This post is devoted to give an introduction to Bayesian modeling using PyMC3, an open source probabilistic programming framework written in Python. Part of this material was presented in the Python Users Berlin (PUB) meet up.

Why PyMC3? As described in the documentation:

PyMC3’s user-facing features are written in pure Python, it leverages Theano to transparently transcode models to C and compile them to machine code, thereby boosting performance.

Theano is a library that allows expressions to be defined using generalized vector data structures called tensors, which are tightly integrated with the popular NumPy ndarray data structure.

In addition, from a practical point of view, PyMC3 syntax is very transparent from the mathematical point of view.

This post is not aimed to give a full treatment of the ~~mathematical details~~, as there are many good (complete and detailed) references around these topics. Also, we are not going to dive deep into PyMC3 as all the details can be found in the documentation. Instead, we are interested in giving an overview of the basic mathematical concepts combined with examples (written in Python code) which should make clear why Monte Carlo simulations are useful in Bayesian modeling.

# 1. Mathematical Background

# 1.1 Bayes Theorem

## Frequentist vs Bayesian

*The essential difference between frequentist inference and Bayesian inference is the same as the difference between the two interpretations of what a “probability” means*.

**Frequentist inference** is a method of statistical inference in which conclusions from data is obtained by emphasizing the frequency or proportion of the data.

**Bayesian inference** is a method of statistical inference in which Bayes’ theorem is used to update the probability for a hypothesis as more evidence or information becomes available.

## Conditional Probability

Let \(A\) and \(B\) be two events, then the *conditional probability* of \(A\) given \(B\) is defined as the ratio

\[\begin{equation} P(A|B):=\frac{P(A\cap B)}{P(B)} \end{equation}\]

*Remark:* Formally we have a probability space \((\Omega, \mathcal{F}, P)\), where \(\Omega\) is the sample space, \(\mathcal{F}\) is a \(\sigma\)-algebra on \(\Omega\) and \(P\) is a probability measure. The events \(A\), and \(B\) are elements of \(\mathcal{F}\) and we assume that \(P(B)\neq 0\).

Observe in particular

\[\begin{equation} P(A|B)P(B)=P(A\cap B)=P(B\cap A) = P(B|A)P(A) \end{equation}\]

## Bayes Theorem

From the last formula we obtain the relation

\[\begin{equation} P(A|B)=\frac{P(B|A)P(A)}{P(B)} \end{equation}\]

which is known as Bayes Theorem.

**Example:** Suppose you are in the U-Bahn and you see a person with long hair. You want to know the probablity that this person is a woman. Consider the events \(A=\) woman \(B=\) long hair. You want to compute \(P(A|B)\). Suppose that you estimate \(P(A)=0.5\), \(P(B)=0.4\) and \(P(B|A)=0.7\) (the probability that a woman has long hair). Then, given these prior estimated probabilities, Bayes theorem gives

\[\begin{equation} P(A|B)=\frac{P(B|A)P(A)}{P(B)} = \frac{0.7\times 0.5}{0.4} = 0.875. \end{equation}\]

## Bayesian Approach to Data Analysis

Assume that you have a sample of observations \(y_1,..., y_n\) of a random variable \(Y\sim f(y|\theta)\), where \(\theta\) is a parameter for the distribution. Here we consider \(\theta\) as a random variable as well. Following Bayes Theorem (its continuous version) we can write:

\[\begin{equation} f(\theta|y)=\frac{f(y|\theta)f(\theta)}{f(y)} = \displaystyle{\frac{f(y|\theta)f(\theta)}{\int f(y|\theta)f(\theta)d\theta}} \end{equation}\]

The function \(f(y|\theta)\) is called the

*likelihood*.\(f(\theta)\) is the

*prior*distribution of \(\theta\).

Note that \(f(y)\) *does not* depend on \(\theta\) (just on the data), thus it can be considered as a “normalizing constant”. In addition, it is often the case that the integral above is not easy to compute. Nevertheless, it is enough to consider the relation:

\[\begin{equation} f(\theta|y) \propto \text{likelihood} \times \text{prior}. \end{equation}\]

(Here \(\propto\) denotes the proportionality relation)

## Example: Poisson Data

In order to give a better sense of the relation above we are going to study a concrete example. Consider \(n\) samples of \(Y\sim \text{Poiss}(\lambda)\). Recall that the Poisson distribution is given by:

\[ \displaystyle{ f(y_i|\lambda)=\frac{e^{-\lambda}\lambda^{y_i}}{y_i!} } \]

where \(\lambda>0\). It is easy to verify that the *expected value* and *variance* are \(\text{E}(Y)=\lambda\) and \(\text{Var}(Y)=\lambda\) respectively. Parallel to the formal discussion, we are going to implement a numerical simulation:

```
import numpy as np
import scipy.stats as ss
# We set a seed so that the results are reproducible.
np.random.seed(5)
# number of samples.
n = 100
# true parameter.
lam_true = 2
# sample array.
y = np.random.poisson(lam=lam_true, size=n)
y
```

```
array([2, 4, 1, 0, 2, 2, 2, 2, 1, 1, 3, 2, 0, 1, 3, 3, 4, 2, 0, 0, 3, 6,
1, 2, 1, 2, 5, 2, 3, 0, 1, 3, 1, 4, 1, 2, 4, 0, 6, 4, 1, 2, 2, 0,
1, 2, 4, 4, 1, 3, 0, 3, 3, 2, 4, 2, 2, 1, 1, 2, 5, 2, 3, 0, 1, 1,
1, 3, 4, 1, 3, 4, 2, 1, 2, 4, 2, 2, 1, 0, 2, 2, 3, 0, 3, 3, 4, 2,
2, 1, 2, 1, 3, 0, 1, 0, 3, 3, 1, 2])
```

```
# mean of the sample.
y.mean()
```

`2.06`

```
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline
# Histogram of the sample.
plt.figure(figsize=(8, 6))
plt.hist(y, bins=15)
plt.title('Histogram of Simulated Data');
```

### Prior: Gamma Distribution

Let us consider a gamma prior distribution for the parameter \(\lambda \sim \Gamma(a,b)\). Recall that the density function for the gamma distribution is

\[\begin{equation} f(\lambda)=\frac{b^a}{\Gamma(a)}\lambda^{a-1} e^{-b\lambda} \end{equation}\]

where \(a>0\) is the *shape* parameter and \(b>0\) is the *rate parameter*. One verifies that

\[ \text{E}(\lambda)=\frac{a}{b} \quad \text{and} \quad \text{Var}(\lambda)=\frac{a}{b^2} \]

Let us plot a gamma distribution for parameters \(a=3.5\) and \(b=2\).

```
# Parameters of the prior gamma distribution.
a = 3.5 # shape
b = 2 # rate = 1/scale
x = np.linspace(start=0, stop=10, num=100)
plt.figure(figsize=(8, 6))
plt.plot(x, ss.gamma.pdf(x,a=a,scale=1/b), 'r-')
plt.title('Gamma Density Function for a={} and b={}'.format(a,b))
# Define the prior distribution.
prior = lambda x: ss.gamma.pdf(x, a=a, scale=1/b)
```

### Likelihood

As the observations are independent the likelihood function is

\[\begin{align} f(y|\lambda)=&\prod_{i=1}^{n} \frac{e^{-\lambda}\lambda^{y_i}}{y_i!} =\frac{e^{-n\lambda}\lambda^{\sum_{i=1}^n y_i}}{\prod_{i=1}^{n}y_i!} \end{align}\]

```
import scipy.special as sp
# Define the likelihood function.
def likelihood(lam,y):
factorials = np.apply_along_axis(
lambda x: sp.gamma(x+1),
axis=0,
arr=y
)
numerator = np.exp(-lam*y.size)*(lam**y.sum())
denominator = np.multiply.reduce(factorials)
return numerator/denominator
```

### Posterior distribution for \(\lambda\) up to a constant

As we are just interested in the structure of the posterior distribution, up to a constant, we see

\[\begin{align} f(\lambda|y)\propto & \text{likelihood} \times \text{prior}\\ \propto & \quad f(y|\lambda)f(\lambda)\\ \propto & \quad e^{-n\lambda}\lambda^{\sum_{i=1}^n y_i} \lambda^{a-1} e^{-b\lambda}\\ \propto & \quad \lambda^{\left(\sum_{i=1}^n y_i+a\right)-1} e^{-(n+b)\lambda}\\ \end{align}\]

```
# Define the posterior distribution.
# (up to a constant)
def posterior_up_to_constant(lam,y):
return likelihood(lam,y)*prior(lam)
# Plot of the prior and (scaled) posterior distribution
# for the parameter lambda.
#
# We multiply the posterior distrubution function
# by the amplitude factor 2.5e74 to make it comparable
# with the prior gamma distribution.
plt.figure(figsize=(8, 6))
plt.plot(x, 2.0e74*posterior_up_to_constant(x,y), label='posterior')
plt.plot(x, ss.gamma.pdf(x,a=a,scale=1/b), 'r-', label='prior')
plt.legend();
```

### True posterior distribution for \(\lambda\)

In fact, as \(f(\lambda|y) \propto\: \lambda^{\left(\sum_{i=1}^n y_i+a\right)-1} e^{-(n+b)\lambda}\), one verifies that the posterior distribution is again a gamma

\[\begin{align} f(\lambda|y) = \Gamma\left(\sum_{i=1}^n y_i+a, n+b\right) \end{align}\]

This means that the gamma and Poisson distribution form a conjugate pair.

```
def posterior(lam,y):
shape = a + y.sum()
rate = b + y.size
return ss.gamma.pdf(lam, shape, scale=1/rate)
plt.figure(figsize=(8, 6))
plt.plot(x, posterior(x,y))
plt.plot(x, ss.gamma.pdf(x,a=a,scale=1/b), 'r-')
plt.title('Prior and Posterior Distributions');
```

We indeed see how the posterior distribution is concentrated around the true parameter \(\lambda=2\).

Note that the posterior mean is

\[\begin{align} \frac{\sum_{i=1}^n y_i+a}{n+b} = \frac{b}{n+b}\frac{a}{b}+\frac{n}{n+b}\frac{\sum_{i=1}^n y_i}{n} \end{align}\]

That is, it is a weighted average of the prior mean \(a/b\) and the sample average \(\bar{y}\). As \(n\) increases,

\[\begin{align} \lim_{n\rightarrow +\infty}\frac{b}{n+b}\frac{a}{b}+\frac{n}{n+b}\frac{\sum_{i=1}^n y_i}{n} = \bar{y}. \end{align}\]

```
# Posterior gamma parameters.
shape = a + y.sum()
rate = b + y.size
# Posterior mean.
shape/rate
```

`2.053921568627451`

# 1.2 Markov Chain Monte Carlo (MCMC) Approach

In the last example the posterior distribution was easy to identify. However, in paractice this is not usually the case and therefore, via Bayes Theorem, we would only know the posterior distribution up to a constant. This motivates the idea of using Monte Carlo simulation methods. How can we sample from a distribution that we do not know? The Metropolis–Hastings algorithm, explained next, is one approach to tackle this problem.

## Metropolis–Hastings algorithm

Let \(\phi\) be a function that is proportional to the desired probability distribution \(f\).

**Initialization:**

Pick \(x_{0}\) to be the first sample, and choose an arbitrary probability density

\[\begin{equation} g(x_{n+1}| x_{n}) \end{equation}\]

that suggests a candidate for the next sample value \(x_{n+1}\). Assume \(g\) is symmetric.

**For each iteration:**

Generate a candidate \(x\) for the next sample by picking from the distribution \(g(x|x_n)\). Calculate the *acceptance ratio*

\[\begin{equation} \alpha := \frac{f(x)}{f(x_n)} = \frac{\phi(x)}{\phi(x_n)} \end{equation}\]

If \(\alpha \geq 1\), automatically accept the candidate by setting

\[\begin{equation} x_{n+1} = x. \end{equation}\]

Otherwise, accept the candidate with probability \(\alpha\). If the candidate is rejected, set

\[\begin{equation} x_{n+1} = x_{n}. \end{equation}\]

Why does this algorithm solve the initial problem? The full explanation is beyond the scope of this post (some references are provided at the end). It relies in the in the following result.

## Ergodic Theorem for Markov Chains

**Theorem (Ergodic Theorem for Markov Chains)** If \(\{x^{(1)} , x^{(2)} , . . .\}\) is an *irreducible*, *aperiodic* and *recurrent* Markov chain, then there is a unique probability distribution \(\pi\) such that as \(N\longrightarrow\infty\),

- \(P(x^{(N)} ∈ A) \longrightarrow \pi(A)\).
- \(\displaystyle{\frac{1}{N}\sum_{n=1}^{N} g(x^{(n)}) \longrightarrow \int g(x)\pi(x) dx }\).

*Recall:*

A Markov chain is said to be

**irreducible**if it is possible to get to any state from any state.A state \(n\) has

**period**\(k\) if any return to state \(n\) must occur in multiples of \(k\) time steps.If \(k=1\), then the state is said to be

**aperiodic**.A state \(n\) is said to be

**transient**if, given that we start in state \(n\), there is a non-zero probability that we will never return to \(i\).A state \(n\) is

**recurrent**if it is not transient.

# 2. PyMC3 Syntax

Now we perform a MCMC simulation for the data described above. Note how easy is to write the model from the mathematical description.

```
import pymc3 as pm
model = pm.Model()
with model:
# Define the prior of the parameter lambda.
lam = pm.Gamma('lambda', alpha=a, beta=b)
# Define the likelihood function.
y_obs = pm.Poisson('y_obs', mu=lam, observed=y)
# Consider 2000 draws and 3 chains.
trace = pm.sample(draws=2000, chains=3)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 4 jobs)
NUTS: [lambda]
Sampling 3 chains: 100%|██████████| 7500/7500 [00:01<00:00, 5660.80draws/s]
```

If we do a trace plot we can see two results:

We see the simulated posterior distribution for 3 independent Markov Chains (so that, when combined, avoid the dependence on the initial point). The 3 different chains correspond to the color blue, green and orange.

The sample value of lambda for each iteration.

`pm.traceplot(trace);`

We can also see the mean and quantile information for the posterior distribution.

`pm.plot_posterior(trace);`

# 4 References and Further Reading

Here we provide some suggested references used in this post and also to go deeper in the subject.