This post is devoted to give an introduction to Bayesian modeling using PyMC3, an open source probabilistic programming framework written in Python. Part of this material was presented in the Python Users Berlin (PUB) meet up.
Why PyMC3? As described in the documentation:
- PyMC3’s user-facing features are written in pure Python, it leverages Theano to transparently transcode models to C and compile them to machine code, thereby boosting performance.
- Theano is a library that allows expressions to be defined using generalized vector data structures called tensors, which are tightly integrated with the popular NumPy
ndarray
data structure.
In addition, from a practical point of view, PyMC3 syntax is very transparent from the mathematical point of view.
This post does not aim to give a full treatment of the mathematical details, as there are many good (complete and detailed) references around these topics (some of them collected at the end of the post). Also, we are not going to dive deep into PyMC3 as all the details can be found in the documentation. Instead, we are interested in giving an overview of the basic mathematical concepts combined with examples (written in Python code). Still, we briefly describe the main idea behind Markov Chain Monte Carlo, a sampling method from which other methods are inspired from.
Remark: Among he rich literature in bayesian statistics I particularly recommend Statistical Rethinking by, Richard McElreath. It is an excellent conceptual and practical introduction to the subject. Moreover, the PyMC3 dev team translated all of the code into PyMC3.
Update: This post has been updated to include better integration with arviz and xarray plus to update PyMC3 syntax.
Mathematical Background
Bayes Theorem
Frequentist vs Bayesian
The essential difference between frequentist inference and Bayesian inference is the same as the difference between the two interpretations of what a “probability” means.
Frequentist inference is a method of statistical inference in which conclusions from data is obtained by emphasizing the frequency or proportion of the data.
Bayesian inference is a method of statistical inference in which Bayes’ theorem is used to update the probability for a hypothesis as more evidence or information becomes available.
Conditional Probability
Let \(A\) and \(B\) be two events, then the conditional probability of \(A\) given \(B\) is defined as the ratio
\[ P(A|B):=\frac{P(A\cap B)}{P(B)} \]
Remark: Formally we have a probability space \((\Omega, \mathcal{F}, P)\), where \(\Omega\) is the sample space, \(\mathcal{F}\) is a \(\sigma\)-algebra on \(\Omega\) and \(P\) is a probability measure. The events \(A\), and \(B\) are elements of \(\mathcal{F}\) and we assume that \(P(B)\neq 0\).
Observe in particular
\[ P(A|B)P(B)=P(A\cap B)=P(B\cap A) = P(B|A)P(A) \]
Bayes Theorem
From the last formula we obtain the relation
\[\begin{equation} P(A|B)=\frac{P(B|A)P(A)}{P(B)} \end{equation}\]
which is known as Bayes Theorem.
Example: Suppose you are in the U-Bahn and you see a person with long hair. You want to know the probablity that this person is a woman. Consider the events \(A=\text{woman}\) and \(B= \text{long hair}\). You want to compute \(P(A|B)\). Suppose that you estimate \(P(A)=0.5\), \(P(B)=0.4\) and \(P(B|A)=0.7\) (the probability that a woman has long hair). Then, given these prior estimated probabilities, Bayes theorem gives
\[\begin{equation} P(A|B)=\frac{P(B|A)P(A)}{P(B)} = \frac{0.7\times 0.5}{0.4} = 0.875. \end{equation}\]
Bayesian Approach to Data Analysis
Assume that you have a sample of observations \(y_1,..., y_n\) of a random variable \(Y\sim f(y|\theta)\), where \(\theta\) is a parameter for the distribution. Here we consider \(\theta\) as a random variable as well. Following Bayes Theorem (its continuous version) we can write.
\[ f(\theta|y)=\frac{f(y|\theta)f(\theta)}{f(y)} = \displaystyle{\frac{f(y|\theta)f(\theta)}{\int f(y|\theta)f(\theta)d\theta}} \]
The function \(f(y|\theta)\) is called the likelihood.
\(f(\theta)\) is the prior distribution of \(\theta\).
Note that \(f(y)\) does not depend on \(\theta\) (just on the data), thus it can be considered as a “normalizing constant”. In addition, it is often the case that the integral above is not easy to compute. Nevertheless, it is enough to consider the relation:
\[ f(\theta|y) \propto \text{likelihood} \times \text{prior}. \]
(Here \(\propto\) denotes the proportionality relation)
Example: Poisson Data
In order to give a better sense of the relation above we are going to study a concrete example. Consider \(n\) samples of \(Y\sim \text{Pois}(\lambda)\). Recall that the Poisson distribution is given by:
\[ f(y_i|\lambda)=\frac{e^{-\lambda}\lambda^{y_i}}{y_i!} \]
where \(\lambda>0\). It is easy to verify that \(E(Y)=\lambda\) and \(Var(Y)=\lambda\). Parallel to the formal discussion, we are going to implement a numerical simulation:
Prepare Notebook
import numpy as np
import pandas as pd
import scipy.special as sp
import scipy.stats as ss
import pymc3 as pm
import arviz as az
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(
style='darkgrid',
rc={'axes.facecolor': '.9', 'grid.color': '.8'}
)
sns.set_palette(palette='deep')
sns_c = sns.color_palette(palette='deep')
plt.rcParams['figure.figsize'] = [10, 7]
Generate Sample Data
Let us generate some sample data (for we of course know the true parameter \(\lambda\)):
# We set a seed so that the results are reproducible.
SEED = 5
np.random.seed(SEED)
# number of samples.
n = 100
# true parameter.
lam_true = 2
# sample array.
y = np.random.poisson(lam=lam_true, size=n)
y
array([2, 4, 1, 0, 2, 2, 2, 2, 1, 1, 3, 2, 0, 1, 3, 3, 4, 2, 0, 0, 3, 6,
1, 2, 1, 2, 5, 2, 3, 0, 1, 3, 1, 4, 1, 2, 4, 0, 6, 4, 1, 2, 2, 0,
1, 2, 4, 4, 1, 3, 0, 3, 3, 2, 4, 2, 2, 1, 1, 2, 5, 2, 3, 0, 1, 1,
1, 3, 4, 1, 3, 4, 2, 1, 2, 4, 2, 2, 1, 0, 2, 2, 3, 0, 3, 3, 4, 2,
2, 1, 2, 1, 3, 0, 1, 0, 3, 3, 1, 2])
Let us plot the sample distribution:
# Histogram of the sample.
fig, ax = plt.subplots()
sns.countplot(x=y, color=sns_c[0], label=f'mean = {y.mean(): 0.3f}', ax=ax)
ax.legend(loc='upper right')
ax.set(title='Sample Distribution', xlabel='y');
Prior: Gamma Distribution
Let us consider a gamma prior distribution for the parameter \(\lambda \sim \Gamma(a,b)\). Recall that the density function for the gamma distribution is
\[ f(\lambda)=\frac{b^a}{\Gamma(a)}\lambda^{a-1} e^{-b\lambda} \]
where \(a>0\) is the shape parameter and \(b>0\) is the rate parameter.
The expected value and variance of the gamma distribution is
\[ E(\lambda)=\frac{a}{b} \quad \text{and} \quad Var(\lambda)=\frac{a}{b^2} \]
# Parameters of the prior gamma distribution.
a = 3 # shape
b = 1 # rate = 1/scale
x = np.linspace(start=0,stop=10, num=300)
fig, ax = plt.subplots()
ax.plot(x, ss.gamma.pdf(x, a=a, scale=1/b), color=sns_c[3])
ax.axvline(x= a /b, color=sns_c[4], linestyle='--', label='gamma mean')
ax.axvline(x= y.mean(), color=sns_c[1], linestyle='--', label='sample mean')
ax.legend()
ax.set(title=f'Gamma Density Function for a={a} and b={b}');
# Define the prior distribution.
prior = lambda x: ss.gamma.pdf(x, a=a, scale=1/b)
Likelihood
As the observations are independent the likelihood function is
\[ f(y|\lambda)=\prod_{i=1}^{n} \frac{e^{-\lambda}\lambda^{y_i}}{y_i!} =\frac{e^{-n\lambda}\lambda^{\sum_{i=1}^n y_i}}{\prod_{i=1}^{n}y_i!} \]
# Define the likelihood function.
def likelihood(lam, y):
factorials = np.apply_along_axis(
lambda x: sp.gamma(x + 1),
axis=0,
arr=y
)
numerator = np.exp(- lam * y.size) * (lam ** (y.sum()))
denominator = np.multiply.reduce(factorials)
return numerator / denominator
Posterior distribution for \(\lambda\) up to a constant
As we are just interested in the structure of the posterior distribution, up to a constant, we see
\[\begin{align} f(\lambda|y)\propto & \: \text{likelihood} \times \text{prior}\\ \propto & \quad f(y|\lambda)f(\lambda)\\ \propto & \quad e^{-n\lambda}\lambda^{\sum_{i=1}^n y_i} \lambda^{a-1} e^{-b\lambda}\\ \propto & \quad \lambda^{\left(\sum_{i=1}^n y_i+a\right)-1} e^{-(n+b)\lambda}\\ \end{align}\]
# Define the posterior distribution.
# (up to a constant)
def posterior_up_to_constant(lam, y):
return likelihood(lam=lam, y=y) * prior(lam)
Now we plot of the prior and (scaled) posterior distributionfor the parameter lambda. We multiply the posterior distribution function by the amplitude factor \(2.5\times 10^{74}\) to make it visually comparable with the prior gamma distribution.
fig, ax = plt.subplots()
ax.plot(x, ss.gamma.pdf(x, a=a, scale=1/b), color=sns_c[3], label='prior')
ax.plot(
x, 2.0e74 * posterior_up_to_constant(x,y), color=sns_c[2], label='posterior_up_to_constant'
)
ax.legend()
ax.set(xlabel='$\lambda$');
True posterior distribution for \(\lambda\)
In fact, as \(f(\lambda|y) \propto\: \lambda^{\left(\sum_{i=1}^n y_i+a\right)-1} e^{-(n+b)\lambda}\), one verifies that the posterior distribution is again a gamma
\[\begin{align} f(\lambda|y) = \Gamma\left(\sum_{i=1}^n y_i+a, n+b\right) \end{align}\]
This means that the gamma and Poisson distribution form a conjugate pair.
def posterior(lam, y):
shape = a + y.sum()
rate = b + y.size
return ss.gamma.pdf(lam, shape, scale=1/rate)
fig, ax = plt.subplots()
ax.plot(x, ss.gamma.pdf(x, a=a, scale=1/b), color=sns_c[3], label='prior')
ax.plot(x, posterior(x,y), color=sns_c[2], label='posterior')
ax.legend()
ax.set(title='Prior and Posterior Distributions', xlabel='$\lambda$');
We indeed see how the posterior distribution is concentrated around the true parameter \(\lambda=2\).
Note that the posterior mean is
\[\begin{align} \frac{\sum_{i=1}^n y_i+a}{n+b} = \frac{b}{n+b}\frac{a}{b}+\frac{n}{n+b}\frac{\sum_{i=1}^n y_i}{n} \end{align}\]
That is, it is a weighted average of the prior mean \(a/b\) and the sample average \(\bar{y}\). As \(n\) increases,
\[\begin{align} \lim_{n\rightarrow +\infty}\frac{b}{n+b}\frac{a}{b}+\frac{n}{n+b}\frac{\sum_{i=1}^n y_i}{n} = \bar{y}. \end{align}\]
# Posterior gamma parameters.
shape = a + y.sum()
rate = b + y.size
# Posterior mean.
print(f'Posterior Mean = {shape / rate: 0.3f}')
Posterior Mean = 2.069
Markov Chain Monte Carlo (MCMC) Approach
In the last example the posterior distribution was easy to identify. However, in paractice this is not usually the case and therefore, via Bayes Theorem, we would only know the posterior distribution up to a constant. This motivates the idea of using Monte Carlo simulation methods. How can we sample from a distribution that we do not know? The Metropolis–Hastings algorithm, explaned next, is one approach to tackle this problem.
Metropolis–Hastings algorithm
Let \(\phi\) be a function that is proportional to the desired probability distribution \(f\).
Initialization:
Pick \(x_{0}\) to be the first sample, and choose an arbitrary probability density
\[\begin{equation} g(x_{n+1}| x_{n}) \end{equation}\]
that suggests a candidate for the next sample value \(x_{n+1}\). Assume \(g\) is symmetric.
For each iteration:
Generate a candidate \(x\) for the next sample by picking from the distribution \(g(x|x_n)\). Calculate the acceptance ratio
\[\begin{equation} \alpha := \frac{f(x)}{f(x_n)} = \frac{\phi(x)}{\phi(x_n)} \end{equation}\]
If \(\alpha \geq 1\), automatically accept the candidate by setting
\[\begin{equation} x_{n+1} = x. \end{equation}\]
Otherwise, accept the candidate with probability \(\alpha\). If the candidate is rejected, set
\[\begin{equation} x_{n+1} = x_{n}. \end{equation}\]
Why does this algorithm solve the initial problem? The full explanation is beyond the scope of this post (some references are provided at the end). It relies in the in the following result.
Ergodic Theorem for Markov Chains
Theorem (Ergodic Theorem for Markov Chains) If \(\{x^{(1)} , x^{(2)} , . . .\}\) is an irreducible, aperiodic and recurrent Markov chain, then there is a unique probability distribution \(\pi\) such that as \(N\longrightarrow\infty\),
- \(P(x^{(N)} ∈ A) \longrightarrow \pi(A)\).
- \(\displaystyle{\frac{1}{N}\sum_{n=1}^{N} g(x^{(n)})) \longrightarrow \int g(x)\pi(x) dx }\).
Recall:
A Markov chain is said to be irreducible if it is possible to get to any state from any state.
A state \(n\) has period \(k\) if any return to state \(n\) must occur in multiples of \(k\) time steps.
If \(k=1\), then the state is said to be aperiodic.
A state \(n\) is said to be transient if, given that we start in state \(n\), there is a non-zero probability that we will never return to \(i\).
A state \(n\) is recurrent if it is not transient.
Remark: PyMC3 has many modern samplers which are in general better that MCMC, like Hamiltonian Monte Carlo (HMC) and No-U-Turn Sampler. For more details I recommend checking Statistical Rethinking - Chapter 9 and Michael Betancourt’s blog.
PyMC3 Syntax
Now we perform a variant of MCMC simulation for the data described above. Note how easy is to write the model from the mathematical description.
with pm.Model() as toy_model:
# Define the prior of the parameter lambda.
lam = pm.Gamma('lambda', alpha=a, beta=b)
# Define the likelihood function.
y_obs = pm.Poisson('y_obs', mu=lam, observed=y)
pm.model_to_graphviz(toy_model)
with toy_model:
# Consider 2000 draws and 4 chains.
trace = pm.sample(
tune=1000,
draws=4000,
chains=4,
cores=-1,
return_inferencedata=True
)
Let us plot the trace of the posterior distribution. We see 4 independent chains which do not seem correlated.
az.plot_trace(data=trace);
Let us get some diagnostics of the sampling:
az.summary(trace)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
lambda | 2.071 | 0.144 | 1.804 | 2.343 | 0.002 | 0.002 | 3357.0 | 5453.0 | 1.0 |
- The inferred mean of the \(\lambda\) parameter is very close to the true value \(2\).
- The
r_hat
denotes the Gelman-Rubin test, which indicates convergence of the chains. Values close to 1.0 mean convergence.
We can plot the mean and highest density interval of the posterior distribution.
fig, ax = plt.subplots()
az.plot_posterior(data=trace, ax=ax);
Let us now do some posterior predictive checks, i.e. sample observations using the distribution of the parameter \(\lambda\).
with toy_model:
forecast_1 = az.from_pymc3(
posterior_predictive=pm.sample_posterior_predictive(trace=trace)
)
posterior_sampels = forecast_1.posterior_predictive['y_obs'].values.flatten()
We can now compare the distributions:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))
sns.countplot(x=y, color=sns_c[0], ax=ax[0])
ax[0].set(title='Sample Distribution', xlabel='y', xlim=(-1, 10))
sns.countplot(x=posterior_sampels, color=sns_c[2], ax=ax[1])
ax[1].set(title='Posterior Predictive Distribution', xlabel='y', xlim=(-1, 10));
The shape of the distributions look very similar!
References and Further Reading
Here we provide some suggested references used in this post and also to go deeper in the subject.