Conjugate Prior.

Bernoulli distribution and Beta prior

The likelihood function

\[p(x \vert \theta) = \theta^x (1 - \theta)^{1-x}\]

The beta distribution is parameterized using \(\alpha_i -1\)

\[p(\theta \vert \alpha) = K(\alpha) \theta^{\alpha_1 -1} (1- \theta)^{\alpha_2-1}\]

where the normalization factor \(K(\alpha)\) can be obtained analytically:

\[\left.\begin{aligned} K ( \alpha ) & = \left( \int \theta ^ { \alpha _ { 1} - 1} ( 1- \theta ) ^ { \alpha _ { 2} - 1} d \theta \right ) ^ { - 1} \\ & = \frac { \Gamma ( \alpha _ { 1} + \alpha _ { 2} ) } { \Gamma ( \alpha _ { 1} ) \Gamma ( \alpha _ { 2} ) } \end{aligned} \right.\]

Consider in particular \(N\) i.i.d Bernoulli observations, \(x = (x_1, \ldots, x_N)^\top\):

\[\begin{align} p ( \theta \vert \mathbf { x } ,\alpha ) &\propto \left( \prod _ { n = 1} ^ { N } \theta ^ { x _ { n } } ( 1- \theta ) ^ { 1- x _ { n } } \right) \theta ^ { \alpha _ { 1} - 1} ( 1- \theta ) ^ { \alpha _ { 2} - 1} \\ &= \theta^ {\sum _ { n = 1} ^ { N } x _ { n } + \alpha _ { 1} - 1} ( 1- \theta ) ^ { N - \sum _ { n = 1} ^ { N } x _ { n } + \alpha _ { 2} - 1}. \end{align}\]

In particular, it is a \(\text{Beta}(\theta \vert \sum _ { n = 1} ^ { N } x _ { n } + \alpha _ { 1}, N - \sum _ { n = 1} ^ { N } x _ { n } + \alpha _ { 2})\).

\[\mathbb{E}[\theta \vert \alpha] = \int \theta \cdot \text{Beta}(\theta \vert \alpha_1, \alpha_2) \text{d}\theta = \frac{\alpha_1}{\alpha_1 + \alpha_2}.\]

A similar calculation (\(\mathbb{E}[\theta^2 \vert \alpha]\)) yields the variance:

\[\operatorname{Var} [ \theta \vert \alpha ] = \frac { \alpha _ { 1} \alpha _ { 2} } { \left( \alpha _ { 1} + \alpha _ { 2} + 1\right) \left( \alpha _ { 1} + \alpha _ { 2} \right) ^ { 2} }.\]

Applying the results to \(\text{Beta}(\theta \vert \sum _ { n = 1} ^ { N } x _ { n } + \alpha _ { 1}, N - \sum _ { n = 1} ^ { N } x _ { n } + \alpha _ { 2})\) we obtain

\[\begin{align} \mathbb { E } [ \theta \vert \mathbf { x } ,\alpha ] &= \frac { \sum _ { n = 1} ^ { N } x _ { n } + \alpha _ { 1} } { N + \alpha _ { 1} + \alpha _ { 2} } \\ \operatorname{Var} [ \theta \vert \mathbf { x } ,\alpha ] &= \frac { \left( \sum _ { n = 1} ^ { N } x _ { n } + \alpha _ { 1} \right) \left( N - \sum _ { n = 1} ^ { N } x _ { n } + \alpha _ { 2} \right) } { \left( N + \alpha _ { 1} + \alpha _ { 2} + 1\right) \left( N + \alpha _ { 1} + \alpha _ { 2} \right) ^ { 2} }. \end{align}\]

The predictive probability of a new data point \(X\) is calculated as follows:

\[p(X = 1 \vert x, \alpha) = \int p(X=1 \vert \theta) p(\theta \vert x, \alpha) \text{d} \theta = \int \theta p(\theta \vert x, \alpha) \text{d} \theta = \mathbb{E}[\theta \vert x, \alpha].\]

Categorical distribution and Dirichlet prior

In this case the likelihood takes the form

\[p ( x \vert \theta ) = \theta _ { 1} ^ { x _ { 1} } \theta _ { 2} ^ { x _ { 2} } \dots \theta _ { K } ^ { x _ { K } }\]

where \(x_k \in \{0, 1\}, \sum_k x_k = 1\).

This yields the conjugate prior:

\[p ( \theta \vert \alpha ) = K ( \alpha ) \theta _ { 1} ^ { \alpha _ { 1} - 1} \theta _ { 2} ^ { \alpha _ { 2} - 1} \dots \theta _ { K } ^ { \alpha _ { K } - 1}\]

where \(\alpha_i > 0\) and

\[K ( \alpha ) = \frac { \Gamma \left( \sum _ { k = 1} ^ { K } \alpha _ { k } \right) } { \prod _ { k = 1} ^ { K } \Gamma \left( \alpha _ { k } \right) }\]

The posterior distribution of \(N\) i.i.d observations, \(x = (x_1, \ldots, x_N)^\top\)

\[p ( \theta \vert x ,\alpha ) \quad \propto \quad \theta _ { 1} ^ { \sum _ { n = 1} ^ { N } x _ { n 1} + \alpha _ { 1} - 1} \theta _ { 2} ^ { \sum _ { n = 1} ^ { N } x _ { n 2} + \alpha _ { 2} - 1} \ldots \theta _ { K } ^ { \sum _ { n = 1} ^ { N } x _ { n K } + \alpha _ { K } - 1}\]

which is \(\text{Dir} (\theta \vert \sum_{n} x_n + \alpha)\) and note that \(x_n = (x_{n1}, \ldots, x_{nK})\).

The mean and variance are

\[\begin{align} \mathbb { E } \left[ \theta _ { i } \vert \alpha \right] &= \frac { \alpha _ { i } } { \alpha. } \\ \operatorname{Var} \left[ \theta _ { i } \vert \alpha \right] &= \frac { \alpha _ { i } \left( \alpha .- \alpha _ { i } \right) } { \alpha. ^ { 2} ( \alpha. + 1) } \end{align}\]

where \(\alpha. = \sum_{i} \alpha_i\).

Poisson distribution and Gamma prior

Consider the Poisson distribution:

\[p ( x \vert \theta ) = \frac { \theta ^ { x } e ^ { - \theta } } { x ! }\]

The corresponding conjugate prior retains the shape of the likelihood:

\[p(\theta \vert \alpha) = K(\alpha) \theta^{\alpha_1 - 1} e^{-\alpha_2 \theta}\]

where the normalization factor \(K(\alpha) = \frac{\alpha_2 ^{\alpha_1}}{\Gamma(\alpha_1)}\).

The posterior distribution of \(N\) i.i.d Poisson observations is:

\[p(\theta \vert \alpha) \propto \theta^{\alpha_1 -1 + \sum_{n=1}^N x_n} e^{-(N+\alpha_2) \theta}\]

which is \(\text{Gamma}(\theta \vert \alpha_1 + \sum_n x_n, \alpha_2 + N)\).

The mean and variance are readily computed as:

\[\begin{align} \mathbb { E } [ \theta \vert \alpha ] &= \frac { \alpha _ { 1} } { \alpha _ { 2} } \\ \operatorname{Var} \left[ \theta _ { i } \vert \alpha \right] &= \frac { \alpha _ { 1} } { \alpha _ { 2} ^ { 2} } \end{align}\]

In the distribution \(\text{Gamma}(\theta \vert \alpha_1, \alpha_2)\), \(\alpha_1\) is known as the shape parameter and \(\alpha_2\) is called the rate parameter (\(1/\alpha_2\) is called the scale parameter).

Univariate Gaussian distribution and Normal-Gamma Priors

Recall that the Gaussian distribution is a two-parameter exponential family of the following form:

\[p(x \mid \mu, \sigma^2) \propto (\sigma^2)^{-1/2} \exp\left\{- \frac { 1} { 2\sigma ^ { 2} } \left( x - \mu \right) ^ { 2} \right\}\]

Conjugacy for the mean

Inspecting the above density form we see that the exponent is the negative of a quadratic form in \(\mu\). Thus we assume

\[p \left( \mu \vert \mu _ { 0} ,\sigma _ { 0} ^ { 2} \right) \propto \left( \sigma _ { 0} ^ { 2} \right) ^ { - 1/ 2} \exp \left\{ - \frac { 1} { 2\sigma _ { 0} ^ { 2} } \left( \mu - \mu _ { 0} \right) ^ { 2} \right\}\]

where \(\mu_0\) and \(\sigma_0^2\) are the mean and variance of \(\mu\).

Before deriving the posterior distribution \(p(\mu \vert x)\), let us introduce some useful properties of Gaussian variable.

\[\begin{align} \mathbb { E } \left[ Z _ { 1} \vert Z _ { 2} \right] &= \mathbb { E } \left[ Z _ { 1} \right] + \frac { \operatorname{Cov} \left[ Z _ { 1} ,Z _ { 2} \right] } { \operatorname{Var} \left[ Z _ { 2} \right] } \left( Z _ { 2} - \mathbb { E } \left[ Z _ { 2} \right] \right) \\ \operatorname{Var} \left[ Z _ { 1} \vert Z _ { 2} \right] &= \operatorname{Var} \left[ Z _ { 1} \right] - \frac { \operatorname{Cov} ^ { 2} \left[ Z _ { 1} ,Z _ { 2} \right] } { \operatorname{Var} \left[ Z _ { 2} \right] } \end{align}\]

Consider that we only have one observation \(x\), we reparameterize \(X\) and \(\mu\) using the Normal random variable \(\epsilon, \delta \sim N(0, 1)\) as

\[\left.\begin{array} { l } { X = \mu + \sigma \epsilon } \\ { \mu = \mu _ { 0} + \sigma _ { 0} \delta } \end{array} \right.\]

We can now easily calculate:

\[\begin{align} \mathbb{E}[X] &= \mathbb{E}[\mu] + \sigma\mathbb{E}[\epsilon] = \mu_0 \\ \mathbb{V}[X] &= \mathbb{V}[\mu] + \sigma^2 \mathbb{V}[\epsilon] = \sigma^2_0 + \sigma^2 \\ \text{Cov}(X, \mu) &= \mathbb { E } \left[ \left( X - \mu _ { 0} \right) \left( \mu - \mu _ { 0} \right) \right] = \mathbb { E } \left[ \left( \mu + \sigma \epsilon - \mu _ { 0} \right) \left( \mu - \mu _ { 0} \right) \right] = \sigma _ { 0} ^ { 2} \end{align}\]

Treating \(\mu\) as \(Z_1\) and \(X\) as \(Z_2\) we obtain:

\[\begin{align} \mathbb{E}[\mu\vert X = x ] &= \mu _ { 0} + \frac { \sigma _ { 0} ^ { 2} } { \sigma ^ { 2} + \sigma _ { 0} ^ { 2} } \left( x - \mu _ { 0} \right) = \frac { \sigma _ { 0} ^ { 2} } { \sigma ^ { 2} + \sigma _ { 0} ^ { 2} } x + \frac { \sigma ^ { 2} } { \sigma ^ { 2} + \sigma _ { 0} ^ { 2} } \mu _ { 0} \\ \text{Var}(\mu \vert X=x) &= \sigma _ { 0} ^ { 2} - \frac { \sigma _ { 0} ^ { 4} } { \sigma ^ { 2} + \sigma _ { 0} ^ { 2} } = \frac { \sigma ^ { 2}} { \sigma ^ { 2} + \sigma _ { 0} ^ { 2} } \sigma _ { 0} ^ { 2}. \end{align}\]

We can also express the results in terms of the precision. In particular, plugging \(\tau = 1/\sigma^2\) and \(\tau_0 = 1 / \sigma_0^2\) into the above equation yields:

\[\begin{align} \mathbb { E } [ \mu \vert X = x ] = \frac { \tau } { \tau + \tau _ { 0} } x + \frac { \tau _ { 0} } { \tau + \tau _ { 0} } \mu _ { 0} \end{align}\]

The posterior expectation is a convex combination of the observation \(x\) and prior mean \(\mu_0\). More precisely, the precision of the data multiplies the data \(x\) and the precision of the prior multiplies the prior mean \(\mu_0\). If the precision of the data is large relative to the precision of the prior, then the posterior mean is closer to \(x\). Also note that the posterior precision \(\tau_\text{post} = \tau + \tau_0\), which has a direct interpretation: precision add.

Let us now consider the posterior distribution in the case of involving multiple observed data points, we have:

\[p ( \mathbf { x } \vert \mu ,\tau ) \propto \tau ^ { n/ 2} \exp\left \{ - \frac { \tau } { 2} \sum _ { i = 1} ^ { n } \left( x _ { i } - \mu \right) ^ { 2} \right\}\]

We rewrite the exponent using a standard trick:

\[\left.\begin{aligned} \sum _ { i = 1} ^ { n } \left( x _ { i } - \mu \right) ^ { 2} & = \sum _ { i = 1} ^ { n } \left( x _ { i } - \overline { x } + \overline { x } - \mu \right) ^ { 2} \\ &= \sum _ { i = 1} ^ { n } \left( x _ { i } - \overline { x } \right) ^ { 2} + n ( \overline { x } - \mu ) ^ { 2} \end{aligned} \right.\]

The first term yields a constant factor, and we see the problem reduces to an equivalent problem involving only single random variable (treating \(\bar{x}\) as random variable), in which \(\bar{X} \sim N(\mu, \sigma^2/n)\).

\[\bar{X} = \mu + \frac{\sigma^2}{n} \epsilon\]

and we have

\[\begin{align} \mathbb{E}[\mu\vertx_1,\ldots,x_n ] &= \frac { \sigma _ { 0} ^ { 2} } { \sigma ^ { 2}/n + \sigma _ { 0} ^ { 2} } \bar{x} + \frac { \sigma ^ { 2} } { \sigma ^ { 2} / n + \sigma _ { 0} ^ { 2} } \mu _ { 0} \\ \text{Var}(\mu \vert x_1,\ldots,x_n) &= \frac { \sigma ^ { 2}/n} { \sigma ^ { 2}/n + \sigma _ { 0} ^ { 2} } \sigma _ { 0} ^ { 2}. \end{align}\]

\(\tau_\text{post} = n\tau + \tau_0\).

Now consider an unseen data point \(Y \sim N(\mu, \sigma^2)\), we reparameterize it as

\[\begin{align} Y &= \mu + \sigma \epsilon \\ \mu &= \mu_\text{post} + \sigma_\text{post}\delta \end{align}\]

\(Y\) again is a Gaussian random variable as it is the sum of two Gaussian random variables \(\mu\) and \(\epsilon\), its predictive mean and variance:

\[\begin{align} \mathbb{E}[Y] &= \mu_\text{post} \\ \text{Var}(Y) &= \sigma^2_\text{post} + \sigma^2. \end{align}\]

Conjugacy for the variance

Let us now consider the Gaussian distribution with known mean \(\mu\),

\(p \left( \mathbf { x } \vert \mu ,\sigma ^ { 2} \right) \propto \left( \sigma ^ { 2} \right) ^ { a } e ^ { - b / \sigma ^ { 2} }\) where \(a = -1/2\) and \(b = \frac { 1} { 2} \sum _ { i = 1} ^ { n } \left( x _ { i } - \mu \right) ^ { 2}\). This has the flavor of the gamma distribution, but the random variable \(\sigma^2\) is in the denominator. This is actually an inverse gamma distribution. We thus assume the prior distribution for the variance is an inverse gamma distribution:

\[p \left( \sigma ^ { 2} \vert \alpha ,\beta \right) = \frac { \beta ^ { \alpha } } { \Gamma ( \alpha ) } \left( \sigma ^ { 2} \right) ^ { - \alpha - 1} e ^ { - \beta / \sigma ^ { 2} }\]

The posterior distribution:

\[\begin{align} p \left( \sigma ^ { 2} \vert \mathbf { x } ,\mu ,\alpha ,\beta \right) \quad & \propto \quad p \left( \mathbf { x } \vert \mu ,\sigma ^ { 2} \right) p \left( \sigma ^ { 2} \vert \alpha ,\beta \right) \\ &\propto \quad \left( \sigma ^ { 2} \right) ^ { - n / 2} e ^ { - \frac { 1} { 2} \sum _ { i = 1} ^ { n } \left( x _ { i } - \mu \right) ^ { 2} / \sigma ^ { 2} } \left( \sigma ^ { 2} \right) ^ { - \alpha - 1} e ^ { - \beta / \sigma ^ { 2} }\\ &= \quad \left( \sigma ^ { 2} \right) ^ { - \left( \alpha + \frac { n } { 2} \right) - 1} e ^ { - \left( \beta + \frac { 1} { 2} \sum _ { i = 1} ^ { n } \left( x _ { i } - \mu \right) ^ { 2} \right) / \sigma ^ { 2} } \end{align}\]

which is an \(\text{Inv-Gamma}(\alpha + \frac{n}{2}, \beta + \frac { 1} { 2} \sum _ { i = 1} ^ { n } \left( x _ { i } - \mu \right) ^ { 2})\).

If we derive the posetrior in terms of precision we would obtain \(\tau \sim \text{Gamma}(\alpha + \frac{n}{2}, \beta + \frac { 1} { 2} \sum _ { i = 1} ^ { n } \left( x _ { i } - \mu \right) ^ { 2})\).

The predictive distribution for an unseen data point \(Y\):

\[\left.\begin{aligned} p \left( Y \vert \mathbf { x } ,\mu ,\alpha ,\beta \right) & = \int p \left( Y \vert \mathbf { x } ,\mu ,\tau \right) p ( \tau \vert \mathbf { x } ,\mu ,\alpha ,\beta ) d \tau \\ & = \int p \left(Y \vert \mu ,\tau \right) p ( \tau \vert \mathbf { x } ,\mu ,\alpha ,\beta ) d \tau \end{aligned} \right.\]

which turns out to be a t-distribution:

\[Y \sim \text{St} \left( \mu ,\frac { \alpha + n / 2} { \beta + \frac { 1} { 2} \sum _ { i = 1} ^ { n } \left( x _ { i } - \mu \right) ^ { 2} } ,2\alpha + n \right).\]

Conjugacy for the mean and variance

We make the following specifications:

\[\begin{align} X _ { i } &\sim N ( \mu ,\tau )\quad \quad i=1,\ldots,n \\ \mu &\sim N \left( \mu _ { 0} ,n _ { 0} \tau \right) \\ \tau &\sim \text{Gamma} ( \alpha ,\beta ) \end{align}\]

where the \(X_i\) are assumed independent given \(\mu, \tau\). We refer to this prior as normal-gamma distribution.

To compute the posterior distribution \(p(\mu, \tau \vert x)\), we first compute \(p(\mu \vert \tau, x)\) and then find \(p(\tau \vert x)\).

\[\begin{align} \tau &\sim p(\tau \vert x) \\ \mu &\sim p(\mu \vert \tau, x) \end{align}\]

When \(\tau\) is fixed, we are back in the setting of conjugacy for the mean and can simply use the results, plugging in \(n_0\tau\) in place of \(\tau_0\):

\[\left.\begin{aligned} \mathbb { E } [ \mu \vert \mathbf { x } ] & = \frac { n \tau } { n \tau + n _ { 0} \tau } \overline { x } + \frac { n _ { 0} \tau } { n \tau + n _ { 0} \tau } \mu _ { 0} \\ & = \frac { n } { n + n _ { 0} } \overline { x } + \frac { n _ { 0} } { n + n _ { 0} } \mu _ { 0} \end{aligned} \right.\]

and \(\tau_\text{post} = n\tau + n_0 \tau\).

We proceed to working out the marginal posterior of \(\tau\):

\[\begin{align} p \left( \tau ,\mu \vert \mathbf { x } ,\mu _ { 0} ,n _ { 0} ,\alpha ,\beta \right) &\propto p ( \tau \vert \alpha ,\beta ) p \left( \mu \vert \tau ,\mu _ { 0} ,n _ { 0} \right) p ( \mathbf { x } \vert \mu ,\tau ) \\ &\propto \left( \tau ^ { \alpha - 1} e ^ { - \beta \tau } \right) \left( \tau ^ { 1/ 2} e ^ { - \frac { n _ { 0} \tau } { 2} ( \mu - \mu 0) ^ { 2} } \right) \left( \tau ^ { n / 2} e ^ { - \frac { \tau } { 2} \sum _ { i = 1} ^ { n } \left( x _ { i } - \mu \right) ^ { 2} } \right) \\ & \propto \tau ^ { \alpha + n / 2- 1} e^{ - \left( \beta + \frac { 1} { 2} \sum _ { i = 1} ^ { n } \left( x _ { i } - \overline { x } \right) ^ { 2} \right) \tau } \tau ^ { 1/ 2} e^{ - \frac { \tau } { 2} \left[ n _ { 0} \left( \mu - \mu _ { 0} \right) ^ { 2} + n ( \overline { x } - \mu ) ^ { 2} \right]} \end{align}\]

We want to integrate out \(\mu\), and the last factor can be viewed as the joint probability distribution of two Gaussian random variables \(\mu\) and \(\bar{x}\) that own the following reparameterization form:

\[\left.\begin{array} { l } { \bar { X } \sim \mu + \frac { 1} { \sqrt { n \tau } } \epsilon } \\ { \mu \sim \mu _ { 0} + \frac { 1} { \sqrt { n _ { 0} \tau } } \delta } \end{array} \right.\]

Thus integrating out \(\mu\) leaves us the marginal distribution of \(\bar{X}\) which is still a Gaussian random variable with:

\[\mathbb{E}[\bar{X}] = \mu_0, \text{Var}(\bar{X}) = \frac { 1} { n \tau } + \frac { 1} { n _ { 0} \tau }.\]

The probability density function is

\[p(\bar{x}) \propto \tau^{1/2} \exp \left\{ - \frac { n n _ { 0} \tau } { 2\left( n + n _ { 0} \right) } \left( \overline { x } - \mu _ { 0} \right) ^ { 2} \right\}\]

In the end, we have

\[p \left( \tau \vert \mathbf { x } ,\mu _ { 0} ,n _ { 0} ,\mathcal { \beta } \right) \propto \tau ^ { \alpha + n / 2- 1} \exp \left \{ - \left( \beta + \frac { 1} { 2} \sum _ { i = 1} ^ { n } \left( x _ { i } - \overline { x } \right) ^ { 2} + \frac { n n _ { 0}} { 2\left( n + n _ { 0} \right) } \left( \overline { x } - \mu _ { 0} \right)^2 \right) \tau \right\}\]

which is a gamma distribution \(\text{Gamma}(\alpha + n/2, \beta + \frac { 1} { 2} \sum _ { i = 1} ^ { n } \left( x _ { i } - \overline { x } \right) ^ { 2} + \frac { n n _ { 0}} { 2\left( n + n _ { 0} \right) } \left( \overline { x } - \mu _ { 0} \right)^2 )\).