Conjugate Prior.

Bernoulli distribution and Beta prior

The likelihood function

p(x|θ)=θx(1θ)1x

The beta distribution is parameterized using αi1

p(θ|α)=K(α)θα11(1θ)α21

where the normalization factor K(α) can be obtained analytically:

K(α)=(θα11(1θ)α21dθ)1=Γ(α1+α2)Γ(α1)Γ(α2)

Consider in particular N i.i.d Bernoulli observations, x=(x1,,xN):

p(θ|x,α)(n=1Nθxn(1θ)1xn)θα11(1θ)α21=θn=1Nxn+α11(1θ)Nn=1Nxn+α21.

In particular, it is a Beta(θ|n=1Nxn+α1,Nn=1Nxn+α2).

E[θ|α]=θBeta(θ|α1,α2)dθ=α1α1+α2.

A similar calculation (E[θ2|α]) yields the variance:

Var[θ|α]=α1α2(α1+α2+1)(α1+α2)2.

Applying the results to Beta(θ|n=1Nxn+α1,Nn=1Nxn+α2) we obtain

E[θ|x,α]=n=1Nxn+α1N+α1+α2Var[θ|x,α]=(n=1Nxn+α1)(Nn=1Nxn+α2)(N+α1+α2+1)(N+α1+α2)2.

The predictive probability of a new data point X is calculated as follows:

p(X=1|x,α)=p(X=1|θ)p(θ|x,α)dθ=θp(θ|x,α)dθ=E[θ|x,α].

Categorical distribution and Dirichlet prior

In this case the likelihood takes the form

p(x|θ)=θ1x1θ2x2θKxK

where xk{0,1},kxk=1.

This yields the conjugate prior:

p(θ|α)=K(α)θ1α11θ2α21θKαK1

where αi>0 and

K(α)=Γ(k=1Kαk)k=1KΓ(αk)

The posterior distribution of N i.i.d observations, x=(x1,,xN)

p(θ|x,α)θ1n=1Nxn1+α11θ2n=1Nxn2+α21θKn=1NxnK+αK1

which is Dir(θ|nxn+α) and note that xn=(xn1,,xnK).

The mean and variance are

E[θi|α]=αiα.Var[θi|α]=αi(α.αi)α.2(α.+1)

where α.=iαi.

Poisson distribution and Gamma prior

Consider the Poisson distribution:

p(x|θ)=θxeθx!

The corresponding conjugate prior retains the shape of the likelihood:

p(θ|α)=K(α)θα11eα2θ

where the normalization factor K(α)=α2α1Γ(α1).

The posterior distribution of N i.i.d Poisson observations is:

p(θ|α)θα11+n=1Nxne(N+α2)θ

which is Gamma(θ|α1+nxn,α2+N).

The mean and variance are readily computed as:

E[θ|α]=α1α2Var[θi|α]=α1α22

In the distribution Gamma(θ|α1,α2), α1 is known as the shape parameter and α2 is called the rate parameter (1/α2 is called the scale parameter).

Univariate Gaussian distribution and Normal-Gamma Priors

Recall that the Gaussian distribution is a two-parameter exponential family of the following form:

p(xμ,σ2)(σ2)1/2exp{12σ2(xμ)2}

Conjugacy for the mean

Inspecting the above density form we see that the exponent is the negative of a quadratic form in μ. Thus we assume

p(μ|μ0,σ02)(σ02)1/2exp{12σ02(μμ0)2}

where μ0 and σ02 are the mean and variance of μ.

Before deriving the posterior distribution p(μ|x), let us introduce some useful properties of Gaussian variable.

E[Z1|Z2]=E[Z1]+Cov[Z1,Z2]Var[Z2](Z2E[Z2])Var[Z1|Z2]=Var[Z1]Cov2[Z1,Z2]Var[Z2]

Consider that we only have one observation x, we reparameterize X and μ using the Normal random variable ϵ,δN(0,1) as

X=μ+σϵμ=μ0+σ0δ

We can now easily calculate:

E[X]=E[μ]+σE[ϵ]=μ0V[X]=V[μ]+σ2V[ϵ]=σ02+σ2Cov(X,μ)=E[(Xμ0)(μμ0)]=E[(μ+σϵμ0)(μμ0)]=σ02

Treating μ as Z1 and X as Z2 we obtain:

E[μ|X=x]=μ0+σ02σ2+σ02(xμ0)=σ02σ2+σ02x+σ2σ2+σ02μ0Var(μ|X=x)=σ02σ04σ2+σ02=σ2σ2+σ02σ02.

We can also express the results in terms of the precision. In particular, plugging τ=1/σ2 and τ0=1/σ02 into the above equation yields:

E[μ|X=x]=ττ+τ0x+τ0τ+τ0μ0

The posterior expectation is a convex combination of the observation x and prior mean μ0. More precisely, the precision of the data multiplies the data x and the precision of the prior multiplies the prior mean μ0. If the precision of the data is large relative to the precision of the prior, then the posterior mean is closer to x. Also note that the posterior precision τpost=τ+τ0, which has a direct interpretation: precision add.

Let us now consider the posterior distribution in the case of involving multiple observed data points, we have:

p(x|μ,τ)τn/2exp{τ2i=1n(xiμ)2}

We rewrite the exponent using a standard trick:

i=1n(xiμ)2=i=1n(xix+xμ)2=i=1n(xix)2+n(xμ)2

The first term yields a constant factor, and we see the problem reduces to an equivalent problem involving only single random variable (treating x¯ as random variable), in which X¯N(μ,σ2/n).

X¯=μ+σ2nϵ

and we have

E[μ\vertx1,,xn]=σ02σ2/n+σ02x¯+σ2σ2/n+σ02μ0Var(μ|x1,,xn)=σ2/nσ2/n+σ02σ02.

τpost=nτ+τ0.

Now consider an unseen data point YN(μ,σ2), we reparameterize it as

Y=μ+σϵμ=μpost+σpostδ

Y again is a Gaussian random variable as it is the sum of two Gaussian random variables μ and ϵ, its predictive mean and variance:

E[Y]=μpostVar(Y)=σpost2+σ2.

Conjugacy for the variance

Let us now consider the Gaussian distribution with known mean μ,

p(x|μ,σ2)(σ2)aeb/σ2 where a=1/2 and b=12i=1n(xiμ)2. This has the flavor of the gamma distribution, but the random variable σ2 is in the denominator. This is actually an inverse gamma distribution. We thus assume the prior distribution for the variance is an inverse gamma distribution:

p(σ2|α,β)=βαΓ(α)(σ2)α1eβ/σ2

The posterior distribution:

p(σ2|x,μ,α,β)p(x|μ,σ2)p(σ2|α,β)(σ2)n/2e12i=1n(xiμ)2/σ2(σ2)α1eβ/σ2=(σ2)(α+n2)1e(β+12i=1n(xiμ)2)/σ2

which is an Inv-Gamma(α+n2,β+12i=1n(xiμ)2).

If we derive the posetrior in terms of precision we would obtain τGamma(α+n2,β+12i=1n(xiμ)2).

The predictive distribution for an unseen data point Y:

p(Y|x,μ,α,β)=p(Y|x,μ,τ)p(τ|x,μ,α,β)dτ=p(Y|μ,τ)p(τ|x,μ,α,β)dτ

which turns out to be a t-distribution:

YSt(μ,α+n/2β+12i=1n(xiμ)2,2α+n).

Conjugacy for the mean and variance

We make the following specifications:

XiN(μ,τ)i=1,,nμN(μ0,n0τ)τGamma(α,β)

where the Xi are assumed independent given μ,τ. We refer to this prior as normal-gamma distribution.

To compute the posterior distribution p(μ,τ|x), we first compute p(μ|τ,x) and then find p(τ|x).

τp(τ|x)μp(μ|τ,x)

When τ is fixed, we are back in the setting of conjugacy for the mean and can simply use the results, plugging in n0τ in place of τ0:

E[μ|x]=nτnτ+n0τx+n0τnτ+n0τμ0=nn+n0x+n0n+n0μ0

and τpost=nτ+n0τ.

We proceed to working out the marginal posterior of τ:

p(τ,μ|x,μ0,n0,α,β)p(τ|α,β)p(μ|τ,μ0,n0)p(x|μ,τ)(τα1eβτ)(τ1/2en0τ2(μμ0)2)(τn/2eτ2i=1n(xiμ)2)τα+n/21e(β+12i=1n(xix)2)ττ1/2eτ2[n0(μμ0)2+n(xμ)2]

We want to integrate out μ, and the last factor can be viewed as the joint probability distribution of two Gaussian random variables μ and x¯ that own the following reparameterization form:

X¯μ+1nτϵμμ0+1n0τδ

Thus integrating out μ leaves us the marginal distribution of X¯ which is still a Gaussian random variable with:

E[X¯]=μ0,Var(X¯)=1nτ+1n0τ.

The probability density function is

p(x¯)τ1/2exp{nn0τ2(n+n0)(xμ0)2}

In the end, we have

p(τ|x,μ0,n0,β)τα+n/21exp{(β+12i=1n(xix)2+nn02(n+n0)(xμ0)2)τ}

which is a gamma distribution Gamma(α+n/2,β+12i=1n(xix)2+nn02(n+n0)(xμ0)2).