Deep Variational Information Bottleneck@ICLR17.

We regard the internal representation of some intermediate layer as a stochastic encoding \(Z\) of the input source \(X\), defined by a parametric encoder \(p(z \vert x; \theta)\). Our goal is to learn an encoding that is maximally informative about our target \(Y\), measured by the mutual information between our encoding and the target \(I(Z, Y)\). At the same time, we wish to ignore the irrelevant information in \(X\) about \(Y\). This suggests the objective:

\[\max_\theta I(Z, Y; \theta)\quad \mathrm{s.t.}\ I(Z, X; \theta) \leq I_c.\]

With the introduction of a Lagrange multiplier \(\beta\), the objective function is equivalent to

\[\max_{\theta} \mathrm{R_{IB}}(\theta) := I (Z, Y; \theta) - \beta I(Z, X; \theta).\]

An intuitive interpretation is to learn an encoding \(Z\) that is maximally expressive about \(Y\) while being maximally compressive about \(X\). The first term in \(\mathrm{R_{IB}}(\theta)\) encourages \(Z\) to be predictive of \(Y\); the second term encourages \(Z\) to forget \(X\). Essentially it forces \(Z\) to act like a minimal sufficient statistics of \(X\) for predicting \(Y\).

In VIB, we make the following Markov assumption:

WGAN-Algorithm

Figure VIB Assumption

The lower bound of MI

\[\begin{aligned} I(Z, X) &= \int \int p(z, x) \log \frac{p(x\vert z)}{p(x)}\ dz\ dx \\ &= \int \int p(z, x) \log p(x \vert z)\ dz\ dx - \int \int p(z, x) \log p(x)\ dz\ dx \\ &= \int p(z) \int p(x\vert z) \log p(x\vert z)\ dx\ dz - \int p(x) \log p(x)\ dx \\ &\geq \int p(x) \int p(x\vert z) \log q(x\vert z)\ dx\ dz + H(X) \end{aligned}\]

The last inequality holds due to \(\int p(x\vert z) \log p(x\vert z)\ dx - \int p(x\vert z) \log q(x\vert z)\ dx = \mathrm{KL}[p(x\vert z) \| q(x\vert z)] \geq 0\).

Since \(H(X) \geq 0\), then

\[I(Z, X) \geq \int \int p(x, z) \log q(x \vert z)\ dx\ dz\]

Replacing \(X\) to \(Y\), we have

\[I(Z, Y) \geq \int \int p(y, z) \log q(y \vert z)\ dy\ dz\]

According to the Markov assumption we have

\[p(y, z) = \int p(y, z \vert x) p(x)\ dx = \int p(y\vert x) p(z\vert x) p(x)\ dx\]

which gives us a new lower bound

\[I(Z, Y) \geq \int \int p(y\vert x) p(z\vert x) p(x) \log q(y \vert z)\ dx\ dy\ dz\]

The upper bound of MI

\[\begin{aligned} I(Z, X) &= \int \int p(z, x) \log \frac{p(z\vert x)}{p(z)}\ dz\ dx \\ &= \int \int p(z, x) \log p(z \vert x)\ dz\ dx - \int \int p(z, x) \log p(z)\ dz\ dx \\ \end{aligned}\]

As \(\int p(z\vert x) \log p(z\vert x)\ dz \geq \int p(z\vert x) \log r(z)\ dz\) \(\Longrightarrow \int p(z\vert x) \log p(z)\ dz - \int p(z\vert x) \log p(x)\ dz \geq \int p(z\vert x) \log r(z)\ dz\) \(\Longrightarrow \int p(z\vert x) \log p(z)\ dz \geq \int p(z\vert x) \log r(z)\ dz + \log p(x)\), and hence we have

\[\begin{aligned} I(Z, X) &\leq \int \int p(z, x) \log p(z \vert x)\ dz\ dx - \int p(x) \left[\int p(z\vert x) \log r(z)\ dz + \log p(x) \right] \ dx \\ &= \int \int p(z, x) \log p(z \vert x)\ dz\ dx - \int \int p(x) p(z\vert x) \log r(z)\ dz\ dx + H(X)\\ &= \int \int p(z, x) \log \frac{p(z\vert x)}{r(z)}\ dz\ dx + H(X)\\ &= \mathbb{E}_{p(x)} [\mathrm{KL}[p(z\vert x) \| r(z)]] + H(X) \end{aligned}\]

\(H(X)\) is a constant w.r.t the parameters we aim to optimize, and thus we have the upper bound,

\[I(Z, X) \leq \mathbb{E}_{p(x)} [\mathrm{KL}[p(z\vert x) \| r(z)]] + \mathrm{const}\]

Remark: In the original paper, the authors derives that

\[I(Z, X) \leq \int \int p(z, x) \log \frac{p(z\vert x)}{r(z)}\ dz\ dx\]

but it is incorrect; because we cannot conclude \(\int p(x\vert z) p(z) \log p(z)\ d z \geq \int p(x\vert z) p(z) \log r(z)\ d z\) by using \(\int p(z) \log p(z)\ d z \geq \int p(z) \log r(z)\ d z\) since \(p(x\vert z)\) varies over \(z\). The derived inequality holds only when the entropy of \(X\) is zero.

Lower Bound of RIB

Combining both of two bounds we have that

\[\begin{aligned} I ( Z , Y ) - \beta I ( Z , X ) \geq & \int d x\ d y\ d z\ p ( x ) p ( y \vert x ) p ( z \vert x ) \log q ( y \vert z ) \\ & - \beta \int d x\ d z\ p ( x ) p ( z \vert x ) \log \frac { p ( z \vert x ) } { r ( z ) } = L(\theta) \end{aligned}\]

We can approximate \(p(x, y)\) using the empirical data distribution \(p ( x , y ) = \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \delta _ { x _ { n } } ( x ) \delta _ { y _ { n } } ( y )\), and hence we can write

\[\begin{aligned} L(\theta) &\approx \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \left[ \int d z\ p ( z \vert x _ { n } ) \log q \left( y _ { n } \vert z \right) - \beta p ( z \vert x _ { n } ) \log \frac { p ( z \vert x _ { n } ) } { r ( z ) } \right] \\ &= \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \left[ \int d z\ p ( z \vert x _ { n } ) \log q \left( y _ { n } \vert z \right) - \beta \mathrm{KL}[p(Z \vert x_n) \| r(Z)] \right] \end{aligned}\]

Then we can use the reparameterization trick to write

\[p(z \vert x)dz = p(\epsilon)d\epsilon\]

where \(z = f(x, \epsilon)\) is a inversible deterministic function of \(x\) and \(\epsilon \sim \mathrm{Normal}(0, 1)\), e.g., \(z = \mu(x) + \sigma(x) \epsilon\).

Remark: The equality can be explained as the mass in the unit volume around any given point in the probability space does not change over the transformation.

We can put everything together to get the following objective function, which we try to minimize:

\[J _ { I B } = \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \mathbb { E } _ { \epsilon \sim p ( \epsilon ) } \left[ - \log q \left( y _ { n } \vert f \left( x _ { n } , \epsilon \right) \right) \right] + \beta \operatorname { KL } [ p ( Z \vert x _ { n } ) \| r ( Z )].\]