Mutual Information Neural Estimation@ICML18.

Mutual Information

Put simply, mutual information quantifies the dependence of two random variables \(X\) and \(Z\). It has the form,

\[I ( X ; Z ) = \int _ { \mathcal { X } \times \mathcal { Z } } \log \frac { d \mathbb { P } _ { X Z } } { d \mathbb { P } _ { X } \otimes \mathbb { P } _ { Z } } d \mathbb { P } _ { X Z } = D _ { K L } \left( \mathbb { P } _ { X Z } \| \mathbb { P } _ { X } \otimes \mathbb { P } _ { Z } \right)\]

where \(\mathbb { P } _ { X Z }\) is the joint probability distribution, and \(\mathbb { P }_{X} = \int _ { \mathcal { Z } } d \mathbb { P } _ { X Z }\) and \(\mathbb { P } _ { Z } = \int _ { \mathcal { X } } d \mathbb { P } _ { X Z }\) are the marginals. In contrast to correlation, mutual information captures non-linear statistical dependencies between variables, and thus can act as a measure of true dependence. The mutual information between \(X\) and \(Z\) can be understood as the decrease of the uncertainty in \(X\) given \(Z\):

\[I ( X ; Z ) : = H ( X ) - H ( X \vert Z )\]

Despite being a pivotal quantity across data science, mutual information has historically been difficult to compute. Exact computation is generally intractable in high dimensional space due to the intractability of the marginal \(\mathbb { P }_{X} = \int _ { \mathcal { Z } } d \mathbb { P } _ { X Z }\).

\[\begin{aligned} \int \int p(x, z) \log \frac{p(x, z)}{p(x) p(z)} dx dz &= \int \int p(x\vert z) p(z) \log \frac{p(x\vert z)}{p(x)} dx dz \\ &= \int \int p(z) p(x\vert z) \log p(x \vert z) dx dz - \int p(x) \log(x) dx \\ &= -\int p(z) H(X\vert z) dz + H(X)\\ &= -H(X\vert Z) + H(X) \end{aligned}\]

Remark:

  • The support of \(\mathbb { P } _ { X } \otimes \mathbb { P } _ { Z }\) includes the support of \(\mathbb { P } _ { X Z }\), so \(I ( X ; Z )\) is always defined.
  • Correlation measures the linear association between two random variables, and it has no obligation to detect any other form of association. For instance, \(\mathbb{P}(X=x) = 1/3\) for \(x = -1, 0, 1\) and \(Y = X^2\), hence \(\mathbb{E}[X] = 0\), \(\mathbb{E}[Y] = 2/3\), \(\mathbb{E}[XY] = 0\); but \(Y\) is a deterministic function of \(X\).

Idea

The method proposed in the paper is built on the Donsker-Varadhan representation of KL-divergence.

Theorem 1 (Donsker-Varadhan representation). Assuming two distributions \(\mathbb{P}\) and \(\mathbb{Q}\) are defined on some compact domain \(\Omega \subset \mathbb{R}^d\) and \(\mathbb{P}\) is absolutely continuous with respect to \(\mathbb{Q}\), the KL-divergence admits the following dual representation:

\[D _ { K L } ( \mathbb { P } \| \mathbb { Q } ) = \sup _ { T : \Omega \rightarrow \mathbb { R } } \mathbb { E } _ { \mathbb { P } } [ T ] - \log \left( \mathbb { E } _ { \mathbb { Q } } \left[ e ^ { T } \right] \right)\]

where the supremum is taken over all functions \(T: \mathcal { \Omega } \rightarrow \mathbb{R}\) such that the two expectations are finite.

Remark:

  • A measure \(\mu\) on Borel subsets of \(\mathbb{R}^d\) is absolutely continuous with resepct to Lebesgue measure \(\lambda\) if for every measurable set \(A\), \(\lambda(A) = 0\) implies \(\mu(A) = 0\). In other words, for any measurable set A, \(\mu(A) > 0\) then \(\lambda(A) > 0\); or the support of \(\lambda\) includes the support of \(\mu\). This is written as \(\mu \ll \lambda\).
  • If \(\mathbb{P}\) and \(\mathbb{Q}\) are both absolutely continuous with repsect to a reference distribution \(\lambda\) on \(\Omega\) then their pdfs \(p\) and \(q\) satisfy \(\mathrm{d} \mathbb{P} = p \mathrm{d} \lambda\) and \(\mathrm{d} \mathbb{Q} = q \mathrm{d} \lambda\).
  • \(\log \left( \mathbb { E } _ { \mathbb { Q } } \left[ e ^ { T } \right] \right) = \log \int_\Omega e^{T(x)} \mathrm{d} \mathbb{Q}(x)\) actually is the continuous log-sum-exp.

Insight: The right-hand-side provides a lower-bound for the KL-divergence \(D _ { K L } ( \mathbb { P } \| \mathbb { Q } )\). The idea is to choose \(\mathcal{F}\) to be the family of functions \(T _ { \theta } : \mathcal { X } \times \mathcal { Z } \rightarrow \mathbb { R }\) parameterized by a deep neural network with parameters \(\theta \in \Theta\).

Neural information measure: \(I_{\Theta}(X,Z)\) is defined as

\[I _ { \Theta } ( X , Z ) = \sup _ { \theta \in \Theta } \mathbb { E } _ { \mathbb { P } _ { X Z } } \left[ T _ { \theta } \right] - \log \left( \mathbb { E } _ { \mathbb { P }_{X} \otimes \mathbb { P } _ { Z } } \left[ e ^ { T _ { \theta } } \right] \right)\]

Note that \(I_{\Theta}(X,Z)\) is the lower-bound of \(I(X; Z)\) and it can be estimated using empirical samples from the empirical distribution. In what follows, given a distribution \(\mathbb{P}\), we denote by \(\hat { \mathbb { P } }^{(n)}\) as the empirical distribution associated to \(n\) i.i.d samples.

Details on the implementation of MINE are provided in Algorithm 1.

MINE-Algorithm

Figure MINE Algorithm

Remark:

  • We could simply drop \(x\) or \(z\) in samples \((x, z)\) drawing from \(\mathbb{P}_{XZ}\) to obtain the marginal samples. However, to estimate the second term \(\log \left( \mathbb { E } _ { \mathrm { P } _ { X } \otimes \mathbb { P } _ { Z } } \left[ e ^ { T _ { \theta } } \right] \right)\) we need to reasonably guarantee the samples \(x^{(i)}\) and \(z^{(i)}\) are independent, thus we re-draw \(n\) samples from the marginal \(\mathbb{P}_Z\) in the second line.

  • The gradient estimator \(\hat{G} = \frac{1}{b} \sum_{i=1}^{b}\nabla_\theta T_\theta (x^{(i)}, z^{(i)}) - \frac{\frac{1}{b} \sum_{i=1}^{b}\nabla_\theta T_\theta (x^{(i)}, \bar{z}^{(i)}) e^{T_\theta(x^{(i)}, \bar{z}^{(i)})}}{\frac{1}{b} \sum_{i=1}^{b}e^{T_\theta(x^{(i)}, \bar{z}^{(i)})} }\) is a biased estimate of the gradient due to the existence of the denominator of the second term.

    To show this, we just take expectation on both side and will find the equality

    \[\mathbb{E}[\hat{G}] = \mathbb { E } _ { \mathbb { P } _ { X Z } } \left[ \nabla_\theta T _ { \theta } \right] - \frac{\mathbb { E } _ { \mathbb { P }_{X} \otimes \mathbb { P } _ { Z } } \left[ e ^ { T _ { \theta } } \nabla_\theta T _ { \theta } \right]}{\mathbb { E } _ { \mathbb { P }_{X} \otimes \mathbb { P } _ { Z } } \left[ e ^ { T _ { \theta } } \right]}\]

    holds only when the denominator \(\frac{1}{b} \sum_{i=1}^{b}e^{T_\theta(x^{(i)}, \bar{z}^{(i)})}\) equals to \(\mathbb { E } _ { \mathbb { P }_{X} \otimes \mathbb { P } _ { Z } } \left[ e ^ { T _ { \theta } } \right]\) (assume the denominator is constant and then take the expectation). Note that the first term in \(\hat{G}\) is estimated with the samples drawn from the joint distribution whereas the second term uses the mariginal distributions, hence, when taking expectation \(\mathbb{E}[\hat{G}]\) we should use different distributions for the two terms, respectively.

    The bias can be reduced by replacing the estimate in the denominator by an exponential moving average.

Theoretical properties

see the appendix.

Applications

Maximizing mutual information to improve GANs

\[\underset { G } { \arg \max } \mathbb { E } [ \log ( D ( G ( [ \boldsymbol { \epsilon } , c ] ) ) ) ] + \beta I ( G ( [ \boldsymbol { \epsilon } , \boldsymbol { c } ] ) ; \boldsymbol { c } )\]

Maximizing mutual information to improve bi-directional GANs.

\[\begin{array} { l } { \arg \max _ { D } \mathbb { E } _ { q ( \boldsymbol { x } , \boldsymbol { z } ) } [ \log D ( \boldsymbol { x } , \boldsymbol { z } ) ] + \mathbb { E } _ { p ( \boldsymbol { x } , \boldsymbol { z } ) } [ \log ( 1 - D ( \boldsymbol { x } , \boldsymbol { z } ) ) ] } \\ { \arg \max _ { \boldsymbol { G } } \mathbb { E } _ { q ( \boldsymbol { x } , \boldsymbol { z } ) } [ \log ( 1 - D ( \boldsymbol { x } , \boldsymbol { z } ) ) ] + \mathbb { E } _ { \boldsymbol { p } ( \boldsymbol { x } , \boldsymbol { z } ) } [ \log D ( \boldsymbol { x } , \boldsymbol { z } ) ] } \\ { + \beta I _ { q } ( \boldsymbol { x } , \boldsymbol { z } ) } \end{array}\]

Information Bottleneck

IB seeks an encoder, \(q(Z \vert X)\), that induces the Markovian structure \(X \rightarrow Z \rightarrow Y\).

\[\mathcal { L } [ q ( Z \vert X ) ] = H ( Y \vert Z ) + \beta I ( X , Z )\]

Minimizing the IB lead to that \(Z\) reduces the uncertainty of \(Y\) and \(Z\) should be as independent with \(X\) as possible, so that \(Z\) only keeps the information contributing to the prediction of \(Y\).