Wasserstein GAN@ICML17.

GAN

Discriminator objective:

\[\begin{aligned} \max \mathbb{E}_{x\sim \mathbb{P}_\mathrm{data}}[\log D(x)] + \mathbb{E}_{z\sim \mathbb{P}_\mathrm{noise}}[\log (1 - D(G(z)))]. \end{aligned}\]

Generator objective

\[\min \mathbb{E}_{z\sim \mathbb{P}_\mathrm{noise}}[\log (1 - D(G(z)))] \quad \text{or}\quad \max \mathbb{E}_{z\sim \mathbb{P}_\mathrm{noise}}[\log D(G(z))].\]

The training procedure is shown as follows,

  • We first train a discriminator \(D\) to maximize \(L(D, g_\theta)\),

    \(L \left( D , g _ { \theta } \right) = \mathbb { E } _ { x \sim \mathbb { P } _ { r } } [ \log D ( x ) ] + \mathbb { E } _ { x \sim \mathbb { P } _ { g } } [ \log ( 1 - D ( x ) ) ].\) The optimal discriminator has the form, \(D ^ { * } ( x ) = \frac { p _ { r } ( x ) } { p _ { r } ( x ) + p _ { g } ( x ) },\) and \(L \left( D ^ { * } , g _ { \theta } \right) = 2 J S D \left( \mathbb { P } _ { r } \| \mathbb { P } _ { g } \right) - 2 \log 2\).

  • Once we get the optimal discriminator \(D^*\), we minimize \(L \left( D ^ { * } , g _ { \theta } \right)\) with repsect to \(\theta\), which is equvialent to minimizing the Jensen-Shanon divergence \(J S D \left( \mathbb { P } _ { r } \| \mathbb { P } _ { g } \right)\).

Distances

Let \(\mathcal{X}\) be a compact metric set and \(\Sigma\) denote the set of all the Borel subsets of \(\mathcal{X}\). Let \(\mathrm{Prob}(\mathcal{X})\) denote the space of probability measures defined on \(\mathcal{X}\). The elementary distances and divergences between two distributions \(\mathbb{P}_r, \mathbb{P}_g \in \mathrm{Prob}(\mathcal{X})\) can be defined as:

  • Total Variation (TV) distance

    \[\delta \left( \mathbb { P } _ { r } , \mathbb { P } _ { g } \right) = \sup _ { A \in \Sigma } \left\vert \mathbb { P } _ { r } ( A ) - \mathbb { P } _ { g } ( A ) \right\vert\]
  • Kullback-Leibler (KL) divergence

    \(K L \left( \mathbb { P } _ { r } \| \mathbb { P } _ { g } \right) = \int \log \left( \frac { P _ { r } ( x ) } { P _ { g } ( x ) } \right) P _ { r } ( x ) d \mu ( x )\) \(\mathbb{P}_r, \mathbb{P}_g\) are assumed to be absolutely continuous, and therefore admits densities, with respect to a same measure \(\mu\) defined on \(\mathcal{X}\).

  • Jensen-Shannon (JS) divergence

    \(J S \left( \mathbb { P } _ { r } , \mathbb { P } _ { g } \right) = K L \left( \mathbb { P } _ { r } \| \mathbb { P } _ { m } \right) + K L \left( \mathbb { P } _ { g } \| \mathbb { P } _ { m } \right)\) where \(\mathbb{P}_m\) is the mixture \(\left( \mathbb { P } _ { r } + \mathbb { P } _ { g } \right) / 2\). The divergence is symmetrical and always defined as we can choose \(\mu = \mathbb{P}_m\) (choose \(\mu\) as the measure of \(\mathbb{P}_m\) ?)

  • Earth-Mover (EM) distance or Wasserstein-1

    \(W \left( \mathbb { P } _ { r } , \mathbb { P } _ { g } \right) = \inf _ { \gamma \in \Pi \left( \mathbb { P } _ { r } , \mathbb { P } _ { g } \right) } \mathbb { E } _ { ( x , y ) \sim \gamma } [ \| x - y \|_2 ]\) where \(\Pi \left( \mathbb { P } _ { r } , \mathbb { P } _ { g } \right)\) denotes the set of all joint distributions \(\gamma ( x , y )\) whose marginals are respectively \(\mathbb{P}_r\) and \(\mathbb{P}_g\). Intuitively, \(\gamma ( x , y )\) indicates how much mass must be transported from \(x\) to \(y\), while \(\|x-y\|_2\) indicates the transportation cost.

Example 1

Let \(Z \sim U[0, 1]\) the uniform distribution on the unit interval. Let \(\mathbb{P}_0\) be the distribution of \((0, Z) \in \mathbb{R}^2\), and \(\mathbb{P}_\theta\) be \((\theta, Z) \in \mathbb{R}^2\) with \(\theta\) a single parameter. We can see that

Wasserstein distance:

\(W(\mathbb{P}_0, \mathbb{P}_\theta) = \vert \theta \vert\).

Jensen-Shanon divergence:

\[J S \left( \mathbb { P } _ { 0 } , \mathbb { P } _ { \theta } \right) = \left\{ \begin{array} { l l } { \log 2 } & { \text { if } \theta \neq 0 } \\ { 0 } & { \text { if } \theta = 0 } \end{array} \right.\] \[\begin{aligned} KL(\mathbb{P}_0 \| \mathbb{P}_m) &= \sum_{x} \int_{0}^1 p_0(x, z) \log \frac{2p_0(x, z)}{p_0(x, z) + p_\theta (x, z)} \mathrm{d}z \\ &= \frac{1}{2} \int_{0}^1 p_0(0, z) \log \frac{2p_0(0, z)}{p_0(0, z) + p_\theta (0, z)} \mathrm{d}z + \frac{1}{2} \int_{0}^1 p_0(\theta, z) \log \frac{2p_0(\theta, z)}{p_0(\theta, z) + p_\theta (\theta, z)} \mathrm{d}z\\ &= \frac{1}{2} \log 2 + 0 = \frac{1}{2} \log 2 \end{aligned}\]

in which \(x\) takes the support of \(\mathbb{P}_m\). Similarly, we can get \(KL(\mathbb{P}_\theta \| \mathbb{P}_m) = \frac{1}{2} \log 2\).

Kullback-Leibler divergence:

\(K L \left( \mathbb { P } _ { \theta } \| \mathbb { P } _ { 0 } \right) = K L \left( \mathbb { P } _ { 0 } \| \mathbb { P } _ { \theta } \right) = \left\{ \begin{array} { l l } { + \infty } & { \text { if } \theta \neq 0 } \\ { 0 } & { \text { if } \theta = 0 } \end{array} \right.\) since their have disjoint supports.

Total Variation:

\[\delta \left( \mathbb { P } _ { 0 } , \mathbb { P } _ { \theta } \right) = \left\{ \begin{array} { l l } { 1 } & { \text { if } \theta \neq 0 } \\ { 0 } & { \text { if } \theta = 0 } \end{array} \right.\]

The supremum values are achieved, when \(\theta \ne 0\), at \(x = 0\) or \(x = \theta\), just note that \(p_0(0, y) = 1\), \(p_\theta(0, y) = 0\).

Remark: We see that when \(\mathbb{P}_0\) and \(\mathbb{P}_\theta\) have disjoint supports, both KLD and JSD are maxed out. When \(\theta _ { t } \rightarrow 0\), the sequence \(\left( \mathbb { P } _ { \theta _ { t } } \right) _ { t \in \mathbb { N } }\) converges to \(\mathbb{P}_0\) under EM distance, but not at all under the other three ones.

WGAN-Example

Figure WGAN Example

For the relative strength of the topologies induced by these distances and divergences, KL is the strongest, followed by JS and TV, and EM the weakest.

Wasserstein GAN

The Kantorovich-Rubinstein duality states that

\[W \left( \mathbb { P } _ { r } , \mathbb { P } _ { \theta } \right) = \sup _ { \| f \| _ { L } \leq 1 } \mathbb { E } _ { x \sim \mathbb { P } _ { r } } [ f ( x ) ] - \mathbb { E } _ { x \sim \mathbb { P } _ { \theta } } [ f ( x ) ].\]

In practice, we can assume that our differentiable functions satisfy \(\| f \| _ { L } \leq K\) (\(K\)-Lipschitz for some constant \(K\)), then we end up with \(K \cdot W \left( \mathbb { P } _ { r } , \mathbb { P } _ { g } \right)\). Consider we have a parameterized family of functions \(\left\{ f _ { w } \right\} _ { w \in \mathcal { W } }\) that are all \(K\)-Lipschitz, we could approximate \(W \left( \mathbb { P } _ { r } , \mathbb { P } _ { \theta } \right)\) with the objective

\[\max _ { w \in \mathcal { W } } \mathbb { E } _ { x \sim \mathbb { P } _ { r } } \left[ f _ { w } ( x ) \right] - \mathbb { E } _ { z \sim p ( z ) } \left[ f _ { w } \left( g _ { \theta } ( z ) \right],\right.\]

in which we update \(w\) to achieve the goal. Once the maximum value is attained, we could consider differentiating \(W \left( \mathbb { P } _ { r } , \mathbb { P } _ { \theta } \right)\) with respect to \(\theta\) to minimize the distance.

Remark: It is desirable to enforce Lipschitz constraints on \(\left\{ f _ { w } \right\} _ { w \in \mathcal { W } }\) in practice, such as the work Sorting out Lipschitz function approximation.

The algorithm is shown as follows. In lines 2-8, we estimate the distance \(W \left( \mathbb { P } _ { r } , \mathbb { P } _ { \theta } \right)\) by updating \(w\), i.e., optimizing the discriminator; in lines 9-11, we minimize the estimated \(W \left( \mathbb { P } _ { r } , \mathbb { P } _ { \theta } \right)\) by updating \(\theta\), i.e., optimizing the generator.

WGAN-Algorithm

Figure WGAN Algorithm