Learning Generative Models with Sinkhorn Divergence@AISTATS18.

## Overview

Figure Computational Graph

From the above figure, we can see that $$P_L$$ is determined by $$\hat{c}$$; there are two flows towards $$\mathcal{W}_c$$, so we can cut off one of them while the model is still differentiable, in the sense that we can still calculate the gradients of $$\theta$$ and $$\varphi$$. Indeed, in OT-GAN, the red flow branch is cut off.

Remark: The proposed generative model actually is very similar to wgan by substituting the 1-Wasserstein loss function with Sinkhorn loss.

## Optimal transport distances

The optimal transport metric between two probability distributions $$\mu, \nu$$ supported on the metric space $$\mathcal{X}$$ is defined as the solution of the linear program:

$\mathcal{W}_c(\mu, \nu) = \min_{\pi \in \Pi(\mu, \nu)} \int_{\mathcal{X} \times \mathcal{X}} c(\mathbf{x}, \mathbf{y}) \mathrm{d} \pi(\mathbf{x}, \mathbf{y}),$

where the set of couplings is composed of joint probability distributions over the product space $$\mathcal{X} \times \mathcal{X}$$ with imposed marginals $$(\mu, \nu)$$

$\Pi(\mu, \nu) = \{ \pi: P_{1\sharp} \pi = \mu, P_{2\sharp} \pi = \nu \},$

where $$P_{1\sharp}(x, y) = x$$ and $$P_{2\sharp}(x, y) = y$$ are simple projection operators. $$c(\mathbf{x}, \mathbf{y})$$ is the ground cost to move a unit of mass form $$\mathbf{x}$$ to $$\mathbf{y}$$.

In the typical cases, the support of measure $$\mu$$ ($$\nu$$) may be a continuous manifold or very large for discrete measure, so we require to replace the initial loss function by an expectation over mini-batches of size $$(m, n)$$ and consider

\begin{align} \mu &= \frac{1}{m} \sum_{i=1}^m \delta_{\mathbf{x}_i}\\ \nu &= \frac{1}{n} \sum_{i=1}^n \delta_{\mathbf{y}_j} \end{align}

where $$\mathbf{x}_i = g_\theta (\mathbf{z}_i)$$ and $$\mathbf{y}_j$$ are instances of empirical distribution.

## Sinkhorn loss

Sinkhorn loss starts with $$b_0 = \mathbb{1}_m, \ell = 0$$, and iterates

\begin{aligned} a _ { \ell + 1 } &\stackrel { \mathrm { def. } } { = } \frac { 1 _ { n } } { K b _ { \ell } } \\ b _ { \ell + 1 } &\stackrel { \mathrm { def. } } { = } \frac { \mathbb { 1 } _ { m } } { K ^ { \top } a _ { \ell + 1 } }, \end{aligned}

where $$K_{ij} = \exp -\hat{c}_{ij}/\epsilon$$ and $$\hat{c} = [\| f_\varphi(\mathbf{x}_i) - f_\varphi(\mathbf{y}_j)\|]_{i,j} \in \mathbb{R}^{m \times n}$$. Then we have $$P_L = \operatorname { diag } \left( a _ { L } \right) K \operatorname { diag } \left( b _ { L } \right)$$, and the Sinkhorn loss $$\langle \hat{c}, P_L \rangle$$.

## Learning algorithm

Figure Sinkhorn Automatic Differentiation

Figure Sinkhorn Loss