Legendre transformation, its geometry intuition, and applications in machine learning.

Legendre conjugate

For a strictly convex function \(f: \Omega \subset \mathbb{R}^n \mapsto \mathbb{R}\) with \(\nabla^2f \succ 0\), its Legendre conjugate is defined as

\[f^* (\mathbf{y}) = \max_{\mathbf{x} \in \Omega}\ \langle \mathbf{y}, \mathbf{x} \rangle - f(\mathbf{x}), \quad \mathbf{y} \in \mathbb{R}^n.\]

As \(\mathbf{x} \mapsto \max_{\mathbf{x} \in \Omega}\ \langle \mathbf{y}, \mathbf{x} \rangle - f(\mathbf{x})\) is a strictly concave function of \(\mathbf{x}\) (linear function + concave function) it has unique point \(\mathbf{x}^*\) that attains the maximum value

\[\langle \mathbf{y}, \mathbf{x}^* \rangle - f(\mathbf{x^*}) \quad \mathrm{with}\quad \nabla f(\mathbf{x}^*) = \mathbf{y}.\]

Hence, we can explain \(f^*(\mathbf{y})\) like this, which point \(\mathbf{x} \in \Omega\) has a gradient \(\nabla f(\mathbf{x})\) equals to the given \(\mathbf{y}\).

Geometry explaination

Let us approach Legendre conjugate from the geometry perspective to gain more intuition.

Recall that the epigraph for a convex function \(f\) is

\[\mathcal{O} = \{ (\mathbf{x}, z) \mid \mathbf{x} \in \Omega, z \geq f(\mathbf{x}) \}.\]

How can we describe its boundary \(\partial \mathcal{O}\) ?

Of course, given a \(\mathbf{x} \in \Omega\) we have \((\mathbf{x}, z=f(\mathbf{x})) \in \partial \mathcal{O}\), representing \(\partial \mathcal{O}\) from \(\mathbf{x}\)-coordinate.

Let us consider the tangent hyperplane passing a fixed point \((\mathbf{x}_0, f(\mathbf{x}_0)) \in \Omega\):

\[z = \langle \nabla f(\mathbf{x}_0), \mathbf{x} - \mathbf{x}_0 \rangle + f(\mathbf{x}_0).\]

The key point here is that there is only one point of \(\partial \mathcal{O}\) that admits a tangent hyperplane with slope \(\nabla f(\mathbf{x}_0),\) since \(\nabla f(\mathbf{x})\) is a monotonous increasing function w.r.t \(\mathbf{x}\) (\(\nabla^2f \succ 0\)). Therefore, we may also represent the boundary \(\partial \mathcal{O}\) with the parameter \(\mathbf{y} = \nabla f(\mathbf{x})\), representing \(\partial \mathcal{O}\) from \(\mathbf{y}\)-coordinate. In summary, we can identify a point \((\mathbf{x}, z) \in \partial \mathcal{O}\), either by its \(\mathbf{x}\)-coordinate or its \(\mathbf{y}\)-coordinate (\(\mathbf{y} = \nabla f(\mathbf{x})\)). Namely, we have exhibited a dual coordinate system.

How can we write the boundary \(\mathcal{O}\) in terms of \(\mathbf{y}\)-coordinate? Note that any hyperplane with a given slope \(\nabla f(\mathbf{x_0})\) passing through \(\partial \mathcal{O}\) has a unique point on \(z\)-axis. These lines admit one equation:

\[z = \langle \nabla f(\mathbf{x_0}), \mathbf{p} - \mathbf{x} \rangle + f(\mathbf{x}),\]

where \((\mathbf{x}, f(\mathbf{x}))\) is the intersection point between the hyperplane and \(\partial \mathcal{O}\) (there may be multiple intersection points, and any of them is valid). Among these parallel lines, \(z = \langle \nabla f(\mathbf{x_0}), \mathbf{p} - \mathbf{x}_0 \rangle + f(\mathbf{x}_0)\) is the unique one that attains the minimized \(z\)-intersection value

\[-\langle \nabla f(\mathbf{x}_0), \mathbf{x}_0 \rangle + f(\mathbf{x}_0).\]

Remark: This is because \(f\) is a convex function and \(f(\mathbf{x}) \geq \langle \nabla f(\mathbf{x_0}), \mathbf{x} - \mathbf{x}_0 \rangle + f(\mathbf{x}_0)\), which concludes

\[-\langle \nabla f(\mathbf{x_0}), \mathbf{x} \rangle + f(\mathbf{x}) \geq -\langle \nabla f(\mathbf{x_0}) , \mathbf{x}_0\rangle + f(\mathbf{x}_0).\]

We can also imagine this visually. The hyperplane \(z = \langle \nabla f(\mathbf{x}_0), \mathbf{x} - \mathbf{x}_0 \rangle + f(\mathbf{x}_0)\) only has one intersection with \(\partial \mathcal{O}\) wheras other lines \(z = \langle \nabla f(\mathbf{x}_0), \mathbf{p} - \mathbf{x} \rangle + f(\mathbf{x})\) with \(\mathbf{x} \ne \mathbf{x}_0\) all have mutiple intersections with \(\partial \mathcal{O}\); we can move any of the later lines to get the former by decreasing the \(z\)-axis intersection.

The tangent hyperlane equation \(z = \langle \nabla f(\mathbf{x}_0), \mathbf{x} - \mathbf{x}_0 \rangle + f(\mathbf{x}_0)\) can also be writen as

\[\left\langle \begin{bmatrix}\nabla f(\mathbf{x}_0) \\ -1 \end{bmatrix}, \begin{bmatrix}\mathbf{x} \\ z \end{bmatrix} - \begin{bmatrix}\mathbf{x}_0 \\ f(\mathbf{x}_0) \end{bmatrix}\right\rangle = 0\]

This form reveals that \([\mathbf{x}, z]^\top \in \mathbb{R}^{n+1}\) is any point in the hyperplane and \([f(\mathbf{x}_0), -1]^\top \in \mathbb{R}^{n+1}\) is the normal vector perpendicular to the hyperplane at point \([\mathbf{x}_0, f(\mathbf{x}_0)]^\top\).

So we can represent \((\mathbf{x}_0, z=f(\mathbf{x}_0)) \in \partial\mathcal{O}\) with \(\mathbf{y} := \nabla f(\mathbf{x_0})\) as

\[\mathbf{x}_0 = \operatorname{argmin}_{\mathbf{x}} -\langle \mathbf{y}, \mathbf{x} \rangle + f(\mathbf{x}) = \operatorname{argmax}_{\mathbf{x}} \langle \mathbf{y}, \mathbf{x} \rangle - f(\mathbf{x}).\]

In other words, finding a point \(\mathbf{x} \in \Omega\) that has a gradient \(\nabla f(\mathbf{x})\) equals to the given \(\mathbf{y}\), which is equivalent to

\[\max_{\mathbf{x} \in \Omega}\ \langle \mathbf{y}, \mathbf{x} \rangle - f(\mathbf{x}), \quad \mathbf{y} \in \mathbb{R}^n.\]

Remark: \(\langle \mathbf{y}, 0 - \mathbf{x} \rangle\) describes the change of \(z\) value of the hyperplane \(z = \langle \mathbf{y}, \mathbf{p} - \mathbf{x} \rangle + f(\mathbf{x})\) when \(\mathbf{p}\) goes from \(\mathbf{x}\) to \(0\). Given the slope \(\mathbf{y}\), the hyperplane passing the point \((\mathbf{x}, f(\mathbf{x}))\) with \(\nabla f(\mathbf{x}) = \mathbf{y}\) achieves the minimum value on \(z\)-axis.

Interpreting Legendre transformation in one sentence:

\[\max_{\mathbf{x} \in \Omega}\ \langle \mathbf{y}, \mathbf{x} \rangle - f(\mathbf{x}), \quad \mathbf{y} \in \mathbb{R}^n\]

is a linear function of \(\mathbf{y}\) and thus it amounts to

\[\min_{\mathbf{x} \in \Omega}\ \langle \mathbf{y}, -\mathbf{x} \rangle + f(\mathbf{x}), \quad \mathbf{y} \in \mathbb{R}^n,\]

which is the minimized \(z\)-intersection value that hyperplane

\[z = \langle \mathbf{y}, \mathbf{p} -\mathbf{x} \rangle + f(\mathbf{x}), \quad \mathbf{p} \in \mathbb{R}^n\]

could attain.

Properties

  • \(f^* (\mathbf{y}) = \max_{\mathbf{x} \in \Omega}\ \langle \mathbf{y}, \mathbf{x} \rangle - f(\mathbf{x})\) is a convex function, since it is a linear function of \(\mathbf{y}\).
  • \(f^{**} = f\).
  • \(\nabla f^* = (\nabla f)^{-1}\) or \((\nabla f^*)^{-1} = \nabla f\) if \(\nabla f\) is inversible.
  • \(f(\mathbf{x}) + f^*(\mathbf{y}) \geq \langle \mathbf{y}, \mathbf{x} \rangle\).

Applications

Smoothed max operator.