Negative entropy, its properties, and applications in machine learning.

Background

Notation: Denoting standard simplex in Rn as n{λR+n:1,λ=1}.

Strongly convex property

We consider the negative entropy defined as

η(x)=x,lnx,xn.

Theorem 1. The negative entropy is a strongly convex on n in the 1-norm with convexity parameter 1, i.e., for any hRn, the following inequality holds

2η(x)h,hh12.

Proof. First note that η(x)=1+lnx and 2η(x)=diag(1/x), and thus

2η(x)h,h=h,hx=i=1nhi2xi.

Now we shall find

minf(x)h,hxs.t.1,x=1.

According to the Proposition 1 the optimal x and optimal dual multiplier λ satisfy

f(x)=(hx)2=1λ(hixi)2=λxi=|hi|λ

in the first step, we have absorbed the minus operator into λ since it is a free variable; the second step follows the fact that hiR and xiR+.

Taking xi=|hi|λ along with i=1nxi=1, we get

f(x)=(in|hi|)2=h12.

Proposition 1. Let x be a local minimum of a differentiable function f subject to the linear equality constraints

xE={xRn|Ax=b},

where A is an m×n-matrix with full row rank, and m<n. Then there exists a vector of multipliers λ such that

f(x)=Aλ.

Proof. Let uiRn,i=1k be the basis of the null space of A and g(v)x+Uv where vRk, U=[u1,,uk]. We observe that g(v)E for any vRk, and let ϕ(v)=f(g(v)). Since ϕ(0) is a local minimum of ϕ then

ϕ(v)|v=0=Uf(x)=0,

i.e., f(x) is orthogonal to the null space of A, hence f(x) must stay in the row space of A which completes the proof.

Remark: dϕ(v)=df(g(v))=f,dg(v)=f,Udv=Uf,dv.

Applications

Differentiable dynamic programming

The max operator can be formulated as

max(x)=supq+q,x

However, it is not a smooth operator w.r.t x and we can smooth it with negative entropy to yield

maxη(x)=supq+q,xη(q)

where the r.h.s is a strongly concave function w.r.t q, more details is available in the paper.

Smoothed Wasserstain distance

Recall the Wasserstain distance between two measure a,b: LC(a,b)minPU(a,b)P,C the r.h.s is a convex function w.r.t P and we can transform it to a ϵ-strongly convex function with negative entropy LCϵ(a,b)minPU(a,b)P,C+ϵη(P).