Introductory Lectures on Convex Optimization 2

Chapter 2 of Introductory Lectures on Convex Optimization.

2 Smooth convex optimization
- 2.1.2 Lower complexity bounds for \(\mathcal { F } _ { \boldsymbol { L } } ^ { \infty , \mathbf { 1 } } \left( \boldsymbol { R } ^ { n } \right)\)
- 2.1.3 Strongly convex functions

2 Smooth convex optimization

We have seen that, in general, the gradient method converges only to a stationary point of function \(f\), which motivates us to find a class of function \(\mathcal{F}\) such that

Assumption 2.1.1 For any \(f \in \mathcal{F}\) the first-order optimality condition, \(f'(x)=0\), is sufficient for a point to be a global solution.

We introduce another invariant operation for the hypothetical class \(\mathcal{F}\).

Assumption 2.1.2 If \(f_1, f_2 \in \mathcal{F}\) and \(\alpha, \beta \geq 0\), then \(\alpha f_1 + \beta f_2 \in \mathcal{F}\).

Let’s introduce one basic element

Assumption 2.1.3 Any linear function \(f(x) = \alpha + \langle a , x \rangle\) belongs to \(\mathcal{F}\).

The reason is obvious, because for a linear function, \(f'(x) = 0\) implies that it must be a constant function, thus any point in its domain is the optimal solution.

Now let’s see what kind of conclusions we can make about our hypothetical class \(\mathcal{F}\). Consider \(f \in \mathcal{F}\) and for a fixed point \(x_0 \in \mathbb{R}^n\) the function

\[\phi ( y ) = f ( y ) - \left\langle f ^ { \prime } \left( x _ { 0} \right) ,y \right\rangle\]

is a non-negative linear combination of \(f(y)\) and linear function \(-\langle f ^ { \prime } \left( x _ { 0} \right) ,y \rangle\) thus \(\phi \in \mathcal{F}\). Note that

\[\phi ^ { \prime } ( y ) \vert _ { y = x _ { 0} } = f ^ { \prime } \left( x _ { 0} \right) - f ^ { \prime } \left( x _ { 0} \right) = 0\]

Hence, in view of assumtion 2.1.1 we have for any \(y \in \mathbb{R}^n\) we have \(\phi ( y ) \geq \phi \left( x _ { 0} \right) = f \left( x _ { 0} \right) - \left\langle f ^ { \prime } \left( x _ { 0} \right) ,x _ { 0} \right\rangle\)

Hence we conclude that

\[f ( y ) \geq f \left( x _ { 0} \right) + \left\langle f ^ { \prime } \left( x _ { 0} \right) ,y - x _ { 0} \right\rangle\]

This inequality defines the class of differentiable convex functions.

Defintion 2.1.1 A continuously differentiable function \(f(x)\) is called convex on \(\mathbb{R}^n\) (\(f \in \mathcal { F } ^ { 1} \left( \mathbb{R} ^ { n } \right)\)) if for any \(x, y \in \mathbb{R}^n\) we have

\[f ( y ) \geq f ( x ) + \left\langle f ^ { \prime } ( x ) ,y - x \right\rangle\]

Theorem 2.1.1 If \(f \in \mathcal { F } ^ { 1} \left( \mathbb{R} ^ { n } \right)\) and \(f'(x^*) = 0\) then \(x^*\) is the global minimum of \(f(x)\) on \(\mathbb{R}^n\).

Lemma 2.1.1 If \(f_1, f_2 \in \mathcal { F } ^ { 1} \left( \mathbb{R} ^ { n } \right)\) and \(\alpha, \beta \geq 0\) then function \(f = \alpha f_1 + \beta f_2\) also belongs to \(\mathcal { F } ^ { 1} \left( \mathbb{R} ^ { n } \right)\).

Lemma 2.1.2 If \(f \in \mathcal { F } ^ { 1} \left( \mathbb{R} ^ { m } \right), b \in \mathbb{R} ^ { m }\) and \(A : \mathbb{R}^n \rightarrow \mathbb{R}^m\) then

\[\phi ( x ) = f ( A x + b ) \in \mathcal { F } ^ { 1} \left( R ^ { n } \right)\]

Proof: Let \(x, y \in \mathbb{R}^n\) and denote \(\bar{x} = Ax + b, \bar{y} = Ay + b\). We have \(f ( \overline { y } ) \geq f ( \overline { x } ) + \left\langle f ^ { \prime } ( \overline { x } ) ,\overline { y } - \overline { x } \right\rangle\). Also note that \(\phi(y) = f(\bar{y}), \phi(x) = f(\bar{x})\), what remains is to show that \(\phi(y) \geq \phi(x) + \langle \phi(x)', y-x \rangle\).

:black_medium_square:

We provide several equivalent definitions of convex function.

Theorem 2.1.2 Continously differentiable function \(f\) belongs to the class \(\mathcal { F } ^ { 1} \left( \mathbb{R} ^ { n } \right)\) if and only if for any \(x, y \in \mathbb{R}^n\) and \(\alpha \in [0, 1]\) we have

\[f ( \alpha x + ( 1- \alpha ) y ) \leq \alpha f ( x ) + ( 1- \alpha ) f ( y ).\]

Proof: Denote \(x _ { \alpha } = \alpha x + ( 1- \alpha ) y\) and let \(f \in \mathcal { F } ^ { 1} \left( \mathbb{R} ^ { n } \right)\). Then

\[\left.\begin{array} { l l } { f \left( x _ { \alpha } \right) } & { \leq f ( y ) + \left\langle f ^ { \prime } \left( x _ { \alpha } \right) ,y - x _ { \alpha } \right\rangle } & { = f ( y ) + \alpha \left\langle f ^ { \prime } \left( x _ { \alpha } \right) ,y - x \right\rangle } \\ { f \left( x _ { \alpha } \right) } & { \leq f ( x ) + \left\langle f ^ { \prime } \left( x _ { \alpha } \right) ,x - x _ { \alpha } \right\rangle } & { = f ( x ) - ( 1- \alpha ) \left\langle f ^ { \prime } \left( x _ { \alpha } \right) ,y - x \right\rangle } \end{array} \right.\]

Multiplying first inequality by \(1 - \alpha\) and the second one by \(\alpha\) and adding the results, we get \(f ( \alpha x + ( 1- \alpha ) y ) \leq \alpha f ( x ) + ( 1- \alpha ) f ( y ).\)

Let \(f ( \alpha x + ( 1- \alpha ) y ) \leq \alpha f ( x ) + ( 1- \alpha ) f ( y )\) be true for all \(x, y \in \mathbb{R}^n\) and \(\alpha \in [0, 1)\). Then

\[\left.\begin{aligned} f ( y ) & \geq \frac { 1} { 1- \alpha } \left[ f \left( x _ { \alpha } \right) - \alpha f ( x ) \right] = f ( x ) + \frac { 1} { 1- \alpha } \left[ f \left( x _ { \alpha } \right) - f ( x ) \right] \\ & = f ( x ) + \frac { 1} { 1- \alpha } [ f ( x + ( 1- \alpha ) ( y - x ) ) - f ( x ) ] \end{aligned} \right.\]

using first order Taylor expansion and tending \(\alpha\) to 1, we get \(f ( y ) \geq f ( x ) + \left\langle f ^ { \prime } ( x ) ,y - x \right\rangle\).

It is worthwhile mentioning that even without the differentiability assumption, the definition still holds.

Theorem 2.1.3 Continuously differentiable function \(f\) belongs to the class \(\mathcal { F } ^ { 1} \left( \mathbb{R} ^ { n } \right)\) if and only if we have

\[\left\langle f ^ { \prime } ( x ) - f ^ { \prime } ( y ) ,x - y \right\rangle \geq 0\]

Proof: \(\mathcal { F } ^ { 1} \left( \mathbb{R} ^ { n } \right) \rightarrow\) \(\left\langle f ^ { \prime } ( x ) - f ^ { \prime } ( y ) ,x - y \right\rangle \geq 0\) is straightforward.

Let \(\left\langle f ^ { \prime } ( x ) - f ^ { \prime } ( y ) ,x - y \right\rangle \geq 0\) hold for all \(x, y \in \mathbb{R}^n\) and denote \(x _ { \tau } = x + \tau ( y - x )\). Then

\[\left.\begin{aligned} f ( y ) & = f ( x ) + \int _ { 0} ^ { 1} \left\langle f ^ { \prime } ( x + \tau ( y - x ) ) ,y - x \right) d \tau \\ & = f ( x ) + \left\langle f ^ { \prime } ( x ) ,y - x \right\rangle + \int _ { 0} ^ { 1} \left\langle f ^ { \prime } \left( x _ { \tau } \right) - f ^ { \prime } ( x ) ,y - x \right\rangle d \tau \\ & = f ( x ) + \left\langle f ^ { \prime } ( x ) ,y - x \right\rangle + \int _ { 0} ^ { 1} \frac { 1} { \tau } \left\langle f ^ { \prime } \left( x _ { \tau } \right) - f ^ { \prime } ( x ) ,x _ { \tau } - x \right) d \tau \\ & \geq f ( x ) + \left\langle f ^ { \prime } ( x ) ,y - x \right\rangle \end{aligned} \right.\]

Now we consider the twice differentiable function class \(\mathcal { F } ^ {2} \left( \mathbb{R} ^ { n } \right)\).

Theorem 2.1.4 Twice continuously differentiable function \(f\) belongs to \(\mathcal { F } ^ { 2} \left( \mathbb{R} ^ { n } \right)\) if and only if for any \(x \in \mathbb{R}^n\) we have

\[f''(x) \succeq 0\]

Proof: Let \(f \in \mathcal { F } ^ { 2} \left( \mathbb{R} ^ { n } \right)\) and \(x_\tau = x + \tau s\), \(\tau > 0\). Then in the view of Theorem 2.1.3, we have

\[\left.\begin{aligned} 0& \leq \frac { 1} { \tau } \left\langle f ^ { \prime } \left( x _ { \tau } \right) - f ^ { \prime } ( x ) ,x _ { \tau } - x \right\rangle = \left\langle f ^ { \prime } \left( x _ { \tau } \right) - f ^ { \prime } ( x ) ,s \right\rangle \\ & = \left\langle \int_0 ^\tau d f^{\prime}(x + \lambda s), s \right\rangle \\ &= \int_0 ^\tau \langle f ^ { \prime \prime } ( x + \lambda s ) d(x + \lambda s), s \rangle\\ &= \int_0^\tau \left\langle f ^ { \prime \prime } ( x + \lambda s ) s ,s \right\rangle d \lambda \end{aligned} \right.\]

and tending \(\tau\) to 0 we get \(f''(x) \succeq 0\).

Let \(f''(x) \succeq 0\) hold and \(s = y-x\). Then

\[\left.\begin{aligned} f ( y ) & = f ( x ) + \int _ { 0} ^ { 1} \left\langle f ^ { \prime } ( x + \tau ( y - x ) ) ,y - x \right) d \tau \\ & = f ( x ) + \left\langle f ^ { \prime } ( x ) ,y - x \right\rangle + \int _ { 0} ^ { 1} \left\langle f ^ { \prime } \left( x _ { \tau } \right) - f ^ { \prime } ( x ) ,y - x \right\rangle d \tau \\ &= f ( x ) + \left\langle f ^ { \prime } ( x ) ,y - x \right\rangle + \int _ { 0} ^ { 1} \int_0^\tau \left\langle f ^ { \prime \prime } ( x + \lambda s ) s ,s \right\rangle d \lambda d \tau \\ & \geq f(x) + \langle f'(x), y-x \rangle. \end{aligned} \right.\]

:black_medium_square:

Examples of differentiable convex function.

According to Lemma 2.1.2, \(f ( x ) = \sum _ { i = 1 } ^ { m } e ^ { \alpha _ { i } + \left\langle a _ { i } , x \right\rangle }\) and \(f ( x ) = \sum _ { i = 1 } ^ { m } \left\vert \left\langle a _ { i } , x \right\rangle - b _ { i } \right\vert ^ { p }\) are all convex functions.

The differentiability iteself cannot ensure any special topological properties of convex functions. Therefore we need to consider the problem classes with Lipschitz continuous derivatives of a certain order.

Theorem 2.1.5 All conditions below, holding for all \(x, y \in \mathbb{R}^n\) and \(\alpha \in [0, 1]\), are equivalent to inclusion \(f \in \mathcal { F } _ { L } ^ { 1,1 } \left( \mathbb{R} ^ { n } \right)\):

\[\text{(1)}\quad 0 \leq f ( y ) - f ( x ) - \left\langle f ^ { \prime } ( x ) , y - x \right\rangle \leq \frac { L } { 2 } \| x - y \| ^ { 2 }\] \[\text{(2)}\quad f ( x ) + \left\langle f ^ { \prime } ( x ) , y - x \right\rangle + \frac { 1 } { 2 L } \left\| f ^ { \prime } ( x ) - f ^ { \prime } ( y ) \right\| ^ { 2 } \leq f ( y )\] \[\text{(3)}\quad \frac { 1 } { L } \left\| f ^ { \prime } ( x ) - f ^ { \prime } ( y ) \right\| ^ { 2 } \leq \left\langle f ^ { \prime } ( x ) - f ^ { \prime } ( y ) , x - y \right\rangle\] \[\text{(4)}\quad \left\langle f ^ { \prime } ( x ) - f ^ { \prime } ( y ) , x - y \right\rangle \leq L \| x - y \| ^ { 2 }\] \[\begin{aligned} \alpha f ( x ) + ( 1 - \alpha ) f ( y ) \geq & f ( \alpha x + ( 1 - \alpha ) y ) \\ & + \frac { \alpha ( 1 - \alpha ) } { 2 L } \left\| f ^ { \prime } ( x ) - f ^ { \prime } ( y ) \right\| ^ { 2 } \end{aligned}\] \[\begin{aligned} \alpha f ( x ) + ( 1 - \alpha ) f ( y ) \leq & f ( \alpha x + ( 1 - \alpha ) y ) \\ & + \alpha ( 1 - \alpha ) \frac { L } { 2 } \| x - y \| ^ { 2 } \end{aligned}\]

Proof: (1) follows from the definition of convex functions and Lemma 1.2.3. To prove (2), let us fix \(x\) and consider the function

\[\phi ( y ) = f ( y ) - \left\langle f ^ { \prime } \left( x \right) , y \right\rangle\]

Note that \(\phi \in \mathcal { F } _ { L } ^ { 1,1 } \left( R ^ { n } \right)\) and its optimal point is \(y^* = x\) and in the view of (1), we have

\[\begin{aligned} \phi \left( x \right) = \min_{v} \phi(v) \leq \min_{v} \left[ \phi(y) + \langle \phi^{\prime}(y), v - y\rangle + \frac{1}{2} L \| v - y \|^2\right] \end{aligned}\]

where the right most function get its minimum at \(v = y - \frac { 1 } { L } \phi ^ { \prime } ( y )\) and replacing \(v\) we get (2).

We get (3) from (2) by adding two copies of it with \(x\) and \(y\) interchanged.

Applying to (3) Cauchy-Schwartz inequality we get \(\left\| f ^ { \prime } ( x ) - f ^ { \prime } ( y ) \right\| \leq L \| x - y \|\), the definition of one order Lipschitz continuous gradient, i.e., (1).

We can get (4) from (1) by adding two copies of it with \(x\) and \(y\) interchanged.

To get (1) from (4) we apply integration:

\[\begin{aligned} f ( y ) - f ( x ) - \left\langle f ^ { \prime } ( x ) , y - x \right\rangle & = \int _ { 0 } ^ { 1 } \left\langle f ^ { \prime } ( x + \tau ( y - x ) ) - f ^ { \prime } ( x ) , y - x \right\rangle d \tau \\ & \leq \frac { 1 } { 2 } L \| y - x \| ^ { 2 } \end{aligned}\]

Theorem 2.1.6 Two times continuously differentiable function \(f\) belongs to \(\mathcal { F } _ { L } ^ { 2,1 } \left( R ^ { n } \right)\) if and only if for any \(x \in R ^ { n }\) we have

\[0 \preceq f ^ { \prime \prime } ( x ) \preceq L I _ { n }.\]

Proof: According to (4) in Theorem 2.1.5,

\[\left.\begin{aligned} \tau L \|s\|^2 &\geq \frac { 1} { \tau } \left\langle f ^ { \prime } \left( x + \tau s \right) - f ^ { \prime } ( x ) ,x + \tau s - x \right\rangle \\ &= \left\langle f ^ { \prime } \left( x _ { \tau } \right) - f ^ { \prime } ( x ) ,s \right\rangle \\ & = \left\langle \int_0 ^\tau d f^{\prime}(x + \lambda s), s \right\rangle \\ &= \int_0 ^\tau \langle f ^ { \prime \prime } ( x + \lambda s ) d(x + \lambda s), s \rangle\\ &= \int_0^\tau \left\langle f ^ { \prime \prime } ( x + \lambda s ) s ,s \right\rangle d \lambda \end{aligned} \right.\]

hence we have

\[L \|s\|^2 \geq \frac{1}{\tau} \int_0^\tau \left\langle f ^ { \prime \prime } ( x + \lambda s ) s ,s \right\rangle d \lambda\]

i.e.

\[f ^ { \prime \prime } ( x ) \preceq L I _ { n }\]

If we have

\[f ^ { \prime \prime } ( x ) \preceq L I _ { n }\]

then

\[\left.\begin{aligned} L \|y - x\|^2 &\geq \int_0^1 \left\langle f ^ { \prime \prime } ( x + \lambda (y-x) ) (y-x) , y-x \right\rangle d \lambda \\ &= \left\langle f ^ { \prime } \left(y \right) - f ^ { \prime } ( x ) , y-x \right\rangle. \end{aligned}\right.\]

Remark: This theorem states

\[f ^ { \prime \prime } ( x ) \preceq L I _ { n } \Leftrightarrow \left\langle f ^ { \prime } \left(y \right) - f ^ { \prime } ( x ) , y-x \right\rangle \leq L \|y - x\|^2\]

Similarly, we could prove

\[f ^ { \prime \prime } ( x ) \succeq \mu I _ { n } \Leftrightarrow \left\langle f ^ { \prime } \left(y \right) - f ^ { \prime } ( x ) , y-x \right\rangle \geq \mu \|y - x\|^2.\]

2.1.2 Lower complexity bounds for \(\mathcal { F } _ { \boldsymbol { L } } ^ { \infty , \mathbf { 1 } } \left( \boldsymbol { R } ^ { n } \right)\)

Assumption 2.1.4 An iterative method \(\mathscr{M}\) generates a sequence of test points \(\{ x_k \}\) such that

\[x_{k} \in x_{0}+\operatorname{Lin}\left\{\nabla f\left(x_{0}\right), \ldots, \nabla f\left(x_{k-1}\right)\right\}, \quad k \geq 1\]

Consider the following family of quadratic functions with a fixed constant \(L > 0\)

\[f_k (x) = \frac{L}{4}\left[ \langle x, A_k x \rangle - \langle x, e_1 \rangle \right]\]

where

Fig. \(A_k\)

\(A_k\) is a tri-diagonal matrix. Thus \(0 \preceq \nabla^{2} f_{k}(x) \preceq L I_{n}\) and \(f_{k}(\cdot) \in \mathscr{F}_{L}^{\infty, 1}\left(\mathbb{R}^{n}\right), 1 \leq k \leq n\). Therefore, the equation

\[\nabla f_{k}(x)= 2A_{k} x-e_{1}=0\]

has unique solution:

\[\bar{x}_{k}^{(i)}=\left\{\begin{array}{cc}{1-\frac{i}{k+1},} & {i=1 \ldots k} \\ {0,} & {k+1 \leq i \leq n}\end{array}\right.\]

Hence, the optimal value of the function \(f_k\) is

\[\begin{align} f_k^* &= \frac{L}{4}\left[ \langle \bar{x}, A_k \bar{x} \rangle - \langle \bar{x}, e_1 \rangle \right] \\ &= \frac{L}{4}\left[ \langle \bar{x}, \frac{1}{2} e_1 \rangle - \langle \bar{x}, e_1 \rangle \right] \\ &= \frac{L}{8}\left(-1+\frac{1}{k+1}\right) \end{align}\]

Note that

\[\sum_{i=1}^{k} i^{2}=\frac{k(k+1)(2 k+1)}{6} \leq \frac{(k+1)^{3}}{3}\]

Therefore,

\[\begin{aligned}\left\|\bar{x}_{k}\right\|^{2} &=\sum_{i=1}^{n}\left(\bar{x}_{k}^{(i)}\right)^{2}=\sum_{i=1}^{k}\left(1-\frac{i}{k+1}\right)^{2} \\ &=k-\frac{2}{k+1} \sum_{i=1}^{k} i+\frac{1}{(k+1)^{2}} \sum_{i=1}^{k} i^{2} \\ & \leq k-\frac{2}{k+1} \cdot \frac{k(k+1)}{2}+\frac{1}{(k+1)^{2}} \cdot \frac{(k+1)^{3}}{3}=\frac{1}{3}(k+1) \end{aligned}\]

Let \(\mathbb{R}^{k, n}=\left\{x \in \mathbb{R}^{n} \vert x^{(i)}=0, k+1 \leq i \leq n\right\}\). Then for any \(x \in \mathbb{R}^{k, n}\) we have

\[f_{p}(x) = f_{k}(x), \quad p=k, \ldots, n\]

since \(\langle x, A_p x \rangle \in \mathbb{R}^{k, n}\) for all \(x \in \mathbb{R}^{k, n}\).

Now let us fix \(p, k \leq p \leq n\).

Lemma 2.1.5 Let \(x_0 = 0\). Then for any sequence \(\left\{x_{k}\right\}_{k=0}^{p}\) satisfying the condition

\[x_{k} \in \mathscr{L}_{k} \stackrel{\mathrm{def}}{=} \operatorname{Lin}\left\{\nabla f_{p}\left(x_{0}\right), \ldots, \nabla f_{p}\left(x_{k-1}\right)\right\}\]

we have \(\mathscr{L}_{k} \subseteq \mathbb{R}^{k, n}\).

Proof Since \(x_0 = 0\), we have \(\nabla f_{p}\left(x_{0}\right)=-\frac{L}{4} e_{1} \in \mathbb{R}^{1, n}\). Thus \(\mathscr{L}_1 = \mathbb{R}^{1,n}\). Let \(\mathscr{L}_{k} \subseteq \mathbb{R}^{k, n}\) for some \(k < p\). Since \(A_p\) is tri-diagonal for any \(x \in \mathbb{R}^{k,n}\) we have \(\nabla f_{p}(x) \in \mathbb{R}^{k+1, n}\). Therefore \(\mathscr{L}_{k+1} \subseteq \mathbb{R}^{k+1, n}\). \(\square\)

Remark: \(A_p x \in \mathbb{R}^{k+1, n}\) for \(x \in \mathbb{R}^{k, n}\) and \(p \geq k\), just consider the element in \(k+1\)-row and \(k\)-column of \(A_p\).

Corollary 2.1.1 For any sequence \(\{x_k\}_{k=0}^p\) with \(x_0 = 0\) and \(x_k \in \mathscr{L}_{k}\), we have

\[f_{p}\left(x_{k}\right) \geq f_{k}^{*}.\]

Proof: Since \(x_k \in \mathscr{L}_{k} \subseteq \mathbb{R}^{k, n}\) and therefore \(f_{p}\left(x_{k}\right)=f_{k}\left(x_{k}\right) \geq f_{k}^{*}\).

Theorem 2.1.7 For any \(k, 1 \leq k \leq \frac{1}{2}(n-1)\), and any \(x_{0} \in \mathbb{R}^{n}\) there exists a function \(f \in \mathscr{F}_{L}^{\infty, 1}\left(\mathbb{R}^{n}\right)\) such that for any first-order method \(\mathscr{M}\) satisfying Assumption 2.1.4 we have

\[\begin{array}{l}{f\left(x_{k}\right)-f^{*} \geq \frac{3 L\left\|x_{0}-x^{*}\right\|^{2}}{32(k+1)^{2}}}, \\ {\left\|x_{k}-x^{*}\right\|^{2} \geq \frac{1}{8}\left\|x_{0}-x^{*}\right\|^{2}},\end{array}\]

where \(x^*\) is the minimum of the function \(f\) and \(f^* = f(x^*)\).

2 Smooth convex optimization

2.1.2 Lower complexity bounds for \(\mathcal { F } _ { \boldsymbol { L } } ^ { \infty , \mathbf { 1 } } \left( \boldsymbol { R } ^ { n } \right)\)

2.1.3 Strongly convex functions