Nesterov

I did a presentation in a group meeting to briefly review the lower complexity bound of first-order convex optimization; and how Nesterov proceed to match the lower bound using the estimation sequence, the slides are here.

Consider an unconstrained optimization problem: $$ \min_{x \in \mathbb{R}^n} f(x) . $$ Here, $f \in \mathcal{C}^1$ is a convex, $L$-Lipschitz smooth function. Obviously we can solve this problem by using first-order methods, using iterations:

$$ x_{k} \in x_{0} + \operatorname{Span} \left \{f^{\prime} \left (x_{0} \right ), \ldots, f^{\prime} \left (x_{k-1} \right )\right \}. $$

The question is what their fundamental limit is, and how to achieve it.

For fundamental limits, Nesterov constructed a quadratic function whose one-step gradients give very little geometry information about the other dimensions, so that once the dimensionality of the problem becomes large, it gets really hard to solve it. (As the error is lower bounded by $\mathcal{O} (1/k^2)$, with $k$ being the dimension.)

He then came up with an algorithm to match the lower bound. Despite the complexity of the algorithm itself, the idea is not that (actually it is) complicated:

We use a series of weighted quadratic functions $\{\lambda_k, \phi_k \}$ to estimate the function itself at each point $x_k$, which has two properties

They provide a lower bound to the true function at each step. $$ \phi_{k}(x) \leq (1-\lambda_{k} ) f(x)+\lambda_{k} \phi_{0}(x) $$

They converge to the true function as the algorithm progresses. $$ \lambda_k \to 0 $$

So intead of minimizing the function itself, we minimize $\{\phi_k \}$ at each step, which is easier but also matches the lower bound because it is quadratic!

There are of course a bunch of alternative interpretations about this method, one of which I found quite intriguing was the dual perspective¹, which views every accelerated gradient step as an optimal coupling of the primal and dual step.

Another that I know of is Bubbeck’s paper. You will have to see the geometric interpretation for yourself, it basically says that two balls shrink is faster than one ball shrinks. Personally I find the youtube ‘‘momentum videos’’ not super helpful…

Allen-Zhu, Z. and Orecchia, L., 2014. Linear coupling: An ultimate unification of gradient and mirror descent. arXiv preprint arXiv:1407.1537. ↩︎