Friday, 15 January 2021 12:00 Written by

How to choose the right optimization method? Jupyter Notebook

At JCMwave we perform optimizations of photonic devices such as light sources, metamaterials, or solar cells on a regular basis using JCMsuite's analysis and optimization toolkit. In this blog post we want to share some insights in common optimization methods and the question which method is the best for which problem. We focus on problems of shape optimization with a moderate number of parameters ranging from 1 to say 100. That is, many methods designed for problems with a very large number of parameters (such as topology optimization or optimization of neural networks) are not considered. Moreover, we assume that the objective function is sufficiently smooth (at least one time differentiable) and can be observed without noise. This is the usual case for the simulation-based shape optimization of physical systems.

Without loss of generality, we consider minimization problems.

Given a $d$-dimensional parameter space $\mathcal{X} \subset \mathbb{R}^d$ and an objective function $f:\mathcal{X}\rightarrow\mathbb{R}$ find parameter values that minimize the objective, i.e. find $$ \mathbf{x}_{\rm min} = {\rm argmin}_{\mathbf{x}\in\mathcal{X}}f(\mathbf{x})$$

First, we give a short introduction and discussion of several relevant optimization algorithms: L-BFGS-B, downhill simplex, differential evolution, and Bayesian optimization. If you are not interested in the background of these algorithms, you might scroll down to the decision tree on the selection of the best optimization method.

Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm

The BFGS algorithm is a second-order method that models the objective function locally by a second-order Taylor expansion. At iteration $k$ at position $\mathbf{x}_k$ it tries to find a step $\mathbf{p}$ to a local minimum of $f$, i.e. $\nabla f(\mathbf{x}_{k}+\mathbf{p})=0$.

Consider the first-order Taylor expansion of the gradient $$\nabla f(\mathbf{x}_k + \mathbf{p}) \approx \nabla f(\mathbf{x}_k) + \mathbf{H}\mathbf{p}, \;\text{where}\; (\mathbf{H})_{i j} = \left.\frac{\partial^2 f}{\partial x_i \partial x_j}\right|_{(\mathbf{x}=\mathbf{x}_k)}. $$ If $f$ is strongly convex $f$ around $\mathbf{x}_{k}$, the Hessian $\mathbf{H}$ is a positive definite matrix that can be inverted. Then, the step evaluates to $$ \mathbf{p} \approx - \mathbf{H}^{-1} \nabla f(\mathbf{x}_k).$$

As long as the objective is not a quadratic function, the step determined by the second-order expansion will not hit a local minimum. Therefore, the algorithm performs a line search in the direction of $\mathbf{p}$ to minimize $f$. That is the next point $\mathbf{x}_{k+1}$ is found by minimizing $f(\mathbf{x}_k + \gamma\mathbf{p})$ over the scalar $\gamma > 0$.

The gradient $\nabla f(\mathbf{x}_k)$ can be approximated by finite differences. However, this is often computationally very expensive. The function $f$ has to be additionally evaluated at small distances $f(\mathbf{x}_k + \delta\mathbf{e}_1),f(\mathbf{x}_k + \delta\mathbf{e}_2),\dots,f(\mathbf{x}_k + \delta\mathbf{e}_d)$. Moreover, a higher precision of the numerical evaluation of $f$ is required to also determine differences of function values with sufficiently high precision. Therefore, the BFGS algorithm should be rather applied, if the derivatives can be determined along with the function value. The FEM solver of JCMsuite, for example, computes derivatives using fast and precise automatic differentiations of all computation steps.

Approximation of the inverse of the Hessian matrix

The exact determination of the Hessian is often numerically too involving. Even when using automatic differentiation, at every computational step $d(d+1)/2$ additional operations have to be performed. Therefore the inverse Hessian $\mathbf{H}^{-1}$ is build up iteratively using the secant equation $$ \nabla f(\mathbf{x}_{k+1}) = \nabla f(\mathbf{x}_{k}) + \mathbf{H}_{k+1}(\mathbf{x}_{k+1}-\mathbf{x}_{k}).$$

The equation doesn't determine $\mathbf{H}_{k+1}^{-1}$ completely. By imposing that the update $\mathbf{H}_{k+1}^{-1}$ of $\mathbf{H}_{k}^{-1}$ leads to a symmetric matrix that fulfills the secant equation and is minimal in the Eucledian norm, i.e. $\min_{\mathbf{H}_{k+1}^{-1}}\left\lVert\mathbf{H}_{k+1}^{-1}-\mathbf{H}_{k}^{-1}\right\rVert$ one obtains the BFGS update rule

$$ \mathbf{H}_{k+1}^{-1} = \left(I - \frac{\mathbf{s} \mathbf{y}^T}{\mathbf{y}^T \mathbf{s}} \right) \mathbf{H}_{k}^{-1} \left(I - \frac{\mathbf{y} \mathbf{s}^T}{\mathbf{y}^T \mathbf{s}} \right) + \frac{\mathbf{s} \mathbf{s}^T}{\mathbf{y}^T \mathbf{s}} $$

with $\mathbf{y} = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_{k})$ and $\mathbf{s} = \mathbf{x}_{k+1}-\mathbf{x}_{k}$.

In order to visualize the behavior of the algorithm, we consider two dimensional Himmelblau's function $f(x, y) = (x^2+y-11)^2 + (x+y^2-7)^2$, which has four local minima. The animation below shows that after each line search the search continues into a new direction.

In [7]:
%matplotlib notebook
import scipy.optimize
from utils import OptimizationVisualizer
from IPython.display import HTML, clear_output

ov = OptimizationVisualizer(jac=True)
anim = ov.animate()