Wednesday, 25 August 2021 12:00 Written by

Parameter Reconstruction 3: Bayesian Least-Square Reconstruction Jupyter Notebook

This is the third blog post in a series dedicated to parameter reconstruction. For an introduction of the basic concepts, please see this fist blog post.

In this post we will apply a very efficient method to find parameter values $p_1,\dots,p_M$ that fit best to a set of measurements $y^*_1,\dots,y^*_K$. As in the previous post, the physical measurement process is modeled by a vectorial function \begin{equation} \mathbf{f}: \mathbf{p} \in \mathcal{X}\subset\mathbb{R}^M \mapsto \mathbf{y}\in\mathbb{R}^K \end{equation} that maps the $M$-dimensional parameter vector $\mathbf{p}$ to a $K$-dimensional output vector $\mathbf{y}$ that can be measured with some assumed Gaussian added noise with variances $\varepsilon_1^2,\dots,\varepsilon_K^2$.

As a concreteness example we consider again the scatterometry setup introduced in the first blog post. Monochromatic polarized UV light with a wavelength of $\lambda = 266\,{\rm nm}$ hits a periodic lamellar grating at a specific angle $\theta,\phi$. The light is reflected from the sub-wavelength grating and the intensities are measured experimentally. The grating is parameterized by a vector $\mathbf{p} = [cd,h,swa,t,r_{\rm top},r_{\rm bot}]^T$ with six entries (see image below).

In the fist post we have also introduced point estimates $\mathbf{p}_{\rm ML}$ of the most likely model parameters given a vector of measurements $\mathbf{y}^*$:

\begin{equation} \mathbf{p}_{\rm ML} = \underset{\mathbf{p}}{\rm arg\,min} \chi^2(\mathbf{p})\;\text{with}\; \chi^2(\mathbf{p}) = \sum_{i=1}^K \left(\frac{y^*_i - f_i(\mathbf{p})}{\varepsilon_i}\right)^2 \end{equation}

How to solve the minimization problem

The problem of minimizing the function $\chi^2(\mathbf{p})$ can be tackled with several approaches. In many fields of engineering the function $\mathbf{f}(\mathbf{p})$ that models the complete measurement process is very expensive to evaluate. Therefore, the minimization algorithm should use as few function evaluations as possible. For our considered example, the evaluation of $\mathbf{f}(\mathbf{p})$ requires to solve Maxwell's equations for determining the scattering process of the incident light on a nano-structure for many angles $\theta,\phi$. Depending on the computer hardware and degree of parallelization the computations can take from some seconds to more than a minute for each parameter set $\mathbf{p}$.

In a previous blog post we have discussed the application fields of several different optimization algorithms - downhill simplex, L-BFGS-B, differential evolution, and Bayesian optimization. Since we have typically a medium number $M<30$ of parameters to reconstruct and the forward model requires simulation times of typically more than 5 seconds, Bayesian optimization is the best fitting method to minimize $\chi^2(\mathbf{p})$. Bayesian optimization uses statistical inference (Gaussian process regression) to predict function values for unseen parameters based on all previous observations of the target function. For an introduction see the blog posts on Gaussian process regression and Bayesian optimization.

However, since we are dealing with a special kind of minimization problem - namely a non-linear least-square problem, there exist more specialized algorithms such as the Gauss-Newton algorithm and the Levenberg-Marquardt algorithm. Both of them exploit the underlying quadratic structure of $\chi^2(\mathbf{p})$. In order to determine the next sampling point with a hopefully smaller value of $\chi^2(\mathbf{p})$, they expand the model function locally up to linear order. Both of these methods converge quite well. However they have two limitations: (1) They can converge into a non-optimal local minimum and miss the global minimum. (2) The local linear approximation limits the accuracy and efficiency of the approaches.

In the following we consider yet another approach, Bayesian least-square optimization, that combines the advantages of Bayesian optimization (accurate statistical inference, fast global convergence) and the Gauss-Newton and Levenberg-Marquardt approach (exploitation of the underlying quadratic structure of $\chi^2(\mathbf{p})$).

The idea of the approach is to use $K$ Gaussian processes ${\rm GP}_1,\dots,{\rm GP}_K$ that are trained with previous evaluations of the channel functions $f_1(\mathbf{p}),\dots,f_K(\mathbf{p})$. Based on the Gaussian process predictions, the next sampling point is determined that minimizes $\chi^2(\mathbf{p})$ with high probability (see figure below). The approach is based on a method introduced by A. K. Uhrenholt and B. S. Jensen PMLR 2661 (2019). Details on the application of the approach for nano metrology were discussed in M. Plock et al. Proc. SPIE 11783 (2021).

Idea of Bayesian least-square optimization:
  1. Gaussian processes ${\rm GP}_1,\dots,{\rm GP}_K$ are trained with function values for each channel $1,\dots,K$ and for each previously evaluated parameter vector $\mathbf{p}_1,\dots,\mathbf{p}_j$.
  2. For any parameter vector $\mathbf{p}$, the Gaussian processes make predictions in form of $K$ Gaussian random variables $\hat y_1(\mathbf{p}),\dots,\hat y_K(\mathbf{p})$ (specified by mean and variance).
  3. Based on the measurement $\mathbf{y}^*$ and the measurement noise variances $\varepsilon_1^2,\dots,\varepsilon_K^2$ the value of $\chi^2(\mathbf{p})$ can be predicted in form of a random variable that follows a non-central chi-square distribution.
  4. Based on this distribution one determines the next parameter vector $\mathbf{p}_{j+1}$ that maximizes some acquisition function $\alpha(\hat\chi^2)$ (e.g. the lower confidence bound) and as such minimizes $\chi^2(\mathbf{p})$ with high probability.

Scatterometry example

In the following we try to reconstruct the parameters of the line grating shown above based on a set of measurements obtained at the PTB, the National Metrology Institute of Germany. Details of the measurement configuration were reported by M. Wurm et al. Meas. Sci. Technol. 22, 094024 (2011). The measurement set contains intensities for 42 $\theta$ angles between 5° and 87° for $\phi=0^\circ$ and $\phi=90$°. Both sets were obtained for both S and P polarization.

The corresponding model function is implemented in a python package ForwardProblem that comes with this Jupyter notebook. The package contains also the experimental data and a function to run FEM simulations of the scattering process.

In [1]:
from forward_problem import ForwardProblem

#Initialize forward problem using 4 simulations in parallel. Decrease this value to the number of available cores
m = ForwardProblem(Multiplicity=4) 

#Measurement intensities and uncertainties
intensities_meas, uncertainties, thetas = m.exp_data()

#Evaluation of the forward model for a specific parameter set
intensities_exp = m.model_eval(keys=dict(cd=25, h=50, swa=90, t=3, r_top=10, r_bot=10))

The following plot shows the measured scattered intensities including their estimated measurement error (colored error bars) as well as the prediction of the intensities by a numerical simulation of the scattering process using JCMsuite for $cd= 25\,{\rm nm}$, $h = 50\,{\rm nm}$, $swa = 90^\circ$, $t = 3\,{\rm nm}$, $r_{\rm top} = 10\,{\rm nm}$, and $r_{\rm bot} = 10\,{\rm nm}$ (lines).

In [2]:
from IPython.display import set_matplotlib_formats
import matplotlib.pyplot as plt
%matplotlib inline
set_matplotlib_formats('svg') # render plots as svg

plt.figure(figsize=(8, 4))

#plot measurements
plt.errorbar(thetas, intensities_meas['S']['phi_0'],yerr=uncertainties['S']['phi_0'],c='r',fmt='_')
plt.errorbar(thetas, intensities_meas['S']['phi_90'],yerr=uncertainties['S']['phi_90'],c='g',fmt='_')
plt.errorbar(thetas, intensities_meas['P']['phi_0'],yerr=uncertainties['P']['phi_0'],c='b',fmt='_')
plt.errorbar(thetas, intensities_meas['P']['phi_90'],yerr=uncertainties['P']['phi_90'],c='y',fmt='_')

#plot model function values
plt.plot(thetas, intensities_exp['S']['phi_0'],'-',c='r',label='S-polarization $\phi=0$')
plt.plot(thetas, intensities_exp['S']['phi_90'],'-',c='g',label='S-polarization $\phi=90$°')
plt.plot(thetas, intensities_exp['P']['phi_0'],'-',c='b',label='P-polarization $\phi=0$')
plt.plot(thetas, intensities_exp['P']['phi_90'],'-',c='y',label='P-polarization $\phi=90$°')

plt.xlabel(r'Inclination $\theta$ (°)')
plt.ylabel('Scattered intensity (normalized)')
plt.legend()
plt.show()
2021-08-25T15:42:22.740147 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

As we see, the chosen parameter values are probably incorrect since the model values are far outside of the uncertainty intervals of the measurement results. Only the measurements for P-polarization and $\phi=0^\circ$ agree reasonably to the model. However, in the previous blog post we saw that these measurements hardly carry any information about the parameter values.

Maximum-likelihood parameter values

So let's find the parameter set with maximum likelihood (i.e. minimum $\chi^2$). First, we create a client object to communicate with the Analysis and Optimization Toolkit, which is automatically started in the background.

In [3]:
import sys,os
sys.path.insert(0,os.path.join(os.getenv('JCMROOT'), 'ThirdPartySupport', 'Python'))
import jcmwave

client = jcmwave.optimizer.client(port=4554)

We define the search domain for the parameters. For later use, we also specify a list of parameter names in LaTeX format.

In [4]:
domain = [
 {'name': 'cd', 'domain': [15, 35]},
 {'name': 'h', 'domain': [40, 60]},
 {'name': 'swa', 'domain': [84, 90]},
 {'name': 't', 'domain': [2.5, 6.5]},
 {'name': 'r_top', 'domain': [3, 18]},
 {'name': 'r_bot', 'domain': [0, 15]},
 {'name': 'pitch', 'type': 'fixed', 'domain': 50}
]

param_names = {dom['name'] : '${' + '}_{\\rm '.join(dom['name'].split('_')) + '}$' for dom in domain[:-1]}

Not for all combinations of parameters in the domain, a valid geometry is defined. For example, the thickness of the oxide layer $t$ cannot be larger than the top radius $r_{\rm top}$, which itself cannot be larger than half of the top width of the structure $\frac{1}{2}\left(cd + \frac{h}{\tan(swa)}\right)$. In total four constraints must be fulfilled that are defined by functions that are negative iff the constraints are fulfilled.

In [5]:
constraints = [
 {'name': 'r_top_t', 'constraint': 't-r_top+1'},
 {'name': 'r_top', 'constraint': 'r_top - cd/2 + h/2/tan(Pi*swa/180) + 1'},
 {'name': 'r_bot', 'constraint': 'r_bot - pitch/2 + cd/2 + h/2/tan(Pi*swa/180) + 1'},
 {'name': 'r_bot_t', 'constraint': 'r_bot + t - pitch/2 + cd/2 + h/2/tan(Pi*swa/180) - t*tan(Pi*swa/2/180) + 1'}
]

The constrained search domain can be regarded as a flat prior-distribution, i.e. we assume that the probability of finding parameter values is constant inside the constrained domain and zero outside.

With the information about the domain and the constraints, a new Study object for the BayesLeastSquare driver can be created.

In [6]:
study = client.create_study(domain=domain, constraints=constraints,
                            driver="BayesLeastSquare",
                            name="Parameter reconstruction",
                            save_dir=os.getcwd(),
                            study_id='parameter_reconstruction')
2021-08-25 15:42:25: The dashboard is accessible via http://localhost:4554/dashboard/parameter_reconstruction

We can configure the study by calling study.set_parameters(). For a least-quare optimization we need to specify the target vector of measurements and the vector of measurement uncertainties. Moreover, we define the optimization budget max_iter=50.

In [7]:
intensities, uncertainties, thetas = m.exp_data()
target_vector = [
        *intensities["S"]["phi_0"],
        *intensities["S"]["phi_90"],
        *intensities["P"]["phi_0"],
        *intensities["P"]["phi_90"],       
    ]
uncertainty_vector = [
        *uncertainties["S"]["phi_0"],
        *uncertainties["S"]["phi_90"],
        *uncertainties["P"]["phi_0"],
        *uncertainties["P"]["phi_90"],       
    ]
study.set_parameters(max_iter=50,target_vector=target_vector,uncertainty_vector=uncertainty_vector)

The model function itself is implemented in a python package ForwardProblem that comes with this Jupyter notebook. For example, it contains the experimental data and a function to run FEM simulations of the scattering process.

Using the package, we can define the objective function of the study that returns an Observation object with a list of all intensities of the reflected light, i.e. the list $[f_1(\mathbf{p}),\dots,f_k(\mathbf{p})]$.

In [8]:
def objective(**kwargs):
    observation = jcmwave.optimizer.Observation()
    intensities = m.model_eval(kwargs)
    observation.add([
        *intensities["S"]["phi_0"],
        *intensities["S"]["phi_90"],
        *intensities["P"]["phi_0"],
        *intensities["P"]["phi_90"],       
    ])
    return observation

study.set_objective(objective)

Now, we can ran the actual least-square minimization.

In [9]:
study.run()

After the study is finished, we can retrieve the parameter with the minimal found $\chi^2$ value by calling study.info()['min_params']. Let's see if the least-square parameter values lead to model function values that agree with the measurements.

In [10]:
intensities_exp = m.model_eval(keys=study.info()['min_params'])
plt.figure(figsize=(8, 4))

#plot measurements
plt.errorbar(thetas, intensities_meas['S']['phi_0'],yerr=uncertainties['S']['phi_0'],c='r',fmt='_')
plt.errorbar(thetas, intensities_meas['S']['phi_90'],yerr=uncertainties['S']['phi_90'],c='g',fmt='_')
plt.errorbar(thetas, intensities_meas['P']['phi_0'],yerr=uncertainties['P']['phi_0'],c='b',fmt='_')
plt.errorbar(thetas, intensities_meas['P']['phi_90'],yerr=uncertainties['P']['phi_90'],c='y',fmt='_')

#plot model function values
plt.plot(thetas, intensities_exp['S']['phi_0'],'-',c='r',label='S-polarization $\phi=0$')
plt.plot(thetas, intensities_exp['S']['phi_90'],'-',c='g',label='S-polarization $\phi=90$°')
plt.plot(thetas, intensities_exp['P']['phi_0'],'-',c='b',label='P-polarization $\phi=0$')
plt.plot(thetas, intensities_exp['P']['phi_90'],'-',c='y',label='P-polarization $\phi=90$°')

plt.xlabel(r'Inclination $\theta$ (°)')
plt.ylabel('Scattered intensity (normalized)')
plt.legend()
plt.show()
2021-08-25T15:43:01.206542 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

Apparently, the model predictions for the least-square parameter values fit well to the measurements within the measurement uncertainties.

Confidence bounds

Apart from the value of the reconstructed parameters, it is often important to know their confidence intervals. In the first blog post of this series, we derived that the uncertainty of the parameter values is approximately given as $$\sigma_{p_i} = \sqrt{{\rm MSE} \cdot \mathrm{Cov}_{ii}},$$ where ${\rm MSE} =\frac{\chi^2_{\rm min}}{K-M}$ is the mean-squared error (i.e. squared error per degree of freedom) and $\mathrm{Cov} = \left(\mathrm{J}^{\top}\mathrm{W}\mathrm{J}\right)^{-1}$ is the parameter covariance matrix with the Jacobian matrix $\mathrm{J}$ and the weight matrix $\mathrm{W} = {\rm diag}\left(\varepsilon_1^{-2},\cdots,\varepsilon_K^{-2}\right)$. The ${\rm MSE}$ value and the covariance matrix $\mathrm{Cov}$ can obtained by calling study.driver_info().

In [11]:
import numpy as np
from IPython.display import HTML, display

min_params = study.info()['min_params']

driver_info = study.driver_info()
Cov = driver_info['parameter_covariance']
MSE = driver_info['MSE']
stds = np.sqrt(MSE*np.diag(Cov)) #standard deviation of parameters 

param_name_cells = '' # table cells with parameter names
param_val_cells = '' # table cells with least-square fit of paramters
param_std_cells = '' # table cells with standard deviations of paramters

for idx, param in enumerate(param_names):
    param_name_cells += f'<td>{param_names[param]}</td>'
    param_val_cells += f'<td>{min_params[param]:.3f}</dt>'
    param_std_cells += f'<td>{stds[idx]:.3f}</dt>'
    
display(HTML(f'''
  <table>
     <thead>
       <tr>
         <th>Paramter</th>
         {param_name_cells}
       </tr>
     </thead>
     <tbody>
       <tr>
         <th>Least square fit</th>
         {param_val_cells}
       </tr>
       <tr>
         <th>Uncertainty</th>
         {param_std_cells}
       </tr>
    </tbody>
  </table>'''))
Paramter ${cd}$${h}$${swa}$${t}$${r}_{\rm top}$${r}_{\rm bot}$
Least square fit 25.57548.37887.8095.03910.8627.956
Uncertainty 0.3911.1293.1740.2662.2677.015

We can compare the uncertainties of the reconstructed parameters to the on-axis standard deviation estimated from a global sensitivity analysis of the previous blog post:

Paramter $p$${cd}$${h}$${swa}$${t}$${r}_{\rm top}$${r}_{\rm bot}$
On-axis standard deviation $\sigma_p$0.0360.1020.1210.0390.2570.371

Indeed the parameters with the largest estimated on-axis standard deviation, $r_{\rm top}$ and $r_{\rm bot}$, have also the largest final reconstruction uncertainty whereas the values with the smallest on-axis standard deviation, $cd$ and $t$, can be reconstructed with the smallest uncertainty. This demonstrates again that the global sensitivity analysis can provide important insights into the ability of an experimental setup to reconstruct certain parameters.

Posterior distribution of parameter values

Until now, we have looked at the maximum-likelihood point estimate of the parameter values \begin{equation} \mathbf{p}_{\rm ML} = \underset{\mathbf{p}}{\rm arg\,min} \chi^2(\mathbf{p})\;\text{with}\; \chi^2(\mathbf{p}) = \left(\frac{y^*_i - f_i(\mathbf{p})}{\varepsilon_i}\right)^2 \end{equation} and at confidence intervals depending on the covariance matrix $\mathrm{Cov} = \left(\mathrm{J}^{\top}\mathrm{W}\mathrm{J}\right)^{-1}$ which are based on a linear approximation of the forward model \begin{equation} \mathbf{f}(\mathbf{p}) \approx \mathbf{f}(\mathbf{p}_{\rm ML}) + \mathrm{J} (\mathbf{p} - \mathbf{p}_{\rm ML}) \;\text{with}\; (\mathrm{J})_{ij}=\left.\frac{\partial f_i(\mathbf{p})}{\partial p_j}\right|_{\mathbf{p} = \mathbf{p}_{\rm ML}}. \end{equation}

Now, we go some steps further:

  1. We want to take non-linear dependencies of the model on the parameter values into account. To this end, we use the trained Gaussian processes of all measurement channels.
  2. We want to take prior information on the parameters into account. Until now, we assumed that all parameter values have the same prior probability within the constrained search domain. Now, suppose we know that the parameters of the linear grating follow a normal distribution with $cd = (25 \pm 5){\rm nm}, h=(50\pm5){\rm nm}, swa = 87^\circ \pm 1^\circ, t=(4.5\pm 1){\rm nm}, r_{\rm top} = (10\pm 3){\rm nm}, r_{\rm bot} = (8\pm 3){\rm nm}$.
  3. Typically the measurement errors are not known exactly. Therefore, we want to replace the fixed measurement variances $\varepsilon_1^2,\cdots,\varepsilon_K^2$ by a more flexible error model for the measurement variance $$\varepsilon_1^2 \rightarrow \eta_i^2(\mathbf{p},a,b) = \big(10^{-a} f_i(\mathbf{p})\big)^2 + \big(10^{-b}\big)^2.$$ The error model parameters shall be uniformly distributed with $1 \leq a \leq 3$ and $2 \leq b \leq 4$.

According to the first blog post, given the prior probability of the model and error parameter $p(\mathbf{p},a,b)$, the posterior probability distribution of the parameter values given the measurement results is $$ p(\mathbf{p},a,b | \mathbf{y}^*) \propto p(\mathbf{p},a,b) \prod_{i=1}^K \frac{1}{\sqrt{2\pi}\eta_i(\mathbf{p},a,b)}\exp\left[-\frac{1}{2}\left(\frac{y^*_i - f_i(\mathbf{p})}{\eta_i(\mathbf{p},a,b)}\right)^2\right]. $$

It is possible to draw samples from this probability distribution using Markov chain Monte Carlo (MCMC) sampling. Once provided with a large number of samples from the posterior, one can determine characteristic figures of the distribution. For example, one often states the lower 16% quantile, the 50% (i.e. the median), and the upper 84% quantile of the parameter values.

During the minimization of $\chi^2$ we have already trained the Gaussian processes. To further improve the Gaussian process predictions, we draw some more samples close to the maximum-likelihood point estimated. Using the keyword explore_max_likelihood of the BayesLeastSquare driver one can draw samples at positions with large likelihood where the Gaussian process predictions still have a significant uncertainty. Let's draw 50 more samples (i.e. 100 samples in total).

In [12]:
study.set_parameters(explore_max_likelihood=True, max_iter=100)
study.run()
2021-08-25 15:42:26: Maximum number of observations reached.

Now, we define the Gaussian prior distribution of the parameters, the error model and the distribution of the error model parameters using the driver parameters distribution, error_model, and error_model_parameter_distribution respectively.

In [ ]:
study.set_parameters(
    distribution = [
        {"name":"cd",    "distribution":"normal", "mean":25,  "variance":5**2},
        {"name":"h",     "distribution":"normal", "mean":50,  "variance":5**2},
        {"name":"swa",   "distribution":"normal", "mean":87,  "variance":1**2},
        {"name":"t",     "distribution":"normal", "mean":4.5, "variance":1**2},
        {"name":"r_bot", "distribution":"normal", "mean":10,  "variance":3**2},
        {"name":"r_top", "distribution":"normal", "mean":8,   "variance":3**2},
    ],
    error_model = "sqrt((10^(-a) * y_model)^2 + (10^(-b))^2)",
    error_model_parameter_distribution = [
        {"name":"a", "distribution":"uniform", "domain":[1,3]},
        {"name":"b", "distribution":"uniform", "domain":[2,4]},
    ]
)

We run the MCMC sampling using 32 walkers each running for 10,000 iterations. Drawing 320,000 samples from the Gaussian processes can take more than 30 minutes. However, this time is almost negligible compared to the time if would take to draw the same number of samples using the original forward model. Internally, the Analysis and Optimization Toolkit uses the python package emcee to run the MCMC sampling, which also provides good tutorials with a brief introduction to the theoretical background of MCMC sampling.

In [14]:
mcmc_result = study.run_mcmc(max_iter=1e4,num_walkers=32)
Mcmc sampling: Progress 0.0%.      2021-08-25 15:43:04: Determining initial model and error parameters with maximum a posteriori probability.
Mcmc sampling: Progress 100.0%. Total time required 0:35:04.          2021-08-25 16:18:08: MCMC sampling completed after 0:35:03. Maximum number of iterations 10000 reached. 
2021-08-25 16:18:08: First 2*tau=571 iterations discarded (burnin). Acceptance fraction 36%.
Mcmc sampling: Progress 100.0%. Total time required 0:35:06.      

Many of the the drawn 320,000 samples have to be ignored. During the initial sampling phase (the "burnin" phase) the parameters are not yet distributed according to the probability distribution. Moreover, consecutive samples of each walker are strongly correlated and thus not independent of one another. With the returned subset of the drawn samples, we can generate a corner plot showing cuts through the high-dimensional posterior probability distribution. The blue markers indicate the maximum-likelihood point estimates. The plot also includes the 16%, 50%, and 84% quantiles for each parameter (dashed lines).

In [15]:
import matplotlib.pyplot as plt
import corner 
min_params = study.info()['min_params']

plt.rcParams.update({'font.size': 7})
fig = plt.figure(figsize=(10, 10))

corner.corner(np.array(mcmc_result['samples']), fig=fig, quantiles=(0.16, 0.5, 0.84),
              levels=(1-np.exp(-0.5),), show_titles=True, scale_hist=True,
              labels=list(param_names.values())+['$a$','$b$'],
              truths=[min_params[param] for param in param_names] + [0,0])

plt.show()
2021-08-25T16:18:11.959217 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/