Wednesday, 25 August 2021 12:00 Written by

Parameter Reconstruction 1: Theoretical Introduction Jupyter Notebook

This is the first blog posts of a series dedicated to numerical methods for parameter reconstruction. Alongside its finite-element solver for the simulation on nano-photonic system (including scatterometry experiments) JCMsuite contains dedicated tools for efficient parameter reconstruction and related numerical analyses. The first blog contains a brief introduction to the main concepts of parameter reconstruction while the following posts are dedicated to the use of these tools.

Parameter reconstruction is an important application field for numerical simulation as it enables to determine the value of physical parameters whose direct measurement is impossible or too costly. To this end, one measures certain accessible parameters $\mathbf{y}^*\in\mathbb{R}^K$ of a physical system. The system and the measurement process have to be simulated numerically. The result of this numerical model is a vectorial function \begin{equation} \mathbf{f}: \mathbf{p} \in \mathcal{X}\subset\mathbb{R}^M \mapsto \mathbf{y}\in\mathbb{R}^K \end{equation} that maps an $M$-dimensional vector $\mathbf{p}$ of the "hidden" model parameters that are to be reconstructed to a $K$-dimensional output vector $\mathbf{y}$ of corresponding measurement values.

For the sake of concreteness, we consider a scatterometry setup. Monochromatic UV light with a wavelength of $\lambda = 266\,{\rm nm}$ hits a periodic lamellar grating at a specific angle $\theta,\phi$. The electric field is either polarized parallel to the plane of incidence (P polarization) or perpendicular (S polarization from senkrecht, German for perpendicular). The grating has a sub-wavelength periodicity length of $50\,{\rm nm}$ such that only one diffraction order at the same outgoing angle is reflected from the sample. The sample consists of silicon with a thin layer of silicon dioxide such that only a part of the light is reflected and the rest is transmitted into the silicon substrate and absorbed. The grating is parameterized by a vector $\mathbf{p} = [cd,h,swa,t,r_{\rm top},r_{\rm bot}]^T$ with six entries (see image below).

We use an experimental scatterometry data set obtained at the PTB, the National Metrology Institute of Germany. Details of the measurement configuration were reported by M. Wurm et al. Meas. Sci. Technol. 22, 094024 (2011). The measurement set contains intensities for 42 $\theta$ angles between 5° and 87° for $\phi=0$ and $\phi=90$°. Both sets were obtained for both S and P polarization.

The following plot shows the measured scattered intensities including their estimated measurement error (colored error bars) as well as the prediction of the intensities by a numerical simulation of the scattering process using JCMsuite for $cd= 25.6\,{\rm nm}$, $h = 48.3\,{\rm nm}$, $swa = 87.9^\circ$, $t = 5.05\,{\rm nm}$, $r_{\rm top} = 10.9\,{\rm nm}$, and $r_{\rm bot} = 8.0\,{\rm nm}$ (lines).

In [1]:
from forward_problem import ForwardProblem

#Initialize forward problem using 4 simulations in parallel. Decrease this value to the number of available cores
m = ForwardProblem(Multiplicity=4) 

#Measurement intensities and uncertainties
intensities_meas, uncertainties, thetas = m.exp_data()

#Evaluation of the forward model for a specific parameter set
intensities_exp = m.model_eval(keys=dict(cd=25.6, h=48.3, swa=87.9, t=5.05, r_top=10.9, r_bot=8.0))
In [2]:
from IPython.display import set_matplotlib_formats
import matplotlib.pyplot as plt
%matplotlib inline
set_matplotlib_formats('svg') # render plots as svg

plt.figure(figsize=(8, 4))

#plot measurements
plt.errorbar(thetas, intensities_meas['S']['phi_0'],yerr=uncertainties['S']['phi_0'],c='r',fmt='_')
plt.errorbar(thetas, intensities_meas['S']['phi_90'],yerr=uncertainties['S']['phi_90'],c='g',fmt='_')
plt.errorbar(thetas, intensities_meas['P']['phi_0'],yerr=uncertainties['P']['phi_0'],c='b',fmt='_')
plt.errorbar(thetas, intensities_meas['P']['phi_90'],yerr=uncertainties['P']['phi_90'],c='y',fmt='_')

#plot model function values
plt.plot(thetas, intensities_exp['S']['phi_0'],'-',c='r',label='S-polarization $\phi=0$')
plt.plot(thetas, intensities_exp['S']['phi_90'],'-',c='g',label='S-polarization $\phi=90$°')
plt.plot(thetas, intensities_exp['P']['phi_0'],'-',c='b',label='P-polarization $\phi=0$')
plt.plot(thetas, intensities_exp['P']['phi_90'],'-',c='y',label='P-polarization $\phi=90$°')

plt.xlabel(r'Inclination $\theta$ (°)')
plt.ylabel('Scattered intensity (normalized)')
plt.legend()
plt.show()
2021-08-25T09:22:33.933972 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

Parameter probability distribution

For the following discussion of the probability distribution of the parameter values we make two important assumptions:

Assumptions

  1. The model is sufficiently accurate to describe the physical process. That is, there exists a parameter set $\mathbf{p}$ that is a sufficiently accurate representation of the real sample and the numerical error of the simulation is negligible.
  2. The measurements on the system for different angles are independent and subject to a random Gaussian measurement noise.

From the assumptions, it follows that the measurements are given as \begin{equation} y^*_i = f_i(\mathbf{p}) + \delta_i, \end{equation} where $\delta_i \sim \mathcal{N}(0,\eta_i^2)$ is a Gaussian random variable and $\eta_i^2$ the measurement noise variance for channel $i$.

Sometimes, the measurement noise is known. In the general case however, one has to find a parameterized model $\eta_i(\mathbf{d})$ for the variance itself. A common choice is to assume that the error is composed of a background term $b$ and a noise contribution which scales linearly with $f_i(\mathbf{p})$: \begin{equation} \eta_i^2(a,b,\mathbf{p}) = b^2 + \left(a\cdot f_i(\mathbf{p})\right)^2. \end{equation} The second term can, e.g., stem from intensity-proportional power fluctuations of laser light.

Since every entry of the measurement vector follows a normal distribution $y^*_i \sim \mathcal{N}(f_i(\mathbf{p}),\eta_i(\mathbf{d}))$, the joint likelihood of measuring the vector $\mathbf{y}^*$ for a system with parameter vector $\mathbf{p}$ and measurement noise parameters $\mathbf{d}$ is given as the product \begin{equation} p(\mathbf{y}^* | \mathbf{p}, \mathbf{d}) = \prod_{i=1}^K \frac{1}{\sqrt{2\pi}\eta_i(\mathbf{d})}\exp\left[-\frac{1}{2}\left(\frac{y^*_i - f_i(\mathbf{p})}{\eta_i(\mathbf{d})}\right)^2\right]. \end{equation}

What we are interested in is the posterior probability density $p(\mathbf{p}, \mathbf{d}|\mathbf{y}^*)$ of the system parameters $\mathbf{p}$ and noise parameters $\mathbf{d}$ given the measurements $\mathbf{y}^*$. According to Bayes theorem this is given as \begin{equation} p(\mathbf{p},\mathbf{d} | \mathbf{y}^*) = \frac{ p(\mathbf{y}^* | \mathbf{p}, \mathbf{d}) p(\mathbf{p},\mathbf{d})}{p(\mathbf{y}^*)}. \end{equation}

The prior probability density of the parameters $p(\mathbf{p},\mathbf{d})$ is often given as uniform distributions over certain intervals or as a multivariate Gaussian distribution. The marginal probability density \begin{equation} p(\mathbf{y}^*) = \int p(\mathbf{y}^* | \mathbf{p}, \mathbf{d}) p(\mathbf{p},\mathbf{d}) {\rm d} \mathbf{p} {\rm d} \mathbf{d} \end{equation} can be determined numerically, e.g. by Monte Carlo integration. However, since this involves to evaluate the expensive model function $\mathbf{f}(\mathbf{p})$ many times, its determination is computationally very expensive. Luckily, the value of the marginal probability is often irrelevant as it only plays the role of a normalization constant for the posterior probability distribution $p(\mathbf{p},\mathbf{d}| \mathbf{y}^*)$.

For numerical reasons, one usually evaluates the logarithm of the posterior probability density (log-posterior) \begin{equation} \label{eq:log-post-dens}\tag{7} \begin{split} \underbrace{\log\left( p(\mathbf{p},\mathbf{d} | \mathbf{y}^*)\right)}_{\rm log-posterior} &= \log\left(p(\mathbf{y}^* | \mathbf{p},\mathbf{d})\right) + \log\left( p(\mathbf{p},\mathbf{d})\right) + const.\\ &= \underbrace{- \sum_{i=1}^K \left(\log(\eta_i(\mathbf{d})) + \frac{1}{2}\left(\frac{y^*_i - f_i(\mathbf{p})}{\eta_i(\mathbf{d})}\right)^2\right)}_{\rm log-likelihood} + \underbrace{\log\left( p(\mathbf{p},\mathbf{d})\right)}_{\rm log-prior} + const. \end{split} \end{equation}

Point estimates

The posterior probability distribution covers all information on the parameter reconstruction. In principle one can draw samples from this distribution using Markov chain Monte Carlo (MCMC) methods. However, in most cases the computation times of the model function $\mathbf{f}(\mathbf{p})$ makes an MCMC sampling infeasible since for practical applications many $10,000$ samples from the posterior are required.

Therefore, it is often more practical to search for the parameter values that maximize the log-posterior probability density \begin{equation} (\mathbf{p}_{\rm MAP},\mathbf{d}_{\rm MAP}) = \underset{\mathbf{p},\mathbf{d}}{\rm arg\,max} \log\left(p(\mathbf{p},\mathbf{d} | \mathbf{y}^*)\right) \end{equation} The maximization can be efficiently performed with global Bayesian optimization. I this blog post, we show a comparison of Bayesian optimization with other local methods in the context of parameter reconstruction.

If there is no prior information $p(\mathbf{p},\mathbf{d})$ on the parameters (i.e. the prior is "flat") and the measurement uncertainties are known up to a constant factor $a$, i.e. $\eta_i(a) = a \varepsilon_i$, the log-posterior equals the log-likelihood \begin{equation} \begin{split} \log\left( p(\mathbf{p},a | \mathbf{y}^*)\right) &= - \sum_{i=1}^K \left(\log(a\, \varepsilon_i) + \frac{1}{2}\left(\frac{y^*_i - f_i(\mathbf{p})}{a\, \varepsilon_i}\right)^2\right) + const. \\ &= - K \log(a) - \frac{1}{2 a^2} \sum_{i=1}^K \left(\frac{y^*_i - f_i(\mathbf{p})}{\varepsilon_i}\right)^2 + const. \end{split} \end{equation}

In this case the most probable parameters of the sample are \begin{equation} \mathbf{p}_{\rm ML} = \underset{\mathbf{p}}{\rm arg\,min} \chi^2(\mathbf{p})\;\text{with}\; \chi^2(\mathbf{p}) = \sum_{i=1}^K \left(\frac{y^*_i - f_i(\mathbf{p})}{\varepsilon_i}\right)^2 \end{equation} The most probable scaling value $a = a_{\rm ML}$ can be determined by finding the maximum of $\log\left( p(\mathbf{p},a | \mathbf{y}^*)\right)$ for $\chi^2(\mathbf{p}) = \chi^2(\mathbf{p}_{\rm ML}) = \chi^2_{\rm min}$ , i.e. $$ \frac{\partial}{\partial a} \log\left( p(\mathbf{p},a | \mathbf{y}^*)\right) = - \frac{K}{a} + \frac{\chi^2_{\rm min}}{a^3} \overset{!}{=} 0 $$ yields \begin{equation} a_{\rm ML}^2 = \frac{\chi^2_{\rm min}}{K} \;\text{with}\; \chi^2_{\rm min} = \chi^2(\mathbf{p}_{\rm ML}) \end{equation}

This last consideration has one flaw. Sine we use $M$ parameters to minimize $\chi^2(\mathbf{p})$, the value of $\chi^2_{\rm min}$ will be typically smaller than for the true model parameters. Hence, $a_{\rm ML}$ is biased towards too small values. Suppose e.g., we have as many degrees of freedom as measurements ($M=K$) to reduce the values of $\chi^2$. In this case it will be in general possible to obtain $\chi^2_{\rm min} = 0$ and hence all measurement errors $\eta_i = a_{\rm ML}\varepsilon_i$ are assumed to be zero. Obviously, as long as $M$ is not much smaller than $K$, we have to account for the bias introduced by the number $M$ of free parameters. To this end, let us regard the random variable $$ z = \sum_{i=1}^K \left(\frac{y^*_i - f_i(\mathbf{p}_{\rm ML})}{a \varepsilon_i}\right)^2$$ that depends on the normally distributed measurements $y^*_i $ with mean $\mu \approx f_i(\mathbf{p}_{\rm ML})$ and standard deviation $\eta_i = a \varepsilon_i$ ($a$ is the true scaling value that should be reconstructed). Hence, $z$ is approximately a sum of squared normally distributed random variables with zero mean and unit variance. However, fitting $p_1,\cdots,p_M$ to minimize the distance of the model $\mathbf{f}(\mathbf{p})$ to the vector $\mathbf{y}^*$ imposes $M$ constraints on the random variables such that only $K-M$ of them are independent. Hence, $z$ follows a chi square distribution with effectively ${\rm DOF} = K-M$ degrees of freedom. For a large number of degrees of freedom, this distribution has an expectation value of $E[z] \approx K-M$. Consequently, the point estimate \begin{equation} a_{\rm ub}^2 = \frac{\chi^2_{\rm min}}{K-M} \end{equation} is an unbiased estimation of $a^2$ in the sense that the expectation value is approximately equal to the true value of $a^2$: $$E[a_{\rm ub}^2] = \frac{E[\chi^2_{\rm min}]}{K-M} = a^2 \frac{E[z]}{K-M} \approx a^2.$$ We note that $\chi^2_{\rm min}/(K-M)$ is known as the reduced chi-squared statistics also denoted as $\mathrm{MSE}$ for "mean squared error". Its square root is often called $\mathrm{RSE}$ for "regression standard error" or "residual standard error".

Confidence intervals of point estimate $\mathbf{p}_{\rm ML}$

It is in principle possible to draw samples according to the posterior probability density using Markov chain Monte Carlo (MCMC) methods. Provided with many samples from the posterior, one can then determine the median parameter values as well as their lower $15$% and the upper $84$% percentiles. These values give a good quantification of the best reconstructed parameters and their uncertainty interval.

However, in most cases the computation times of the model function $\mathbf{f}(\mathbf{p})$ makes an MCMC sampling infeasible since usually many $10,000$ samples from the posterior are required.

An alternative is to approximate the posterior distribution by a multivariate Gaussian distribution. To this end, one assumes that in a sufficiently large region around $\mathbf{p} = \mathbf{p}_{\rm ML}$ one can approximate the model by the Taylor expansion up to linear order \begin{equation} \mathbf{f}(\mathbf{p}) \approx \mathbf{f}(\mathbf{p}_{\rm ML}) + \mathrm{J} (\mathbf{p} - \mathbf{p}_{\rm ML}) \end{equation} where $\mathrm{J}$ is the Jacobian matrix with entries \begin{equation} J_{ij}=\left.\frac{\partial f_i(\mathbf{p})}{\partial p_j}\right|_{\mathbf{p} = \mathbf{p}_{\rm ML}}. \end{equation} We define the weight matrix of inverse (unscaled) measurement variances \begin{equation} \mathrm{W} = {\rm diag}\left(\frac{1}{\varepsilon_1^2},\cdots,\frac{1}{\varepsilon_K^2}\right). \end{equation} and the vector of residuals $\mathbf{r} = \mathbf{y}^*-\mathbf{f}(\mathbf{p}_{\rm ML})$. With the linear approximation the probability density of $\mathbf{p}$ evaluates to \begin{equation} \begin{split} p(\mathbf{p}) &\propto {\rm exp}\left[-\frac{1}{2 a_{\rm ub}^2}\left\{\mathbf{r}-\mathrm{J} (\mathbf{p} - \mathbf{p}_{\rm ML})\right\}^{\top}\mathrm{W}\left\{\mathbf{r}-\mathrm{J} (\mathbf{p} - \mathbf{p}_{\rm ML})\right\}\right] \\ &= {\rm exp}\left[-\frac{1}{2 a_{\rm ub}^2}\left\{\mathbf{r}^{\top}\mathrm{W}\mathbf{r}-2\mathbf{r}^{\top}\mathrm{W}(\mathbf{p} - \mathbf{p}_{\rm ML}) - (\mathbf{p} - \mathbf{p}_{\rm ML})^{\top}\mathrm{J}^{\top}\mathrm{W}\mathrm{J} (\mathbf{p} - \mathbf{p}_{\rm ML})\right\}\right] \end{split} \end{equation} Since $\mathbf{p}_{\rm ML}$ is at an extremum of the probability density, terms linear in $\mathbf{p} - \mathbf{p}_{\rm ML}$ vanish, such that $\mathbf{r}^{\top}\mathrm{W}(\mathbf{p} - \mathbf{p}_{\rm ML}) = 0$. By comparing the exponent of the remaining $\mathbf{p}$-dependent probability distribution \begin{equation} p(\mathbf{p}) \propto {\rm exp}\left[-\frac{1}{2 a_{\rm ub}^2} (\mathbf{p} - \mathbf{p}_{\rm ML})^{\top}\mathrm{J}^{\top}\mathrm{W}\mathrm{J} (\mathbf{p} - \mathbf{p}_{\rm ML})\right] \end{equation} with that of a multivariate Gaussian distribution, the covariance matrix $\mathrm{\Sigma}$ of the probability distribution is given as \begin{equation} \mathrm{\Sigma} = \left(\frac{1}{a_{\rm ub}^2}\mathrm{J}^{\top}\mathrm{W}\mathrm{J}\right)^{-1} = \frac{\chi^2_{\rm min}}{K-M} \left(\mathrm{J}^{\top}\mathrm{W}\mathrm{J}\right)^{-1} \end{equation}

The standard deviations of each parameter $p_i$ is given by the square root of the diagonal elements of this covariance matrix, i.e. \begin{equation} \sigma_{p_i} = \sqrt{\mathrm{\Sigma}_{ii}} = \sqrt{\mathrm{MSE} \cdot \mathrm{Cov}_{ii}}, \end{equation} where $\mathrm{Cov} = \left(\mathrm{J}^{\top}\mathrm{W}\mathrm{J}\right)^{-1}$ is the parameter covariance matrix.

Conclusion and Outlook

In this blog post we have introduced some of the main concepts of parameter reconstruction, i.e. we derived the probability distribution of the parameters, their maximum-likelihood point estimates and their confidence intervals defined by a standard deviation of their approximate Gaussian probability distribution. In following blog posts, we will apply the introduced concepts for

logo