Inverse problems with Poisson data: statistical regularization theory, applications and algorithms

Thorsten Hohage; Frank Werner

doi:10.1088/0266-5611/32/9/093001

1. Introduction

In this paper we review recent activities on the reconstruction of unknown quantities from indirect Poisson distributed data. Such problems arise in many important photonic imaging modalities and have received considerable attention by applied mathematicians and statisticians over the last years. Typically the unknown quantity of interest is either a source density of photons or electrons or an object which interacts with some known incident beam of photons or electrons. The latter case occurs in coherent diffraction imaging, and the former case in positron emission tomography (PET), single photon emission computed tomography (SPECT) and the various types of fluorescence microscopy such as confocal microscopy, 4PI-microscopy, STED-microscopy, and RESOLFT schemes. Some of these applications will be discussed in more detail in section 5.

We will formulate the inverse problem in the form of an operator equation

$\begin{eqnarray}&&F(u)={g}^{\dagger }.\end{eqnarray} \tag{ 1 }$

Here the unknown quantity of interest $u$ is described by an element of a real Banach space ${ \mathcal X }$ , the ideal data ${g}^{\dagger }$ is a non-negative, integrable function on some compact manifold ${\mathbb{M}}\subset {{\mathbb{R}}}^{d}$ , and the (possibly nonlinear) forward operator F describes the imaging setup. ${g}^{\dagger }$ can typically be interpreted as a photon density on a measurement manifold ${\mathbb{M}}$ . Alternatively, ${\mathbb{M}}$ may be a time interval or a cylinder in space–time.

Due to the wave-particle duality of light or electromagnetic waves we may always describe the forward operator in terms of Maxwell's equations or some approximation of these equations. For time-harmonic electromagnetic waves the photon or energy density is proportional to the squared absolute value of the electric field. At low energies the quantization of energy is the main source of noise. It is also referred to as shot noise, and the treatment of this type of noise in the context of inverse problems is the topic of this review.

For simplicity let us confine our discussion to measurements without time resolution, i.e. the measurement manifold ${\mathbb{M}}$ is a subset of ${{\mathbb{R}}}^{3}$ . If the photon density is constant or separates into a space-dependent factor and a time-dependent factor, we will argue that the (ideal) photon detections are described by a Poisson point process G, the density of which is the marginal spatial photon density ${g}^{\dagger }$ . Let $\{{x}_{1},\ldots ,{x}_{N}\}\subset {\mathbb{M}}$ denote the positions of the detected photons. The total number N of detected photons is also random. Let us define

$\begin{eqnarray}&&G=\displaystyle \sum _{i=1}^{N}{\delta }_{{x}_{i}}\end{eqnarray} \tag{ 2 }$

as a sum of Dirac-measures at the photon positions, and denote by

$\begin{eqnarray*}&&G(A)=\#\{1\leqslant i\leqslant N| \;{x}_{i}\in A\}\end{eqnarray*}$

the number of photons in any measurable subset $A\subset {\mathbb{M}}$ . Then it is physically evident that

$\begin{eqnarray}&&{\bf{E}}[G(A)]={\int }_{A}{g}^{\dagger }{\rm{d}}x,\end{eqnarray} \tag{ 3 }$

and that the random variables $G({A}_{1}),\ldots ,G({A}_{M})$ for any finite number of disjoint, measurable sets ${A}_{i}\subset {\mathbb{M}}$ are stochastically independent. Hence, by definition, G is a Poisson process, and as a consequence, G(A) is a Poisson distributed integer-valued random variable (see definition 2.1 and proposition 2.2).

We will introduce an additional parameter $t\gt 0$ , which can often be interpreted as an exposure time, and assume that a Poisson process with density ${{tg}}^{\dagger }$ is observed. If ${g}^{\dagger }$ is normalized as a probability density, then t is the expected total number of photons, i.e. $t={\bf{E}}[N]$ .

A central question concerns the convergence and rates of convergence of estimators in the limit $t\to \infty$ . In many imaging methods N can vary over a large range. Results on the speed of convergence of estimators may help to predict how large N has to be chosen to achieve a required accuracy or resolution. Since the number of photons corresponds to the radiation dose, we may rephrase the question above as investigation of the relation between radiation dose and resolution.

In physical experiments positions of photons can never be determined with infinite precision as suggested by the notion of a Poisson process. In practice, photons are rather collected by a finite number of detectors, often charge-coupled-device (CCD) cameras. Then the experimental data are described by vectors or arrays of non-negative integers $G({{\mathbb{M}}}_{j})$ , j = 1,..., J where the ${{\mathbb{M}}}_{j}\subset {\mathbb{M}}$ describe detector areas, and $G({{\mathbb{M}}}_{j})$ is the number of photons collected by the jth detector (neglecting read-out errors for the moment). In a nonparametric setting, i.e. for $\mathrm{dim}\;{ \mathcal X }=\infty$ , we obviously cannot hope for convergence for fixed finite J in general. The finiteness of the bin sizes $\;\mathrm{diam}\;({{\mathbb{M}}}_{j})$ gives rise to a discretization error which has to tend to 0 besides $t\to \infty$ . This will briefly be reviewed in section 2.5. For the sake of simplicity we will work with Poisson processes, i.e. in the limit $J\to \infty$ in most of this paper.

In addition, data are typically contaminated by read-out errors of the detectors and by background noise. We will assume in the following that these sources of error are negligible compared to the shot noise and ignore them.

The plan of this paper is as follows: in section 2 we will give a precise definition of a Poisson process as discussed above and present immediate implications of this data model, including the derivation of the negative log-likelihood functional, the binning, and a concentration inequality which is fundamental for the further analysis. The following section 3 is devoted to estimation of ${u}^{\dagger }$ based on a wavelet-vaguelette decomposition (WVD) of the forward operator. Section 4 deals with variational regularization, including consistency and convergence rates of estimators based on a Tikhonov-type regularization approach. We conclude this section with a brief review of parameter choice methods and extensions to nonlinear forward operators. Some important applications are covered in section 5 including PET, fluorescence microscopy and phase retrieval problems. In section 6 we review algorithms for the minimization problems considered in section 4 and related approaches for phase retrieval problems, before we end with some concluding remarks in section 7.

2. Poisson processes

2.1. Fundamental properties

In the following we collect some basic properties of Poisson processes. As a general reference we refer to [94]. Recall that a point process on ${\mathbb{M}}$ can be seen as a random collection of points $\{{x}_{1},...,{x}_{N}\}\subset {\mathbb{M}}$ satisfying certain measurability properties. A different view on point processes is given by the notion of a sum of Dirac measures (2).

Definition 2.1 (Poisson point process). Let ${g}^{\dagger }\in {{\bf{L}}}^{1}({\mathbb{M}})$ with ${g}^{\dagger }\geqslant 0$ . A point process $G={\sum }_{i=1}^{N}{\delta }_{{x}_{i}}$ is called a Poisson point process or Poisson process with intensity ${g}^{\dagger }$ if

For each choice of disjoint, measurable sets ${A}_{1},...,{A}_{n}\subset {\mathbb{M}}$ the random variables $G({A}_{j})$ are stochastically independent.
${\bf{E}}[G(A)]={\int }_{A}{g}^{\dagger }\;{\rm{d}}x$ for each measurable set $A\subset {\mathbb{M}}$ .

Proposition 2.2. Let $G$ be a Poisson process with intensity ${g}^{\dagger }\in {{\bf{L}}}^{1}({\mathbb{M}})$ . Then for each measurable $A\subset {\mathbb{M}}$ the random variable $G(A)$ is Poisson distributed with parameter $\lambda := {\int }_{A}{g}^{\dagger }\;{\rm{d}}x$ , i.e. ${\bf{P}}[G(A)=k]={{\rm{e}}}^{-\lambda }\tfrac{{\lambda }^{k}}{k!}$ for $k\in {\mathbb{N}}$ .

For the proof we refer to [93, theorem 1.11.8].

A further fundamental property concerns the conditioning of a Poisson process on the observed number of counts:

Proposition 2.3. A Poisson process $G$ with intensity ${g}^{\dagger }\in {L}^{1}({\mathbb{M}})$ conditioned on $G({\mathbb{M}})=N$ is a Bernoulli process with parameter $N$ and probability measure $\mu (A):= {\int }_{A}{g}^{\dagger }\;{\rm{d}}x/{\int }_{{\mathbb{M}}}{g}^{\dagger }\;{\rm{d}}x$ . In other words, conditioned on $G({\mathbb{M}})=N$ the points are distributed like independent random variables ${X}_{1},\ldots ,{X}_{N}$ distributed according to $\mu$ .

Proof. Let ${A}_{1},\ldots ,{A}_{M}\subset {\mathbb{M}}$ be disjoint, measurable subsets and define ${A}_{0}\subset {\mathbb{M}}$ such that ${A}_{0}\cup ...\cup {A}_{M}={\mathbb{M}}$ . For ${n}_{1},...,{n}_{M}\in {{\mathbb{N}}}_{0}$ we define ${n}_{0}:= N-{\sum }_{i=1}^{M}{n}_{i}$ . Using proposition 2.2 we get

$\begin{eqnarray*}&&{\bf{P}}[G({A}_{1})={n}_{1},\ldots ,G({A}_{M})={n}_{M}| G({\mathbb{M}})=N]\\ &&={\bf{P}}[G({A}_{0})={n}_{0},G({A}_{1})={n}_{1},\ldots ,G({A}_{M})={n}_{M}]/{\bf{P}}[G({\mathbb{M}})=N]\\ &&=\displaystyle \prod _{i=0}^{M}\mathrm{exp}\left(-{\displaystyle \int }_{{A}_{i}}{g}^{\dagger }\;{\rm{d}}x\right)\displaystyle \frac{1}{{n}_{i}!}{\left({\displaystyle \int }_{{A}_{i}}{g}^{\dagger }{\rm{d}}x\right)}^{{n}_{i}}/{\left(\mathrm{exp}\left(-{\displaystyle \int }_{{\mathbb{M}}}{g}^{\dagger }{\rm{d}}x\right)\displaystyle \frac{1}{N!}{\left({\displaystyle \int }_{{\mathbb{M}}}{g}^{\dagger }{\rm{d}}x\right)}^{N}\right)}^{-1}\\ &&=N!\displaystyle \prod _{i=0}^{M}\displaystyle \frac{{\left({\displaystyle \int }_{{A}_{i}}{g}^{\dagger }{\rm{d}}x\right)}^{{n}_{i}}}{{n}_{i}!{\left({\displaystyle \int }_{{\mathbb{M}}}{g}^{\dagger }{\rm{d}}x\right)}^{{n}_{i}}}=N!\displaystyle \prod _{i=0}^{M}\displaystyle \frac{\mu {({A}_{i})}^{{n}_{i}}}{{n}_{i}!},\end{eqnarray*}$

where we used $N={n}_{0}+{n}_{1}\quad +\ldots +\quad {n}_{M}$ . Consequently, the quantity ${\bf{P}}[G({A}_{1})={n}_{1},\ldots ,G({A}_{M})={n}_{M}| G({\mathbb{M}})=N]$ describes a multinomial distribution, and the proof is complete. ■

This necessary property of a Poisson process can be used vice versa to construct a Poisson process with a prescribed intensity ${g}^{\dagger }$ and hence prove existence (see [94, section 2.5]).

As a Poisson process $G={\sum }_{j=1}^{N}{\delta }_{{x}_{j}}$ can be seen as a random measure belonging to the space ${\mathfrak{M}}({\mathbb{M}})$ of all signed Borel measures on ${\mathbb{M}}$ , we can define integrals over functions $\varphi \;:{\mathbb{M}}\to {\mathbb{R}}$ w.r.t. G. They are given by

$\begin{eqnarray*}&&{\int }_{{\mathbb{M}}}\varphi {\rm{d}}G=\displaystyle \sum _{i=1}^{N}\varphi ({x}_{i}).\end{eqnarray*}$

Proposition 2.4. For a measurable function $\varphi \;:{\mathbb{M}}\to {\mathbb{R}}$ the formulas

$\begin{eqnarray}&&{\bf{E}}\left[\;{\int }_{{\mathbb{M}}}\varphi {\rm{d}}G\right]={\int }_{{\mathbb{M}}}\varphi {g}^{\dagger }{\rm{d}}x,\end{eqnarray} \tag{ 4a }$

$\begin{eqnarray}&&{\bf{Var}}\left[\;{\int }_{{\mathbb{M}}}\varphi {\rm{d}}G\right]={\int }_{{\mathbb{M}}}{\varphi }^{2}{g}^{\dagger }{\rm{d}}x,\end{eqnarray} \tag{ 4b }$

hold true whenever the integrals on the right-hand sides exist.

If φ takes only a finite number of values, these identities are a straightforward consequence of the definition and proposition 2.2. From this observation the general results can be derived by standard approximation arguments (see e.g. [94, section 3.2]).

2.2. Log-likelihood

The negative log-likelihood as data-fidelity term appears naturally in the Bayesian approach, and it often has good properties from a frequentist perspective. For discrete random variables G with a parameter dependent family of probability measures ${{\bf{P}}}_{g}$ , the negative log-likelihood functional is defined by $g\mapsto -\mathrm{ln}{{\bf{P}}}_{g}[G]$ . For continuous random variables probability densities are used. We will give two derivations of the fact that for a Poisson process and observations ${x}_{1},\ldots ,{x}_{N}$ the negative log-likelihood functional is given by

$\begin{eqnarray}&&g\mapsto {\int }_{{\mathbb{M}}}g\;{\rm{d}}x-\displaystyle \sum _{i=1}^{N}\mathrm{ln}g({x}_{i}),\qquad g\in {{\bf{L}}}^{1}({\mathbb{M}}).\end{eqnarray} \tag{ 5 }$

We first consider the situation of a binned Poisson process as typically encountered in practice. Here with the notation of the introduction the observations are described by a Poisson distributed random vector $Y={({Y}_{j})}_{j=1..J}={(G({A}_{j}))}_{j=1..J}$ with independent entries and means ${\lambda }_{j}={\bf{E}}[{Y}_{j}]={\int }_{{A}_{j}}{g}^{\dagger }\;{\rm{d}}x$ . Then ${{\bf{P}}}_{\lambda }[Y=({k}_{1},...,{k}_{J})]$ $={\prod }_{j=1}^{J}{{\rm{e}}}^{-{\lambda }_{j}}{\lambda }_{j}^{{k}_{j}}/{k}_{j}!$ for $k\in {{\mathbb{N}}}_{0}^{J}$ . One readily calculates the negative log-likelihood functional as

$\begin{eqnarray*}&&\lambda \mapsto \displaystyle \sum _{j=1}^{J}[{\lambda }_{j}-{k}_{j}\mathrm{ln}({\lambda }_{j})+\mathrm{ln}({k}_{j}!)],\qquad \lambda \in {(0,\infty )}^{J}.\end{eqnarray*}$

This functional can obviously be extended to all of ${{\mathbb{R}}}^{J}$ by the conventions

$\begin{eqnarray}&&-\mathrm{ln}(x)=\infty \quad {\rm{for}}\;{\rm{}}x\leqslant 0\qquad {\rm{and}}\quad 0\mathrm{ln}(0)=0,\end{eqnarray} \tag{ 6 }$

which are always used in the following. Furthermore, we can add ${\sum }_{j=1}^{J}[{k}_{j}\mathrm{ln}(| {A}_{j}| )-\mathrm{ln}({k}_{j}!)]$ without affecting the minimizers, so that we treat

$\begin{eqnarray}&&\lambda \mapsto \displaystyle \sum _{j=1}^{J}\left[{\lambda }_{j}-{k}_{j}\mathrm{ln}\left(\displaystyle \frac{{\lambda }_{j}}{| {A}_{j}| }\right)\right],\qquad \lambda \in {{\mathbb{R}}}^{J}\end{eqnarray} \tag{ 7 }$

as discrete negative log-likelihood functional. Letting the bin sizes in (7) tend to zero we obtain (5) as continuous negative log-likelihood functional.

A more direct derivation of (5) uses proposition 2.3. Denoting (conditional) probability densities by ${\bf{p}}$ we have

$\begin{eqnarray*}-\mathrm{ln}\left({{\bf{p}}}_{g}\left[G=\displaystyle \sum _{i=1}^{N}{\delta }_{{x}_{i}}\right]\right) & = & -\mathrm{ln}\left({{\bf{p}}}_{g}\left[G=\displaystyle \sum _{i=1}^{N}{\delta }_{{x}_{i}}| G({\mathbb{M}})=N\right]{{\bf{P}}}_{g}[G({\mathbb{M}})=N]\right)\\ & = & -\mathrm{ln}\left[N!\left(\displaystyle \prod _{i=1}^{N}\displaystyle \frac{g({x}_{i})}{{\displaystyle \int }_{{\mathbb{M}}}g\;{\rm{d}}x}\right)\mathrm{exp}\left(-{\displaystyle \int }_{{\mathbb{M}}}g\;{\rm{d}}x\right)\displaystyle \frac{1}{N!}{\left({\displaystyle \int }_{{\mathbb{M}}}g{\rm{d}}x\right)}^{N}\right]\\ & = & {\displaystyle \int }_{{\mathbb{M}}}g\;{\rm{d}}x-\displaystyle \sum _{i=1}^{N}\mathrm{ln}g({x}_{i}).\end{eqnarray*}$

The additional factor $N!$ in the second line equals the number of combinations ${a}_{1},...,{a}_{N}\in {\mathbb{M}}$ such that ${\sum }_{i=1}^{N}{\delta }_{{a}_{i}}={\sum }_{i=1}^{N}{\delta }_{{x}_{i}}$ with given ${x}_{1},...,{x}_{N}$ .

2.3. Notions of a noise level and data fidelity terms

As discussed in the introduction, we will now introduce a parameter $t\gt 0$ for asymptotic considerations, and the idea is that the data corresponding to $t\gt 0$ was generated by the intensity ${{tg}}^{\dagger }$ . Instead of thinking of different experiments for different values of t, we think of a continuous measurement procedure, where for each fixed $t\gt 0$ , the corresponding data is a rescaled spatial Poisson process:

Definition 2.5. Let $\tilde{G}={\sum }_{j}{\delta }_{{x}_{j},{t}_{j}}$ be a (spatio-temporal) Poisson process on ${\mathbb{M}}\times [0,\infty )$ with intensity ${\tilde{g}}^{\dagger }(x,t)={g}^{\dagger }(x)$ for some ${g}^{\dagger }\in {{\bf{L}}}^{1}({\mathbb{M}})$ . Then a temporally normalized Poisson process ${({G}_{t})}_{t\gt 0}$ with density ${g}^{\dagger }$ is defined by

$\begin{eqnarray*}&&{G}_{t}=\displaystyle \frac{1}{t}{\int }_{0}^{t}\tilde{G}(\cdot ,\tau ){\rm{d}}\tau =\displaystyle \frac{1}{t}\displaystyle \sum _{{t}_{j}\leqslant t}{\delta }_{{x}_{j},{t}_{j}}.\end{eqnarray*}$

Note from definition 2.1 that $\tilde{{G}_{t}}:= {{tG}}_{t}$ is a Poisson process on ${\mathbb{M}}$ with density ${{tg}}^{\dagger }$ for all $t\gt 0$ , which reveals the connection to [88, 155] where a sequence of spatial Poisson processes was used. The definition used here based on a spatio-temporal Poisson process on ${\mathbb{M}}\times [0,\infty )$ offers two advantages: from a statistical point of view it is simpler as now for any t the random data live on the same probability space. Furthermore it is closer to physical reality, as now all photons observed in a subset A at up to time t are also part of all measurements restricted to A at any later time.

The scaling factor $\tfrac{1}{t}$ ensures that the expected value of an integral against G_t is independent of $t\gt 0$ : By (4a) and (4b) we obtain

$\begin{eqnarray*}{\bf{E}}\left[\;{\displaystyle \int }_{{\mathbb{M}}}g{\rm{d}}{G}_{t}\right] & = & \displaystyle \frac{1}{t}{\bf{E}}\left[\;{\displaystyle \int }_{{\mathbb{M}}}g{\rm{d}}\tilde{{G}_{t}}\right]=\displaystyle \frac{1}{t}{\displaystyle \int }_{{\mathbb{M}}}{{gtg}}^{\dagger }{\rm{d}}x={\displaystyle \int }_{{\mathbb{M}}}{{gg}}^{\dagger }{\rm{d}}x,\\ {\bf{Var}}\left[\;{\displaystyle \int }_{{\mathbb{M}}}g{\rm{d}}{G}_{t}\right] & = & \displaystyle \frac{1}{{t}^{2}}{\bf{E}}\left[\;{\displaystyle \int }_{{\mathbb{M}}}{g}^{2}{\rm{d}}\tilde{{G}_{t}}\right]=\displaystyle \frac{1}{t}{\displaystyle \int }_{{\mathbb{M}}}{g}^{2}{g}^{\dagger }{\rm{d}}x.\end{eqnarray*}$

Thus any bounded linear functional of ${g}^{\dagger }$ can be estimated unbiasedly with a variance proportional to $\tfrac{1}{t}$ . This suggests that the noise level should be proportional to $1/\sqrt{t}$ .

A difficulty encountered by several deterministic approaches to the analysis of inverse problems with Poisson data is to come up with a good deterministic analog of Poisson data. In particular, there is no pointwise noise level or a pointwise signal-to-noise ratio for Poisson data which does not depend on a bin size: as ${\bf{E}}[X]={\bf{Var}}[X]=\lambda$ for a Poisson distributed random variable with parameter $\lambda \gt 0$ , the signal-to-noise ratio is ${\bf{E}}[X]/{\bf{Var}}{[X]}^{1/2}=\sqrt{\lambda }$ . Thus, by subdividing bins the pointwise (or binwise) signal-to-noise ratio is decreased although we do not loose any information. Actually we even gain more precise information on the location of the photons resulting in a smaller discretization error.

From (5) we know that the log-likelihood data fidelity term between $\tilde{{G}_{t}}$ and with the intensity $\tilde{g}:= {tg}$ is given (up to an additive constant) given by $\tilde{g}\mapsto {\int }_{{\mathbb{M}}}\tilde{g}{\rm{d}}x-{\int }_{{\mathbb{M}}}\mathrm{ln}(\tilde{g}){\rm{d}}\tilde{{G}_{t}}$ . Dividing this by t and ignoring additional additive constants independent of g we arrive at the following log-likelihood data fidelity term (again using (6)):

$\begin{eqnarray}&&{{ \mathcal S }}_{0}({G}_{t};g):= {\int }_{{\mathbb{M}}}g{\rm{d}}x-{\int }_{{\mathbb{M}}}\mathrm{ln}(g){\rm{d}}{G}_{t}.\end{eqnarray} \tag{ 8 }$

Assume for simplicity that ${\int }_{{\mathbb{M}}}\mathrm{ln}({g}^{\dagger }){g}^{\dagger }\;{\rm{d}}x\lt \infty$ . (Later we will even assume ${g}^{\dagger }\in {{\bf{L}}}^{\infty }({\mathbb{M}})$ .) Then ${\bf{E}}\left[\;{\int }_{{\mathbb{M}}}\mathrm{ln}(g){\rm{d}}{G}_{t}\right]={\int }_{{\mathbb{M}}}\mathrm{ln}(g){g}^{\dagger }\;{\rm{d}}x$ as computed above, which is independent of t and meaningful for all reasonable g. Without rescaling the former property would not be true.

Since in variational regularization the data are only accessed via the data fidelity term, it will be useful to quantify the influence of the data on the data fidelity term. This will lead to a notion of a noise level. Although the minimum of ${{ \mathcal S }}_{0}({G}_{t};\cdot )$ should be attained close to ${g}^{\dagger }$ , the value of the minimum has no significance. Therefore, we subtract ${{ \mathcal S }}_{0}({G}_{t};{g}^{\dagger })$ . It is also natural to ask for a deterministic distance measure between g and ${g}^{\dagger }$ . This can be obtained by taking expectations:

$\begin{eqnarray}&&{\bf{E}}[{{ \mathcal S }}_{0}({G}_{t};g)-{{ \mathcal S }}_{0}({G}_{t};{g}^{\dagger })]={\int }_{{\mathbb{M}}}\left[g-{g}^{\dagger }-{g}^{\dagger }\mathrm{ln}\left(\displaystyle \frac{g}{{g}^{\dagger }}\right)\right]{\rm{d}}x=: {\mathbb{KL}}({g}^{\dagger };g)\end{eqnarray} \tag{ 9 }$

${\mathbb{KL}}({g}^{\dagger };g)$ is called Kullback–Leibler divergence of g from ${g}^{\dagger }$ .

Note that ${\mathbb{KL}}({g}^{\dagger };g)$ may be $\infty$ even if $g\gt 0$ almost everywhere. $g\mapsto {\mathbb{KL}}({g}^{\dagger };g)$ is a strictly convex, lower-semicontinuous functional on ${{\bf{L}}}^{1}({\mathbb{M}})$ (see e.g. [56, corollary 2.2]). Moreover, it is easy to see that ${\mathbb{KL}}({g}^{\dagger };g)\geqslant 0$ for all $g,{g}^{\dagger }$ and ${\mathbb{KL}}({g}^{\dagger };g)=0$ if and only if $g={g}^{\dagger }$ .

We emphasize that the Kullback–Leibler divergence is widely used in statistics, not only in connection with Poisson data [148, section 2.4].

We quantify the influence of the data on the data fidelity functional over a set ${\mathfrak{G}}$ of candidate densities by the random number

$\begin{eqnarray*}{\bf{err}}({\mathfrak{G}}) & := & \underset{g\in {\mathfrak{G}}}{\mathrm{sup}}| {{ \mathcal S }}_{0}({G}_{t};g)-{{ \mathcal S }}_{0}({G}_{t};{g}^{\dagger })-{\bf{E}}[{{ \mathcal S }}_{0}({G}_{t};g)-{{ \mathcal S }}_{0}({G}_{t};{g}^{\dagger })]| \\ & = & \underset{g\in {\mathfrak{G}}}{\mathrm{sup}}\left|{\displaystyle \int }_{{\mathbb{M}}}\mathrm{ln}\left(\displaystyle \frac{g}{{g}^{\dagger }}\right)({\rm{d}}{G}_{t}-{g}^{\dagger }\;{\rm{d}}x)\right|,\end{eqnarray*}$

which we call effective noise level. In the next subsection we will discuss conditions under which ${\bf{P}}[{\bf{err}}({\mathfrak{G}})\geqslant \rho /\sqrt{t}]$ decreases exponentially in ρ uniformly in t. Such so-called concentration or deviation inequalities can only be shown if $| \mathrm{ln}(g/{g}^{\dagger })|$ is uniformly bounded. However in many applications g may be 0 in parts of the domain or get arbitrarily close to 0 such that this condition is violated. This is one of the reasons to introduce a modified data fidelity term ${{ \mathcal S }}_{\sigma }({G}_{t};g)$ with an offset parameter $\sigma \geqslant 0$ such that

$\begin{eqnarray*}&&{\bf{E}}[{{ \mathcal S }}_{\sigma }({G}_{t};g)-{{ \mathcal S }}_{\sigma }({G}_{t};{g}^{\dagger })]={\mathbb{KL}}({g}^{\dagger }+\sigma ;g+\sigma )=: {{ \mathcal T }}_{\sigma }({g}^{\dagger };g).\end{eqnarray*}$

A simple computation shows that this is the case for

$\begin{eqnarray}{{ \mathcal S }}_{\sigma }({G}_{t};g):= \left\{\begin{array}{ll}{\displaystyle \int }_{{\mathbb{M}}}g{\rm{d}}x-{\displaystyle \int }_{{\mathbb{M}}}\mathrm{ln}(g+\sigma )(\;{\rm{d}}{G}_{t}+\sigma \;{\rm{d}}x) & \mathrm{if}\ g\geqslant -\sigma \;{\rm{a.e.}},\\ \infty & \mathrm{else}.\end{array}\right.\end{eqnarray} \tag{ 10 }$

Now the error term is given by

$\begin{eqnarray}{\bf{err}}({\mathfrak{G}}) & := & \underset{g\in {\mathfrak{G}}}{\mathrm{sup}}| {{ \mathcal S }}_{\sigma }({G}_{t};g)-{{ \mathcal S }}_{\sigma }({G}_{t};{g}^{\dagger })-{{ \mathcal T }}_{\sigma }({g}^{\dagger };g)| \\ & = & \underset{g\in {\mathfrak{G}}}{\mathrm{sup}}\left|{\displaystyle \int }_{{\mathbb{M}}}\mathrm{ln}\left(\displaystyle \frac{g+\sigma }{{g}^{\dagger }+\sigma }\right)({\rm{d}}{G}_{t}-{g}^{\dagger }\;{\rm{d}}x)\right|,\end{eqnarray} \tag{ 11 }$

and we get ${\mathrm{sup}}_{g\in {\mathfrak{G}}}{\parallel \mathrm{ln}\tfrac{g+\sigma }{{g}^{\dagger }+\sigma }\parallel }_{\infty }\lt \infty$ if ${\mathrm{sup}}_{g\in {\mathfrak{G}}}{\parallel g\parallel }_{\infty }\lt \infty$ , ${g}^{\dagger }\in {\mathfrak{G}}$ and $g\geqslant 0$ for all $g\in {\mathfrak{G}}$ . This will enable us to obtain concentration inequalities for ${\bf{err}}({\mathfrak{G}})$ . Note that ${{ \mathcal S }}_{\sigma }({G}_{t};\cdot )$ is continuous w.r.t the ${{\bf{L}}}^{\infty }$ -norm if and only if $\sigma \gt 0$ , and this continuity property is used as discussed above to stochastically bound ${\bf{err}}({\mathfrak{G}})$ and also at several other points in our convergence analysis.

Another reason for choosing a small positive offset parameter $\sigma \gt 0$ are instabilities of estimators and slow convergence of optimization methods for the case $\sigma =0$ , particularly in the context of Newton-type methods (see e.g. the discussion of the possible step-sizes in algorithm 1).

The use of positive offset parameter $\sigma$ is common practice in the literature on Poisson inverse problems. Sometimes it is justified by CCD modeling as discussed in section 2.5, but such arguments should be used with care: a background noise level should not be added to the observed data, and if $\sigma$ should denote the variance of Gaussian readout-errors, then ${{ \mathcal S }}_{\sigma }$ is only justified as a log-likelihood functional for large values of $\sigma$ , whereas typical values of $\sigma$ are well below 1. We will rather consider $\sigma$ as a parameter of the inversion methods.

The following lemma describes the relations between ${{ \mathcal T }}_{\sigma }$ and the ${{\bf{L}}}^{2}$ distance:

Lemma 2.6 (Bounds for the exact data fidelity term). Let $\sigma \geqslant 0$ and ${{ \mathcal T }}_{\sigma }$ be defined by (9). Then the following estimates hold true for all functions $g\geqslant 0$ whenever the occurring norms are finite:

$\begin{eqnarray}&&{\parallel g-{g}^{\dagger }\parallel }_{{{\bf{L}}}^{2}({\mathbb{M}})}^{2}\leqslant \left(\displaystyle \frac{2}{3}{\parallel {g}^{\dagger }\parallel }_{{{\bf{L}}}^{\infty }({\mathbb{M}})}+\displaystyle \frac{4}{3}{\parallel g\parallel }_{{{\bf{L}}}^{\infty }({\mathbb{M}})}+2\sigma \right){{ \mathcal T }}_{\sigma }({g}^{\dagger };g),\end{eqnarray} \tag{ 12a }$

$\begin{eqnarray}&&{{ \mathcal T }}_{\sigma }({g}^{\dagger };g)\leqslant {\parallel \displaystyle \frac{1}{g+\sigma }\parallel }_{{{\bf{L}}}^{\infty }({\mathbb{M}})}{\parallel g-{g}^{\dagger }\parallel }_{{{\bf{L}}}^{2}({\mathbb{M}})}^{2},\end{eqnarray} \tag{ 12b }$

$\begin{eqnarray}&&{{ \mathcal T }}_{\sigma }({g}^{\dagger };g)\leqslant {\parallel \displaystyle \frac{{g}^{\dagger }+\sigma }{g+\sigma }\parallel }_{{{\bf{L}}}^{2}({\mathbb{M}})}{\parallel g-{g}^{\dagger }\parallel }_{{{\bf{L}}}^{2}({\mathbb{M}})}.\end{eqnarray} \tag{ 12c }$

Proof. We cite the following inequality from [31, lemma 2.2], which holds true for all $u\geqslant 0$ and $v\gt 0$

$\begin{eqnarray*}&&{(u-v)}^{2}\leqslant \left(\displaystyle \frac{2}{3}u+\displaystyle \frac{4}{3}v\right)\left(u\mathrm{ln}\left(\displaystyle \frac{u}{v}\right)-u+v\right)\end{eqnarray*}$

and can easily be verified by dividing by v and differentiating twice with respect to u/v. If we use this estimate with $u={g}^{\dagger }(x)+\sigma$ , $v=g(x)+\sigma$ and integrate over ${\mathbb{M}}$ , we obtain (12a) by the Cauchy–Schwarz inequality. As $\mathrm{ln}(\tau )\leqslant \tau -1$ we moreover have the pointwise estimate

$\begin{eqnarray*}&&\left(u\mathrm{ln}\left(\displaystyle \frac{u}{v}\right)-u+v\right)\leqslant \displaystyle \frac{1}{v}{(u-v)}^{2}\end{eqnarray*}$

for all $u\geqslant 0$ and $v\gt 0$ . This directly implies (12b) via integration, and from ${v}^{-1}{(u-v)}^{2}=(u/v-1)(u-v)$ and integration we obtain (12c) again by the Cauchy–Schwarz inequality. ■

2.4. Estimating the density of a Poisson process

As we have seen in proposition 2.3, there is a close connection between estimating the intensity of a inhomogeneous Poisson process and estimating a probability density from i.i.d. samples. The latter has been investigated in statistics for many decades, and due to the analogy many nonparametric estimation procedures for the latter have been transfered to the former. We refer to [13, 23, 97, 125, 126, 129, 156], but emphasize that this list is far from being complete. It is further known, that the former problem is asymptotically equivalent to estimating a positive diffusion drift in an SDE [74] provided that the density is bounded away from 0 and Lipschitz continuous. Both is used (only) to ensure that the SDE has a strong solution. A second order asymptotically accurate estimator for the density of a Poisson process was recently suggested in [73].

2.5. Binning, discretization errors, and CCD modeling

Even though the model of a Poisson process as observable data discussed above offers several theoretical advantages, the observed quantities will always be a discrete measurement ${{\bf{g}}}^{{\rm{obs}}}\in {{\mathbb{N}}}_{0}^{J}$ or ${{\mathbb{R}}}^{J}$ in practice. Let us discuss the relations between these concepts (see [88]).

Suppose the measurement area ${\mathbb{M}}$ is covered by J disjoint detector regions (bins) ${{\mathbb{M}}}_{j}$ , ${\mathbb{M}}={\bigcup }_{j=1}^{J}{{\mathbb{M}}}_{j}$ with $| {{\mathbb{M}}}_{j}| \gt 0$ for all j. Moreover, let us assume for the moment that each detector precisely counts the number of photons in ${{\mathbb{M}}}_{j}$ . Then the observed data are given by

$\begin{eqnarray*}&&{{\bf{g}}}_{j}^{{\rm{obs}}}={G}_{t}({{\mathbb{M}}}_{j})=\displaystyle \frac{\#\{1\leqslant i\leqslant n\;| \;{x}_{i}\in {{\mathbb{M}}}_{j}\}}{t},\qquad j=1,\ldots ,\quad J.\end{eqnarray*}$

Note that t is always assumed to be known. By proposition 2.2 it holds

$\begin{eqnarray*}&&{\bf{P}}\left[{{\bf{g}}}_{j}^{{\rm{obs}}}=\displaystyle \frac{k}{t}\right]=\displaystyle \frac{{(t{{\bf{g}}}_{j}^{\dagger })}^{k}}{k!}\mathrm{exp}(-t{{\bf{g}}}_{j}^{\dagger }),\end{eqnarray*}$

where ${{\bf{g}}}_{j}^{\dagger }={\int }_{{{\mathbb{M}}}_{j}}{g}^{\dagger }\;{\rm{d}}x$ , i.e. $t{{\bf{g}}}_{j}^{{\rm{obs}}}$ is a Poisson distributed random variable with parameter $t{{\bf{g}}}_{j}^{\dagger }$ . The corresponding negative log-likelihood functional is then given by (7) which will be denoted by ${{ \mathcal S }}_{J}$ in the following. For the sake of convenience (and due to favorable properties) in most cases not ${{ \mathcal S }}_{J}$ , but ${{ \mathcal S }}_{J}-{{ \mathcal S }}_{J}({{\bf{g}}}^{{\rm{obs}}};{{\bf{g}}}^{{\rm{obs}}})$ is used as fidelity term, which exactly leads to the discretized Kullback–Leibler divergence:

$\begin{eqnarray}&&{{\mathbb{KL}}}_{J}({{\bf{g}}}^{{\rm{obs}}};{\bf{g}}):= \displaystyle \sum _{j=1}^{J}\left[{{\bf{g}}}_{j}-{{\bf{g}}}_{j}^{{\rm{obs}}}-{{\bf{g}}}_{j}^{{\rm{obs}}}\mathrm{ln}\left(\displaystyle \frac{{{\bf{g}}}_{j}}{{{\bf{g}}}_{j}^{{\rm{obs}}}}\right)\right].\end{eqnarray} \tag{ 13 }$

Again note that the additional constants compared to (7) do not affect minimizers.

Let us introduce the disretization operators ${{\bf{S}}}_{J}:{{\bf{L}}}^{1}({\mathbb{M}})\to {{\mathbb{R}}}^{J}$ , ${({{\bf{S}}}_{J}g)}_{j}:= {\int }_{{{\mathbb{M}}}_{j}}g\;{\rm{d}}x$ and ${{\bf{S}}}_{J}^{* }{\bf{g}}:= {\sum }_{j=1}^{J}{| {{\mathbb{M}}}_{j}| }^{-1}{{\bf{g}}}_{j}{1}_{{{\mathbb{M}}}_{j}}$ , which are adjoint to each other w.r.t. the ${{\bf{L}}}^{2}({\mathbb{M}})$ inner product and the inner product $\langle {\bf{g}},{\bf{h}}\rangle := {\sum }_{j=1}^{J}{| {{\mathbb{M}}}_{j}| }^{-1}{{\bf{g}}}_{j}{{\bf{h}}}_{j}$ .

Then ${{\bf{P}}}_{J}:= {{\bf{S}}}_{J}^{* }{{\bf{S}}}_{J}$ is the ${{\bf{L}}}^{2}$ -orthogonal projection onto the subspace of functions, which are constant on each ${{\mathbb{M}}}_{j}$ . Note that we can extend ${{\bf{S}}}_{J}$ naturally to measures such that ${({{\bf{S}}}_{J}({G}_{t}))}_{j}={G}_{t}({{\mathbb{M}}}_{j})={{\bf{g}}}_{j}^{{\rm{obs}}}$ .

With those notations the semi-discrete problem ${{\bf{S}}}_{J}F(u)={{\bf{S}}}_{J}{G}_{t}={{\bf{g}}}^{{\rm{obs}}}$ can be analyzed analogously to the continuous case if ${{ \mathcal S }}_{J}$ is used as a fidelity term. It turns out that the additional discretization error is of the form ${\mathrm{sup}}_{g}| {\mathbb{KL}}({g}^{\dagger };g)-{\mathbb{KL}}({{\bf{P}}}_{J}{{\bf{g}}}^{\dagger };{{\bf{P}}}_{J}{\bf{g}})|$ . As expected, convergence can be achieved only for $J\to \infty$ as $t\to \infty$ .

Remark 2.7. Let us once again discuss the choice of the data fidelity term. Replacing the discrete quantities ${{\bf{g}}}^{{\rm{obs}}},{{\bf{g}}}^{\dagger }$ and ${\bf{g}}$ in (13) by functions ${g}^{{\rm{obs}}},{g}^{\dagger },g\in {{\bf{L}}}^{1}({\mathbb{M}})$ , this often serves as a motivation for using deterministic data models with the Kullback–Leibler divergence (9) as data fidelity term. However, this ignores certain scaling problems in the limiting process: In the limit of smaller and smaller bins, i.e. $J\to \infty$ the data vector ${{\bf{g}}}^{{\rm{obs}}}$ will eventually have only 0's and 1's as entries, and hence ${\sum }_{j=1}^{J}{{\bf{g}}}_{j}^{{\rm{obs}}}\mathrm{ln}({{\bf{g}}}_{j}^{{\rm{obs}}}/| {{\mathbb{M}}}_{j}| )$ will tend to $\infty$ with probability 1 although ${\mathrm{lim}}_{J\to \infty }{\sum }_{j=1}^{J}{{\bf{g}}}_{j}^{\dagger }\mathrm{ln}({{\bf{g}}}_{j}^{\dagger }/| {{\mathbb{M}}}_{j}| )={\int }_{{\mathbb{M}}}{g}^{\dagger }\mathrm{ln}{g}^{\dagger }\;{\rm{d}}x$ under reasonable regularity assumptions. This is related to the fact that a corresponding continuous quantity ${\int }_{{\mathbb{M}}}\mathrm{ln}({G}_{t}){\rm{d}}{G}_{t}$ cannot be defined in a meaningful way. Thus any data model for which (13) is uniformly bounded in J must differ seriously from a true Poissonian data model, and similarly definitions of a noise level based on (13), e.g. ${{\mathbb{KL}}}_{J}({{\bf{g}}}^{{\rm{obs}}};{{\bf{g}}}^{\dagger })$ seem questionable. We point out that this is a common phenomenon in statistical inverse problems: if one considers discrete observations with Gaussian errors modeled analogously as local averages of a Gaussian white noise process, the limit $J\to \infty$ of the standard quadratic log-likelihood functional will diverge with probability 1.

Let us now consider CCD camera detectors in some more detail. Snyder et al [137, 138] model the signal of a CCD camera as a sum of independent random variables

$\begin{eqnarray*}&&{g}_{j}+{b}_{j}+{e}_{j},\end{eqnarray*}$

where g_j and b_j are Poisson distributed and describe the signal and various types of background noise, respectively, whereas e_j describes read-out errors and is assumed to be Gaussian distributed with mean m_j and variance v_j. Moreover, ${\bf{E}}[{g}_{j}]={\beta }_{j}{\int }_{{{\mathbb{M}}}_{j}}{{tg}}^{\dagger }\;{\rm{d}}x$ with an efficiency parameter ${\beta }_{j}$ describing the probability of a photon in ${{\mathbb{M}}}_{j}$ to be detected. If ${\beta }_{j}$ and ${\bf{E}}[{b}_{j}]$ are known from calibration measurements, they may be incorporated into the forward operator F to fit the CCD model with readout errors e_j = 0 into our framework. This turns a linear operator into an affine linear operator. Alternatively, if ${\beta }_{j}$ and ${\bf{E}}[{b}_{j}]$ are constant, but unknown, they may be incorporated in the forward operator F as additional unknowns. In case of ${\bf{E}}[{b}_{j}]$ this leads to a linear operator if F is linear, but the incorporation of ${\beta }_{j}$ yields a nonlinear operator.

Gaussian readout errors e_j do not fit into our framework exactly. If v_j is large and m_j = v_j (other values of m_j can easily be accommodated for by adding ${v}_{j}-{m}_{j}$ to the data), then the distribution of e_j can be well approximated by a Poisson random variable with mean v_j. This leads to the data fidelity term ${{ \mathcal S }}_{\sigma }$ with $\sigma ={v}_{j}$ . The true log-likelihood of the mixture of a Poisson and a Gaussian random variable is more complicated. The use of such likelihoods has been discussed in [20, 137, 138].

2.6. Concentration and deviation inequalities

Talagrand's concentration inequalities in product spaces [146] were an important breakthrough in probability theory. If ${X}_{1},\ldots ,{X}_{n}$ are independent identically distributed random variables on some measurable space $({\mathbb{M}},{ \mathcal M })$ , ${ \mathcal F }$ is a countable family of real-valued measurable functions on $({\mathbb{M}},{ \mathcal M })$ with ${\mathrm{sup}}_{f\in { \mathcal F }}{\parallel f\parallel }_{\infty }\lt \infty$ and $Z:= {\mathrm{sup}}_{f\in { \mathcal F }}{\sum }_{i=1}^{n}f({X}_{i})$ he proved that ${\bf{P}}[| Z-{\bf{E}}[Z]| \gt x]$ decays exponentially in $x\gt 0$ . If only ${\bf{P}}[Z-{\bf{E}}[Z]\gt x]$ is bounded one speaks of deviation inequalities. Later Massart [110] derived a version of Talagrand's inequality with explicit constants. We cite the following deviation inequality for Poisson processes from [125] which has a similar form as Massart's result:

Theorem 2.8 ([125, corollary 2]). Let $G$ be a Poisson process with intensity ${g}^{\dagger }\in {{\bf{L}}}^{1}({\mathbb{M}})$ and let ${ \mathcal F }\subset {L}^{\infty }({\mathbb{M}})$ be a countable family of real-valued functions with ${\parallel f\parallel }_{\infty }\leqslant b$ for all $f\in { \mathcal F }$ . Let

$\begin{eqnarray*}&&Z:= \underset{f\in { \mathcal F }}{\mathrm{sup}}\left|{\int }_{{\mathbb{M}}}f({\rm{d}}G-{g}^{\dagger }{\rm{d}}x)\right|\quad \mathrm{and}\quad {v}_{0}:= \underset{f\in { \mathcal F }}{\mathrm{sup}}{\int }_{{\mathbb{M}}}{f}^{2}{g}^{\dagger }\;{\rm{d}}x.\end{eqnarray*}$

Then for any $\rho ,\varepsilon \gt 0$ the following inequality holds true:

$\begin{eqnarray}&&{\bf{P}}\left[Z\geqslant (1+\varepsilon ){\bf{E}}[Z]+\sqrt{12{{tv}}_{0}\rho }+\left(\displaystyle \frac{5}{4}+\displaystyle \frac{32}{\varepsilon }\right)b\rho \right]\leqslant \mathrm{exp}(-\rho ).\end{eqnarray} \tag{ 14 }$

From this theorem the following deviation inequalities were derived in [155]:

Theorem 2.9. Let ${\mathbb{M}}\subset {{\mathbb{R}}}^{d}$ be a bounded Lipschitz domain, let ${G}_{t}$ be a temporally normalized Poisson process with intensity ${g}^{\dagger }\in {{\bf{L}}}^{1}({\mathbb{M}})$ in the sense of definition 2.5 and let ${H}^{s}({\mathbb{M}})$ denote the ${{\bf{L}}}^{2}$ -based Sobolev space with index $s\gt d/2$ , and ${\mathfrak{G}}(R):= \{g\in {H}^{s}({\mathbb{M}})\;:{\parallel g\parallel }_{{H}^{s}}\leqslant R\}$ . Then there exists a constant $c\gt 0$ depending only on ${\mathbb{M}}$ , s and ${\parallel {g}^{\dagger }\parallel }_{{L}^{1}}$ such that

$\begin{eqnarray*}&&{\bf{P}}\left[\underset{{\mathfrak{g}}\in {\mathfrak{G}}(R)}{\mathrm{sup}}\left|{\int }_{{\mathbb{M}}}{\mathfrak{g}}(\;{\rm{d}}{G}_{t}-{g}^{\dagger }{\rm{d}}x)\right|\geqslant \displaystyle \frac{\rho }{\sqrt{t}}\right]\leqslant \mathrm{exp}\left(-\displaystyle \frac{\rho }{{cR}}\right)\end{eqnarray*}$

for all $R\geqslant 1$ , $t\geqslant 1$ and $\rho \geqslant {cR}$ . Moreover

$\begin{eqnarray}&&{\bf{P}}\left[{\bf{err}}({\mathfrak{G}}(R))\geqslant \displaystyle \frac{\rho }{\sqrt{t}}\right]\leqslant \mathrm{exp}\left(-\displaystyle \frac{\rho }{{C}_{{\rm{conc}}}}\right)\end{eqnarray} \tag{ 15 }$

for all $R\geqslant 1$ , $t\geqslant 1$ and $\rho \geqslant {C}_{{\rm{conc}}}$ where ${C}_{{\rm{conc}}}:= 2\mathrm{max}\{{\sigma }^{-\lfloor s\rfloor -1},| \mathrm{ln}(R)| \}{Rc}$ .

3. Projection methods

For linear statistical inverse problems projection methods are one of the most intensively studied and widely used approaches. Applying such methods to the operator equation (1) consists in constructing an estimator in some (typically finite-dimensional) closed subspace ${{ \mathcal X }}_{j}\subset { \mathcal X }$ . Sometimes the data is also projected onto a closed subspace ${{ \mathcal Y }}_{j}\subset { \mathcal Y }$ .

Suppose in the following that $T\;:{ \mathcal X }\to { \mathcal Y }$ is a bounded, linear operator. Given a finite-dimensional subspace ${{ \mathcal X }}_{j}\subset { \mathcal X }$ , a natural projection-type estimator for Gaussian noise is

$\begin{eqnarray}&&{u}_{j}^{{\rm{P}}}:= \underset{u\in {{ \mathcal X }}_{j}}{\mathrm{argmin}}{\parallel {Tu}-{g}^{{\rm{obs}}}\parallel }_{{ \mathcal Y }}^{2},\end{eqnarray} \tag{ 16 }$

where ${g}^{{\rm{obs}}}\in { \mathcal Y }$ denotes the observed data. As ${{ \mathcal X }}_{j}$ is finite-dimensional, $\parallel {u}_{j}^{{\rm{P}}}\parallel$ has finite variance, but the variance tends to infinity as the dimension of ${{ \mathcal X }}_{j}$ tends to $\infty ,$ i.e. regularization is achieved by the choice of the subspaces. For fixed sequences of subspaces the index j plays the role of a regularization parameter.

For Poisson data, a natural analog of (16) would be given by

$\begin{eqnarray}&&{u}_{j}^{{\rm{ML}}}:= \underset{u\in {{ \mathcal X }}_{j}\cap { \mathcal C }}{\mathrm{argmin}}\;{{ \mathcal S }}_{\sigma }({G}_{t};F(u)),\end{eqnarray} \tag{ 17 }$

where ${ \mathcal C }\subset { \mathcal X }$ is a closed cone (e.g. of non-negative functions) such that $F(u)\geqslant 0$ for all $u\in { \mathcal C }$ for linear F. Note that ${u}_{j}^{{\rm{ML}}}$ is the maximum likelihood estimator if ${{ \mathcal X }}_{j}={ \mathcal X }$ and $\sigma =0$ . Unfortunately it seems difficult to control the variance of ${u}_{j}^{{\rm{ML}}}$ , and we are not aware of a convergence analysis of this estimator. For this reason we will discuss estimators motivated by (16) in the following, which have been studied more intensively in the literature. Still note that (16) does not incorporate information on the data distribution. We refer to Szkutnik [143] for an approach which is similar to the methods introduced below, but also incorporates the data distribution via quasi maximum likelihood.

To define projection-type estimator motivated by (16) for Poisson data consider the estimator

$\begin{eqnarray}&&{u}_{j}^{{\rm{G}}}\in {{ \mathcal X }}_{j}:\qquad {\langle {{Tu}}_{j}^{{\rm{G}}},v\rangle }_{{ \mathcal Y }}=\langle {g}^{{\rm{obs}}},v\rangle \qquad \mathrm{for}\ \mathrm{all}\qquad v\in {{ \mathcal Y }}_{j}.\end{eqnarray} \tag{ 18 }$

If ${{ \mathcal Y }}_{j}=T({{ \mathcal X }}_{j})$ , then ${u}_{j}^{{\rm{P}}}={u}_{j}^{{\rm{G}}}$ . Equation (18) can be used for different choices of ${{ \mathcal Y }}_{j}$ as well, but the connection to (16) is lost in general. Note that (18) can also be used to define an estimator if ${g}^{{\rm{obs}}}$ is replaced by a Poisson process, as the right-hand side in (18) only consists of integrals against ${g}^{{\rm{obs}}}$ which are well-defined.

If T is compact with singular value decomposition $({\sigma }_{k},{u}_{k},{v}_{k})$ , and if we choose ${{ \mathcal X }}_{j}=\mathrm{span}\{{u}_{1},\ldots ,{u}_{j}\}$ and ${{ \mathcal Y }}_{j}=T({{ \mathcal X }}_{j})=\mathrm{span}\{{v}_{1},...,{v}_{j}\}$ , then (18) gives

$\begin{eqnarray}&&{u}_{j}^{{\rm{P}}}=\displaystyle \sum _{k=1}^{j}{\sigma }_{k}^{-1}{\langle {g}^{{\rm{obs}}},{v}_{k}\rangle }_{{ \mathcal Y }}{u}_{k}.\end{eqnarray} \tag{ 19 }$

As observed in the pioneering works of Donoho [51], the singular value decomposition can be significantly outperformed by wavelet and wavelet-vaguelette bases in certain situations. We introduce a simple type of such bases in the following subsection.

3.1. Wavelet bases

Assume a scaling function ϕ and a wavelet function ψ are given. Then we will denote by ${\phi }_{\lambda }$ and ${\psi }_{\lambda }$ the corresponding scaling and wavelet functions at scale j where λ contains the scale parameters j as well as the space parameters k. If the dimension of ${\mathbb{M}}$ is 1, then $\lambda =(j,k)$ with $j,k\in {\mathbb{N}}$ and ${\psi }_{j,k}={2}^{j/2}\psi ({2}^{j}\cdot -k)$ . In the general case we will write $| \lambda | =j$ to indicate that the wavelet ${\psi }_{\lambda }$ has scale j, in case of $| \lambda | \lt j$ we mean that ${\psi }_{\lambda }$ has scale $j^{\prime}$ where $0\leqslant j^{\prime} \lt j$ . In this setup we suppose the following:

Assumption 3.1. Let ${\mathbb{M}}={({\mathbb{R}}/{\mathbb{Z}})}^{d}$ . The generating functions ϕ and ψ are chosen such that

(i)
${\{{\phi }_{\lambda }\}}_{| \lambda | =j}$ is an orthonormal basis of ${{ \mathcal X }}_{j}=\mathrm{span}\{{\phi }_{\lambda }\;| \;| \lambda | \quad =\quad j\}$ such that ${{ \mathcal X }}_{0}\subset {{ \mathcal X }}_{1}\subset ...\subset {{\bf{L}}}^{2}({\mathbb{M}})$ with $\mathrm{dim}({{ \mathcal X }}_{j})={2}^{{jd}}$ , $d=\mathrm{dim}({\mathbb{M}})$ and ${\{{\psi }_{\lambda }\}}_{| \lambda | =j}$ is an orthonormal basis of the orthogonal complement W_j of ${{ \mathcal X }}_{j}$ into ${{ \mathcal X }}_{j+1}$ .
(ii)
Any $u\in {{\bf{L}}}^{2}({\mathbb{M}})$ can be represented by its wavelet decomposition
$\begin{eqnarray*}&&u=\displaystyle \sum _{| \lambda | =0}\langle u,{\phi }_{\lambda }\rangle {\phi }_{\lambda }+\displaystyle \sum _{j=0}^{\infty }\displaystyle \sum _{| \lambda | =j}\langle u,{\psi }_{\lambda }\rangle {\psi }_{\lambda }.\end{eqnarray*}$
To simplify the notation, it is convenient to write ${\{{\psi }_{\lambda }\}}_{| \lambda | =-1}:= {\{{\phi }_{\lambda }\}}_{| \lambda | =0}$ and consider $j\in {{\mathbb{Z}}}_{\geqslant -1}$ in the following.
(iii)
We have ${\parallel {\psi }_{\lambda }\parallel }_{{{\bf{L}}}^{\infty }({\mathbb{M}})}={2}^{| \lambda | d/2}{\parallel \psi \parallel }_{{{\bf{L}}}^{\infty }({\mathbb{M}})}$ .
(iv)
For all $j\in {\mathbb{N}}$ there exists A_j such that ${\parallel v\parallel }_{{{\bf{L}}}^{\infty }({\mathbb{M}})}\leqslant {A}_{j}{\parallel v\parallel }_{{{\bf{L}}}^{2}({\mathbb{M}})}$ for all $v\in {{ \mathcal X }}_{j}$ .

Under those assumptions the corresponding wavelets provide an unconditional basis for Besov spaces. If additionally $\phi ,\psi \in {C}^{L}$ where $L\gt s$ and $\sigma := s+d\left(\tfrac{1}{2}-\tfrac{1}{p}\right)\geqslant 0$ , the norm defined by

$\begin{eqnarray*}&&{\parallel u\parallel }_{s,p,q}:= {\left(\displaystyle \sum _{j=1}^{\infty }{\left({2}^{j\sigma p}\displaystyle \sum _{| \lambda | =j}{| \langle u,{\psi }_{\lambda }\rangle | }^{p}\right)}^{\tfrac{q}{p}}\right)}^{\tfrac{1}{q}}\end{eqnarray*}$

is equivalent to the usual Besov space norm of ${B}_{p,q}^{s}({{\mathbb{R}}}^{d})$ (see [68]).

3.2. Projection-type estimators using a WVD

As discussed above, choosing ${{ \mathcal X }}_{j}$ as a wavelet space in (18) might be a good choice for representing ${u}^{\dagger }$ , but in general T will not be diagonal w.r.t. this basis. Donoho [51] also described a subclass of operators which possess a so-called WVD, which is given by an orthogonal wavelet basis $({\psi }_{\lambda })$ of ${ \mathcal X }$ and two systems of functions $({u}_{\lambda })$ and $({v}_{\lambda })$ in ${ \mathcal Y }$ such that $T{\psi }_{\lambda }={\kappa }_{j}{v}_{\lambda }$ , ${T}^{* }{u}_{\lambda }={\kappa }_{j}{\psi }_{\lambda }$ (here the values ${\kappa }_{j}$ only depend in the scale parameter j). Furthermore, the functions u_λ and v_λ should be bi-orthogonal (i.e. ${\langle {u}_{\lambda },{v}_{\mu }\rangle }_{{ \mathcal Y }}={\delta }_{\lambda ,\mu }$ ) and near-orthogonal. Note that the wavelet basis ${\psi }_{\lambda }$ in the WVD cannot be chosen by the user. This gives rise to the reconstruction formula

$\begin{eqnarray}&&{u}^{\dagger }=\displaystyle \sum _{\lambda }{\kappa }_{j}^{-1}{\langle {{Tu}}^{\dagger },{u}_{\lambda }\rangle }_{{ \mathcal Y }}{\psi }_{\lambda },\end{eqnarray} \tag{ 20 }$

which can readily be used to construct an SVD-type estimator from ${g}^{{\rm{obs}}}$ instead of ${{Tu}}^{\dagger }$ . Furthermore, as ${g}^{{\rm{obs}}}$ will enter the formula (20) only in terms of an integral against u_λ, this can also be used to define linear projection-type estimators for inverse problems with Poisson data as follows:

$\begin{eqnarray}&&{u}_{j}^{{\rm{LIN}}}:= \displaystyle \sum _{| \lambda | \leqslant j}{\kappa }_{j}^{-1}{\int }_{{\mathbb{M}}}{u}_{\lambda }{\rm{d}}{G}_{t}{\psi }_{\lambda }.\end{eqnarray} \tag{ 21 }$

Estimators of this form have been studied by Cavalier and Koo [39] if T is the Radon transform (in this case, the WVD is explicitly known). They provide the best possible rate of convergence in Besov-balls w.r.t. the ${{\bf{L}}}^{2}$ -norm and give conditions, under which (21) achieves this rate. Their results are of similar type as those for the related wavelet-Galerkin method stated below.

For inverse problems with additive Gaussian white noise it is well-known (see [51, section 6]) that linear estimators do not yield order optimal results in case that the exact solution ${u}^{\dagger }$ is supposed to be in a Besov space ${B}_{p,p}^{s}({\mathbb{M}})$ with $1\leqslant p\lt 2$ . In this situation, nonlinear estimators have to be taken into account. This is also true for the results concerning (21).

The necessary nonlinearity is often introduced by applying a soft-thresholding operator

$\begin{eqnarray}&&{\zeta }_{\epsilon }(x):= \mathrm{sgn}(x)\mathrm{max}(| x| -\epsilon ,0)\end{eqnarray} \tag{ 22 }$

to the wavelet coefficients. Note that for a sequence ${({x}_{n})}_{n\in {\mathbb{N}}}$ the sequence ${({\zeta }_{\epsilon }({x}_{n}))}_{n\in {\mathbb{N}}}$ can be seen as a 'sparsened' version of the original sequence as ${\zeta }_{\epsilon }$ maps all coefficients with absolute value less than $\epsilon$ to 0.

This leads to the nonlinear estimator

$\begin{eqnarray}&&{u}_{j}^{{\rm{NL}}}:= \displaystyle \sum _{| \lambda | \leqslant j}{\zeta }_{\epsilon }\left({\kappa }_{j}^{-1}{\int }_{{\mathbb{M}}}{u}_{\lambda }{\rm{d}}{G}_{t}\right){\psi }_{\lambda },\end{eqnarray} \tag{ 23 }$

which has also been studied by Cavalier and Koo [39]. Again they calculate the optimal rate of convergence among all estimators and prove that the nonlinear estimator (23) as well as an adaptive version with soft-thresholding replaced by hardthresholding yields the optimal rate.

3.3. Projection-type estimators using a wavelet-Galerkin approximation

The above use of a WVD is restricted applications where such a decomposition is known explicitly. Otherwise ${T}^{* }$ needs to be inverted numerically to obtain the u_λ's, which causes instabilities due to ill-posedness. To overcome these restrictions, Cohen et al [45] proposed a related, but different technique called wavelet-Galerkin approximation. If T in (18) is self-adjoint and positive definite, we may choose ${{ \mathcal Y }}_{j}={{ \mathcal X }}_{j}$ . This results in a classical Galerkin approximation. If T is not self-adjoint, the normal equation arising from (16) motivates the application of the Galerkin approximation to ${T}^{* }{Tu}={T}^{* }{g}^{{\rm{obs}}}$ , i.e.

$\begin{eqnarray}&&{u}_{j}^{{\rm{Gmod}}}\in {{ \mathcal X }}_{j}:\qquad \langle {{Tu}}_{j}^{{\rm{Gmod}}},{Tv}\rangle =\langle {g}^{{\rm{obs}}},{Tv}\rangle \quad \mathrm{for}\ \mathrm{all}\quad v\in {{ \mathcal X }}_{j}.\end{eqnarray} \tag{ 24 }$

Therefore, we confine the subsequent discussion to the case that T is self-adjoint and positive definite following [4, 45] who claim that similar results can be shown for (24) without stating them explicitely.

Let ${\psi }_{\lambda }$ again be a wavelet basis satisfying assumption 3.1 and set ${{ \mathcal X }}_{j}=\mathrm{span}\{{\psi }_{\lambda }\;| \;| \lambda | \lt j\}$ . Then the linear Galerkin estimator for ${u}^{\dagger }$ is given by

$\begin{eqnarray}&&{u}_{j}^{{\rm{G}}}\in {{ \mathcal X }}_{j}:\qquad \langle {{Tu}}_{j}^{{\rm{G}}},{\psi }_{\lambda }\rangle =\langle {g}^{{\rm{obs}}},{\psi }_{\lambda }\rangle \quad \mathrm{for}\ \mathrm{all}\quad | \lambda | \lt j.\end{eqnarray} \tag{ 25 }$

Again, this estimator can be extended to Poisson data, as the right-hand side of (25) is well-defined also for ${g}^{{\rm{obs}}}={G}_{t}$ . The function ${u}_{j}\in {{ \mathcal X }}_{j}$ satisfying (25) with ${g}^{{\rm{obs}}}={g}^{\dagger }$ is also called the Galerkin approximation of ${u}^{\dagger }$ .

Next we aim for a direct description of the wavelet coefficients of ${u}_{j}^{{\rm{LS}}}$ . Let us introduce the Galerkin wavelets ${g}_{\lambda }^{j}\in {{ \mathcal X }}_{j}$ by

$\begin{eqnarray*}&&\langle {{Tg}}_{\lambda }^{j},v\rangle =\langle {\psi }_{\lambda },v\rangle \qquad \mathrm{for}\ \mathrm{all}\qquad v\in {{ \mathcal X }}_{j}.\end{eqnarray*}$

By assumption, ${T}_{{| }_{{{ \mathcal X }}_{j}}}$ is bounded invertible for all j as ${{ \mathcal X }}_{j}$ is finite-dimensional where ${T}_{{| }_{{{ \mathcal X }}_{j}}}$ denotes the restriction of T to the subspace ${{ \mathcal X }}_{j}$ . Thus the functions ${g}_{\lambda }^{j}$ can be written as ${g}_{\lambda }^{j}={(T{| }_{{{ \mathcal X }}_{j}})}^{-1}{\psi }_{\lambda }$ . Nevertheless the norm of ${(T{| }_{{{ \mathcal X }}_{j}})}^{-1}$ will tend to $\infty$ as $j\to \infty$ if the underlying problem is ill-posed. So the Galerkin wavelets ${g}_{\lambda }^{j}$ show an unfavorable behavior for large j and amplify the noise in the right-hand side.

A simple calculation shows that with this definition one obtains

$\begin{eqnarray*}&&\langle {g}_{\lambda }^{j},{g}^{\dagger }\rangle =\langle {u}_{j},{\psi }_{\lambda }\rangle \end{eqnarray*}$

for all $| \lambda | \lt j$ . So the wavelets coefficients of u_j are given by $\langle {g}_{\lambda }^{j},{g}^{\dagger }\rangle$ if $| \lambda | \lt j$ and 0 else. This formulation does not explicitly use the forward operator T and gives rise to the standard linear projection estimator ${\hat{u}}_{j}$ for ${u}^{\dagger }$ w.r.t. the prescribed wavelet basis defined by

$\begin{eqnarray}&&{\hat{u}}_{j}\in {{ \mathcal X }}_{j}:\qquad \langle {\hat{u}}_{j},{\psi }_{\lambda }\rangle ={\int }_{{\mathbb{M}}}{g}_{\lambda }^{j}{\rm{d}}{G}_{t}\qquad \mathrm{for}\ \mathrm{all}\qquad | \lambda | \lt j.\end{eqnarray} \tag{ 26 }$

In other words, the wavelet coefficients of ${\hat{u}}_{j}$ are given by ${\int }_{{\mathbb{M}}}{g}_{\lambda }^{j}{\rm{d}}{G}_{t}$ whenever $| \lambda | \lt j$ and by 0 else, i.e. ${\hat{u}}_{j}={\sum }_{| \lambda | \lt j}\left[{\int }_{{\mathbb{M}}}{g}_{\lambda }^{j}{\rm{d}}{G}_{t}\right]{\psi }_{\lambda }$ . This is a natural generalization of the Galerkin approximation as we simply replace the exact quantities $\langle {g}_{\lambda }^{j},{g}^{\dagger }\rangle$ by the measured approximations. Obviously, taking into account the definition of the Galerkin wavelets ${g}_{\lambda }^{j}$ and the functions u_λ in the definition of the WVD, (26) is an analog of (21). Nevertheless one should note that here the basis ${\psi }_{\lambda }$ is chosen by the user, which is not the case in the WVD. As the Galerkin wavelets ${g}_{\lambda }^{j}$ amplify the noise, j plays the role of the regularization parameter here.

A main disadvantage of the linear estimator (26) is that ${\hat{u}}_{j}$ is not necessarily non-negative. Therefore, Antoniadis and Bigot [4] proposed to project ${u}^{\dagger }$ onto a function space of the form

$\begin{eqnarray}&&{E}_{j}=\left\{{u}_{j,\theta }=\mathrm{exp}\left(\;\displaystyle \sum _{| \lambda | \lt j}{\theta }_{\lambda }{\psi }_{\lambda }\right)\;| \;\theta ={({\theta }_{\lambda })}_{| \lambda | \lt j}\in {{\mathbb{R}}}^{{2}^{{jd}}}\right\}.\end{eqnarray} \tag{ 27 }$

The choice of this approximation set ensures that ${u}_{j,\theta }\gt 0$ a.e. By results of Csiszár [48] we have

$\begin{eqnarray*}&&{\mathbb{KL}}({u}^{\dagger };{u}_{j,\theta })=\underset{u\in {E}_{j}}{\mathrm{inf}}{\mathbb{KL}}({u}^{\dagger };u)\quad \iff \quad \forall | \lambda | \lt j\ \langle {u}_{j,\theta },{\psi }_{\lambda }\rangle =\langle {u}^{\dagger },{\psi }_{\lambda }\rangle .\end{eqnarray*}$

This suggests to define the estimator ${u}_{j,\hat{\theta }}\in {E}_{j}$ such that

$\begin{eqnarray}&&\langle {u}_{j,\hat{\theta }},{\psi }_{\lambda }\rangle ={\int }_{{\mathbb{M}}}{g}_{\lambda }^{j}{\rm{d}}{G}_{t}\qquad \mathrm{for}\ \mathrm{all}\qquad | \lambda | \lt j.\end{eqnarray} \tag{ 28 }$

Obviously ${u}_{j,\hat{\theta }}$ depends linearly on G_t, but the corresponding parameter $\hat{\theta }\in {{\mathbb{R}}}^{{2}^{{jd}}}$ does not depend linearly on G_t due to the exponential function in (27). In fact, $\hat{\theta }$ is computed by a Newton scheme, and the existence of a solution to (28) can only be guaranteed with increasing probability as $t\to \infty$ .

As the wavelet coefficients $\langle {u}^{\dagger },{\psi }_{\lambda }\rangle$ are approximated by ${\int }_{{\mathbb{M}}}{g}_{\lambda }^{j}{\rm{d}}{G}_{t}$ as seen above, (28) can be seen as the Galerkin information projection estimate of ${u}^{\dagger }$ at scale j. Obviously, a similar interpretation is not valid for the standard linear estimator (26) as ${\mathbb{KL}}({u}^{\dagger };u)=\infty$ whenever $u\lt 0$ on a set of positive measure.

Note that this is a result concerning projection w.r.t. the Kullback–Leibler divergence in ${ \mathcal X }$ and not in ${ \mathcal Y }$ where the Kullback–Leibler divergence appears in connection with the log-likelihood of the Poisson distribution. The methods (26) and (28) are not connected to a likelihood method at all.

For the asymptotic behavior of the estimator (28), the following result holds true:

Theorem 3.2 (See [4, theorem 4.2]). Let ${\mathbb{M}}={({\mathbb{R}}/{\mathbb{Z}})}^{d}$ and assume that the linear operator $T\;:{L}^{2}({\mathbb{M}})\to {L}^{2}({\mathbb{M}})$ is self-adjoint, positive definite, and an isomorphism between ${H}^{\tau }({\mathbb{M}})$ and ${H}^{\tau +\nu }({\mathbb{M}})$ for any $\tau \geqslant 0$ for some $\nu \gt 0$ and that ${u}^{\dagger }=\mathrm{exp}({v}^{\dagger })$ with some ${v}^{\dagger }\in {H}^{s}({\mathbb{M}})={B}_{2,2}^{s}({\mathbb{M}})$ where $s\gt d/2$ . Suppose moreover that assumption 3.1 is satisfied and ψ having compact support is chosen such that $\psi \in {H}^{\tau +d/2-\nu }$ with $\tau \gt \nu -d/2$ and the first $r\gt \tau +d/2$ moments of ψ vanish. Then for the a priori parameter choice

$\begin{eqnarray}&&{2}^{-j(t)}={\left(\displaystyle \frac{1}{t}\right)}^{\tfrac{1}{2s+2\nu +d}}\end{eqnarray} \tag{ 29 }$

with probability tending to 1 as $t\to \infty$ the Galerkin information projection estimate (28) exists and satisfies

$\begin{eqnarray}&&{\bf{E}}[{\mathbb{KL}}({u}^{\dagger };{u}_{j,\hat{\theta }})]={ \mathcal O }\left({\left(\displaystyle \frac{1}{t}\right)}^{\tfrac{2s}{2s+2\nu +d}}\right),\qquad t\to \infty .\end{eqnarray} \tag{ 30 }$

In case that ${u}^{\dagger }\in {B}_{p,p}^{s}({\mathbb{M}})$ with $1\leqslant p\lt 2$ , the linear estimator will again not be of optimal order. The assumption especially implies that ${u}^{\dagger }$ and its weak derivatives are in some sense sparse w.r.t. the wavelet basis, and hence our estimator should also be sparse. Recall that ${\hat{u}}_{j}={\sum }_{| \lambda | \lt j}\left[\;{\int }_{{\mathbb{M}}}{g}_{\lambda }^{j}{\rm{d}}{G}_{t}\right]{\psi }_{\lambda }$ for the standard linear estimator ${\hat{u}}_{j}$ . This leads to the natural nonlinear estimator

$\begin{eqnarray}&&{\hat{u}}_{j}^{{\rm{NL}}}=\displaystyle \sum _{| \lambda | \lt j}{\zeta }_{\epsilon }\left(\;{\int }_{{\mathbb{M}}}{g}_{\lambda }^{j}{\rm{d}}{G}_{t}\right){\psi }_{\lambda },\end{eqnarray} \tag{ 31 }$

which is again an analog to (23) but with user-chosen ${\psi }_{\lambda }$ . As in the linear case, ${\hat{u}}_{j}^{{\rm{NL}}}$ is not necessarily non-negative. Thus Antoniadis and Bigot [4] consider the corresponding Galerkin information projection onto the exponential family E_j of functions (27), i.e.

$\begin{eqnarray}&&\langle {u}_{j,\hat{\theta }}^{{\rm{NL}}},{\psi }_{\lambda }\rangle ={C}_{\lambda }^{\psi \to g}{\left({\zeta }_{\epsilon }\left({\int }_{{\mathbb{M}}}{\psi }_{\kappa }{\rm{d}}{G}_{t}\right)\right)}_{| \kappa | \lt j}\qquad \mathrm{for}\ \mathrm{all}\qquad | \lambda | \lt j,\end{eqnarray} \tag{ 32 }$

where ${C}_{j,\lambda }^{\psi \to g}$ denotes the linear transformation mapping the coefficients w.r.t. $\{{\psi }_{\lambda }\}$ to the λth coefficient w.r.t $\{{g}_{\lambda }^{j}\}$ . Thus thresholding is not performed on the coefficients of the data w.r.t. $\{{g}_{\lambda }^{j}\}$ but w.r.t. $\{{\psi }_{\lambda }\}$ itself. For an adaptive version of the estimator (32) the following asymptotic behavior was shown:

Theorem 3.3 (See [4, theorem 5.1]). Assume that the linear operator $T$ is self-adjoint, positive definite, an isomorphism between ${{\bf{L}}}^{2}({\mathbb{M}})$ and ${H}^{\nu }({\mathbb{M}})$ and maps ${B}_{p,p}^{\tau }({\mathbb{M}})$ bounded into ${B}_{p,p}^{\tau +\nu }({\mathbb{M}})$ for any $\tau \geqslant 0$ . Let ${u}^{\dagger }$ satisfy ${u}^{\dagger }=\mathrm{exp}({v}^{\dagger })$ with some ${v}^{\dagger }\in {B}_{p,p}^{s}({\mathbb{M}})$ where $s\gt 0$ and $1/p=1/2+s/(2\nu +d)$ . Suppose moreover that assumption 3.1 is satisfied and ψ having compact support is chosen such that $\psi \in {H}^{\tau +d/2-\nu }$ with $\tau \gt \nu -d/2$ and the first $r\gt \tau +d/2$ moments of ψ vanish. Then for the a posteriori choices

$\begin{eqnarray*}&&{2}^{-j(t)}\leqslant {\left(\displaystyle \frac{1}{t}\right)}^{\tfrac{1}{2\nu }},\qquad \epsilon (t,\lambda )=\displaystyle \frac{\sqrt{| \mathrm{ln}(t)| }}{\sqrt{t}}{2}^{\nu | \lambda | }\end{eqnarray*}$

with probability tending to 1 as $t\to \infty$ the Galerkin information projection estimate (32) exists and satisfies

$\begin{eqnarray}&&{\bf{E}}[{\mathbb{KL}}({u}^{\dagger };{u}_{j,\hat{\theta }}^{{\rm{NL}}})]={ \mathcal O }\left({\left(\displaystyle \frac{\sqrt{| \mathrm{ln}(t)| }}{t}\right)}^{\tfrac{2s}{2s+2\nu +d}}\right),\qquad t\to \infty .\end{eqnarray} \tag{ 33 }$

4. Variational regularization of inverse problems with Poisson data

This section deals with variational regularization methods of the form

$\begin{eqnarray}&&{\hat{u}}_{\alpha }\in \underset{u\in {\mathfrak{B}}}{\mathrm{argmin}}\left[\displaystyle \frac{1}{\alpha }{{ \mathcal S }}_{\sigma }({G}_{t};F(u))+{ \mathcal R }(u)\right],\end{eqnarray} \tag{ 34 }$

also known as generalized Tikhonov regularization and its variants. Here ${({G}_{t})}_{t\gt 0}$ is a temporally normalized Poisson process as in definition 2.5, ${{ \mathcal S }}_{\sigma }$ is the data fidelity term defined in (10), ${ \mathcal R }$ is a convex penalty term, and $\alpha \gt 0$ a regularization parameter.

From a frequentist statistical perspective, (34) with $\sigma =0$ corresponds to a penalized maximum likelihood approach, and from a Bayesian point of view, (34) describes a maximum a posteriori (MAP) estimator for ${u}^{\dagger }$ if (in a discrete setting) the prior has a density of the form $\propto \mathrm{exp}(-\alpha { \mathcal R }(u))$ .

Note that the functional $u\mapsto \tfrac{1}{\alpha }{{ \mathcal S }}_{\sigma }({G}_{t};F(u))+{ \mathcal R }(u)$ is convex if $F$ is linear. This makes a large number of algorithms available, some of which will be discussed in section 6. If $F$ is nonlinear, one possibility is to iteratively replace $F$ by linear approximations, see section 4.5.

The choice of ${ \mathcal R }$ should be driven by potential a priori information on the solution in the sense that ${ \mathcal R }(u)$ should be small for physically plausible solutions $u$ . For example, if it is known that the true solution ${u}^{\dagger }$ is smooth, then a Sobolev norm ${ \mathcal R }(u)={\parallel u\parallel }_{{H}^{s}}^{2}$ with some index $s\gt 0$ is reasonable. Other functionals which have received considerable attention in the literature include entropy-like functionals of the form

$\begin{eqnarray}&&{{ \mathcal R }}_{{\rm{ME}}}(u)={\int }_{{\rm{\Omega }}}u\mathrm{ln}(u){\rm{d}}x,\end{eqnarray} \tag{ 35 }$

where $u\in { \mathcal X }={{\bf{L}}}^{1}({\rm{\Omega }})$ , ${\rm{\Omega }}\subset {{\mathbb{R}}}^{n}$ , and the total variation seminorm

$\begin{eqnarray}&&{{ \mathcal R }}_{{\rm{TV}}}(u)=\mathrm{sup}\left\{{\int }_{{\rm{\Omega }}}u\mathrm{div}\varphi {\rm{d}}x\;| \;\varphi \in {C}_{0}^{1}{({\rm{\Omega }})}^{n},{\parallel \varphi \parallel }_{\infty }\leqslant 1\right\}.\end{eqnarray} \tag{ 36 }$

If $u\in {W}^{\mathrm{1,1}}({\rm{\Omega }})$ , then ${{ \mathcal R }}_{{\rm{TV}}}(u)={\int }_{{\rm{\Omega }}}| {\rm{\nabla }}u| {\rm{d}}x$ . We further mention sparsity promoting functionals given by a weighted ℓ¹-norm

$\begin{eqnarray}&&{{ \mathcal R }}_{{{\ell }}^{1}}(u)=\displaystyle \sum _{i\in I}{\beta }_{i}| {u}_{i}| .\end{eqnarray} \tag{ 37 }$

Of the coefficients ${({u}_{i})}_{i\in I}$ of u w.r.t. to a basis or frame and weights ${\beta }_{i}\gt 0$ . Common examples are the pixel basis, wavelet frames (as introduced in section 3.1), shearlet and curvelet frames.

4.1. Existence of minimizers

First we have to show that minimizers of the Tikhonov functional in (34) exist. For this end as well as for the discussion of regularizing properties in the next subsection we introduce the following assumptions:

Assumption 4.1. Let ${ \mathcal X }$ be a Banach space, ${\mathbb{M}}\subset {{\mathbb{R}}}^{d}$ is a bounded Lipschitz domain, and $F\;:{\mathfrak{B}}\subset { \mathcal X }\to {{\bf{L}}}^{1}({\mathbb{M}})$ the forward operator.

There exists a topology ${\tau }_{{ \mathcal X }}$ weaker than the ${ \mathcal X }$ norm topology and a topology ${\tau }_{{ \mathcal Y }}$ such that $({ \mathcal X },{\tau }_{{ \mathcal X }})$ and $(\mathrm{span}(F({\mathfrak{B}})),{\tau }_{{ \mathcal Y }})$ are a locally convex vector space.
${ \mathcal R }\;:{ \mathcal X }\to (-\infty ,\infty ]$ is convex, proper, sequentially lower semicontinuous w.r.t. ${\tau }_{{ \mathcal X }}$ , and the sub-levelsets $\{u\in { \mathcal X }\;| \;{ \mathcal R }(u)\leqslant M\}$ are sequentially pre-compact w.r.t. ${\tau }_{{ \mathcal X }}$ for all $M\gt 0$ .
${\mathfrak{B}}$ is sequentially closed w.r.t. ${\tau }_{{ \mathcal X }}$ .
${\mathfrak{B}}\cap \mathrm{dom}({ \mathcal R })\ne \varnothing$ .
$F:{\mathfrak{B}}\cap \mathrm{dom}({ \mathcal R })\to F({\mathfrak{B}})$ is sequentially continuous w.r.t. ${\tau }_{{ \mathcal X }}$ and ${\tau }_{{ \mathcal Y }}$ .
$F(u)\geqslant 0$ for all $u\in {\mathfrak{B}}$ .
${\tau }_{{ \mathcal Y }}$ is a norm topology, and $F({\mathfrak{B}})$ is a bounded w.r.t. ${\tau }_{{ \mathcal Y }}$ .

Let us discuss these assumptions, starting with the choice of ${\tau }_{{ \mathcal Y }}$ in (H1): for data ${G}_{t}\in {{\bf{L}}}^{1}({\mathbb{M}})$ as used in deterministic noise models, the weak ${{\bf{L}}}^{1}$ -topology can be used. If G_t is a measure, we will use the strong ${{\bf{L}}}^{\infty }$ -topology or even the strong H^s-topology to apply the concentration inequality in theorem 2.9. Note that the stronger ${\tau }_{{ \mathcal Y }}$ , the more smoothing F has to be to fulfil (H5) and (H7).

For the choice of ${\tau }_{{ \mathcal X }}$ the compactness assumption in (H2) is most critical. If ${ \mathcal R }(u)=\parallel u-{u}_{0}{\parallel }_{{ \mathcal X }}^{p}$ , $p\geqslant 1$ where ${ \mathcal X }$ is a reflexive Banach space or a Banach space with a predual, then (H2) holds true for the weak topology ${\tau }_{{ \mathcal X }}$ by the Banach–Alaoglu theorem. For the entropy functional ${{ \mathcal R }}_{{\rm{ME}}}$ it has been shown in [56, lemma 2.1] using the Dunford–Pettis theorem that sublevel sets are relatively compact w.r.t. the weak ${{\bf{L}}}^{1}$ -topology. Note that by the Eberlein–Šmulian theorem weak pre-compactness and weak sequential pre-compactness are equivalent in general Banach spaces. The case ${ \mathcal R }={{ \mathcal R }}_{{\rm{TV}}}$ will be discussed after proposition 4.2. For ${ \mathcal R }$ given by weighted l¹ norms, we may choose ${\tau }_{{ \mathcal X }}$ as strong topology of an l¹-norm with weights increasing slower than $({\beta }_{i})$ .

For linear operators F (H6) is typically verified by showing that F preserves non-negativity and choosing ${\mathfrak{B}}$ as a subset of the cone of non-negative functions.

For linear operators F assumption (H7) is obviously satisfied if ${\mathfrak{B}}$ is bounded w.r.t. some norm for which F is ${\tau }_{{ \mathcal Y }}$ continuous. Many forward operators F in applications preserve the ${{\bf{L}}}^{1}$ norm, and then ${\parallel {u}^{\dagger }\parallel }_{{{\bf{L}}}^{1}}$ can be estimated from the number of counts.

Proposition 4.2 (Existence and uniqueness of ${\hat{u}}_{\alpha }$ ).

(i)
If ${G}_{t}={g}^{{\rm{obs}}}\in {{\bf{L}}}^{1}({\mathbb{M}})\subset {\mathfrak{M}}({\mathbb{M}})$ with ${g}^{{\rm{obs}}}\geqslant 0$ and ${\mathbb{KL}}({g}^{{\rm{obs}}}+\sigma ;F({u}^{\dagger })+\sigma )\lt \infty$ and if (H1)–(H6) in assumption 4.1 hold true with the weak ${{\bf{L}}}^{1}$ topology ${\tau }_{{ \mathcal Y }}$ , then the Tikhonov functional in (34) has a minimizer ${\hat{u}}_{\alpha }$ for all $\alpha \gt 0$ and $\sigma \geqslant 0$ .
(ii)
If G_t is a finite, positive Borel measure and assumption 4.1 holds true with the strong norm topology ${\tau }_{{ \mathcal Y }}$ of ${{\bf{L}}}^{\infty }({\mathbb{M}})$ , then the Tikhonov functional in (34) has a minimizer ${\hat{u}}_{\alpha }$ for all $\alpha \gt 0$ and $\sigma \gt 0$ .
(iii)
${\hat{u}}_{\alpha }$ is unique if F is linear and ${ \mathcal R }$ is strictly convex.

Proof.

(i)
This is a standard argument using
$\begin{eqnarray}&&{{ \mathcal S }}_{\sigma }({G}_{t};g)-{\int }_{{\mathbb{M}}}(({g}^{{\rm{obs}}}+\sigma )\mathrm{ln}({g}^{{\rm{obs}}}+\sigma )+{g}^{{\rm{obs}}}+\sigma ){\rm{d}}x\\ &&\quad ={\mathbb{KL}}({g}^{{\rm{obs}}}+\sigma ;g+\sigma )\geqslant 0\end{eqnarray}$
and the weak sequential lower semicontinuity of ${\mathbb{KL}}({g}^{{\rm{obs}}}+\sigma ;\cdot )$ (see e.g. [120]).
(ii)
By (H6) and (H7) we have ${{ \mathcal S }}_{\sigma }({G}_{t};F(u))\lt \infty$ for all $u\in {\mathfrak{B}}$ , so by (H4) the Tikhonov functional is proper. Let $({u}_{n})$ be a minimizing sequence. Note from (10) that
$\begin{eqnarray}&&{{ \mathcal S }}_{\sigma }({G}_{t};F({u}_{n}))\geqslant -\mathrm{ln}({C}_{{\mathfrak{B}}}+\sigma )({G}_{t}({\mathbb{M}})+\sigma | {\mathbb{M}}| )\end{eqnarray} \tag{ 38 }$
for all n with ${C}_{{\mathfrak{B}}}:= {\mathrm{sup}}_{u\in {\mathfrak{B}}}{\parallel F(u)\parallel }_{\infty }$ (which is finite by (H7)). Therefore, $M:= {\mathrm{sup}}_{n}{ \mathcal R }({u}_{n})\lt \infty$ . Hence, due to (H2) there exists a subsequence $({u}_{n(k)})$ which converges to some ${\hat{u}}_{\alpha }\in { \mathcal X }$ w.r.t. ${\tau }_{{ \mathcal X }}$ . (H3) implies ${\hat{u}}_{\alpha }\in {\mathfrak{B}}$ , and (H5) implies that ${\mathrm{lim}}_{k\to \infty }{\parallel F({u}_{n(k)})-F({\hat{u}}_{\alpha })\parallel }_{\infty }=0$ . From the lower semi-continuity of ${ \mathcal R }$ w.r.t. ${\tau }_{{ \mathcal X }}$ and the continuity of ${{ \mathcal S }}_{\sigma }({G}_{t};\cdot )$ w.r.t. the supremum norm it follows that ${\hat{u}}_{\alpha }$ is a minimum.
(iii)
This follows from the convexity of ${{ \mathcal S }}_{\sigma }({G}_{t};\cdot )$ .

■

For the total variation penalty ${{ \mathcal R }}_{{\rm{TV}}}$ defined in (36) assumption (H2) is violated since ${{ \mathcal R }}_{{\rm{TV}}}(u+c)={{ \mathcal R }}_{{\rm{TV}}}(u)$ for any constant c, so sublevel sets cannot be compact w.r.t. any useful topology ${\tau }_{{ \mathcal X }}$ . However, if ${{ \mathcal R }}_{{\rm{TV}}}(u)$ is replaced by ${\parallel u\parallel }_{\mathrm{BV}}:= {{ \mathcal R }}_{{\rm{TV}}}(u)+{\parallel u\parallel }_{{{\bf{L}}}^{1}}$ , then (H2) is satisfied for the strong ${{\bf{L}}}^{p}$ topology as ${\tau }_{{ \mathcal X }}$ for any $1\leqslant p\lt \tfrac{m}{m-1}$ where m is the dimension of Ω (see [1, theorem 2.5]). (H2) is also satisfied for the penalty

$\begin{eqnarray*}&&{{ \mathcal R }}_{{\rm{BV}}}(u):= {{ \mathcal R }}_{{\rm{TV}}}(u)+| \bar{u}| \qquad {\rm{with}}\qquad \bar{u}:= \displaystyle \frac{1}{| {\rm{\Omega }}| }{\int }_{{\rm{\Omega }}}u\;{\rm{d}}x\end{eqnarray*}$

since ${{ \mathcal R }}_{{\rm{BV}}}$ defines an equivalent norm on $\mathrm{BV}({\rm{\Omega }})$ : obviously ${{ \mathcal R }}_{{\rm{BV}}}(u)\leqslant \mathrm{max}\{1,{| {\rm{\Omega }}| }^{-1}\}{\parallel u\parallel }_{{\rm{BV}}}$ . Moreover, it follows from the Poincaré–Wirtinger inequality ${\parallel u-\bar{u}\parallel }_{{{\bf{L}}}^{1}}\leqslant {C}_{{\rm{PW}}}{{ \mathcal R }}_{{\rm{TV}}}(u)$ ([1, equation (2.11)]) that ${\parallel u\parallel }_{{\rm{BV}}}\leqslant \mathrm{max}\{| {\rm{\Omega }}| ,{C}_{{\rm{PW}}}+1\}{{ \mathcal R }}_{{\rm{TV}}}(u)$ .

However, ${{ \mathcal R }}_{{\rm{BV}}}$ and ${\parallel \cdot \parallel }_{{\rm{BV}}}$ are rarely used as penalties since they lead to additional algorithmic difficulties and typically to worse results. Therefore, we state a separate existence result for the case ${ \mathcal R }={{ \mathcal R }}_{{\rm{TV}}}$ . Similar results have been shown in [1] for quadratic data fidelities and in [12, 132] for (34) with ${G}_{t}={g}^{{\rm{obs}}}\in {{\bf{L}}}^{1}({\mathbb{M}})\subset {\mathfrak{M}}({\mathbb{M}})$ .

Proposition 4.3 (Existence of minimizers for ${ \mathcal R }={{ \mathcal R }}_{{\rm{TV}}}$ ). Suppose that (H1) and (H3)–(H7) in assumption 4.1 hold true with ${ \mathcal X }=\mathrm{BV}({\rm{\Omega }})$ for a bounded Lipschitz domain ${\rm{\Omega }}\subset {{\mathbb{R}}}^{m}$ , with ${\tau }_{{ \mathcal X }}$ the strong ${{\bf{L}}}^{p}$ -topology for some $p\in [1,\tfrac{m}{m-1})$ , ${\tau }_{{ \mathcal Y }}$ the strong ${{\bf{L}}}^{\infty }$ -topology, ${ \mathcal R }={{ \mathcal R }}_{{\rm{TV}}}$ , a set ${\mathfrak{B}}$ such that ${1}_{{\rm{\Omega }}}\in {\mathfrak{B}}$ with the the constant function ${1}_{{\rm{\Omega }}}\;:{\rm{\Omega }}\to {\mathbb{R}}$ , $x\mapsto 1$ , and a linear operator $F$ satisfying

$\begin{eqnarray}&&\underset{x\in {\mathbb{M}}}{\mathrm{inf}}(F{1}_{{\rm{\Omega }}})(x)\gt 0.\end{eqnarray} \tag{ 39 }$

Then the Tikhonov functional in (34) has a minimizer ${\hat{u}}_{\alpha }$ for all finite Borel measures ${G}_{t}$ and all $\alpha \gt 0$ and $\sigma \gt 0$ .

Proof. As in the proof of proposition 4.2 we can conclude that for a minimizing sequence ${({u}_{n})}_{n\in {\mathbb{N}}}$ the supremum $M:= {\mathrm{sup}}_{n}{{ \mathcal R }}_{{\rm{TV}}}({u}_{n})$ is finite. The main difficulty is to show that also

$\begin{eqnarray}&&\underset{n}{\mathrm{sup}}\bar{{u}_{n}}\lt \infty .\end{eqnarray} \tag{ 40 }$

As sublevel sets of ${{ \mathcal R }}_{{\rm{BV}}}$ are precompact in ${{\bf{L}}}^{p}({\rm{\Omega }})$ as discussed above, we can then extract a convergent subsequence of $({u}_{n})$ in ${{\bf{L}}}^{p}({\rm{\Omega }})$ , and the remainder of the proof coincides with that of proposition 4.2.

To show (40), assume on the contrary that ${\mathrm{sup}}_{n}\;\bar{{u}_{n}}=\infty$ . By the continuity of the embedding ${{\bf{L}}}^{p}({\rm{\Omega }})\;\hookrightarrow \;\mathrm{BV}({\rm{\Omega }})$ we have ${\parallel {u}_{n}-\bar{{u}_{n}}{1}_{{\rm{\Omega }}}\parallel }_{{{\bf{L}}}^{p}}\leqslant {C}_{p}{{ \mathcal R }}_{{\rm{TV}}}({u}_{n})\leqslant {C}_{p}M$ for some ${C}_{p}\gt 0$ and all n. Hence

$\begin{eqnarray*}&&{\parallel F({u}_{n}-\bar{{u}_{n}}{1}_{{\rm{\Omega }}})\parallel }_{\infty }\leqslant C\end{eqnarray*}$

with $C:= {\parallel F\parallel }_{{{\bf{L}}}^{p}\to {{\bf{L}}}^{\infty }}{C}_{p}M$ . By the assumptions (39) and (H7) we have $\gamma \leqslant (F{1}_{{\rm{\Omega }}})(x)\leqslant {\rm{\Gamma }}$ for all $x\in {\mathbb{M}}$ and $0\lt \gamma \leqslant {\rm{\Gamma }}$ , so by the triangle inequality

$\begin{eqnarray*}&&\bar{{u}_{n}}\gamma -C\leqslant {{Fu}}_{n}\leqslant \bar{{u}_{n}}{\rm{\Gamma }}+C.\end{eqnarray*}$

Now we can conclude from our assumption ${\mathrm{sup}}_{n}\;\bar{{u}_{n}}=\infty$ and the definition (10) of ${{ \mathcal S }}_{\sigma }$ that ${\mathrm{sup}}_{n}{{ \mathcal S }}_{\sigma }({G}_{t};{{Fu}}_{n})=\infty$ , in contradiction to the minimizing property of $({u}_{n})$ . This shows (40) and completes the proof.■

4.2. Regularizing properties

Once existence of minimizers of the Tikhonov functional has been established, the next natural question is in which sense they converge to the true solution.

4.2.1. Definitions

To concisely describe such results from the literature below, let us introduce the notion of a regularization scheme for operator equations (1) in Banach spaces. Generalizing the standard definition of regularization schemes in Hilbert spaces (see e.g. [59]) we introduce a loss function ${d}_{{ \mathcal X }}$ in the pre-image space ${ \mathcal X }$ and a noise measure ${d}_{{ \mathcal Y }}$ in the image space. In the standard definition, these are given by the norms of ${ \mathcal X }$ and ${ \mathcal Y }$ , i.e. ${d}_{{ \mathcal X }}({u}_{1},{u}_{2})={\parallel {u}_{1}-{u}_{2}\parallel }_{{ \mathcal X }}$ and ${d}_{{ \mathcal Y }}({g}_{1},{g}_{2})={\parallel {g}_{1}-{g}_{2}\parallel }_{{ \mathcal Y }}$ . However, in the context of variational regularization with general penalty function ${ \mathcal R }$ , measuring reconstruction errors in the norm of ${ \mathcal X }$ often turns out to be too restrictive. Moreover, in the context of deterministic models for Poisson data ${d}_{{ \mathcal Y }}={\mathbb{KL}}$ has often been used.

Definition 4.4 (Deterministic regularization scheme). Let ${\mathfrak{B}}$ and ${ \mathcal Y }$ be sets, $F:{\mathfrak{B}}\to { \mathcal Y }$ a forward operator, ${d}_{{ \mathcal X }}\;:{\mathfrak{B}}\times {\mathfrak{B}}\to [0,\infty ]$ a loss function, and ${d}_{{ \mathcal Y }}\;:{ \mathcal Y }\times { \mathcal Y }\to [0,\infty ]$ a noise measure fulfilling ${d}_{{ \mathcal X }}(u,u)=0$ and ${d}_{{ \mathcal Y }}(g,g)=0$ for all $u\in {\mathfrak{B}}$ and $g\in { \mathcal Y }$ . A family of mappings ${R}_{\alpha }\;:{ \mathcal Y }\to {\mathfrak{B}}$ together with a parameter choice rule $\bar{\alpha }\;:(0,\infty )\times { \mathcal Y }\to (0,\infty )$ is called a (deterministic) regularization scheme for the problem (1) w.r.t. the loss ${d}_{{ \mathcal X }}$ and the noise measure ${d}_{{ \mathcal Y }}$ if for all ${g}^{\dagger }\in F({\mathfrak{B}})$ there exists ${u}^{\dagger }\in B$ with $F({u}^{\dagger })={g}^{\dagger }$ such that

$\begin{eqnarray}&&\underset{\delta \searrow 0}{\mathrm{lim}}\mathrm{sup}\{{d}_{{ \mathcal X }}({R}_{\bar{\alpha }(\delta ,{g}^{{\rm{obs}}})}({g}^{{\rm{obs}}}),{u}^{\dagger })\;| \;{g}^{{\rm{obs}}}\in { \mathcal Y },{d}_{{ \mathcal Y }}({g}^{{\rm{obs}}},F({u}^{\dagger }))\leqslant \delta \}=0.\end{eqnarray} \tag{ 41 }$

If we consider the situation of random data, (41) needs to be replaced by some type of stochastic convergence.

Definition 4.5 (Statistical regularization scheme). Let ${ \mathcal X }$ be a locally convex vector space, $F:{\mathfrak{B}}\subset { \mathcal X }\to {{\bf{L}}}^{1}({\mathbb{M}})$ a forward operator and ${d}_{{ \mathcal X }}\;:{\mathfrak{B}}\times {\mathfrak{B}}\to [0,\infty )$ a Borel measurable loss function fulfilling ${d}_{{ \mathcal X }}(u,u)=0$ for all $u\in {\mathfrak{B}}\subset { \mathcal X }$ . A Borel measurable mapping $R:{\mathfrak{M}}({\mathbb{M}})\times (0,\infty )\to {\mathfrak{B}}$ , $(G,\alpha )\mapsto {R}_{\alpha }(G)$ together with a measurable parameter choice rule $\bar{\alpha }\;:(0,\infty )\times {\mathfrak{M}}({\mathbb{M}})\to (0,\infty )$ is called a statistical regularization scheme or consistent estimator w.r.t. the loss ${d}_{{ \mathcal X }}$ for the problem (1) under Poisson data if for all ${g}^{\dagger }\in F({\mathfrak{B}})$ there exist ${u}^{\dagger }\in {\mathfrak{B}}$ with $F({u}^{\dagger })={g}^{\dagger }$ such that

$\begin{eqnarray}&&\forall \;\varepsilon \gt 0\;:\qquad \underset{t\to \infty }{\mathrm{lim}}{\bf{P}}[{d}_{{ \mathcal X }}({R}_{\bar{\alpha }(t,{G}_{t})}({G}_{t}),{u}^{\dagger })\gt \varepsilon ]=0,\end{eqnarray} \tag{ 42 }$

where G_t is a temporally normalized Poisson process with intensity ${g}^{\dagger }=F({u}^{\dagger })$ .

In other words we require that ${d}_{{ \mathcal X }}({R}_{\bar{\alpha }(t,{G}_{t})}({G}_{t}),{u}^{\dagger })\to 0$ in probability. Alternatively, we could require convergence in expectation, ${\mathrm{lim}}_{t\to \infty }{\bf{E}}[{d}_{{ \mathcal X }}({R}_{\bar{\alpha }(t,{G}_{t})}({G}_{t}),{u}^{\dagger })]=0$ . Recall that this is a stronger condition due to Markov's inequality ${\bf{P}}[| X| \gt \varepsilon ]\leqslant {\bf{E}}[| X| ]/\varepsilon$ .

Often the strongest loss function with respect to which the regularizing property can be shown is given by the Bregman distance ${{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}(\cdot ,{u}^{\dagger })$ of ${ \mathcal R }$ for a proper, convex, lower-semicontinuous functional ${ \mathcal R }\;:{ \mathcal X }\to (-\infty ,\infty ]$ defined by

$\begin{eqnarray*}&&{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}(\hat{u},{u}^{\dagger }):= { \mathcal R }(\hat{u})-{ \mathcal R }({u}^{\dagger })-\langle {u}^{* },\hat{u}-{u}^{\dagger }\rangle .\end{eqnarray*}$

Here ${u}^{* }\in \partial { \mathcal R }({u}^{\dagger })$ is a subgradient. In the context of inverse problems the Bregman distance was first used by Eggermont [56] for maximum-entropy regularization ${ \mathcal R }={{ \mathcal R }}_{{\rm{ME}}}$ and later for general ${ \mathcal R }$ by Burger and Osher [36]. It has become a very frequently used loss function.

By convexity the Bregman distance is non-negative, and obviously ${{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({u}^{\dagger },{u}^{\dagger })=0$ . For strictly convex ${ \mathcal R }$ the equation ${{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}(u,{u}^{\dagger })=0$ holds true if and only if $u={u}^{\dagger }$ . Actually the Bregman distance is in general not a distance since it is neither symmetric nor does it satisfy a triangle inequality. If ${ \mathcal X }$ is a Hilbert space and ${ \mathcal R }(u)=\tfrac{1}{2}{\parallel u-{u}_{0}\parallel }_{{ \mathcal X }}^{2}$ , we have ${u}^{* }=u-{u}_{0}$ and ${{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}(u,{u}^{\dagger })=\tfrac{1}{2}{\parallel u-{u}^{\dagger }\parallel }_{{ \mathcal X }}^{2}$ . Moreover, if ${ \mathcal X }$ is a q-convex Banach space and ${ \mathcal R }(u)=\tfrac{1}{q}{\parallel u\parallel }_{{ \mathcal X }}^{q}$ , then there exists ${C}_{{\rm{bd}}}\gt 0$ such that

$\begin{eqnarray}&&{\parallel u-{u}^{\dagger }\parallel }_{{ \mathcal X }}^{q}\leqslant {C}_{{\rm{bd}}}{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}(u,{u}^{\dagger })\qquad {\rm{for}}\;{\rm{all}}\qquad u\in {\mathfrak{B}}\end{eqnarray} \tag{ 43 }$

(see Xu and Roach [157]). This property (43) is particularly helpful since then convergence and convergence rates w.r.t. the Bregman distance imply convergence and convergence rates w.r.t. the norm. Note that (43) can hold true also for different penalty terms ${ \mathcal R }$ than norms of a q-convex Banach space. As an example consider the entropy functional ${{ \mathcal R }}_{{\rm{ME}}}$ where it turns out that ${{ \mathcal D }}_{{{ \mathcal R }}_{{\rm{ME}}}}^{{u}^{* }}(u,{u}^{\dagger })={\mathbb{KL}}(u;{u}^{\dagger })$ . Thus it follows from (12a) with $\sigma =0$ that (43) can be achieved.

At this point we also mention that it is not a coincidence that the derived distance measure for the Poisson distribution is a Bregman divergence. There is a general connection between exponential families and Bregman divergences, for which we refer to [7].

If the solution of $F({u}^{\dagger })={g}^{\dagger }$ is not uniquely determined, limiting elements of Tikhonov minimizers will have the following property:

Definition 4.6 ( ${ \mathcal R }$ -minimizing solution). Let ${ \mathcal X }$ be a topological vector space, $F\;:{\mathfrak{B}}\subset { \mathcal X }\to {{\bf{L}}}^{1}({\mathbb{M}})$ a forward operator, ${g}^{\dagger }\in F({\mathfrak{B}})$ , and ${ \mathcal R }\;:{ \mathcal X }\to (-\infty ,\infty ]$ a convex penalty term. An element ${u}^{\dagger }\in {\mathfrak{B}}$ is called ${ \mathcal R }$ -minimizing solution of (1), if

$\begin{eqnarray}&&{ \mathcal R }({u}^{\dagger })=\mathrm{min}\{{ \mathcal R }(u)\;| \;u\in {\mathfrak{B}},F(u)={g}^{\dagger }\}\lt \infty .\end{eqnarray} \tag{ 44 }$

In the following we will always have to assume that such ${ \mathcal R }$ -minimizing solutions are unique.

4.2.2. Deterministic regularizing properties

Tikhonov-type regularization with Poisson likelihood fidelity (34) under a deterministic error assumption has been studied by several authors over the last years covering a wide range of aspects. Note that if the observed data are not described by a sum of point measures, but by some function ${G}_{t}={g}^{{\rm{obs}}}\in {{\bf{L}}}^{1}({\mathbb{M}})$ , it is no longer necessary to distinguish between ${{ \mathcal S }}_{\sigma }$ and ${{ \mathcal T }}_{\sigma }$ , i.e. the Kullback–Leibler divergence may be used as data fidelity term (see remark 2.7).

A first reference is Resmerita and Anderssen [122] where the special case that ${ \mathcal R }$ is also given by the Kullback–Leibler-divergence w.r.t. some initial guess and linear operators F is considered.

The paper by Eggermont and LaRiccia [57] also deals with a minimization problem of the form (34), but in a slightly different setup.

Results for general ${ \mathcal R }$ and general ${ \mathcal S }$ , which cover the previous result, were obtained by Pöschl [120] and Flemming [64, 66]. They state a general set of assumptions on the data fidelity term, one of which is

$\begin{eqnarray*}&&{\rm{If}}\;{\rm{}}\underset{n\to \infty }{\mathrm{lim}}{\mathbb{KL}}({g}_{n};g)=0\;{\rm{then}}\underset{n\to \infty }{\mathrm{lim}}{\mathbb{KL}}({g}_{n};\tilde{g})=0\;{\rm{for}}\;{\rm{all}}\;\tilde{g}\;{\rm{with}}\;{\mathbb{KL}}(\tilde{g};g)\lt \infty .\end{eqnarray*}$

This assumption can only be verified if $F(u)\geqslant a\gt 0$ for all $u\in {\mathfrak{B}}$ , which is a quite restrictive assumption in some applications. However, this property is only used in the proof of stability (i.e. continuous dependence of the minimizers on the data), but not for the proof of the regularizing property. Hence, the result can be formulated as follows:

Theorem 4.7 (Deterministic regularizing property). Suppose (H1)–(H6) in assumption 4.1 holds true with ${ \mathcal Y }={{\bf{L}}}^{1}({\mathbb{M}})$ and the weak topology ${\tau }_{{ \mathcal Y }}$ . Moreover, assume that for each ${g}^{\dagger }\in F({\mathfrak{B}})$ there exists a unique ${ \mathcal R }$ -minimizing solution ${u}^{\dagger }$ satisfying $\partial { \mathcal R }({u}^{\dagger })\ne \varnothing$ . Then the Tikhonov regularizaton (34) defines a deterministic regularization method for ${d}_{{ \mathcal X }}(u,{u}^{\dagger }):= {{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}(u,{u}^{\dagger })$ , ${d}_{{ \mathcal Y }}({g}_{1},{g}_{2}):= {\mathbb{KL}}({g}_{1}+\sigma ;{g}_{2}+\sigma )$ , $\sigma \geqslant 0$ , and any parameter choice rule satisfying

$\begin{eqnarray*}&&\bar{\alpha }(\delta ,{g}^{{\rm{obs}}})\searrow 0,\qquad \displaystyle \frac{\delta }{\bar{\alpha }(\delta ,{g}^{{\rm{obs}}})}\searrow 0\qquad \mathrm{as}\qquad \delta \searrow 0.\end{eqnarray*}$

Proof. The proof proceeds along the same lines as for Tikhonov regularization in Hilbert spaces with quadratic ${ \mathcal R }$ and ${ \mathcal S }$ [59]. It can be found in [120, theorem 1.10] or [66, theorem 3.4]. ■

In a series of papers Bardsley and coworkers studied generalized Tikhonov regularization (34) with error measure ${d}_{{ \mathcal Y }}({g}_{1},{g}_{2})={\parallel {g}_{1}-{g}_{2}\parallel }_{{{\bf{L}}}^{\infty }}$ instead of the Kullback–Leibler divergence. Moreover, they also consider perturbations of the operator, not only the data. The operator F is assumed to be affine linear where F(0) is interpreted as background noise, which is assumed to be non-negative, and the linear part $F-F(0)$ is assumed to be a non-negativity preserving integral operator. In particular, they show that (34) defines a regularization scheme with an a priori parameter choice rule in the following cases:

${\mathfrak{B}}=\{u\in {{\bf{L}}}^{2}({\rm{\Omega }})\quad | \quad u\geqslant 0\;{\rm{a.e.}}\}$ , ${ \mathcal R }(u)={\parallel u\parallel }_{{{\bf{L}}}^{2}}^{2}$ and ${d}_{{ \mathcal X }}({u}_{1},{u}_{2})={\parallel {u}_{1}-{u}_{2}\parallel }_{{{\bf{L}}}^{2}}$ (see [9]);
${\mathfrak{B}}=\{u\in {H}^{1}({\rm{\Omega }})\quad | \quad u\geqslant 0\;{\rm{a.e.}}\}$ , ${ \mathcal R }(u)={\parallel {\rm{\Lambda }}\;\mathrm{grad}u\parallel }_{{{\bf{L}}}^{2}}^{2}$ with a positive definite matrix function Λ and ${d}_{{ \mathcal X }}({u}_{1},{u}_{2})={\parallel {u}_{1}-{u}_{2}\parallel }_{{{\bf{L}}}^{2}}$ (see [11]);
${\mathfrak{B}}=\{u\in \mathrm{BV}({\rm{\Omega }})\quad | \quad u\geqslant 0\;{\rm{a.e.}}\}$ , ${ \mathcal R }={{ \mathcal R }}_{{\rm{TV}}}$ and ${d}_{{ \mathcal X }}({u}_{1},{u}_{2})={\parallel {u}_{1}-{u}_{2}\parallel }_{{{\bf{L}}}^{p}}$ with $1\leqslant p\lt \tfrac{m}{m-1}$ (see [12]).

4.2.3. Statistical regularizing property/consistency

Now let us come back to Tikhonov-type regularization (34) of inverse problems with Poisson data. Although we have seen in section 2 and especially in theorem 2.9 that a natural analog to the noise level δ in the previous subsection is given by $\tfrac{\rho }{\sqrt{t}}$ we emphasize that the noise ${\bf{err}}({\mathfrak{G}})$ may be arbitrarily large for any ${\mathfrak{G}}$ with small, but positive probability. Hence, the results of the previous subsection are not applicable.

Theorem 4.8 (Stochastic regularizing property). Suppose assumption 4.1 holds true true with strong ${H}^{s}({\mathbb{M}})$ topology ${\tau }_{{ \mathcal Y }}$ with $s\gt d/2$ . Moreover, assume that for each ${g}^{\dagger }\in F({\mathfrak{B}})$ there exists a unique ${ \mathcal R }$ -minimizing solution ${u}^{\dagger }\in {\mathfrak{B}}$ satisfying $\partial { \mathcal R }({u}^{\dagger })\ne \varnothing$ . Set ${d}_{{ \mathcal X }}(u,{u}^{\dagger }):= {{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}(u,{u}^{\dagger })$ for a selection ${u}^{* }\in \partial { \mathcal R }({u}^{\dagger })$ . Then the family of mappings ${R}_{\alpha }\;:{\mathfrak{M}}({\mathbb{M}})\to \mathrm{dom}({ \mathcal R })$ defined by ${R}_{\alpha }({G}_{t}):= {\hat{u}}_{\alpha }$ with any minimizer ${\hat{u}}_{\alpha }$ of (34) with $\sigma \gt 0$ equipped with a parameter choice $\bar{\alpha }\;:(0,\infty )\times {\mathfrak{M}}({\mathbb{M}})\to (0,\infty )$ fulfilling

$\begin{eqnarray}&&\underset{t\to \infty }{\mathrm{lim}}\bar{\alpha }(t,{G}_{t})=0,\end{eqnarray} \tag{ 45a }$

$\begin{eqnarray}&&\underset{t\to \infty }{\mathrm{lim}}\displaystyle \frac{\mathrm{ln}(t)}{\sqrt{t}\bar{\alpha }(t,{G}_{t})}=0\end{eqnarray} \tag{ 45b }$

defines a statistical regularization scheme for problem (1) under Poisson data w.r.t. ${d}_{{ \mathcal X }}$ .

Proof. Suppose the assertion is wrong. Then there exist $\varepsilon ,\delta \gt 0$ and a sequence ${({t}_{n})}_{n\in {\mathbb{N}}}\subset {{\mathbb{R}}}_{\gt 0}$ , ${t}_{n}\to \infty$ such that

$\begin{eqnarray}&&{\bf{P}}[{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({R}_{\bar{\alpha }({t}_{n},{G}_{{t}_{n}})}({G}_{{t}_{n}}),{u}^{\dagger })\gt \varepsilon ]\gt \delta \end{eqnarray} \tag{ 46 }$

for all $n\in {\mathbb{N}}$ . W.l.o.g. we can assume that ${t}_{n}\geqslant n$ for all $n\in {\mathbb{N}}$ . Let us define ${\alpha }_{n}:= \bar{\alpha }({t}_{n},{G}_{{t}_{n}})$ and ${u}_{n}:= {R}_{{\alpha }_{n}}({G}_{{t}_{n}})$ , $n\in {\mathbb{N}}$ . It follows from the definition and by some rearrangements that

$\begin{eqnarray*}&&{ \mathcal R }({u}_{n})\leqslant \displaystyle \frac{1}{{\alpha }_{n}}({{ \mathcal S }}_{\sigma }({G}_{{t}_{n}},F({u}^{\dagger }))-{{ \mathcal S }}_{\sigma }({G}_{{t}_{n}},F({u}_{n})))+{ \mathcal R }({u}^{\dagger })\end{eqnarray*}$

for all $n\in {\mathbb{N}}$ . Now define $\rho (t):= \mathrm{ln}({t}^{2{C}_{{\rm{conc}}}})=2{C}_{{\rm{conc}}}\mathrm{ln}(t)$ and consider the events

$\begin{eqnarray*}&&{E}_{N}:= \bigcap _{n\geqslant N}\left\{{{ \mathcal S }}_{\sigma }({G}_{{t}_{n}},F({u}^{\dagger }))-{{ \mathcal S }}_{\sigma }({G}_{{t}_{n}},F({u}_{n}))\leqslant \displaystyle \frac{\rho ({t}_{n})}{\sqrt{{t}_{n}}}-{{ \mathcal T }}_{\sigma }({g}^{\dagger };F({u}_{n}))\right\}\end{eqnarray*}$

for $N\in {\mathbb{N}}$ . On E_N we find by (45b) that

$\begin{eqnarray}&&{ \mathcal R }({u}_{n})+\displaystyle \frac{{{ \mathcal T }}_{\sigma }({g}^{\dagger };F({u}_{n}))}{{\alpha }_{n}}\leqslant { \mathcal R }({u}^{\dagger })+\displaystyle \frac{2{C}_{{\rm{conc}}}\mathrm{ln}({t}_{n})}{\sqrt{{t}_{n}}{\alpha }_{n}}\to { \mathcal R }({u}^{\dagger })\end{eqnarray} \tag{ 47 }$

as $n\to \infty$ . By the non-negativity of ${{ \mathcal T }}_{\sigma }$ and assumption (H2) there exists a subsequence ${({u}_{{n}_{k}})}_{k\in {\mathbb{N}}}$ such that ${u}_{{n}_{k}}\overset{{\tau }_{{ \mathcal X }}}{\to }\tilde{u}$ for some $\tilde{u}\in { \mathcal X }$ , and by (H3) $\tilde{u}\in {\mathfrak{B}}$ . Moreover, using (H5) and the lower semicontinuity of ${{ \mathcal T }}_{\sigma }({g}^{\dagger };\cdot )$ , (47) implies together with (45a) that

$\begin{eqnarray*}&&{{ \mathcal T }}_{\sigma }({g}^{\dagger };F(\tilde{u}))\leqslant \underset{n\to \infty }{\mathrm{lim}\;\mathrm{inf}}{{ \mathcal T }}_{\sigma }({g}^{\dagger };F({u}_{n}))\leqslant {\alpha }_{n}\underset{n\to \infty }{\mathrm{lim}\;\mathrm{inf}}[{ \mathcal R }({u}^{\dagger })-{ \mathcal R }({u}_{n})]\leqslant C{\alpha }_{n}\to 0\end{eqnarray*}$

as $n\to \infty$ on any E_N. Since ${{ \mathcal T }}_{\sigma }({g}^{\dagger };g)=0$ if and only if $g={g}^{\dagger }$ it follows from (H3) that $F(\tilde{u})={g}^{\dagger }$ . In view of (47) this means $\tilde{u}={u}^{\dagger }$ . As ${u}^{\dagger }$ is unique, the whole sequence ${u}_{n}$ must converge to $\tilde{u}$ as $n\to \infty$ . In particular, we find especially that $\langle {u}^{* },{u}_{n}-{u}^{\dagger }\rangle \to 0$ as ${u}^{* }\in {{ \mathcal X }}^{* }$ , and thus

$\begin{eqnarray*}&&{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({u}_{n},{u}^{\dagger })={ \mathcal R }({u}_{n})-{ \mathcal R }({u}^{\dagger })-\langle {u}^{* },{u}_{n}-{u}^{\dagger }\rangle \to 0\end{eqnarray*}$

as $n\to \infty$ on any E_N.

Note that ${\bf{P}}[{E}_{N}^{c}]\leqslant {\sum }_{n=N}^{\infty }\mathrm{exp}(-{C}_{{\rm{conc}}}^{-1}\rho ({t}_{n}))\leqslant {\sum }_{n=N}^{\infty }{n}^{-2}\to 0$ as $N\to \infty$ due to (15). Thus we can fix ${N}_{0}\in {\mathbb{N}}$ such that ${{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({u}_{n},{u}^{\dagger })\leqslant \varepsilon$ on any E_N and ${\bf{P}}[{E}_{n}^{c}]\lt \delta$ for all $n\geqslant {N}_{0}$ . This yields

$\begin{eqnarray*}&&{\bf{P}}[{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({u}_{n},{u}^{\dagger })\gt \varepsilon ]=1-{\bf{P}}[{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({u}_{n},{u}^{\dagger })\leqslant \varepsilon ]\leqslant 1-{\bf{P}}[{E}_{n}]={\bf{P}}[{E}_{n}^{c}]\leqslant \delta \end{eqnarray*}$

for all $n\geqslant {N}_{0}$ which contradicts (46). Thus our assumption is wrong, and the assertion follows.■

We also state a corresponding theorem for the case ${ \mathcal R }={{ \mathcal R }}_{{\rm{TV}}}$ :

Theorem 4.9 (Stochastic regularizing property for ${ \mathcal R }={{ \mathcal R }}_{{\rm{TV}}}$ ). Suppose that (H1) and (H3)–(H7) in assumption 4.1 hold true with ${ \mathcal X }=\mathrm{BV}({\rm{\Omega }})$ for a bounded Lipschitz domain ${\rm{\Omega }}\subset {{\mathbb{R}}}^{m}$ , with ${\tau }_{{ \mathcal X }}$ the strong ${{\bf{L}}}^{p}$ -topology for some $p\in [1,\tfrac{m}{m-1})$ , ${\tau }_{{ \mathcal Y }}$ the strong ${H}^{s}({\mathbb{M}})$ -topology with $s\gt d/2$ , ${ \mathcal R }={{ \mathcal R }}_{{\rm{TV}}}$ , and a linear operator F satisfying (39). Then (34) with $\sigma \gt 0$ equipped with a parameter choice $\bar{\alpha }\;:(0,\infty )\times {\mathfrak{M}}({\mathbb{M}})\to (0,\infty )$ fulfilling (45a) and (45b) defines a statistical regularization scheme for problem (1) under Poisson data w.r.t. ${d}_{{ \mathcal X }}$ .

Proof. (Sketch) The proof is similar to that of theorem 4.8. After showing boundedness of ${{ \mathcal R }}_{{\rm{TV}}}({u}_{n})$ as in (47) we additionally have to show boundedness of the means $\bar{{u}_{n}}$ to use pre-compactness of level sets of ${{ \mathcal R }}_{{\rm{BV}}}$ . This can be done by the same argument as in the proof of proposition 4.3 using that ${\mathrm{sup}}_{n}{G}_{{t}_{n}}({\mathbb{M}})\lt \infty$ a.s.. This follows from the strong law of large numbers, as ${\mathrm{lim}}_{n\to \infty }{G}_{{t}_{n}}({\mathbb{M}})\to {\int }_{{\mathbb{M}}}{g}^{\dagger }{\rm{d}}x$ a.s., and consequently the sequence ${({G}_{{t}_{n}}({\mathbb{M}}))}_{n\in {\mathbb{N}}}$ is a.s. bounded. ■

4.3. A priori rates of convergence

In theorem 4.8 we have seen that Tikhonov-type regularization (34) defines a statistical regularization scheme w.r.t. the Bregman distance, i.e. the Bregman distance ${{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({\hat{u}}_{\alpha },{u}^{\dagger })$ converges to 0 in probability as $t\to \infty$ . But as usual for ill-posed problems, this convergence can be arbitrarily slow (see [59, proposition 3.11] for a general result). To specify the speed of convergence we need additional assumptions on the exact solution ${u}^{\dagger }$ . Then we will be able to prove rates of convergence even in expectation.

As smoothness assumption we will use the concept of source conditions in the form of variational inequalities. These conditions were introduced by Hofmann et al [82] and have become a widely used tool in inverse problems in Banach spaces [25, 35, 64, 67, 77, 83, 88, 92, 155].

For detailed information about the cross-connection between different smoothness conditions in Hilbert spaces we refer to Flemming [65, 66].

Assumption 4.10 (Variational source condition). There exists ${u}^{* }\in \partial { \mathcal R }({u}^{\dagger })$ , $\beta \gt 0$ and a function φ such that

$\begin{eqnarray}&&\beta {{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}(u,{u}^{\dagger })\leqslant { \mathcal R }(u)-{ \mathcal R }({u}^{\dagger })+\varphi ({{ \mathcal T }}_{\sigma }({g}^{\dagger };F(u)))\end{eqnarray} \tag{ 48 }$

for all $u\in {\mathfrak{B}}$ . Moreover, the function $\varphi \;:[0,\infty )\to [0,\infty )$ has the following properties:

$\varphi (0)=0$ ,
φ is monotonically increasing and
φ is concave.

Note that the assumptions on φ already imply continuity, which means that φ is a so-called index function. If ${{ \mathcal T }}_{\sigma }$ is replaced by the ${{\bf{L}}}^{2}$ -distance between $F(u)$ and ${g}^{\dagger }$ , then (48) in the benchmark case $\varphi (\tau )=c\cdot \tau$ coincides with the original assumption in [82]. Afterwards, the analysis has been generalized to the power case $\varphi (\tau )=c\cdot {\tau }^{\kappa }$ with $\kappa \in (0,1)$ in [83] and to general concave φ in [25, 77]. The connection to spectral source conditions in the Hilbert space case is described in [133] for the benchmark case, in [83] for the power case and by Flemming in [65, 66] for the general case including the previous ones.

Variational source conditions are related to the well-studied field of stability estimates in inverse problems: in fact if a variational source condition with fixed index function φ and ${ \mathcal R }$ and ${ \mathcal T }$ given by squared Hilbert space norms holds true for all ${u}^{\dagger }$ in some smoothness class ${ \mathcal K }\subset { \mathcal X }$ , then it is easy to see that a stability estimate

$\begin{eqnarray*}&&\parallel {u}_{1}-{u}_{2}{\parallel }_{{ \mathcal X }}^{2}\leqslant \varphi ({\parallel F({u}_{1})-F({u}_{2})\parallel }_{{ \mathcal Y }}^{2})\end{eqnarray*}$

holds true for all ${u}_{1},{u}_{2}\in { \mathcal K }$ . Unfortunately, the reverse implication is not obvious. However, recently it has been shown [87] that stability estimates for an inverse medium scattering problem based on complex geometrical optics solutions can be sharpened to variational source conditions. Here the index function is of the form $\varphi (\tau )=A\mathrm{ln}{(3+{\tau }^{-1})}^{-p(s)}$ for all ${u}^{\dagger }$ in a Sobolev ball with index s, some constant $A\gt 0$ and an explicit function p(s). At the moment it is unclear to which extent other stability estimates can be sharpened to variational source conditions.

Classical spectral source conditions for linear operators T in Hilbert spaces

$\begin{eqnarray}&&{u}^{\dagger }=\psi ({T}^{* }T)\omega \end{eqnarray} \tag{ 49 }$

imply the variational source condition in assumption 4.10 with $F=T$ , ${ \mathcal R }(u)={\parallel u\parallel }_{{ \mathcal X }}^{2}$ , and ${{ \mathcal T }}_{\sigma }({g}^{\dagger };g)={\parallel g-{g}^{\dagger }\parallel }_{{ \mathcal Y }}^{2}$ for some index function $\varphi ={\varphi }_{\psi }$ [65, 66]. More specifically, if $\psi (\tau )={\tau }^{\nu }$ for some $\nu \in (0,1/2]$ , then $\varphi (\tau )=\tilde{\beta }{\tau }^{\tfrac{2\nu }{2\nu +1}}$ with a constant $\tilde{\beta }\gt 0$ depending on $\parallel \omega \parallel$ [83, proposition 6.6], and if $\psi (\tau )={(-\mathrm{ln}(\tau ))}^{-p}$ for some $p\gt 0$ , then $\varphi (\tau )=\tilde{\beta }{(-\mathrm{ln}(\tau ))}^{-2p}$ [66].

Using lemma 2.6 and (H7) in assumption 4.1 with ${\tau }_{{ \mathcal Y }}$ the ${{\bf{L}}}^{\infty }$ norm topology, we see that the variational source conditions (48) with ${{ \mathcal T }}_{\sigma }({g}^{\dagger };F(u))$ replaced by ${\parallel F(u)-F({u}^{\dagger })\parallel }_{{{\bf{L}}}^{2}({\mathbb{M}})}^{2}$ implies (48). However, note that (48) might be weaker than the corresponding variational source condition with ${{ \mathcal T }}_{\sigma }$ replaced by a squared ${{\bf{L}}}^{2}$ distance as ${{ \mathcal T }}_{\sigma }({g}^{\dagger };F(u))$ might be $\infty$ . Thus we might expect (48) to hold with an asymptotically better index function φ than ${\varphi }_{\psi }$ .

In principle we could also replace the Bregman distance in (48) by some other loss function ${d}_{{ \mathcal X }}$ and would then obtain convergence rates w.r.t. ${d}_{{ \mathcal X }}$ . This is for example useful in regularization with sparsity constraints where ${ \mathcal R }$ is the ℓ¹-norm of the coefficients w.r.t. some basis. Unfortunately the Bregman-distance is then not an appropriate error measure (see e.g. [104]), but it has been shown recently [35] that a variational source condition w.r.t. the ℓ¹-norm distance between $u$ and ${u}^{\dagger }$ can be fulfilled, which then implies convergence rates in the ℓ¹-norm. Nevertheless we decided to focus on the Bregman distance here to avoid further notational difficulties.

It has first been shown by Grasmair [77] that the approximation error of Tikhonov-type regularization (34) under the variational source condition (48) is bounded by the Fenchel conjugate of $(-\varphi )$ (see [58] for details on Fenchel duality). Therefore we introduce the abbreviation

$\begin{eqnarray}&&\psi (s):= {(-\varphi )}^{* }\left(-\displaystyle \frac{1}{s}\right)=\underset{\tau \geqslant 0}{\mathrm{sup}}\left[\varphi (\tau )-\displaystyle \frac{\tau }{s}\right],\qquad s\gt 0.\end{eqnarray} \tag{ 50 }$

Now we are in a position to prove rates of convergence for Tikhonov-type regularization (34) under an a priori parameter choice.

Theorem 4.11 (A priori rates of convergence). If (H6) and (H7) in assumption 4.1 with ${\tau }_{{ \mathcal Y }}$ the H^s-norm topology for $s\gt d/2$ and assumption 4.10 are satisfied and if there exists a global minimizer ${\hat{u}}_{\alpha }$ in (34), then the following holds true:

(i)
For all $\alpha \gt 0$ we have
$\begin{eqnarray}&&\beta {{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({\hat{u}}_{\alpha },{u}^{\dagger })\leqslant \displaystyle \frac{{\bf{err}}(\{F({\hat{u}}_{\alpha })\})}{\alpha }+\psi (\alpha ).\end{eqnarray} \tag{ 51 }$
(ii)
If $\alpha =\alpha (t)$ is chosen such that
$\begin{eqnarray}&&\displaystyle \frac{1}{\alpha }\in -\partial (-\varphi )\left(\displaystyle \frac{1}{\sqrt{t}}\right)\qquad {\rm{as}}\qquad t\to \infty ,\end{eqnarray} \tag{ 52 }$
we have the convergence rate
$\begin{eqnarray}&&{\bf{E}}[{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({\hat{u}}_{\alpha },{u}^{\dagger })]={ \mathcal O }\left(\varphi \left(\displaystyle \frac{1}{\sqrt{t}}\right)\right)\qquad {\rm{as}}\qquad t\to \infty .\end{eqnarray} \tag{ 53a }$

Proof. (i) By definition the minimizer ${\hat{u}}_{\alpha }$ satisfies

$\begin{eqnarray*}&&{{ \mathcal S }}_{\sigma }({G}_{t};F({\hat{u}}_{\alpha }))+\alpha { \mathcal R }({\hat{u}}_{\alpha })\leqslant {{ \mathcal S }}_{\sigma }({G}_{t};{g}^{\dagger })+\alpha { \mathcal R }({u}^{\dagger }).\end{eqnarray*}$

With the help of (11) this implies

$\begin{eqnarray*}{ \mathcal R }({\hat{u}}_{\alpha })-{ \mathcal R }({u}^{\dagger }) & \leqslant & \displaystyle \frac{1}{\alpha }[{{ \mathcal S }}_{\sigma }({G}_{t};{g}^{\dagger })-{{ \mathcal S }}_{\sigma }({G}_{t};F({\hat{u}}_{\alpha }))]\\ & \leqslant & \displaystyle \frac{1}{\alpha }[{\bf{err}}(\{F({\hat{u}}_{\alpha })\})-{{ \mathcal T }}_{\sigma }({g}^{\dagger };F({\hat{u}}_{\alpha }))].\end{eqnarray*}$

Inserting this into the variational source condition (48) yields

$\begin{eqnarray*}\beta {{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({\hat{u}}_{\alpha },{u}^{\dagger }) & \leqslant & \displaystyle \frac{{\bf{err}}(\{F({\hat{u}}_{\alpha })\})}{\alpha }-\displaystyle \frac{1}{\alpha }{{ \mathcal T }}_{\sigma }({g}^{\dagger };F({\hat{u}}_{\alpha }))+\varphi ({{ \mathcal T }}_{\sigma }({g}^{\dagger };F({\hat{u}}_{\alpha })))\\ & \leqslant & \displaystyle \frac{{\bf{err}}(\{F({\hat{u}}_{\alpha })\})}{\alpha }-\displaystyle \frac{\lambda }{\alpha }{{ \mathcal T }}_{\sigma }({g}^{\dagger };F({\hat{u}}_{\alpha }))+\underset{\tau \geqslant 0}{\mathrm{sup}}\left[\varphi (\tau )-\displaystyle \frac{\tau (1-\lambda )}{\alpha }\right]\\ & = & \displaystyle \frac{{\bf{err}}(\{F({\hat{u}}_{\alpha })\})}{\alpha }-\displaystyle \frac{\lambda }{\alpha }{{ \mathcal T }}_{\sigma }({g}^{\dagger };F({\hat{u}}_{\alpha }))+\psi \left(\displaystyle \frac{\alpha }{1-\lambda }\right)\end{eqnarray*}$

for any $\lambda \in [0,1)$ where we substituted $\tau ={{ \mathcal T }}_{\sigma }({g}^{\dagger };F({\hat{u}}_{\alpha }))$ and used ${{ \mathcal T }}_{\sigma }\geqslant 0$ . Now (51) follows from $\lambda =0$ and $\lambda =1/2$ respectively and the non-negativity of the Bregman distance.

(ii) Consider the events ${E}_{\rho }:= \left\{{\bf{err}}({\mathfrak{G}})\leqslant \tfrac{\rho }{\sqrt{t}}\right\}$ , $\rho \geqslant 1$ . By (H7) we have $F({\hat{u}}_{\alpha })\in {\mathfrak{G}}$ and hence ${\bf{err}}(\{F({\hat{u}}_{\alpha })\})\leqslant \rho /\sqrt{t}$ on E_ρ. Thus the error decomposition (51) implies

$\begin{eqnarray}&&\underset{{E}_{\rho }}{\mathrm{max}}{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({\hat{u}}_{\alpha },{u}^{\dagger })\leqslant \displaystyle \frac{\rho }{\beta }\left[\displaystyle \frac{1}{\sqrt{t}\alpha }+\psi (\alpha )\right].\end{eqnarray} \tag{ 54 }$

It follows from the definition of the Fenchel conjugate and from ${(-\varphi )}^{* * }=-\varphi$ (see [58]) that the infimum of the right-hand side is given by

$\begin{eqnarray}&&\underset{\alpha \gt 0}{\mathrm{inf}}\left[\displaystyle \frac{1}{\sqrt{t}\alpha }+\psi (\alpha )\right]=-\underset{s\lt 0}{\mathrm{sup}}\left[\displaystyle \frac{s}{\sqrt{t}}-{(-\varphi )}^{* }(s)\right]=\varphi \left(\displaystyle \frac{1}{\sqrt{t}}\right),\end{eqnarray} \tag{ 55 }$

and moreover from Young's inequality (see [58, proposition 4.1]) that the infimum is attained at α as given in (52). Hence by (54) and (55) we have

$\begin{eqnarray*}&&\underset{{E}_{\rho }}{\mathrm{max}}{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({\hat{u}}_{\alpha },{u}^{\dagger })\leqslant \displaystyle \frac{\rho }{\beta }\left[\displaystyle \frac{1}{\sqrt{t}\alpha }+\psi (\alpha )\right]\leqslant \displaystyle \frac{\rho }{\beta }\varphi \left(\displaystyle \frac{1}{\sqrt{t}}\right).\end{eqnarray*}$

From the concentration inequality (15) it follows that ${\bf{P}}[{E}_{\rho }]\geqslant 1-\mathrm{exp}(-{C}_{{\rm{conc}}}^{-1}\rho )$ with the constant ${C}_{{\rm{conc}}}$ specified in theorem 2.9 whenever t and ρ are sufficiently large. Thus with ${\rho }_{k}:= {C}_{{\rm{conc}}}k$ , $k\in {\mathbb{N}}$ we obtain with ${E}_{0}:= \varnothing$ that

$\begin{eqnarray}{\bf{E}}[{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({\hat{u}}_{\alpha },{u}^{\dagger })] & = & \displaystyle \sum _{k=1}^{\infty }{\bf{P}}[{E}_{k}\setminus {E}_{k-1}]{\bf{E}}[{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({\hat{u}}_{\alpha },{u}^{\dagger })\;| \;{E}_{k}\setminus {E}_{k-1}]\\ & \leqslant & {\bf{P}}[{E}_{1}]\underset{{E}_{1}}{\mathrm{max}}{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({\hat{u}}_{\alpha },{u}^{\dagger })+\displaystyle \sum _{k=2}^{\infty }{\bf{P}}[{E}_{k-1}^{c}]\underset{{E}_{k}}{\mathrm{max}}{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({\hat{u}}_{\alpha },{u}^{\dagger })\\ & \leqslant & \displaystyle \frac{{C}_{{\rm{conc}}}}{\beta }\left({\bf{P}}[{E}_{1}]+\displaystyle \sum _{k=2}^{\infty }k\mathrm{exp}(-k)\right)\varphi \left(\displaystyle \frac{1}{\sqrt{t}}\right),\end{eqnarray} \tag{ 56 }$

which proves (53a) as the sum converges.■

Besides the result above for the case of Poisson data, which has first been proven in [153, 155], methods of the form (34) have also been studied under deterministic noise assumptions. Resmerita and Anderssen [122] show rates of convergence for linear operators K under the benchmark source condition ${K}^{* }\omega \in \partial { \mathcal R }({u}^{\dagger })$ for ${ \mathcal R }={{ \mathcal R }}_{{\rm{ME}}}$ , ${d}_{{ \mathcal X }}({u}_{1},{u}_{2})={\parallel {u}_{1}-{u}_{2}\parallel }_{{{\bf{L}}}^{1}}$ , and ${d}_{{ \mathcal Y }}({g}^{{\rm{obs}}},{g}^{\dagger })={\mathbb{KL}}({g}^{{\rm{obs}}};{g}^{\dagger })$ . Benning and Burger [19] also show convergence rates for the benchmark source condition, but for general convex ${ \mathcal R }$ , ${d}_{{ \mathcal X }}({u}_{1},{u}_{2})={{ \mathcal D }}_{{ \mathcal R }}(u,{u}^{\dagger })+{{ \mathcal D }}_{{ \mathcal R }}({u}^{\dagger },u)$ and ${d}_{{ \mathcal Y }}({g}_{1},{g}_{2})={\parallel {g}_{1}-{g}_{2}\parallel }_{{{\bf{L}}}^{2}({\mathbb{M}})}$ . Flemming [64, 66] establishes rates of convergence under a variational source condition and for ${d}_{{ \mathcal Y }}({g}_{1},{g}_{2})={\parallel {g}_{1}-{g}_{2}\parallel }_{{{\bf{L}}}^{1}({\mathbb{M}})}$ . He assumes a stronger variational source condition than (48) to overcome the lack of a triangle-type inequality.

4.4. A posteriori parameter choice rules

In the previous subsection we have derived rates of convergence as the exposure time t tends to $\infty$ . To obtain these rates of convergence the reguarlization parameter α had to be chosen in a way requiring knowledge of the index function φ in (48) characterizing (abstract) smoothness of the unknown solution ${u}^{\dagger }$ (see (52)). As such knowledge is not realistic, we will now discuss a posteriori parameter choice rules, which do not require knowledge of φ.

A widely used parameter choice rule in inverse problems is the discrepancy principle where α is chosen such that the residuals ${{ \mathcal T }}_{\sigma }({g}^{\dagger };F({\hat{u}}_{\alpha }))$ are of the same order as the noise, e.g. in our specific setup we would require

$\begin{eqnarray}&&{{ \mathcal T }}_{\sigma }({g}^{\dagger };F({\hat{u}}_{\alpha }))\leqslant \displaystyle \frac{\tau }{\sqrt{t}}\end{eqnarray} \tag{ 57 }$

for a given tuning parameter $\tau \geqslant 1$ . Unfortunately, ${g}^{\dagger }$ is unknown and hence ${{ \mathcal T }}_{\sigma }({g}^{\dagger };F({\hat{u}}_{\alpha }))$ cannot be evaluated. The usual solution is to replace ${g}^{\dagger }$ by the measurements G_t and require the residuals w.r.t. G_t to be of the same order as the noise level. We point out that this procedure does not apply in our situation, as the only evaluable quantity is ${{ \mathcal S }}_{\sigma }({G}_{t};F({\hat{u}}_{\alpha }))$ , which does not need to be small if $F({\hat{u}}_{\alpha })$ is close to ${g}^{\dagger }$ as ${\bf{E}}[{{ \mathcal S }}_{\sigma }({G}_{t};{g}^{\dagger })]\ne 0$ in general. Hence, the common discrepancy principle (57) cannot be applied.

Nevertheless different discrepancy-like principles have been proposed in the literature [10, 22] for the discretized setting of denoising and deblurring applications. Bertero et al [22] propose to choose α in (34) such that

$\begin{eqnarray}&&{\mathbb{KL}}({{\bf{S}}}_{J}\tilde{{G}_{t}};{{\bf{S}}}_{J}F({\hat{u}}_{\alpha }))=\displaystyle \frac{J}{2},\end{eqnarray} \tag{ 58 }$

where ${{\bf{S}}}_{J}\;:{ \mathcal Y }\to {{\mathbb{R}}}^{J}$ is the binning operator introduced in section 2.5. This is motivated by the fact that for a Poisson distributed random variable X_t with mean $t\lambda$ the expectation ${\bf{E}}\left[{X}_{t}\mathrm{ln}\left(\tfrac{{X}_{t}}{t\lambda }\right)+t\lambda -{X}_{t}\right]$ tends to $\tfrac{1}{2}$ as $t\to \infty$ (see [159, lemma 1]). To ensure (58) to be well-defined, uniqueness of ${\hat{u}}_{\alpha }$ is an important issue, see proposition 4.2. The numerical implementation of such a parameter choice rule and a related rule defined by a constrained minimization problems is discussed in [160]. A similar criterion has been treated in [10] where the data fidelity term ${\mathbb{KL}}$ in (58) has been replaced by a second order Taylor expansion, which leads to a weighted least-squares discrepancy. Moreover, the authors in [10] also investigate generalized cross-validation (which basically consists in reconstructing ${\hat{u}}_{\alpha }^{k}$ by leaving the kth data point out and then minimizing the weighted least-squares discrepancy between ${{\bf{S}}}_{J}{G}_{t}$ and ${{\bf{S}}}_{J}F({\hat{u}}_{\alpha }^{k})$ overα). Furthermore, an unbiased predictive risk estimate has been proposed in [10]. Santos and De Pierro [131] propose to choose α as the minimizer of $\tfrac{1}{J}{\bf{E}}[{\mathbb{KL}}({{\bf{S}}}_{J}{g}^{\dagger };{{\bf{S}}}_{J}F({\hat{u}}_{\alpha }))]$ , which is obviously impossible as ${g}^{\dagger }$ is unknown. Therefore, they derive some stochastically justified approximation of this term which is then used for the minimization. Finally Chen and Cheng [44] investigate a spatially adapted choice of λ for the case of deblurring and ${ \mathcal R }$ given by the total variation seminorm (36) which is similarly based on (58).

All the parameter choice rules described above lack two basic properties: they do not have an obvious extension to our continuous setup involving a Poisson process (basically because (58) is not meaningful then) and no rates of convergence are known for these strategies. In the following we will describe a parameter choice rule for the continuous setup which still yields convergence rates and can also be implemented quite easily. It is based on an idea of Lepskiĭ [100] for regression problems and has been adapted to inverse problems under the name balancing principle (see [76, 111] and references therein). Suppose in the following that ${ \mathcal R }$ is chosen such that its Bregman distance obeys an inequality of the form (43) and that β in (48) is in $\left[\tfrac{1}{2},\infty \right)$ . Recall the error decomposition (51) and note that an upper bound for the function $(\alpha ,t)\mapsto \tfrac{{\bf{err}}(\{F({\hat{u}}_{\alpha })\})}{\alpha }$ is basically known as ${\bf{err}}(F({\mathfrak{B}}))\leqslant \tfrac{\rho }{\sqrt{t}}$ by (15) with overwhelming probability. Consider the discrete set of regularization parameters

$\begin{eqnarray}&&{\alpha }_{1}=\displaystyle \frac{\tau \mathrm{ln}(t)}{\sqrt{t}},\qquad {\alpha }_{j+1}=r{\alpha }_{j},\qquad j\in {\mathbb{N}}\end{eqnarray} \tag{ 59a }$

with some number $r\gt 1$ and a sufficiently large tuning parameter τ. Furthermore it suffices to consider the first m regularization parameters with $m=\mathrm{min}\{j\in {\mathbb{N}}\;| \;{\alpha }_{j}\geqslant 1\}$ . The Lepskiĭ-type balancing principle consists in selecting the regularization parameter ${\alpha }_{{\rm{bal}}}={\alpha }_{{j}_{{\rm{bal}}}}$ defined by

$\begin{eqnarray}&&{j}_{{\rm{bal}}}:= \mathrm{min}\left\{j\in \{1,\ldots ,m\}\;| \;{\parallel {\hat{u}}_{{\alpha }_{i}}-{\hat{u}}_{{\alpha }_{j}}\parallel }_{{ \mathcal X }}\leqslant 4{(4{C}_{{\rm{bd}}})}^{\tfrac{1}{q}}{r}^{\tfrac{1-i}{q}}{\rm{for}}\;{\rm{all}}\;i\lt j\right\}.\end{eqnarray} \tag{ 59b }$

The definition of ${\alpha }_{{\rm{bal}}}$ requires some explanations. First recall that the constants ${C}_{{\rm{bd}}}$ and q are determined by the choice of ${ \mathcal R }$ , see (43). In contrast, the parameters τ and r determining the sampling of possible regularization parameters in (59a) can be freely chosen by the user. Once ${C}_{{\rm{bd}}},q,r,\tau$ are computed or chosen respectively, the implementation of the Lepskiĭ-type balancing principle simply requires the computation of the regularized solutions ${\hat{u}}_{{\alpha }_{i}j}$ for $1\leqslant j\leqslant m$ and ${j}_{{\rm{bal}}}$ by (59b). With this construction, it turns out that the so-called oracle inequality (see [155, equation (31)])

$\begin{eqnarray}&&{\parallel {\hat{u}}_{{\alpha }_{{\rm{bal}}}}-{u}^{\dagger }\parallel }_{{ \mathcal X }}\leqslant 4r{12}^{q}{C}_{{\rm{bd}}}\underset{j=1,\ldots ,m}{\mathrm{min}}\left[\displaystyle \frac{\tau \mathrm{ln}(t)}{\sqrt{t}{\alpha }_{j}}+\psi ({\alpha }_{j})\right]\end{eqnarray} \tag{ 60 }$

holds true on the event ${\bf{err}}(\{F(u)\quad | \quad u\in {\mathfrak{B}}\})\leqslant \tfrac{\tau \mathrm{ln}(t)}{\sqrt{t}}$ , and the complementary event has probability $\leqslant {(1/t)}^{\tau /{C}_{{\rm{conc}}}}$ , see (15). Here the influence of the user-chosen parameters τ and r becomes clear: the finer the sampling (i.e. the closer r is to 1), the sharper the oracle inequality, and τ controls the probability of the complementary event. Consequently, the choice of r requires a trade-off between computation time and accuracy, and τ should be sufficiently large to ensure a fast decay of the probability of the complementary event.

To ensure that the discrete minimum of the right-hand side of (60) is of the same order as the infimum of $\tau \mathrm{ln}(t){t}^{-1/2}{\alpha }^{-1}+\psi (\alpha )$ over all $\alpha \gt 0$ , we need to impose an additional condition on the index function φ. To see this, consider the case of exact penalization, where $\varphi (\tau )=c\tau$ , $c\gt 0$ and ψ as in (50) is given by an indicator function $\psi (s)={\chi }_{[0,c]}(s)$ which is 0 on $[0,c]$ and $\infty$ elsewhere. Then the sampling (59a) is not appropriate to ensure at least one regularization parameter ${\alpha }_{j}\in [0,c]$ . But if furthermore ${\varphi }^{1+\varepsilon }$ is concave with $\varepsilon \gt 0$ , one readily obtains

$\begin{eqnarray*}&&\psi ({Cs})\leqslant \mathrm{max}\{1,{C}^{\tfrac{1}{\varepsilon }}\}\psi (s)\end{eqnarray*}$

for all $C,s\gt 0$ . Thus the sampling (59a) will contain the optimal regularization parameter up to some constant. These considerations lead to the following convergence rate result:

Theorem 4.12 (See [155, theorem 5.1]). If (H6) and (H7) in assumption 4.1 with ${\tau }_{{ \mathcal Y }}$ the ${H}^{s}$ -norm topology for $s\gt d/2$ and assumption 4.10 hold true such that $\beta \in \left[\tfrac{1}{2},\infty \right)$ and ${\varphi }^{1+\varepsilon }$ is concave for $\varepsilon \gt 0$ . Moreover let ${ \mathcal R }$ be chosen such that (43) is fulfilled and let ${\mathfrak{B}}$ be bounded. Then the a posteriori parameter choice (59b) with $\tau \geqslant \tfrac{1}{2}{C}_{{\rm{conc}}}$ with ${C}_{{\rm{conc}}}$ as in theorem 2.9 leads to the convergence rate

$\begin{eqnarray}&&{\bf{E}}[{\parallel {\hat{u}}_{{\alpha }_{{\rm{bal}}}}-{u}^{\dagger }\parallel }_{{ \mathcal X }}^{q}]={ \mathcal O }\left(\varphi \left(\displaystyle \frac{\mathrm{ln}(t)}{\sqrt{t}}\right)\right)\qquad {\rm{as}}\qquad t\to \infty .\end{eqnarray} \tag{ 61 }$

Note that the constant ${C}_{{\rm{conc}}}$ can in principle be calculated if ${g}^{\dagger }$ is properly scaled (e.g. if ${\parallel {g}^{\dagger }\parallel }_{{{\bf{L}}}^{1}({\mathbb{M}})}=1$ ). Comparing the convergence rates in theorems 4.11 and 4.12, we find that we have to pay a logarithmic factor for adaptivity, which is known to be inevitable in some, but not all cases (see [147]).

The performance of the Lepskiĭ-type balancing principle in the context of Newton iterations has been studied in [153] for phase retrieval and phaseless inverse scattering problems.

4.5. Newton-type methods for nonlinear operators

Although the results on the variational regularization (34) in the previous subsections also hold true for nonlinear operators $F$ provided assumptions 4.1 and 4.10 are fulfilled, the functional $u\mapsto \tfrac{1}{\alpha }{{ \mathcal S }}_{\sigma }({G}_{t};F(u))+{ \mathcal R }(u)$ is not convex in general for nonlinear $F$ . Therefore, a global minimizer ${\hat{u}}_{\alpha }$ , which exists by proposition 4.2, may be difficult to compute due to local minima.

Hence, we will now discuss Newton-type method as a frequently used alternative. If the operator $F$ has a derivative $F^{\prime} [{u}_{k}]$ at the current iterate ${u}_{k}$ , we may replace $F(u)$ by its linearization $F({u}_{k})+F^{\prime} [{u}_{k}](u-{u}_{k})$ in the Tikhonov functional (34). This yields the iteration

$\begin{eqnarray}&&{u}_{k+1}\in \underset{u\in {\mathfrak{B}}}{\mathrm{argmin}}\left[\displaystyle \frac{1}{{\alpha }_{k}}{{ \mathcal S }}_{\sigma }({G}_{t};F({u}_{k})+F^{\prime} [{u}_{k}](u-{u}_{k}))+{ \mathcal R }(u)\right].\end{eqnarray} \tag{ 62a }$

The regularization parameters ${\alpha }_{k}$ are chosen such that

$\begin{eqnarray}&&{\alpha }_{0}\leqslant 1,\qquad {\alpha }_{k}\searrow 0,\qquad 1\leqslant \displaystyle \frac{{\alpha }_{k}}{{\alpha }_{k+1}}\leqslant {C}_{{\rm{dec}}}\quad \mathrm{for}\ \mathrm{all}\quad k\in {\mathbb{N}}.\end{eqnarray} \tag{ 62b }$

This method can be seen as a generalization of the well-known iteratively regularized Gauss–Newton method [6, 24, 84, 92] incorporating the non-Gaussian behavior of the noise. Thus it is called an iteratively regularized Newton-type method. Note that each of the iteration steps (62a) consists of a convex minimization problem of Tikhonov-type form (34) which can be solved by one of the algorithms discussed in section 6. This is a substantial advantage compared to (34). As the method is of Newton-type, we expect a fast convergence behavior and thus only justifiable additional computational costs.

Due to the linearization it cannot be guaranteed immediately that the functions $F({u}_{k})+F^{\prime} [{u}_{k}](u-{u}_{k})$ are non-negative. As we are only able to control the effective noise level ${\bf{err}}({\mathfrak{G}})$ on sets for which $g+\sigma \geqslant a\gt 0$ with fixed a independent of g, this causes difficulties. We overcome those by replacing the threshold in (10) and the corresponding ${{ \mathcal T }}_{\sigma }$ by $g\geqslant -\sigma /2$ , which ensures the above condition with $a=\sigma /2$ whenever ${{ \mathcal T }}_{\sigma }({g}^{\dagger };g)\lt \infty$ .

The main difference in the convergence analysis of (34) and (62a) is that for the Newton-type method we additionally need to control the nonlinearity error. As we seek for a convergence analysis under general source conditions, i.e. general concave φ in (48), a strong nonlinearity condition is required. The most commonly used assumption for iterative methods under such source conditions is the tangential cone condition, i.e.

$\begin{eqnarray}&&{\parallel F(u)-F(\tilde{u})-F^{\prime} [u](\tilde{u}-u)\parallel }_{}\leqslant \eta {\parallel F(u)-F(\tilde{u})\parallel }_{}\end{eqnarray} \tag{ 63 }$

for all $u,\tilde{u}\in {\mathfrak{B}}$ with $\eta \gt 0$ sufficiently small, see e.g. [78]. Here ${\parallel \cdot \parallel }_{}$ denotes the norm used as data fidelity. Conditions of this form are known to be fulfilled for several parameter identification problems in PDEs with distributed measurements, see e.g. [53, 92]. Nevertheless, for many important applications in practice it is currently unknown if (63) holds true or not. For an analysis based on a weaker nonlinearity condition provided the variational inequality is strong enough we refer to [154].

As our method does not rely on a norm as data fidelity term, condition (63) is generalized as follows:

Assumption 4.13 (Generalized tangential cone condition). Suppose there exists a constant ${C}_{{\rm{tcc}}}\geqslant 1$ and $\eta \gt 0$ sufficiently small such that

$\begin{eqnarray}\displaystyle \frac{1}{{C}_{{\rm{tcc}}}}{{ \mathcal T }}_{\sigma }({g}^{\dagger };F(\tilde{u}))-\eta {{ \mathcal T }}_{\sigma }({g}^{\dagger };F(u)) & \leqslant & {{ \mathcal T }}_{\sigma }({g}^{\dagger };F(u)+F^{\prime} [u](\tilde{u}-u))\\ & \leqslant & {C}_{{\rm{tcc}}}{{ \mathcal T }}_{\sigma }({g}^{\dagger };F(\tilde{u}))+\eta {{ \mathcal T }}_{\sigma }({g}^{\dagger };F(u))\end{eqnarray} \tag{ 64 }$

for all $u,\tilde{u}\in {\mathfrak{B}}$ .

By lemma 2.6 the condition (64) relates to (63) if the functions $F(\tilde{u})$ , $F(u)$ and $F(u)+F^{\prime} [u](\tilde{u}-u)$ satisfy the required integrability and boundedness conditions. Furthermore, if ${ \mathcal T }$ was given by a power of a norm, (64) reduces to (63) as shown in [88].

Now the following theorem regarding the rates of convergence of (62a) can readily be proven by combining the techniques from [88, 155]:

Theorem 4.14. Suppose assumptions 4.1, 4.10 and 4.13 hold true with ${\tau }_{{ \mathcal Y }}$ the ${H}^{s}({\mathbb{M}})$ norm topology for $s\gt d/2$ . If η and the initial residual ${{ \mathcal T }}_{\sigma }({g}^{\dagger };F({u}_{0}))$ are sufficiently small, the sequence ${({\alpha }_{k})}_{k\in {\mathbb{N}}}$ satisfies (62b) and the iteration is stopped at

$\begin{eqnarray}&&{k}_{* }=\mathrm{min}\left\{k\in {\mathbb{N}}\;| \;\displaystyle \frac{1}{{\alpha }_{k}}\geqslant -\mathrm{inf}\partial (-\varphi )\left(\displaystyle \frac{1}{\sqrt{t}}\right)\right\},\end{eqnarray} \tag{ 65 }$

then we obtain the convergence rate

$\begin{eqnarray*}&&{\bf{E}}[{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({u}_{{k}_{* }},{u}^{\dagger })]={ \mathcal O }\left(\varphi \left(\displaystyle \frac{1}{\sqrt{t}}\right)\right)\qquad {\rm{as}}\qquad t\to \infty .\end{eqnarray*}$

Similar to theorem 4.12 we can also replace (65) by a Lepskiĭ-type a posteriori stopping rule if (43) holds true and we obtain the same convergence rate up to a logarithmic factor.

So far we have not discussed if (62a) defines a statistical regularization scheme in some sense. Local convergence in expectation

$\begin{eqnarray}&&{\bf{E}}[{{ \mathcal D }}_{{ \mathcal R }}^{{u}^{* }}({u}_{{k}_{* }},{u}^{\dagger })]\to 0,\qquad t\to \infty \end{eqnarray} \tag{ 66 }$

under the nonlinearity condition 4.13 would follow immediately from the above convergence result with k_* chosen by the Lepskiĭ-type stopping rule if we knew that every ${u}^{\dagger }$ satisfies a variational source condition as in assumption 4.10 for some index function φ. Such a result has been shown by Mathé and Hofmann [112] for spectral source conditions in Hilbert spaces. However, so far it is unknown if or under which conditions a similar result holds true for variational source conditions in Banach spaces with general ${ \mathcal R }$ and ${ \mathcal T }$ .

However, we may can use the result by Mathé and Hofmann [112] if we confine ourselves to penalties ${ \mathcal R }(u)={\parallel u\parallel }_{{ \mathcal X }}^{2}$ with a Hilbert norm ${ \mathcal X }$ . It yields a spectral source condition (49) with $T=F^{\prime} [{u}^{\dagger }]$ , which implies the variational source condition

$\begin{eqnarray*}&&\beta \parallel u-{u}^{\dagger }{\parallel }_{{ \mathcal X }}^{2}\leqslant { \mathcal R }(u)-{ \mathcal R }({u}^{\dagger })+{\varphi }_{\psi }(\parallel F^{\prime} [{u}^{\dagger }](u-{u}^{\dagger }){\parallel }_{{{\bf{L}}}^{2}({\mathbb{M}})}^{2}).\end{eqnarray*}$

for all $u$ (see [65, 66]). Using the tangential cone condition (63) we get ${\parallel F^{\prime} [{u}^{\dagger }](u-{u}^{\dagger })\parallel }_{{{\bf{L}}}^{2}({\mathbb{M}})}\leqslant (1+\eta ){\parallel F(u)-F({u}^{\dagger })\parallel }_{{{\bf{L}}}^{2}({\mathbb{M}})}$ . With this estimate, we can verify assumption 4.10 as discussed after (49). This yields the consistency result (66).

5. Applications

In this section we discuss some prominent examples of inverse problems with Poisson data. Since the literature for each of the three groups of applications discussed below is huge, we will only give very selective references.

5.1. Positron emission tomography

PET is a medical imaging technique to estimate the three-dimensional density of a radioactive tracer, which is administered to a patient before the PET scan. As the tracer decays, positrons are emitted. They immediately annihilate with a nearby electron and give rise to two gamma photons flying in opposite directions. These pairs of gamma photons are detected in a ring of sensors around the patient, see figure 1. For each pair of detected photons it is known that the emission occurred on the straight line connecting the two detectors, but the location on this straight line is unknown.

**Figure 1.** 2D sketch of a PET scan: the pair of photons caused by annihilation of the emitted positron and a nearby electron is recorded at a detector pair.
Download figure:
Standard image High-resolution image

We sketch the derivation of the mathematical model for this problem as described by Vardi et al [150] in a discrete setting and by Johnstone and Silverman [91] in a continuous setting (see also Cavalier and Koo [39]): assume the density $u$ of the tracer is supported in ${B}_{1}:= \{x\in {{\mathbb{R}}}^{2}\;:| x| \lt 1\}$ . It is well-known that the gamma photon emissions are well-described by a Poisson process with intensity proportional to $u$ . Suppose the straight lines intersecting B₁ are parameterized by ${L}_{{\rm{\Theta }},s}:= \{s{\rm{\Theta }}+\tau {{\rm{\Theta }}}^{\perp }| \tau \in {\mathbb{R}}\}$ with $s\in [-1,1]$ , ${\rm{\Theta }}=({\theta }_{1},{\theta }_{2})\in {{\mathbb{S}}}^{1}:= \{x\in {{\mathbb{R}}}^{2}\;:{| x| }_{2}=1\}$ , and ${{\rm{\Theta }}}^{\perp }:= (-{\theta }_{2},{\theta }_{1})$ . In a discrete setting the number of simultaneous counts of two extended detectors can be shown to be proportional to the number of gamma photon emissions in the area covered by straight lines connecting these detectors. In particular, this number of counts is Poisson distributed. In a continuous setting the simultaneous counts are described by a Poisson process, the intensity of which is proportional to the Radon transform of $u$ :

$\begin{eqnarray*} & & F:{{\bf{L}}}^{1}({B}_{1})\to {{\bf{L}}}^{1}([-1,1]\times {{\mathbb{S}}}^{1}),\\ & & ({Fu})(s,{\rm{\Theta }}):= {\displaystyle \int }_{-\sqrt{1-{s}^{2}}}^{\sqrt{1-{s}^{2}}}u(s{\rm{\Theta }}+\tau {{\rm{\Theta }}}^{\perp }){\rm{d}}\tau .\end{eqnarray*}$

The forward operator is obviously linear. Moreover, as a density the unknown ${u}^{\dagger }$ satisfies the non-negativity constraint ${u}^{\dagger }\geqslant 0$ . Note that the Radon transform preserves non-negativity, so we indeed have ${{Fu}}^{\dagger }\geqslant 0$ . Furthermore, using the mapping properties of the Radon transform it is easy to specify ${\mathfrak{B}}$ such that (H6) and (H7) in assumption 4.1 holds true. A natural choice is ${\mathfrak{B}}=\{u\in {H}^{1/2+\epsilon }({B}_{1})\quad | \quad u\geqslant 0\text{a.e.},\parallel u\parallel \leqslant R\}$ with some $R\gt 0$ , which employs the fact that the Radon transform in two-dimensions is 1/2 smoothing (see [114, theorem 5.1]).

5.2. Fluorescence microscopy

Confocal fluorescence microscopy allows three-dimensional imaging of the density of a fluorescent marker. As opposed to conventional microscopes, at each time only a small part of the specimen is illuminated by laser light focussed by some lense. Fluorescent light is collected by the same lense, and light emitted from out-of-focus planes is blocked by a pinhole. This way only fluorescent photons emitted in a small neighborhood of the focal point are recorded by a detector behind the pinhole, see figure 2. Let $h(y-x)$ denote the probability density that a fluorescent marker at y emits a photon recorded by the detector if the focal point is x, assuming that this density actually depends only on x − y, and let $u$ denote the fluorescent marker density. If the specimen is scanned, the inverse problem to recover $u$ from the observed data can be modeled as an inverse problem with Poisson data where the forward operator is the convolution operator

$\begin{eqnarray*}&&({Fu})(x)=\int h(x-y)u(y){\rm{d}}y.\end{eqnarray*}$

The convolution kernel h is referred to as point spread function. As a density $u$ again satisfies the non-negativity constraint $u\geqslant 0$ . Similar problems arise an astronomical imaging. We refer to the overview by Bertero et al [21] and numerous references therein.

**Figure 2.** Sketch of the confocal principle: the object of interest is excited by laser light which is then recorded at a detector, whereas light from out-of-focus planes does not pass the pinhole.
Download figure:
Standard image High-resolution image

In some applications the point spread function is not spatially invariant. Under sufficiently strong a priori information and possibly with additional measurements, the spatial dependence can sometimes be recovered jointly with $u$ by nonlinear inversions. In particular, this is the case for 4Pi microscopy. This is a type of fluorescence microscopy which uses interference of coherent photons to improve the axial resolution of standard confocal microscopy by a factor of 5–7 (see Hell and Stelzer [80] and Hell [79]). Here the point spread function h depends on the relative phase φ of the interfering photons. Ideally φ should be constant, but often this is experimentally difficult or impossible to achieve, so φ is a slowly varying function of x and has to be recovered jointly with the object u. This is described by the nonlinear forward operator

$\begin{eqnarray*}&&(F(u,\varphi ))(x)=\int h(x-y,\varphi (x))u(y){\rm{d}}y.\end{eqnarray*}$

A simple model for the point spread function is given by $h(x,\varphi )={h}_{{\rm{conf}}}(x){\mathrm{cos}}^{n}({{cx}}_{3}+\varphi /2)$ where ${h}_{{\rm{conf}}}$ is the point spread function of the corresponding confocal microscope and $n\in 2,4$ depends on the type of 4Pi-microscopy. Further data of the form $F(u,\varphi +{\varphi }_{0})$ are experimentally easy to obtain. We refer to Vicidomini et al [151] and Stück et al [142] for further results and algorithms for this problem.

In several applications the fluorescent marker density u does not provide the information one is actually interested in. In fact, $u=n\cdot p$ with the molecular map n and the brightness function p. In a discretized setup, the function n(x) indicates the number of markers in the pixel or voxel x, and hence attains only integer values. The brightness p(x) is given by the probability that a marker at x responds to a single excitation pulse and its response is measured by the detectors; above it has been assumed to be the same for all markers at x. However, p may be non-constant, and hence the knowledge of u does not allow to infer on n, which is the actual object of interest in many applications. To recover both n and p, it was suggested in [145] to exploit the effect of photon antibunching: Under appropriate experimental conditions a single marker cannot emit two photons in a single excitation experiment. Therefore, counting simultaneous photon detections provides additional information on n. This gives in fact rise to another nonlinear problem in fluorescence microscopy where the forward operator is given by

$\begin{eqnarray}&&F(n,p)=\left(\begin{array}{c}({np})\ast h-{\beta }_{1}({(({np})\ast h)}^{2}-({{np}}^{2})\ast {h}^{2})\\ {\beta }_{2}({(({np})\ast h)}^{2}-({{np}}^{2})\ast {h}^{2})\end{array}\right),\end{eqnarray} \tag{ 67 }$

see [144]. Here ∗ denotes convolution, and ${\beta }_{1},{\beta }_{2}\gt 0$ are known factors taking into account the detection geometry. It can readily seen that $({np})\ast h$ and $({{np}}^{2})\ast {h}^{2}$ are uniquely determined by $F(n,p)$ . Thus, if the point spread function h has a positive Fourier transform, it is possible to recover n and p from the measurements. In the typical experiments, h can be approximated quite well by a Gaussian and thus the Fourier transform is a Gaussian as well and hence positive. For results and algorithms for this problem we refer to [144].

Note that measuring simultaneous photon coincidences requires short excitation pulses and sufficient waiting times in between to avoid miscounting. Therefore we are not able to measure continuously, but to repeat the single excitation experiment several times. This naturally leads to a multinomial distribution of the data. However, it is well-known that the multinomial distribution can be well approximated by a Poisson distribution if the number of repetitions is sufficiently large. Thus, it is still reasonable to consider the data as Poissonian.

For other inverse problems related to fluorescence microscopy we also refer to [5] for an overview in a more statistical context.

In all cases considered in this subsection, the smoothing properties of the forward operator are fully determined by the smoothness of h. Suppose that h is at least k times continuously differentiable with $k\gt d/2$ and denote the unknowns by u, i.e. $u=(\tilde{u},\varphi )$ in the 4Pi example with the fluorescence density $\tilde{u}$ and $u=(n,p)$ in case of (67). Then we may choose ${\mathfrak{B}}=\{u\in {{\bf{L}}}^{2}\;| \;u\geqslant 0\;{\rm{a.e.}},\parallel u\parallel \leqslant R\}$ with some $R\gt 0$ and (H7) in assumption 4.1 is satisfied.

5.3. Inverse scattering and phase retrieval problems

The propagation of time harmonic electromagnetic waves in an inhomogeneous isotropic medium with constant magnetic permeability is described by the time-harmonic Maxwell equations

$\begin{eqnarray}&&\;\mathrm{curl}\;\;\mathrm{curl}\;{\bf{E}}-{\kappa }^{2}n(x){\bf{E}}=0\qquad {\rm{in}}\;{\rm{}}{{\mathbb{R}}}^{3}\end{eqnarray} \tag{ 68 }$

for the electric field ${\bf{E}}\;:{{\mathbb{R}}}^{3}\to {{\mathbb{C}}}^{3}$ where n denotes the refractive index of the medium (see e.g. [46]) and $\kappa \gt 0$ the wave number. In scattering theory it is usually assumed that the total field ${\bf{E}}={{\bf{E}}}^{{\rm{i}}}+{{\bf{E}}}^{{\rm{s}}}$ is the sum of a known incident field ${{\bf{E}}}^{{\rm{i}}}$ and a scattered field ${{\bf{E}}}^{{\rm{s}}}$ . The latter satisfies the Silver–Müller radiation condition

$\begin{eqnarray}&&\underset{| x| \to \infty }{\mathrm{lim}}\left(\;\mathrm{curl}\;{{\bf{E}}}^{{\rm{s}}}\times x-\displaystyle \frac{| x| }{{\rm{i}}\kappa }{{\bf{E}}}^{{\rm{s}}}\right)=0.\end{eqnarray} \tag{ 69 }$

Classical inverse scattering problems consist in finding the refractive index n, possibly under further a priori information, given measurements of ${\bf{E}}$ on some measurement surface ${\mathbb{M}}\subset {{\mathbb{R}}}^{3}$ . For numerous references we refer to [46]. Often such a description is rather idealistic since it is typically very difficult or impossible to determine the phase of ${\bf{E}}$ experimentally. Therefore, in a more realistic model the unperturbed data are given by ${| {\bf{E}}| }^{2}$ , which is proportional to the energy density.

The forward operator is then described by

$\begin{eqnarray*}&&F(n):= {| {\bf{E}}{| }_{{\mathbb{M}}}| }^{2}.\end{eqnarray*}$

This operator is nonlinear and defined only implicitly. As discussed in the introduction, at low energy densities the quantization of energy cannot be neglected, and so data are described by a Poisson process with intensity ${| {\bf{E}}{| }_{{\mathbb{M}}}| }^{2}$ , see Ivanyshyn and Kress [90] or [88].

If the wavelength is small compared to characteristic features of the object and n is close to 1 everywhere, the forward problem can be accurately described by a much simpler model (see Born and Wolf [30] or Paganin [118]): suppose the support of $n-1$ is contained in $\{(x,y,z)\in {{\mathbb{R}}}^{3}| -1\lt z\lt 0\}$ and the incident field is a coherent beam, which propagates in positive z-direction and is polarized in x-direction, see figure 3. Then ${\bf{E}}\approx (u,0,0)$ where ${\rm{\Delta }}u+{\kappa }^{2}u=0$ , and in the plane $\{z=0\}$ the function u is approximately related to n by the projection approximation

$\begin{eqnarray*}&&u(x,y,0)\approx {u}^{{\rm{i}}}(x,y,0)\mathrm{exp}\left({\rm{i}}\kappa {\int }_{-1}^{0}(n(x,y,z)-1){\rm{d}}z\right)\end{eqnarray*}$

also used in computerized tomography. Often the so-called exit field ${u}_{0}(x,y):= u(x,y,0)$ is considered as unknown of the inverse problem. Note that u₀ yields a projection of the unknown refractive index n. By a separation of variables the upward propagating solution to the Helmholtz equation ${\rm{\Delta }}u+{\kappa }^{2}u=0$ in $\{z\gt 0\}$ with boundary condition $u(x,y,0)={u}_{0}(x,y)$ is given by

$\begin{eqnarray*}u(x,y,z) & = & {{ \mathcal F }}^{-1}(\mathrm{exp}({\rm{i}}z\sqrt{{\kappa }^{2}-{| \xi | }^{2}})({ \mathcal F }{u}_{0})(\xi ))(x,y)\\ & \approx & {{\rm{e}}}^{{\rm{i}}\kappa z}{{ \mathcal F }}^{-1}\left(\mathrm{exp}(-{\rm{i}}\displaystyle \frac{z}{2\kappa }{| \xi | }^{2})({ \mathcal F }{u}_{0})(\xi )\right)(x,y)=: ({P}_{z}^{{\rm{F}}}{u}_{0})(x,y),\end{eqnarray*}$

where ${ \mathcal F }$ denotes the two-dimensional Fourier transform and where we have used the Taylor approximation $\sqrt{{\kappa }^{2}-{| \xi | }^{2}}\approx \kappa -{| \xi | }^{2}/(2\kappa )$ . The so-called Fresnel propagator ${P}_{z}^{{\rm{F}}}$ is unitary on ${{\bf{L}}}^{2}({{\mathbb{R}}}^{2})$ .

**Figure 3.** Sketch of a phase retrieval problem in coherent diffractive imaging: the incident field is chosen to be a plane wave propagating in z direction, the object of interest is placed close to the plane $\{z=0\}$ , and the intensity of the field is recorded at $\{z={\rm{\Gamma }}\}$ .
Download figure:
Standard image High-resolution image

**Figure 3.** Sketch of a phase retrieval problem in coherent diffractive imaging: the incident field is chosen to be a plane wave propagating in z direction, the object of interest is placed close to the plane $\{z=0\}$ , and the intensity of the field is recorded at $\{z={\rm{\Gamma }}\}$ .
Download figure:
Standard image High-resolution image

The problem to recover the exit field u₀ from photon counts in the plane $\{z={\rm{\Gamma }}\}$ is an inverse problem with Poisson data for the nonlinear forward operator

$\begin{eqnarray*}&&F({u}_{0}):= {| {P}_{{\rm{\Gamma }}}{u}_{0}| }^{2}.\end{eqnarray*}$

Such problems are referred to as phase retrieval problems since they may be reformulated as finding the missing phase function ${P}_{{\rm{\Gamma }}}{u}_{0}/| {P}_{{\rm{\Gamma }}}{u}_{0}|$ in order to recover u₀ by an application of ${P}_{{\rm{\Gamma }}}^{-1}$ . If $({x}^{2}+{y}^{2})\kappa /{\rm{\Gamma }}\ll 1$ the Fraunhofer (or far field) approximation

$\begin{eqnarray*}&&| ({P}_{{\rm{\Gamma }}}{u}_{0})(x,y)| \approx \displaystyle \frac{\kappa }{2\pi {\rm{\Gamma }}}\left|{ \mathcal F }\left(\displaystyle \frac{\kappa x}{{\rm{\Gamma }}},\displaystyle \frac{\kappa y}{{\rm{\Gamma }}}\right)\right|\end{eqnarray*}$

is valid, i.e. the forward operator is essentially the squared absolute value of the Fourier transform.

Obviously, one may only hope for uniqueness if further a priori information on u₀ are given. Typical such information are support constraints, $| {u}_{0}| \equiv 1$ (i.e. n real-valued), u₀ real-valued (i.e. $n-1$ purely imaginary) or a priori known singularities of u₀ or its derivatives. We refer to Klibanov et al [96] for an overview on uniqueness results for phase retrieval problems in one space dimension and to Klibanov [95] for a uniqueness result in two space dimensions in the far field case. A uniqueness result for near field data assuming only boundedness of $\mathrm{supp}(n-1)$ was recently shown by Maretzke [108].

Obviously the operator $F\;:{u}_{0}\mapsto {| {P}_{{\rm{\Gamma }}}{u}_{0}| }^{2}$ always maps to non-negative functions. Even though ${ \mathcal X }$ is often a space of complex-valued functions here, we interpret it as a real Banach space to speak of the Fréchet derivative $F^{\prime} [u]\;:{ \mathcal X }\to { \mathcal Y }={{\bf{L}}}_{{\mathbb{R}}}^{1}({\mathbb{M}})$ as a linear mapping to the real Banach space ${{\bf{L}}}_{{\mathbb{R}}}^{1}({\mathbb{M}})$ . If u₀ has compact support or is a compactly supported perturbation of a constant function, then ${P}_{{\rm{\Gamma }}}{u}_{0}$ and hence ${| {P}_{{\rm{\Gamma }}}{u}_{0}| }^{2}$ are analytic. (H7) in assumption 4.1 is satisfied for any ${{\bf{L}}}^{p}$ -bounded set ${\mathfrak{B}}$ and any Sobolev norm in the image space.

6. Algorithmic approaches

In the following we will review algorithmic approaches for inverse problems with Poisson data. In section 3 we already discussed projection methods, which by design were not connected to the Poissonian structure of the data at all. In the following we will mainly focus on methods which are connected to the minimization of Tikhonov-type functionals (34) with discrete data (see section 2.5). They can be written as

$\begin{eqnarray}&&{\hat{u}}_{\alpha }\in \underset{u\in { \mathcal X }}{\mathrm{argmin}}{ \mathcal J }(u),\qquad { \mathcal J }(u):= { \mathcal S }({\bf{T}}u)+\alpha { \mathcal R }(u),\end{eqnarray} \tag{ 70 }$

where ${ \mathcal R }\;:{ \mathcal X }\to {\mathbb{R}}\cup \{\infty \}$ and ${ \mathcal S }\;:{{\mathbb{R}}}^{J}\to {\mathbb{R}}\cup \{\infty \}$ are proper, convex, and lower semi-continuous, ${\bf{T}}\;:{ \mathcal X }\to {{\mathbb{R}}}^{J}$ is a bounded linear operator and ${ \mathcal S }$ is of the form

$\begin{eqnarray}&&{ \mathcal S }({\bf{h}})=\displaystyle \sum _{j=1}^{J}[({{\bf{h}}}_{j}+{{\bf{g}}}_{j}^{0})-{{\bf{g}}}_{j}^{{\rm{obs}}}\mathrm{ln}({{\bf{h}}}_{j}+{{\bf{g}}}_{j}^{0})].\end{eqnarray} \tag{ 71 }$

Here ${{\bf{g}}}^{{\rm{obs}}}={{\bf{S}}}_{J}{G}_{t}+\sigma {\bf{1}}$ with the notation of section 2.5 (where all entries of the vector ${\bf{1}}$ are 1), and the conventions (6) for the logarithm are used to avoid explicit case distinctions. For linear operators we have ${{\bf{g}}}^{0}=\sigma {\bf{1}}$ and ${\bf{T}}={{\bf{S}}}_{J}F$ . In case of the Newton iteration (62a) for nonlinear operators we have to solve a sequence of problems of the form (70) with ${\bf{T}}={{\bf{S}}}_{J}F^{\prime} [{u}_{k}]$ and ${{\bf{g}}}^{0}={{\bf{S}}}_{J}F({u}_{k})+\sigma$ . For most of the following discussions ${ \mathcal X }$ may be either finite or infinite dimensional. In the latter situation, the problem (70) is semidiscrete. Nevertheless, elements in the solution space ${ \mathcal X }$ or its dual will be denoted by standard letters whereas elements in the data space ${{\mathbb{R}}}^{J}$ or its dual will always be denoted by bold letters, in agreement with the notation above.

In view of the huge number of papers on the subject, our review will necessarily remain incomplete, even with the focus outlined above. We will mainly discuss developments since 2009. For a review of previous numerical approaches we refer to Bertero et al [21]. For numerical comparisons of several of the algorithms discussed below for deconvolution problems with TV-regularization we refer the recent PhD Thesis by Benfenati [17].

6.1. Quadratic approximations

In this subsection we will discuss methods which either replace (70) by some quadratic approximation or which solve (70) by sequential quadratic approximation. In this setting, quadratic only refers to the data fidelity term, not to the penalty ${ \mathcal R }$ . For simplicity, the reader may think of a quadratic penalty ${ \mathcal R }$ (up to constraints) in this subsection.

Note that with our conventions for the logarithm we have ${ \mathcal S }({\bf{h}})=\infty$ if components of ${\bf{T}}u+{{\bf{g}}}^{0}$ are negative, i.e. the data fidelity term ${ \mathcal S }({\bf{h}})$ in (71) introduces the implicit constraint ${\bf{T}}u+{{\bf{g}}}^{0}\geqslant 0$ . Note that the negative logarithm in ${ \mathcal S }$ acts like a barrier function enforcing this constraint. For many approximations of ${ \mathcal S }$ below we may have to add the constraint explicitely. For most linear inverse problems this is not an issue as the forward operator ${\bf{T}}$ is defined on a space of real-valued functions and preserves non-negativity, i.e.

$\begin{eqnarray}&&u\geqslant 0\quad \Rightarrow \quad {\bf{T}}u\geqslant 0,\end{eqnarray} \tag{ 72 }$

and we have ${{\bf{g}}}^{0}=\sigma {\bf{1}}$ and ${u}^{\dagger }\geqslant 0$ . Then the condition ${\bf{T}}u+{{\bf{g}}}^{0}\geqslant 0$ is implied by the much easier constraint $u\geqslant 0$ , which is formally included in the definition of ${ \mathcal R }$ . This is not the in case for Newton-type method where ${{\bf{g}}}^{0}={{\bf{S}}}_{J}F({u}_{k})+\sigma {\bf{1}}$ , so many approximations below will be less useful in this case.

Approximations of the Poisson likelihood. Let Y be a Poisson distributed random variable with parameter $\lambda \gt 0$ . The infinite divisibility property of the Poisson distribution and the central limit theorem ensure that in distribution

$\begin{eqnarray}&&\displaystyle \frac{Y-\lambda }{\sqrt{\lambda }}\to { \mathcal N }(0,1),\qquad \displaystyle \frac{Y-\lambda }{\sqrt{Y}}\to { \mathcal N }(0,1),\qquad \mathrm{as}\quad \lambda \to \infty .\end{eqnarray} \tag{ 73 }$

Here the second assertion follows from the first assertion and Slutsky's theorem (see [149, lemma 2.8]). Not surprisingly it can be observed in numerical simulations that the convergence in the second statement of (73) will be slower than in the first statement, as the true denominator $\sqrt{\lambda }$ is replaced by the estimator $\sqrt{Y}$ .

This motivates two approaches based on approximation of the Poisson log-likelihood. The first is the weighted least squares problem

$\begin{eqnarray}&&{\widehat{u}}_{\alpha }^{{\rm{WLS}}}:= \underset{u\in { \mathcal X }}{\mathrm{argmin}}\left[{\parallel \displaystyle \frac{{{\bf{g}}}^{{\rm{obs}}}-{\bf{T}}u-{{\bf{g}}}^{0}}{\sqrt{\mathrm{max}\{{{\bf{g}}}^{{\rm{obs}}},\epsilon \}}}\parallel }_{2}^{2}+\alpha { \mathcal R }(u)\right]\end{eqnarray} \tag{ 74 }$

involving a relaxation parameter $\epsilon \gt 0$ to avoide division by 0. Here and in the following a fraction of vectors denotes pointwise division, i.e. $\tfrac{{\bf{a}}}{{\bf{b}}}:= {\left(\tfrac{{{\bf{a}}}_{j}}{{{\bf{b}}}_{j}}\right)}_{j=1,\ldots ,J}$ . This method has been used in many different applications, e.g. [8, 10, 88, 141, 142]. It must be said, however, that it is quite sensitive to the choice of in case of small components of $t{{\bf{g}}}^{\dagger }$ .

The second method is based on the first statement in (73), which leads to the nonlinear weighted least squares method

$\begin{eqnarray}&&{\widehat{u}}_{\alpha }^{{\rm{NWLS}}}:= \underset{u\in { \mathcal X }}{\mathrm{argmin}}\left[{\parallel \displaystyle \frac{{{\bf{g}}}^{{\rm{obs}}}-{\bf{T}}u-{{\bf{g}}}^{0}}{\sqrt{{\bf{T}}u+{{\bf{g}}}^{0}}}\parallel }_{2}^{2}+\alpha { \mathcal R }(u)\right].\end{eqnarray} \tag{ 75 }$

The implicit side condition ${\bf{T}}{\widehat{u}}_{\alpha }^{{\rm{NWLS}}}+{{\bf{g}}}^{0}\geqslant 0$ is still required. Note that (75) is non-convex and hence computationally challenging. It has been approached by split-gradient methods (somewhat similar to (97)) in [139]. More often it is replaced by the iteratively re-weighted least-squares method defined by

$\begin{eqnarray}&&{u}_{{\rm{IRLS}}}^{k+1}:= \underset{u\in { \mathcal X }}{\mathrm{argmin}}\left[{\parallel \displaystyle \frac{{{\bf{g}}}^{{\rm{obs}}}-{\bf{T}}u-{{\bf{g}}}^{0}}{\sqrt{\mathrm{max}\{{\bf{T}}{u}_{{\rm{IRLS}}}^{k}+{{\bf{g}}}^{0},\epsilon \}}}\parallel }_{2}^{2}+\alpha { \mathcal R }(u)\right],\end{eqnarray} \tag{ 76 }$

see [69, 101]. Note that this method again involves a relaxation parameter $\epsilon \gt 0$ .

For quadratic penalities ${ \mathcal R }$ including a non-negativity constraint $u\geqslant 0$ the constrained quadratic minimization problems (74) and (76) may be solved by a variety of methods, see [152, section 9]. In particular, semismooth Newton methods are very efficient [142].

Another approximation makes use of the Anscombe transform (see [2]) $f(y):= 2\sqrt{y+\tfrac{3}{8}}$ , which is well-known to be a variance stabilizing transformation for Poisson data in statistics. This is shown by the following lemma (see [32, lemma 1]):

Lemma 6.1 (Anscombe's transform). Suppose $\lambda \gt 0$ and let $Y$ be a Poisson distributed random variable with parameter λ. Then for all $c\geqslant 0$ it holds

$\begin{eqnarray*}{\bf{E}}[2\sqrt{Y+c}] & = & 2\sqrt{\lambda }+\displaystyle \frac{4c-1}{4\sqrt{\lambda }}+{ \mathcal O }\left({\lambda }^{-\tfrac{3}{2}}\right),\\ {\bf{Var}}[2\sqrt{Y+c}] & = & 1+\displaystyle \frac{3-8c}{8\lambda }+{ \mathcal O }({\lambda }^{-2}).\end{eqnarray*}$

Consequently, the choice $c=3/8$ in Anscombe's transform ensures that the variance of $2\sqrt{Y+c}$ does no longer depend on λ up to second order. Even more, it follows from properties of the Poisson distribution that $2\sqrt{Y+\tfrac{3}{8}}-2\sqrt{\lambda }\to { \mathcal N }(0,1)$ in distribution as $\lambda \to \infty$ .

From this, it can readily be seen that another reasonable Tikhonov-type regularization is given by

$\begin{eqnarray}&&{\widehat{u}}_{\alpha }^{{\rm{ansc}}}:= \underset{u\in { \mathcal X }}{\mathrm{argmin}}\left[{\parallel \sqrt{{{\bf{g}}}^{{\rm{obs}}}+{{\bf{g}}}^{0}+\displaystyle \frac{3}{8}}-\sqrt{{\bf{T}}u+{{\bf{g}}}^{0}}\parallel }_{2}^{2}+\alpha { \mathcal R }(u)\right].\end{eqnarray} \tag{ 77 }$

Note that the implicit side condition ${\bf{T}}{\widehat{u}}_{\alpha }^{{\rm{ansc}}}+{{\bf{g}}}^{0}\geqslant 0$ is still included. Furthermore, the objective function is convex in $u$ , as $y\mapsto {(\sqrt{y+c}-a)}^{2}$ is a convex mapping for any $a\gt 0,c\in {\mathbb{R}}$ . Thus, (77) is again of the form (70) with different ${ \mathcal S }$ , which makes general splitting methods applicable. We will discuss the forward–backward splitting applied in [54] in section 6.3. Alternatively, (77) can be approached by considering $u\mapsto \sqrt{{\bf{T}}u}$ as a nonlinear operator and applying iterative linearization, i.e. $\sqrt{{\bf{T}}u+{{\bf{g}}}^{0}}\approx \sqrt{{\bf{T}}{u}^{k}+{{\bf{g}}}^{0}}+\tfrac{{\bf{T}}(u-{u}^{k})}{2\sqrt{{\bf{T}}{u}^{k}+{{\bf{g}}}^{0}}}$ and iterating over k. This approach has been used in [70].

Sequential quadratic approximation of ${ \mathcal S }$ . Now let us come back to the original problem (70). We consider the approximation of the data fidelity term (71) by its second order Taylor expansion, which is given by

$\begin{eqnarray*}&&{ \mathcal S }({\bf{g}}+{\bf{h}})={ \mathcal S }({\bf{g}})+\displaystyle \sum _{j=1}^{J}\left(1-\displaystyle \frac{{{\bf{g}}}_{j}^{{\rm{obs}}}}{{{\bf{g}}}_{j}+{{\bf{g}}}_{j}^{0}}\right){{\bf{h}}}_{j}+\displaystyle \frac{1}{2}\displaystyle \sum _{j=1}^{J}\displaystyle \frac{{{\bf{g}}}_{j}^{{\rm{obs}}}}{{({{\bf{g}}}_{j}+{{\bf{g}}}_{j}^{0})}^{2}}{{\bf{h}}}_{j}^{2}+O({\parallel {\bf{h}}\parallel }^{3})\end{eqnarray*}$

if ${\bf{g}}+{{\bf{g}}}^{0}\gt 0$ . Replacing ${ \mathcal S }$ in (70) by this quadratic approximation and completing the squares, we obtain the following iterative scheme to approximate the solution of (70):

$\begin{eqnarray}&&{u}^{k+1}:= \underset{u\in { \mathcal X }}{\mathrm{argmin}}\left[{\parallel \displaystyle \frac{\sqrt{{{\bf{g}}}^{{\rm{obs}}}}}{{\bf{T}}{u}^{k}+{{\bf{g}}}^{0}}{\bf{T}}(u-{u}^{k})-\displaystyle \frac{{{\bf{g}}}^{{\rm{obs}}}-{\bf{T}}{u}^{k}-{{\bf{g}}}^{0}}{\sqrt{{{\bf{g}}}^{{\rm{obs}}}}}\parallel }_{2}^{2}+\alpha { \mathcal R }(u)\right].\end{eqnarray} \tag{ 78 }$

Note that ${u}^{k}$ is a minimum of (70) if ${u}^{k+1}={u}^{k}$ and ${\bf{T}}{u}^{k}+{{\bf{g}}}^{0}\geqslant 0$ . In [88] a side condition ${\bf{T}}{u}^{k+1}+{{\bf{g}}}^{0}\geqslant \tfrac{\sigma }{2}{\bf{1}}$ for a regularized Newton method was enforced via step-size control.

Discussion. The quality of the approximation using properties of the Poisson distribution strongly depends on the order of magnitude of $t{{\bf{g}}}^{\dagger }$ . Obviously, the methods (74)–(76) will yield good results only for large $t{{\bf{g}}}^{\dagger }$ , as they are based on the asymptotic statement (73). The situation is slightly better when using (77), as it is known that Anscombe's transform also performs well for small values of λ, see [32].

In all cases the quality of the approximations can be improved by increasing the binsize in a preprocessing step, which leads to larger values of ${{\bf{g}}}^{\dagger }$ and ${{\bf{g}}}^{{\rm{obs}}}$ . Obviously, this increases the discretization error as discussed in section 2.5, so this should only be done as long as the discretization error is not dominant.

In many applications, including most of the ones discussed in section 5, the quantity ${{\bf{g}}}^{\dagger }$ is not bounded away from 0, and hence using the true data fidelity ${ \mathcal S }$ as in (71) will pay off as shown in simulations [53, 88].

6.2. Elements of convex analysis

For the following discussions we have to recall some basic facts from convex analysis, which can be found in most textbooks and monographs on this topic (see e.g. [14, 58]). By the sum and the chain rule for subdifferentials, the optimality conditions for (70) can be written as

$\begin{eqnarray}&&0\in \partial { \mathcal J }({\hat{u}}_{\alpha })={{\bf{T}}}^{* }\partial { \mathcal S }({\bf{T}}{\hat{u}}_{\alpha })+\alpha { \mathcal R }({\hat{u}}_{\alpha })\end{eqnarray} \tag{ 79 }$

under mild assumptions. Note that (79) is equivalent to the existence of some ${\hat{{\bf{p}}}}_{\alpha }\in -\partial { \mathcal S }({\bf{T}}{\hat{u}}_{\alpha })$ such that ${{\bf{T}}}^{* }{\hat{{\bf{p}}}}_{\alpha }\in \partial { \mathcal R }({\hat{u}}_{\alpha })$ . As $-{\hat{{\bf{p}}}}_{\alpha }\in \partial { \mathcal S }({\bf{T}}{\hat{u}}_{\alpha })$ is equivalent to ${\bf{T}}{\hat{u}}_{\alpha }\in \partial {{ \mathcal S }}^{* }(-{\hat{{\bf{p}}}}_{\alpha })$ with the Legendre–Fenchel conjugate functional ${{ \mathcal S }}^{* }\;:{{\mathbb{R}}}^{J}\to {\mathbb{R}}\cup \{\infty \}$ ,

$\begin{eqnarray*}&&{{ \mathcal S }}^{* }({\bf{q}}):= \underset{{\bf{h}}}{\mathrm{sup}}[\langle {\bf{q}},{\bf{h}}\rangle -{ \mathcal S }({\bf{h}})],\end{eqnarray*}$

the optimality condition (79) can equivalently be rewritten as the pair of conditions

$\begin{eqnarray}&&{{\bf{T}}}^{* }{\hat{{\bf{p}}}}_{\alpha }\in \alpha \partial { \mathcal R }({\hat{u}}_{\alpha }),\qquad {\bf{T}}{\hat{u}}_{\alpha }\in \partial {{ \mathcal S }}^{* }(-{\hat{{\bf{p}}}}_{\alpha }),\end{eqnarray} \tag{ 80 }$

which are known as extremal relations. They are obviously necessary for $({\hat{u}}_{\alpha },{\hat{{\bf{p}}}}_{\alpha })$ to be a saddle point of the Lagrange function

$\begin{eqnarray}&&{ \mathcal L }(u,{\bf{p}}):= \alpha { \mathcal R }(u)-\langle {\bf{p}},{\bf{T}}u\rangle -{{ \mathcal S }}^{* }(-{\bf{p}}),\end{eqnarray} \tag{ 81 }$

and they can be shown to be also sufficient. The dual problem consists in maximizing the objective functional ${\bf{p}}\mapsto {\mathrm{inf}}_{u\in { \mathcal X }}L(u,{\bf{p}})$ and has the form

$\begin{eqnarray}&&{\hat{{\bf{p}}}}_{\alpha }\in {\mathrm{argmax}}_{{\bf{p}}\in {{\mathbb{R}}}^{J}}[-\alpha {{ \mathcal R }}^{* }({{\bf{T}}}^{* }{\bf{p}})-{{ \mathcal S }}^{* }(-{\bf{p}})].\end{eqnarray} \tag{ 82 }$

Straightforward calculations show that

$\begin{eqnarray}&&{{ \mathcal S }}^{* }({\bf{p}})=\displaystyle \sum _{j\in \mathrm{supp}({{\bf{g}}}^{{\rm{obs}}})}-{{\bf{g}}}_{j}^{{\rm{obs}}}\mathrm{ln}(1-{{\bf{p}}}_{j})-\langle {\bf{p}},{{\bf{g}}}^{0}\rangle +\displaystyle \sum _{j\in \mathrm{supp}({{\bf{g}}}_{j}^{{\rm{obs}}})}{{\bf{g}}}_{j}^{{\rm{obs}}}(\mathrm{ln}{{\bf{g}}}_{j}^{{\rm{obs}}}-1)\end{eqnarray} \tag{ 83 }$

if ${\bf{p}}\leqslant 1$ and ${{\bf{p}}}_{j}\lt 1$ for all $j\in \mathrm{supp}({{\bf{g}}}^{{\rm{obs}}})$ , otherwise ${{ \mathcal S }}^{* }({\bf{p}})=\infty$ . Moreover, $\partial { \mathcal S }({\bf{h}})$ is non-empty if and only if ${\bf{h}}\geqslant -{{\bf{g}}}^{0}$ and ${{\bf{h}}}_{j}\gt -{{\bf{g}}}_{j}^{0}$ for all $j\in \mathrm{supp}({{\bf{g}}}^{{\rm{obs}}})$ . In this case

$\begin{eqnarray}&&\partial { \mathcal S }({\bf{h}})={\bf{1}}-\displaystyle \frac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{h}}+{{\bf{g}}}^{0}}+\partial {\chi }_{{{\mathbb{R}}}_{+}^{J}}({\bf{h}}+{{\bf{g}}}^{0})\end{eqnarray} \tag{ 84 }$

Here ${\chi }_{C}$ denotes the indicator function of a set C, i.e. ${\chi }_{C}(x):= 0$ for $x\in C$ and ${\chi }_{C}(x):= \infty$ else, and we set $\tfrac{{{\bf{g}}}_{j}^{{\rm{obs}}}}{{{\bf{h}}}_{j}+{{\bf{g}}}_{j}^{0}}:= 0$ if ${{\bf{h}}}_{j}+{{\bf{g}}}_{j}^{0}=0={{\bf{g}}}_{j}^{{\rm{obs}}}$ .

Resolvent operators. Let ${ \mathcal G }\;:{\mathbb{V}}\to {\mathbb{R}}\cup \{\infty \}$ be a proper, convex, lower semicontinuous functional on a Hilbert space ${\mathbb{V}}$ and $\lambda \gt 0$ . Since the functional $x\mapsto { \mathcal G }(x)+\tfrac{1}{2\lambda }{\parallel x-z\parallel }^{2}$ has a unique minimum $\bar{x}\in {\mathbb{V}}$ for all $z\in {\mathbb{V}}$ characterized by $z-\bar{x}\in \lambda \partial { \mathcal G }(\bar{x})$ , the operator ${(I+\lambda \partial { \mathcal G })}^{-1}\;:{\mathbb{V}}\to {\mathbb{V}}$ ,

$\begin{eqnarray}&&{(I+\lambda \partial { \mathcal G })}^{-1}(z):= \underset{x\in {\mathbb{V}}}{\mathrm{argmin}}\left[{ \mathcal G }(x)+\displaystyle \frac{1}{2\lambda }{\parallel x-z\parallel }_{{\mathbb{V}}}^{2}\right],\end{eqnarray} \tag{ 85 }$

called resolvent or proximity operator is well defined. Let us discuss some relevant examples in our context:

If ${\chi }_{A}$ is the indicator function of a closed, convex set $A\subset {\mathbb{V}}$ , then the resolvent
$\begin{eqnarray}&&{(I+\lambda \partial {\chi }_{A})}^{-1}(u)=\underset{v\in A}{\mathrm{argmin}}{\parallel v-u\parallel }_{{\mathbb{V}}}^{2}=: {{ \mathcal P }}_{A}(u)\end{eqnarray} \tag{ 86 }$
is the metric projection ${{ \mathcal P }}_{A}$ onto A, which is independent of λ.
For ${\mathbb{V}}={{\ell }}^{2}({\mathbb{N}})$ and the ℓ¹ penalty given in (37), we obtain the soft thresholding or shrinkage operator as introduced in (22),
$\begin{eqnarray}&&{(I+\lambda \partial {{ \mathcal R }}_{{{\ell }}^{1}})}^{-1}(u)={({\zeta }_{\lambda }({u}_{n}))}_{n\in {\mathbb{N}}}.\end{eqnarray} \tag{ 87 }$
For the $\mathrm{TV}$ penalty ${{ \mathcal R }}_{{\rm{TV}}}$ in (36) and ${ \mathcal X }={{\bf{L}}}^{2}({{\mathbb{R}}}^{2})$ , the evaluation of the resolvent ${(I+\lambda \partial {{ \mathcal R }}_{{\rm{TV}}})}^{-1}(u)$ corresponds to the solution of an image denoising problem using the Rudin–Osher–Fatemi model [130]. For this well-studied task no explicit solution formula is known, but efficient iterative algorithms exist, e.g. by Chambolle [40].
The computation of the resolvent ${(I+\lambda \partial { \mathcal S })}^{-1}$ with ${ \mathcal S }$ given by (71) separates componentwise. Straightforward computations distinguishing the cases ${{\bf{g}}}_{j}^{{\rm{obs}}}\gt 0$ and ${{\bf{g}}}_{j}^{{\rm{obs}}}=0$ yield
$\begin{eqnarray}&&{(I+\lambda \partial { \mathcal S })}^{-1}({\bf{h}})=\displaystyle \frac{1}{2}({\bf{h}}-\lambda {\bf{1}}-{{\bf{g}}}^{0}+\sqrt{{({\bf{h}}-\lambda {\bf{1}}+{{\bf{g}}}^{0})}^{2}+4\lambda {{\bf{g}}}^{{\rm{obs}}}}),\end{eqnarray} \tag{ 88 }$
where the square and the square root are understood componentwise.
The resolvent of $\partial {{ \mathcal S }}^{* }$ can either be evaluated by direct computations using (83) or from (88) using Moreau's identity ${(I+\lambda \partial {{ \mathcal S }}^{* })}^{-1}({\bf{h}})={\bf{h}}-\lambda {\left(I+\tfrac{1}{\lambda }\partial { \mathcal S }\right)}^{-1}\left(\tfrac{1}{\lambda }{\bf{h}}\right)$ ([14, theorem 14.3]):
$\begin{eqnarray}&&{(I+\lambda \partial {{ \mathcal S }}^{* })}^{-1}({\bf{h}})\\ &&\;\;\;=\displaystyle \frac{1}{2}({\bf{h}}+{\bf{1}}+\lambda {{\bf{g}}}^{0}-\sqrt{{({\bf{h}}-{\bf{1}}+\lambda {{\bf{g}}}^{0})}^{2}+4\lambda {{\bf{g}}}^{{\rm{obs}}}}).\end{eqnarray} \tag{ 89 }$

Gradient and proximal point methods. Two fundamental algorithms for the solution of convex minimization problems are the method of steepest descent

$\begin{eqnarray}&&{u}^{k+1}\in (I-{\lambda }_{k}\partial { \mathcal J })({u}^{k})\end{eqnarray} \tag{ 90 }$

and the proximal point algorithm

$\begin{eqnarray}&&{u}^{k+1}:= {(I+{\lambda }_{k}\partial { \mathcal J })}^{-1}({u}^{k}),\end{eqnarray} \tag{ 91 }$

with step sizes ${\lambda }_{k}\gt 0$ . For smooth ${ \mathcal J }$ they can be interpreted as the forward and backward Euler method applied to the evolution equation $\partial u(t)\in -\partial { \mathcal J }(u(t))$ , $t\gt 0$ . Note that solutions to (70) correspond to stationary points of this evolution equation. Therefore, the operators $I-{\lambda }_{k}\partial { \mathcal J }$ and and ${(I+{\lambda }_{k}\partial { \mathcal J })}^{-1}$ are often referred to as forward and backward operator, rsp. which of course must not be confused with the forward operator of the inverse problem. Operators of this type will be the basic ingredients of most of the algorithms discussed below.

For non-smooth ${ \mathcal J }$ the method (90) is not uniquely defined. Moreover, convergence can be very slow in general.

The proximal point algorithm (91) applied to ${ \mathcal J }$ given by (70) is not useful since in general each iteration step is as difficult as the original problem (70). However, a number of important iterative algorithms can be interpreted as proximal point methods applied to reformulations of the original problem [14]. It has been shown by Rockafellar [128] that the sequence generated by the proximal point algorithm converges at least weakly to a minimizer. Therefore, the proximal point method can serve as a unifying principle and a means to prove convergence.

6.3. Splitting methods

Whereas the evaluation of the resolvent of ${ \mathcal J }$ is usually as difficult as the minimization of ${ \mathcal J }$ , often the resolvent of ${ \mathcal R }$ is much easier to compute. The same holds true for the resolvents of ${ \mathcal S }$ and ${{ \mathcal S }}^{* }$ (see (89)) as well as the applications of ${\bf{T}}$ and ${{\bf{T}}}^{* }$ . For quadratic ${ \mathcal S }$ (but non-quadratic ${ \mathcal R }$ ) the resolvent of

$\begin{eqnarray*}&&{ \mathcal Q }:= { \mathcal S }\;\circ \;{\bf{T}}\end{eqnarray*}$

leads to the solution of a linear system of equations, which may also be much easier than the evaluation of the resolvent of ${ \mathcal J }$ . This explains the interest in splitting algorithms, which consist only of the above simpler operations.

Forward–backward splittings. The simplest splitting methods are forward–backward and backward–backward splittings defined by

$\begin{eqnarray}&&{u}^{k+1}:= {(I+{\lambda }_{k}\alpha \partial { \mathcal R })}^{-1}(I-{\lambda }_{k}\partial { \mathcal Q })({u}^{k}),\end{eqnarray} \tag{ 92 }$

$\begin{eqnarray}&&{u}^{k+1}:= {(I+{\lambda }_{k}\alpha \partial { \mathcal R })}^{-1}{(I+{\lambda }_{k}\partial { \mathcal Q })}^{-1}({u}^{k}).\end{eqnarray} \tag{ 93 }$

For ${ \mathcal S }$ given by (71) and nontrivial ${\bf{T}}$ , typically no sufficiently efficient method for the evaluation of the resolvent ${(I+{\lambda }_{k}\partial { \mathcal Q })}^{-1}$ in (93) is available. However, for denoising problems, i.e. ${\bf{T}}=I$ , the backward–backward splitting would be an option in view of (88). Actually we mainly mentioned (93) here for later reference in section 6.5.

In the case of a quadratic data fidelity term ${ \mathcal S }$ and ${ \mathcal R }={{ \mathcal R }}_{{{\ell }}^{1}}$ , (92) corresponds to a step of Landweber iteration followed by a soft thresholding step (see (87)). This method, also referred to as iterative shrinkage thresholding algorithm (ISTA), was derived and analyzed independently by several authors in different ways (see Daubechies et al [49], Figueiredo and Nowak [63], Starck et al [140]).

For ${ \mathcal S }$ given by (71) we only discuss the case ${{\bf{g}}}^{0}=\sigma {\bf{1}}$ which occurs for linear inverse problems. Then the forward operator of ${ \mathcal Q }={ \mathcal S }\;\circ \;{\bf{T}}$ in (92) can be calculated explicitly using (84) as

$\begin{eqnarray}&&(I-\lambda \partial { \mathcal Q })(u)=\left(u-\lambda {{\bf{T}}}^{* }\;\left({\bf{1}}-\displaystyle \frac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}u+\sigma {\bf{1}}}\right)\right).\end{eqnarray} \tag{ 94 }$

Consequently, (92) is given by algorithm 1 and has been discussed for PET by Anthoine et al [3].

Algorithm 1. Forward–backward splitting for (70) and (71).

Choose starting point ${u}^{0}\in { \mathcal X }$ ${u}^{0}\in { \mathcal X }$

for

$k=0,1,2,...$ do

Choose step length ${\lambda }_{k}\gt 0$ ${\lambda }_{k}\gt 0$

${u}^{k+1}={(I+{\lambda }_{k}\alpha \partial { \mathcal R })}^{-1}\left({u}^{k}-{\lambda }_{k}{{\bf{T}}}^{* }\left({\bf{1}}-\tfrac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}{u}^{k}+\sigma {\bf{1}}}\right)\right)$ ${u}^{k+1}={(I+{\lambda }_{k}\alpha \partial { \mathcal R })}^{-1}\left({u}^{k}-{\lambda }_{k}{{\bf{T}}}^{* }\left({\bf{1}}-\tfrac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}{u}^{k}+\sigma {\bf{1}}}\right)\right)$

To make this iteration well-defined, the authors of [3] chose $\sigma \gt 0$ , used the fact that ${\bf{T}}$ for PET preserves non-negativity (see (72)), and ensured that ${u}^{k}\geqslant 0$ for all k by incorporating an indicator function of the cone of non-negative signals into ${ \mathcal R }$ . This required an iterative evaluation of the backward step ${(I+{\lambda }_{k}\alpha \partial { \mathcal R })}^{-1}$ even for ℓ¹ penalization instead of soft thresholding (87), which was performed by fast iterative shrinkage thresholding algorithm (FISTA) (see algorithm 3). Note that unless ${\bf{T}}u$ is bounded away from 0, the Lipschitz constant of ${\rm{\nabla }}{ \mathcal Q }(u)={{\bf{T}}}^{* }\left({\bf{1}}-\tfrac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}u+\sigma {\bf{1}}}\right)$ deteriorates as $\sigma \to 0$ . Since the step-lengths for which the iteration (92) is stable are inverse proportional to this Lipschitz constant, convergence will be slow. To control the Lipschitz constant, Dupé et al [54] suggested to use the Anscombe transform (see lemma 6.1) in a forward–backward splitting method.

Even for moderate Lipschitz constants the convergence of forward–backward splitting methods is often quite slow. In fact, if ${ \mathcal Q }$ has a gradient which is Lipschitz continuous with constant L and the step size is chosen as ${\lambda }_{k}=1/L$ , then the values of the objective function (not $\parallel {u}_{k}-{\hat{u}}_{\alpha }\parallel$ !) are bounded by ${ \mathcal J }({u}^{k})-{ \mathcal J }({\hat{u}}_{\alpha })\leqslant \tfrac{L}{2k}{\parallel {u}^{k}-{\hat{u}}_{\alpha }\parallel }^{2}$ (see [16]). Several strategies have been developed to accelerate the convergence of forward–backward splitting.

The scaled gradient projection (SGP) method, discussed by Bonettini et al [29] for deconvolution problems, can be interpreted as a version of forward–backward splitting with ${ \mathcal R }={\chi }_{C}$ where C is a closed convex cone, typically that of non-negative signals. Note that this choice of ${ \mathcal R }$ is not regularizing. To speed up convergence, the (discrete) space ${ \mathcal X }$ is equipped with a problem adapted norm ${\parallel u\parallel }_{k}^{2}={u}^{\top }{({\lambda }_{k}{D}_{k})}^{-1}u$ given by a diagonal Gram matrix ${D}_{k}^{-1}$ and a step length parameter ${\lambda }_{k}$ in each iteration step, and relaxation parameters ${\gamma }_{k}\in (0,1]$ are introduced. Using ${{ \mathcal P }}_{C,{\lambda }_{k}{D}_{k}}(u)={(I+{\lambda }_{k}\partial { \mathcal R })}^{-1}(u)\in {\mathrm{argmin}}_{v\in C}{v}^{\top }{({\lambda }_{k}{D}_{k})}^{-1}v$ , this leads to the iteration formula in algorithm 2.

Algorithm 2. SGP for solving $u\in {\mathrm{argmin}}_{u\in C}{ \mathcal S }({\bf{T}}v)$ .

Choose initial guess ${u}^{0}\in { \mathcal X }$ ${u}^{0}\in { \mathcal X }$

for

$k=0,1,2,...$ do

Choose (diagonal) scaling matrix D_k

Choose step-length parameter ${\lambda }_{k}\gt 0$ ${\lambda }_{k}\gt 0$ and relaxation parameter ${\gamma }_{k}\in (0,1]$ ${\gamma }_{k}\in (0,1]$

${u}^{k+1}={\gamma }_{k}{{ \mathcal P }}_{C,{\lambda }_{k}{D}_{k}}\left({u}^{k}-{\lambda }_{k}{D}_{k}{{\bf{T}}}^{* }\left({\bf{1}}-\tfrac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}{u}^{k}+\sigma {\bf{1}}}\right)\right)+(1-{\gamma }_{k}){u}^{k}$ ${u}^{k+1}={\gamma }_{k}{{ \mathcal P }}_{C,{\lambda }_{k}{D}_{k}}\left({u}^{k}-{\lambda }_{k}{D}_{k}{{\bf{T}}}^{* }\left({\bf{1}}-\tfrac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}{u}^{k}+\sigma {\bf{1}}}\right)\right)+(1-{\gamma }_{k}){u}^{k}$

▷ ${{ \mathcal P }}_{C,{\lambda }_{k}{D}_{k}}(u)\in {\mathrm{argmin}}_{v\in C}{v}^{\top }{({\lambda }_{k}{D}_{k})}^{-1}v$ ${{ \mathcal P }}_{C,{\lambda }_{k}{D}_{k}}(u)\in {\mathrm{argmin}}_{v\in C}{v}^{\top }{({\lambda }_{k}{D}_{k})}^{-1}v$

In [29] three choices of D_k are compared: (i) ${D}_{k}=I;$ (ii) D_k as inverse of the diagonal part of the Hessian of ${ \mathcal Q };$ (iii) D_k is chosen such that scaled gradient step becomes an expectation maximization (EM) step, see section 6.4. In numerical experiments the last choice yields the best results in most cases. Moreover, the authors of [29] highlight the importance of a good selection of the step-size parameters ${\lambda }_{k}$ and suggest a special Barzilai–Borwein type rule. Later Porta et al [119] demonstrated further improvements by a Ritz-type rule.

For SGP as formulated in algorithm 2 one must hope that regularization can be achieved by early stopping. One option to obtain convergence to a penalized maximum likelihood solution ${\hat{u}}_{\alpha }$ is to incorporate the penalty into ${ \mathcal Q }$ , as done for a TV-penalty in Zanella et al [158]. Here a smoothed version of the TV term had to be used, and convergence deteriorates as the smoothing parameter tends to 0. An alternative, which allows the use of the unsmoothed TV-norm, is to replace ${{ \mathcal P }}_{C,{D}_{k}}={(I+{\lambda }_{k}{D}_{k}\partial {\chi }_{C})}^{-1}$ by ${(I+{\lambda }_{k}\alpha \partial {{ \mathcal R }}_{{\rm{TV}}})}^{-1}$ . This leads to the EM-TV algorithm discussed in section 6.4.

Another type of acceleration of forward–backward splitting (92) was suggested by Beck and Teboulle [16] based on Nesterov's algorithm [115] to accelerate gradient methods. This so-called FISTA is formulated in algorithm 3.

Algorithm 3. FISTA for computing $u\in {\mathrm{argmin}}_{v\in { \mathcal X }}{ \mathcal Q }(v)+{ \mathcal R }(v)$ .

Choose ${u}^{0}\in { \mathcal X };$ ${v}^{1}={u}_{0};\ \ {t}_{1}=1$
for $k=1,2,...$ do
Choose step length ${\lambda }_{k}\gt 0$
${u}^{k}={(I+{\lambda }_{k}\partial { \mathcal R })}^{-1}(I-\lambda \partial { \mathcal Q })({v}^{k})$	▷If ${ \mathcal Q }={ \mathcal S }\;\circ \;{\bf{T}}$ with ${ \mathcal S }$ from (71), see (94)
${t}_{k+1}=\tfrac{1}{2}(1+\sqrt{1+4{t}_{k}^{2}})$
${v}^{k+1}={u}^{k}+\tfrac{{t}_{k}-1}{{t}_{k+1}}({u}^{k}-{u}^{k-1})$

If ${ \mathcal Q }$ has a gradient which is Lipschitz continuous with constant L and the step size is chosen as ${\lambda }_{k}=1/L$ , then the values of the objective function are bounded by ${ \mathcal J }({u}_{k})-{ \mathcal J }({\hat{u}}_{\alpha })\leqslant \tfrac{2L}{{(k+1)}^{2}}{\parallel {u}_{0}-{\hat{u}}_{\alpha }\parallel }^{2}$ as opposed to the bound $\tfrac{L}{2k}{\parallel {u}_{0}-{\hat{u}}_{\alpha }\parallel }^{2}$ for ISTA (see [16]). Recently Chambolle and Dossal [41] showed weak convergence of the iterates ${u}_{k}$ for a slight modification of FISTA, which has the same convergence properties of the objective function as FISTA. In the modified FISTA the variables t_k are chosen as ${t}_{k}:= (k+a-1)/a$ for some $a\gt 2$ .

Douglas–Rachford and alternating direction method of multipliers (ADMM) methods. Another classical splitting algorithm is the Douglas–Rachford method

$\begin{eqnarray}&&{u}^{k+1}={(I+\lambda \partial { \mathcal Q })}^{-1}[{(I+\lambda \partial { \mathcal R })}^{-1}(I-\lambda \partial { \mathcal Q })+\lambda \partial { \mathcal Q }]({u}^{k}),\end{eqnarray} \tag{ 95 }$

which was first introduced for the solution of parabolic differential equations in [52] and then generalized to maximal monotone operators in [102]. As opposed to the forward–backward splitting it is stable for all $\lambda \gt 0$ (see [102]). In [55] it was shown that the Douglas–Rachford algorithm can be interpreted as a proximal point algorithm. In particular, it converges weakly, as already shown in [102]. Using the transformation of variables ${u}^{k}={(I+\lambda \partial { \mathcal Q })}^{-1}{v}^{k}$ it can be rewritten only in terms of the resolvent of ${ \mathcal Q }$ , which is necessary in cases where ${ \mathcal Q }$ is not smooth. This reformulation gives algorithm 4.

Algorithm 4. Douglas–Rachford algorithm for computing $u\in {\mathrm{argmin}}_{v\in { \mathcal X }}{ \mathcal Q }(v)+{ \mathcal R }(v)$ .

Choose ${v}^{0}\in { \mathcal X }$ ${v}^{0}\in { \mathcal X }$ and step-length parameter $\lambda \gt 0;$ $\lambda \gt 0;$ k = 0;

while stopping criterion not satisfied do

${v}^{k+1}={(I+\lambda \partial { \mathcal R })}^{-1}[2{(I+\lambda \partial { \mathcal Q })}^{-1}-I]({v}^{k})+[I-{(I+\lambda \partial { \mathcal Q })}^{-1}]({v}^{k})$ ${v}^{k+1}={(I+\lambda \partial { \mathcal R })}^{-1}[2{(I+\lambda \partial { \mathcal Q })}^{-1}-I]({v}^{k})+[I-{(I+\lambda \partial { \mathcal Q })}^{-1}]({v}^{k})$

$k\leftarrow k+1$ $k\leftarrow k+1$

${u}^{k}={(I+\lambda { \mathcal Q })}^{-1}({v}^{k})$ ${u}^{k}={(I+\lambda { \mathcal Q })}^{-1}({v}^{k})$

As discussed for the backward–backward splitting (93), ${(I+\lambda \partial { \mathcal Q })}^{-1}$ cannot efficiently be implemented for ${ \mathcal S }$ in (71) and nontrivial ${\bf{T}}$ . However, algorithm 4 has been applied to a Poisson denoising problem in [47].

As shown in [72] (see also [134]), the Douglas–Rachford algorithm applied to the dual problem (82) yields the so-called ADMM. Instead of reproducing this more lengthy derivation of ADMM as a Douglas–Rachford method here, we will derive ADMM applied to (70) as a variant of the well-known method of multipliers or augmented Lagrangian method [81, 121] (see also [71] for the usage in inverse problems). The starting point is the following reformulation of (70) as a constrained minimization problem:

$\begin{eqnarray*}&&{\rm{minimize}}\quad { \mathcal S }({{\bf{v}}}_{1})+\alpha { \mathcal R }({v}_{2})\quad {\rm{subject}}\;{\rm{to}}\;{\rm{}}\quad {\bf{T}}u={{\bf{v}}}_{1},\ u={v}_{2}.\end{eqnarray*}$

Following Figueiredo and Bioucas-Dias [62] we have introduced here the additional variable $u$ as opposed to a standard ADMM reformulation of (70). The augmented Lagrangian associated to this problem is

$\begin{eqnarray*}{L}_{\beta }(u,{{\bf{v}}}_{1},{v}_{2},{{\bf{p}}}_{1},{p}_{2}) & = & { \mathcal S }({{\bf{v}}}_{1})+\alpha { \mathcal R }({v}_{2})-\langle \beta {{\bf{p}}}_{1},{\bf{T}}u-{{\bf{v}}}_{1}\rangle -\langle \beta {p}_{2},u-{v}_{2}\rangle \\ & & +\displaystyle \frac{\beta }{2}{\parallel {\bf{T}}u-{{\bf{v}}}_{1}\parallel }^{2}+\displaystyle \frac{\beta }{2}{\parallel u-{v}_{2}\parallel }^{2}\end{eqnarray*}$

with primal variables $u,{v}_{2}\in { \mathcal X }$ and ${{\bf{v}}}_{1}\in {{\mathbb{R}}}^{J}$ , a parameter $\beta \gt 0$ , and (scaled) Lagrange multipliers $\beta {{\bf{p}}}_{1}\in {{\mathbb{R}}}^{J}$ and $\beta {p}_{2}\in {{ \mathcal X }}^{* }$ . Rather than computing $({u}^{k+1},{{\bf{v}}}_{1}^{k+1},{v}_{2}^{k+1})\in {\mathrm{argmin}}_{(u,{{\bf{v}}}_{1},{v}_{2})}{L}_{\beta }(u,{{\bf{v}}}_{1},{v}_{2},{{\bf{p}}}_{1}^{k},{p}_{2}^{k})$ as in the method of multipliers, in ADMM one minimizes over the primal variables one after the other in a Gauß–Seidel fashion, which is much easier. The second step in the method of multipliers is a gradient ascent step for the dual objective functional ${ \mathcal H }({{\bf{p}}}_{1},{p}_{2}):= {\mathrm{inf}}_{(u,{{\bf{v}}}_{1},{v}_{2})}{L}_{\beta }(u,{{\bf{v}}}_{1},{v}_{2},{{\bf{p}}}_{1},{p}_{2})$ , which has a simple explicit formula. The same formula is also used in ADMM. This yields algorithm 5.

Algorithm 5. ADMM for (70) and (71).

Choose $\beta \gt 0,{u}^{0}\in { \mathcal X },{{\bf{v}}}_{1}^{0}\in {{\mathbb{R}}}^{J},{v}_{2}^{0}\in { \mathcal X },{{\bf{p}}}_{1}^{0}\in {{\mathbb{R}}}^{J},{p}_{2}^{0}\in {{ \mathcal X }}^{* }$
for $k=0,1,2,...$ do
${u}^{k+1}={({{\bf{T}}}^{* }{\bf{T}}+I)}^{-1}({{\bf{T}}}^{* }({{\bf{v}}}_{1}^{k}+{{\bf{p}}}_{1}^{k})+{v}_{2}^{k}+{p}_{2}^{k})$
${{\bf{v}}}_{1}^{k+1}={(I+{\beta }^{-1}\partial { \mathcal S })}^{-1}({\bf{T}}{u}^{k+1}-{{\bf{p}}}_{1}^{k})$	▷see (88)
${v}_{2}^{k+1}={(I+\alpha /\beta \partial { \mathcal R })}^{-1}({u}^{k+1}-{p}_{2}^{k})$
${{\bf{p}}}_{1}^{k+1}={{\bf{p}}}_{1}^{k}-({\bf{T}}{u}^{k+1}-{{\bf{v}}}_{1}^{k+1})$
${p}_{2}^{k+1}={p}_{2}^{k}-({u}^{k+1}-{v}_{2}^{k+1})$

In the standard version of ADMM without the additional variables ${v}_{2}$ and p₂, ${u}^{k+1}$ would be the minimum of ${ \mathcal R }(u)-\langle \beta {{\bf{p}}}_{1}^{k},{\bf{T}}u-{{\bf{v}}}_{1}^{k}\rangle +\tfrac{\beta }{2}\parallel {\bf{T}}u-{{\bf{v}}}_{1}^{k}\parallel$ , which can be much harder to compute if ${ \mathcal R }$ is not quadratic. In fact, in algorithm 5 the update for ${u}^{k}$ can be computed explicitly for a convolution operator, and by the conjugate gradient method in other cases. The method in [62], which is called Poisson image deconvolution by augmented Lagrangian, is actually more general and can handle arbitrary objective functionals of the form $u\mapsto {\sum }_{l=1}^{L}{{ \mathcal Q }}_{j}({{\bf{T}}}_{j}u)$ with convex functionals ${{ \mathcal Q }}_{j}$ and linear operators ${{\bf{T}}}_{j}$ , which allows the separation, and hence the efficient treatment of various penalty terms ${ \mathcal R }$ . For the case ${ \mathcal R }={{ \mathcal R }}_{{\rm{TV}}}$ the splitting idea was taken even further by Setzer et al [135], and the suggested algorithms PIDSplit and PIDSplit + show significantly improved performance.

Primal-dual methods. The Chambolle–Pock algorithm [42] can be motivated by reformulating the optimality conditions (80) as fixed point equations: Replacing ${\hat{{\bf{p}}}}_{\alpha }$ by $-{\hat{{\bf{p}}}}_{\alpha }$ , multiplying the second condition ${\bf{T}}{\hat{u}}_{\alpha }\in \partial {{ \mathcal S }}^{* }({\hat{{\bf{p}}}}_{\alpha })$ by a parameter ${\tau }_{{ \mathcal Y }}\gt 0$ , and adding ${\hat{{\bf{p}}}}_{\alpha }$ on both sides yields ${\hat{{\bf{p}}}}_{\alpha }+{\tau }_{{ \mathcal Y }}{\bf{T}}{\hat{u}}_{\alpha }\in {\hat{{\bf{p}}}}_{\alpha }+{\tau }_{{ \mathcal Y }}\partial {{ \mathcal S }}^{* }({\hat{{\bf{p}}}}_{\alpha })$ , and similarly we obtain ${\hat{u}}_{\alpha }-{\tau }_{{ \mathcal X }}{{\bf{T}}}^{* }{\hat{{\bf{p}}}}_{\alpha }\in {\hat{u}}_{\alpha }+{\tau }_{{ \mathcal X }}\partial { \mathcal R }({\hat{u}}_{\alpha })$ . Therefore, the optimality conditions (80) can equivalently be reformulated as

$\begin{eqnarray*}&&{\hat{{\bf{p}}}}_{\alpha }={(I+{\tau }_{{ \mathcal Y }}\partial {{ \mathcal S }}^{* })}^{-1}({\hat{{\bf{p}}}}_{\alpha }+{\tau }_{{ \mathcal Y }}{\bf{T}}{\hat{u}}_{\alpha }),\qquad {\hat{u}}_{\alpha }={(I+{\tau }_{{ \mathcal X }}\partial { \mathcal R })}^{-1}({\hat{u}}_{\alpha }-{\tau }_{{ \mathcal X }}{{\bf{T}}}^{* }{\hat{{\bf{p}}}}_{\alpha }).\end{eqnarray*}$

Iterating these equations and performing an overrelaxation step ${\hat{u}}^{k+1}={u}^{k+1}+{\theta }_{k}({u}^{k+1}-{u}^{k})$ yields algorithm 6.

Algorithm 6. CP primal-dual algorithm for solving (70) and (71).

Choose initial point $({u}^{0},{{\bf{p}}}^{0})\in { \mathcal X }\times {{\mathbb{R}}}^{J}$ and set ${\widehat{u}}^{0}={u}^{0}$
for $k=0,1,2,...$ do
Choose parameters ${\tau }_{{ \mathcal X }}^{k},{\tau }_{{ \mathcal Y }}^{k}\gt 0$ and ${\theta }^{k}\in [0,1]$
${{\bf{p}}}^{k+1}={(I+{\tau }_{{ \mathcal Y }}^{k}\partial {{ \mathcal S }}^{* })}^{-1}({{\bf{p}}}^{k}+{\tau }_{{ \mathcal Y }}^{k}{\bf{T}}{\bar{u}}^{k})$	▷see (89)
${u}^{k+1}={(I+{\tau }_{{ \mathcal X }}^{k}\alpha \partial { \mathcal R })}^{-1}({u}^{k}-{\tau }_{{ \mathcal X }}^{k}{{\bf{T}}}^{* }{{\bf{p}}}^{k+1})$
${\bar{u}}^{k+1}={u}^{k+1}+{\theta }^{k}({u}^{k+1}-{u}^{k})$

For ${\theta }^{k}\equiv 1$ and constant parameters ${\tau }_{{ \mathcal X }}^{k}$ and ${\tau }_{{ \mathcal Y }}^{k}$ it has been shown in [42] that the iterates of algorithm 6 converge to a saddle point of (81). If one of the functionals ${{ \mathcal S }}^{* }$ or ${ \mathcal R }$ is uniformly convex, then ${\theta }^{k},{\tau }_{{ \mathcal X }}^{k}$ and ${\tau }_{{ \mathcal Y }}^{k}$ can be chosen such that one obtains convergence rates of order $1/{k}^{2}$ for the iterates, and if both are uniformly convex, then one even has linear convergence.

The performance of algorithm 6 depends on the computational effort caused by the resolvents. We have seen in (89) that ${(I+{\tau }_{{ \mathcal Y }}^{k}\partial {{ \mathcal S }}^{* })}^{-1}$ separates componentwise and can hence be computed quite fast. Depending on ${ \mathcal R }$ , this might not be the case for ${(I+{\tau }_{{ \mathcal X }}^{k}\alpha \partial { \mathcal R })}^{-1}$ . This restriction can be overcome for certain Banach space norms by formulating the algorithm in a Banach space setting and replacing the second step in algorithm 6 by the update formula

$\begin{eqnarray*}&&{u}^{k+1}={({J}_{X}+{\tau }_{{ \mathcal X }}^{k}\alpha \partial { \mathcal R })}^{-1}({J}_{X}({u}^{k})-{\tau }_{{ \mathcal X }}^{k}{{\bf{T}}}^{* }{{\bf{p}}}^{k+1}),\end{eqnarray*}$

which involves the duality mapping ${J}_{{ \mathcal X }}\;:{ \mathcal X }\to { \mathcal X }^{\prime}$ and a generalized resolvent. The update step for ${{\bf{p}}}^{k}$ can be generalized in an analogous way. This yields an analog to algorithm 6 exhibiting improved performance in several situations [43, 86].

Other primal-dual algorithms for minimizing (70) with ${ \mathcal R }={{ \mathcal R }}_{{\rm{TV}}}$ have been suggested in [26–28]. Here, as opposed to the framework described in section 6.2, dualization is performed via the gradient in the TV-norm whereas the data-fidelity functional ${ \mathcal Q }:= { \mathcal S }\;\circ \;{\bf{T}}$ is not splitted and treated explicitely. Reference [27] also proposes an extragradient step.

6.4. EM algorithms for linear non-negativity preserving operators

We only briefly touch upon EM algorithms and refer to the review by Bertero et al [21] for further information.

We will only consider linear problems (i.e. ${{\bf{g}}}^{0}=\sigma {\bf{1}}$ ) and forward operators ${\bf{T}}$ which preserve non-negativity (see (72)). We first consider the minimization of the unpenalized negative log-likelihood functional $u\mapsto { \mathcal S }({\bf{T}}u)$ , i.e. the case $\alpha =0$ in (70). Multiplying the optimality condition $0\in {{\bf{T}}}^{* }\partial { \mathcal S }({\bf{T}}u)$ by $u/{{\bf{T}}}^{* }{\bf{1}}$ yields the fixed point equation $u=\tfrac{u}{{{\bf{T}}}^{* }{\bf{1}}}{{\bf{T}}}^{* }\left(\tfrac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}u+\sigma {\bf{1}}}\right)$ for the optimal solution (see (84)). This gives rise to the fixed point iteration in algorithm 7. In the context of deconvolution problems it is often referred to as Richardson–Lucy algorithm since it was first suggested by these authors in [105, 127]. Alternatively, this iteration can be derived as an EM method. We refer to [37, 50, 136] for more information on this class of methods.

Algorithm 7. EM algorithm for computing $u\in {\mathrm{argmin}}_{v\geqslant 0}{ \mathcal S }({\bf{T}}v)$ .

Choose ${u}^{0}\in { \mathcal X }$ ${u}^{0}\in { \mathcal X }$ , ${u}^{0}\gt 0$ ${u}^{0}\gt 0$

for

$k=0,1,2,...$ do

${u}^{k+1}=\tfrac{{u}^{k}}{{{\bf{T}}}^{* }{\bf{1}}}{{\bf{T}}}^{* }\left(\tfrac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}{u}^{k}+\sigma {\bf{1}}}\right)$ ${u}^{k+1}=\tfrac{{u}^{k}}{{{\bf{T}}}^{* }{\bf{1}}}{{\bf{T}}}^{* }\left(\tfrac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}{u}^{k}+\sigma {\bf{1}}}\right)$

Since algorithm 7 is designed to converge to the maximum likelihood estimator, one has to expect that the variance of the iterates will explode in an infinite dimensional setting as $k\to \infty$ . Indeed, one can experimentally observe the semi-convergence behavior typical for iterative regularization methods. So far it is unknown, if the EM algorithm 7 equipped with some appropriate stopping rule has the regularizing property as shown for Tikhonov regularization in theorem 4.8. For attempts in this direction we refer to [113, 123, 124].

To obtain a similar iteration scheme for minimizing penalized log-likelihood functionals (70), Sawatzky et al [132] again multiply the optimality condition (79) by $u/{{\bf{T}}}^{* }{\bf{1}}$ and derive from this equation the following iteration formula:

$\begin{eqnarray*}&&\displaystyle \frac{{u}^{k}}{{{\bf{T}}}^{* }{\bf{1}}}{{\bf{T}}}^{* }\left(\displaystyle \frac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}{u}^{k}+\sigma {\bf{1}}}\right)\in {u}^{k+1}+\alpha \displaystyle \frac{{u}^{k}}{{{\bf{T}}}^{* }{\bf{1}}}\partial { \mathcal R }({u}^{k+1}).\end{eqnarray*}$

Note that $\partial { \mathcal R }$ is evaluated at the new iterate ${u}^{k+1}$ rather than the old iterate ${u}^{k}$ . This leads to a forward–backward scheme, which is formulated in algorithm 8 for the case ${ \mathcal R }={{ \mathcal R }}_{{\rm{TV}}}$ as discussed in [132].

Algorithm 8. EM-TV algorithm for computing $u\in {\mathrm{argmin}}_{v\geqslant 0}{ \mathcal S }({\bf{T}}v)+\alpha {{ \mathcal R }}_{{\rm{TV}}}(v)$ .

Choose ${u}^{0}\in { \mathcal X }$ , ${u}^{0}\gt 0$
for $k=0,1,2,...$ do
${\tilde{u}}^{k+1/2}=\tfrac{{u}^{k}}{{{\bf{T}}}^{* }{\bf{1}}}{{\bf{T}}}^{* }\left(\tfrac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}{u}^{k}+\sigma {\bf{1}}}\right)$	▷EM step
${u}^{k+1/2}=(1-{\lambda }_{k}){u}^{k}+{\lambda }_{k}{\tilde{u}}^{k+1/2}$ for some ${\lambda }_{k}\in (0,1]$
${u}^{k+1}\in {\mathrm{argmin}}_{u\in \mathrm{BV}({\rm{\Omega }})}\left[\tfrac{1}{2}{\int }_{{\rm{\Omega }}}\tfrac{{{\bf{T}}}^{* }{\bf{1}}}{{u}^{k}}{(u-{u}^{k+1/2})}^{2}\;{\rm{d}}x+\alpha {\| u\| }_{{\rm{TV}}}\right]$	▷TV step

Note that if ${\bf{T}}$ preserves non-negativity (see (72)) the EM step is well-defined and preserves non-negativity. It was shown in [132] that also the TV-step in algorithm 8 preserves non-negativity, and hence the method is well defined without the need to enforce non-negativity explicitly as a constraint in the second step. Straightforward calculations show that the EM-TV iteration can be formulated in the following way as a forward–backward scheme:

$\begin{eqnarray}&&{u}^{k+1}:= {\left(I+{\lambda }_{k}\displaystyle \frac{{u}^{k}}{{{\bf{T}}}^{* }{\bf{1}}}\alpha \partial {{ \mathcal R }}_{{\rm{TV}}}\right)}^{-1}\left(I-{\lambda }_{k}\displaystyle \frac{{u}^{k}}{{{\bf{T}}}^{* }{\bf{1}}}\partial { \mathcal Q }\right){u}^{k}.\end{eqnarray} \tag{ 96 }$

Using this reformulation, the authors of [132] proved convergence of subsequences with respect to the strong ${{\bf{L}}}^{1}$ topology and the weak $\mathrm{BV}$ topology under some assumptions. Equation (96) also reveals the close connection of EM-TV and SGP with ${\gamma }_{k}=1$ and the scaling matrix ${\lambda }_{k}{D}_{k}$ chosen as multiplication by $\tfrac{{u}^{k}}{{{\bf{T}}}^{* }{\bf{1}}}$ as recommended in [29]: as ${{ \mathcal P }}_{C,{\lambda }_{k}{D}_{k}}={(I+\partial {\chi }_{C})}^{-1}$ , the only difference is that ${ \mathcal R }={\chi }_{C}$ is replaced by ${ \mathcal R }=\alpha {{ \mathcal R }}_{{\rm{TV}}}$ .

Nowak and Kolaczyk [116] also study an EM-type method for the computation of a MAP estimator. The prior determining the functional ${ \mathcal R }$ in (70) is chosen in a multiscale fashion by putting a gamma density prior on the total intensity ${\int }_{{\mathbb{M}}}{{tg}}^{\dagger }{\rm{d}}x$ and independent beta priors on the ratios between larger and smaller scales. This reflects the conjugacy between the Poisson and the gamma distribution as well as those between the binomial and the beta distribution, taking into account proposition 2.3.

Following [99], a fully explicit, non-negativity preserving EM-type method for the penalized likelihood functional (70) with general differentiable penalty ${ \mathcal R }$ can be obtained by an arbitrary splitting ${ \mathcal R }^{\prime} [u]=U(u)-V(u)$ with $U(u),V(u)\geqslant 0$ . Multiplying the optimality conditions by $u/{{\bf{T}}}^{* }{\bf{1}}$ yields $\left(1+\tfrac{\alpha }{{{\bf{T}}}^{* }{\bf{1}}}V(u)\right)u=\tfrac{u}{{{\bf{T}}}^{* }{\bf{1}}}\left({{\bf{T}}}^{* }\tfrac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}u+\sigma {\bf{1}}}+\alpha U(u)\right)$ , which leads to the iteration formula

$\begin{eqnarray}&&{u}^{k+1}:= \displaystyle \frac{{u}^{k}}{{{\bf{T}}}^{* }{\bf{1}}+\alpha V({u}^{k})}\left({{\bf{T}}}^{* }\displaystyle \frac{{{\bf{g}}}^{{\rm{obs}}}}{{\bf{T}}{u}^{k}+\sigma {\bf{1}}}+\alpha U({u}^{k})\right).\end{eqnarray} \tag{ 97 }$

A similar technique has been used in [151].

In [34] Bregman iterations

$\begin{eqnarray*}&&{\widehat{u}}_{\alpha }^{l+1}\in \mathrm{argmin}[{ \mathcal S }({\bf{T}}u)+\alpha {{ \mathcal D }}_{{ \mathcal R }}(u,{\widehat{u}}_{\alpha }^{l})]\end{eqnarray*}$

and variants of this formula were studied for ${ \mathcal R }={{ \mathcal R }}_{{\rm{TV}}}$ , which can reduce the loss of contrast. Inexact Bregman iterations were studied in [18].

6.5. Projection methods for phase retrieval problems

Phase retrieval problems as discussed at the end of section 5.3 are often considered as feasibility problems: find a point in the intersection of a set $A\subset { \mathcal X }$ describing a priori information and a set B describing information from the measurements. Often A describes support constraints for $u$ , non-negativity, or $| u| \equiv 1$ for the complex-valued function $u$ in case of a real-valued refractive indices. For exact data ${g}^{\dagger }$ the set B is given by

$\begin{eqnarray*}&&B:= \{u\in { \mathcal X }\;:{| {P}_{{\rm{\Gamma }}}u| }^{2}={g}^{\dagger }\}.\end{eqnarray*}$

Using the indicator functions ${\chi }_{A}$ and ${\chi }_{B}$ of the sets A and B (i.e. ${\chi }_{A}(u):= 0$ if $u\in A$ and ${\chi }_{A}(u):= \infty$ else) the intersection $A\cap B$ can be expressed as the solution set of the minimization problem

$\begin{eqnarray}&&\mathrm{minimize}\ {\chi }_{A}(u)+{\chi }_{B}(u),\qquad u\in { \mathcal X }={{\bf{L}}}^{2}({{\mathbb{R}}}^{2}).\end{eqnarray} \tag{ 98 }$

The set A is often convex (at least for the first two examples given above), however, B is not convex. Recall that the resolvent ${{ \mathcal P }}_{A}:= {(I+\lambda \partial {\chi }_{A})}^{-1}$ is the metric projection onto A independent of $\lambda \gt 0$ . For example, if $A=\{u\in {{\bf{L}}}^{2}({{\mathbb{R}}}^{2})\;:u\geqslant 0,\mathrm{supp}\;u\subset {\rm{\Omega }}\}$ , then $({{ \mathcal P }}_{A}(u))(x)=\mathrm{max}({\mathfrak{R}}(u(x)),0)$ if $x\in {\rm{\Omega }}$ and $({{ \mathcal P }}_{A}(u))(x)=0$ else.

Let ${{ \mathcal P }}_{B}$ be a selection from the (in general multi-valued) ${{\bf{L}}}^{2}$ -metric projection onto B. Since ${P}_{{\rm{\Gamma }}}$ is unitary in ${{\bf{L}}}^{2}({{\mathbb{R}}}^{2})$ , ${{ \mathcal P }}_{B}$ is also straightforward to compute despite of the non-convexity of B:

$\begin{eqnarray*}{{ \mathcal P }}_{B}(u):= \left\{\begin{array}{ll}{P}_{{\rm{\Gamma }}}^{* }\left({g}^{\dagger }\displaystyle \frac{{P}_{{\rm{\Gamma }}}u}{| {P}_{{\rm{\Gamma }}}u| }\right) & {\rm{on}}\;{\rm{}}\{x\in {{\mathbb{R}}}^{2}\;:({P}_{{\rm{\Gamma }}}u)(x)\ne 0\}\\ {P}_{{\rm{\Gamma }}}^{* }{g}^{\dagger } & {\rm{on}}\;{\rm{}}\{x\in {{\mathbb{R}}}^{2}\;:({P}_{{\rm{\Gamma }}}u)(x)=0\}.\end{array}\right.\end{eqnarray*}$

The backward–backward splitting method then corresponds to the alternating projections method in algorithm 9.

Algorithm 9. Alternating projections for finding $u\in A\cap B$ .

Choose ${u}^{0}\in { \mathcal X }$ ${u}^{0}\in { \mathcal X }$

for

$k=0,1,2,...$ do

${u}^{k+1}:= {{ \mathcal P }}_{A}{{ \mathcal P }}_{B}{u}^{k}$ ${u}^{k+1}:= {{ \mathcal P }}_{A}{{ \mathcal P }}_{B}{u}^{k}$

It is also known as error reduction method since both the distance of the iterates to A and the distance to B does not increase in any step [60]. Note, however, that this does not imply that the distance to $A\cap B$ does not increase! algorithm 9 is a variant of the Gerchberg–Saxton algorithm [75].

Alternatively, we may adapt the Douglas–Rachford algorithm 4 (without the final step).

Algorithm 10. HIO for finding $u\in A\cap B$ .

Choose ${u}^{0}\in { \mathcal X }$ ${u}^{0}\in { \mathcal X }$

for

$k=0,1,2,...$ do

${u}^{k+1}=[{{ \mathcal P }}_{A}(2{{ \mathcal P }}_{B}-I)+(I-{{ \mathcal P }}_{B})]{u}^{k}$ ${u}^{k+1}=[{{ \mathcal P }}_{A}(2{{ \mathcal P }}_{B}-I)+(I-{{ \mathcal P }}_{B})]{u}^{k}$

which yields the so-called Hybrid input–output (HIO) algorithm suggested by Fienup [60], see algorithm 10. Both algorithms are very simple to implement and have become standard tools in optics [61]. The interpretations above are due to Bauschke et al [15]. While HIO often converges much faster than error reduction the convergence behavior can be erratic, and the iterates may jump away from a close neighborhood of the solution. A frequently used more stable projection method is relaxed averaged alternating reflections (RAAR) with a parameter $\beta \in (0,1]$ $\beta \in (0,1]$ (see [106]) formulated as algorithm 11.

Algorithm 11. RAAR for finding $u\in A\cap B$ .

Choose starting point ${u}^{0}\in { \mathcal X }$ ${u}^{0}\in { \mathcal X }$ , $\beta \in (0,1]$ $\beta \in (0,1]$

for

$k=0,1,2,...$ do

${u}^{k+1}=\left[\tfrac{\beta }{2}((2{{ \mathcal P }}_{A}-I)(2{{ \mathcal P }}_{B}-I)+I)+(1-\beta ){{ \mathcal P }}_{B}\right]({u}^{k})$ ${u}^{k+1}=\left[\tfrac{\beta }{2}((2{{ \mathcal P }}_{A}-I)(2{{ \mathcal P }}_{B}-I)+I)+(1-\beta ){{ \mathcal P }}_{B}\right]({u}^{k})$

So far the convergence behavior of all of these projection methods is not completely understood even for exact data.

Instead of looking for any point in the intersection of A and B, one may replace (98) by a minimization problem

$\begin{eqnarray*}&&\mathrm{minimize}\ ({\chi }_{A}+{ \mathcal R })(u)+{\chi }_{B}(u),\qquad u\in { \mathcal X }\end{eqnarray*}$

with some convex functional ${ \mathcal R }$ . Then in all the formulas above ${{ \mathcal P }}_{A}$ has to be replaced by the resolvent ${{ \mathcal P }}_{A}^{\lambda }:= {(I+\lambda \partial ({\chi }_{A}+{ \mathcal R }))}^{-1}$ which does depend on λ. E.g. if the signal $u$ is expected to be sparse in some wavelet, curvelet, shearlet or other frame, ${ \mathcal R }$ may be chosen as ℓ¹-norm of the corresponding coefficients, and in this case ${{ \mathcal P }}_{A}^{\lambda }$ involves a shrinkage operator (see (87)). The method proposed in [103] can be interpreted this way.

To take into account the Poissonian distribution of the data, projections onto fattened versions of the set B have been considered [107].

An interesting new approach called phase lift proposed by Candès et al [38] is based on a convex relaxations of a reformulation of the non-convex problem (98) in a much larger space, the dimension of which is the square of the dimension of the original problem. Although this approach has found applications in other areas, the huge increase of the dimension remains a severe restriction for many applications in optics.

As an alternative to the more commonly used projection methods, phase retrieval problems can also be solved by regularized Newton methods [85, 88, 89, 98, 109]. Although more difficult to implement than algorithms 9–11, these methods offer more flexibility in treating different experimental setups and different types of a priori information. E.g. they allow the joint recovery of both modulus and phase of an exit field from a single diffraction pattern and the all-at-once reconstruction of three-dimensional refractive indices from tomographic measurements [109]. Moreover, regularized Newton methods can more easily take advantage of the known Poissonian distribution of the data than projection methods and are more appropriate in case of severe ill-posedness or ill-conditioning. On the other hand, in case of far-field data without a sufficiently strong reference object, they seem to be more prone of being trapped in local minima.

7. Concluding remarks

As experimental scientists strive for higher and higher spatial and temporal resolution in photonic imaging, photon count rates become smaller and smaller, and it becomes increasingly important to take advantages of the a priori known Poissonian distribution of the data to achieve competitive reconstructions. For this reason, inverse problems for Poisson data has become a very active field of research over the last years. Such inverse problems have been approached by different groups of researchers partly independently from different perspectives: development of efficient reconstruction algorithms, statistical and deterministic regularization theory, as well as different fields of applications. This review attempts to provide an overview of recent developments in these fields, and hopefully allows researchers from one field to take advantage of results from other fields.

We believe that the field of inverse problems for Poisson data will continue to attract a lot of research activities. Let us close this review by mentioning a selection of possible directions for future research: often not only an estimator, but also confidence bands are of interest for practitioners. In this connection the study of inverse problems for Poisson data from a Bayesian perspective should be extended, which has not been part of this review. In particular, one may study posterior contraction rates as a Bayesian analog to the frequentist rates of convergence discussed here. In a frequentist setting, one may look not only for upper, but also for lower bounds on the rate of convergence in classes of functions $u$ for which $F(u)$ is not bounded away from 0 (otherwise asymptotic equivalence to the Gaussian case has to be expected, see [33, 74, 117] for such asymptotic equivalence results in the sense of Le Cam for density estimation). Finally, we expect further progress on the characterization of variational source conditions based on techniques from conditional stability analysis of inverse problems. Besides these theoretical questions there is a need for efficient, problem-adapted algorithms for three- and four-dimensional inversions from Poisson data in different settings.

Acknowledgments

Financial support by the German Research Foundation DFG through CRC 755, projects A07, C02 and C09 is gratefully acknowledged.

Inverse problems with Poisson data: statistical regularization theory, applications and algorithms

Article metrics

Submit

Permissions

Share this article

Author e-mails

Author affiliations

Author notes

Dates

Peer review information

Abstract

1. Introduction

2. Poisson processes

2.1. Fundamental properties

2.2. Log-likelihood

2.3. Notions of a noise level and data fidelity terms

2.4. Estimating the density of a Poisson process

2.5. Binning, discretization errors, and CCD modeling

2.6. Concentration and deviation inequalities

3. Projection methods

3.1. Wavelet bases

3.2. Projection-type estimators using a WVD

3.3. Projection-type estimators using a wavelet-Galerkin approximation

4. Variational regularization of inverse problems with Poisson data

4.1. Existence of minimizers

4.2. Regularizing properties

4.2.1. Definitions

4.2.2. Deterministic regularizing properties

4.2.3. Statistical regularizing property/consistency

4.3. A priori rates of convergence

4.4. A posteriori parameter choice rules

4.5. Newton-type methods for nonlinear operators

5. Applications

5.1. Positron emission tomography

5.2. Fluorescence microscopy

5.3. Inverse scattering and phase retrieval problems

6. Algorithmic approaches

6.1. Quadratic approximations

6.2. Elements of convex analysis

6.3. Splitting methods

6.4. EM algorithms for linear non-negativity preserving operators

6.5. Projection methods for phase retrieval problems

7. Concluding remarks

Acknowledgments