In the main article, we have presented two methods for the minimization of sparsity functionals of the form
$$\displaystyle{ \mathcal{T}_{\alpha,v^{\delta }}(u) =\Vert \mathbf{A}u - v^{\delta }\Vert ^{2} +\alpha \mathcal{R}(u) }$$
(24)
with
$$\displaystyle{\mathcal{R}(u) =\sum _{\lambda \in \Lambda }w_{\lambda }\vert \langle \phi _{\lambda },u\rangle \vert,}$$
namely, iterative thresholding and second order cone programs. In this appendix, we will discuss some additional methods.
Numerical methods for sparsity minimization can be divided into two categories: first, methods that attempt to minimize the (non-smooth) functional (
24) directly and, second, methods that approximate the functional (
24) with a differentiable one and then try to find a minimum of the approximation. In contrast to the non-smooth original problem, here the corresponding optimality condition will be a single valued equation with a unique solution. Both methods introduced in the main article treated the direct problem. We now discuss first two approximation approaches and then two direct methods.
Usually, the approximating functionals can be written in the form
$$\displaystyle{ \mathcal{T}_{\alpha,\epsilon,v^{\delta }}(u) =\Vert \mathbf{A}u - v^{\delta }\Vert ^{2} +\alpha \mathcal{R}_{\epsilon }(u) }$$
(25)
with
$$\displaystyle{\mathcal{R}_{\epsilon }(u) =\sum _{\lambda \in \Lambda }w_{\lambda }\psi (\langle \phi _{\lambda },u\rangle,\epsilon ).}$$
Here we assume that the function
\(\psi \,: \,\mathbb{R} \times \mathbb{R}_{\geq 0} \rightarrow \mathbb{R}\) has the following properties:
-
The function
ψ satisfies
\(\psi (s,\epsilon ) \rightarrow \vert s\vert\) for every
\(s \in \mathbb{R}\).
-
For every
ε > 0 the function
ψ(⋅ ,
ε) is convex and differentiable.
The function
ψ satisfies
\(\psi (s,\epsilon ) \rightarrow \vert s\vert\) for every
\(s \in \mathbb{R}\).
For every
ε > 0 the function
ψ(⋅ ,
ε) is convex and differentiable.
Typical examples of functions
ψ are
1.
\(\psi (s,\epsilon ) = \sqrt{s^{2 } +\epsilon ^{2}}\) ,
2.
\(\psi (s,\epsilon ) = s^{2}/2\epsilon\) für
\(\vert s\vert \leq \epsilon\) und
\(\psi (s,\epsilon ) =\vert s\vert -\epsilon /2\) für
\(\vert s\vert \geq \epsilon\),
3.
\(\psi (s,\epsilon ) =\vert s\vert ^{1+\epsilon }\),
\(\psi (s,\epsilon ) = \sqrt{s^{2 } +\epsilon ^{2}}\) ,
\(\psi (s,\epsilon ) = s^{2}/2\epsilon\) für
\(\vert s\vert \leq \epsilon\) und
\(\psi (s,\epsilon ) =\vert s\vert -\epsilon /2\) für
\(\vert s\vert \geq \epsilon\),
\(\psi (s,\epsilon ) =\vert s\vert ^{1+\epsilon }\),
to name but a few. The advantage of these approaches is that the subdifferential of
\(\mathcal{T}_{\alpha,\epsilon,v^{\delta }}\) is at most single valued, and therefore the minimization can be performed with gradient based algorithms.
Consider now
ε > 0 fixed. Minimization of the functional
\(\mathcal{T}_{\alpha,\epsilon,v^{\delta }}\) can be performed with gradient-type methods, such as Landweber iteration, or (quasi-)Newton methods. In the case of
\(\psi (s,\epsilon ) = \sqrt{s^{2 } +\epsilon ^{2}}\) the gradient based method looks as follows:
$$\displaystyle{ u^{(n+1)} = u^{(n)} -\tau _{ n}{\Bigl [2\mathbf{A}^{{\ast}}(\mathbf{A}u^{(n)} - v^{\delta }) -\alpha \sum _{\lambda \in \Lambda }w_{\lambda } \frac{\langle \phi _{\lambda },u^{(n)}\rangle } {\sqrt{\langle \phi _{\lambda }, u^{(n) } \rangle ^{2 } +\epsilon ^{2}}}\phi _{\lambda }\Bigr ]}\;. }$$
(26)
Here
τ
n is some positive step-size that can, for instance, be defined by a line-search.
This approach is based on the identity
$$\displaystyle{\mathcal{R}(u) =\sum _{\lambda \in \Lambda } \frac{w_{\lambda }} {\vert \langle \phi _{\lambda },u\rangle \vert }\langle \phi _{\lambda },u\rangle ^{2}.}$$
Thus,
ℓ
1-regularization can be considered as quadratic regularization with weights
\(w_{\lambda }/\vert \langle \phi _{\lambda },u\rangle \vert\) depending on the minimizer
u. We now define
$$\displaystyle{\mathcal{S}(u,\tilde{u},\epsilon ):=\sum _{\lambda \in \Lambda }w_{\lambda }\rho _{\epsilon }(\vert \langle \phi _{\lambda },\tilde{u}\rangle \vert )\langle \phi _{\lambda },u\rangle ^{2},}$$
where the functions
ρ
ε satisfy
ρ
ε (
t) → 1∕
t for
ε → 0. A typical choice is
$$\displaystyle{\rho _{\epsilon }(t) = \frac{1} {\sqrt{t^{2 } +\epsilon ^{2}}}.}$$
Starting with some initial guess
u
(0) of the minimizer of
\(\mathcal{T}_{\alpha,v^{\delta }}\), one then chooses some sequence of positive numbers
ε
n → 0 and defines inductively
$$\displaystyle{u^{(n+1)}:=\mathop{ \mathrm{arg\,min}}\limits _{ u}{\Bigl [\Vert \mathbf{A}u - v^{\delta }\Vert ^{2} +\alpha \mathcal{S}(u,u^{(n)},\epsilon _{ n})\Bigr ]}.}$$
Note that in each iteration one has to minimize a quadratic functional, which amounts to solving a linear equation.
This and related methods have been considered by several authors. A theoretical analysis can, for instance, be found in Daubechies et al. (
2010).
We recall first the iterative thresholding algorithm defined by
$$\displaystyle{u_{n+1}:= \mathbf{S}_{\mu \alpha \mathbf{w},1}{\bigl (u_{n} +\mu \mathbf{A}^{{\ast}}(v^{\delta } -\mathbf{A}u_{ n})\bigr )},}$$
where the thresholding operator
S is as in Sect. 4.1. In Beck and Teboulle (
2009), an improvement of the iterative thresholding algorithm has been proposed, where the iterates are defined as linear combinations of subsequent iterates. More precisely, given some starting point
u
0 ∈
U, the iteration is defined by
$$\displaystyle{\begin{array}{rcl} y_{n}& =&\mathbf{S}_{\mu \alpha \mathbf{w},1}{\bigl (u_{n} +\mu \mathbf{A}^{{\ast}}(v^{\delta } -\mathbf{A}u_{n})\bigr )}, \\ t_{n+1} & =&\frac{1+\sqrt{1+4t_{n }^{2}}} {2} \,, \\ u_{n+1} & =&y_{n} + \frac{t_{n}-1} {t_{n+1}} (y_{n} - y_{n-1}), \end{array} }$$
with the initialization
t
0 = 1. We note that the first step of this algorithm is an ordinary thresholding step, and thus an initialization of
y
−1 is not necessary.
It has been shown in Beck and Teboulle (
2009) that this algorithm converges provided that the shrinkage parameter
μ is chosen in such a way that
\(\mu \Vert \mathbf{A}^{{\ast}}\mathbf{A}\Vert <1\). This is precisely the same constraint as in the classical iterative thresholding algorithm. In addition, the estimate
$$\displaystyle{\mathcal{T}_{\alpha,v^{\delta }}(u_{k}) -\mathcal{T}_{\alpha,v^{\delta }}(u_{\alpha }^{\delta }) \leq \frac{2\mu \Vert u_{0} - u_{\alpha }^{\delta }\Vert ^{2}} {(k + 1)^{2}} }$$
holds, which means that the accuracy of the iterate
u
k , measured in terms of the energy
\(\mathcal{T}_{\alpha,v^{\delta }}\), is of order
O(1∕
k
2). In contrast, the classical iterative thresholding algorithm only yields an accuracy of order
O(1∕
k).
In order to define augmented Lagrangian methods for the minimization of
\(\mathcal{T}_{\alpha,v^{\delta }}\), we consider the equivalent formulation as the constrained optimization problem
$$\displaystyle{\Vert r\Vert ^{2} +\alpha \mathcal{R}(u) \rightarrow \min \qquad \text{ subject to }\mathbf{A}u + r = v^{\delta }.}$$
The augmented Lagrangian of this optimization problem is the mapping
\(\mathcal{L}_{\beta }\,: \,U \times V \times V \rightarrow (-\infty,+\infty ]\),
$$\displaystyle{\mathcal{L}_{\beta }(u,r,\xi ) =\Vert r\Vert ^{2} +\alpha \mathcal{R}(u) -\langle \xi,\mathbf{A}u + r - v^{\delta }\rangle +\beta \Vert \mathbf{A}u + r - v^{\delta }\Vert ^{2}\,,}$$
depending on the parameter
β > 0.
In order to find a saddle-point of
\(\mathcal{L}_{\beta }\), we perform the iteration (see Yang and Zhang
2011)
$$\displaystyle{\begin{array}{rcl} u_{n+1} & =&\mathbf{S}_{\mu \alpha \mathbf{w}/2\beta,1}{\bigl (u_{n} +\mu \mathbf{A}^{{\ast}}(v^{\delta } + r_{n} -\mathbf{A}u_{n} -\xi _{n}/(2\beta ))\bigr )}, \\ r_{n+1} & =& \frac{1} {1+\beta }{\bigl (\xi _{n}/2 -\beta (\mathbf{A}u_{n+1} - v^{\delta })\bigr )}, \\ \xi _{n+1} & =&\xi _{n} -\gamma \beta {\bigl (\mathbf{A}u_{n+1} + r_{n+1} - v^{\delta }\bigr )}.\end{array} }$$
We do note that the update step for
r is simply an explicit computation of the minimum of
\(\mathcal{L}_{\beta }(u_{n+1},\cdot,\xi _{n})\). Similarly, the update step for
u can be interpreted as an
approximate minimization of
\(\mathcal{L}_{\beta }(\cdot,r_{n},\xi _{n})\). Therefore, the iteration can be regarded as an inexact alternating directions method. It has been shown in Yang and Zhang (
2011) that this algorithm converges, if the parameters
μ,
γ > 0 satisfy the relation
\(\mu \Vert \mathbf{A}^{{\ast}}\mathbf{A}\Vert + 2\gamma <2\).
In addition to this algorithm, augmented Lagrangian methods based on a
dual problem have also been proposed in Yang and Zhang (
2011). We refer to that paper for further information.