Top

Foundations of Computational Mathematics

Published in:

Open Access 17-08-2021

Optimal Combination of Linear and Spectral Estimators for Generalized Linear Models

Authors: Marco Mondelli, Christos Thrampoulidis, Ramji Venkataramanan

Published in: Foundations of Computational Mathematics | Issue 5/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

We study the problem of recovering an unknown signal ${\varvec{x}}$ given measurements obtained from a generalized linear model with a Gaussian sensing matrix. Two popular solutions are based on a linear estimator $\hat{\varvec{x}}^\mathrm{L}$ and a spectral estimator $\hat{\varvec{x}}^\mathrm{s}$. The former is a data-dependent linear combination of the columns of the measurement matrix, and its analysis is quite simple. The latter is the principal eigenvector of a data-dependent matrix, and a recent line of work has studied its performance. In this paper, we show how to optimally combine $\hat{\varvec{x}}^\mathrm{L}$ and $\hat{\varvec{x}}^\mathrm{s}$. At the heart of our analysis is the exact characterization of the empirical joint distribution of $({\varvec{x}}, \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})$ in the high-dimensional limit. This allows us to compute the Bayes-optimal combination of $\hat{\varvec{x}}^\mathrm{L}$ and $\hat{\varvec{x}}^\mathrm{s}$, given the limiting distribution of the signal ${\varvec{x}}$. When the distribution of the signal is Gaussian, then the Bayes-optimal combination has the form $\theta \hat{\varvec{x}}^\mathrm{L}+\hat{\varvec{x}}^\mathrm{s}$ and we derive the optimal combination coefficient. In order to establish the limiting distribution of $({\varvec{x}}, \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})$, we design and analyze an approximate message passing algorithm whose iterates give $\hat{\varvec{x}}^\mathrm{L}$ and approach $\hat{\varvec{x}}^\mathrm{s}$. Numerical simulations demonstrate the improvement of the proposed combination with respect to the two methods considered separately.

Communicated by Afonso Bandeira.

M. Mondelli was partially supported by the 2019 Lopez-Loreta Prize. C. Thrampoulidis was partially supported by an NSF award CIF-2009030 and by an NSERC Discovery Grant. R. Venkataramanan was partially supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

In a generalized linear model (GLM) [36, 39], we want to recover a d-dimensional signal ${\varvec{x}}\in \mathbb R^d$ given n i.i.d. measurements ${\varvec{y}}=(y_1, \ldots , y_n)$ of the form:

$$\begin{aligned} y_i\sim p(y\mid \langle {\varvec{x}}, {\varvec{a}}_i \rangle ), \qquad i\in \{1, \ldots , n\}, \end{aligned}$$

(1)

where $\langle \cdot , \cdot \rangle $ denotes the scalar product, $\{{\varvec{a}}_i\}_{1\le i\le n}$ are known sensing vectors, and the (stochastic) output function $p(\cdot \mid \langle {\varvec{x}}, {\varvec{a}}_i \rangle )$ is a known probability distribution. GLMs arise in several problems in statistical inference and signal processing. Examples include photon-limited imaging [53, 58], compressed sensing [19], signal recovery from quantized measurements [7, 46], phase retrieval [21, 49], and neural networks with one hidden layer [30].

The problem of estimating ${\varvec{x}}$ from ${\varvec{y}}$ is, in general, non-convex, and semi-definite programming relaxations have been proposed [9, 11, 52, 56]. However, the computational complexity and memory requirement of these approaches quickly grow with the dimension d. For this reason, several non-convex approaches have been developed, e.g., alternating minimization [40], approximate message passing (AMP) [15, 44, 48], Wirtinger Flow [10], Kaczmarz methods [57], and iterative convex-programming relaxations [1, 7, 14, 25]. The Bayes-optimal estimation and generalization error have also been studied in [3]. When the output function $p(\cdot \mid \langle {\varvec{x}}, {\varvec{a}}_i \rangle )$ is unknown, (1) is called the single-index model in the statistics literature, see e.g., [8, 28, 33]. The problem of recovering a structured signal (e.g., sparse, low-rank) from high-dimensional single-index measurements has been an active research topic over the past few years [22‐24, 41‐43, 51, 52, 59].

Throughout this paper, the performance of an estimator $\hat{\varvec{x}}$ will be measured by its normalized correlation (or “overlap") with ${\varvec{x}}$:

$$\begin{aligned} \frac{\left|{\langle {\varvec{x}}, \hat{\varvec{x}}\rangle }\right|}{\Vert {\varvec{x}}\Vert _2 \Vert \hat{\varvec{x}}\Vert _2 }, \end{aligned}$$

(2)

where $\Vert \cdot \Vert _2$ denotes the Euclidean norm of a vector.

Most of the existing techniques require an initial estimate of the signal, which can then be refined via a local algorithm. Here, we focus on two popular methods: a linear estimator and a spectral estimator. The linear estimator $\hat{\varvec{x}}^\mathrm{L}$ has the form:

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n {\mathcal {T}}_L(y_i){\varvec{a}}_i \,, \end{aligned}$$

(3)

where ${\mathcal {T}}_L$ denotes a given preprocessing function. The performance analysis of this estimator is quite simple, see e.g., Proposition 1 in [43] or Sect. 2.3 of this paper. The spectral estimator consists in the principal eigenvector $\hat{\varvec{x}}^\mathrm{s}$ of a matrix of the form:

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n {\mathcal {T}}_s(y_i){\varvec{a}}_i{\varvec{a}}_i^\mathsf{T}\,, \end{aligned}$$

(4)

where ${\mathcal {T}}_s$ is another preprocessing function. The idea of a spectral method first appeared in [32] and, for the special case of phase retrieval, a series of works has provided more and more refined performance bounds [11, 12, 40]. Recently, an exact high-dimensional analysis of the spectral method for generalized linear models with Gaussian sensing vectors has been carried out in [34, 37]. These works consider a regime where both n and d grow large at a fixed proportional rate $\delta =n/d>0$. The choice of ${\mathcal {T}}_s$ which minimizes the value of $\delta $ (and, consequently, the amount of data) necessary to achieve a strictly positive scalar product (2) was obtained in [37]. Furthermore, the choice of ${\mathcal {T}}_s$ which maximizes the correlation between ${\varvec{x}}$ and $\hat{\varvec{x}}^\mathrm{s}$ for any given value of the sampling ratio $\delta $ was obtained in [35]. The case in which the sensing vectors are obtained by picking columns from a Haar matrix is tackled in [16].

In short, the performance of the linear estimate $\hat{\varvec{x}}^\mathrm{L}$ and the spectral estimate $\hat{\varvec{x}}^\mathrm{s}$ is well understood, and there is no clear winner between the two. In fact, the superiority of one method over the other depends on the output function $p(\cdot \mid \langle {\varvec{x}}, {\varvec{a}}_i \rangle )$ and on the sampling ratio $\delta $. For example, for phase retrieval ($y_i=|\langle {\varvec{x}}, {\varvec{a}}_i \rangle |$), the spectral estimate provides positive correlation with the ground-truth signal as long as $\delta >1/2$ [37], while linear estimators of the form (3) are not effective for any $\delta >0$. On the contrary, for 1-bit compressed sensing ($y_i=\mathrm{sign}(\langle {\varvec{x}}, {\varvec{a}}_i \rangle )$) the situation is the opposite: the spectral estimator is uncorrelated with the signal for any $\delta >0$, while the linear estimate works well. For many cases of practical interest, e.g., neural networks with ReLU activation function ($y_i=\max (\langle {\varvec{x}}, {\varvec{a}}_i \rangle , 0)$), both the linear and the spectral method give estimator with non-zero correlation. Thus, a natural question is the following:

What is the optimal way to combine the linear estimator $\hat{\varvec{x}}^\mathrm{L}$ and the spectral estimator $\hat{\varvec{x}}^\mathrm{s}$?

This paper closes the gap and answers the question above for Gaussian sensing vectors $\{{\varvec{a}}_i\}_{1\le i\le n}$. Our main technical contribution is to provide an exact high-dimensional characterization of the joint empirical distribution of $({\varvec{x}}, \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})$ in the limit $n, d \rightarrow \infty $ with a fixed sampling ratio $\delta =n/d$ (see Theorem 1). In particular, we prove that the conditional distribution of $(\hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})$ given ${\varvec{x}}$ converges to the law of a bivariate Gaussian whose mean vector and covariance matrix are specified in terms of the preprocessing functions ${\mathcal {T}}_L$ and ${\mathcal {T}}_s$. As a consequence, we are able to compute the Bayes-optimal combination of $\hat{\varvec{x}}^\mathrm{L}$ and $\hat{\varvec{x}}^\mathrm{s}$ for any given prior distribution on ${\varvec{x}}$ (see Theorem 2). In the special case in which the signal prior is Gaussian, the Bayes-optimal combination has the form $\theta \hat{\varvec{x}}^\mathrm{L}+\hat{\varvec{x}}^\mathrm{s}$, with $\theta \in \mathbb R$, and we compute the optimal combination coefficient $\theta _*$ that maximizes the normalized correlation in (2) (see Corollary 2).

The characterization of the joint empirical distribution of $({\varvec{x}}, \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})$ is achieved by designing and analyzing a suitable approximate message passing (AMP) algorithm. AMP is a family of iterative algorithms that has been applied to several high-dimensional statistical estimation problems including estimation in linear models [4, 5, 15, 29], generalized linear models [44, 48, 50], and low-rank matrix estimation [13, 31, 38, 45]. An appealing feature of AMP algorithms is that under suitable conditions on the model, the empirical joint distribution of the iterates can be exactly characterized in the high-dimensional limit, in terms of a simple scalar recursion called state evolution.

In this paper, we design an AMP algorithm that is equivalent to a power method computing the principal eigenvector of the matrix (4). Then, the state evolution analysis leads to the desired joint empirical distribution of $({\varvec{x}}, \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})$. Using the limiting distribution, we reduce the vector problem of estimating ${\varvec{x}}\in \mathbb R^d$ given two (correlated) observations $\hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}\in \mathbb R^d$ to the scalar problem of estimating the random variable $X\in \mathbb R$ given two (correlated) observations $X_L, X_s\in \mathbb R$. We emphasize that the focus of this work is not on using the AMP algorithm as an estimator for the generalized linear model. Rather, we use AMP as a proof technique to characterize the joint empirical distribution of $({\varvec{x}}, \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})$, and thereby understand how to optimally combine the two simple estimators.

Our proposed combination of the linear and spectral estimators can significantly boost the correlation with the ground-truth signal (2). As an illustration, in Fig. 1a we compare against each other the correlations $\rho _L, \rho _s$ and $\rho _*$ of the linear, spectral and optimal combined estimators, respectively, for a range of values of the sampling ratio $\delta =n/d$ and measurements of the form $y_i = 0.3\,{\langle }{\varvec{x}}, {\varvec{a}}_i{\rangle } + {\langle }{\varvec{x}}, {\varvec{a}}_i{\rangle }^2 + z_i$. Here, ${\varvec{x}}$ is uniformly distributed on the d-dimensional sphere of radius $\sqrt{d}$, ${\varvec{a}}_i \ \sim _{i.i.d.} \ \mathsf{N}({\varvec{0}}_d, {\varvec{I}}_d/d)$, $z_i\sim _{i.i.d.}\mathcal {N}(0,0.2)$, and the preprocessing functions are chosen as follows: ${\mathcal {T}}_L(y) = y$ and ${\mathcal {T}}_s(y) = \min \{y,3.5\}$. The solid lines correspond to analytically derived asymptotic formulae, and they are compared against numerical simulations (cf. markers of the corresponding color) computed for $d=2000$. Specifically, the red line corresponds to the optimal combined estimator $\theta _*\hat{\varvec{x}}^\mathrm{L}+ \hat{\varvec{x}}^\mathrm{s}$ (in this example, the empirical distribution of ${\varvec{x}}$ is Gaussian). The optimal combination coefficient $\theta _*$ is plotted in Fig. 1b as a function of $\delta $. Note that for values of $\delta $ for which the spectral estimator achieves strictly positive correlation with ${\varvec{x}}$, the combined estimator provides a significant performance improvement. The performance gain depends on: (i) the sampling ratio $\delta $ (it can be as large as $\sim 30\%$ for $\delta \approx 8$), and (ii) the output function that defines the measurement. To better visualize this dependence, we plot in Fig. 1c the percentage gain $(\rho _*-\rho _{\max })/\rho _{\max }\times 100$ for various values of $\delta $ and for different output-function parameterizations. Specifically, the x-axis in Fig. 1c represents the value of the coefficient $H_1$ of an output function of the form $y_i = 0.5+H_1\,{\langle }{\varvec{x}}, {\varvec{a}}_i{\rangle } + 0.5\,{\langle }{\varvec{x}}, {\varvec{a}}_i{\rangle }^2 + z_i$, with $z_i\sim _{i.i.d.}\mathcal {N}(0,0.2)$. Above, $\rho _*$ denotes the correlation achieved by our proposed estimator, and $\rho _{\max }$ is the maximum correlation among the linear and spectral estimators.

The rest of the paper is organized as follows. In Sect. 2, we describe the setting and review existing results on the linear and the spectral estimator. In Sect. 3, we present our contributions. The main technical result, Theorem 1, gives an exact characterization of the joint empirical distribution of $({\varvec{x}}, \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})$. Using this, we derive the Bayes-optimal combination of the estimators $\hat{\varvec{x}}^\mathrm{L}$ and $\hat{\varvec{x}}^\mathrm{s}$. In the special case in which the signal prior is Gaussian, the Bayes-optimal combination is linear in $\hat{\varvec{x}}^\mathrm{L}$ and $\hat{\varvec{x}}^\mathrm{s}$, and we derive the optimal coefficient. In Sect. 4, we demonstrate the effectiveness of our method via numerical simulation. In Sect. 5, we describe the generalized AMP algorithm and use it to prove Theorem 1.

2 Preliminaries

2.1 Notation and Definitions

Given $n\in \mathbb N$, we use the shorthand $[n]=\{1, \ldots , n\}$. Given a vector ${\varvec{x}}$, we denote by $\Vert {\varvec{x}}\Vert _2$ its Euclidean norm. Given a matrix ${\varvec{A}}$, we denote by $\Vert {\varvec{A}}\Vert _\mathrm{op}$ its operator norm.

The empirical distribution of a vector ${\varvec{x}}= (x_1, \ldots , x_d)^{\mathsf{T}}$ is given by $ \frac{1}{d}\sum _{i=1}^d \delta _{x_i}$, where $\delta _{x_i}$ denotes a Dirac delta mass on $x_i$. Similarly, the empirical joint distribution of vectors ${\varvec{x}}, {\varvec{x}}'\in \mathbb {R}^d$ is $\frac{1}{d} \sum _{i=1}^d \delta _{(x_i, x'_i)}$.

Given two probability measures $\mu $ (on a space $\mathcal {X}$) and $\nu $ (on a space $\mathcal {Y}$), a coupling $\rho $ of $\mu $ and $\nu $ is a probability distribution on $\mathcal {X}\times \mathcal {Y}$ whose marginals coincide with $\mu $ and $\nu $, respectively. For $k\ge 1$, the Wasserstein-k ($W_k$) distance between two probability measures $\mu $, $\nu $ on $\mathbb {R}^n$ is defined by

$$\begin{aligned} W_k(\mu ,\nu ) \equiv \inf _{\rho }{\mathbb E}_{({\varvec{X}},{\varvec{Y}})\sim \rho }\big \{\Vert {\varvec{X}}-{\varvec{Y}}\Vert _2^k\}^{1/k}\, , \end{aligned}$$

(5)

where the infimum is over all the couplings of $\mu $ and $\nu $. A sequence of probability distributions $\nu _n$ on $\mathbb {R}^m$ converges in $W_k$ to $\nu $, written $\nu _n{\mathop {\Rightarrow }\limits ^{W_{k}}} \nu $, if $W_k(\nu _n,\nu ) \rightarrow 0$ as $n \rightarrow \infty $.

2.2 Generalized Linear Model

Let ${\varvec{x}}\in \mathbb {R}^d$ be the signal of interest. We assume that $\Vert {\varvec{x}}\Vert ^2_2=d$. The signal is observed via inner products with n sensing vectors $({\varvec{a}}_i)_{i \in [n]}$, with each ${\varvec{a}}_i \in \mathbb {R}^d$ having independent Gaussian entries with mean zero and variance 1/d. That is,

$$\begin{aligned} ( {\varvec{a}}_i) \ \sim _{i.i.d.} \ \mathsf{N}({\varvec{0}}_d, {\varvec{I}}_d/d). \end{aligned}$$

(6)

Given $g_i = \langle {\varvec{x}}, {\varvec{a}}_i \rangle $, the measurement vector ${\varvec{y}}\in \mathbb {R}^n$ is obtained by drawing each component independently according to a conditional distribution $p_{Y| {G}}$

$$\begin{aligned} y_i \ \sim \ p_{Y | G}(y_i \mid {g}_i), \quad i \in [n]. \end{aligned}$$

(7)

We stack the measurement vectors as rows to define the $n \times d$ sensing matrix ${\varvec{A}}$. That is,

$$\begin{aligned} {\varvec{A}}= [{\varvec{a}}_1, \ldots , {\varvec{a}}_n]^{\mathsf{T}}. \end{aligned}$$

(8)

We write $\delta _n = \frac{n}{d}$ for the sampling ratio and assume that $\delta _n \rightarrow \delta \in (0, \infty )$. Since the entries of the sensing matrix are $\sim _{i.i.d.} \mathsf{N}(0, 1/d)$, each row of ${\varvec{A}}$ has norm close to 1.

2.3 Linear Estimator

Given the measurements $(y_i)_{i \in [n]}$ and a preprocessing function ${\mathcal {T}}_L: \mathbb {R}\rightarrow \mathbb {R}$, define the $n\times 1$ vector

$$\begin{aligned} {\varvec{z}}^\mathrm{L}= [{\mathcal {T}}_L(y_1), \ldots , {\mathcal {T}}_L(y_n)]^\mathsf{T}. \end{aligned}$$

(9)

Consider the following linear estimator that averages the data as follows:

$$\begin{aligned} \hat{\varvec{x}}^\mathrm{L}:= \frac{\sqrt{d}}{n} {\varvec{A}}^\mathsf{T}{\varvec{z}}^\mathrm{L}= \frac{\sqrt{d}}{n}\sum _{i=1}^n {\mathcal {T}}_L(y_i){\varvec{a}}_i \,. \end{aligned}$$

(10)

The following lemma characterizes the asymptotic performance of this simple estimator. The proof is rather straightforward, and we include it in Appendix A for completeness.

Lemma 1

Let ${\varvec{x}}$ be such that $\Vert {\varvec{x}}\Vert ^2_2=d$, $\{{\varvec{a}}_i\}_{1\le i\le n}\sim _{i.i.d.}\mathsf{N}({{\varvec{0}}}_d,{\varvec{I}}_d/d)$, and ${\varvec{y}}$ be distributed according to (7). Let $n/d\rightarrow \delta $, $G\sim \mathsf{N}(0, 1)$ and define $Z_L=\mathcal T_L(Y)$ for $Y\sim p_{Y|G}(\,\cdot \,|\,G)$ such that ${\mathbb E}\{|GZ_L|\}<\infty $. Let $\hat{\varvec{x}}^\mathrm{L}$ be the linear estimator defined as in (10). Then, as $n \rightarrow \infty $,

$$\begin{aligned}&\Vert \hat{\varvec{x}}^\mathrm{L}\Vert _2^2 {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\left( {\mathbb E}\left\{ G Z_L\right\} \right) ^2 + \frac{{\mathbb E}\{Z_L^2\}}{\delta }, \quad \text { and }\nonumber \\&\quad \frac{\langle \hat{\varvec{x}}^\mathrm{L}, {\varvec{x}}\rangle }{\Vert {\hat{\varvec{x}}^\mathrm{L}}\Vert _2 \, \left\Vert {{\varvec{x}}}\right\Vert _2} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\frac{{\mathbb E}\left\{ G Z_L\right\} }{\sqrt{\left( {\mathbb E}\left\{ G Z_L\right\} \right) ^2 + {{\mathbb E}\left\{ Z_L^2\right\} }\big /{\delta }}}\,. \end{aligned}$$

(11)

2.4 Spectral Estimator

Given the measurements $(y_i)_{i \in [n]}$, consider the $n \times n$ diagonal matrix

$$\begin{aligned} {\varvec{Z}}_s = \mathrm{diag}({\mathcal {T}}_s(y_1), \ldots , {\mathcal {T}}_s(y_n)), \end{aligned}$$

(12)

where ${\mathcal {T}}_s: \mathbb {R}\rightarrow \mathbb {R}$ is a preprocessing function. Consider the $d \times d$ matrix

$$\begin{aligned} {\varvec{D}}_n = {\varvec{A}}^\mathsf{T}{\varvec{Z}}_s {\varvec{A}}. \end{aligned}$$

(13)

Let $G\sim \mathsf{N}(0, 1)$, $Y\sim p(\cdot \mid G)$, and $Z_s=\mathcal T_s(Y)$. We will make the following assumptions on $Z_s$.

(A1) $\mathbb P(Z_s=0)<1$.

(A2) $Z_s$ has bounded support and $\tau $ is the supremum of this support, i.e.,

$$\begin{aligned} \tau = \inf \{ z : \mathbb P (Z_s\le z) = 1\}. \end{aligned}$$

(14)

(A3) As $\lambda $ approaches $\tau $ from the right, we have

$$\begin{aligned} \lim _{\lambda \rightarrow \tau ^+}{\mathbb E}\left\{ \frac{Z_s}{(\lambda -Z_s)^2}\right\} =\lim _{\lambda \rightarrow \tau ^+}{\mathbb E}\left\{ \frac{Z_s\cdot G^2}{\lambda -Z_s}\right\} =\infty . \end{aligned}$$

(15)

Let us comment on these assumptions. First, the condition (A1) simply avoids the degenerate case in which the measurement vector, after passing through the preprocessing function, is 0 with high probability. Second, the condition (A2) requires that the support of $Z_s$ is bounded both from above and below. This assumption appears in the papers that have recently analyzed the performance of spectral estimators [34, 35, 37], and it is also required for Lemma 2. Requiring that the support of $Z_s$ is bounded from above is rather natural, since the argument relies on the matrix ${\varvec{D}}_n$ having a spectral gap. It is not clear whether having the support of $Z_s$ bounded from both sides (rather than only from above) is necessary, and investigating this aspect is an interesting avenue for future research. Let us also point out that the condition (A2) is purely technical and rather mild. In fact, if the desired preprocessing function is not bounded,¹ then one can construct a sequence of bounded approximations that approach its performance, as done, e.g., in [35]. Finally, the condition (A3) essentially requires that $Z_s$ has sufficient probability mass near the supremum of the support $\tau $. One sufficient condition is that the law of $Z_s$ has a point mass at $\tau $. If this is not the case, the argument in (115)–(118) of [37] shows how to modify the preprocessing function ${\mathcal {T}}_s$ so that (i) condition (A3) holds, and (ii) the spectral estimator suffers no performance loss.

For $\lambda \in (\tau , \infty )$ and $\delta \in (0, \infty )$, define

$$\begin{aligned} \phi (\lambda ) = \lambda \cdot {\mathbb E}\left\{ \frac{Z_s\cdot G^2}{\lambda -Z_s}\right\} , \end{aligned}$$

(16)

and

$$\begin{aligned} \psi _{\delta }(\lambda ) = \lambda \left( \frac{1}{\delta }+{\mathbb E}\left\{ \frac{Z_s}{\lambda -Z_s}\right\} \right) . \end{aligned}$$

(17)

Note that $\phi (\lambda )$ is a monotone non-increasing function and that $\psi _{\delta }(\lambda )$ is a convex function. Let $\bar{\lambda }_{\delta }$ be the point at which $\psi _{\delta }$ attains its minimum, i.e.,

$$\begin{aligned} \bar{\lambda }_\delta = \arg \min _{\lambda \ge \tau } \psi _{\delta }(\lambda ). \end{aligned}$$

(18)

For $\lambda \in (\tau , \infty )$, define also

$$\begin{aligned} \zeta _{\delta }(\lambda ) = \psi _{\delta }(\max (\lambda , \bar{\lambda }_\delta )). \end{aligned}$$

(19)

The spectrum of ${\varvec{D}}_n$ exhibits a phase transition as $\delta $ increases. The most basic phenomenon of this kind was unveiled for low-rank perturbations of a Wigner matrix: the well-known BBAP phase transition, first discovered in the physics literature [26], and named after the authors of [2]. Here, the model for the random matrix ${\varvec{D}}_n$ is quite different from that considered in [2, 26], and the phase transition is formalized by the following result.

Lemma 2

Let ${\varvec{x}}$ be such that $\Vert {\varvec{x}}\Vert ^2_2=d$, $\{{\varvec{a}}_i\}_{1\le i\le n}\sim _{i.i.d.}\mathsf{N}({{\varvec{0}}}_d,{\varvec{I}}_d/d)$, and ${\varvec{y}}$ be distributed according to (7). Let $n/d\rightarrow \delta $, $G\sim \mathsf{N}(0, 1)$ and define $Z_s=\mathcal T_s(Y)$ for $Y\sim p_{Y|G}(\,\cdot \,|\,G)$. Assume that $Z_s$ satisfies the assumptions (A1)–(A2)–(A3). Let $\hat{\varvec{x}}^\mathrm{s}$ be the principal eigenvector of the matrix ${\varvec{D}}_n$, defined as in (13). Then, the following results hold:

The equation

$$\begin{aligned} \zeta _{\delta }(\lambda ) = \phi (\lambda ) \end{aligned}$$

(20)

admits a unique solution, call it $\lambda _{\delta }^*$, for $\lambda > \tau $.

As $n \rightarrow \infty $,

$$\begin{aligned} \frac{|\langle \hat{\varvec{x}}^\mathrm{s}, {\varvec{x}}\rangle |^2}{\left\Vert {\hat{\varvec{x}}^\mathrm{s}}\right\Vert _2^2 \, \left\Vert {{\varvec{x}}}\right\Vert _2^2} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}} \left\{ \begin{array}{ll} 0, &{} \hbox { if }\psi _{\delta }'(\lambda _{\delta }^*)\le 0,\\ \displaystyle \frac{\psi _{\delta }'(\lambda _{\delta }^*)}{\psi _{\delta }'(\lambda _{\delta }^*)-\phi '(\lambda _{\delta }^*)}, &{} \hbox { if }\psi _{\delta }'(\lambda _{\delta }^*)> 0,\\ \end{array}\right. \end{aligned}$$

(21)

where $\psi _{\delta }'$ and $\phi '$ denote the derivatives of these two functions.

Let $\lambda _1^{{\varvec{D}}_n}\ge \lambda _2^{{\varvec{D}}_n}$ denote the two largest eigenvalues of ${\varvec{D}}_n$. Then, as $n\rightarrow \infty $,

$$\begin{aligned} \begin{aligned} \lambda _1^{{\varvec{D}}_n}&{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\delta \,\zeta _{\delta }(\lambda _{\delta }^*),\\ \lambda _2^{{\varvec{D}}_n}&{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\delta \,\zeta _{\delta }(\bar{\lambda }_{\delta }).\\ \end{aligned} \end{aligned}$$

(22)

The proof immediately follows from Lemma 2 of [37]. In that statement, it is assumed that ${\varvec{x}}$ is uniformly distributed on the d-dimensional sphere, but this assumption is actually never used. In fact, since the sensing vectors $\{{\varvec{a}}_i\}_{1\le i\le n}$ are i.i.d. standard Gaussian, to prove the result above, without loss of generality we can assume that ${\varvec{x}}= \sqrt{d}{\varvec{e}}_1$, where ${\varvec{e}}_1$ is the first element of the canonical basis of $\mathbb R^d$. We also note that the signal ${\varvec{x}}$ and the measurement matrix ${\varvec{A}}$ differ from Lemma 2 in [37] for a scaling factor. This accounts for an extra term $\delta $ in the expression of the eigenvalues of ${\varvec{D}}_n$.

3 Main Results

Throughout this section, we will make the following additional assumptions on the signal ${\varvec{x}}$, the output function of the GLM, and the preprocessing functions ${\mathcal {T}}_L$ and ${\mathcal {T}}_s$ used for the linear and spectral estimators, respectively.

(B1) Let $\hat{P}_{X,d}$ denote the empirical distribution of ${\varvec{x}}\in \mathbb {R}^d$, with $\Vert {\varvec{x}}\Vert ^2_2 =d$. As $d \rightarrow \infty $, $\hat{P}_{X,d}$ converges weakly to a distribution $P_X$ such that, for some $k \ge 2$, the following hold: (i) ${\mathbb E}_{P_X}\{ \left|{X}\right|^{2k -2} \} < \infty $, and (ii) $\lim _{d \rightarrow \infty } {\mathbb E}_{\hat{P}_{X,d}}\{ \left|{X}\right|^{2k -2} \} = {\mathbb E}_{P_X}\{ \left|{X}\right|^{2k -2} \}$. Furthermore, ${\mathbb E}\{ \left|{Y}\right|^{2k -2} \} < \infty $, with $Y\sim p_{Y|G}(\,\cdot \,|\,G)$ and $G \sim \mathsf{N}(0,1)$.

(B2) The function ${\mathcal {T}}_L: \mathbb {R}\rightarrow \mathbb {R}$ is Lipschitz and $|{\mathbb E}\{ {\mathcal {T}}_L(Y) G \}|>0$; the function ${\mathcal {T}}_s: \mathbb {R}\rightarrow \mathbb {R}$ is bounded and Lipschitz.

The assumption (B2) is mainly technical and rather mild, since one can construct a sequence of approximations of the desired ${\mathcal {T}}_L, {\mathcal {T}}_s$ that satisfy (B2).

Lemmas 1 and 2 in the previous sections derive formulae for the asymptotic correlation of ${\varvec{x}}$ with the linear estimator $\hat{\varvec{x}}^\mathrm{L}$ and the spectral estimator $\hat{\varvec{x}}^\mathrm{s}$. For convenience, let us denote these as follows:

$$\begin{aligned} \rho _L:=\frac{{\mathbb E}\left\{ Z_L G\right\} }{\sqrt{\left( {\mathbb E}\left\{ Z_L G\right\} \right) ^2 + {{\mathbb E}\left\{ Z_L^2\right\} }\big /{\delta }}},\quad \text {and}\quad \rho _s:= \sqrt{\frac{\psi _{\delta }'(\lambda _{\delta }^*)}{\psi _{\delta }'(\lambda _{\delta }^*)-\phi '(\lambda _{\delta }^*)}}. \end{aligned}$$

(23)

We also denote by $n_L$ the high-dimensional limit of $\Vert \hat{\varvec{x}}^\mathrm{L}\Vert _2$,

$$\begin{aligned} n_L = \sqrt{\left( {\mathbb E}\left\{ G Z_L\right\} \right) ^2 + {{\mathbb E}\left\{ Z_L^2\right\} }\big /{\delta }}, \end{aligned}$$

(24)

and we define

$$\begin{aligned} q := \displaystyle \sqrt{\frac{\psi _{\delta }'(\lambda _{\delta }^*)}{\psi _{\delta }'(\lambda _{\delta }^*)-\phi '(\lambda _{\delta }^*)}}\cdot \frac{{\mathbb E}\left\{ \frac{Z_L \cdot G}{1-\frac{1}{\lambda _{\delta }^*} Z_s} \right\} }{\sqrt{\left( {\mathbb E}\left\{ Z_L\cdot G\right\} \right) ^2 + \frac{{\mathbb E}\left\{ Z_L^2\right\} }{\delta }}}. \end{aligned}$$

(25)

3.1 Joint Distribution of Linear and Spectral Estimators

The key technical challenge is to compute the limiting joint empirical distribution of the signal ${\varvec{x}}$, the linear estimator $\hat{\varvec{x}}^\mathrm{L}$, and the spectral estimator $\hat{\varvec{x}}^\mathrm{s}$. This result is stated in terms of pseudo-Lipschitz test functions.

Definition 1

(Pseudo-Lipschitz test function) We say that a function $\psi : \mathbb {R}^m \rightarrow \mathbb {R}$ is pseudo-Lipschitz of order $k \ge 1$, denoted $\psi \in \mathrm{PL}(k)$, if there is a constant $C > 0$ such that

$$\begin{aligned} \left\Vert {\psi ({\varvec{x}})-\psi ({\varvec{y}})}\right\Vert _2 \le C (1 + \Vert {\varvec{x}}\Vert _2^{k-1} + \Vert {\varvec{y}}\Vert _2^{k-1} )\left\Vert {{\varvec{x}}- {\varvec{y}}}\right\Vert _2, \end{aligned}$$

(26)

for all ${\varvec{x}},{\varvec{y}}\in \mathbb {R}^m$.

Examples of test functions in $\mathrm{PL}(2)$ with $m=2$ include $\psi (a,b) = (a-b)^2$, $\psi (a,b) =ab$, and $\psi (a,b) = \left|{a-b}\right|$. We note that if $\psi \in \mathrm{PL}(k)$, then $\psi ({\varvec{x}}) \le C'(1+ \Vert {\varvec{x}}\Vert _2^k)$ for some constant $C' >0$. Also note that if $\psi \in \mathrm{PL}(k)$ for $k \ge 2$, then $\psi \in \mathrm{PL}(k')$ for $1 \le k' \le (k-1)$.

Theorem 1

(Joint distribution) Let ${\varvec{x}}$ be such that $\Vert {\varvec{x}}\Vert ^2_2=d$, $\{{\varvec{a}}_i\}_{1\le i\le n}\sim _{i.i.d.}\mathsf{N}({{\varvec{0}}}_d,{\varvec{I}}_d/d)$, and ${\varvec{y}}$ be distributed according to (7). Let $n/d\rightarrow \delta $, $G\sim \mathsf{N}(0, 1)$, $Z_L=\mathcal T_L(Y)$, and $Z_s=\mathcal T_s(Y)$ for $Y\sim p_{Y|G}(\,\cdot \,|\,G)$. Assume that (A1)–(A2)–(A3) and (B1)–(B2) hold. Assume further that $\psi _{\delta }'(\lambda _{\delta }^*)> 0$, and let $\hat{\varvec{x}}^\mathrm{s}$ be the principal eigenvector of ${\varvec{D}}_n$, defined as in (13) with preprocessing function ${\mathcal {T}}_s$, with the sign of $\hat{\varvec{x}}^\mathrm{s}$ chosen so that $\langle \hat{\varvec{x}}^\mathrm{s}, {\varvec{x}}\rangle \ge 0$. Let $\hat{\varvec{x}}^\mathrm{L}$ be the linear estimator defined as in (10) with preprocessing function ${\mathcal {T}}_L$.

Consider the rescalings ${\varvec{x}}^\mathrm{s}=\sqrt{d}\hat{\varvec{x}}^\mathrm{s}$ and ${\varvec{x}}^\mathrm{L} = \sqrt{d} \hat{\varvec{x}}^\mathrm{L}/n_L$. Then, the following holds almost surely for any PL(k) function $\psi : \mathbb {R}^3 \rightarrow \mathbb {R}$:

$$\begin{aligned}&\lim _{d \rightarrow \infty } \, \frac{1}{d} \sum _{i=1}^d \psi (x_i, x^\mathrm{L}_i, x^\mathrm{s}_i) = {\mathbb E}\{ \psi (X, \, \rho _L X + W_L ,\, \rho _s X + W_s)) \} . \end{aligned}$$

(27)

Here, $X \sim P_X$, and $(W_L, W_s)$ are independent of X and jointly Gaussian with zero mean and covariance given by ${\mathbb E}\{W_L^2\}=1-\rho _L^2$, ${\mathbb E}\{W_s^2\}=1-\rho _s^2$ and ${\mathbb E}\{W_L W_s\} = q-\rho _L\rho _s$.

Note that the order k of the pseudo-Lipschitz test function appearing in (27) is the same as the integer k appearing in assumption (B1). In particular, the order of pseudo-Lipschitz functions for which (27) holds is only constrained by the fact that the random variables X and Y should have finite moments of order $2k-2$. The proof of the theorem is given in Sect. 5.

Remark 1

(What happens if either linear or spectral are ineffective?) From Lemma 1, the assumption $|{\mathbb E}\{ {\mathcal {T}}_L(Y) G \}|>0$ contained in (B2) implies that $|\rho _L|>0$. Similarly, from Lemma 2, the assumption $\psi _{\delta }'(\lambda _{\delta }^*)> 0$ implies that $\rho _s>0$. Thus, Theorem 1 assumes that both the linear and the spectral estimators are effective. We note that a similar result also holds when only one of the two estimators achieves strictly positive correlation. In that case, $\psi : \mathbb {R}^2 \rightarrow \mathbb {R}$ takes as input the components of the signal ${\varvec{x}}$ and of the estimator that is effective (as well as the corresponding random variables), and a formula analogous to (27) holds. The proof of this claim is easily deduced from the argument of Theorem 1. A simpler proof using a rotational invariance argument along the lines of (144)-(151) also leads to the same result.

Remark 2

(Convergence in $W_k$) The result in Eq. (27) is equivalent to the statement that the empirical joint distribution of $({\varvec{x}}, {\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})$ converges almost surely in $W_k$ distance to the joint law of

$$\begin{aligned} (X, \, \rho _L X + W_L ,\, \rho _s X + W_s). \end{aligned}$$

(28)

This follows from the fact that a sequence of distributions $P_n$ with finite k-th moment converges in $W_k$ to P if and only if $P_n$ converges weakly to P and $\int \Vert a \Vert ^k \, \mathrm{d}P_n(a) \rightarrow \int \Vert a \Vert ^k \, \mathrm{d}P(a)$, see [55, Definition 6.7, Theorem 6.8].

3.2 Optimal Combination

Equipped with the result of Theorem 1, we now reduce the vector problem of estimating ${\varvec{x}}$ given $(\hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})$ to an estimation problem over scalar random variables, i.e., how to optimally estimate X from observations $X_L:=\rho _L X + W_L$ and $X_s:=\rho _s X + W_s$, where $W_L$ and $W_s$ are jointly Gaussian. The Bayes-optimal estimator for this scalar problem is given by

$$\begin{aligned} F_*(x_L, x_s) := \mathbb E\{X \mid X_L=x_L, X_s=x_s\}. \end{aligned}$$

(29)

This is formalized in the following result.

Lemma 3

Let $(X,X_L,X_s)$ be jointly distributed random variables such that

$$\begin{aligned} X_L = \rho _L X + W_L\qquad \text {and}\qquad X_s = \rho _s X + W_s\,, \end{aligned}$$

(30)

where $(W_L,W_s)$ are jointly Gaussian independent of X with zero mean and covariance given by ${\mathbb E}\{W_L^2\}=1-\rho _L^2$, ${\mathbb E}\{W_s^2\}=1-\rho _s^2$ and ${\mathbb E}\{W_L W_s\} = q-\rho _L\rho _s$. Assume that $\rho _L\ne 0$ or $\rho _s\ne 0$. Let

$$\begin{aligned} \mathcal {V}:=\left\{ f(X_L,X_s)~:~0<{\mathbb E}\{f^2(X_L,X_s)\}<\infty \right\} , \end{aligned}$$

(31)

and consider the following optimal estimation problem of X given $X_L$ and $X_s$ over all measurable estimators $f:\mathbb {R}^2\rightarrow \mathbb {R}$ in $\mathcal {V}$:

$$\begin{aligned} \max _{f\in \mathcal {V}} \, \frac{|{\mathbb E}\left\{ X\cdot f(X_L,X_s)\right\} |}{\sqrt{{\mathbb E}\{X^2\} \cdot {\mathbb E}\left\{ f^2(X_L,X_s)\right\} }}. \end{aligned}$$

(32)

Then, for any $c\ne 0$, $\hat{X} = cF_*(X_L, X_s)$ attains the maximum in (32), where $F_*$ is defined in (29).

Proof

By the tower property of conditional expectation, for any $f \in \mathcal {V}$ we have

$$\begin{aligned} \begin{aligned}&\frac{\left|{{\mathbb E}\left\{ X\cdot f(X_L,X_s)\right\} }\right|}{\sqrt{ {\mathbb E}\{ f(X_L,X_s)^2 \}}} = \frac{\left|{{\mathbb E}\left\{ {\mathbb E}\{ X \mid X_L, X_s \} \cdot f(X_L,X_s)\right\} }\right|}{\sqrt{ {\mathbb E}\{ f(X_L,X_s)^2 \}}}\\&\le \sqrt{{\mathbb E}\left\{ {\mathbb E}\{ X \, | \, X_L, X_s \}^2 \right\} } , \end{aligned} \end{aligned}$$

(33)

where we have used the Cauchy–Schwarz inequality. Moreover, the inequality in (33) becomes an equality if and only if $f(X_L,X_s) = c \, {\mathbb E}\{ X \, | \, X_L, X_s \}$, for some $c \ne 0$, which proves the result.

At this point, we are ready to show how to optimally combine the linear estimator $\hat{\varvec{x}}^\mathrm{L}$ and the spectral estimator $\hat{\varvec{x}}^\mathrm{s}$.

Theorem 2

(Optimal combination) Consider the setting of Theorem 1. Let $F_*$ be defined in (29) and assume that $F_*\in PL(\lfloor k/2\rfloor )$. Then, as $n\rightarrow \infty $,

$$\begin{aligned} \frac{|\langle F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s}), {\varvec{x}}\rangle |}{\Vert F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\Vert _2 \Vert {\varvec{x}}\Vert _2}{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\rho _*:=\frac{|{\mathbb E}\left\{ X\cdot F_*(X_L,X_s)\right\} |}{\sqrt{{\mathbb E}\left\{ F_*^2(X_L,X_s)\right\} }}, \end{aligned}$$

(34)

where $F_*$ acts component-wise on ${\varvec{x}}^\mathrm{L}$ and ${\varvec{x}}^\mathrm{s}$, i.e., $F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})=(F_*(x^\mathrm{L}_1, x^\mathrm{s}_1), \ldots , F_*(x^\mathrm{L}_d, x^\mathrm{s}_d))$. Furthermore, for any $f\in PL(\lfloor k/2\rfloor )$ acting component-wise on ${\varvec{x}}^\mathrm{L}$ and ${\varvec{x}}^\mathrm{s}$, almost surely,

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{|\langle f({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s}), {\varvec{x}}\rangle |}{\Vert f({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\Vert _2 \Vert {\varvec{x}}\Vert _2}\le \rho _*. \end{aligned}$$

(35)

Proof

If $f(b, c)\in $ PL($\lfloor k/2\rfloor $), then one can immediately verify that (i) $\psi (a, b, c)=af(b, c)\in $ PL(k), and (ii) $\psi (a, b, c)=(f(b, c))^2\in $ PL(k). Thus, by applying Theorem 1, we have that, for any $f(b, c)\in $ PL($\lfloor k/2\rfloor $), as $n\rightarrow \infty $,

$$\begin{aligned} \begin{aligned} \frac{|\langle f({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s}), {\varvec{x}}\rangle |}{\Vert f({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\Vert _2 \Vert {\varvec{x}}\Vert _2}&{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\frac{\mathbb E\{Xf(\rho _LX+W_L, \rho _sX+W_s)\}}{\sqrt{\mathbb E\{f^2(\rho _LX+W_L, \rho _sX+W_s)\}}}. \end{aligned} \end{aligned}$$

(36)

By taking $f=F_*$, the result (34) immediately follows. By applying Lemma 3, (35) also follows and the proof is complete.

The integer k appearing in assumption (B1) is the same one defining the order $\lfloor k/2\rfloor $ of the pseudo-Lipschitz functions $F_*$ and f in (34)–(35).

Remark 3

(What happens if either linear or spectral are ineffective?) Theorem 2 considers the same setting of Theorem 1, and therefore it assumes that $\rho _L\ne 0$ and $\rho _s\ne 0$. The results in (34)–(35) still hold if either $\rho _L= 0$ or $\rho _s= 0$ (and even in the case $\rho _L= \rho _s= 0$). For the sake of simplicity, suppose that $\rho _L= 0$ (the argument for $\rho _s= 0$ is analogous). Then, $X_L=W_L$ is independent of X and therefore the conditional expectation in (29) does not depend on $x_L$. Recall from Remark 1 that if $\rho _L=0$, then a formula analogous to (27) holds where $\psi : \mathbb {R}^2 \rightarrow \mathbb {R}$ takes as input the components of ${\varvec{x}}$ and ${\varvec{x}}^\mathrm{s}$ on the LHS, and the corresponding random variables on the RHS. Hence, (34)–(35) are obtained by following the same argument in the proof of Theorem 2.

We highlight that, even if one of the two estimators is ineffective, the proposed optimal combination can still improve on the performance of the other one. This is due to the fact that the function $F_*$ takes advantage of the knowledge of the signal prior. We showcase an example of this behavior for a binary prior in Fig. 5 discussed in Sect. 4.2. We also note that if the signal prior is Gaussian, then no improvement is possible when one of the two estimators has vanishing correlation with the signal, see Figs. 2, 3 and 4 in Sect. 4.1. In fact, as detailed in Sect. 3.3, in the Gaussian case $F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})$ is a linear combination of ${\varvec{x}}^\mathrm{L}$ and ${\varvec{x}}^\mathrm{s}$. Thus, if $\rho _L=0$ (resp. $\rho _s=0$), then $F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})$ is aligned with ${\varvec{x}}^\mathrm{s}$ (resp. ${\varvec{x}}^\mathrm{L}$).

Remark 4

(Sign of $\hat{\varvec{x}}^\mathrm{s}$) The spectral estimator $\hat{\varvec{x}}^\mathrm{s}$ is defined up to a change of sign, since it is the principal eigenvector of a suitable matrix. In Theorem 1 and 2, we pick the sign of $\hat{\varvec{x}}^\mathrm{s}$ such that $\langle \hat{\varvec{x}}^\mathrm{s}, {\varvec{x}}\rangle \ge 0$. In practice, there is a simple way to resolve the sign ambiguity: one can match the sign of q as defined in (25) with the sign of the scalar product $\langle \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}\rangle $ (see also (41)).

Remark 5

(Sufficient condition for pseudo-Lipschitz $F_*$) The assumption that the Bayes-optimal estimator $F_*$ in (29) is pseudo-Lipschitz is fairly mild. In fact, $F_*$ is Lipschitz if either of the following two conditions on X hold [20, Lemma 3.8]: (i) X has a log-concave distribution, or (ii) there exist independent random variables U, V such that U is Gaussian, V is compactly supported and $X {\mathop {=}\limits ^\mathrm{d}} U+V$.

Remark 6

(Non-separable combinations) The optimality of $F_*$ in Theorem 2 can be extended to a class of combined estimators of the form $f_d({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})$, where $f_d: \mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}^d$ may not act component-wise on $({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})$. Given $f_d$, we define the function $S_{f_d}({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})= \frac{1}{d}\Vert f_d({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s}) \Vert ^2$. Let $f_d: \mathbb {R}^d \rightarrow \mathbb {R}^d$ be any sequence of functions (indexed by d) such that $S_{f_d}: \mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}$ is uniformly pseudo-Lipschitz of order k. That is, for each d, the property (26) holds for $S_{f_d}$ with a pseudo-Lipschitz constant C that does not depend on d. Then, the state evolution result in [6, Theorem 1] for non-separable test functions implies that

$$\begin{aligned} \lim _{d \rightarrow \infty } {\mathbb P}\left( \frac{|\langle f_d({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s}), {\varvec{x}}\rangle |}{\Vert f_d({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\Vert _2 \Vert {\varvec{x}}\Vert _2} \le \rho _*\right) =1. \end{aligned}$$

(37)

The result above is in terms of convergence in probability, while the limiting statement in (35) holds almost surely. This is because the state evolution result for AMP with non-separable functions [6, Theorem 1] is obtained in terms of convergence in probability.

3.3 A Special Case: Optimal Linear Combination

Theorem 2 shows that the optimal way to combine $\hat{\varvec{x}}^\mathrm{L}$ and $\hat{\varvec{x}}^\mathrm{s}$ is via the Bayes-optimal estimator $F_*$ for the corresponding scalar problem. If the signal prior X is standard Gaussian, then $F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})$ is a linear combination of ${\varvec{x}}^\mathrm{L}$ and ${\varvec{x}}^\mathrm{s}$. In this section, we provide closed-form expressions for the performance of such optimal linear combination.

For convenience, let us denote the normalized linear estimator as $\bar{\varvec{x}}^\mathrm{L}$, i.e., $\bar{\varvec{x}}^\mathrm{L}=\hat{\varvec{x}}^\mathrm{L}/\Vert \hat{\varvec{x}}^\mathrm{L}\Vert _2$. (Recall that the spectral estimator $\hat{\varvec{x}}^\mathrm{s}$ is already normalized, i.e., $\Vert \hat{\varvec{x}}^\mathrm{s}\Vert _2=1$.) We consider an estimator $\hat{\varvec{x}}^\mathrm{c}(\theta )$ of ${\varvec{x}}$, parameterized by $\theta \in \mathbb {R}\cup \{\pm \infty \}$, defined as follows:

$$\begin{aligned} \hat{\varvec{x}}^\mathrm{c}(\theta ):= \theta \bar{\varvec{x}}^\mathrm{L}+ \hat{\varvec{x}}^\mathrm{s},\quad \theta \in \mathbb {R}\cup \{\pm \infty \}, \end{aligned}$$

(38)

where we use the convention, $\hat{\varvec{x}}^\mathrm{c}(\theta )=\pm \bar{\varvec{x}}^\mathrm{L}$ for $\theta =\pm \infty $.

Let us now compute the asymptotic performance of the proposed estimator $\hat{\varvec{x}}^\mathrm{c}(\theta )$ in (38). Specifically, using Lemmas 1 and 2, it follows immediately that

$$\begin{aligned} \left\langle \hat{\varvec{x}}^\mathrm{c}(\theta ), \, \frac{{\varvec{x}}}{\left\Vert {{\varvec{x}}}\right\Vert _2} \right\rangle {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\theta \cdot \rho _L + \rho _s \,. \end{aligned}$$

(39)

In order to conclude with the limit of the normalized correlation $\frac{{\langle }\hat{\varvec{x}}^\mathrm{c}(\theta ), \mathbf {x}{\rangle }}{\left\Vert {\hat{\varvec{x}}^\mathrm{c}(\theta )}\right\Vert _2\left\Vert {{\varvec{x}}}\right\Vert _2}$, it still remains to compute the magnitude of the new estimator:

$$\begin{aligned} \left\Vert {\hat{\varvec{x}}^\mathrm{c}(\theta )}\right\Vert _2^2 = \theta ^2 \Vert \bar{\varvec{x}}^\mathrm{L}\Vert _2^2 + \left\Vert {\hat{\varvec{x}}^\mathrm{s}}\right\Vert _2^2 + 2 \theta {\langle }\bar{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}{\rangle } = \theta ^2+1+ 2 \theta {\langle }\bar{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}{\rangle }. \end{aligned}$$

(40)

This is possible thanks to the following result, which gives the correlation between the linear and the spectral estimator as well as the asymptotic performance of the linear combination $\hat{\varvec{x}}^\mathrm{c}(\theta )$.

Corollary 1

(Performance of linear combination) Consider the setting of Theorem 1. Then, as $n\rightarrow \infty $,

$$\begin{aligned} \frac{\langle \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}\rangle }{\Vert {\hat{\varvec{x}}^\mathrm{L}}\Vert _2 \, \left\Vert {\hat{\varvec{x}}^\mathrm{s}}\right\Vert _2} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}q, \end{aligned}$$

(41)

where q is given by (25). Furthermore, let $\hat{\varvec{x}}^\mathrm{c}(\theta )$ be the estimator defined in (38) parameterized by $\theta \in \mathbb {R}$. Then, as $n\rightarrow \infty $,

$$\begin{aligned} \frac{{\langle }\hat{\varvec{x}}^\mathrm{c}(\theta ), {\varvec{x}}{\rangle }}{\left\Vert {\hat{\varvec{x}}^\mathrm{c}(\theta )}\right\Vert _2\left\Vert {{\varvec{x}}}\right\Vert _2} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\frac{\theta \rho _L + {\rho _s}}{\sqrt{1+\theta ^2+2\theta q}}=:F(\theta ). \end{aligned}$$

(42)

Proof

The limit of the correlation (41) follows by applying Theorem 1 with the PL(2) function $\psi (a, b, c)=bc$ and using that $\Vert \hat{\varvec{x}}^\mathrm{L}\Vert _2 {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}n_L$. The result (42) then follows from (40) and (41), recalling that ${\langle }\bar{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}{\rangle } = \frac{\langle \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}\rangle }{\Vert {\hat{\varvec{x}}^\mathrm{L}}\Vert _2 \, \left\Vert {\hat{\varvec{x}}^\mathrm{s}}\right\Vert _2}$ and $\Vert \hat{\varvec{x}}^\mathrm{s}\Vert _2=1$.

Using (23), the parameter q can be alternatively expressed in terms of $\rho _L$ and $\rho _s$ in the following compact form:

$$\begin{aligned} q = \rho _L \cdot \rho _s \cdot {{\mathbb E}\left\{ \frac{Z_L G}{1-\frac{1}{\lambda _\delta ^*} Z_s} \right\} }\Big /{{\mathbb E}\left\{ Z_L G\right\} }\,. \end{aligned}$$

(43)

Observe also that $F(0)=\rho _s$ and $F(\infty ):=\lim _{\theta \rightarrow \pm \infty }F(\theta )=\pm \rho _L$.

Using the characterization of Corollary 1, we can compute the combination coefficient $\theta _*$ that leads to the asymptotically optimal linear combination of the form (38).

Corollary 2

(Optimal linear combination) Recall the notation in Corollary 1 and define

$$\begin{aligned} \theta _*=\frac{\rho _L-\rho _s q}{\rho _s-\rho _L q}\in \mathbb {R}\cup \{\pm \infty \}. \end{aligned}$$

(44)

Assume $|q|<1$. Then, for all $\theta \in \mathbb {R}\cup \{\pm \infty \}$, it holds that

$$\begin{aligned} |F(\theta )| \le F(\theta _*)= \sqrt{\frac{\rho _s^2+\rho _L^2-2q\rho _L\rho _s}{1-q^2}}\,. \end{aligned}$$

(45)

The proof of Corollary 1 is deferred to Appendix B. Let us now comment on the assumption $|q|<1$. If $\hat{\varvec{x}}^\mathrm{L}$ and $\hat{\varvec{x}}^\mathrm{s}$ are perfectly correlated (i.e., $|q|=1$), then it is clear that the combined estimator $\hat{\varvec{x}}^\mathrm{c}(\theta )$ cannot improve the performance for any value of $\theta $. On the contrary, when $|q|<1$, Corollary 2 characterizes when the linear combination $\hat{\varvec{x}}^\mathrm{c}$ strictly improves upon the performance of the individual estimators $\hat{\varvec{x}}^\mathrm{L}$ and $\hat{\varvec{x}}^\mathrm{s}$. Specifically, by denoting

$$\begin{aligned} \rho _{\max } := \max \{|\rho _L|,\rho _s\}\qquad \text {and}\qquad p := {\left\{ \begin{array}{ll} \rho _s/\rho _L, &{}~\text {if}~ |\rho _L|\ge \rho _s, \\ \rho _L/\rho _s, &{}~\text {else}, \end{array}\right. } \end{aligned}$$

(46)

such that the right-hand side of (45) becomes

$$\begin{aligned} F(\theta _*) = \rho _{\max }\sqrt{1+\frac{(p-q)^2}{1-q^2}}\,, \end{aligned}$$

(47)

it can be readily checked that $F(\theta _*)>\rho _{\max }$ provided that $|q|<1$ and $q\ne p$.

Remark 7

(Optimization of preprocessing functions) The linear estimator $\hat{\varvec{x}}^\mathrm{L}$ and the spectral estimator $\hat{\varvec{x}}^\mathrm{s}$ use the preprocessing functions $\mathcal {T}_L$ (cf. (3)) and $\mathcal {T}_s$ (cf. (4)), respectively. The choice of these functions naturally affects the performance of the two estimators, as well as, that of the combined estimator $F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})$. Lemmas 1–2, Theorem 2 and Corollary 2 derive sharp asymptotics on the estimation performance that hold for any choice of the preprocessing functions $\mathcal {T}_L$ and $\mathcal {T}_s$ satisfying our technical assumptions (A1), (A2), (A3), (B2). In Appendix C, we briefly discuss how these results can be used to yield optimal such choices. Specifically, the optimal preprocessing function that maximizes the normalized correlation of the spectral estimator is derived in [35], see Appendix C.2. The optimal choice for the linear estimator is much easier to obtain and we outline the process in Appendix C.1. In Appendix C.3, we combine these two results to derive a precise characterization of the sampling regimes in which the linear estimator surpasses the spectral estimator, and vice-versa. Finally, in Appendix C.4, we use Corollary 2 to cast the problem of optimally choosing $\mathcal {T}_L$ and $\mathcal {T}_s$ to maximize the correlation of the combined estimator $\hat{\varvec{x}}^\mathrm{c}(\theta _*)$ as a function optimization problem. Solving the latter is beyond the scope of this paper, and it represents an intriguing future research direction.

4 Numerical Simulations

This section validates our theory via numerical experiments and provides further insights on the benefits of the proposed combined estimator.

First, we consider a setting in which the signal ${\varvec{x}}$ is uniformly distributed on the d-dimensional sphere. In this case, the limiting empirical distribution $P_X$ is Gaussian. Thus, the Bayes-optimal estimator $F_*({\varvec{x}}^\mathrm{L},{\varvec{x}}^\mathrm{s})$ in (29) is linear and is given by $\hat{\varvec{x}}^\mathrm{c}(\theta _*)$, where $\theta _*$ is determined in Corollary 2. For this scenario, we study in Figs. 2, 3 and 4 the performance gain of $\hat{\varvec{x}}^\mathrm{c}(\theta _*)$ for three different measurement models and for various noise levels.

Second, in Fig. 5 we consider a setting in which the the entries of ${\varvec{x}}$ are binary, drawn i.i.d. from the set $\{1, -1 \}$, such that the empirical distribution is of the form $P_X(1) = 1 - P_X(-1) = \mathsf{p}$, for some $\mathsf{p}\in (0,1)$. For this case, we compute the Bayes-optimal estimator $F_*({\varvec{x}}_L,{\varvec{x}}_s)$ and compare it with the optimal linear combination $\hat{\varvec{x}}^c(\theta _*)$ for various choices of output functions.

4.1 Optimal Linear Combination

In Fig. 2, we fix the input–output relation as $y_i = f({\langle }{\varvec{a}}_i, {\varvec{x}}{\rangle }) + \sigma z_i$, with $z_i\sim _{i.i.d.} \mathcal {N}(0,1)$ and $f(x)=\max \{x,-0.4x\}$ (cf. Fig. 2a), and we investigate the performance gain of the proposed combined estimator for different values of the noise variance $\sigma ^2$. Here, ${\varvec{x}}$ is generated via a standard Gaussian vector which is normalized such that $\Vert {\varvec{x}}\Vert _2=\sqrt{d}$. Also, ${\varvec{a}}_i \ \sim _{i.i.d.} \ \mathsf{N}({\varvec{0}}_d, {\varvec{I}}_d/d)$ and the pre-processing functions are chosen: ${\mathcal {T}}_L(y) = y$ and ${\mathcal {T}}_s(y) = \min \{y,3.5\}$. Note that the empirical distribution of ${\varvec{x}}$ tends to a standard Gaussian distribution in the high-dimensional limit. Thus, following Sect. 3.3, the optimal combined estimator is (asymptotically) linear and is given by (38) for $\theta =\theta _*$ chosen as in (44). In Fig. 2b, we plot the percentage improvement $\frac{\rho _*-\rho _{\max }}{\rho _{\max }}\times 100$ as a function of the sampling ratio $\delta $, for three values of the noise variance $\sigma ^2=0,0.4$ and 0.8. Here, $\rho _*=F(\theta _*)$ defined in (45) and $\rho _{\max }=\max \{|\rho _L|,\rho _s\}.$ We observe that, as $\sigma ^2$ increases, larger values of the sampling ratio $\delta $ are needed for the combined estimator to improve upon the linear and spectral estimators. However, for sufficiently large $\delta $, the benefit of the combined estimator is more pronounced for larger values of the noise variance. For instance, for $\sigma ^2=0.8$ and large values of $\delta $, the percentage gain is larger than $10\%$. In Fig. 2c, we fix $\sigma ^2=0.4$ and plot the correlations $\rho _L, \rho _s$ and $\rho _*$. The solid lines correspond to the theoretical predictions obtained by Lemma 1, Lemma 2 and Corollary 2, respectively. The theoretical predictions are compared against the results of Monte Carlo simulations. For the simulations, we used $d=1000$ and averaged over 15 independent problem realizations. In Fig. 2c (Middle), we also plot the limit q (cf. (25)) of the cross-correlation $\frac{{\langle }{\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s}{\rangle }}{\Vert {\varvec{x}}^\mathrm{L} \Vert \Vert {\varvec{x}}^\mathrm{s} \Vert }$ and the ratio p in (46). The corresponding values of the optimal combination coefficient $\theta _*$ are plotted in Fig. 2c (Right). For values of $\delta $ smaller than the threshold for weak-recovery of the spectral method (where $\rho _s=0$), we observe that $\rho _*=\rho _L$ and $\theta _*=\infty $. However, for larger values of $\delta $, the value of the optimal coefficient $\theta _*$ is non-trivial. Finally, Fig. 2d shows the same plots as in Fig. 2c, but for $\sigma ^2=0.8$.

The setting of Fig. 3 is the same as in Fig. 2, only now the input–output function is chosen as $f(x)=|x|\cdot \mathbf {1}_{\{|x|\ge 1.5\}} + x\cdot \mathbf {1}_{\{|x|<1.5\}}$. Comparing Fig. 3b to Fig. 2b, note that the benefit of the combined estimator is more significant for the link function studied here. Moreover, in contrast to Fig. 2b, here, the percentage gain of the combined estimator takes its maximum value in the noiseless case: $\sigma ^2=0$.

In Fig. 4, we repeat the experiments of Figs. 2 and 3, but for $f(x)= 1+ 0.3x + (x^2-1)$. Compared to the two functions studied in Figs. 2 and 3, in Fig. 4 we observe that the performance gain is significantly larger and reaches values up to $~30\%$. This can be argued by considering the expansion of the input–output functions on the basis of the Hermite polynomials, i.e., $f(x) = \sum _{i=0}^\infty H_i h_i(x)$, where $h_i(x)$ is the ith-order Hermite polynomial with leading coefficient 1 and $H_i = \frac{1}{i!}{\mathbb E}_{G\sim \mathcal {N}(0,1)}\{f(G)h_i(G)\}$. Specifically, recall that the first three Hermite polynomials are as follows: $h_0(x)=1$, $h_1(x)=x$ and $h_2(x)=x^2-1$. Thus, for $f(x)= 1+ 0.3x + (x^2-1)$, only the first three coefficients $\{H_i\}, i=0,1,2$, are non-zero. To see the relevance of these coefficients to the linear and spectral estimators, note that for identity pre-processing functions it holds that $ {\mathbb E}\{GZ_L\} = {\mathbb E}\{GY\} = {\mathbb E}\{Gf(G)\} = H_1 $ and $ {\mathbb E}\{(G^2-1)Z_s\} ={\mathbb E}\{(G^2-1)f(G)\} = 2H_2. $ Thus, it follows directly from Lemma 1 that $\rho _L=0$ if the first Hermite coefficient $H_1$ is zero. Similarly, it can be shown using Lemma 2 that $\rho _s=0$ if the second Hermite coefficient $H_2$ is zero; in fact, the threshold of weak recovery of the spectral method is infinity in this case (see (167) in Appendix C). Intuitively, the linear and spectral estimators exploit the energy of the output function corresponding to the Hermite polynomials of first- and second-order, respectively; see also [17, 52]. In this example, we have chosen f(x) such that all of its energy is concentrated on the Hermite polynomials of order up to two.

As a final note, from the numerical results in Figs. 2, 3 and 4, we observe that the proposed optimal combination leads to a performance improvement only if both the linear and the spectral estimators are asymptotically correlated with the signal. This is because the signal prior is Gaussian (see Remark 3). In contrast, as we will see in the next section, when the signal prior is binary, the combined estimator provides an improvement even when only the linear estimator is effective.

4.2 Bayes-Optimal Combination

In Figs. 5a–d, we consider the same setting as in Figs. 3 and 4, respectively. However, here each entry of ${\varvec{x}}$ takes value either $+1$ or $-1$. Each entry is chosen independently according to the distribution $P_X(1) = 1 -P_X(-1) =\mathsf{p}$, for $\mathsf{p}\in (0,1)$. Thus, the Bayes optimal combination $\hat{\varvec{x}}^\mathrm{mmse}=F_*({\varvec{x}}^\mathrm{L},{\varvec{x}}^\mathrm{s})$ is not necessarily linear as in the Gaussian case. In Appendix D, we compute the Bayes-optimal estimator $F_*(x_L,x_s)$ (cf. (29)) for the setting considered here. Then, we use the prediction of Theorem 2 to plot in solid black lines the normalized correlation of $\hat{\varvec{x}}^\mathrm{mmse}$ with ${\varvec{x}}$ (i.e., $\rho _*^\mathrm{mmse} = \rho _*$ in (34)). The theoretical predictions (solid lines) are compared against the results of Monte Carlo simulations (markers). Moreover, we compare the optimal performance against those of the linear estimator (cf. $\rho _L$), the spectral estimator (cf. $\rho _s$) and the optimal linear combination $\hat{\varvec{x}}^\mathrm{c}(\theta _*)$ (cf. $\rho _*^\mathrm{linear})$. We have chosen $\mathsf{p}=0.3$ in Fig. 5a, c and $\mathsf{p}=0.5$ in Fig. 5b, d. Note that the optimal linear combination provides a performance improvement only for the values of $\delta $ s.t. $\rho _L>0$ and $\rho _s>0$. On the contrary, $\rho _*^\mathrm{mmse}$ is strictly larger than $\rho _L$ even when $\rho _s=0$.

4.3 The Combined Estimator as Initialization for Local Algorithms

As mentioned in the introduction, the initial estimates of the signal obtained by either the linear/spectral methods or our proposed combined estimator can then be refined via local algorithms such as gradient descent (GD). The theory and numerical simulations in the previous sections showed the superiority of our proposed combined estimator $\hat{\varvec{x}}^\mathrm{mmse}=F_*({\varvec{x}}^\mathrm{L},{\varvec{x}}^\mathrm{s})$ over ${\varvec{x}}^\mathrm{L}$ and ${\varvec{x}}^\mathrm{s}$ in terms of correlation with the true signal. In Fig. 6, by plotting the correlation of GD iterates ${\varvec{x}}_t, t\ge 1$ for different initializations, we show numerically how this improvement translates to improved performance of gradient descent refinements. Specifically, we ran GD on squared error loss with step size 1/2, that is ${\varvec{x}}_{t+1}={\varvec{x}}_{t}-\sum _{i\in [n]}f^\prime (\langle {\varvec{a}}_i , {\varvec{x}}_t \rangle )\left( y_i-f(\langle {\varvec{a}}_i , {\varvec{x}}_t \rangle )\right) $. Here, f is the output function described in the caption of the figure.

In Fig. 6a, b, the true signal has i.i.d. Gaussian entries, thus the linear combined estimator $\hat{\varvec{x}}^\mathrm{c}$ is the optimal combination in terms of correlation performance. We observe that for two different choices of output function and sampling ratio (see caption), GD with linear or spectral initialization requires hundreds of iterations to reach the performance of the combined estimator. In Fig. 6c, d, the true signal has entries in $\{+1, -1\}$, chosen independently with $P_X(1)=\mathsf{p}=0.3$ and $P_X(-1)=\mathsf{p}=0.1$, respectively. Here, the linear combined estimator is sub-optimal (but still improves upon linear/spectral), thus we also compute and study the Bayes-optimal estimator. Interestingly, while for both priors, GD converges to the same correlation as t increases, for $\mathsf{p}=0.1$, GD achieves higher correlation if stopped early. It is a fascinating, albeit challenging, question to better understand the evolution of the GD trajectory as a function of the initialization, signal prior and output function.

5 Proof of Theorem 1

5.1 Proof Sketch

The proof of Theorem 1 is based on the design and analysis of a generalized approximate message passing (GAMP) algorithm. GAMP is a class of iterative algorithms proposed by Rangan [44] for estimation in generalized linear models. A GAMP algorithm is defined in terms of a sequence of Lipschitz functions $f_t:\mathbb R\rightarrow \mathbb R$ and $g_t:\mathbb R \times \mathbb {R}\rightarrow \mathbb R$, for $t \ge 0$. For $t\ge 0$, the GAMP iteration computes:

$$\begin{aligned} \begin{aligned} {\varvec{u}}^t&= \frac{1}{\sqrt{\delta }}{\varvec{A}}f_t({\varvec{v}}^t)-\mathsf{b}_t g_{t-1}({\varvec{u}}^{t-1}; {\varvec{y}}),\\ {\varvec{v}}^{t+1}&= \frac{1}{\sqrt{\delta }}{\varvec{A}}^\mathsf{T}g_t({\varvec{u}}^t; {\varvec{y}}) - \mathsf{c}_t f_t({\varvec{v}}^t). \end{aligned} \end{aligned}$$

(48)

Here, the functions $f_t$ and $g_t$ are understood to be applied component-wise, i.e., $f_t({\varvec{v}}^t)=(f_t(v^t_1)$, $\ldots , f_t(v^t_d))$ and $g_t({\varvec{u}}^t; {\varvec{y}})=(g_t(u^t_1; y_1), \ldots , g_t(u^t_n; y_n))$. The scalars $\mathsf{b}_t, \mathsf{c}_t$ are defined as

$$\begin{aligned} \mathsf{b}_t =\frac{1}{n}\sum _{i=1}^d f_t'(v_i^t), \qquad \mathsf{c}_t = \frac{1}{n}\sum _{i=1}^n g_t'(u_i^t; y_i), \end{aligned}$$

(49)

where $g_t'(\cdot , \cdot )$ denotes the derivative with respect to the first argument. The iteration (48) is initialized with

$$\begin{aligned} {\varvec{u}}^{0}= c {\varvec{1}}_n, \qquad {\varvec{v}}^1 = \frac{1}{\sqrt{\delta }} {\varvec{A}}^{\mathsf{T}} g_{0}({\varvec{u}}^{0} ; {\varvec{y}}), \end{aligned}$$

(50)

for some constant $c>0$. Here, ${\varvec{1}}_n \in \mathbb {R}^n$ denotes the all-ones vector.

A key feature of the GAMP algorithm (48) is that the asymptotic empirical distribution of its iterates can be succinctly characterized via a deterministic recursion, called state evolution. Hence, the performance of the high-dimensional problem involving the iterates ${\varvec{u}}^t, {\varvec{v}}^t$ is captured by a scalar recursion. Specifically, this result gives that for $t \ge 1$, the empirical distributions of ${\varvec{u}}^t$ and ${\varvec{v}}^t$ converge in $W_k$ distance to the laws of the random variables $U_t$ and $V_t$, respectively, with

$$\begin{aligned} U_t&\equiv \mu _{U,t} G + \sigma _{U,t} W_{U,t}, \end{aligned}$$

(51)

$$\begin{aligned} V_t&\equiv \mu _{V,t} X + \sigma _{V,t} W_{V,t}, \end{aligned}$$

(52)

where $(G, W_{U,t}) \sim _{\text {i.i.d.}} \mathsf{N}(0,1)$. Similarly, $X \sim P_X$ and $W_{V,t} \sim \mathsf{N}(0,1)$ are independent. The deterministic parameters $(\mu _{U,t}, \sigma _{U,t}$, $\mu _{V,t}, \sigma _{V,t} )$ are computed via the recursion (56) detailed in Sect. 5.2, and the formal statement of the state evolution result is contained in Proposition 1 (again in Sect. 5.2).

Next, in Sect. 5.3, we show that a suitable choice of the functions $f_t, g_t$ leads to a GAMP algorithm such that (i) ${\varvec{v}}^1 = \sqrt{d} \, \hat{\varvec{x}}^\mathrm{L}$, and (ii) ${\varvec{v}}^t$ is aligned with $\sqrt{d} \, \hat{\varvec{x}}^\mathrm{s}$ as t grows large. Specifically, choosing $g_0(u; y)=\mathcal T_L(y)/\sqrt{\delta }$ in (50) immediately gives ${\varvec{v}}^1 = \sqrt{d} \, \hat{\varvec{x}}^\mathrm{L}$. In order to approach the spectral estimator as $t\rightarrow \infty $, we pick $f_t, g_t$ to be linear functions; see (68). The idea is that, with this choice of $f_t$ and $g_t$, the GAMP iteration is effectively a power method. Let us now briefly discuss why this is the case. With the choice of $f_t, g_t$ in (68), the GAMP iteration can be expressed as

$$\begin{aligned} \begin{aligned} {\varvec{u}}^t&= \frac{1}{\sqrt{\delta } \, \beta _t} \big [ {\varvec{A}}{\varvec{v}}^t \, - \, {\varvec{Z}}{\varvec{u}}^{t-1} \big ], \\ {\varvec{v}}^{t+1}&= {\varvec{A}}^\mathsf{T}{\varvec{Z}}{\varvec{u}}^t - \frac{\sqrt{\delta }}{\beta _t} \, {\mathbb E}\{ Z \} \, {\varvec{v}}^t, \end{aligned} \end{aligned}$$

(53)

where ${\varvec{Z}}= \mathrm{diag}({\mathcal {T}}(y_1), \ldots , {\mathcal {T}}(y_n))$, $Z=\mathbb E\{{\mathcal {T}}(Y)\}$, and the function $\mathcal T:\mathbb R\rightarrow \mathbb R$ is defined later in terms of the spectral preprocessing function ${\mathcal {T}}_s$; see (82). Then, Lemma 4 analyzes the fixed points of the state evolution of the GAMP algorithm (53), and Lemma 6 proves that in the high-dimensional limit, the vector ${\varvec{v}}^t$ tends to align with the principal eigenvector of the matrix ${\varvec{M}}_n= {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}$ as $t \rightarrow \infty $. Here, we provide a heuristic sanity check for this last claim. Assume that the iterates ${\varvec{v}}^t$ and ${\varvec{u}}^t$ converge to the limits ${\varvec{v}}^\infty $ and ${\varvec{u}}^\infty $, respectively, in the sense that $\lim _{t\rightarrow \infty }\frac{1}{d}\Vert {\varvec{v}}^t-{\varvec{v}}^\infty \Vert ^2=0$ and $\lim _{t\rightarrow \infty }\frac{1}{n}\Vert {\varvec{u}}^t-{\varvec{u}}^\infty \Vert ^2=0$. Then, from (53), the limits ${\varvec{v}}^\infty $ and ${\varvec{u}}^\infty $ satisfy

$$\begin{aligned} \begin{aligned} {\varvec{u}}^\infty&= \frac{1}{\sqrt{\delta } \, \beta _\infty } \big [ {\varvec{A}}{\varvec{v}}^\infty \, - \, {\varvec{Z}}{\varvec{u}}^{\infty } \big ], \\ {\varvec{v}}^{\infty }&= {\varvec{A}}^\mathsf{T}{\varvec{Z}}{\varvec{u}}^\infty - \frac{\sqrt{\delta }}{\beta _\infty } \, {\mathbb E}\{ Z \} \, {\varvec{v}}^\infty . \end{aligned} \end{aligned}$$

(54)

Furthermore, from the analysis of the fixed points of state evolution of Lemma 4, we obtain that $\beta _\infty =\sqrt{\delta }\mathbb E\{Z(G^2-1)\}$. Thus, after some manipulations, (54) can be rewritten as

$$\begin{aligned} \begin{aligned} {\varvec{u}}^\infty&= ({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}{\varvec{v}}^\infty \\ \left( 1+\frac{ {\mathbb E}\{ Z \}}{\mathbb E\{Z(G^2-1)\}} \,\right) {\varvec{v}}^{\infty }&= {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}{\varvec{v}}^\infty . \end{aligned} \end{aligned}$$

(55)

Hence, ${\varvec{v}}^\infty $ is an eigenvector of the matrix ${\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}$, and the GAMP algorithm is effectively performing a power method. Finally, we choose $\mathcal T$ (and consequently ${\varvec{Z}}$) so that ${\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1}={\varvec{Z}}_s$, with ${\varvec{Z}}_s$ given by (12). In this way, the matrix ${\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}$ coincides with the spectral matrix ${\varvec{D}}_n$, as defined in (13), and therefore ${\varvec{v}}^t$ approaches the spectral estimator $\hat{\varvec{x}}^\mathrm{s}$.

In conclusion, as ${\varvec{v}}^1 = \sqrt{d} \hat{\varvec{x}}^\mathrm{L}$ and ${\varvec{v}}^t$ tends to be aligned with $\sqrt{d} \hat{\varvec{x}}^\mathrm{s}$ for large t, we can use the state evolution result of Proposition 1 to analyze the joint empirical distribution of $({\varvec{x}}, \sqrt{d} \hat{\varvec{x}}^\mathrm{L}, \sqrt{d} \hat{\varvec{x}}^\mathrm{s})$. At this point, the proof of Theorem 1 becomes a straightforward application of Lemma 6 and is presented at the end of Sect. 5.3. The proof of Lemma 6 is quite long, so it is provided separately in Sect. 5.4.

5.2 State Evolution for Generalized Approximate Message Passing (GAMP)

For $t \ge 1$, let us define the following deterministic recursion for the parameters $(\mu _{U,t}, \sigma _{U,t}$, $\mu _{V,t}, \sigma _{V,t} )$ appearing in (51)–(52):

$$\begin{aligned} \begin{aligned} \mu _{U,t}&= \frac{1}{\sqrt{\delta }} {\mathbb E}\{ X f_t(V_t) \}, \\ \sigma _{U,t}^2&= \frac{1}{\delta } {\mathbb E}\Big \{ \big ( f_t(V_t) - X {\mathbb E}\{ X f_t(V_t)\} \big )^2 \Big \} = \frac{1}{\delta } {\mathbb E}\Big \{ \big ( f_t(V_t) - \sqrt{\delta }\, \mu _{U,t} X \big )^2 \Big \}, \\ \mu _{V,t+1}&=\sqrt{\delta } {\mathbb E}\{ G g_t(U_t; Y) \} \, - \, {\mathbb E}\{ g_t'(U_t;Y) \} {\mathbb E}\{ X f_t(V_t) \}, \\ \sigma _{V,t+1}^2&= {\mathbb E}\{ g_t(U_t; Y)^2 \}. \end{aligned} \end{aligned}$$

(56)

Recalling from (50) that ${\varvec{u}}^0 = c {\varvec{1}}_n$, the state evolution recursion is initialized with

$$\begin{aligned} \begin{aligned}&\mu _{V,1} = \sqrt{\delta } \, {\mathbb E}\{ g_0(c ; Y) G \}, \qquad \sigma _{V,1}^2 = {\mathbb E}\{ g_0(c; Y)^2 \}. \end{aligned} \end{aligned}$$

(57)

Furthermore, for $t \ge 0$, let the sequences of random variables $(W_{U,t})_{t \ge 0}$ and $(W_{V,t})_{t \ge 0}$ be each jointly Gaussian with zero mean and covariance defined as follows [6, 47]. First, we have:

$$\begin{aligned} {\mathbb E}\{W_{V,1} W_{V,t} \} = \frac{1}{\sigma _{V,1} \, \sigma _{V,t}} {\mathbb E}\left\{ g_{0}(c; Y) \, g_{t-1}(\mu _{U,t-1} G + \sigma _{U,t-1} W_{U,t-1}; Y) \right\} , \qquad t \ge 2. \end{aligned}$$

(58)

Then, for $r, t \ge 1$, we iteratively compute:

$$\begin{aligned}&{\mathbb E}\{W_{U,r} W_{U,t} \} \nonumber \\&\quad = \frac{1}{\sigma _{U,r} \, \sigma _{U,t}} \cdot \frac{1}{\delta }{\mathbb E}\Big \{ \big ( f_r(\mu _{V,r} X + \sigma _{V,r} W_{V,r}) - X \sqrt{\delta }\, \mu _{U,r}\big ) \big ( f_t(\mu _{V,t} X + \sigma _{V,t} W_{V,t}) \nonumber \\&\qquad - X \sqrt{\delta }\, \mu _{U,t}\}\big ) \Big \}, \end{aligned}$$

(59)

$$\begin{aligned}&{\mathbb E}\{W_{V,r+1} W_{V,t+1} \} = \frac{1}{\sigma _{V,r+1} \, \sigma _{V,t+1}} {\mathbb E}\left\{ g_{r}(\mu _{U,r} G + \sigma _{U,r} W_{U,r}; Y) \, g_{t}(\mu _{U,t} G + \sigma _{U,t} W_{U,t}; Y) \right\} . \end{aligned}$$

(60)

Note that for $r=t$, by (56) we have ${\mathbb E}\{W_{U,t}^2 \} = {\mathbb E}\{W_{V,t}^2 \}=1$.

At this point, we are ready to present the state evolution result [27, 44] for the GAMP algorithm (48)-(49).

Proposition 1

(State Evolution) Consider the GAMP iteration in Eqs. (48)–(49), with initialization ${\varvec{u}}_0 = c {\varvec{1}}_n \in \mathbb {R}^n$, for any constant $c>0$. Assume that for $t \ge 0$, the functions $g_t: \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}$ and $f_t: \mathbb {R}\rightarrow \mathbb {R}$ are Lipschitz, and that Assumption (B1) on p.6 holds for some $k \ge 2$.

Then, the following holds almost surely for any PL(k) function $\psi : \mathbb {R}^{t+1} \rightarrow \mathbb {R}$, for $t \ge 1$:

$$\begin{aligned}&\lim _{n \rightarrow \infty }\frac{1}{n} \sum _{i=1}^n \psi (g_i, u^t_i, u^{t-1}_i, \ldots , u^1_{i} ) = {\mathbb E}\{ \psi (G, \, U_t, \, U_{t-1}, \, \ldots , U_1) \}, \end{aligned}$$

(61)

$$\begin{aligned}&\lim _{d \rightarrow \infty } \frac{1}{d} \sum _{i=1}^d \psi (x_i, v^{t}_i, v^{t-1}_i, \ldots , v^1_i) = {\mathbb E}\{ \psi (X, \, V_{t}, \, V_{t-1}, \, \ldots , V_1) \}, \end{aligned}$$

(62)

where the distributions of the random vectors $(G, U_t, \ldots , U_1)$ and $(X, V_t, \ldots , V_1)$ are given by the state evolution recursion in Eqs. (51)–(60).

Remark 8

Suppose that we have a sequence of PL(k) functions $\psi _t: \mathbb {R}^{t+1} \rightarrow \mathbb {R}$ (indexed by t) such that

$$\begin{aligned} \begin{aligned}&\lim _{t \rightarrow \infty } {\mathbb E}\{ \psi _t(G, \, U_{t}, \, U_{t-1}, \, \ldots , U_1)\} = c_U, \\&\lim _{t \rightarrow \infty } {\mathbb E}\{ \psi _t(X, \, V_{t}, \, V_{t-1}, \, \ldots , V_1)\} = c_V, \end{aligned} \end{aligned}$$

(63)

for some constants $c_U, c_V \in \mathbb {R}$. Then, since the statements (61) and (62) hold with probability 1 for each fixed $t \ge 1$, we have that, almost surely,

$$\begin{aligned}&\lim _{t\rightarrow \infty } \lim _{n \rightarrow \infty } \frac{1}{n} \sum _{i=1}^n \psi (g_i, u^{t}_i, \ldots , u^1_i) = c_U, \end{aligned}$$

(64)

$$\begin{aligned}&\lim _{t\rightarrow \infty } \lim _{d \rightarrow \infty } \frac{1}{d} \sum _{i=1}^d \psi (x_i, v^{t}_i, \ldots , v^1_i) = c_V. \end{aligned}$$

(65)

5.3 GAMP as a Power Method to Compute the Spectral Estimator

Consider the GAMP iteration in (48) initialized with ${\varvec{u}}^0 = \frac{1}{\delta } {\varvec{1}}_n$, and the function $g_0:\mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}$ chosen as

$$\begin{aligned} g_0(u; \, y) = \frac{{\mathcal {T}}_L(y)}{\sqrt{\delta }}, \end{aligned}$$

(66)

so that

$$\begin{aligned} {\varvec{v}}^1 = \frac{1}{\delta }{\varvec{A}}^{\mathsf{T}} {\mathcal {T}}_L({\varvec{y}}). \end{aligned}$$

(67)

(The function $f_0$ is assumed to be zero.) From (10), we note that ${\varvec{v}}^1 = \sqrt{d} \, \hat{\varvec{x}}^\mathrm{L}$.

For $t \ge 1$, let the functions $g_t: \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}$ and $f_t:\mathbb {R}\rightarrow \mathbb {R}$ be chosen as

$$\begin{aligned} g_t(u; \, y) = \sqrt{\delta } \, u {\mathcal {T}}(y), \qquad f_{t}(v) = \frac{v}{\beta _{t}}, \end{aligned}$$

(68)

where the function ${\mathcal {T}}: \mathbb {R}\rightarrow \mathbb {R}$ is bounded and Lipschitz, and $\beta _t$ is a constant, defined iteratively for $t \ge 1$ via the state evolution equations below (Eqs. (72)–(74)). To prove Theorem 1, we will choose ${\mathcal {T}}$ as a suitable function of ${\mathcal {T}}_s$ (see (82)). Note that the functions $g_t$ and $f_t$ are required to be Lipschitz for $t\ge 0$, and this will be ensured choosing ${\mathcal {T}}$ to be bounded and Lipschitz (see Lemma 6).

With this choice of $f_t, g_t$, for $t\ge 1$, the scalars in (49) are given by

$$\begin{aligned} \mathsf{b}_t =\frac{1}{\delta \beta _t}, \qquad \mathsf{c}_t = \sqrt{\delta } \cdot \frac{1}{n}\sum _{i=1}^n {\mathcal {T}}(y_i). \end{aligned}$$

(69)

In the GAMP iteration below, we will replace the parameter $\mathsf{c}_t$ by its almost sure limit $\bar{\mathsf{c}}_t = \sqrt{\delta } \, {\mathbb E}\{Z\}$, where $Z \triangleq {\mathcal {T}}(Y)$. The state evolution result in Proposition 1 still holds when $\mathsf{c}_t$ is replaced with $\bar{\mathsf{c}}_t$ in the GAMP iteration [27, 44]. This can be shown using the pseudo-Lipschitz property of the test functions $\psi $ in Eqs. (61)–(62) and the fact that $ \lim _{n \rightarrow \infty } \frac{1}{n}\sum _{i=1}^n {\mathcal {T}}(y_i) = {\mathbb E}\{ Z \}$ almost surely, due to the strong law of large numbers.

With these choices, the GAMP iteration in (48) is as follows. Initializing with

$$\begin{aligned} {\varvec{u}}^0 = \frac{1}{\delta } {\varvec{1}}_n, \qquad {\varvec{v}}^1 = \frac{1}{\delta } {\varvec{A}}^{\mathsf{T}} {\mathcal {T}}_L({\varvec{y}}), \end{aligned}$$

(70)

we have for $t \ge 1$:

$$\begin{aligned} \begin{aligned} {\varvec{u}}^t&= {\left\{ \begin{array}{ll} \frac{1}{\sqrt{\delta } \, \beta _t} \big [ {\varvec{A}}{\varvec{v}}^t \, - \, {\varvec{Z}}_L {\varvec{u}}^{t-1} \big ], &{} \text { for } t=1, \\ \frac{1}{\sqrt{\delta } \, \beta _t} \big [ {\varvec{A}}{\varvec{v}}^t \, - \, {\varvec{Z}}{\varvec{u}}^{t-1} \big ], &{} \text { for } t >1, \end{array}\right. }\\ {\varvec{v}}^{t+1}&= {\varvec{A}}^\mathsf{T}{\varvec{Z}}{\varvec{u}}^t - \frac{\sqrt{\delta }}{\beta _t} \, {\mathbb E}\{ Z \} \, {\varvec{v}}^t, \end{aligned} \end{aligned}$$

(71)

where ${\varvec{Z}}_L = \mathrm{diag}({\mathcal {T}}_L(y_1), \ldots , {\mathcal {T}}_L(y_n))$ and ${\varvec{Z}}= \mathrm{diag}({\mathcal {T}}(y_1), \ldots , {\mathcal {T}}(y_n))$. The state evolution equations in (56) become:

$$\begin{aligned}&\mu _{U,t} = \frac{\mu _{V,t}}{\sqrt{\delta } \beta _t}, \qquad \sigma _{U, t}^2 = \frac{\sigma _{V,t}^2}{ \delta \, \beta ^2_t}, \end{aligned}$$

(72)

$$\begin{aligned}&\mu _{V,t+1} = \sqrt{\delta } \,\frac{\mu _{V,t}}{\beta _t} \big [ {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\} \big ], \qquad \sigma _{V,t+1}^2 = \frac{1}{\beta _t^2}\big [ \mu _{V,t}^2 {\mathbb E}\{Z^2G^2\} + \sigma _{V,t}^2 {\mathbb E}\{ Z^2\} \big ], \end{aligned}$$

(73)

$$\begin{aligned}&\beta _{t+1} = \sqrt{\mu _{V,{t+1}}^2 + \sigma _{V,{t+1}}^2}. \end{aligned}$$

(74)

Here, we recall that $G \sim \mathsf{N}(0,1)$ and $Z={\mathcal {T}}(Y)$, for $Y \sim p_{Y|G}(\cdot \mid G)$. From (57), the state evolution iteration is initialized with the following:

$$\begin{aligned} \mu _{V,1} = {\mathbb E}\{ {\mathcal {T}}_L(Y) G \}, \qquad \sigma _{V,1}^2 = \frac{1}{\delta }{\mathbb E}\{ {\mathcal {T}}_L(Y)^2 \}, \qquad \beta _1 = \sqrt{\mu _{V,1}^2 + \sigma _{V,1}^2}. \end{aligned}$$

(75)

We will show in Lemma 6 that in the high-dimensional limit, the vector ${\varvec{v}}^t$ in (71) tends to align with the principal eigenvector of the matrix ${\varvec{M}}_n= {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}$ as $t \rightarrow \infty $. In other words, the GAMP iteration is equivalent to a power iteration for ${\varvec{M}}_n$. This equivalence, together with Proposition 1, allows us to precisely characterize the limiting empirical distribution of the eigenvector of ${\varvec{M}}_n$ in Lemma 6.

We begin with a result characterizing the fixed points of the state evolution recursion in (72)–(73).

Lemma 4

Assume that ${\mathbb E}\{ Z(G^2-1)\} >0$ and that $\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}$. Then, the state evolution recursion (72)–(73) has three fixed points: one is $\mathsf{FP}_0 \equiv (\bar{\mu }_V=0, \bar{\sigma }_V^2 ={\mathbb E}\{ Z^2\})$, and the other two are $\mathsf{FP}_1 \equiv (\tilde{\mu }_V, \tilde{\sigma }_V^2)$ and $\mathsf{FP}_2 \equiv (-\tilde{\mu }_V, \tilde{\sigma }_V^2)$, where

$$\begin{aligned} \tilde{\mu }_V = \sqrt{\frac{\tilde{\beta }^2(\tilde{\beta }^2 - {\mathbb E}\{Z^2\})}{\tilde{\beta }^2 + {\mathbb E}\{Z^2 G^2\} - {\mathbb E}\{Z^2\}}}, \qquad \tilde{\sigma }_V^2 = \frac{\tilde{\beta }^2 {\mathbb E}\{Z^2 G^2\}}{ \tilde{\beta }^2 + {\mathbb E}\{ Z^2 G^2\} - {\mathbb E}\{Z^2\}}, \end{aligned}$$

(76)

with

$$\begin{aligned} \tilde{\beta }^2 = \delta \, ( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2. \end{aligned}$$

(77)

Furthermore, if the initialization (75) is such that $\mu _{V,1} >0$, then the recursion converges to $\mathsf{FP}_1$. If $\mu _{V,1} < 0$, the recursion converges to $\mathsf{FP}_2$.

The lemma is proved in Appendix E. We note that Lemma 4 (and the subsequent Lemmas 5–6 as well) assumes that ${\mathbb E}\{ Z(G^2-1)\} >0$ and $\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}$. These conditions concern the auxiliary random variable Z (or, equivalently, the auxiliary function ${\mathcal {T}}$). In the proof of Theorem 1, which appears at the end of this section, we will provide a choice of Z (depending on $Z_s$) that fulfills these requirements; see (82). Let us also point out that the condition $\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}$ follows from $\psi _{\delta }'(\lambda _{\delta }^*)> 0$, which in turn ensures that $\rho _s>0$. For a discussion of the case $\rho _s=0$, see Remark 1.

The next lemma shows that the mean-squared difference between successive AMP iterates vanishes as $t \rightarrow \infty $ in the high-dimensional limit.

Lemma 5

Assume that ${\mathbb E}\{ Z(G^2-1)\} >0$, $\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}$, and $\left|{{\mathbb E}\{ {\mathcal {T}}_L(Y) G \}}\right| >0$. Consider the GAMP iteration in (71) initialized with ${\varvec{u}}^0 = \frac{1}{\delta }{\varvec{1}}_n$. Then, the following limits hold almost surely:

$$\begin{aligned} \lim _{t \rightarrow \infty } \lim _{n \rightarrow \infty } \, \frac{1}{n} \Vert {\varvec{u}}^t - {\varvec{u}}^{t-1} \Vert ^2 =0, \qquad \lim _{t\rightarrow \infty } \lim _{d \rightarrow \infty } \, \frac{1}{d} \Vert {\varvec{v}}^{t+1} - {\varvec{v}}^{t} \Vert ^2 =0. \end{aligned}$$

(78)

The proof of the lemma is given in Appendix F. The next result is the main technical lemma needed to prove Theorem 1. It shows that, as t grows large, ${\varvec{v}}^t$ tends to be aligned with the principal eigenvector of the matrix ${\varvec{M}}_n$ defined in (79). Theorem 1 is then obtained from Lemma 6 by using a suitable choice for ${\mathcal {T}}(\cdot )$ in the GAMP iteration (71), which ensures that ${\varvec{M}}_n$ is a scaled version of ${\varvec{D}}_n$ in (13).

Lemma 6

Let ${\varvec{x}}$ be such that $\Vert {\varvec{x}}\Vert ^2=d$, $\{{\varvec{a}}_i\}_{1\le i\le n}\sim _{i.i.d.}\mathsf{N}({{\varvec{0}}}_d,{\varvec{I}}_d/d)$, and ${\varvec{y}}$ be distributed according to (7). Let $n/d\rightarrow \delta $, $G\sim \mathsf{N}(0, 1)$ and $Z=\mathcal T(Y)$ for $Y\sim p_{Y|G}(\,\cdot \,|\,G)$. Assume that the conditions (B1)–(B2) on p.6 hold (with ${\mathcal {T}}_s(\cdot )$ replaced by $\mathcal T(\cdot )$ in (B2)). Assume also that Z is bounded, $\mathbb P(Z>-1)=1$, ${\mathbb E}\{ Z(G^2-1)\} >0$ and $\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}$. Define $Z'=\frac{Z}{Z+\delta \mathbb E\{Z(G^2-1)\}}$ and assume that $Z'$ satisfies the assumptions (A1), (A2), (A3) on p.5. Define the matrix

$$\begin{aligned} {\varvec{M}}_n= {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}, \end{aligned}$$

(79)

where ${\varvec{Z}}= \mathrm{diag}({\mathcal {T}}(y_1), \ldots , {\mathcal {T}}(y_n))$. Let $\hat{{\varvec{\varphi }}}_1$ be the principal eigenvector of ${\varvec{M}}_n$, let its sign be chosen so that $\langle \hat{{\varvec{\varphi }}}_1, {\varvec{x}}\rangle \ge 0$, and consider the rescaling $\tilde{{\varvec{\varphi }}}^{(1)}=\sqrt{d}\hat{{\varvec{\varphi }}}_1$. Also, let $\tilde{\varvec{x}}^\mathrm{L} = \sqrt{d} \hat{\varvec{x}}^\mathrm{L}$, where $\hat{\varvec{x}}^\mathrm{L}$ is the linear estimator defined in (10).

Then, the following holds almost surely for any PL(k) function $\psi : \mathbb {R}\times \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}$:

$$\begin{aligned}&\lim _{d \rightarrow \infty } \, \frac{1}{d} \sum _{i=1}^d \psi (x_i, \tilde{x}^\mathrm{L}_i, \tilde{\varphi }_i^{(1)}) = {\mathbb E}\{ \psi (X, \, \mu _{V,1} X + \sigma _{V,1} W_{V,1} ,\, \tilde{\beta }^{-1}(\tilde{\mu }_{V} X + \tilde{\sigma }_{V} W_{V, \infty })) \} . \end{aligned}$$

(80)

Here $X \sim P_X$, $\tilde{\mu }_{V}, \tilde{\sigma }_{V}, \tilde{\beta }$ are given by (76)–(77), and $\mu _{V,1},\sigma _{V,1}^2$ are given by (75). The random variables $(W_{V,1}, W_{V, \infty })$ are independent of X, and jointly Gaussian with zero mean and covariance:

$$\begin{aligned} {\mathbb E}\{ W_{V,1}^2 \} = {\mathbb E}\{ W_{V,\infty }^2 \} =1, \qquad {\mathbb E}\{W_{V,1} W_{V,\infty } \} = \frac{ \tilde{\mu }_{V} \, {\mathbb E}\{ {\mathcal {T}}_L(Y) Z G \}}{ \tilde{\beta } \tilde{\sigma }_{V} \sqrt{ {\mathbb E}\{ {\mathcal {T}}_L(Y)^2\} } }. \end{aligned}$$

(81)

We first show how Theorem 1 is obtained from Lemma 6, and then prove the lemma in the following subsection.

Proof of Theorem 1

Recall that $\lambda _\delta ^*$ is the unique solution of (20) for $\lambda >\tau $, where $\tau $ is the supremum of the support of $Z_s$. Define

$$\begin{aligned} Z \triangleq \frac{Z_s}{\lambda _\delta ^*- Z_s} = \frac{{\mathcal {T}}_s(Y)}{\lambda _\delta ^*- {\mathcal {T}}_s(Y)}. \end{aligned}$$

(82)

We now verify that Z satisfies the assumptions of Lemma 6. As $\lambda _\delta ^*>\tau $ and $Z_s$ has bounded support with supremum $\tau $, we have that Z is bounded. Furthermore, Z is a Lipschitz function of Y, since $Z_s$ is a Lipschitz function of Y and $Z_s$ is bounded away from $\lambda _\delta ^*$. Thus, Z satisfies the condition (B2) (with ${\mathcal {T}}_s(\cdot )$ replaced by ${\mathcal {T}}(\cdot )$).

Next, note that $\tau >0$ (since $Z_s$ satisfies assumption (A3)). As $\mathbb P(\lambda _\delta ^*- Z_s>0)=1$, we have that $\mathbb P(Z>-1)=1$.

As $\psi _{\delta }'(\lambda _{\delta }^*)> 0$, we have that $\lambda _{\delta }^*>\bar{\lambda }_\delta $, where $\bar{\lambda }_\delta $ denotes the point at which $\psi _\delta $ attains its minimum (see Eq. (18)). Consequently, since $\lambda _{\delta }^*$ solves (20), we have that

$$\begin{aligned} \psi _\delta (\lambda _{\delta }^*)=\phi (\lambda _{\delta }^*). \end{aligned}$$

(83)

As $Z_s$ satisfies assumption (A3), we have that $\tau >0$, which implies that $\lambda _\delta ^*>0$. Thus, by using the definitions (16)–(17), (83) can be rewritten as

$$\begin{aligned} \mathbb E\left\{ \frac{Z_s(G^2-1)}{\lambda _{\delta }^*-Z_s}\right\} =\frac{1}{\delta }. \end{aligned}$$

(84)

By combining (82) and (84), we obtain that

$$\begin{aligned} \mathbb E\{Z(G^2-1)\}=\frac{1}{\delta }>0. \end{aligned}$$

(85)

Finally, we compute the derivative of $\psi _\delta (\lambda )$ at $\lambda _{\delta }^*$, noting that the derivative and the expectation in (17) can be interchanged due to bounded convergence. We thus obtain that the condition $\psi _{\delta }'(\lambda _{\delta }^*)> 0$ is equivalent to

$$\begin{aligned} \frac{1}{\delta } > \mathbb E\left\{ \frac{Z_s^2}{(\lambda _{\delta }^*-Z_s)^2}\right\} = \mathbb E\left\{ Z^2\right\} , \end{aligned}$$

(86)

where the last equality follows from (82). From (85) and (86), we have $\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}$.

We also have

$$\begin{aligned} Z'\triangleq \frac{Z}{Z+\delta \mathbb E\{Z(G^2-1)\}}=\frac{Z}{Z+1}=\frac{Z_s}{\lambda _\delta ^*}, \end{aligned}$$

(87)

where in the first equality we use (85) and in the second equality we use (82). Thus, $Z'$ satisfies the assumptions (A1)–(A2)–(A3) (with $\tau /\lambda _\delta ^*$ being the supremum of its support), and we can apply Lemma 6.

Using (87) in (79), we see that ${\varvec{M}}_n = \frac{1}{\lambda _\delta ^*}{\varvec{A}}^{\mathsf{T}} {\varvec{Z}}_s {\varvec{A}}$. Therefore, the principal eigenvector of ${\varvec{M}}_n$ is equal to $\hat{\varvec{x}}^\mathrm{s}$. Furthermore, from (76)–(77), we can compute the coefficients $\tilde{\mu }_{V}, \tilde{\sigma }_{V}, \tilde{\beta }$ as

$$\begin{aligned} \begin{aligned} \tilde{\beta }^2&= \delta \, ( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2=\frac{1}{\delta },\\ \tilde{\mu }_V&= \sqrt{\frac{\tilde{\beta }^2(\tilde{\beta }^2 - {\mathbb E}\{Z^2\})}{\tilde{\beta }^2 + {\mathbb E}\{Z^2 G^2\} - {\mathbb E}\{Z^2\}}}=\sqrt{\frac{\frac{1}{\delta }\left( \frac{1}{\delta } - {\mathbb E}\left\{ \frac{Z_s^2}{(\lambda _\delta ^*- Z_s)^2}\right\} \right) }{\frac{1}{\delta } + {\mathbb E}\left\{ \frac{Z_s^2(G^2-1)}{(\lambda _\delta ^*- Z_s)^2}\right\} }},\\ \tilde{\sigma }_V&= \sqrt{\frac{\tilde{\beta }^2 {\mathbb E}\{Z^2 G^2\}}{ \tilde{\beta }^2 + {\mathbb E}\{ Z^2 G^2\} - {\mathbb E}\{Z^2\}}}= \sqrt{\frac{\frac{1}{\delta } {\mathbb E}\left\{ \frac{Z_s^2\cdot G^2}{(\lambda _\delta ^*- Z_s)^2}\right\} }{\frac{1}{\delta } + {\mathbb E}\left\{ \frac{Z_s^2(G^2-1)}{(\lambda _\delta ^*- Z_s)^2}\right\} }}. \end{aligned} \end{aligned}$$

By using also (75), one can easily verify that

$$\begin{aligned}&\mu _{V, 1}=n_L\rho _L,\qquad \sigma _{V, 1}=n_L\sqrt{1-\rho _L^2},\qquad \tilde{\beta }^{-1}\tilde{\mu }_V=\rho _s,\\&\tilde{\beta }^{-1}\tilde{\sigma }_V=\sqrt{1-\rho _s^2}, \qquad \frac{q-\rho _s\rho _L}{\sqrt{(1-\rho _s^2)(1-\rho _L^2)}}=\frac{ \tilde{\mu }_{V} \, {\mathbb E}\{ {\mathcal {T}}_L(Y) Z G \}}{ \tilde{\beta } \tilde{\sigma }_{V} \sqrt{ {\mathbb E}\{ {\mathcal {T}}_L(Y)^2\} } }, \end{aligned}$$

which yields the desired result.

5.4 Proof of Lemma 6

Fix $c>0$, and let $\tilde{{\varvec{Z}}}=c {\varvec{Z}}$ and $\tilde{Z}=c Z$. Then,

$$\begin{aligned} \begin{aligned} {\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1}&= \tilde{{\varvec{Z}}}(\tilde{{\varvec{Z}}}+\delta \mathbb E\{\tilde{Z}(G^2-1)\}{\varvec{I}}_n)^{-1}. \end{aligned} \end{aligned}$$

(88)

Thus, by inspecting the definition (79), one immediately obtains that $\tilde{{\varvec{\varphi }}}^{(1)}$ does not change if we rescale ${\varvec{Z}}$ and Z by the multiplicative factor c. Furthermore, by using the definitions (76)–(77), we have that

$$\begin{aligned} \tilde{\beta }^{-1}\tilde{\mu }_{V}&=\sqrt{\frac{\delta \, ( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2 - {\mathbb E}\{Z^2\}}{\delta \, ( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2 + {\mathbb E}\{Z^2 G^2\} - {\mathbb E}\{Z^2\}}} \nonumber \\&= \sqrt{\frac{\delta \, ( {\mathbb E}\{\tilde{Z}G^2\} - {\mathbb E}\{\tilde{Z}\})^2 - {\mathbb E}\{\tilde{Z}^2\}}{\delta \, ( {\mathbb E}\{\tilde{Z}G^2\} - {\mathbb E}\{\tilde{Z}\})^2 + {\mathbb E}\{\tilde{Z}^2 G^2\} - {\mathbb E}\{\tilde{Z}^2\}}},\end{aligned}$$

(89)

$$\begin{aligned} \tilde{\beta }^{-1}\tilde{\sigma }_{V}&=\sqrt{\frac{{\mathbb E}\{Z^2 G^2\}}{ \delta \, ( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2 + {\mathbb E}\{ Z^2 G^2\} - {\mathbb E}\{Z^2\}}}\nonumber \\&=\sqrt{\frac{{\mathbb E}\{\tilde{Z}^2 G^2\}}{ \delta \, ( {\mathbb E}\{\tilde{Z}G^2\} - {\mathbb E}\{\tilde{Z}\})^2 + {\mathbb E}\{ \tilde{Z}^2 G^2\} - {\mathbb E}\{\tilde{Z}^2\}}}, \end{aligned}$$

(90)

and that

$$\begin{aligned} \begin{aligned} \frac{ \tilde{\mu }_{V} \, {\mathbb E}\{ {\mathcal {T}}_L(Y) Z G \}}{ \tilde{\beta } \tilde{\sigma _{V}} \sqrt{ {\mathbb E}\{ {\mathcal {T}}_L(Y)^2\} } }&=\frac{{\mathbb E}\{ {\mathcal {T}}_L(Y) Z G \}}{\sqrt{\delta }( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})}\sqrt{\frac{\delta \, ( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2 - {\mathbb E}\{Z^2\}}{{\mathbb E}\{ Z^2 G^2 \}{\mathbb E}\{ {\mathcal {T}}_L(Y)^2\}}}\\&=\frac{{\mathbb E}\{ {\mathcal {T}}_L(Y) \tilde{Z} G \}}{\sqrt{\delta }( {\mathbb E}\{\tilde{Z}G^2\} - {\mathbb E}\{\tilde{Z}\})}\sqrt{\frac{\delta \, ( {\mathbb E}\{\tilde{Z}G^2\} - {\mathbb E}\{\tilde{Z}\})^2 - {\mathbb E}\{\tilde{Z}^2\}}{{\mathbb E}\{ \tilde{Z}^2 G^2 \}{\mathbb E}\{ {\mathcal {T}}_L(Y)^2\}}}. \end{aligned} \end{aligned}$$

(91)

Consequently, both the LHS and the RHS of (80) are unchanged when we rescale ${\varvec{Z}}$ and Z by the multiplicative factor c.

The argument above proves that, without loss of generality, we can rescale ${\varvec{Z}}$ and Z by any multiplicative factor $c>0$. To simplify the rest of the argument, it is convenient to assume the normalization condition

$$\begin{aligned} \mathbb E\{Z(G^2-1)\}=\frac{1}{\delta }, \end{aligned}$$

(92)

under which ${\varvec{M}}_n = {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}$.

Consider the iteration (71) with the initialization in (70). Since by hypothesis, $\mu _{V,1}^2 = ({\mathbb E}\{ {\mathcal {T}}_L(Y) G \})^2 >0$, Lemma 4 guarantees that the state evolution recursion (73) converges to either fixed point $\mathsf{FP}_1$ or $\mathsf{FP}_2$ as $t \rightarrow \infty $. That is,

$$\begin{aligned} \lim _{t \rightarrow \infty } \mu _{V,t}^2 = \tilde{\mu }_V^2, \quad \lim _{t \rightarrow \infty } \sigma ^2_{V,t} = \tilde{\sigma }^2_V, \quad \lim _{t \rightarrow \infty } \beta _t^2 = \tilde{\mu }_V^2 + \tilde{\sigma }^2_V = \tilde{\beta }^2 = \frac{1}{\delta }. \end{aligned}$$

(93)

The last equality above follows by combining (77) and (92).

By substituting the expression for ${\varvec{u}}^t$ in (71) in the ${\varvec{v}}^{t+1}$ update, the iteration can be rewritten as follows, for $t \ge 2$:

$$\begin{aligned} {\varvec{u}}^t&= \frac{1}{\sqrt{\delta } \, \beta _t} \big [ {\varvec{A}}{\varvec{v}}^t \, - \, {\varvec{Z}}{\varvec{u}}^{t-1} \big ], \end{aligned}$$

(94)

$$\begin{aligned} {\varvec{v}}^{t+1}&= \frac{1}{\sqrt{\delta } \beta _t}\Big [ \big ({\varvec{A}}^\mathsf{T}{\varvec{Z}}{\varvec{A}}- \delta {\mathbb E}\{Z\} \, {\varvec{I}}_d \big ) {\varvec{v}}^t \, - \, {\varvec{A}}^\mathsf{T}{\varvec{Z}}^2 {\varvec{u}}^{t-1} \Big ]. \end{aligned}$$

(95)

In the remainder of the proof, we will assume that $t \ge 2$. Define

$$\begin{aligned} {\varvec{e}}_1^t&= {\varvec{u}}^{t}-{\varvec{u}}^{t-1}, \end{aligned}$$

(96)

$$\begin{aligned} {\varvec{e}}_2^t&= {\varvec{v}}^{t+1}-{\varvec{v}}^{t}. \end{aligned}$$

(97)

By combining (96) with (94), we have that

$$\begin{aligned} {\varvec{u}}^{t-1}=({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}{\varvec{A}}{\varvec{v}}^t-\sqrt{\delta }\beta _t({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}{\varvec{e}}_1^t. \end{aligned}$$

(98)

By substituting the expression for ${\varvec{u}}^{t-1}$ obtained in (98) into (95), we have

$$\begin{aligned} \begin{aligned} {\varvec{v}}^{t+1}&= \left( {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}- \frac{\sqrt{\delta } {\mathbb E}\{Z\}}{\beta _t} \, {\varvec{I}}_d \right) {\varvec{v}}^t +{\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}{\varvec{e}}_1^t\\&= \left( {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-\delta {\mathbb E}\{Z\} \, {\varvec{I}}_d \right) {\varvec{v}}^t \\&\quad +(1-\sqrt{\delta }\beta _t){\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}{\varvec{v}}^t \\&\quad + \delta {\mathbb E}\{Z\}\left( 1-\frac{1}{\sqrt{\delta }\beta _t}\right) {\varvec{v}}^t+{\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}{\varvec{e}}_1^t. \end{aligned} \end{aligned}$$

(99)

Let

$$\begin{aligned} {\varvec{e}}_3^t = \left( {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-(\delta {\mathbb E}\{Z\}+1) \, {\varvec{I}}_d \right) {\varvec{v}}^t. \end{aligned}$$

(100)

From (97) and (99), we obtain

$$\begin{aligned} \begin{aligned} {\varvec{e}}_3^t&={\varvec{e}}_2^t-(1-\sqrt{\delta }\beta _t){\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}{\varvec{v}}^t \\&\quad - \delta {\mathbb E}\{Z\}\left( 1-\frac{1}{\sqrt{\delta }\beta _t}\right) {\varvec{v}}^t-{\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}{\varvec{e}}_1^t. \end{aligned} \end{aligned}$$

(101)

Let us decompose ${\varvec{v}}^t$ into a component in the direction of $\hat{{\varvec{\varphi }}}_1$ plus an orthogonal component ${\varvec{r}}^t$:

$$\begin{aligned} {\varvec{v}}^t = \xi _t \hat{{\varvec{\varphi }}}_1+{\varvec{r}}^t, \end{aligned}$$

(102)

where $\xi _t=\langle {\varvec{v}}^t, \hat{{\varvec{\varphi }}}_1\rangle $.

At this point, the idea is to show that, when t is large, ${\varvec{r}}^t$ is small, thus ${\varvec{v}}^t$ tends to be aligned with $\hat{{\varvec{\varphi }}}_1$. To do so, we prove that, as $n\rightarrow \infty $, the largest eigenvalue of the matrix ${\varvec{M}}_n$ defined in (79) converges almost surely to $\delta {\mathbb E}\{Z\}+1$. Furthermore, we prove that the matrix ${\varvec{M}}_n$ exhibits a spectral gap, in the sense that the second largest eigenvalue of ${\varvec{M}}_n$ converges almost surely to a value strictly smaller than $\delta {\mathbb E}\{Z\}+1$. Since ${\varvec{r}}^t$ is orthogonal to $\hat{{\varvec{\varphi }}}_1$ and ${\varvec{M}}_n$ has a spectral gap, the norm of ${\varvec{e}}_3^t$ in (100) can be lower bounded by a strictly positive constant times the norm of ${\varvec{r}}^t$. Next, using the expression in (101), we show that the norm of ${\varvec{e}}_3^t$ can be made arbitrarily small by taking n and t sufficiently large. From this, we conclude that ${\varvec{r}}^t$ must be small.

We begin by proving that ${\varvec{M}}_n$ has a spectral gap.

Lemma 7

(Spectral gap for ${\varvec{M}}_n$) The following holds almost surely:

$$\begin{aligned} \lim _{n \rightarrow \infty } \lambda _1^{{\varvec{M}}_n}&= 1+\delta \mathbb E\{Z\}, \end{aligned}$$

(103)

$$\begin{aligned} \limsup _{n \rightarrow \infty } \lambda _2^{{\varvec{M}}_n}&< 1+\delta \mathbb E\{Z\} - c_1, \end{aligned}$$

(104)

for a numerical constant $c_1 >0$ that does not depend on n.

Proof of Lemma 7

By hypothesis, $Z'=Z/(Z+1)$ satisfies the assumptions (A1)-(A2)-(A3). Thus, we can use Lemma 2 to compute the almost sure limit of the two largest eigenvalues of ${\varvec{M}}_n$, call them $\lambda _1^{{\varvec{M}}_n}\ge \lambda _2^{{\varvec{M}}_n}$.

Let $\tau $ denote the supremum of the support of $Z'$. As $\mathbb P(Z>-1)=1$ and Z has bounded support, we have that $\tau <1$. For $\lambda \in (\tau , \infty )$, define

$$\begin{aligned} \phi (\lambda ) = \lambda \cdot {\mathbb E}\left\{ \frac{Z'\cdot G^2}{\lambda -Z'}\right\} =\lambda \cdot {\mathbb E}\left\{ \frac{Z\cdot G^2}{(\lambda -1)Z+ \lambda }\right\} , \end{aligned}$$

(105)

and

$$\begin{aligned} \psi _{\delta }(\lambda ) = \lambda \left( \frac{1}{\delta }+{\mathbb E}\left\{ \frac{Z'}{\lambda -Z'}\right\} \right) = \lambda \left( \frac{1}{\delta }+{\mathbb E}\left\{ \frac{Z}{(\lambda -1)Z+ \lambda }\right\} \right) . \end{aligned}$$

(106)

Note that

$$\begin{aligned} \begin{aligned} \phi (1) = \mathbb E\{Z\cdot G^2\}, \qquad \psi _{\delta }(1) =\frac{1}{\delta }+{\mathbb E}\{Z\}. \end{aligned} \end{aligned}$$

(107)

Thus, by using (92), we have that

$$\begin{aligned} \phi (1)=\psi _{\delta }(1). \end{aligned}$$

(108)

Furthermore,

$$\begin{aligned} \psi _{\delta }'(\lambda ) = \frac{1}{\delta }-{\mathbb E}\left\{ \left( \frac{Z'}{\lambda -Z'}\right) ^2\right\} = \frac{1}{\delta }-{\mathbb E}\left\{ \left( \frac{Z}{(\lambda -1)Z+ \lambda }\right) ^2\right\} . \end{aligned}$$

(109)

Thus,

$$\begin{aligned} \psi _{\delta }'(1)=\frac{1}{\delta }-{\mathbb E}\{Z^2\}>0, \end{aligned}$$

(110)

where in the last step we combine the hypothesis $\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}$ with the normalization condition (92).

Let $\bar{\lambda }_\delta $ be the point at which $\psi _\delta $ attains its minimum, as defined in (18), define $\zeta _\delta $ as in (19) and let $\lambda _\delta ^*$ be the unique solution of (20). Since $\psi _\delta $ is convex, (109) and (110) imply that $\bar{\lambda }_\delta <1$. Furthermore, (108) implies that $\lambda _\delta ^*=1$. By Lemma 2, we obtain that, as $n\rightarrow \infty $,

$$\begin{aligned} \begin{aligned} \lambda _1^{{\varvec{M}}_n}&{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\delta \,\zeta _\delta (1),\\ \lambda _2^{{\varvec{M}}_n}&{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\delta \,\zeta _\delta (\bar{\lambda }_\delta ). \end{aligned} \end{aligned}$$

(111)

Note that

$$\begin{aligned} \delta \,\zeta _\delta (1)=\delta \,\phi (1)= \delta \mathbb E\{Z\cdot G^2\} = 1+\delta \mathbb E\{Z\}, \end{aligned}$$

(112)

where the first equality comes from the fact that $\lambda _\delta ^*=1$ is the unique solution of (20), while the second and third equalities follow from (107) and (108). By combining (112) and (111), we obtain $\lambda _1^{{\varvec{M}}_n} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}1+\delta \mathbb E\{Z\}$. Furthermore, $\zeta _\delta (\bar{\lambda }_\delta )=\psi _\delta (\bar{\lambda }_\delta )$. As $\bar{\lambda }_\delta <1$ and $\psi _\delta $ is Lipschitz continuous (from (109)), there exists a numerical constant $c_1 >0$ such that

$$\begin{aligned} \delta \psi _\delta (\bar{\lambda }_\delta )\le 1+\delta \mathbb E\{Z\}-c_1. \end{aligned}$$

(113)

Hence, $\lambda _2^{{\varvec{M}}_n} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\delta \,\psi _\delta (\bar{\lambda }_\delta )\le 1+\delta \mathbb E\{Z\}-c_1$.

Let us now go back to (100) and combine it with (102). Then,

$$\begin{aligned} \left( {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-(\delta {\mathbb E}\{Z\}+1) \, {\varvec{I}}_d \right) {\varvec{r}}^t ={\varvec{e}}_3^t+\xi _t(\delta {\mathbb E}\{Z\}+1-\lambda _1^{{\varvec{M}}_n})\hat{{\varvec{\varphi }}}_1. \end{aligned}$$

(114)

We now prove that, almost surely, for all sufficiently large n, the following lower bound on the norm of the LHS of (114) holds:

$$\begin{aligned} \left\| \left( {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-(\delta {\mathbb E}\{Z\}+1) \, {\varvec{I}}_d \right) {\varvec{r}}^t\right\| _2\ge c_2\Vert {\varvec{r}}^t\Vert _2, \end{aligned}$$

(115)

where $c_2>0$ is a numerical constant independent of n, t.

As the matrix ${\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-(\delta {\mathbb E}\{Z\}+1) \, {\varvec{I}}_d$ is symmetric, it can be written in the form ${\varvec{Q}}{\varvec{\Lambda }}{\varvec{Q}}^\mathsf{T}$, with ${\varvec{Q}}$ orthogonal and ${\varvec{\Lambda }}$ diagonal. Furthermore, the columns of ${\varvec{Q}}$ are the eigenvectors of ${\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-(\delta {\mathbb E}\{Z\}+1) \, {\varvec{I}}_d$ and the diagonal entries of ${\varvec{\Lambda }}$ are the corresponding eigenvalues. As ${\varvec{r}}^t$ is orthogonal to $\hat{{\varvec{\varphi }}}_1$, we can write

$$\begin{aligned} \left( {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-(\delta {\mathbb E}\{Z\}+1) \, {\varvec{I}}_d \right) {\varvec{r}}^t={\varvec{Q}}{\varvec{\Lambda }}'{\varvec{Q}}^\mathsf{T}{\varvec{r}}^t, \end{aligned}$$

(116)

where ${\varvec{\Lambda }}'$ is obtained from ${\varvec{\Lambda }}$ by changing the entry corresponding to $\lambda _1^{{\varvec{M}}_n}-(\delta {\mathbb E}\{Z\}+1)$ to any other value. For our purposes, it suffices to substitute $\lambda _1^{{\varvec{M}}_n}-(\delta {\mathbb E}\{Z\}+1)$ with $\lambda _2^{{\varvec{M}}_n}-(\delta {\mathbb E}\{Z\}+1)$. Note that

$$\begin{aligned} \begin{aligned} \Vert {\varvec{Q}}{\varvec{\Lambda }}'{\varvec{Q}}^\mathsf{T}{\varvec{r}}^t\Vert _2^2&\ge \Vert {\varvec{r}}^t\Vert _2^2 \min _{\varvec{s}:\Vert \varvec{s}\Vert =1}\Vert {\varvec{Q}}{\varvec{\Lambda }}'{\varvec{Q}}^\mathsf{T}\varvec{s}\Vert _2^2\\&=\Vert {\varvec{r}}^t\Vert _2^2\min _{\varvec{s}:\Vert \varvec{s}\Vert =1}\langle \varvec{s}, {\varvec{Q}}\left( {\varvec{\Lambda }}'\right) ^2{\varvec{Q}}^\mathsf{T}\varvec{s}\rangle \\&=\Vert {\varvec{r}}^t\Vert _2^2\,\lambda _\mathrm{min}({\varvec{Q}}\left( {\varvec{\Lambda }}'\right) ^2{\varvec{Q}}^\mathsf{T}), \end{aligned} \end{aligned}$$

(117)

where $\lambda _\mathrm{min}({\varvec{Q}}\left( {\varvec{\Lambda }}'\right) ^2{\varvec{Q}}^\mathsf{T})$ denotes the smallest eigenvalue of ${\varvec{Q}}\left( {\varvec{\Lambda }}'\right) ^2{\varvec{Q}}^\mathsf{T}$ and the last equality follows from the variational characterization of the smallest eigenvalue of a symmetric matrix. Note that

$$\begin{aligned} \lambda _\mathrm{min}({\varvec{Q}}\left( {\varvec{\Lambda }}'\right) ^2{\varvec{Q}}^\mathsf{T})=\lambda _\mathrm{min}\left( ({\varvec{\Lambda }}')^2\right) =\min _{i\in \{2, \ldots , d\}}\left( ((\delta {\mathbb E}\{Z\}+1)-\lambda _i^{{\varvec{M}}_n})^2\right) . \end{aligned}$$

(118)

By using (104), we obtain that, almost surely, for all sufficiently large n,

$$\begin{aligned} \min _{i\in \{2, \ldots , d\}}\left( ((\delta {\mathbb E}\{Z\}+1)-\lambda _i^{{\varvec{M}}_n})^2\right) \ge \left( \frac{c_1}{2}\right) ^2. \end{aligned}$$

(119)

By combining (116), (117), (118) and (119), we conclude that (115) holds.

Recalling that ${\varvec{r}}^t$ satisfies (114), we will next show that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{d\rightarrow \infty }\frac{1}{\sqrt{d}}\left\| {\varvec{e}}_3^t+\xi _t(\delta {\mathbb E}\{Z\}+1-\lambda _1^{{\varvec{M}}_n})\hat{{\varvec{\varphi }}}_1\right\| _2=0. \end{aligned}$$

(120)

Combined with (114) and (115), this implies that $ \lim _{t\rightarrow \infty }\lim _{d\rightarrow \infty } \frac{\Vert {\varvec{r}}^t \Vert _2}{\sqrt{d}} = 0$ almost surely.

By using the triangle inequality, we have

$$\begin{aligned} \begin{aligned} \left\| {\varvec{e}}_3^t+\xi _t(\delta {\mathbb E}\{Z\}+1-\lambda _1^{{\varvec{M}}_n})\hat{{\varvec{\varphi }}}_1\right\| _2&\le \left\| {\varvec{e}}_3^t\right\| _2+|\xi _t|\cdot |\delta {\mathbb E}\{Z\}+1-\lambda _1^{{\varvec{M}}_n}|\cdot \left\| \hat{{\varvec{\varphi }}}_1\right\| _2\\&\le \left\| {\varvec{e}}_3^t\right\| _2+\Vert {\varvec{v}}^t\Vert _2\cdot |\delta {\mathbb E}\{Z\}+1-\lambda _1^{{\varvec{M}}_n}|, \end{aligned} \end{aligned}$$

(121)

where the second inequality uses $\left\| \hat{{\varvec{\varphi }}}_1\right\| _2=1$ and that $|\xi _t|= \langle {\varvec{v}}^t, \hat{{\varvec{\varphi }}}_1 \rangle \le \Vert {\varvec{v}}^t\Vert _2$.

We can bound the second term on the RHS of (121) using the result in Proposition 1, applied with the PL(2) test function $\psi (v) = v^2$. Then, almost surely,

$$\begin{aligned} \lim _{d\rightarrow \infty } \, \frac{1}{d}\Vert {\varvec{v}}^t\Vert ^2_2 = {\mathbb E}\{V_t^2\}=\beta _t^2. \end{aligned}$$

(122)

Here we used the definitions of $V_t$ and $\beta _t^2$ from (52) and (74). Recalling from (93) that $\lim _{t\rightarrow \infty } \beta _t^2 =\frac{1}{\delta }$, the limit in (122) combined with Remark 8 and the continuous mapping theorem implies that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{d\rightarrow \infty } \frac{1}{\sqrt{d}}\Vert {\varvec{v}}^t\Vert _2=\frac{1}{\sqrt{\delta }}. \end{aligned}$$

(123)

Thus, by using (103), we conclude that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{n\rightarrow \infty }\frac{1}{\sqrt{d}}\Vert {\varvec{v}}^t\Vert _2\cdot |\delta {\mathbb E}\{Z\}+1-\lambda _1^{{\varvec{M}}_n}|=0. \end{aligned}$$

(124)

We now bound the first term on the RHS of (121). Recalling the definition of ${\varvec{e}}_3^t$ in (101), an application of the triangle inequality gives

$$\begin{aligned} \begin{aligned} \left\| {\varvec{e}}_3^t\right\| _2&\le \left\| {\varvec{e}}_2^t\right\| _2+|1-\sqrt{\delta }\beta _t|\cdot \left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}{\varvec{v}}^t\right\| _2 \\&\quad + \delta |{\mathbb E}\{Z\}|\cdot \left| 1-\frac{1}{\sqrt{\delta }\beta _t}\right| \cdot \left\| {\varvec{v}}^t\right\| _2+\left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}{\varvec{e}}_1^t\right\| _2\\&\le \left\| {\varvec{e}}_2^t\right\| _2+|1-\sqrt{\delta }\beta _t|\cdot \left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}\right\| _\mathrm{op}\Vert {\varvec{v}}^t\Vert _2 \\&\quad + \delta |{\mathbb E}\{Z\}|\cdot \left| 1-\frac{1}{\sqrt{\delta }\beta _t}\right| \cdot \left\| {\varvec{v}}^t\right\| _2+\left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}\right\| _\mathrm{op}\Vert {\varvec{e}}_1^t\Vert _2, \end{aligned} \end{aligned}$$

(125)

where the second inequality follows from the fact that, given a matrix ${\varvec{M}}$ and a vector ${\varvec{v}}$, $\Vert {\varvec{M}}{\varvec{v}}\Vert _2\le \Vert {\varvec{M}}\Vert _\mathrm{op} \Vert {\varvec{v}}\Vert _2$.

Let us bound the operator norm of the two matrices appearing in the RHS of (125). As the operator norm is sub-multiplicative, we have

$$\begin{aligned} \begin{aligned}&\left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}\right\| _\mathrm{op}\\&\quad \le \left\| {\varvec{Z}}\right\| _\mathrm{op}\left\| ({\varvec{Z}}+{\varvec{I}}_n)^{-1}\right\| _\mathrm{op}\left\| ({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}\right\| _\mathrm{op} \left\| {\varvec{A}}\right\| _\mathrm{op}^2,\\&\left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}\right\| _\mathrm{op}\\&\quad \le \left\| {\varvec{Z}}\right\| _\mathrm{op}^2\left\| ({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}\right\| _\mathrm{op}\left\| {\varvec{A}}\right\| _\mathrm{op}^2. \end{aligned} \end{aligned}$$

(126)

As Z is bounded, the operator norm of ${\varvec{Z}}$ is upper bounded by a numerical constant (independent of n, t). The operator norm of $({\varvec{Z}}+{\varvec{I}}_n)^{-1}$ and $({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}$ is also upper bounded by a numerical constant (independent of n, t). Indeed, from (93) $\beta _t \rightarrow 1/\sqrt{\delta }$ as $t \rightarrow \infty $, and the support of Z does not contain points arbitrarily close to $-1$. We also have that, almost surely, for all sufficiently large n, the operator norm of ${\varvec{A}}$ is upper bounded by a constant (independent of n, t). As a result, we deduce that, almost surely, for all sufficiently large n, t,

$$\begin{aligned} \begin{aligned} \left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}\right\| _\mathrm{op}&\le C,\\ \left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}\right\| _\mathrm{op}&\le C, \end{aligned} \end{aligned}$$

(127)

where C is a numerical constant (independent of n, t). Furthermore, by Lemma 5, the following limits hold almost surely:

$$\begin{aligned} \begin{aligned} \lim _{t\rightarrow \infty }\lim _{n\rightarrow \infty }\frac{1}{\sqrt{n}}\left\| {\varvec{e}}_1^t\right\| _2&=0,\\ \lim _{t\rightarrow \infty }\lim _{n\rightarrow \infty }\frac{1}{\sqrt{d}}\left\| {\varvec{e}}_2^t\right\| _2&=0. \end{aligned} \end{aligned}$$

(128)

By combining (93), (123), (127) and (128), we obtain that, almost surely, each of the terms in the RHS of (125) vanishes when scaled by the factor $1/\sqrt{n}$, as $t, n\rightarrow \infty $. As a result, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{d\rightarrow \infty }\frac{1}{\sqrt{d}}\left\| {\varvec{e}}_3^t\right\| _2=0. \end{aligned}$$

(129)

By combining (121), (124) and (129), we conclude that, almost surely, (120) holds.

Recall that ${\varvec{r}}^t$ satisfies (114). Thus, by combining the lower bound in (115) with the almost sure limit in (120), we obtain that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{d \rightarrow \infty }\frac{1}{\sqrt{d}}\left\| {\varvec{r}}^t\right\| _2=0. \end{aligned}$$

(130)

Recalling from (102) that ${\varvec{r}}^t$ is the component of ${\varvec{v}}^t$ orthogonal to $\hat{{\varvec{\varphi }}}_1$, the result in (130) implies that ${\varvec{v}}^t$ tends to be aligned with $\hat{{\varvec{\varphi }}}_1$ in the high-dimensional limit. In formulas, by combining (102) with (130), we have that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{n\rightarrow \infty }\frac{1}{\sqrt{n}}\left\| {\varvec{v}}^t - \xi _t \hat{{\varvec{\varphi }}}_1\right\| _2=0. \end{aligned}$$

(131)

Note that

$$\begin{aligned} \left\| {\varvec{v}}^t - \xi _t \hat{{\varvec{\varphi }}}_1\right\| _2^2=\left\| {\varvec{v}}^t\right\| _2^2-\xi _t^2. \end{aligned}$$

(132)

Thus, by using (123), we obtain that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{n\rightarrow \infty } \frac{1}{\sqrt{d}}|\xi _t|=\frac{1}{\sqrt{\delta }}. \end{aligned}$$

(133)

To obtain the sign of $\xi _t$, we first observe that, by Proposition 1, almost surely,

$$\begin{aligned} \lim _{d \rightarrow \infty }\frac{1}{d}\langle {\varvec{v}}^t, {\varvec{x}}\rangle =\mu _{V, t}. \end{aligned}$$

(134)

As $\mu _{V, 0} = \alpha > 0$ and ${\mathbb E}\{ Z(G^2-1)\} = 1/\delta $, the state evolution iteration (73) implies that $\mu _{V, t} >0$ for all $t \ge 0$. Using (102), we can write

$$\begin{aligned} \frac{1}{d} \langle {\varvec{v}}^t, {\varvec{x}}\rangle = \frac{\xi _t}{\sqrt{d}} \frac{\langle \hat{{\varvec{\varphi }}}_1, {\varvec{x}}\rangle }{\sqrt{d}} \, + \, \frac{\langle {\varvec{r}}^t, {\varvec{x}}\rangle }{d}. \end{aligned}$$

(135)

Recall that by hypothesis, $\langle \hat{{\varvec{\varphi }}}_1, {\varvec{x}}\rangle \ge 0$. Moreover, using (130) and Cauchy–Schwarz, we have that, almost surely,

$$\begin{aligned} \lim _{t \rightarrow \infty } \lim _{d \rightarrow \infty }\frac{\langle {\varvec{r}}^t, {\varvec{x}}\rangle }{\sqrt{d}} = 0. \end{aligned}$$

Thus we deduce that, almost surely,

$$\begin{aligned} \lim _{t \rightarrow \infty } \lim _{d \rightarrow \infty } \frac{\xi _t}{\sqrt{d}} = + \frac{1}{\sqrt{\delta }}.\end{aligned}$$

Therefore,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{d\rightarrow \infty }\frac{1}{\sqrt{d}}\left\| \sqrt{\delta }{\varvec{v}}^t - \tilde{{\varvec{\varphi }}}^{(1)}\right\| _2=0 \quad \text {a.s.}, \end{aligned}$$

(136)

with $\tilde{{\varvec{\varphi }}}^{(1)}=\sqrt{d}\hat{{\varvec{\varphi }}}_1$.

At this point, we are ready to prove (80). For any $\mathrm{PL}(k)$ function $\psi : \mathbb {R}^3 \rightarrow \mathbb {R}$, we have that

$$\begin{aligned} \begin{aligned}&\left| \frac{1}{d}\sum _{i=1}^d \psi (x_i, \tilde{x}^\mathrm{L}_i, \tilde{\varphi }_i^{(1)}) - \frac{1}{d}\sum _{i=1}^d \psi (x_i, \tilde{x}^\mathrm{L}_i, \sqrt{\delta }v_i^t) \right| \\&\quad \le \frac{1}{d}\sum _{i=1}^d \left| \psi (x_i, \tilde{x}^\mathrm{L}_i, \tilde{\varphi }_i^{(1)}) - \psi (x_i, \tilde{x}^\mathrm{L}_i, \sqrt{\delta }v_i^t) \right| \\&\ \le \frac{C}{d}\sum _{i=1}^d |\tilde{\varphi }_i^{(1)}-\sqrt{\delta }v_i^t| \\&\quad \left[ 1+\, \left( \big (\tilde{\varphi }_i^{(1)}\big )^2+ (\tilde{x}^\mathrm{L}_i)^2 +x_i^2 \right) ^{(k-1)/2}+ \, \left( \delta (v_i^t)^2 + (\tilde{x}^\mathrm{L}_i)^2 +x_i^2\right) ^{(k-1)/2} \right] \\&\ \le \frac{C}{d}\sum _{i=1}^d |\tilde{\varphi }_i^{(1)}-\sqrt{\delta }v_i^t|\\&\quad \left[ 1+\, 3^{\frac{k-1}{2}} \Big ( \left|{\tilde{\varphi }_i^{(1)}}\right|^{k-1} + \left|{\tilde{x}^\mathrm{L}_i}\right|^{k-1} + \left|{x_i}\right|^{k-1} + \left|{\sqrt{\delta } v_i^t}\right|^{k-1} + \left|{\tilde{x}^\mathrm{L}_i}\right|^{k-1} + \left|{x}\right|_i^{k-1} \Big ) \right] \\&\ \le C'\frac{\Vert \tilde{{\varvec{\varphi }}}^{(1)} - \sqrt{\delta }{\varvec{v}}^t\Vert _2}{\sqrt{d}} \left[ 1+ \sum _{i=1}^d \left( \frac{ |\tilde{\varphi }_i^{(1)}|^{2(k-1)} + |\tilde{x}^\mathrm{L}_i|^{2(k-1)} + |{x}_i|^{2(k-1)} + |\sqrt{\delta } v_i^t|^{2(k-1)}}{d} \right) \right] ^{1/2}, \end{aligned} \end{aligned}$$

(137)

where $C,C'$ are universal constants (which may depend on k but not on d, n). The inequality in the second line above uses $\psi \in \mathrm{PL}(k)$, and the third and fourth lines are obtained via the Hölder and Cauchy–Schwarz inequalities. We now claim that, almost surely,

$$\begin{aligned} \lim _{d\rightarrow \infty }\sum _{i=1}^d \left( \frac{ |\tilde{\varphi }_i^{(1)}|^{2(k-1)} + |\tilde{x}^\mathrm{L}_i|^{2(k-1)} + |{x}_i|^{2(k-1)} + |\sqrt{\delta } v_i^t|^{2(k-1)}}{d} \right) \le C, \end{aligned}$$

(138)

where, from now on, we will use C to denote a generic positive constant that does not depend on d, n. If (138) holds, then by using (136) and (137), we deduce that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{d\rightarrow \infty } \left| \frac{1}{d}\sum _{i=1}^d \psi (x_i, \tilde{x}^\mathrm{L}_i, \tilde{\varphi }_i^{(1)}) - \frac{1}{d}\sum _{i=1}^d \psi (x_i, \tilde{x}^\mathrm{L}_i, \sqrt{\delta }v_i^t) \right| = 0. \end{aligned}$$

(139)

Let us now prove (138). First, by assumption (B1), we have that

$$\begin{aligned} \lim _{d \rightarrow \infty } \frac{1}{d} \sum _{i} |{x}_i|^{2(k-1)} \le C. \end{aligned}$$

(140)

Next, the main technical lemma [27, Lemma 2] leading to the state evolution result in Proposition 1 implies that, almost surely, for $t\ge 1$,

$$\begin{aligned} \limsup _{d \rightarrow \infty } \frac{1}{d} \sum _i | v_i^t|^{2(k-1)} \le C. \end{aligned}$$

(141)

In particular, this follows from [27, Lemma 2(e)] (see also [4, Lemma 1(e)]). Since $\tilde{{\varvec{x}}}^\mathrm{L} = {\varvec{v}}^1$, we also have that, almost surely,

$$\begin{aligned} \limsup _{d \rightarrow \infty } \frac{1}{d} \sum _i | \tilde{x}_i^\mathrm{L}|^{2(k-1)} \le C. \end{aligned}$$

(142)

It remains to show that, almost surely,

$$\begin{aligned} \limsup _{d \rightarrow \infty } \frac{1}{d} \sum _i (\tilde{\varphi }_i^{(1)})^{2(k-1)}\le C. \end{aligned}$$

(143)

To do so, we use a rotational invariance argument. Let ${\varvec{R}}\in \mathbb R^{d \times d}$ be an orthogonal matrix such that ${\varvec{R}}{\varvec{x}}= {\varvec{x}}$. Then,

$$\begin{aligned} \langle {\varvec{x}}, {\varvec{a}}_i\rangle =\langle {\varvec{R}}{\varvec{x}}, {\varvec{R}}{\varvec{a}}_i \rangle = \langle {\varvec{x}}, {\varvec{R}}{\varvec{a}}_i\rangle . \end{aligned}$$

(144)

Consequently, we have that

$$\begin{aligned} {\varvec{R}}{\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}{\varvec{R}}^{\mathsf{T}} {\mathop {=}\limits ^{{\text {d}}}}{\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}, \end{aligned}$$

(145)

which immediately implies that

$$\begin{aligned} {\varvec{R}}\tilde{{\varvec{\varphi }}}^{(1)} {\mathop {=}\limits ^{{\text {d}}}}\tilde{{\varvec{\varphi }}}^{(1)}. \end{aligned}$$

(146)

Then, we can decompose $\tilde{{\varvec{\varphi }}}^{(1)}$ as

$$\begin{aligned} \tilde{{\varvec{\varphi }}}^{(1)} = a_d \, {\varvec{x}}+ \sqrt{1-a_d^2}\, {\varvec{\varphi }}^{\perp }, \end{aligned}$$

(147)

where ${\varvec{\varphi }}^{\perp }$ is uniformly distributed over the set of vectors orthogonal to ${\varvec{x}}$ with norm $\sqrt{d}$ and

$$\begin{aligned} \frac{1}{d} \langle {\varvec{x}}, \tilde{{\varvec{\varphi }}}^{(1)} \rangle = a_d. \end{aligned}$$

(148)

Relating the uniform distribution on the sphere to the normal distribution [54, Sec. 3.3.3], we can express ${\varvec{\varphi }}^\perp $ as follows:

$$\begin{aligned} {\varvec{\varphi }}^{\perp } =\sqrt{d} \, \frac{{\varvec{u}}- \displaystyle \frac{1}{d}\langle {\varvec{u}}, {\varvec{x}}\rangle {\varvec{x}}}{\Big \Vert {\varvec{u}}- \displaystyle \frac{1}{d}\langle {\varvec{u}}, {\varvec{x}}\rangle {\varvec{x}}\Big \Vert _2}, \end{aligned}$$

(149)

where ${\varvec{u}}\sim \mathsf{N}({\varvec{0}}_d, {\varvec{I}}_d)$ and independent of ${\varvec{x}}$. By the law of large numbers, we have the almost sure limits

$$\begin{aligned} \begin{aligned} \lim _{d\rightarrow \infty }\frac{1}{d}\langle {\varvec{u}}, {\varvec{x}}\rangle = 0, \qquad \quad \lim _{d\rightarrow \infty }\frac{1}{d}\Big \Vert {\varvec{u}}- \displaystyle \frac{1}{d}\langle {\varvec{u}}, {\varvec{x}}\rangle {\varvec{x}}\Big \Vert _2^2 = 1. \end{aligned} \end{aligned}$$

(150)

Thus, by combining (147) and (149), we conclude that

$$\begin{aligned} \tilde{{\varvec{\varphi }}}^{(1)} = c_1 {\varvec{x}}+c_2{\varvec{u}}, \end{aligned}$$

(151)

where the coefficients $c_1$ and $c_2$ can be bounded by universal constants (independent of n, d) using (150). As a result,

$$\begin{aligned}&\frac{1}{d} \sum _i (\tilde{\varphi }_i^{(1)})^{2(k-1)}\le 2^{2(k-1)} |c_1|^{2(k-1)} \frac{1}{d}\nonumber \\&\sum _i |x_i|^{2(k-1)}+2^{2(k-1)} |c_2|^{2(k-1)} \frac{1}{d} \sum _i |u_i|^{2(k-1)}. \end{aligned}$$

(152)

Note that, almost surely,

$$\begin{aligned} \lim _{d\rightarrow \infty } \frac{1}{d} \sum _i |u_i|^{2(k-1)}={\mathbb E}\{U^{2(k-1)}\}\le C, \end{aligned}$$

(153)

where $U\sim \mathsf{N}(0, 1)$. By combining (140), (152) and (153), (143) immediately follows. Finally, by combining (140), (141), (142) and (143), we deduce that (138) holds.

We now use Proposition 1 which guarantees that, almost surely,

$$\begin{aligned} \left|{ \lim _{d \rightarrow \infty } \frac{1}{d} \sum _{i=1}^d \psi (x_i, \tilde{x}^\mathrm{L}_i, \sqrt{\delta }v_i^t) - {\mathbb E}\{ \psi (X, \, \mu _{V, 1} X + \sigma _{V,1} W_{V,1} , \, \sqrt{\delta }(\mu _{V, t} X + \sigma _{V,t} W_{V,t})) \}}\right| = 0, \quad t \ge 1. \end{aligned}$$

(154)

To conclude the proof of (80), we take the limit $t \rightarrow \infty $ and use Remark 8. For this, we will show that

$$\begin{aligned} \begin{aligned}&\lim _{t \rightarrow \infty } {\mathbb E}\{ \psi (X, \, \mu _{V, 1} X + \sigma _{V,1} W_{V,1} , \, \sqrt{\delta }(\mu _{V, t} X + \sigma _{V,t} W_{V,t})) \} \\&\quad = {\mathbb E}\{ \psi (X, \, \mu _{V, 1} X + \sigma _{V,1} W_{V,1} , \, \sqrt{\delta }(\tilde{\mu }_{V} X + \tilde{\sigma }_{V} W_{V,\infty })) \}, \end{aligned} \end{aligned}$$

(155)

where $(W_{V,1}, W_{V,\infty })$ are zero mean jointly Gaussian random variables with covariance given by (81). Using (58) and the formulas for $g_0$ and $g_t$ from (66) and (68), we have

$$\begin{aligned} \begin{aligned} {\mathbb E}\{ W_{V,1} W_{V,t} \}&= \frac{1}{\sigma _{V,1} \sigma _{V,t}}{\mathbb E}\{ {\mathcal {T}}_L(Y) Z (\mu _{U,t-1} G + \sigma _{U,t-1}W_{U,t-1}) \} \\&= \frac{\mu _{V,t-1}}{\sqrt{\delta } \beta _{t-1} \sigma _{V,1} \sigma _{V,t}} {\mathbb E}\{ {\mathcal {T}}_L(Y) Z G \}. \end{aligned} \end{aligned}$$

(156)

In the second equality above, we used (72). Using the expression for $\sigma _{V,1}$ from (75) and letting $t \rightarrow \infty $, we have

$$\begin{aligned} \lim _{t \rightarrow \infty } {\mathbb E}\{ W_{V,1} W_{V,t} \} = \frac{ \tilde{\mu }_{V} {\mathbb E}\{ {\mathcal {T}}_L(Y) Z G \}}{ \tilde{\beta } \tilde{\sigma }_{V} \sqrt{ {\mathbb E}\{ {\mathcal {T}}_L(Y)^2\} } } = {\mathbb E}\{ W_{V,1} W_{V, \infty } \}. \end{aligned}$$

(157)

Therefore, the sequence of zero mean jointly Gaussian pairs $(W_{V,1}, W_{V,t})_{t \ge 1}$ converges in distribution to the jointly Gaussian pair $(W_{V,1}, W_{V,\infty })$, whose covariance is given by (81).

To show (155), we use Lemma 9 in Appendix G. We apply this result taking $Q_t$ to be the distribution of

$$(X, \, \mu _{V,1}X + \sigma _{V,1} W_{V,1}, \, {\mu }_{V,t}X + {\sigma }_{V,t} W_{V,t} ).$$

Since $\mu _{V,t} \rightarrow \tilde{\mu }_V$, $\sigma _{V,t} \rightarrow \tilde{\sigma }_V$, the sequence $(Q_t)_{t \ge 2}$ converges weakly to Q, which is the distribution of

$$(X, \, \mu _{V,1}X + \sigma _{V,1} W_{V,1}, \, \tilde{\mu }_{V}X + \tilde{\sigma }_{V} W_{V,\infty } ).$$

In our case, $\psi : \mathbb {R}^3 \rightarrow \mathbb {R}$ is PL(k), and therefore $\psi (a,b,c) \le C'(1 + \left|{a}\right|^k + \left|{b}\right|^k + \left|{c}\right|^k)$, for all $(a,b,c) \in \mathbb {R}^3$ for some constant $C'$. Choosing $h(a,b,c) = \left|{a}\right|^k + \left|{b}\right|^k + \left|{c}\right|^k$, we have $\frac{\left|{\psi }\right|}{1 + h} \le C'$. Furthermore, $\int h \, \mathrm{d}Q_t$ is a linear combination of $\{\mu _{V,t}, \mu _{V,t}^2, \ldots , \mu _{V,t}^k, \sigma _{V,t}, \sigma _{V,t}^2, \ldots , \sigma _{V,t}^k \}$, with coefficients that do not depend on t. The integral $\int h \, \mathrm{d}Q$ has the same form, except that $\mu _{V,t}, \sigma _{V,t}$ are replaced by $\tilde{\mu }_{V}, \tilde{\sigma }_{V}$, respectively. Since $\mu _{V,t} \rightarrow \tilde{\mu }_{V}$ and $\sigma _{V,t} \rightarrow \tilde{\sigma }_{V}$, we have that

$$\lim _{t \rightarrow \infty } \int h \, \mathrm{d}Q_t = \int h \, \mathrm{d}Q.$$

Therefore, by applying Lemma 9 in Appendix G, we have that

$$\begin{aligned} \lim _{t \rightarrow \infty } \int \psi \, \mathrm{d}Q_t = \int \psi \, \mathrm{d}Q, \end{aligned}$$

which is equivalent to (155). This completes the proof of the lemma.

Acknowledgements

M. Mondelli would like to thank Andrea Montanari for helpful discussions. All the authors would like to thank the anonymous reviewers for their helpful comments.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article Automorphism Groups of Certain Enriques Surfaces

next article Convergence Rates of First- and Higher-Order Dynamics for Solving Linear Ill-Posed Problems

Proof of Lemma 1

By rotational invariance of the Gaussian measure, we can assume without loss of generality that ${\varvec{x}}=\sqrt{d}{\varvec{e}}_1=[\sqrt{d},0,\ldots ,0]^\mathsf{T}$. Let us also denote the first column of the matrix ${\varvec{A}}$ by ${\varvec{u}}\in \mathbb {R}^{n}$ and the remaining $n\times (d-1)$ sub-matrix by $\widetilde{\varvec{A}}$, i.e., $ {\varvec{A}}= \begin{bmatrix} {\varvec{u}}&\widetilde{\varvec{A}}\end{bmatrix}. $ In this notation, each measurement $y_i, i\in [n]$ depends only on the corresponding element $u_i$ of the vector ${\varvec{u}}$. In particular, the random variables $z^\mathrm{L}_i={\mathcal {T}}(y_i), i\in [n]$ are independent of the sub-matrix $\widetilde{\varvec{A}}$. Furthermore, we may express $\hat{\varvec{x}}^\mathrm{L}$ as follows:

$$\begin{aligned} \hat{\varvec{x}}^\mathrm{L}= \frac{\sqrt{d}}{n} \begin{bmatrix} {\varvec{u}}^\mathsf{T}{\varvec{z}}^\mathrm{L}\\ \widetilde{{\varvec{A}}}^\mathsf{T}{\varvec{z}}^\mathrm{L}\end{bmatrix}\,. \end{aligned}$$

We are now ready to prove (11). First, we compute the correlation $\langle \hat{\varvec{x}}^\mathrm{L}, \frac{{\varvec{x}}}{\left\Vert {{\varvec{x}}}\right\Vert _2}\rangle $:

$$\begin{aligned} \langle \hat{\varvec{x}}^\mathrm{L}, \frac{{\varvec{x}}}{\left\Vert {{\varvec{x}}}\right\Vert _2}\rangle = \frac{1}{n} \sqrt{d} ({\varvec{u}}^\mathsf{T}{\varvec{z}}^\mathrm{L}) = \frac{1}{n} \sum _{i\in [n]}\sqrt{d} u_i z_i^\mathrm{L} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}{\mathbb E}\left\{ GZ_L\right\} , \end{aligned}$$

(158)

where we have that $\sqrt{d} u_i{\mathop {\sim }\limits ^{\text {iid}}}\mathcal {N}(0,1)$ and the almost sure convergence follows from the law of large numbers.

Second, we compute the norm of the estimator $\Vert \hat{\varvec{x}}^\mathrm{L}\Vert _2$:

$$\begin{aligned} \Vert \hat{\varvec{x}}^\mathrm{L}\Vert _2^2&= \frac{d}{n^2}({\varvec{u}}^\mathsf{T}{\varvec{z}}^\mathrm{L})^2 + \frac{d}{n^2}\left\Vert {\widetilde{{\varvec{A}}}^\mathsf{T}{\varvec{z}}^\mathrm{L}}\right\Vert _2^2 \nonumber \\&{\mathop {=}\limits ^{{\text {d}}}}\frac{1}{n^2}\Big (\sum _{i\in [n]}\sqrt{d} u_i z_i^\mathrm{L}\Big )^2 + \frac{1}{n^2}\left\Vert {\Vert {\varvec{z}}^\mathrm{L}\Vert _2\mathbf {h}}\right\Vert _2^2 {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\left( {\mathbb E}\left\{ GZ_L\right\} \right) ^2 + \frac{{\mathbb E}\left\{ Z_L^2\right\} }{\delta }\,, \end{aligned}$$

(159)

where we have used the following: (i) $\sqrt{d}\widetilde{{\varvec{A}}}^\mathsf{T}{\varvec{z}}^\mathrm{L}{\mathop {=}\limits ^{{\text {d}}}}\Vert {\varvec{z}}^\mathrm{L}\Vert _2\mathbf {h}$ with $\mathbf {h}\sim \mathcal {N}(0,{\varvec{I}}_{d-1})$; (ii) $\Vert {\varvec{z}}^\mathrm{L}\Vert _2^2/n{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}{\mathbb E}\left\{ Z_L^2\right\} $ and $\Vert \mathbf {h}\Vert _2^2/n{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}1/\delta $, by the law of large numbers.

Combining the above displays completes the proof of the lemma.

Proof of Corollary 2

Consider $ F(\theta )=\frac{\theta \rho _L+\rho _s}{\sqrt{1+\theta ^2+2\theta q}}. $ It can be checked that

$$\begin{aligned} F^\prime (\theta ) = \frac{(\rho _L-q\rho _s) - \theta (\rho _s-q\rho _L)}{(1+\theta ^2+2\theta q)^{3/2}}. \end{aligned}$$

(160)

We consider three cases.

Case 1: ${\rho _s=\rho _L q}$. Here, F is either strictly increasing or strictly decreasing depending on the sign of $\rho _L-q\rho _s$. But, $q\in (-1,1)$, thus, $\rho _s<|\rho _L|\implies \mathrm{sign}(\rho _L-q\rho _s)=\mathrm{sign}(\rho _L)$. Thus, F is maximized at $\tilde{\theta }\rightarrow \mathrm{sign}(\rho _L)\cdot \infty $ and approaches the value $|\rho _L|$. Moreover, $F(\theta )\le |\rho _L|$. To conclude with the desired, notice that if $\rho _s=\rho _L q$, then $\theta _*$ and $F(\theta _*)$ defined in the lemma take the values $\theta _*=\mathrm{sign}(\rho _L)\cdot \infty $ and $F(\theta _*)=|\rho _L|$, respectively.

Case 2: ${\rho _s>\rho _L q}$. Here, it can be readily checked from (160) that F is maximized at $\tilde{\theta }:=\frac{\rho _L-\rho _s q}{\rho _s-\rho _L q}$. Also, a bit of algebra shows that

$$\begin{aligned} F(\tilde{\theta })=\sqrt{\frac{\rho _s^2+\rho _L^2-2q\rho _L\rho _s}{1-q^2}}=F(\theta _*). \end{aligned}$$

Thus, $|F(\theta )|$ is maximized either at $\tilde{\theta }$ or as $\theta $ approaches $\pm \infty $. But, $F(\theta _*)^2-\rho _L^2=\frac{(\rho _s-q\rho _L)^2}{1-q^2}> 0 \Rightarrow F(\tilde{\theta })>|\rho _L|$. Hence, $|F(\theta )|$ is indeed maximized at $\tilde{\theta }=\theta _*$ and attains the value $F(\tilde{\theta })=F(\theta _*)$.

Case 3: ${\rho _s<\rho _L q}$. Here, it can be checked from (160) that F is minimized at $\tilde{\theta }:=\frac{\rho _L-\rho _s q}{\rho _s-\rho _L q}$ and the minimum value is $F(\tilde{\theta })=-F(\theta _*).$ Moreover, similar to Case 2 above, $F(\theta _*)=|F(\tilde{\theta })|>|\rho _L|$. Thus, again, $|F(\theta )|$ is maximized at $\theta _*$ and taking the value $F(\theta _*)$.

This completes the proof of the result.

Optimization of Preprocessing Functions

In order to state the results in this section, let us define the following functions for $y\in \mathbb {R}$ and $G\sim \mathcal {N}(0,1)$,

$$\begin{aligned} \mu _0(y)&= {\mathbb E}_G[p_{Y\mid G}(y\mid G)], \end{aligned}$$

(161a)

$$\begin{aligned} \mu _1(y)&= {\mathbb E}_G[G\cdot p_{Y\mid G}(y\mid G)],\end{aligned}$$

(161b)

$$\begin{aligned} \mu _2(y)&= {\mathbb E}_G[G^2\cdot p_{Y\mid G}(y\mid G)]. \end{aligned}$$

(161c)

Note that the functions $\mu _0,\mu _1$ and $\mu _2$ only depend on the conditional distribution $p_{Y\mid G}(\, \cdot \, | \, G)$. Furthermore, let ${\mathcal {S}}$ denote the support of the probability measure Y (i.e., the support of $\mu _0(y)$).

Linear Estimator

In terms of the notation in (161), for a preprocessing function $\mathcal {T}_L(y)$, we can write

$$\begin{aligned} {\mathbb E}\{GZ_L\} = \int _{{\mathcal {S}}} \mathcal {T}_L(y) \mu _1(y) \mathrm {d}y\quad \text {and}\quad {\mathbb E}\{Z_L^2\} = \int _{{\mathcal {S}}} \mathcal {T}^2_L(y) \mu _0(y) \mathrm {d}y. \end{aligned}$$

Thus, it follows from (11) in Lemma 1 that

$$\begin{aligned} |\rho _L| = \left( 1+\frac{1}{\delta }\frac{\int _{{\mathcal {S}}} \mathcal {T}_L^2(y)\mu _0(y)\mathrm {d}y}{\left( \int _{{\mathcal {S}}} \mathcal {T}_L(y)\mu _1(y)\mathrm {d}y\right) ^2}\right) ^{-1/2}\,, \end{aligned}$$

(162)

provided $\int _{{\mathcal {S}}} \mathcal {T}_L(y) \mu _1(y) \mathrm {d}y\ne 0$ and ${\mathbb E}\{|GZ_L|\}<\infty .$

Assume, henceforth, that

$$\begin{aligned} 0<\int _{{\mathcal {S}}} \frac{ \mu _1^2(y) }{\mu _0(y)} \mathrm {d}y\,<\infty . \end{aligned}$$

(163)

Then, we will show in this section that the optimal preprocessing function for the linear estimator is

$$\begin{aligned} \mathcal {T}^*_L(y) = \frac{\mu _1(y)}{\mu _0(y)}\,, \end{aligned}$$

(164)

and the achieved (optimal) normalized correlation is

$$\begin{aligned} \rho _L^* = \left( 1+\frac{1}{\delta \int _{{\mathcal {S}}}\frac{\mu _1^2(y)}{\mu _0(y)}\mathrm {d}y}\right) ^{-1/2}\,. \end{aligned}$$

(165)

To see this, note from (162) that $\rho _L^2$ is maximized for the choice of $\mathcal {T}_L$ that minimizes the ratio ${\int _{{\mathcal {S}}} \mathcal {T}_L^2(y)\mu _0(y)\mathrm {d}y}\big /{\left( \int _{{\mathcal {S}}} \mathcal {T}_L(y)\mu _1(y)\mathrm {d}y\right) ^2}$, while ensuring $\int _{{\mathcal {S}}} \mathcal {T}_L(y) \mu _1(y) \mathrm {d}y\ne 0$. Furthermore, by using the Cauchy–Schwarz inequality, we obtain:

$$\begin{aligned}&\left( \int _{{\mathcal {S}}} \mathcal {T}_L(y)\mu _1(y)\mathrm {d}y\right) ^2 = \left( \int _{{\mathcal {S}}} \mathcal {T}_L(y)\sqrt{\mu _0(y)}\frac{\mu _1(y)}{\sqrt{\mu _0(y)}}\mathrm {d}y\right) ^2\nonumber \\&\quad \le \left( \int _{{\mathcal {S}}} \mathcal {T}_L^2(y){\mu _0(y)}\mathrm {d}y\right) \left( \int _{{\mathcal {S}}} \frac{\mu _1^2(y)}{\mu _0(y)}\mathrm {d}y\right) \,. \end{aligned}$$

(166)

Rearranging the above and substituting in the expression for $\left|{\rho _L}\right|$ from (162) yields $\rho _L^2 \le \left( \rho _L^*\right) ^2$, with equality achieved if and only if $\mathcal {T}_L(y)=c\cdot \frac{\mu _1(y)}{\mu _0(y)}, \forall y\in \mathbb {R}$ and some constant $c>0$. Clearly, the correlation performance of $\hat{\varvec{x}}^\mathrm{L}$ is insensitive to scaling $\mathcal {T}_L$ by a constant. Thus, we can choose $c=1$ to arrive at (164). To complete the proof of the claim, note that for the choice in (164):

$$\begin{aligned} \int _{{\mathcal {S}}} \mathcal {T}_L^*(y) \mu _1(y) \mathrm {d}y = \int _{{\mathcal {S}}} \frac{ \mu _1^2(y) }{\mu _0(y)} \mathrm {d}y\,>0, \end{aligned}$$

and

$$\begin{aligned} {\mathbb E}\{|G Z_L|\} \le \sqrt{{\mathbb E}\{G^2\}}\left( \int _{\mathcal {S}}({\mathcal {T}}_L^*(y))^2\mu _0(y)\mathrm {d}y\right) ^{1/2} = \sqrt{{\mathbb E}\{G^2\}}\left( \int _{\mathcal {S}}\frac{\mu _1^2(y)}{\mu _0(y)}\mathrm {d}y\right) ^{1/2} <\infty , \end{aligned}$$

where the last inequalities in the above lines follow from (163).

As a final note, observe that the optimal $\mathcal {T}^*_L$ does not depends on the sampling ratio $\delta $.

Spectral Estimator

The optimal preprocessing function for the spectral estimator is derived in [35]. For ease of reference, we present here their result in the special case where $\inf _y \frac{\mu _2(y)}{\mu _0(y)}>0.$ If this condition does not hold, the idea is to construct a sequence of approximations of the optimal preprocessing function (we refer the reader to [35] for the details).

Assume $\delta \ge \delta ^*$, where

$$\begin{aligned} \delta ^*:= \left( \int _{{\mathcal {S}}}\frac{(\mu _2(y)-\mu _0(y))^2}{\mu _0(y)} \mathrm {d}y\right) ^{-1} \end{aligned}$$

(167)

is the threshold for weak recovery of the spectral estimator [37]. For a preprocessing function $\mathcal {T}_s(y)$, we have from Lemma 2 that

$$\begin{aligned} \rho _s = \left( 1+\frac{\int _{{\mathcal {S}}} \left( \frac{\mathcal {T}_s(y)}{\lambda _\delta ^*-\mathcal {T}_s(y)}\right) ^2\mu _2(y)\mathrm {d}y}{\frac{1}{\delta }-\int _{{\mathcal {S}}} \left( \frac{\mathcal {T}_s(y)}{\lambda _\delta ^*-\mathcal {T}_s(y)}\right) ^2\mu _0(y)\mathrm {d}y}\right) ^{-1/2}\,, \end{aligned}$$

where $\lambda _\delta ^*$ is the unique solution to the following equation for $\lambda \ge \tau $:

$$\begin{aligned} \int _{{\mathcal {S}}}\frac{\mathcal {T}_s(y)}{\lambda -\mathcal {T}_s(y)}\left( \mu _2(y)-\mu _0(y)\right) \mathrm {d}y=\frac{1}{\delta }\,, \end{aligned}$$

(168)

and also (cf. $\psi _\delta ^\prime (\lambda _\delta ^*)\ge 0$),

$$\begin{aligned} \int _{{\mathcal {S}}} \left( \frac{\mathcal {T}_s(y)}{\lambda _\delta ^*-\mathcal {T}_s(y)}\right) ^2\mu _0(y) \mathrm {d}y \le \frac{1}{\delta }\,. \end{aligned}$$

(169)

Using this characterization, [35] shows that the optimal preprocessing function for the spectral estimator is

$$\begin{aligned} \mathcal {T}^*_s(y) = 1-\frac{\mu _0(y)}{\mu _2(y)}\,, \end{aligned}$$

(170)

and the achieved (optimal) normalized correlation is

$$\begin{aligned} \rho _s^* = \left( 1+\beta _\delta \right) ^{-1/2},\quad \text {where}~ \int _{{\mathcal {S}}}\frac{(\mu _2(y)-\mu _0(y))^2}{\mu _0(y)+\mu _2(y)/\beta _\delta }\mathrm {d}y = \frac{1}{\delta }. \end{aligned}$$

(171)

As for the linear estimator, the optimal function $\mathcal {T}^*_s$ does not depend on the sampling ratio $\delta $.

Spectral Versus Linear

As mentioned in the introduction, there is no clear winner between the linear and the spectral estimator: the superiority of one method over the other depends on the measurement model and on the sampling ratio. Here, we fix the measurement model (i.e., the stochastic output function $p_{Y|G}(\, \cdot \, | \, {\langle }{\varvec{x}}, {\varvec{a}}_i{\rangle })$) and we present an analytic condition that determines which method is superior for any given $\delta >0$ after optimizing both in terms of the preprocessing function.

Lemma 8

Assume that $\inf _y \frac{\mu _2(y)}{\mu _0(y)}>0$ and (163) hold. Consider the function $h:\mathbb {R}_+\rightarrow \mathbb {R}_+$,

$$\begin{aligned} h(t):= \int _{{\mathcal {S}}}\frac{(\mu _2(y)-\mu _0(y))^2}{\mu _0(y)+\mu _2(y)/t}\mathrm {d}y, \end{aligned}$$

(172)

and let $\gamma _\delta :=\left( {\delta \int _{{\mathcal {S}}}\frac{\mu _1^2(y)}{\mu _0(y)}\mathrm {d}y}\right) ^{-1}$. Then, the following holds:

$$\begin{aligned} \delta \cdot h( \gamma _\delta ) \lessgtr 1 ~\Longrightarrow ~ \rho _s^* \lessgtr \rho _L^*, \end{aligned}$$

(173)

where $\rho _L^*$ and $\rho _s^*$ are defined in (165) and (171), and denote the optimal normalized correlation for the linear and the spectral estimator, respectively.

Proof

It can be checked by direct differentiation and the fact that $\mu _2(y)>0$, that $h(\cdot )$ is strictly increasing. Thus, from (171), we find that $h( \gamma _\delta ) \lessgtr 1/\delta ~\Longrightarrow ~ \beta _\delta \gtrless \gamma _\delta $. To conclude the proof, note from (165) and (171) that $\rho _L^*=u(\gamma _\delta )$ and $\rho _s^*=u(\beta _\delta )$, respectively, where we define $u(t) = (1+t)^{-1/2}$ and $u(\cdot )$ is strictly decreasing.

Combined Estimator

In the previous sections of Appendix C, we have discussed how to optimally choose $\mathcal {T}_L$ and $\mathcal {T}_s$ to maximize the correlation of the linear and spectral estimators. This was possible thanks to the asymptotic characterizations in Lemmas 1 and 2. Theorem 2 opens up the possibility to optimally choose $\mathcal {T}_L$ and $\mathcal {T}_s$ to maximize the correlation achieved by the Bayes-optimal combination $F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})$. Here, we focus on the special case in which the signal prior is Gaussian and, hence, $F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})$ is a linear combination of ${\varvec{x}}^\mathrm{L}$ and ${\varvec{x}}^\mathrm{s}$, see Corollary 2. In the rest of this section, we formalize the problem of (optimally) choosing the functions $\mathcal {T}_L$ and $\mathcal {T}_s$.

To begin, note from (43) that $q=\rho _L \cdot \rho _s \cdot s$, where we define

$$\begin{aligned} s := \frac{\int _{{\mathcal {S}}}{\frac{\mathcal {T}_L(y)\mu _1(y)}{1-\frac{1}{\lambda _\delta ^*}\mathcal {T}_s(y)}\mathrm {d}y}}{\int _{{\mathcal {S}}}{\mathcal {T}_L(y)\mu _1(y)\mathrm {d}y}} \,. \end{aligned}$$

(174)

Furthermore, by using (174) in (45), we can express the achieved correlation $F(\theta _*)$ of $\hat{\varvec{x}}^\mathrm{c}(\theta _*)$ as follows

$$\begin{aligned} F^2(\theta _*)&= \frac{\rho _s^2+\rho _L^2-2\rho _s^2\rho _L^2s}{1-\rho _s^2\rho _L^2s^2} = \frac{\frac{1}{\rho _s^2}+\frac{1}{\rho _L^2}-2s}{\frac{1}{\rho _s^2\rho _L^2}-s^2} \end{aligned}$$

(175)

$$\begin{aligned}&= \frac{2-2s+\frac{1}{\delta }\frac{\int _{{\mathcal {S}}}{\mathcal {T}_L^2\mu _0}}{(\int _{{\mathcal {S}}}{\mathcal {T}_L\mu _1)^2}} + \frac{\int _{{\mathcal {S}}}{\left( \frac{\widetilde{\mathcal {T}}_s}{1-\widetilde{\mathcal {T}}_s}\right) ^2\mu _2}}{\frac{1}{\delta } -\int _{{\mathcal {S}}}{\left( \frac{\widetilde{\mathcal {T}}_s}{1-\widetilde{\mathcal {T}}_s}\right) ^2\mu _0} } }{ \left( 1+\frac{1}{\delta }\frac{\int _{{\mathcal {S}}}{\mathcal {T}_L^2\mu _0}}{(\int _{{\mathcal {S}}}{\mathcal {T}_L\mu _1)^2}}\right) \left( 1+\frac{\int _{{\mathcal {S}}}{\left( \frac{\widetilde{\mathcal {T}}_s}{1-\widetilde{\mathcal {T}}_s}\right) ^2\mu _2}}{\frac{1}{\delta } -\int _{{\mathcal {S}}}{\left( \frac{\widetilde{\mathcal {T}}_s}{1-\widetilde{\mathcal {T}}_s}\right) ^2\mu _0} }\right) -s^2}, \end{aligned}$$

(176)

where we have denoted $\widetilde{\mathcal {T}}(y):=\frac{1}{\lambda _\delta ^*}\mathcal {T}(y)$ and all integrals are over y (not explicitly written for brevity). Thus, the problem of choosing $\mathcal {T}_L$ and $\mathcal {T}_s$ can be reformulated as follows:

$$\begin{aligned}&\max _{\gamma ,\mathcal {T}_L,\widetilde{\mathcal {T}}_s}~\gamma \nonumber \\&~\text {s.t.}~~~~(176) \ge \gamma \nonumber \\&~~~~~~~~\int _{{\mathcal {S}}}\frac{\widetilde{\mathcal {T}}_s(y)}{1-\widetilde{\mathcal {T}}_s(y)}\left( \mu _2(y)-\mu _0(y)\right) \mathrm {d}y=\frac{1}{\delta }\nonumber \\&~~~~~~~~\int _{{\mathcal {S}}}\left( \frac{\widetilde{\mathcal {T}}_s(y)}{1-\widetilde{\mathcal {T}}_s(y)}\right) ^2\mu _0(y) \mathrm {d}y\le \frac{1}{\delta }\nonumber \\&~~~~~~~~\int _{{\mathcal {S}}}\mathcal {T}_L(y)\mu _1(y) \mathrm {d}y > 0\nonumber \,. \end{aligned}$$

The second and the third constraints above follow from (168) and (169), respectively. These further guarantee that $\rho _s>0$ (so division in (175) is allowed). Similarly, the last constraint on $\mathcal {T}_L$ ensures that $\rho _L^2>0$.

Though concrete, the formulation above is a difficult function optimization problem. Solving this goes beyond the scope of this paper, but it may be an interesting future direction. Another related question is whether the solution to this problem coincides (or not) with the “individually" optimal choices in (164) and (170), respectively.

Example: Bayes-Optimal Combination for Binary Prior

In this section, we evaluate explicitly the Bayes-optimal estimator $F_*(x_L,x_s) = {\mathbb E}\{ X\,|\,X_L=x_L,X_s=x_s\}$ in (29) for the case where $X\in \{1, -1\}$ with $P_X(1)=\mathsf{p}$ and $P_X(-1)=1-\mathsf{p}$.

Using this prior, we obtain

$$\begin{aligned} F_*(x_L,x_s)={\mathbb E}\{X\,|\,X_L=x_L,X_s=x_s\}&= {\mathbb E}\{X\,|\,\rho _LX+W_L=x_L,\rho _sX+W_s=x_s\}\nonumber \\&= 2\mathbb P(X=1\,|\,\rho _LX+W_L=x_L,\rho _sX+W_s=x_s) - 1 \nonumber \\&= \frac{2}{1+\frac{1-\mathsf{p}}{\mathsf{p}} \frac{p_{W_L,W_S}(x_L+\rho _L,x_s+\rho _s)}{p_{W_L,W_S}(x_L-\rho _L,x_s-\rho _s)}} - 1, \end{aligned}$$

(177)

where the last line follows by Bayes rule and simple algebra. Here, $p_{W_L,W_s}$ denotes the joint density of $(W_L,W_s)$ as predicted by Theorem 1, i.e.,

$$\begin{aligned} p_{W_L,W_s}(w_L,w_s) = \frac{1}{C}\exp \Big ({-\frac{1}{2} \begin{bmatrix} w_L&w_s \end{bmatrix} \Sigma ^{-1} \begin{bmatrix} w_L \\ w_s \end{bmatrix}\Big ) }, \end{aligned}$$

(178)

where $\Sigma =\begin{bmatrix} 1-\rho _L^2 &{} q-\rho _L\rho _s \\ q-\rho _L\rho _s &{} 1-\rho _s^2 \end{bmatrix},$ and C is a constant that is irrelevant for our purpose as it cancels in (177). Using (178) in (177) gives an explicit expression for $F_*(x_L,x_s)$.

In Sect. 4.2, we numerically implement the optimal combined estimator for various measurement models. Specifically, we use the linear and spectral estimators ${\varvec{x}}^\mathrm{L}$ and ${\varvec{x}}^\mathrm{s}$ to form the combined estimator

$$\begin{aligned} \hat{\varvec{x}}^\mathrm{mmse} = F_*({\varvec{x}}^\mathrm{L},{\varvec{x}}^\mathrm{s}), \end{aligned}$$

where $F_*$ acts element-wise on the entries of its arguments as specified in (177). The asymptotic correlation of the estimator $\hat{\varvec{x}}^\mathrm{mmse}$ is given by Theorem 2 as follows: $\rho _*=\frac{|{\mathbb E}\left\{ X\cdot F_*(X_L,X_s)\right\} |}{\sqrt{{\mathbb E}\left\{ F_*^2(X_L,X_s)\right\} }}$ (see (34)). Equipped with the explicit expression for $F_*(X_L,X_s)$ in (177), we can compute $\rho _*$ using Monte Carlo averaging over realizations of the triplet $(X,X_L,X_s)$.

Proof of Lemma 4

Take $t \rightarrow \infty $ in (73), and let $\mu _V = \lim _{t \rightarrow \infty } \mu _{V,t}, \, \sigma _V^2 = \lim _{t \rightarrow \infty } \sigma _{V,t}^2$, $\beta ^2 = \sigma _V^2 + \mu _V^2$. Then, by solving these equations, we obtain two solutions for the pair $(\mu _V^2, \sigma _V^2)$: one solution gives the fixed point $\mathsf{FP}_0$; and the other solution gives $\mathsf{FP}_1$ and $\mathsf{FP}_2$. Note that the fixed points $\mathsf{FP}_1$ and $\mathsf{FP}_2$ exist only when $\tilde{\beta }^2 > {\mathbb E}\{Z^2\}$. From (77), this is equivalent to $\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}$, which is the condition in the statement of the lemma.

Let us define

$$\begin{aligned} \gamma _t^2 \equiv \frac{\mu _{V,t}^2}{\sigma _{V,t}^2}. \end{aligned}$$

Using this definition, the two equations in (73) can be combined to obtain the following recursion in $\gamma _t^2$:

$$\begin{aligned} \gamma _{t+1}^2 = \frac{\delta ({\mathbb E}\{ Z G^2\} - {\mathbb E}\{Z\})^2}{{\mathbb E}\{Z^2G^2\} + {\mathbb E}\{ Z^2\}/\gamma _t^2 }. \end{aligned}$$

(179)

Note that ${\mathbb E}\{ Z^2\}>0$. In fact, if ${\mathbb E}\{ Z^2\}=0$, then $\mathbb P(Z=0)=1$ and the condition ${\mathbb E}\{ Z(G^2-1)\} >0$ cannot hold. Thus, the two fixed points of this recursion are $\gamma ^2_{\mathsf{FP}_0}=0$, and

$$\begin{aligned} \gamma ^2_{\mathsf{FP}_{12}}=\frac{ \delta ({\mathbb E}\{ ZG^2\} - {\mathbb E}\{Z\})^2 - {\mathbb E}\{Z^2\} }{{\mathbb E}\{ Z^2 G^2 \}} = \frac{\tilde{\beta }^2 - {\mathbb E}\{Z^2\}}{{\mathbb E}\{ Z^2 G^2 \}} = \frac{\tilde{\mu }_V^2}{\tilde{\sigma }_V^2}. \end{aligned}$$

(180)

As discussed above, the fixed point $\gamma ^2_{\mathsf{FP}_{12}}$ exists when $\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}$. The recursion can be written as $\gamma _{t+1}^2 = f(\gamma _t^2)$, where

$$\begin{aligned} f(x) = \frac{\delta ({\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2}{{\mathbb E}\{Z^2G^2\} + {\mathbb E}\{ Z^2\}/x}. \end{aligned}$$

The derivative of f is given by

$$\begin{aligned} f'(x) = \frac{\delta \, ({\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2\,{\mathbb E}\{Z^2\} }{({\mathbb E}\{Z^2\} \, + \, x \, {\mathbb E}\{Z^2 G^2\})^2}. \end{aligned}$$

(181)

We now argue that, whenever $\gamma _1^2 = \frac{\mu _{V,1}^2}{\sigma _{V,1}^2}$ is strictly positive, the recursion $\gamma _{t+1}^2 = f(\gamma _t^2)$ converges to $\gamma ^2_{\mathsf{FP}_{12}}$. We will separately consider the two cases $\gamma _1^2 < \gamma ^2_{\mathsf{FP}_{12}}$ and $\gamma _1^2 > \gamma ^2_{\mathsf{FP}_{12}}$.

We consider first the case $\gamma _1^2 < \gamma ^2_{\mathsf{FP}_{12}}$. Since $f'(x) >0$, the function f(x) is strictly increasing for $x \ge 0$. Therefore, if for any $t \ge 0$ we have $\gamma _t^2 < \gamma ^2_{\mathsf{FP}_{12}}$, then $\gamma _{t+1}^2 = f(\gamma _t^2) < f(\gamma ^2_{\mathsf{FP}_{12}}) = \gamma ^2_{\mathsf{FP}_{12}}$. Next, we argue that $f(x) > x$ for $0< x < \gamma ^2_{\mathsf{FP}_{12}}$. To show this claim, we first note that $f'(0) >1$ since $\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}$. Thus, $f(x) > x$ for x sufficiently close to 0. If $f(\gamma ') \le \gamma '$ for some $0< \gamma ' < \gamma ^2_{\mathsf{FP}_{12}}$, then there exists a fixed point between 0 and $\gamma '$, as f(x) is continuous. However, this is not possible since $\gamma ' < \gamma ^2_{\mathsf{FP}_{12}}$ and $\gamma ^2_{\mathsf{FP}_{12}}$ is the unique fixed point $> 0$. As a result, $f(x) > x$ for $0< x < \gamma ^2_{\mathsf{FP}_{12}}$. Hence, if $\gamma _1^2 < \gamma ^2_{\mathsf{FP}_{12}}$, then the sequence $(\gamma _t^2)_{t \ge 1}$ is strictly increasing and bounded above by $\gamma ^2_{\mathsf{FP}_{12}}$. Furthermore, by using the uniqueness of the fixed point, one obtains that its supremum is $\gamma ^2_{\mathsf{FP}_{12}}$. Therefore, the sequence $(\gamma _t^2)_{t \ge 1}$ converges to $\gamma ^2_{\mathsf{FP}_{12}}$.

Next, consider the case $\gamma _1^2 > \gamma ^2_{\mathsf{FP}_{12}}$. We observe that

$$\begin{aligned} f'(\gamma ^2_{\mathsf{FP}_{12}}) =\frac{{\mathbb E}\{Z^2\}}{\delta ({\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2 } < 1 \end{aligned}$$

(182)

since $\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}$. From (181), we see that $f'(x)$ is strictly decreasing for $x >0$, hence $f'(x) <1$ for $x \ge \gamma ^2_{\mathsf{FP}_{12}}$. Therefore, by the Banach fixed point theorem, the iteration (179) converges to $\gamma ^2_{\mathsf{FP}_{12}}$ whenever $\gamma _1^2 \ge \gamma ^2_{\mathsf{FP}_{12}}$.

Finally, we observe from (73) that:

$$\begin{aligned}&\mu _{V, t+1}^2 = \delta ({\mathbb E}\{Z(G^2-1)\})^2 \frac{\gamma _t^2}{1+ \gamma _t^2},\nonumber \\&\qquad \sigma _{V, t+1}^2 = \frac{\gamma _t^2}{1+ \gamma _t^2} {\mathbb E}\{Z^2 G^2\} \, \, + \, \frac{1}{1+ \gamma _t^2} \, {\mathbb E}\{ Z^2 \}. \end{aligned}$$

(183)

Thus, for any initialization such that $\gamma _1^2 > 0$,

$$\begin{aligned} \begin{aligned}&\lim _{t \rightarrow \infty } \mu _{V, t+1}^2 = \delta ({\mathbb E}\{Z(G^2-1)\})^2 \frac{\gamma ^2_{\mathsf{FP}_{12}}}{1+ \gamma ^2_{\mathsf{FP}_{12}}} = \tilde{\mu }_V^2, \\&\lim _{t \rightarrow \infty } \sigma _{V, t+1}^2 = \frac{\gamma ^2_{\mathsf{FP}_{12}}}{1+ \gamma ^2_{\mathsf{FP}_{12}}} {\mathbb E}\{Z^2 G^2\} \, + \, \frac{1}{1+ \gamma ^2_{\mathsf{FP}_{12}}} \, {\mathbb E}\{ Z^2 \} = \tilde{\sigma }_V^2, \end{aligned} \end{aligned}$$

(184)

where we have used the expression for $\gamma ^2_{\mathsf{FP}_{12}}$ from (180). Note that both fixed points $\mathsf{FP}_1$ and $\mathsf{FP}_2$ correspond to the same $(\tilde{\mu }_V^2, \tilde{\sigma }_V^2)$. The assumption that ${\mathbb E}\{ Z(G^2-1)\} >0$ ensures that the sign of $\mu _{V,t+1}$ in (73) remains unchanged and hence the iteration converges to either $\mathsf{FP}_1$ or $\mathsf{FP}_2$ depending on the sign of $\mu _{V,1}$.

Proof of Lemma 5

We first consider a fixed $t \ge 2$ and write

$$\begin{aligned} \frac{1}{n} \Vert {\varvec{u}}^t - {\varvec{u}}^{t-1} \Vert ^2 = \frac{\Vert {\varvec{u}}^t \Vert ^2}{n} \, + \, \frac{\Vert {\varvec{u}}^{t-1} \Vert ^2}{n} \, - \, 2 \frac{\langle {\varvec{u}}^t, \, {\varvec{u}}^{t-1} \rangle }{n}. \end{aligned}$$

(185)

Applying Proposition 1 with the PL(2) functions $\psi (a) = a^2$ (for the first two terms) and $\psi (a, b) = ab$ (for the last term), we obtain

$$\begin{aligned} \begin{aligned}&\lim _{n \rightarrow \infty } \frac{1}{n} \Vert {\varvec{u}}^t - {\varvec{u}}^{t-1} \Vert ^2 \, {\mathop {=}\limits ^{\text {a.s.}}} \, {\mathbb E}\Big \{ (\mu _{U,t} G + \sigma _{U,t} W_{U,t} )^2 \Big \}\\&+ \, {\mathbb E}\Big \{ (\mu _{U,t-1} G + \sigma _{U,t-1} W_{U,{t-1}} )^2 \Big \} \\&- 2 {\mathbb E}\Big \{ (\mu _{U,t} G + \sigma _{U,t} W_{U,t}) (\mu _{U,t-1} G + \sigma _{U,t-1} W_{U,{t-1}} ) \Big \} \\&= \mu _{U,t}^2 \, + \, \sigma _{U,t}^2 \, + \, \mu _{U,t-1}^2 \, + \, \sigma _{U,t-1}^2\\&- \, 2 \mu _{U,t} \mu _{U,t-1} \, - \, 2 \sigma _{U,t} \sigma _{U,t-1} {\mathbb E}\{W_{U,t-1} W_{U,t}\}. \end{aligned} \end{aligned}$$

(186)

Similarly, we have for any $t \ge 1$

$$\begin{aligned}&\lim _{d \rightarrow \infty } \frac{1}{d} \Vert {\varvec{v}}^{t+1} - {\varvec{v}}^{t} \Vert ^2 \, {\mathop {=}\limits ^{\text {a.s.}}} \, \mu _{V,t+1}^2 + \sigma _{V,t+1}^2\nonumber \\&\quad + \mu _{V,t}^2 + \sigma _{V,t}^2 - 2 \mu _{V,t} \mu _{V,t+1} - 2 \sigma _{V,t+1} \sigma _{V,t} {\mathbb E}\{W_{V,t+1} W_{V,t}\}. \end{aligned}$$

(187)

Since $\left|{{\mathbb E}\{ {\mathcal {T}}_L(Y) G \}}\right| >0$, the initialization $\mu _{V,1}$ of the state evolution recursion in (75) is strictly non-zero. Therefore, Lemma 4 guarantees that the state evolution recursion (72)–(73) converges to either the fixed point $\mathsf{FP}_1$ or to $\mathsf{FP}_2$ depending on the sign of $\mu _{V,1}$. Without loss of generality assume that $\mu _{V,1} >0$, so that the recursion converges to $\mathsf{FP}_1$. (The argument for $\mu _{V,1}<0$ is identical.)

$$\begin{aligned} \lim _{t \rightarrow \infty } \mu _{V,t} = \tilde{\mu }_V, \quad \lim _{t \rightarrow \infty } \sigma _{V,t}^2 = \tilde{\sigma }_V^2, \qquad \lim _{t \rightarrow \infty } \mu _{U,t} = \frac{\tilde{\mu }_V}{\sqrt{\delta } \tilde{\beta }}, \quad \lim _{t \rightarrow \infty } \sigma _{U,t}^2 = \frac{\tilde{\sigma }_V^2}{\delta \tilde{\beta }^2}. \end{aligned}$$

(188)

Hence, the desired result immediately follows from (186), (187) and Remark 8, if we show that ${\mathbb E}\{W_{V,t+1} W_{V,t}\} \rightarrow 1$ and ${\mathbb E}\{W_{U,t-1} W_{U,t}\} \rightarrow 1$ as $t \rightarrow \infty $.

Taking $r=(t-1)$ in (60) and using the formula for $g_t(\cdot , \cdot )$ from (68), we obtain

$$\begin{aligned} \begin{aligned} {\mathbb E}\{ W_{V,t+1} W_{V,t} \} \sigma _{V, t+1} \sigma _{V,t}&= \delta {\mathbb E}\left\{ Z^2 (\mu _{U,t} G + \sigma _{U,t} W_{U,t}) (\mu _{U,t-1} G + \sigma _{U,t-1} W_{U,t-1}) \right\} \\&= \delta \left( {\mathbb E}\{ Z^2 G^2\} \mu _{U,t} \mu _{U,t-1} \, + \, {\mathbb E}\{Z^2\} {\mathbb E}\{ W_{U,t} W_{U,t-1}\} \sigma _{U,t} \sigma _{U,t-1} \right) . \end{aligned} \end{aligned}$$

(189)

Next, taking $r=(t-1)$ in (59) and using the formula for $f_t(\cdot )$ from (68), we get

$$\begin{aligned} \begin{aligned}&{\mathbb E}\{ W_{U,t} W_{U,t-1}\} \sigma _{U,t} \sigma _{U,t-1} \\&= \frac{1}{\delta } {\mathbb E}\left\{ \left( \frac{\mu _{V,t} X + \sigma _{V,t} W_{V,t}}{\beta _t} - X \sqrt{\delta } \mu _{U,t}\right) \left( \frac{\mu _{V,t-1} X + \sigma _{V,t-1} W_{V,t-1}}{\beta _{t-1}} - X \sqrt{\delta } \mu _{U,t-1}\right) \right\} \\&=\frac{1}{\delta } \frac{\sigma _{V,t} \, \sigma _{V,t-1}}{ \beta _t \,\beta _{t-1}}{\mathbb E}\{ W_{V,t} W_{V,t-1}\}. \end{aligned} \end{aligned}$$

(190)

Here, the last equality is obtained by noting from (72) that $\sqrt{\delta } \mu _{U,t} = \frac{\mu _{V,t}}{\beta _t}$, hence the coefficients on X cancel. Combining (189) and (190), we obtain

$$\begin{aligned} \begin{aligned} {\mathbb E}\{ W_{V,t+1} W_{V,t} \} = \frac{\delta {\mathbb E}\{ Z^2 G^2\} \mu _{U,t} \, \mu _{U,t-1}}{\sigma _{V, t+1} \sigma _{V,t}} \, + \, \frac{E\{Z^2\} \sigma _{V,t} \, \sigma _{V,t-1}}{\beta _t \, \beta _{t-1} \, \sigma _{V, t+1} \sigma _{V,t}} {\mathbb E}\{ W_{V,t} W_{V,t-1}\}. \end{aligned} \end{aligned}$$

(191)

For brevity, we write this iteration as $w_{t+1} = a_t \, + \, b_t w_t$, where

$$\begin{aligned} w_{t} = {\mathbb E}\{ W_{V,t} W_{V,t-1} \}, \quad a_t = \frac{\delta {\mathbb E}\{ Z^2 G^2\} \mu _{U,t} \, \mu _{U,t-1}}{\sigma _{V, t+1} \sigma _{V,t}}, \quad b_t=\frac{E\{Z^2\} \sigma _{V,t} \, \sigma _{V,t-1}}{\beta _t \, \beta _{t-1} \, \sigma _{V, t+1} \sigma _{V,t}}. \end{aligned}$$

(192)

The iteration is initialized with $w_1 = {\mathbb E}\{W_{V,1} W_{V,0}\}=0$. Note that, as $t \rightarrow \infty $, the sequences $a_t$ and $b_t$ converge to well-defined limits determined by (188). By using the sub-additivity of $\limsup $, we have that

$$\begin{aligned} \limsup _{t\rightarrow \infty } w_{t+1} = \limsup _{t \rightarrow \infty } \, (a_t + b_t w_t) \le \lim _{t \rightarrow \infty } a_t + \lim _{t \rightarrow \infty } b_t \limsup _{t\rightarrow \infty } w_t. \end{aligned}$$

(193)

Rearranging and using the limits from (188), we obtain

$$\begin{aligned} \limsup _{t\rightarrow \infty } w_t \le \frac{{\mathbb E}\{Z^2 G^2 \}}{(\tilde{\beta }^2 - {\mathbb E}\{Z^2\})} \frac{\tilde{\mu }_V^2}{\tilde{\sigma }_V^2} = 1, \end{aligned}$$

(194)

where the last equality follows from (76). By using the super-additivity of $\liminf $, we also have that

$$\begin{aligned} \liminf _{t\rightarrow \infty } w_{t+1} = \liminf _{t \rightarrow \infty } \, (a_t + b_t w_t) \ge \lim _{t \rightarrow \infty } a_t + \lim _{t \rightarrow \infty } b_t \liminf _{t\rightarrow \infty } w_t, \end{aligned}$$

(195)

which leads to

$$\begin{aligned} \liminf _{t\rightarrow \infty } w_t \ge \frac{{\mathbb E}\{Z^2 G^2 \}}{(\tilde{\beta }^2 - {\mathbb E}\{Z^2\})} \frac{\tilde{\mu }_V^2}{\tilde{\sigma }_V^2} = 1. \end{aligned}$$

(196)

By combining (194) and (196), we conclude that

$$\lim _{t \rightarrow \infty } {\mathbb E}\{W_{V,t+1} W_{V,t}\} =1.$$

By using (190) and (188), we also have that

$$\lim _{t \rightarrow \infty } {\mathbb E}\{W_{U,t-1} W_{U,t}\} =1,$$

which completes the proof.

An Auxiliary Lemma

Lemma 9

(Lemma 4.5 in [18]) Let $(Q_t)_{t \ge 1}$ be a sequence of distributions converging weakly to some distribution Q, and let h be a non-negative continuous function such that

$$\begin{aligned} \lim _{t \rightarrow \infty } \int h \, \mathrm{d}Q_t = \int h \, \mathrm{d}Q. \end{aligned}$$

(197)

Then, for any continuous function $\psi $ such that $\left|{\psi }\right|/(1 + h)$ is bounded,

$$\begin{aligned} \lim _{t \rightarrow \infty } \int \psi \, \mathrm{d}Q_t = \int \psi \, \mathrm{d}Q . \end{aligned}$$

(198)

This is the case, e.g., in noiseless phase retrieval, where $y_i=\langle {\varvec{x}}, {\varvec{a}}_i\rangle ^2$ and the optimal preprocessing function is ${\mathcal {T}}_s^*(y)=1-1/y$.

Bahmani, S., Romberg, J.: Phase retrieval meets statistical learning theory: A flexible convex relaxation. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 252–260 (2017)

Baik, J., Ben Arous, G., Péché, S.: Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability pp. 1643–1697 (2005)

Barbier, J., Krzakala, F., Macris, N., Miolane, L., Zdeborová, L.: Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences 116(12), 5451–5460 (2019)MathSciNetCrossRef

Bayati, M., Montanari, A.: The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory 57, 764–785 (2011)MathSciNetCrossRef

Bayati, M., Montanari, A.: The LASSO risk for gaussian matrices. IEEE Transactions on Information Theory 58, 1997–2017 (2012)MathSciNetCrossRef

Berthier, R., Montanari, A., Nguyen, P.M.: State evolution for approximate message passing with non-separable functions. Information and Inference 9(1), 33–79 (2020)MathSciNetCrossRef

Boufounos, P.T., Baraniuk, R.G.: 1-bit compressive sensing. In: 42nd Annual Conference on Information Sciences and Systems, pp. 16–21. IEEE (2008)

Brillinger, D.R.: A generalized linear model with Gaussian regressor variables. A Festschrift For Erich L. Lehmann p. 97 (1982)

Candès, E.J., Eldar, Y.C., Strohmer, T., Voroninski, V.: Phase retrieval via matrix completion. SIAM Review 57(2), 225–251 (2015)MathSciNetCrossRef

10.

Candès, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory 61(4), 1985–2007 (2015)MathSciNetCrossRef

11.

Candès, E.J., Strohmer, T., Voroninski, V.: Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics 66(8), 1241–1274 (2013)MathSciNetCrossRef

12.

Chen, Y., Candès, E.J.: Solving random quadratic systems of equations is nearly as easy as solving linear systems. In: Advances in Neural Information Processing Systems (NIPS), pp. 739–747 (2015)

13.

Deshpande, Y., Montanari, A.: Information-theoretically optimal sparse PCA. In: IEEE International Symposium on Information Theory (ISIT), pp. 2197–2201 (2014)

14.

Dhifallah, O., Thrampoulidis, C., Lu, Y.M.: Phase retrieval via polytope optimization: Geometry, phase transitions, and new algorithms. arXiv:1805.09555 (2018)

15.

Donoho, D.L., Maleki, A., Montanari, A.: Message Passing Algorithms for Compressed Sensing. Proceedings of the National Academy of Sciences 106, 18,914–18,919 (2009)CrossRef

16.

Dudeja, R., Bakhshizadeh, M., Ma, J., Maleki, A.: Analysis of spectral methods for phase retrieval with random orthogonal matrices. IEEE Transactions on Information Theory 66(8), 5182–5203 (2020)MathSciNetCrossRef

17.

Dudeja, R., Hsu, D.: Learning single-index models in Gaussian space. In: Conference On Learning Theory (COLT), pp. 1887–1930 (2018)

18.

Dümbgen, L., Samworth, R., Schuhmacher, D.: Approximation by log-concave distributions, with applications to regression. Annals of Statistics 39(2), 702–730 (2011)MathSciNetCrossRef

19.

Eldar, Y.C., Kutyniok, G.: Compressed sensing: Theory and applications. Cambridge University Press (2012)

20.

Feng, O.Y., Venkataramanan, R., Rush, C., Samworth, R.J.: A unifying tutorial on approximate message passing. arXiv:2105.02180 (2021)

21.

Fienup, J.R.: Phase retrieval algorithms: A comparison. Applied Optics 21(15), 2758–2769 (1982)CrossRef

22.

Genzel, M.: High-dimensional estimation of structured signals from non-linear observations with general convex loss functions. IEEE Transactions on Information Theory 63(3), 1601–1619 (2017)MathSciNetCrossRef

23.

Genzel, M., Jung, P.: Recovering structured data from superimposed non-linear measurements. IEEE Transactions on Information Theory 66(1), 453–477 (2019)MathSciNetCrossRef

24.

Goldstein, L., Minsker, S., Wei, X.: Structured signal recovery from non-linear and heavy-tailed measurements. IEEE Transactions on Information Theory 64(8), 5513–5530 (2018)MathSciNetCrossRef

25.

Goldstein, T., Studer, C.: Phasemax: Convex phase retrieval via basis pursuit. IEEE Transactions on Information Theory 64(4), 2675–2689 (2018)MathSciNetCrossRef

26.

Hoyle, D.C., Rattray, M.: Principal-component-analysis eigenvalue spectra from data with symmetry-breaking structure. Physical Review E 69(2), 026124 (2004)CrossRef

27.

Javanmard, A., Montanari, A.: State evolution for general approximate message passing algorithms, with applications to spatial coupling. Information and Inference pp. 115–144 (2013)

28.

Kakade, S.M., Kanade, V., Shamir, O., Kalai, A.: Efficient learning of generalized linear and single index models with isotonic regression. In: Advances in Neural Information Processing Systems (NIPS), pp. 927–935 (2011)

29.

Krzakala, F., Mézard, M., Sausset, F., Sun, Y., Zdeborová, L.: Probabilistic reconstruction in compressed sensing: algorithms, phase diagrams, and threshold achieving matrices. Journal of Statistical Mechanics: Theory and Experiment 2012(08), P08009 (2012)

30.

LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRef

31.

Lesieur, T., Krzakala, F., Zdeborová, L.: Constrained low-rank matrix estimation: Phase transitions, approximate message passing and applications. Journal of Statistical Mechanics: Theory and Experiment 2017(7), 073403 (2017)MathSciNetCrossRef

32.

Li, K.C.: On principal Hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma. Journal of the American Statistical Association 87(420), 1025–1039 (1992)MathSciNetCrossRef

33.

Li, K.C., Duan, N.: Regression analysis under link violation. Annals of Statistics pp. 1009–1052 (1989)

34.

Lu, Y.M., Li, G.: Phase transitions of spectral initialization for high-dimensional non-convex estimation. Information and Inference (2019)

35.

Luo, W., Alghamdi, W., Lu, Y.M.: Optimal spectral initialization for signal recovery with applications to phase retrieval. IEEE Transactions on Signal Processing 67(9), 2347–2356 (2019)MathSciNetCrossRef

36.

McCullagh, P.: Generalized linear models. Routledge (2018)

37.

Mondelli, M., Montanari, A.: Fundamental limits of weak recovery with applications to phase retrieval. Foundations of Computational Mathematics 19, 703–773 (2019)MathSciNetCrossRef

38.

Montanari, A., Venkataramanan, R.: Estimation of low-rank matrices via approximate message passing. Annals of Statistics 45(1), 321–345 (2021)MathSciNetMATH

39.

Nelder, J.A., Wedderburn, R.W.: Generalized linear models. Journal of the Royal Statistical Society: Series A (General) 135(3), 370–384 (1972)CrossRef

40.

Netrapalli, P., Jain, P., Sanghavi, S.: Phase retrieval using alternating minimization. In: Advances in Neural Information Processing Systems (NIPS), pp. 2796–2804 (2013)

41.

Neykov, M., Liu, J.S., Cai, T.: L1-regularized least squares for support recovery of high dimensional single index models with gaussian designs. The Journal of Machine Learning Research 17(1), 2976–3012 (2016)MATH

42.

Plan, Y., Vershynin, R.: The generalized lasso with non-linear observations. IEEE Transactions on Information Theory 62(3), 1528–1537 (2016)MathSciNetCrossRef

43.

Plan, Y., Vershynin, R., Yudovina, E.: High-dimensional estimation with geometric constraints. Information and Inference 6(1), 1–40 (2017)MathSciNetMATH

44.

Rangan, S.: Generalized Approximate Message Passing for Estimation with Random Linear Mixing. In: IEEE International Symposium on Information Theory (ISIT) (2011)

45.

Rangan, S., Fletcher, A.K., Goyal, V.K.: Asymptotic Analysis of MAP Estimation via the Replica Method and Applications to Compressed Sensing. In: Advances in Neural Information Processing Systems (NIPS) (2009)

46.

Rangan, S., Goyal, V.K.: Recursive consistent estimation with bounded noise. IEEE Transactions on Information Theory 47(1), 457–464 (2001)MathSciNetCrossRef

47.

Rush, C., Venkataramanan, R.: Finite sample analysis of approximate message passing algorithms. IEEE Transactions on Information Theory 64(11), 7264–7286 (2018)MathSciNetCrossRef

48.

Schniter, P., Rangan, S.: Compressive phase retrieval via generalized approximate message passing. IEEE Transactions on Signal Processing 63(4), 1043–1055 (2014)MathSciNetCrossRef

49.

Shechtman, Y., Eldar, Y.C., Cohen, O., Chapman, H.N., Miao, J., Segev, M.: Phase retrieval with application to optical imaging: a contemporary overview. IEEE Signal Processing Magazine 32(3), 87–109 (2015)CrossRef

50.

Sur, P., Candès, E.J.: A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences 116(29), 14516–14525 (2019)MathSciNetCrossRef

51.

Thrampoulidis, C., Abbasi, E., Hassibi, B.: Lasso with non-linear measurements is equivalent to one with linear measurements. In: Advances in Neural Information Processing Systems (NIPS), pp. 3420–3428 (2015)

52.

Thrampoulidis, C., Rawat, A.S.: Lifting high-dimensional non-linear models with Gaussian regressors. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 3206–3215 (2019)

53.

Unser, M., Eden, M.: Maximum likelihood estimation of linear signal parameters for Poisson processes. IEEE Transactions on Acoustics, Speech, and Signal Processing 36(6), 942–945 (1988)CrossRef

54.

Vershynin, R.: High-dimensional probability: An introduction with applications in data science, vol. 47. Cambridge University Press (2018)

55.

Villani, C.: Optimal transport: Old and new, vol. 338. Springer Science & Business Media (2008)

56.

Waldspurger, I., d’Aspremont, A., Mallat, S.: Phase recovery, maxcut and complex semidefinite programming. Mathematical Programming 149(1-2), 47–81 (2015)MathSciNetCrossRef

57.

Wei, K.: Solving systems of phaseless equations via Kaczmarz methods: A proof of concept study. Inverse Problems 31(12) (2015)

58.

Yang, F., Lu, Y.M., Sbaiz, L., Vetterli, M.: Bits from photons: Oversampled image acquisition using binary Poisson statistics. IEEE Transactions on Image Processing 21(4), 1421–1436 (2012)MathSciNetCrossRef

59.

Yi, X., Wang, Z., Caramanis, C., Liu, H.: Optimal linear estimation under unknown nonlinear transform. In: Advances in Neural Information Processing Systems (NIPS), pp. 1549–1557 (2015)

Title: Optimal Combination of Linear and Spectral Estimators for Generalized Linear Models
Authors: Marco Mondelli
Christos Thrampoulidis
Ramji Venkataramanan
Publication date: 17-08-2021
Publisher: Springer US
Published in: Foundations of Computational Mathematics / Issue 5/2022
Print ISSN: 1615-3375
Electronic ISSN: 1615-3383
DOI: https://doi.org/10.1007/s10208-021-09531-x

Springer Professional

Optimal Combination of Linear and Spectral Estimators for Generalized Linear Models

Abstract

Publisher's Note

1 Introduction

2 Preliminaries

2.1 Notation and Definitions

2.2 Generalized Linear Model

2.3 Linear Estimator

2.4 Spectral Estimator

3 Main Results

3.1 Joint Distribution of Linear and Spectral Estimators

3.2 Optimal Combination

3.3 A Special Case: Optimal Linear Combination

4 Numerical Simulations

4.1 Optimal Linear Combination

4.2 Bayes-Optimal Combination

4.3 The Combined Estimator as Initialization for Local Algorithms

5 Proof of Theorem 1

5.1 Proof Sketch

5.2 State Evolution for Generalized Approximate Message Passing (GAMP)

5.3 GAMP as a Power Method to Compute the Spectral Estimator

5.4 Proof of Lemma 6

Acknowledgements

Publisher's Note

Proof of Lemma 1

Proof of Corollary 2

Optimization of Preprocessing Functions

Linear Estimator

Spectral Estimator

Spectral Versus Linear

Combined Estimator

Example: Bayes-Optimal Combination for Binary Prior

Proof of Lemma 4

Proof of Lemma 5

An Auxiliary Lemma

Premium Partner

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Preliminaries

2.1 Notation and Definitions

2.2 Generalized Linear Model

2.3 Linear Estimator

2.4 Spectral Estimator

3 Main Results

3.1 Joint Distribution of Linear and Spectral Estimators

3.2 Optimal Combination

3.3 A Special Case: Optimal Linear Combination

4 Numerical Simulations

4.1 Optimal Linear Combination

4.2 Bayes-Optimal Combination

4.3 The Combined Estimator as Initialization for Local Algorithms

5 Proof of Theorem 1

5.1 Proof Sketch

5.2 State Evolution for Generalized Approximate Message Passing (GAMP)

5.3 GAMP as a Power Method to Compute the Spectral Estimator

5.4 Proof of Lemma 6

Acknowledgements

Publisher's Note

Proof of Lemma 1

Proof of Corollary 2

Optimization of Preprocessing Functions

Linear Estimator

Spectral Estimator

Spectral Versus Linear

Combined Estimator

Example: Bayes-Optimal Combination for Binary Prior

Proof of Lemma 4

Proof of Lemma 5

An Auxiliary Lemma

Other articles of this Issue 5/2022

A Geometric Integration Approach to Nonsmooth, Nonconvex Optimisation

Convergence Rates of First- and Higher-Order Dynamics for Solving Linear Ill-Posed Problems

Quasi-optimal Nonconforming Approximation of Elliptic PDEs with Contrasted Coefficients and , , Regularity

Stationary Points at Infinity for Analytic Combinatorics

Construction of Cubic Splines on Arbitrary Triangulations

Vandermonde Varieties, Mirrored Spaces, and the Cohomology of Symmetric Semi-algebraic Sets

Premium Partner