1 Introduction

Classification ensembles that use some form of randomization in the design of the base classifiers have a long and successful history in machine learning, especially in the case when there are fewer training observations than data dimensions. Common approaches include: Bagging (Breiman 1996); random subspaces (Ho 1998); random forests (Breiman 2001).

Surprisingly, despite the well-known theoretical properties of random projections as dimension-reducing approximate isometries (Dasgupta and Gupta 2002; Achlioptas 2003) and empirical and theoretical studies demonstrating their usefulness when learning a single classifier (e.g. Fradkin and Madigan 2003; Durrant and Kabán 2010), results in the literature employing random projections to create weak learners for ensemble classification are sparse compared to results for other approaches such as bagging and random subspaces. On the other hand, given their appealing properties and tractability to analysis, random projections seem like a rather natural choice in this setting. Those empirical studies we could find on randomly-projected ensembles in the literature (Goel et al. 2005; Folgieri 2008; Schclar and Rokach 2009) all report good empirical performance from the ensemble, but none attempt a theoretical analysis. Indeed for all of the randomizing approaches mentioned above, despite a wealth of empirical evidence demonstrating the effectiveness of these ensembles, there are very few theoretical studies.

An important paper by Fumera et al. (2008) gives an approximate analytical model as a function of the ensemble size, applicable to linear combiners, which explains the variance reducing property of bagging. However, besides the inherent difficulties with the approach of bias-variance decomposition for classification problems (Schapire et al. 1998), such analysis only serves to relate the performance of an ensemble to its members and Fumera et al. (2008) correctly point out that even for bagging, the simplest such approach and in use since at least 1996, there is ‘no clear understanding yet of the conditions under which bagging outperforms an individual classifier [trained on the full original data set]’. They further state that, even with specific assumptions on the data distribution, such an analytical comparison would be a complex task. In other words, there is no clear understanding yet about when to use an ensemble vs. when to use one classifier.

Here we take a completely different approach to address this last open issue for a specific classifier ensemble: Focusing on an ensemble of randomly projected Fisher linear discriminant (RP-FLD) classifiers as our base learners, we leverage recent random matrix theoretic results to link the performance of the linearly combined ensemble to the corresponding classifier trained on the original data. In particular, we extend and simplify the work of Marzetta et al. (2011) specifically for this classification setting, and one of our main contributions is to derive theoretical guarantees that directly link the performance of the randomly projected ensemble to the performance of Fisher linear discriminant (FLD) learned in the full data space. This theory is, however, not simply of abstract interest: We also show experimentally that the algorithm we analyze here is highly competitive with the state-of-the-art. Furthermore our algorithm has several practically desirable properties, amongst which are: Firstly, the individual ensemble members are learned in a very low-dimensional space from randomly-projected data, and so training data can be collected, stored and processed entirely in this form. Secondly, our approach is fast—training on a single core typically has lower time complexity than learning a regularized FLD in the data space, while for classification the time complexity is the same as the data space FLD. Thirdly, parallel implementation of our approach is straightforward since, both for training and classification, the individual ensemble members can be run on separate cores. Finally, our approach returns an inverse covariance matrix estimate for the full \(d\)-dimensional data space, the entries of which are interpretable as conditional correlations; this may be useful in a wide range of settings.

Our randomly projected ensemble approach can be viewed as a generalization of bagged ensembles, in the sense that here we generate multiple instances of training data by projecting a training set of size \(N\) onto a subspace drawn uniformly at random with replacement from the data space, whereas in bagging one generates instances of training data by drawing \(N'\) training examples uniformly with replacement from a training set of size \(N \geqslant N'\). However, in this setting, an obvious advantage of our approach over bagging is that it is able to repair the rank deficiency of the sample covariance matrix we need to invert in order to build the classifier. In particular, we show that when there are fewer observations than dimensions our ensemble implements a data space FLD with a sophisticated covariance regularization scheme (parametrized by an integer parameter) that subsumes a combination of several previous regularization schemes. In order to see the clear structural links between our ensemble and its data space counterpart we develop our theory in a random matrix theoretic setting. We avoid a bias-variance decomposition approach since, in common with the analysis of Schapire et al. (1998), a key property of our ensemble is that its effect is not simply to reduce the variance of a biased classifier.

A preliminary version of this work (Durrant and Kabán 2013) won the best paper award at the 5th Asian Conference on Machine Learning (ACML 2013). Here we extend that work in several directions, as well as including material omitted there due to space constraints: The high probability lower bound on the generalization error of pseudo-inverted FLD (Theorem 3.3), full proofs of all theorems in the place of the sketches in Durrant and Kabán (2013), and most of Sect. 5 are new. Moreover we have greatly extended the experimental section, adding a 100,000-dimensional dataset, additional carefully-tuned comparison methods, and comparisons of our approach with an ensemble of random subspace FLDs.

The structure of the remainder of the paper is as follows: We give some brief background and describe the randomly projected FLD classifier ensemble. Next, we present theoretical findings that give insight into how this ensemble behaves. We continue by presenting extensive experiments on real datasets from the bioinformatic domain where FLD (and variants) are a popular classifier choice even though often restricted to a diagonal covariance choice because of high dimensionality and data scarcity (Guo et al. 2007; Dudoit et al. 2002). We further present experimental results on a 100,000-dimensional drug discovery dataset, that is from another problem domain where the small sample size problem typically arises. Our experiments suggest that in practice, when the number of training examples is less than the number of data dimensions, our ensemble approach outperforms the traditional FLD in the data space both in terms of prediction performance and computation time. Finally, we summarize and discuss possible future directions for this and similar approaches.

2 Preliminaries

We consider a binary classification problem in which we observe \(N\) i.i.d examples of labelled training data \(\mathcal {T}_{N}=\{( x_{i},y_{i}): x_i \in \mathbb {R}^d, y_{i} \in \{0,1\} \}_{i=1}^{N}\) where \((x_{i},y_{i}) \overset{i.i.d}{\sim } \mathcal {D}_{x,y}\). We are interested in comparing the performance of a randomly-projected ensemble classifier working in the projected space \(\mathbb {R}^k, k \ll d\), to the performance achieved by the corresponding classifier working in the data space \(\mathbb {R}^d\). We will consider Fisher’s linear discriminant classifier working in both of these settings since FLD is a popular and widely used linear classifier (in the data space setting) and yet it is simple enough to analyse in detail.

The decision rule for FLD learned from training data is given by:

$$\begin{aligned} \hat{h}(x_{q}){:=}\mathbf {1} \left\{ (\hat{\mu }_{1}- \hat{\mu }_{0})^{T} \hat{\Sigma }^{-1}\left( x_{q}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2}\right) > 0 \right\} \end{aligned}$$

where \(\hat{\mu }_{0}, \hat{\mu }_{1}\), and \(\hat{\Sigma }\) are maximum likelihood (ML) estimates of the class-conditional means and (shared) covariance matrix respectively, and \(\mathbf {1}(\cdot )\) is the indicator function which returns 1 if its argument is true and 0 otherwise. In the setting considered here we assume that \(N \ll d\). Hence, \(\hat{\Sigma }\) will be singular and so one can either pseudo-invert or regularize \(\hat{\Sigma }\) to obtain a working decision rule; both approaches are used in practice (Raudys and Duin 1998).

To construct the randomly projected ensemble, we choose the number of ensemble members \(M\) and the projection dimensionality \(k\), and generate \(M\) random matrices \(R \in \mathcal {M}_{k \times d}\) with i.i.d entries \(r_{ij} \sim \mathcal {N}(0,\sigma ^2)\) each. We can take \(\sigma ^{2}=1\) without loss of generality. Such matrices are called random projection matrices in the literature (Arriaga and Vempala 1999; Achlioptas 2003). Pre-multiplying the data with one of the matrices \(R\) maps the training examples to a random \(k\)-dimensional subspace of the data space \(\mathbb {R}^d\) and for each instance of \(R\) we learn a single FLD classifier in this subspace. By linearity of expectation and of the projection operator, the decision rule for a single randomly projected classifier is then given by:

$$\begin{aligned} \hat{h}_{R}(x_{q}) {:=} \mathbf {1} \left\{ \!(\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T}R^\mathrm{T} \!\left( \! R\hat{\Sigma }R^\mathrm{T} \!\right) ^{-1} \!\!\!R\left( \! x_{q}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2}\!\right) \! >\! 0 \right\} \end{aligned}$$

For an ensemble, various different combination rules can be applied. The most common choices include majority voting (when there is an odd number of classifiers in the ensemble) and linear combination (Brown 2009). We want to make the most of the weak learners’ confidence estimates so we choose to employ the averaged linear decisions of \(M\) base learners as our combination rule which gives the following ensemble decision:

$$\begin{aligned} \hat{h}_{ens}(x_{q}){:=} \mathbf {1} \left\{ \frac{1}{M}\sum _{i=1}^M (\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T}R_{i}^\mathrm{T} \left( R_{i} \hat{\Sigma }R_{i}^\mathrm{T} \right) ^{-1} R_{i} \left( x_{q}- \frac{\hat{\mu }_{1}+ \hat{\mu }_{0}}{2} \right) > 0 \right\} \end{aligned}$$

Our algorithm is therefore very simple: we learn \(M\) FLD classifiers from \(M\) different instances of randomly projected data, average their outputs and take the sign of this average as the ensemble decision. This combination rule is called ‘voting’ in the ensemble literature but, to avoid any possible confusion with majority voting, we shall refer to it as ‘RP averaging’; it does not require the number of classifiers in the ensemble to be odd for good generalization and, as we shall see, it also has the advantage of analytical tractability.

We commence our theoretical analysis of this algorithm by examining the expected performance of the RP-FLD ensemble when the training set is fixed, which is central to linking the ensemble and data space classifiers, and then later in Theorem 3.2 we will consider random instantiations of the training set.

To begin, observe that by the law of large numbers the LHS of the argument of the decision rule of the ensemble converges to the following:

$$\begin{aligned}&\underset{{M}\rightarrow {\infty }}{\lim }\frac{1}{M} \sum _{i=1}^M \!(\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T}R_{i}^\mathrm{T} \!\left( \! R_{i}\hat{\Sigma }R_{i}^\mathrm{T} \!\right) ^{-1} \!\!\!R_{i}\left( \! x_{q}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2}\!\right) \! \nonumber \\&\quad =(\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} \text {E}\left[ R^\mathrm{T} \left( \!R \hat{\Sigma }R^\mathrm{T}\right) ^{-1} \!\!R \right]\!\left( x_{q}- \frac{\hat{\mu }_{1}+ \hat{\mu }_{0}}{2} \right) \end{aligned}$$
(2.1)

provided that this limit exists. If \(\rho \) is the rank of \(\hat{\Sigma }\), then it will turn out that for \(R \in \mathcal {M}_{k \times d}\) having i.i.d zero-mean Gaussian entries \(r_{ij} \sim \mathcal {N}(0,1)\), if \( k \in \{1,\ldots ,\rho -2\}\) then this expectation is indeed defined for each entry. From Eq. (2.1) we see that, for a fixed training set, in order to quantify the error of the ensemble it is enough to consider the expectation (w.r.t random matrices \(R\)):

$$\begin{aligned} \text {E}\left[ R^\mathrm{T} \left( R \hat{\Sigma }R^\mathrm{T} \right) ^{-1} R \right] \end{aligned}$$
(2.2)

Before continuing, we should note that for the case \( k \in \{1,\ldots ,\rho -2\}\) (Marzetta et al. 2011) provide a procedure to compute this expectation exactly. However we are more interested in how this expectation relates to characteristics of the maximum likelihood estimate of the sample covariance \(\hat{\Sigma }\), since we shall see in Theorem 3.2 that improving the conditioning of this matrix has a direct impact on the generalization error of the FLD classifier. Our approach and proof techniques are therefore very different to those followed by Marzetta et al. (2011), specifically we bound this expectation from both sides in the positive semi-definite ordering in order to provide an estimate of the extreme eigenvalues of the inverse covariance matrix implemented by our ensemble.

3 Theory

Our main theoretical results are the following three theorems: the first characterizes the regularization effect of our ensemble, while the second bounds the generalization error of the ensemble for an arbitrary training set of size \(N\) in the case of multivariate Gaussian class-conditional distributions with shared covariance. The third is a finite sample generalization of the negative result of Bickel and Levina (2004) showing that when the data dimension \(d\) is large compared to the rank of \(\hat{\Sigma }\) (which is a function of the sample size) then, with high probability, pseudoinverted FLD performs poorly.

Theorem 3.1

(Regularization) Let \(\hat{\Sigma } \in \mathcal {M}_{d \times d}\) be a symmetric positive semi-definite matrix with rank \(\rho \in \{3,\ldots ,d-1\}\), and denote by \(\lambda _{\max }(\hat{\Sigma }), \lambda _{\min \ne 0}(\hat{\Sigma }) > 0\) its greatest and least non-zero eigenvalues. Let \(k < \rho -1\) be a positive integer, and let \(R \in \mathcal {M}_{k \times d}\) be a random matrix with i.i.d \(\mathcal {N}(0,1)\) entries. Let \(S^{-1} {:=} \text {E}\left[ R^\mathrm{T} \left( R \hat{\Sigma } R^\mathrm{T} \right) ^{-1} R \right]\), and denote by \(\kappa (S^{-1})\) its condition number, \(\kappa (S^{-1}) = \lambda _{\max }(S^{-1})/\lambda _{\min }(S^{-1})\). Then:

$$\begin{aligned} \kappa (S^{-1}) \leqslant \frac{\rho }{\rho -k-1} \cdot \frac{\lambda _{\max }(\hat{\Sigma })}{\lambda _{\min \ne 0}(\hat{\Sigma })} \end{aligned}$$

This theorem implies that for a large enough ensemble the condition number of the sum of random matrices \(\frac{1}{M} \sum _{i=1}^{M}R_{i}^\mathrm{T} \left( R_{i} \hat{\Sigma } R_{i}^\mathrm{T} \right) ^{-1} R_{i}\) is bounded. Of course, any one of these summands \(R_{i}^\mathrm{T} \left( R_{i} \hat{\Sigma } R_{i}^\mathrm{T} \right) ^{-1} R_{i}\) is singular by construction. On the other hand if we look at the decision rule of a single randomly projected classifier in the \(k\)-dimensional space,

$$\begin{aligned} \hat{h}_{R}(x_{q}){:=}\mathbf {1} \left\{ \!(\hat{\mu }_{1}- \hat{\mu }_{0})R^\mathrm{T}\left( R\hat{\Sigma }R^\mathrm{T}\right) ^{-1}R\left( \!x_{q}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2}\right) \!\!>\! 0 \right\} \end{aligned}$$
(3.1)

we have for all \(z \ne 0, Rz \ne 0\) almost surely, and \(R\hat{\Sigma }R^\mathrm{T}\) is full rank almost surely—therefore with probability 1 the \(k\)-dimensional system in (3.1) is well-posed.

The significance of this theorem from a generalization error analysis point of view stems from the fact that the rank deficient maximum-likelihood covariance estimate has unbounded condition number and, as we see below in Theorem 3.2, (an upper bound on) the generalization error of FLD increases as a function of the condition number of the covariance estimate employed. In turn, the bound given in our Theorem 3.1 depends on the extreme non-zero eigenvalues of \(\hat{\Sigma }\), its rankFootnote 1 \(\rho \), and the subspace dimension \(k\), which are all finite for any particular training set instance. We should also note that the subspace dimension \(k\) is a parameter that we can choose, and in what follows \(k\) therefore acts as the integer regularization parameter in our setting.

Theorem 3.2

(Tail bound on generalization error of the converged ensemble) Let \(\mathcal {T}= \{(x_i,y_i)\}_{i=1}^{N}\) be a set of training data of size \(N = N_0 + N_1\), subject to \(N<d\) and \(N_y > 1\) \(\forall y\). Let \(x_{q}\) be a query point with Gaussian class-conditionals \(x_{q}|(y_{q}= y) \sim \mathcal {N}(\mu _y, \Sigma )\), with \(\Sigma \) full rank and let \(\Pr \{y_{q}= y\}=\pi _y\). Let \(\rho \) be the rank of the maximum likelihood estimate of the covariance matrix and let \(k < \rho -1\) be a positive integer. Then for any \(\delta \in (0,1)\) and any training set of size \(N\), the generalization error of the converged ensemble of randomly projected FLD classifiers is upper-bounded with probability at least \(1-\delta \) by the following:

$$\begin{aligned} \Pr _{x_{q},y_{q}}(\hat{h}_{ens}(x_{q}) \ne y_{q})&\leqslant \sum _{y=0}^{1} \pi _y \Phi \left( - \left[ g \left( \bar{\kappa } \left( \sqrt{2\log \frac{5}{\delta }} \right) \right) \right. \right. \nonumber \\&\ldots \times \left[ \sqrt{\Vert \Sigma ^{-\frac{1}{2}}(\mu _{1}-\mu _{0}) \Vert ^{2}+ \frac{dN}{N_{0}N_{1}}} - \sqrt{\frac{2N}{N_0 N_1}\log \frac{5}{\delta }} \right] _{+}\nonumber \\&\ldots \left. \left. \!\!\! -\sqrt{\frac{d}{N_y}} \left( 1+\sqrt{\frac{2}{d}\log \frac{5}{\delta }}\right) \right] \right) \end{aligned}$$

where \(\Phi \) is the c.d.f of the standard Gaussian, \(\bar{\kappa }(\epsilon )\) is a high probability (w.r.t. the random draw of training set) upper bound on the condition number of \(\Sigma \hat{S}^{-1}\) given by Eq. (4.17) and \(g(\cdot )\) is the function \(g(a) {:=} \frac{\sqrt{a}}{1 + a}\).

The principal terms in this bound are: (i) The function \(g:[1,\infty ) \rightarrow (0,\frac{1}{2}]\) which is a decreasing function of its argument and here captures the effect of the mismatch between the estimated model covariance matrix \(\hat{S}^{-1}\) and the true class-conditional covariance \(\Sigma \), via a high-probability upper bound on the condition number of \(\hat{S}^{-1}\Sigma \); (ii) The Mahalanobis distance between the two class centres which captures the fact that the better separated the classes are the smaller the generalization error should be; and (iii) antagonistic terms involving the sample size (\(N\)) and the number of training examples in each class (\(N_{0},N_{1}\)), which capture the effect of class (im)balance—the more evenly the training data are split, the tighter the bound.

We note that a bound on generalization error with similar behaviour can be obtained for the much larger family of sub-Gaussian distributions, or when the true class-conditional covariance matrices are taken to be different (see e.g. Durrant and Kabán 2010; Durrant 2013). Therefore the distributional assumptions on Theorem 3.2 are not crucial.

Theorem 3.3

(High probability lower bound on generalization error of pseudoinverted FLD) For any \(\delta \in (0,1)\), and any data set of size \(N_0+N_1=N\), assuming Gaussian classes with shared covariance and \(\kappa (\Sigma )<\infty \) , the generalization error of pseudo-inverted FLD is lower-bounded with probability at least \(1-\delta \) over the random draws of training set by:

$$\begin{aligned} \Pr (\hat{h}_{+}(x_{q}) \ne y_q)&\geqslant \Phi \left( -\frac{1}{2} \sqrt{1+\sqrt{\frac{8}{N}\log \frac{2}{\delta }}} \right. \nonumber \\&\ldots \times \left( 1+\sqrt{\frac{2\lambda _{\max }(\Sigma )\log (2/\delta )}{\text {Tr}(\Sigma )+\Vert \mu _1-\mu _0 \Vert ^2 \frac{N_0N_1}{N}}}\right) \nonumber \\&\ldots \left. \times \sqrt{\frac{\rho }{d} \frac{\Vert \mu _1-\mu _0\Vert ^2 +\text {Tr}(\Sigma )\frac{N}{N_0N_1}}{\lambda _{\min }(\Sigma )}}\right) \end{aligned}$$

where \(\Phi \) is the c.d.f of the standard Gaussian.

It is interesting to notice that this lower bound depends on the rank of the covariance estimate, not on its fit to the true covariance \(\Sigma \). Note in particular that when \(N \ll d\) our lower bound explains the bad performance of pseudo-inverted FLD since \(\rho \), the rank of \(\hat{\Sigma }\), is at most \(\min \{N-2,d\}\) and the lower bound of Theorem 3.3 becomes tighter as \(\rho /d\) decreases. Allowing the dimensionality \(d\) to be large, as in Bickel and Levina (2004), so that \(\rho /d \rightarrow 0\), this fraction goes to \(0\) which means the lower bound of Theorem 3.3 converges to \(\Phi (0) = 1/2\)—in other words random guessing.

4 Proofs

4.1 Proof of Theorem 3.1

Estimating the condition number of \(\text {E}\left[ R^\mathrm{T} \left( R \hat{\Lambda } R^{T} \right) ^{-1} R \right]\) is the key result underpinning our generalization error results. We will make use of the following two easy, but useful, lemmas:

Lemma 4.1

(Unitary invariance) Let \(R \in \mathcal {M}_{k \times d}\) with \(r_{ij} \overset{\text {i.i.d}}{\sim }\mathcal {N}(0,\sigma ^2)\). Let \(\hat{\Sigma }\) be any symmetric positive semi-definite matrix, and let \(\hat{U}\) be a unitary matrix such that \(\hat{\Sigma }= \hat{U}\hat{\Lambda } \hat{U}^\mathrm{T}\), where \(\hat{\Lambda }\) is a diagonal matrix with the eigenvalues of \(\hat{\Sigma }\) in descending order along the diagonal. Then:

$$\begin{aligned} \text {E}\left[ R^\mathrm{T} \left( R \hat{\Sigma }R^\mathrm{T} \right) ^{-1} R \right] =\hat{U}\text {E}\left[ R^\mathrm{T} \left( R\hat{\Lambda } R^\mathrm{T} \right) ^{-1} R \right]\hat{U}^\mathrm{T} \end{aligned}$$

Lemma 4.2

(Expected preservation of eigenvectors) Let \(\hat{\Lambda }\) be a diagonal matrix, then \(\text {E}\left[ R^\mathrm{T} \left( R\hat{\Lambda } R^\mathrm{T} \right) ^{-1} R \right]\) is a diagonal matrix.

Furthermore, if \(\hat{U}\) diagonalizes \(\hat{\Sigma }\) as \(\hat{\Sigma }=\hat{U}\hat{\Lambda } \hat{U}^\mathrm{T}\), then \(\hat{U}\) also diagonalizes \(\text {E}\left[ R^\mathrm{T} \left( R \hat{\Sigma }R^\mathrm{T} \right) ^{-1} R \right]\).

We omit the proofs which are straightforward and can be found in Marzetta et al. (2011).

Now, it follows from Lemmas 4.1 and 4.2 that at convergence our ensemble preserves the eigenvectors of \(\hat{\Sigma }\), and so we only need to consider the diagonal entries (i.e. the eigenvalues) of \(\text {E}\left[ R^\mathrm{T} \left( R\hat{\Lambda } R^\mathrm{T} \right) ^{-1} R \right]\), which we now do. To fix ideas we will look first at the case \(k=1\), when we are projecting the high dimensional data on to a single line for each classifier in the ensemble. In this case the ith diagonal element of \(\text {E}\left[ R^\mathrm{T} \left( R \hat{\Lambda } R^\mathrm{T} \right) ^{-1} R \right] \) is \(\text {E}\left[\frac{r_{i}^2}{\sum _{j=1}^{\rho } \lambda _{j} r_{j}^2} \right]\), where \(r_i\) is the ith entry of the single row matrix \(R\). This can be upper and lower bounded as:

$$\begin{aligned} \frac{1}{\lambda _{\max }} \text {E}\left[\frac{r_{i}^2}{\sum _{j=1}^\rho r_{j}^2} \right]\leqslant \text {E}\left[\frac{r_{i}^2}{\sum _{j=1}^\rho \lambda _j r_{j}^2} \right] \leqslant \frac{1}{\lambda _{\min \ne 0}}\text {E}\left[\frac{r_{i}^2}{\sum _{j=1}^\rho r_{j}^2} \right] \end{aligned}$$

where \(\lambda _{\min \ne 0}\) denotes the smallest nonzero eigenvalue of \(\hat{\Lambda }\) (and of \(\hat{\Sigma }\)), and \(\lambda _{\max }\) its largest eigenvalue.

Recall that as a result of Lemmas 4.1 and 4.2 we only need consider the diagonal entries of this expectation as the off-diagonal terms are known to be zero.

Now, we evaluate the remaining expectation. There are two cases: If \(i>\rho \) then \(r_i\) is independent from the denominator and we have \(\text {E}\left[\frac{r_{i}^2}{\sum _{j=1}^\rho r_{j}^2} \right] =\text {E}\left[r_{i}^2 \right]\text {E}\left[1/\sum _{j=1}^\rho r_{j}^2 \right] = \frac{1}{\rho -2}\), where we used the expectation of the inverse-\(\chi ^2\) with \(\rho \) degrees of freedom, and the fact that \(\text {E}\left[r_{i}^2 \right]=1\). When \(i\leqslant \rho \), then in turn we have \(\text {E}\left[\frac{r_{i}^2}{\sum _{j=1}^{\rho } r_{j}^2 } \right] = \text {E}\left[\frac{r_{i}^2}{\Vert r\Vert ^2} \right] = \frac{1}{\rho }\). That is,

$$\begin{aligned} \text {E}\left[\text {diag}\left( \frac{r_{i}^2}{\sum _{j=1}^\rho r_{j}^2} \right) \right] = \left[ \begin{array}{c|c} \frac{1}{\rho } I_\rho &{} 0 \\ \hline 0 &{} \frac{1}{\rho -2} I_{d-\rho } \end{array}\right] \end{aligned}$$

and so \(\text {E}\left[ R^\mathrm{T} \left( R \hat{\Lambda } R^\mathrm{T} \right) ^{-1} R \right]\) is full rank, hence invertible. Its inverse may be seen as a regularized covariance estimate in the data space, and its condition number, \(\kappa \), is upper bounded by:

$$\begin{aligned} \kappa \leqslant \frac{\rho }{\rho -2} \cdot \frac{\lambda _{\max }}{\lambda _{\min \ne 0}} \end{aligned}$$

whereas in the setting \(N < d\) the ML covariance estimate has unbounded condition number.

For the general \(k < \rho -1\) case we write \(R\) as a concatenation of two matrices \(R=[P, S]\) where \(P\) is \(k \times \rho \) and \(S\) is \(k \times (d-\rho )\), so that \(\text {E}\left[ R^\mathrm{T} \left( R \hat{\Lambda } R^\mathrm{T} \right) ^{-1} R \right]\) can be decomposed as two diagonal blocks:

$$\begin{aligned} \left[ \begin{array}{c|c} \text {E}[ P^\mathrm{T} \left( P \hat{\underline{\Lambda }} P^\mathrm{T} \right) ^{-1} P ] &{} 0 \\ \hline 0 &{} \text {E}[ S^\mathrm{T} \left( P \hat{\underline{\Lambda }} P^\mathrm{T} \right) ^{-1} S ] \end{array}\right] \end{aligned}$$

where here in \(P \hat{\underline{\Lambda }}P^\mathrm{T}\) we use \( \hat{\underline{\Lambda }}\) to denote the \(\rho \times \rho \) positive definite upper block of the positive semi-definite matrix \(\hat{\Lambda }\). Now, rewrite the upper block to orthonormalize \(P\) as the following: \(\text {E}[ P^\mathrm{T} \left( P \hat{\underline{\Lambda }} P^\mathrm{T} \right) ^{-1} P ]=\)

$$\begin{aligned} \text {E}[ P^\mathrm{T} \!(PP^\mathrm{T})^{- \frac{1}{2}} \! \left( \!(PP^\mathrm{T})^{- \frac{1}{2}} \!\! P \hat{\underline{\Lambda }} P^\mathrm{T}\!(PP^\mathrm{T})^{- \frac{1}{2}} \right) ^{-1} \!\!\!(PP^\mathrm{T})^{- \frac{1}{2}}P ] \end{aligned}$$

Denoting by \(P_i\) the ith column of \(P\), we can write and bound the ith diagonal element as:

$$\begin{aligned}&\text {E}[ P_i^\mathrm{T} (PP^\mathrm{T})^{- \frac{1}{2}} \left( (PP^\mathrm{T})^{- \frac{1}{2}}P \hat{\underline{\Lambda }} P^\mathrm{T}(PP^\mathrm{T})^{- \frac{1}{2}} \right) ^{-1} (PP^\mathrm{T})^{- \frac{1}{2}}P_i ]\\&\leqslant \text {E}\left[ \frac{ P_i^\mathrm{T} (PP^\mathrm{T})^{-1}P_i }{\lambda _{\min }((PP^\mathrm{T})^{- \frac{1}{2}}P \hat{\underline{\Lambda }} P^\mathrm{T}(PP^\mathrm{T})^{- \frac{1}{2}} ) } \right]\\&\leqslant \text {E}\left[ \frac{ P_i^\mathrm{T} (PP^\mathrm{T})^{-1}P_i }{\lambda _{\min \ne 0}} \right] \end{aligned}$$

with \(\lambda _{\min \ne 0}\) the smallest non-zero eigenvalue of \(\hat{\Lambda }\) as before, and where we used the Rayleigh quotient and the Poincaré separation theorem respectively (e.g. Horn and Johnson 1985, Thm 4.2.2, Corr 4.3.16). This holds for all \(i\), so then replacing we have:

$$\begin{aligned} \text {E}[P^\mathrm{T} (PP^\mathrm{T})^{-1}P]/\lambda _{\min \ne 0} \succcurlyeq \text {E}\left[ P^\mathrm{T} ( P \hat{\underline{\Lambda }} P^\mathrm{T})^{-1} P \right] \end{aligned}$$
(4.1)

where \(A\succcurlyeq B\) denotes \(A-B\) is positive semi-definite. Now the remaining expectation can be evaluated using the expectation of the \(\rho \)-dimensional Wishart matrix \(P^\mathrm{T}P\) with \(k\) degrees of freedom:

$$\begin{aligned} \text {E}[ P^\mathrm{T} (PP^\mathrm{T})^{-1}P] = \text {E}[P^\mathrm{T}P]/\rho = \frac{k}{\rho }\cdot I_{\rho } \end{aligned}$$

Similarly to equation (4.1) we can also show that:

$$\begin{aligned} \text {E}\left[P^\mathrm{T} (P \hat{\underline{\Lambda }} P^\mathrm{T})^{-1}P \right] \succcurlyeq \text {E}[ P^\mathrm{T} \left( PP^\mathrm{T} \right) ^{-1} P ]/\lambda _{\max } \end{aligned}$$
(4.2)

in much the same way. Put together, the diagonal elements in the upper block are all in the interval:

$$\begin{aligned} \left[\frac{1}{\lambda _{\max }} \frac{k}{\rho }, \frac{1}{\lambda _{\min \ne 0}} \frac{k}{\rho } \right] \end{aligned}$$

Hence, we see that in this upper block the condition number is reduced in comparison to that of \(\hat{\Lambda }\) in its column space.

$$\begin{aligned} \frac{\lambda _{\max }(\text {E}[P^\mathrm{T}(P \hat{\underline{\Lambda }} P^\mathrm{T})^{-1}P])}{\lambda _{\min }(\text {E}[P^\mathrm{T}(P \hat{\underline{\Lambda }} P^\mathrm{T})^{-1}P])} \leqslant \frac{\lambda _{\max }(\hat{\Lambda })}{\lambda _{\min \ne 0}(\hat{\Lambda })} \end{aligned}$$

That is, in the range of \(\hat{\Sigma }\), the ensemble has the effect of a shrinkage regularizer (Ledoit and Wolf 2004). Next, we consider its effect in the null space of \(\hat{\Sigma }\).

The lower block is \(\text {E}\left[ S^\mathrm{T} ( P \hat{\underline{\Lambda }} P^\mathrm{T} )^{-1} S \right] = \text {Tr}\left( \text {E}\left[( P \hat{\underline{\Lambda }} P^\mathrm{T} )^{-1} \right] \right) \cdot I_{d-\rho }\) since \(S\) is independent of \(P\). We again rewrite this to orthonormalize \(P\). Going through similar steps, we obtain: \(\text {Tr}\left( \text {E}\left[( P \hat{\underline{\Lambda }} P^\mathrm{T} )^{-1} \right] \right) =\)

$$\begin{aligned}&\text {Tr}\left( \! \text {E}\left[\! \left( PP^\mathrm{T} \right) ^{- \frac{1}{2}}\! \left( \! \left( PP^\mathrm{T} \right) ^{- \frac{1}{2}}\!\! P \hat{\underline{\Lambda }} P^\mathrm{T} \!\! \left( PP^\mathrm{T} \right) ^{- \frac{1}{2}}\! \right) ^{-1}\!\! \left( PP^\mathrm{T} \right) ^{- \frac{1}{2}} \right] \right) \\&\quad \leqslant \frac{\text {Tr}\left( \text {E}\left[ \left( PP^\mathrm{T} \right) ^{-1} \right] \right) }{\lambda _{\min \ne 0}} = \frac{k}{\rho - k -1}\cdot \frac{1}{\lambda _{\min \ne 0}} \end{aligned}$$

where we used the expectation of the inverse Wishart. Likewise,

$$\begin{aligned} \text {Tr}\left( \text {E}\left[ \left( P \hat{\underline{\Lambda }} P^\mathrm{T} \right) ^{-1} \right] \right) \geqslant \frac{k}{\rho - k -1} \cdot \frac{1}{\lambda _{\max }} \end{aligned}$$
(4.3)

Hence, the lower block is a multiple of \(I_{d-\rho }\) with the coefficient in the interval:

$$\begin{aligned} \left[\frac{k}{\rho - k -1}\frac{1}{\lambda _{\max }},\frac{k}{\rho - k -1}\frac{1}{\lambda _{\min \ne 0}} \right] \end{aligned}$$

That is, in the null space of \(\hat{\Sigma }\) the ensemble acts as a ridge regularizer (Hastie et al. 2001).

Putting everything together, the condition number of the covariance (or inverse covariance) estimate is upper bounded by:

$$\begin{aligned} \kappa \leqslant \frac{\rho }{\rho -k-1} \cdot \frac{\lambda _{\max }}{\lambda _{\min \ne 0}} \end{aligned}$$
(4.4)

which we see reduces to equation (4.1) when \(k=1\). \(\square \)

4.2 Proof of Theorem 3.2

Traditionally ensemble methods are regarded as ‘meta-learning’ approaches and although bounds exist (e.g. Koltchinskii and Panchenko 2002) there are, to the best of our knowledge, no results giving the exact analytical form of the generalization error of any particular ensemble. Indeed, in general it is not analytically tractable to evaluate the generalization error exactly, so one can only derive bounds. Because we deal with a particular ensemble of FLD classifiers we are able to derive the exact generalization error of the ensemble in the case of Gaussian classes with shared covariance \(\Sigma \), the setting in which FLD is Bayes’ optimal. This allows us to explicitly connect the performance of the ensemble to its data space analogue. As noted earlier, an upper bound on generalization error with similar behaviour can be derived for the much larger class of sub-Gaussian distributions (see e.g. Durrant and Kabán 2010; Durrant 2013), therefore this Gaussianity assumption is not crucial.

We proceed in two steps: (1) Obtain the generalization error of the ensemble conditional on a fixed training set; (2) Bound the deviation of this error caused by a random draw of a training set.

4.2.1 Generalization error of the ensemble for a fixed training set

For a fixed training set, the generalization error is given by the following lemma:

Lemma 4.3

(Exact generalization error with Gaussian classes) Let \(x_{q}|(y_q=y) \sim \mathcal {N}(\mu _y,\Sigma )\), where \(\Sigma \in \mathcal {M}_{d \times d}\) is a full rank covariance matrix, and let \(\pi _y {:=} \text {Pr}\{y_{q}= y\}\). Let \(R \in \mathcal {M}_{k \times d}\) be a random projection matrix with i.i.d. zero-mean Gaussian entries and denote \(\hat{S}^{-1} {:=} \text {E}_R \left[ R^\mathrm{T} \left( R \hat{\Sigma }R^\mathrm{T} \right) ^{-1} R \right] \). Then the exact generalization error of the converged randomly projected ensemble classifier (2.1) is given by \(\Pr _{(x_{q},y_q)}\{\hat{h}_{ens}(x_{q}) \ne y_q\}=\)

$$\begin{aligned} \sum _{y=0}^{1} \pi _y \Phi \left( - \frac{1}{2} \frac{ (\hat{\mu }_{\lnot y} - \hat{\mu }_y)^\mathrm{T} \hat{S}^{-1} (\hat{\mu }_{0}+ \hat{\mu }_{1}- 2\mu _{y}) }{\sqrt{(\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} \hat{S}^{-1} \Sigma \hat{S}^{-1}(\hat{\mu }_{1}- \hat{\mu }_{0})}}\right) \end{aligned}$$
(4.5)

where \(\Phi \) is the c.d.f of the standard Gaussian.

The proof of this lemma is similar in spirit to the one for a single FLD in Pattison and Gossink (1999). For completeness we give it below.

4.3 Proof of Lemma 4.3

Without loss of generality let \(x_{q}\) have label \(0\). By assumption the classes have Gaussian distribution \(\mathcal {N}(\mu _y,\Sigma )\) so then the probability that \(x_{q}\) is misclassified by the converged ensemble is given by:

$$\begin{aligned} \text {Pr}_{x_{q}|y_q=0} \left\{ (\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} \hat{S}^{-1} \left( x_{q}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2} \right) > 0 \right\} \end{aligned}$$
(4.6)

Define \(a^\mathrm{T}{:=} (\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T}\hat{S}^{-1}\) and observe that if \(x_{q}\sim \mathcal {N}(\mu _{0}, \Sigma )\) then:

$$\begin{aligned} \left( x_{q}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2} \right) \sim \mathcal {N}\left( \left( \mu _{0}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2} \right) , \Sigma \right) \end{aligned}$$

and so:

$$\begin{aligned} a^\mathrm{T} \left( x_{q}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2} \right) \sim \mathcal {N}\left( a^\mathrm{T} \left( \mu _{0}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2} \right) , a^\mathrm{T} \Sigma a \right) \end{aligned}$$

which is a univariate Gaussian. Therefore:

$$\begin{aligned} \frac{a^\mathrm{T} \left( x_{q}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2} \right) - a^\mathrm{T} \left( \mu _{0}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2} \right) }{\sqrt{a^\mathrm{T} \Sigma a}} \sim \mathcal {N}(0,1) \end{aligned}$$

Hence, for the query point \(x_{q}\) we have the probability (4.6) is given by:

$$\begin{aligned}&\Phi \left( \frac{a^\mathrm{T} \left( \mu _{0}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2}\right) }{\sqrt{a^\mathrm{T} \Sigma a}} \right) = \Phi \left( - \frac{1}{2} \frac{(\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} \hat{S}^{-1} (\hat{\mu }_{0}+ \hat{\mu }_{1}- 2\mu _{0}) }{\sqrt{(\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} \hat{S}^{-1} \Sigma \hat{S}^{-1} (\hat{\mu }_{1}- \hat{\mu }_{0})}}\right) \end{aligned}$$
(4.7)

where \(\Phi \) is the c.d.f of the standard Gaussian.

A similar argument deals with the case when \(x_{q}\) belongs to class \(1\), and applying the law of total probability completes the proof. \(\square \)

Indeed equation (4.5) has the same form as the error of the data space FLD (See Bickel and Levina 2004; Pattison and Gossink 1999 for example.) and the converged ensemble, inspected in the original data space, produces exactly the same mean estimates and covariance matrix eigenvector estimates as FLD working on the original data set. However it has different eigenvalue estimates that result from the sophisticated regularization scheme that we analyzed in Sect. 4.1.

4.3.1 Tail bound on the generalization error of the ensemble

The previous section gave the exact generalization error of our ensemble conditional on a given training set. In this section our goal is to derive an upper bound with high probability on the ensemble generalization error w.r.t. random draws of the training set.

We will use the following concentration lemma:

Lemma 4.4

(Concentration bound on exponential random variables) Let \(X\) be a Gaussian random vector in \(\mathbb {R}^d\) with mean vector \(\text {E}[X] = \mu \) and covariance matrix \(\Sigma \). Let \(\epsilon > 0\). Then:

$$\begin{aligned}&\text {Pr}\left\{ \Vert X \Vert ^{2} \geqslant \left( 1+\epsilon \right) \left( \text {Tr}\left( \Sigma \right) +\Vert \mu \Vert ^{2}\right) \right\} \leqslant \exp \left( - \frac{\text {Tr}(\Sigma )+\Vert \mu \Vert ^{2}}{2 \lambda _{\max }(\Sigma )}\left( \sqrt{1+\epsilon }-1 \right) ^2 \right) \nonumber \\ \end{aligned}$$
(4.8)

Furthermore, if \(\epsilon \in (0,1)\):

$$\begin{aligned} \text {Pr}\left\{ \Vert X \Vert ^{2} \leqslant \left( 1-\epsilon \right) \left( \text {Tr}\left( \Sigma \right) +\Vert \mu \Vert ^{2}\right) \right\} \leqslant \exp \left( - \frac{\text {Tr}(\Sigma )+\Vert \mu \Vert ^{2}}{2 \lambda _{\max }(\Sigma )}\left( \sqrt{1-\epsilon }-1 \right) ^2 \right) \qquad \end{aligned}$$
(4.9)

The proof, which follows immediately from the more general result we give in Durrant and Kabán (2012), is given in Appendix 1 for completeness. Now we can bound the generalization error of the RP-FLD ensemble. We begin by decomposing the numerator of the generalization error term (for a single class) obtained in Lemma 4.3 as follows:

$$\begin{aligned}&\left( \hat{\mu }_{1}+\hat{\mu }_{0}- 2 \mu _{0}\right) ^\mathrm{T}\hat{S}^{-1} \left( \hat{\mu }_{1}- \hat{\mu }_{0}\right) = \left( \hat{\mu }_{1}-\hat{\mu }_{0}\right) ^\mathrm{T}\!\hat{S}^{-1}\! \left( \hat{\mu }_{1}- \hat{\mu }_{0}\right) \\&\quad + 2 \left( \hat{\mu }_{0}- \mu _{0}\right) ^\mathrm{T}\!\hat{S}^{-1}\! \left( \hat{\mu }_{1}- \hat{\mu }_{0}\right) \end{aligned}$$

Using this decomposition we can rewrite the argument of the first term in Lemma 4.3 in the following form:

$$\begin{aligned} \Phi \left( - \frac{1}{2}[A-B] \right) \end{aligned}$$

where

$$\begin{aligned} A=\frac{\left( \hat{\mu }_{1}-\hat{\mu }_{0}\right) ^\mathrm{T}\hat{S}^{-1} \left( \hat{\mu }_{1}- \hat{\mu }_{0}\right) }{\sqrt{(\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} \hat{S}^{-1} \Sigma \hat{S}^{-1}(\hat{\mu }_{1}- \hat{\mu }_{0})}} \end{aligned}$$

and

$$\begin{aligned} B=\frac{2\left( \mu _{0}- \hat{\mu }_{0}\right) ^\mathrm{T}\hat{S}^{-1} \left( \hat{\mu }_{1}- \hat{\mu }_{0}\right) }{\sqrt{(\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} \hat{S}^{-1} \Sigma \hat{S}^{-1}(\hat{\mu }_{1}- \hat{\mu }_{0})}} \end{aligned}$$

We will lower bound \(A\) and upper bound \(B\) with high probability over the random draw of training set in order to bound the whole term from above with high probability and, since \(\Phi \) is monotonic increasing in its argument, this will give the upper bound on generalization error.

4.3.2 Lower-bounding the term \(A\)

Applying the Kantorovich inequality (e.g. Horn and Johnson 1985, Thm 7.4.41), \(A\) is lower bounded by:

$$\begin{aligned} \Vert \Sigma ^{-\frac{1}{2}} \left( \hat{\mu }_{1}- \hat{\mu }_{0}\right) \Vert \cdot \frac{2\sqrt{\kappa (\hat{S}^{-\frac{1}{2}} \Sigma \hat{S}^{-\frac{1}{2}})}}{1 + \kappa (\hat{S}^{-\frac{1}{2}} \Sigma \hat{S}^{-\frac{1}{2}})} \end{aligned}$$
(4.10)

where \(\kappa (H) {:=} \frac{\lambda _{\max }(H)}{\lambda _{\min }(H)}\) denotes the condition number of the matrix \(H\).

Next, since \(\Sigma ^{-\frac{1}{2}}\hat{\mu }_{1}\) and \(\Sigma ^{-\frac{1}{2}}\hat{\mu }_{0}\) are independent with \(\Sigma ^{-\frac{1}{2}}\hat{\mu }_{y} \sim \mathcal {N}(\Sigma ^{-\frac{1}{2}}\mu _{y}, I_d/N_{y})\), we have \(\Sigma ^{-\frac{1}{2}}(\hat{\mu }_{1}-\hat{\mu }_{0}) \sim \mathcal {N}(\Sigma ^{-\frac{1}{2}}(\mu _{1}- \mu _{0}), N/(N_{0}N_{1})\cdot I_d )\).

Applying the second concentration bound of Lemma 4.4, (4.9), we have:

$$\begin{aligned} \Vert \Sigma ^{-\frac{1}{2}}(\hat{\mu }_{1}- \hat{\mu }_{0}) \Vert \geqslant \sqrt{\left( 1-\epsilon \right) \left( \frac{d\cdot N}{N_{0}N_{1}} +\Vert \Sigma ^{-\frac{1}{2}}(\mu _{1}-\mu _{0}) \Vert ^{2}\right) } \end{aligned}$$
(4.11)

with probability at least:

$$\begin{aligned} 1 - \exp \left( - \frac{d+\Vert \Sigma ^{-\frac{1}{2}}(\mu _{1}- \mu _{0}) \Vert ^{2}N_0N_1/N }{2} \left( \sqrt{1-\epsilon }-1 \right) ^2 \right) \end{aligned}$$
(4.12)

To complete the bounding of the term \(A\), we denote \(g(a) {:=} \frac{\sqrt{a}}{1 + a}\), and observe that this is a monotonic decreasing function on \([1, \infty )\). So, replacing \(a\) with the condition number \(\kappa (\hat{S}^{-\frac{1}{2}} \Sigma \hat{S}^{-\frac{1}{2}}) \in [1,\infty )\) we need to upper bound this condition number in order to lower bound \(g\). Denoting this upper bound by \(\bar{\kappa }\), which will be quantified in Lemma 4.5, then the term \(A\) is lower bounded with high probability by:

$$\begin{aligned} A \geqslant 2 g(\bar{\kappa })\sqrt{\left( 1-\epsilon \right) \left( \Vert \Sigma ^{-\frac{1}{2}}(\mu _{1}-\mu _{0}) \Vert ^{2} + \frac{d\cdot N}{N_{0}N_{1}} \right) } \end{aligned}$$
(4.13)

4.3.3 Upper-bounding the term \(B\)

We can rewrite \(B\) by inserting \(\Sigma ^{-\frac{1}{2}}\Sigma ^{\frac{1}{2}} = I_d\), and using Cauchy-Schwarz in the numerator to give:

$$\begin{aligned} B \leqslant \frac{2 \Vert \Sigma ^{-\frac{1}{2}}(\mu _{0}- \hat{\mu }_{0})\Vert \cdot \Vert \Sigma ^{\frac{1}{2}}\hat{S}^{-1}(\hat{\mu }_{1}- \hat{\mu }_{0})\Vert }{\sqrt{(\hat{\mu }_{1}- \hat{\mu }_{0})^{T}\hat{S}^{-1} \Sigma \hat{S}^{-1}(\hat{\mu }_{1}- \hat{\mu }_{0})}} \end{aligned}$$
(4.14)

After cancellation, this simplifies to:

$$\begin{aligned} =2 \Vert \Sigma ^{- \frac{1}{2}}(\mu _{0}- \hat{\mu }_{0})\Vert \end{aligned}$$
(4.15)

and so by Lemma 4.4, (4.8), we have:

$$\begin{aligned} B&\leqslant 2\sqrt{(1+\epsilon )d/N_{0}} \end{aligned}$$
(4.16)

with probability at least \(1-\exp (-\frac{d}{2} (\sqrt{1+\epsilon }-1)^2)\).

To bound the condition number \(\kappa (\hat{S}^{- \frac{1}{2}} \Sigma \hat{S}^{- \frac{1}{2}})\) with high probability we need the following additional lemma:

Lemma 4.5

(Upper bound on \(\kappa (\hat{S}^{- \frac{1}{2}} \Sigma \hat{S}^{- \frac{1}{2}})\)) Under the conditions of Theorem 3.2 we have, \(\forall \epsilon > 0\):

$$\begin{aligned} \kappa (\hat{S}^{- \frac{1}{2}} \Sigma \hat{S}^{- \frac{1}{2}})&= \frac{\lambda _{\max }(\Sigma ^{\frac{1}{2}} \cdot \text {E}_R[R^\mathrm{T} (R\hat{\Sigma }R^\mathrm{T})^{-1} R]\cdot \Sigma ^{\frac{1}{2}})}{\lambda _{\min }(\Sigma ^{\frac{1}{2}} \cdot \text {E}_R[R^\mathrm{T} (R\hat{\Sigma }R^\mathrm{T})^{-1} R]\cdot \Sigma ^{\frac{1}{2}})}\nonumber \\&\leqslant \frac{(\sqrt{N-2}+\sqrt{d}+\epsilon )^2(1+\rho /k \cdot \kappa (\Sigma ))}{(\sqrt{N-2}-\sqrt{k}-\epsilon )^2}\nonumber \\&{=:}\bar{\kappa }(\epsilon ) \end{aligned}$$
(4.17)

with probability at least \(1-2\exp (-\epsilon ^2/2)\).

4.3.4 Proof of Lemma 4.5

To upper bound the condition number \(\kappa (\hat{S}^{- \frac{1}{2}} \Sigma \hat{S}^{- \frac{1}{2}})\) with high probability, we derive high-probability upper and lower bounds on (respectively) the greatest and least eigenvalues of its argument. We will make use of the following result, Eq. (2.3) from Vershynin (2012):

Lemma 4.6

(Singular Values of Wishart Matrices (Vershynin 2012)) Let \(R\) be a \(k \times d\) matrix with i.i.d \(\mathcal {N}(0,1)\) entries. Then for all \(\epsilon > 0\) with probability at least \(1- 2\exp (-\epsilon ^2/2)\) we have:

$$\begin{aligned} \sqrt{d} - \sqrt{k} - \epsilon \leqslant s_{\min }(R) \leqslant s_{\max }(R) \leqslant \sqrt{d} + \sqrt{k} + \epsilon \end{aligned}$$
(4.18)

4.3.5 Upper-bound on largest eigenvalue

By Jensen’s inequality, and noting that \(\lambda _{\max }(\cdot )\) is a convex function, we have:

$$\begin{aligned}&\lambda _{\max }(\hat{S}^{- \frac{1}{2}} \Sigma \hat{S}^{- \frac{1}{2}})\\&\quad =\lambda _{\max }(\Sigma ^{\frac{1}{2}} \text {E}_R[R^\mathrm{T} (R\hat{\Sigma }R^\mathrm{T})^{-1} R] \Sigma ^{\frac{1}{2}})\\&\quad \leqslant \text {E}_R[\lambda _{\max }(\Sigma ^{\frac{1}{2}} R^\mathrm{T} (R\hat{\Sigma }R^\mathrm{T})^{-1} R\Sigma ^{\frac{1}{2}})]\\&\quad = \text {E}_R[\lambda _{\max }((R\hat{\Sigma }R^\mathrm{T})^{-1} R\Sigma R^\mathrm{T}]\\&\quad = \text {E}_R[\lambda _{\max }( (R\Sigma R^\mathrm{T})^{\frac{1}{2}}(R\hat{\Sigma }R^\mathrm{T})^{-1} (R\Sigma R^\mathrm{T})^{\frac{1}{2}})]\\&\quad =\text {E}_R\left[ \frac{1}{\lambda _{\min }((R\Sigma R^\mathrm{T})^{- \frac{1}{2}}R\hat{\Sigma }R^\mathrm{T}(R\Sigma R^\mathrm{T})^{- \frac{1}{2}})} \right] \\&\quad \leqslant \frac{N}{(\sqrt{N-2}-\sqrt{k}-\epsilon )^2} \end{aligned}$$

with probability at least \(1-\exp (-\epsilon ^2/2), \forall \epsilon >0\), where throughout we use the fact that the non-zero eigenvalues of \(AB\) are the same as non-zero eigenvalues of \(BA\), in the second to last step we used the fact that for invertible matrices \(A\) we have \(\lambda _{\max }(A) = 1/\lambda _{\min }(A^{-1})\), and in the last step we used that for any particular full row-rank matrix \(R, (R\Sigma R^\mathrm{T})^{- \frac{1}{2}}R\hat{\Sigma }R^\mathrm{T}(R\Sigma R^\mathrm{T})^{- \frac{1}{2}}\) (regarded as a function of the training set and therefore \(\hat{\Sigma }\) is the random variable) is distributed as a \(k\)-dimensional Wishart with \(N-2\) degrees of freedom and scale matrix \(I_k\) (e.g. Mardia et al. 1979 Corr. 3.4.1.2), and we then used the high probability lower-bound for the smallest eigenvalue of such a matrix, Lemma 4.6.

4.3.6 Lower-bound on smallest eigenvalue

Dealing with the smallest eigenvalue is less straightforward. Although \(\lambda _{\min }(\cdot )\) is a concave function, Jensen’s inequality does not help with lower bounding the smallest eigenvalue of the expectation since the matrix \(\hat{\Sigma }\) in the argument of this expectation is singular. We therefore take a different route and start by rewriting as follows:

$$\begin{aligned}&\lambda _{\min }(\Sigma ^{\frac{1}{2}} \text {E}_R[R^\mathrm{T} (R\hat{\Sigma }R^\mathrm{T})^{-1} R)] \Sigma ^{\frac{1}{2}})\nonumber \\&\quad = \frac{1}{\lambda _{\max }(\Sigma ^{- \frac{1}{2}} (\text {E}_R[R^\mathrm{T} (R\hat{\Sigma }R^\mathrm{T})^{-1} R)])^{-1} \Sigma ^{- \frac{1}{2}})}\nonumber \\&\quad = \frac{1}{\lambda _{\max }(\Sigma ^{- \frac{1}{2}} \{ \hat{\Sigma }+ (\text {E}_R[R^\mathrm{T} (R\hat{\Sigma }R^\mathrm{T})^{-1} R)])^{-1}-\hat{\Sigma }\}\Sigma ^{- \frac{1}{2}})} \end{aligned}$$
(4.19)

Now, using Weyl’s inequality, and the SVD decomposition \(\hat{\Sigma }=\hat{U}\hat{\Lambda }\hat{U}^\mathrm{T}\) combined with Lemma 4.1, the denominator in (4.19) is upper-bounded by:

$$\begin{aligned}&\lambda _{\max }(\Sigma ^{- \frac{1}{2}} \hat{\Sigma }\Sigma ^{- \frac{1}{2}}) + \lambda _{\max }(\Sigma ^{- \frac{1}{2}}\hat{U}\left( (\text {E}_R[R^\mathrm{T} (R\hat{\Lambda } R^\mathrm{T})^{-1} R])^{-1}-\hat{\Lambda }\right) \hat{U}^\mathrm{T}\Sigma ^{- \frac{1}{2}})\nonumber \\&\quad \leqslant \lambda _{\max }(\Sigma ^{- \frac{1}{2}} \hat{\Sigma }\Sigma ^{- \frac{1}{2}}) + \lambda _{\max }( (\text {E}_R[R^\mathrm{T} (R\hat{\Lambda } R^\mathrm{T})^{-1} R])^{-1}-\hat{\Lambda })/\lambda _{\min }(\Sigma ) \end{aligned}$$
(4.20)

Now observe from Lemma 4.2 that the matrix \(\text {E}_R[R^\mathrm{T} (R\hat{\Lambda } R^\mathrm{T})^{-1} R])^{-1}-\hat{\Lambda }\) is diagonal and, from our analysis in Sect. 4.1, it has the upper \(\rho \) diagonal entries in the interval:

$$\begin{aligned} \left[ \left( \frac{\rho }{k}-1\right) \lambda _{\min \ne 0}(\hat{\Lambda }), \left( \frac{\rho }{k}-1\right) \lambda _{\max }(\hat{\Lambda })\right] \end{aligned}$$

and the lower \(d-\rho \) diagonal entries in the interval:

$$\begin{aligned} \left[ \frac{\rho -k-1}{k}\lambda _{\min \ne 0}(\hat{\Lambda }), \frac{\rho -k-1}{k}\lambda _{\max }(\hat{\Lambda })\right] \end{aligned}$$

Hence, \(\lambda _{\max }( (\text {E}_R[R^\mathrm{T} (R\hat{\Lambda } R)^{-1} R])^{-1}-\hat{\Lambda })\leqslant \frac{\rho }{k}\lambda _{\max }(\hat{\Lambda })\) and so the lower-bounding of (4.20) continues as:

$$\begin{aligned} \geqslant \frac{1}{\lambda _{\max }(\Sigma ^{- \frac{1}{2}} \hat{\Sigma }\Sigma ^{- \frac{1}{2}}) + \frac{\rho }{k}\frac{\lambda _{\max }(\hat{\Lambda })}{\lambda _{\min }(\Sigma )}} \end{aligned}$$
(4.21)

Now observe that \(\Sigma ^{- \frac{1}{2}} \hat{\Sigma }\Sigma ^{- \frac{1}{2}}\) is a \(d\)-dimensional standard Wishart with \(N-2\) degrees of freedom and scale matrix \(I_d\) (e.g. Mardia et al. 1979, Corr. 3.4.1.2), and using the upper bound in Lemma 4.6 for largest eigenvalues of standard Wishart matrices we get (4.21) lower-bounded as

$$\begin{aligned} \geqslant \frac{1}{(\sqrt{N-2}+\sqrt{d}+\epsilon )^2/N + \frac{\rho }{k}\frac{\lambda _{\max }(\hat{\Lambda })}{\lambda _{\min }(\Sigma )}} \end{aligned}$$
(4.22)

with probability at least \(1-\exp (-\epsilon ^2/2)\).

Finally, we bound \(\lambda _{\max }(\hat{\Lambda })\) as:

$$\begin{aligned} \lambda _{\max }(\hat{\Lambda })&= \lambda _{\max }(\hat{\Sigma })=\lambda _{\max }(\Sigma \Sigma ^{-1}\hat{\Sigma })\\&\leqslant \lambda _{\max }(\Sigma )\lambda _{\max }(\Sigma ^{-1}\hat{\Sigma }) = \lambda _{\max }(\Sigma )\lambda _{\max }(\Sigma ^{-\frac{1}{2}}\hat{\Sigma }\Sigma ^{-\frac{1}{2}})\\&\leqslant \lambda _{\max }(\Sigma )(\sqrt{N-2}+\sqrt{d}+\epsilon )^2/N \end{aligned}$$

To complete the bound on the condition number we apply union bound and combine the eigenvalue estimates to obtain, after simple algebra, Lemma 4.5. \(\square \)

Back to the proof of Theorem 3.2, substituting into Lemma 4.3 the high probability bounds for \(A\) and \(B\), rearranging, then setting each of the failure probabilities to \(\delta /5\) so that the overall probability of failure remains below \(\delta \), then solving for \(\epsilon \) we obtain Theorem 3.2 after some algebra. For completeness we give these last few straightforward details in Appendix 2. \(\square \)

4.4 Proof of Theorem 3.3

We start from the exact form of the error of FLD in the data space with a fixed training set. Using a similar approach to that employed in proving Lemma 4.3, this is easily be shown to be:

$$\begin{aligned} \Pr (\hat{h}_{+}(x_{q}) \ne y_q)= \sum _{y=0}^{1} \pi _y \Phi \left( - \frac{1}{2} \frac{ (\hat{\mu }_{\lnot y} - \hat{\mu }_y)^\mathrm{T} \hat{\Sigma }^{+} (\hat{\mu }_{0}+ \hat{\mu }_{1}- 2\mu _{y})}{\sqrt{(\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} \hat{\Sigma }^{+} \Sigma \hat{\Sigma }^{+}(\hat{\mu }_{1}- \hat{\mu }_{0})}}\right) \end{aligned}$$

where \(\hat{\Sigma }^{+}\) is the pseudo-inverse of the maximum likelihood covariance estimate.

Make the rank \(\rho \) SVD decomposition \(\hat{\Sigma } = \hat{\underline{U}}\hat{\underline{\Lambda }}\hat{\underline{U}}^\mathrm{T}\), where \(\hat{\underline{U}}\) is the \(d\times \rho \) matrix of eigenvectors associated with the non-zero eigenvalues, \(\hat{\underline{U}}^\mathrm{T}\hat{\underline{U}}=I_{\rho }\), and as before \(\hat{\underline{\Lambda }}\) is the diagonal \(\rho \times \rho \) matrix of non-zero eigenvalues. Then we have:

$$\begin{aligned}&\frac{(\hat{\mu }_{1}+ \hat{\mu }_{0}- 2\mu _{0})^\mathrm{T} \hat{\underline{U}}\hat{\underline{\Lambda }}^{-1}\hat{\underline{U}}^\mathrm{T}(\hat{\mu }_{1}- \hat{\mu }_{0})}{\sqrt{(\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} \hat{\underline{U}}\hat{\underline{\Lambda }}^{-1}\hat{\underline{U}}^\mathrm{T} \Sigma \hat{\underline{U}}\hat{\underline{\Lambda }}^{-1}\hat{\underline{U}}^\mathrm{T}(\hat{\mu }_{1}- \hat{\mu }_{0})}}\\&\quad \leqslant \frac{(\hat{\mu }_{1}+ \hat{\mu }_{0}- 2\mu _{0})^\mathrm{T} \hat{\underline{U}}\hat{\underline{\Lambda }}^{-1}\hat{\underline{U}}^\mathrm{T}(\hat{\mu }_{1}- \hat{\mu }_{0})}{\sqrt{\lambda _{\min }(\Sigma )}\sqrt{(\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} \hat{\underline{U}}\hat{\underline{\Lambda }}^{-2}\hat{\underline{U}}^\mathrm{T}(\hat{\mu }_{1}- \hat{\mu }_{0})}}\\&\quad \leqslant \frac{\Vert \hat{\underline{U}}^\mathrm{T}(\hat{\mu }_{1}+ \hat{\mu }_{0}- 2\mu _{0})\Vert \cdot \Vert \hat{\underline{\Lambda }}^{-1}\hat{\underline{U}}^\mathrm{T}(\hat{\mu }_{1}- \hat{\mu }_{0})\Vert }{\sqrt{\lambda _{\min }(\Sigma )}\Vert \hat{\underline{\Lambda }}^{-1}\hat{\underline{U}}^T(\hat{\mu }_{1}- \hat{\mu }_{0})\Vert }\\&\quad =\frac{\Vert \hat{\underline{U}}^\mathrm{T}(\hat{\mu }_{1}+ \hat{\mu }_{0}- 2\mu _{0})\Vert }{\sqrt{\lambda _{\min }(\Sigma )}} \end{aligned}$$

where we used minorization by Rayleigh quotient and the Cauchy-Schwartz inequality. We will use the well-known fact that \(\hat{\Sigma }\) and \(\hat{\mu }_{1}+ \hat{\mu }_{0}\) are independent (Mardia et al. 1979). Observe that \(\hat{\underline{U}}^\mathrm{T}\) is a random matrix with orthonormal rows representing the eigenvectors of the sample covariance of a Gaussian sample. Using the rotational invariance of the multivariate Gaussian distribution, by the Johnson-Lindenstrauss lemma (JLL) this matrix acts as an approximate isometry with high probability (Dasgupta and Gupta 2002) that projects a \(d\)-dimensional vector onto a random subspace of dimension \(\rho \). Conditioning on \(\hat{\mu }_{1}+ \hat{\mu }_{0}\) to hold this quantity fixed, and using independence of \(\hat{\underline{U}}\) and \(\hat{\mu }_{1}+ \hat{\mu }_{0}\) (Tulino and Verdú 2004), we have with probability at least \(1-\exp (-N\epsilon ^2/8)\) that:

$$\begin{aligned} \frac{\Vert \hat{\underline{U}}^\mathrm{T}(\hat{\mu }_{1}+ \hat{\mu }_{0}- 2\mu _{0})\Vert }{\sqrt{\lambda _{\min }(\Sigma )}} \leqslant \sqrt{1+\epsilon }\sqrt{\frac{\rho }{d}}\frac{ \Vert \hat{\mu }_{1}+ \hat{\mu }_{0}-2\mu _0\Vert }{\sqrt{\lambda _{\min }(\Sigma )}} \end{aligned}$$

Further, applying Lemma 4.4 (4.8) to the norm on the r.h.s and replacing in the generalization error expression, we have the following lower bound:

$$\begin{aligned} \Phi \left( -\frac{1}{2} \sqrt{(1+\epsilon _1)(1+\epsilon _2)}\sqrt{\frac{\rho }{d} \frac{\Vert \mu _1-\mu _0\Vert ^2 +\text {Tr}(\Sigma )\frac{N}{N_0N_1}}{\lambda _{\min }(\Sigma )}}\right) \end{aligned}$$

with probability at least \(1-[\exp (-N\epsilon _1^2/8)+\exp (-\frac{\text {Tr}(\Sigma )+\Vert \mu _1-\mu _0\Vert ^2\frac{N_0N_1}{N}}{2\lambda _{\max }(\Sigma )} (\sqrt{1+\epsilon _2}-1)^2)]\).

Setting both of these exponential risk probabilities to \(\delta /2\) and solving for \(\epsilon _1\) and \(\epsilon _2\), we obtain the lower bound on the generalization error of pseudoinverted FLD, Theorem 3.3. \(\square \)

5 Remarks

5.1 On the effect of eigenvector misestimation

We have seen that the eigenvector estimates are not affected by the regularization scheme implemented by our converged ensemble. One may then wonder, since we are dealing with small sample problems, how does misestimation of the eigenvectors of \(\Sigma \) affect the classification performance?

It is known that the quality of eigenvector estimates depends on the eigengaps (differences between ordered eigenvalues) of \(\Sigma \) as well as on the data dimension and number of training examples (Vu and Lei 2012; Johnstone and Lu 2009; Paul and Johnstone 2012; Vu 2011). Although the sensitivity of eigenvectors to perturbations of matrix entries is well known, the following simple but powerful example from Horn and Johnson (1985) shows clearly both the problem and the importance of eigenvalue separation. Let:

$$\begin{aligned} \Sigma = \left[ \begin{array}{c@{\quad }c} 1- \epsilon &{} 0 \\ 0 &{} 1+ \epsilon \end{array}\right] \end{aligned}$$

so that \(\Sigma \) has eigenvalues \(1 \pm \epsilon \) and eigenvectors \((1,0)^\mathrm{T}, (0,1)^\mathrm{T}\). On the other hand consider the following perturbed matrix (where the perturbation could arise from, say, estimation error or noise):

$$\begin{aligned} \Sigma + E = \left[ \begin{array}{c@{\quad }c} 1- \epsilon &{} 0 \\ 0 &{} 1+ \epsilon \end{array}\right] + \left[ \begin{array}{c@{\quad }c} \epsilon &{} \epsilon \\ \epsilon &{} - \epsilon \end{array}\right] = \left[ \begin{array}{c@{\quad }c} 1 &{} \epsilon \\ \epsilon &{} 1 \end{array}\right] \end{aligned}$$

This matrix also has eigenvalues \(1 \pm \epsilon \), but has eigenvectors \(\frac{1}{\sqrt{2}}(1,1)^\mathrm{T}, \frac{1}{\sqrt{2}}(1,-1)^\mathrm{T}\), regardless of how small \(\epsilon \) is.

Applying this in the small sample setting we consider here, if the eigengaps of \(\Sigma \) are too small we can expect bad estimates of its eigenvectors. However, we have seen in Theorem 3.2 that the generalization error of the ensemble can be bounded above by an expression that depends on covariance misestimation only through the condition number of \(\hat{S}^{-1}\Sigma \equiv (\Sigma +E)^{-1}\Sigma \) so even a large misestimation of the eigenvectors need not have a large effect on the classification performance: if all the eigengaps are small, so that all the eigenvalues of \(\Sigma \) are similar, then poor estimates of the eigenvectors will not affect this condition number too much. Conversely, following Johnstone and Lu (2009) if the eigengaps are large—i.e. we have a very elliptical covariance—then better eigenvector estimates are likely from the same sample size and the condition number of \(\hat{S}^{-1}\Sigma \) should still not grow too much as a result of any eigenvector misestimation. In the case of the toy example above, the eigenvalues of \(\Sigma (\Sigma +E)^{-1}\) are \(\frac{1\pm \epsilon \sqrt{2-\epsilon ^2}}{1-\epsilon ^2}\), so its condition number is \(\frac{1+\epsilon \sqrt{2-\epsilon ^2}}{1-\epsilon \sqrt{2-\epsilon ^2}}\). For small \(\epsilon \) this remains fairly close to one—meaning eigenvector misestimation indeed has a negligible effect on classification performance.

5.2 On the effect of \(k\)

It is interesting to examine the condition number bound in (4.17) in isolation, and observe the trade off for the projection dimension \(k\) which describes very well its role of regularization parameter in the context of our RP-FLD ensemble. To make the numerator smaller \(k\) needs to be large while to make the denominator larger it needs to be small. We also see natural behaviour with \(N, d\) and the conditioning of the true covariance. From Eqs. (4.13) and (4.16) we see that the condition number bounded by Eq. (4.17) is the only term in the generalization error bound affected by the choice of \(k\), so we can also partly answer the question left open in Marzetta et al. (2011) about how the optimal \(k\) depends on the problem characteristics, from the perspective of classification performance, by reading off the most influential dependencies that the problem characteristics have on the optimal \(k\). The first term in the numerator of (4.17) contains \(d\) but does not contain \(k\) while the remaining terms contain \(k\) but do not contain \(d\), so we infer that in the setting of \(k < \rho -1 <d\) the optimal choice of \(k\) is not affected by the dimensionality \(d\). Noting that for \(N < d\) and Gaussian class-conditionals we have \(\rho = N-2\) with probability \(1\), we see also that for small \(N\) or \(\rho \) the minimizer of this condition number is achieved by a smaller \(k\) (meaning a stronger regulariser), as well as for a small \(\kappa (\Sigma )\). Conversely, when \(N, \rho \), or \(\kappa (\Sigma )\) is large then \(k\) should also be large to minimize the bound.

It is also interesting to note that the regularization scheme implemented by our ensemble has a particularly pleasing form. Shrinkage regularization is the optimal regularizer (w.r.t the Frobenius norm) in the setting when there are sufficient samples to make a full rank estimation of the covariance matrix (Ledoit and Wolf 2004), and therefore one would also expect it to be a good choice for regularization in the range space of \(\hat{\Sigma }\). Furthermore ridge regularization in the null space of \(\hat{\Sigma }\) can also be considered optimal in the following sense—its effect is to ensure that any query point lying entirely in the null space of \(\hat{\Sigma }\) is assigned the maximum likelihood estimate of its class label (i.e. the label of the class with the nearest mean).

5.3 Bias of the ensemble

By letting \(N \rightarrow \infty \) (and so \(\rho \rightarrow d\)) while enforcing \(k < d = \rho \) we see that our ensemble implements a biased estimate of the true covariance matrix \(\Sigma \). In particular, plugging in the true parameters \(\mu _y\) and \(\Sigma \) in the exact error (4.5) we find that the Bayes’ risk for FLD in the data space is \(\sum _{y=0}^{1} \pi _y \Phi \left( - \frac{1}{2}\Vert \Sigma ^{-\frac{1}{2}}(\mu _{1}- \mu _{0})\Vert \right) \) but the expression in Theorem 3.2 converges to:

$$\begin{aligned} \sum _{y=0}^{1} \pi _y \Phi \left( - g\left( 1 + \frac{d}{k} \kappa (\Sigma )\right) \Vert \Sigma ^{-\frac{1}{2}}(\mu _{1}- \mu _{0})\Vert \right) \end{aligned}$$

where we recall that \(g(1) = \frac{1}{2}\). When \(N < d\) however, we see that the generalization error of our RP-FLD ensemble is upper bounded for any training sample containing at least two points for each class whereas our Theorem 3.3 (and asymptotic results in Bickel and Levina 2004) demonstrate that this is not the case in the data space setting if we regularize by pseudoinverting.

Note that when we plug the expectation examined above into the classifier ensemble, this is equivalent to an ensemble with infinitely many members and therefore, for any choice of \(k < \rho -1\), although we can underfit (with a poor choice of \(k\)) the bounded loss of our ensemble implies that we cannot overfit any worse than the pseudo-inverse FLD data space classifier regardless of the ensemble size, since we do not learn any combination weights from the data. This is quite unlike adaptive ensemble approaches such as AdaBoost, where it is well-known that increasing the ensemble size can indeed lead to overfitting. Furthermore, we shall see from the experiments in Sect. 6 that this guarantee vs. the performance of pseudo-inversion appears to be a conservative prediction of the performance achievable by our randomly-projected ensemble.

5.4 Time complexities for the RP-FLD ensemble

We noted in the Introduction that our ensemble, although simple to implement, is also fast. Here we briefly compare the time complexity of our ensemble approach (for a finite ensemble) with that for regularized FLD learnt in the data space.

The time complexity of training a regularized FLD in the data space is dominated by the cost of inverting the estimated covariance matrix \(\hat{\Sigma }\) (Duda et al. 2000), which is \(\mathcal {O}(d^3)\) or \(\mathcal {O}(d^{\log _{2} 7}) \simeq \mathcal {O}(d^{2.807})\) using Strassen’s algorithm (Golub and Van Loan 2012).Footnote 2 On the other hand, in order to obtain a full-rank inverse covariance estimate in the data space using our ensemble requires \(M \in \mathcal {O}\left( \lceil d/k \rceil \right) \), and our experimental results in Sect. 6 suggest that \(M\) of this order is indeed enough to get good classification performance. Using this, and taking into account the \(M\) \(k\times d\) matrix multiplications required to construct the randomly-projected training sets, implies that the time complexity of training our algorithm is \(\mathcal {O}(\frac{d}{k}(Nkd+k^3)) = \mathcal {O}(Nd^{2}+k^{2}d)\) overall, where the \(k^3\) term comes from inverting the full-rank covariance matrix estimate in the projected space. Since we have \(k < \rho -1 < N \ll d\) this is generally considerably faster than learning regularized FLD in the original data space, and furthermore, by using sparse random projection matrices such as those described in Achlioptas (2003); Ailon and Chazelle (2006); Matoušek (2008) one can improve the constant terms hidden by the \(\mathcal {O}\) considerably.

For classification on a single core, one can avoid randomly projecting the query point \(M\) times by averaging the individual classifiers comprising the ensemble. That is, by bracketing the argument to the ensemble decision rule as:

$$\begin{aligned} \left( (\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} \frac{1}{M}\sum _{i=1}^M R_{i}^\mathrm{T} \left( R_{i} \hat{\Sigma }R_{i}^\mathrm{T} \right) ^{-1} R_{i}\right) \left( x_{q}- \frac{\hat{\mu }_{1}+ \hat{\mu }_{0}}{2}\right) \end{aligned}$$

we obtain a single linear classifier of the form \(\hat{h} = w + b, w \in \mathbb {R}^d, b \in \mathbb {R}\), where:

$$\begin{aligned} w {:=} \frac{1}{M}\sum _{i=1}^M(\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} R_{i}^\mathrm{T} \left( R_{i} \hat{\Sigma }R_{i}^\mathrm{T} \right) ^{-1} R_{i} = \frac{1}{M}\sum _{i=1}^M w_i \end{aligned}$$

and

$$\begin{aligned} b {:=} -\frac{1}{M}\sum _{i=1}^M (\hat{\mu }_{1}- \hat{\mu }_{0})^\mathrm{T} R_{i}^\mathrm{T} \left( R_{i} \hat{\Sigma }R_{i}^\mathrm{T} \right) ^{-1} R_{i}\left( \frac{\hat{\mu }_{1}+ \hat{\mu }_{0}}{2} \right) = \frac{1}{M}\sum _{i=1}^M b_i \end{aligned}$$

Classification of new points using our ensemble then has the same time complexity as classification using the data space FLD, namely \(\mathcal {O}(d)\).

6 Experiments

We now present experimental results which show that our ensemble approach is competitive with the state of the art in terms of prediction performance. We do not claim of course that the choice of FLD as a classifier is optimal for these data sets; rather we demonstrate that the various practical advantages of our RP-FLD approach that we listed in the Introduction and Sect. 5.4, and most importantly its analytical tractability, do not come at a cost in terms of prediction performance.

6.1 Datasets

We used six publicly available high dimensional datasets: Five from the bioinformatics domain (colon, two versions of leukaemia, prostate, and duke breast cancer), and one drug discovery dataset from the 2003 NIPS Feature Selection Challenge (dorothea). The characteristics of these datasets are described in Table 1. Our smallest datasets (colon and leukaemia) were the highest dimensional ones used in the empirical RP-classifier study of Fradkin and Madigan (2003) (although that paper focuses on a single randomly projected classifier vs. the data space equivalent). The 7,129 dimensional leukaemia-large was also the dataset of choice in evaluating a technique for ultrahigh dimensional data in Fan and Lv (2008). The 100,000 dimensional dorothea dataset is currently the highest dimensional publicly available dataset in the UCI repository from a problem domain where \(N \ll d\) is the norm.

Table 1 Datasets

6.2 Protocol

We standardized each data set to have features with mean 0 and variance 1. For dorothea we removed features with zero variance, there were 8402 such features which left a working dimensionality of 91598; we did not do any further feature selection filtering to avoid any external effects in our comparison. For the first five datasets we ran experiments on 100 independent splits, and in each split we took 12 points for testing and used the remainder for training. For dorothea we used the same data split as was used in the NIPS challenge, taking the provided 800 point training set for training and the 350 point validation set for testing. We ran 10 instances for each combination of projection dimension, projection method, and ensemble size—that is 1120 experiments.

For our data space experiments on colon and leukaemia we used FLD with ridge regularization and fitted the regularization parameter using 5-fold cross-validation independently on each training set, following Cawley and Talbot (2010), with search in the set \(\{2^{-11}, 2^{-10},\ldots ,2\}\). However on these data this provided no statistically significant improvement over employing a diagonal covariance in the data space, most likely because of the data scarcity. Therefore for the remaining three bioinformatics datasets (which are even higher dimensional) we used diagonal FLD in the data space. Indeed since diagonal FLD is in use for gene array data sets (Dudoit et al. 2002) despite the features being known to be correlated (this constraint acting as a form of regularization) one of the useful benefits of our ensemble is that such a diagonality constraint is no longer necessary.

To satisfy ourselves that building on FLD was a reasonable choice of classifier we also ran experiments in the data space using classical SVM (using the matlab implementation of Cawley (2000) on the first five datasets, and the ‘liblinear’ toolbox (Fan et al. 2008), which is specialised for very large datasets, for dorothea) and \(\ell _1\)-regularized SVM (Fan et al. 2008) with linear kernel. In all SVMs the \(C\) parameter was fitted by 5-fold cross-validation as above, with search in the set \(\{2^{-10},2^{-9},\ldots ,2^{10}\}\).

For the dorothea dataset it was impractical to consider constructing the full FLD in the dataspace since the covariance matrix would not fit in memory on the authors’ machines. We linearized diagonal FLD to get around this issue, but the performance of diagonal FLD was extremely poor (Accuracy of 0.2686) and, since the classical linear SVM is also known to perform poorly on this dataset (Guyon et al. 2006, 2004), for the dorothea dataset we baselined against Bernoulli Naïve Bayes (without preprocessing the binary data) following the advice of the challenge organiser to her students given in Guyon et al. (2006). We have also run \(\ell _1\)-regularised SVM (Fan et al. 2008), which turned out successful on this data set.

For all experiments carried out in the projected space, the randomly projected base learners are FLDs with full covariance and no regularization (since we choose \(k < \rho -1\) and so the projected sample covariances are invertible).

6.3 Results

For the five bioinformatics datasets, in each case we compare the performance of the RP ensembles with (regularized) FLD in the data space, vanilla and \(\ell _1\)-regularized SVM, and (as suggested by one of the anonymous referees) with an ensemble of Random Subspace (RS) FLD classifiers.Footnote 3 For dorothea we also compare our RP-FLD ensemble with Bernoulli Naïve Bayes.

Summary results for the rule of thumb choice \(k=\rho /2\) are listed in Table 2 as well as end-to-end running times for each data split on a Linux machine with Intel \(\circledR \) \(\hbox {Core}^{\mathrm{TM}}\) i5-3570 CPU @ 3.40 GHz and 7 GB memory for the bioinformatics datasets. The dorothea experiments were run on a Microsoft \(\circledR \) Windows \(7^{\mathrm{TM}}\) machine with the same CPU specification and 8GB memory and, because of the extremely high dimensionality of the dorothea dataset, we refactored our code to avoid randomly projecting the test set by using the approach described in Sect. 5.4 for these experiments; therefore the running times for dorothea are not directly comparable with those for the biomedical datasets. We note, however, that the main computational overhead for the dorothea dataset comes from the random preprocessing of the data (either random projection, or random subspace) so for these experiments the running times for the preprocessing step are given which still give a good indication of the overall running time.

Table 2 Mean error rates \(\pm \) 1 standard error, and CPU times estimated from 100 independent splits (10 instances of the fixed split for dorothea) for random projection ensembles with 100 (RP-Ens M=100) or 1000 (RP-Ens M=1000) members, and competing methods (see text for details). For both RP-ensembles and RS-ensembles \(k=\rho /2\) was used

We see from Table 2 that with M = 1,000 members in our ensemble, the SVM outperforms us on two datasets (colon and duke), we outperform it on two datasets (leukaemia-large, and dorothea) and no statistical difference is found on the remaining two datasets. On one dataset (dorothea) \(\ell _1\)-regularised SVM does better than us, we outperform it on three data sets (colon, leukaemia, and leukaemia-large), and there is no statistical difference on the remaining two data sets. We outperform random subspace with 1,000 ensemble members on two datasets (duke and prostate) and there is no statistical difference found on the remaining four datasets.

The picture looks not much different for our method having M=100 ensemble members, except there is no significant difference with the \(\ell _1\)-regularised SVM on leukaemia-large, the random subspaces with 100 members beats us on colon and leukaemia, and displays no difference on duke. In fact it turns out that our ensemble with 1000 members differs from that with 100 members on only one data set (duke).

The random subspace FLD ensemble wins over our RP-FLD ensemble with respect to computation time, although this difference is of course confined to the training time only since the time complexity for classification is still \(\mathcal {O}(d)\). Interestingly for the random subspace ensembles the overall error performance is just slightly behind that of the random projection ensembles. Since trading off a small amount of accuracy for a speed-up may be desirable for some applications, an interesting research question is whether similar theoretical guarantees to those we obtained for our RP-FLD ensemble can be proved in the random subspace case. Nevertheless the computation time of our RP-FLD ensemble is comparable with the sophisticated liblinear implementation of \(\ell _1\)-regularised SVM, as is its performance. In fact on three of the six data sets tested none of the competing methods outperformed our RP-FLD ensemble at the 95 % confidence level.

In Fig. 1 we plot the results for the regularized data space FLD (Bernoulli Naïve Bayes for dorothea), for a single RP-FLD, and for ensembles of 10, 100, and 3000 RP-FLD classifiers (1000 for dorothea). We see in all cases that our theoretical analysis is well supported, the RP-FLD ensemble outperforms traditional FLD on a range of choices of \(k\) and \(M\), and the rule of thumb choice \(k=\rho /2\) is not far from the optimal performance—on these data sets \(\rho =N-2\). It is interesting to see that, despite the statistically insignificant difference in performance of full-vs-diagonal covariance models we found for the two lower-dimensional data sets in the data space, for the three higher dimensional data sets (where we used a diagonality constraint for computational tractability) the gap in generalization performance of the data space FLD vs the competing approaches is very large, whereas the gap in performance between the RP-FLD ensembles and the competing approaches is small. Empirically we see, as we might reasonably expect, that capturing the feature covariances via our ensemble approach produces better classification results than working in the data space with a diagonal covariance model.

Fig. 1
figure 1

Effect of \(k\). Plots show test error rate versus \(k\) and error bars mark 1 standard error estimated from 100 runs (10 repeated runs on the same split for dorothea). In these experiments we used Gaussian random matrices with i.i.d \(\mathcal {N}(0,1)\) entries. In each case the projection dimension runs along the \(x\)-axis from 1 through to \(\rho -2\) (Color figure online)

We ran further experiments on the colon and leukaemia-large data sets to compare the performance of the fast random projections from Achlioptas (2003) to Gaussian random projection matrices, and to compare our decision rule to majority vote. Quite interestingly, the picture is very similar and we find no statistically significant difference in the empirical results in comparison with the ensemble that we have presented and analyzed in detail here. The results of these experiments are plotted in Fig. 2. The performance match between the different choices of random matrix is unsurprising, but the agreement with majority vote is both striking and rather unexpected - we do not yet have an explanation for this behaviour, although it does not appear to arise from the unsigned confidences of the individual ensemble members being concentrated around a particular value.

Fig. 2
figure 2

Effect of different random projection matrices and comparison with majority vote. Left hand column shows results on the Colon dataset, right hand column shows results on Leukaemia-large. Row 1 RP Majority Vote using Gaussian random matrices with i.i.d \(\mathcal {N}(0,1)\) entries; Row 2 RP Averaging using Gaussian random matrices with i.i.d \(\mathcal {N}(0,1)\) entries; Row 3 RP Averaging using \(\pm 1\) random matrices with i.i.d entries; Row 4 RP Averaging using the sparse \(\{-1,0,+1\}\) random matrices from Achlioptas (2003) (Color figure online)

7 Discussion and future work

We considered a randomly projected (RP) ensemble of FLD classifiers and gave theory which, for a fixed training set, explicitly links this ensemble classifier to its data space analogue. We have shown that the RP ensemble implements an implicit regularization of the corresponding FLD classifier in the data space. We demonstrated experimentally that the ensemble can recover or exceed the performance of a carefully-fitted ridge-regularized data space equivalent but with generally lower computational cost. Our theory guarantees that, for most choices of projection dimension \(k\), the error of a large ensemble remains bounded even when the number of training examples is far lower than the number of data dimensions and we gained a good understanding of the effect of our discrete regularization parameter \(k\). In particular, we argued that the regularization parameter \(k\) allows us to finesse the known issue of poor eigenvector estimates in this setting. We also demonstrated empirically that with an appropriate choice of \(k\) we can obtain good generalization performance even with few training examples, and a rule of thumb choice \(k=\rho /2\) appears to work well.

We showed that, for classification, the optimal choice of \(k\) depends on the true data parameters (which are unknown) thereby partly answering—in the negative—the question in Marzetta et al. (2011) concerning whether a simple formula for the optimal \(k\) exists.

It would be interesting to extend this work to obtain similar guarantees for ensembles of generic randomly-projected linear classifiers in convex combination, and for an ensemble of random subspace FLDs: we are working on ways to do this. Furthermore, it would be interesting to derive a concentration inequality for matrices in the p.s.d ordering to quantify with what probability a finite ensemble is far from its expectation; this however appears to be far from straightforward—the rank deficiency of \(\hat{\Sigma }\) is the main technical issue to tackle.