A memory-based method to select the number of relevant components in principal component analysis

Anshul Verma; Pierpaolo Vivo; Tiziana Di Matteo

doi:10.1088/1742-5468/ab3bc4

1. Introduction

With the arrival of sophisticated new technologies and the advent of the big data era, the amount of digital information that can be produced, processed and stored has increased at an unprecedented pace in recent years. The need of sophisticated post-processing tools—able to identify and discern the essential driving features of a given high-dimensional system—has thus become of paramount importance. Principal component analysis (PCA), which aims to reduce the dimensionality of the correlation matrix between data [1, 2], is continuing to prove a highly valuable method in this respect. PCA has been shown to have applications spanning from neuroscience to finance. In image processing, for instance, this technique has proven useful to identify key mixtures of colours of an image for use in compression [3]. In molecular dynamics, the increasing computational power available to researchers makes it possible to simulate more complex systems, with PCA helping to detect important chemical drivers [4]. The brain's neurons produce different responses to a variety of stimuli, hence PCA can be used in neuroscience to find common binding features that determine such responses [5]. In finance, the amount of digital storage and the length of available historical time series have dramatically increased. It has therefore become possible to probe the multivariate structure of changes in prices, but with the large universe of stocks that usually make up markets, PCA has become a valuable technique in identifying essential factors governing price evolution [6–8].

Within the class of dimensionality reduction methods, whose goal is to produce a faithful but smaller representation of the original correlation matrix [9], PCA plays a very important role. Other known methods include information filtering techniques [10–15], autoencoders [16, 17] and independent component analysis (ICA) [18, 19]. PCA accomplishes this task using a subset of the orthogonal basis of the correlation matrix of the system. Successive principal components—namely the eigenvectors corresponding to the largest eigenvalues—provide the orthogonal directions along which data are maximally spread out. Since the dimension of empirical correlation matrices can be as large as ∼ $10^{2}-10^3$ , a highly important parameter is the number $m^\star$ of principal components one should retain, which should strike the optimal balance between providing a faithful representation of the original data and avoiding the inclusion of irrelevant details.

Unfortunately, there is no natural prescription on how to select the optimal value $m^\star$ , and many heuristic procedures and so-called stopping criteria have been proposed in the literature [1, 2]. The most popular methods—about which more details are given in section 7—are (i) scree plots [20], (ii) cumulative explained variance [21, 22], (iii) distribution-based methods [23, 24], and (iv) cross-validation [25, 26]. However, they all suffer from different, but serious drawbacks: (i) and (ii) are essentially rules of thumb with little data-driven justification, (iii) do not allow the user to control the overall significance level of the final result and are thus impractical for large data sets, and finally (iv), whilst being more objective and relying on fewer assumptions, is often computationally cumbersome [1]. Efforts to improve each subclass—for instance the more 'subjective' methods [20–22]—have been undertaken, but they usually resulted in adding more assumptions or were anyway unable to fully solve the issues [1].

Unlike most other methods available in the literature, in this paper we propose to take advantage of long memory effects that are present in many empirical time series [27] to select the optimal number $m^\star$ of principal components to retain in PCA. We shall leverage on the natural factor model implied by PCA (see section 5.2 below) to assess the statistical contribution of each principal component to the overall 'total memory' of the time series, using a recently introduced proxy for memory strength [15]. We test the validity of our proposal on synthetic data, namely two fractional Gaussian noise processes with different Hurst exponents (see section 6.1), and also on an empirical dataset whose details are reported in appendix A. Comparing our memory-based method with other heuristic criteria in the literature, we find that our procedure does not include any subjective evaluation, makes a very minimal and justifiable set of initial assumptions, and is computationally far less intensive than cross-validation.

Our methodology is generally applicable to any (however large) correlation matrix of a long-memory dataset. A typical example is provided by financial time-series, which are well-known to display long-memory effects [28]. The volatility of such time-series indeed constitutes an important input for risk estimation and dynamical models of price changes [29–31]. However, the multivariate extensions of common volatility models, such as multivariate Generalised Autogressive Conditional Heteroskedastic (GARCH) [32], stochastic covariance [33] and realised covariance [34], suffer from the curse of dimensionality, hindering their application in practice. A popular solution to this issue is to first apply PCA to the correlation matrix between volatilities, and then use the reduced form of the correlation matrix to fit a univariate volatility model for each component, as in [6]. In climate studies, PCA has been used to create 'climate indices' to identify patterns in climate data from a wide range of measurements including precipitations and temperature [35]. Here, factors such as the surface temperature are known to exhibit long range memory [36]. In neuroscience, PCA can be used to discover amongst the vast number of possible neurons those which correspond to particular responses, for example how an insect brain responds to different odorants [5]. In this case as well, long memory effects are well-known to play an important role [37]. Our framework is therefore highly suited to a wide array of problems.

The paper is organised as follows: in section 2, we introduce and define the PCA procedure and how one selects the most relevant number of principal components. Section 3 describes the relevant quantities and results that are specific to financial data. We detail our proposed method to select the principal components based on memory in section 5, testing the method on synthetic and empirical data in section 6. We explore the advantages that our method offers over existing approaches in literature in section 7, before finally drawing some conclusions in section 8. The appendices are devoted to the description of the empirical dataset and technical details.

2. PCA and the optimal number of principal components to retain

In this section, we give a brief introduction to PCA to make the paper self-contained. Call $\boldsymbol{X}$ the data matrix, which contains N columns—standardised to have zero mean and unit variance—of individual defining features, and T rows recording particular realisations in time of such features. PCA searches for the orthogonal linear basis with unit length $\boldsymbol{w}_{\{i=1,...,N\}}$ that transforms the system to one where the highest variance is captured by the first component, the second highest by the second component and so on [1]. The first component is therefore given by

$\begin{align} \renewcommand{\dag}{\dagger} \newcommand{\e}{{\rm e}} \newcommand{\re}{{\rm Re}} \displaystyle \boldsymbol{w}_{1}=\arg \max_{||\boldsymbol{w}||=1}\left\{||\boldsymbol{X}\boldsymbol{w}||^{2}\right\}=\arg \max_{||\boldsymbol{w}||=1}\left\{\boldsymbol{w}^{\dagger}\boldsymbol{E} \boldsymbol{w}\right\} , \nonumber \end{align} \tag{ 1 }$

where $\newcommand{\re}{{\rm Re}} \renewcommand{\dag}{\dagger}\dagger$ represents the transpose, and $\boldsymbol{E}$ is the sample correlation matrix of $\boldsymbol{X}$ , defined as

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle E_{ij}=\frac{1}{T}\sum_{t=1}^{T}X_{ti}X_{tj} \ . \label{EEqn} \nonumber \end{align} \tag{ 2 }$

The search for $\boldsymbol w_1$ can be formulated as a constrained optimisation problem, i.e. we must maximise

$\begin{align} \renewcommand{\dag}{\dagger} \newcommand{\e}{{\rm e}} \newcommand{\re}{{\rm Re}} \displaystyle \boldsymbol{w}^{\dagger}\boldsymbol{E} \boldsymbol{w}-\lambda(\boldsymbol{w}^{\dagger}\boldsymbol{w}-1) , \label{Lagrange} \nonumber \end{align} \tag{ 3 }$

where $\lambda$ is the Lagrange multiplier enforcing normalisation of the eigenvectors. Differentiating equation (3) w.r.t. to $\boldsymbol{w}$ we get

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \boldsymbol{E}\boldsymbol{w}-\lambda\boldsymbol{w}=0 \ . \nonumber \end{align} \tag{ 4 }$

This means that the Lagrange multiplier must be an eigenvalue of $\boldsymbol{E}$ . Also note that the variance of data along the direction $\boldsymbol w$ is given by

$\begin{align} \renewcommand{\dag}{\dagger} \newcommand{\e}{{\rm e}} \newcommand{\re}{{\rm Re}} \displaystyle \boldsymbol{w}^{\dagger}\boldsymbol{E}\boldsymbol{w}=\lambda\boldsymbol{w}^{\dagger}\boldsymbol{w}=\lambda , \nonumber \end{align} \tag{ 5 }$

and hence the largest variance is realised by the top eigenvalue. It follows that the first principal component—i.e. the direction along which the data are maximally spread out—is nothing but the top eigenvector $\boldsymbol{w}_{1}$ corresponding to the top eigenvalue $\lambda_1$ . A similar argument holds for the subsequent principal components.

The aim of PCA is to reduce $\boldsymbol{E}$ to a $m \times m$ matrix, where $m\ll N$ is the number of principal components that we choose to retain. Is there an optimal value $m^\star$ that one should select? Clearly, this is an important question that must be addressed, since it determines the 'best' size of the reduced correlation matrix that is just enough to describe the main features of the data without including irrelevant details. In this paper, we address this question and we provide a new method to select the optimal value $m^\star$ of the number of principal components that we should retain for long-memory data.

3. Financial data

3.1. Data structure

In this section, we describe the general structure of the data matrix that we use in the context of financial data. We consider a system of N stocks and T records of their daily closing prices. We calculate the time series of log-returns for a given stock i, r_i(t), defined as:

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle r_{i}(t)=\ln p_{i}(t+1) - \ln p_{i}(t) , \nonumber \end{align} \tag{ 6 }$

where p _i(t) is the price of stock i at time t. After standardising r_i(t) so that it has zero mean and unit variance, we define the proxy we shall use for the volatility, i.e. the variability in asset returns (either increasing or decreasing), as $\ln |r_{i}(t)|$ [38]. Most stochastic volatility models—where the volatility is assumed to be random and not constant—assume that the return for the stock i evolve according to [39]

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle r_{i}(t)=\delta(t){\rm e}^{\omega_{i}(t)} , \label{StochVolModel} \nonumber \end{align} \tag{ 7 }$

where $\delta(t)$ is a white noise with finite variance and $\omega_{i}(t)$ are the log volatility terms. The exponential term encodes the structure of the volatility and how it contributes to the overall size of the return. We note that for our purposes, we are able to set the white noise term to be the same for all stocks since it contains no memory by definition [40] (we have checked that changing this assumption to include a stock dependent white noise term does not change our results). Taking the absolute value of (7) and the log of both sides, equation (7) becomes

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{LogVolTerms} \ln |r_{i}(t)|=\ln |\delta(t)|+\omega_{i}(t) \ . \nonumber \end{align} \tag{ 8 }$

We see that working with $\ln |r_{i}(t)|$ has the added benefit of making $\omega_{i}(t)$ —the proxy for volatility—additive, which in turn makes the volatility more suitable for factor models. Since $\delta(t)$ is a random scale factor that is applied to all stocks, we can set it to 1, so that $\omega_{i}(t)=\ln |r_{i}(t)|$ . We also standardise $\ln |r_{i}(t)|$ to a mean of 0 and standard deviation 1 as performed in [41]. Finally, the data matrix $\boldsymbol{X}$ that we use as input for our procedure consists of entries $X_{ti}=\omega_{i}(t)$ .

3.2. Market mode and Marčenko–Pastur

For the case of log volatilities in finance [7, 42] (see further details in appendix B), it has been known for some time that the smallest eigenvalues of the empirical correlation matrix $\boldsymbol E$ may be heavily contaminated by noise due to the finiteness of the data samples. In our search for the most relevant $m^\star$ components, it is therefore important to confine ourselves to the sector of the spectrum that is less affected by noise at the outset.

To facilitate this identification, we will resort to a null distribution of eigenvalues, which are produced from a Gaussian white noise process. This is given by the celebrated Marčenko–Pastur (MP) distribution [7, 43, 44, 76]

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle p(\lambda)=\frac{1}{2\pi q\sigma^{2}}\frac{\sqrt{(\lambda_{+}-\lambda)(\lambda-\lambda_{-})}}{\lambda} \label{MPDist} , \nonumber \end{align} \tag{ 9 }$

where $p(\lambda)$ is the probability density of eigenvalues having support in $\lambda_{-}< \lambda < \lambda_{+}$ . The edge points $\lambda_{\pm}=\sigma\left(1\pm \sqrt{q}\right){}^{2}$ , q = T/N and $\sigma$ is the standard deviation over all stocks. By comparing the empirical eigenvalue distribution of $\boldsymbol E$ to the MP law (9), we can therefore see how many eigenvalues, and thus principal components, are likely to be corrupted by noise and should therefore be discarded from the very beginning. More recently, this procedure has received some criticisms [45–47]: it has been argued that eigenvalues carrying genuine information about weakly correlated clusters of stocks could still be buried under the MP sea, and more refined filtering strategies may be needed to bring such correlations to the surface. Other generalisations of the MP law for non-normally distributed random data and applications to financial data can be found for instance in [48] and [49].

In practical terms, we first create the empirical correlation matrix $\newcommand{\re}{{\rm Re}} \renewcommand{\dag}{\dagger}\boldsymbol E=(1/T)\boldsymbol X^\dagger\boldsymbol X$ from the matrix $\boldsymbol X$ (constructed from either synthetic or empirical data), and then we fit the MP law to the empirical distribution of its eigenvalues. This is done by considering q and $\sigma$ in equation (9) as free parameters to take into account finite sample biases [46]. In figure 1(a), we plot the histogram of bulk eigenvalues for the empirical dataset described in appendix A, and in the inset a number of outliers $\lambda > \lambda_{+}$ in semilog scale. It is indeed well-known that some of the eigenvalues of $\boldsymbol E$ extend well beyond the upper edge of the MP law, and that the largest eigenvalue lies even further away (see figure 1(a)). This means that the first principal component accounts for a large proportion of the variability of data, and is in fact a well-known effect of the market mode [41, 50, 51]. We plot the entries of the right eigenvector $\boldsymbol{w}_{1}$ of $\mathbf{E}$ (corresponding to the market mode) and $\boldsymbol{w}_{2}$ in figure 2, with the blue lines giving the length from the origin of the corresponding 2D vector. We see from figure 2 that the entries for $\boldsymbol{w}_{1}$ are all positive, which confirms that indeed the first eigenvector affects all stocks.

**Figure 1.** (a) Histogram of the eigenvalue distribution of $\boldsymbol{E}$ constructed from the empirical dataset (see appendix A), compared to the best fit Marčenko–Pastur distribution in red. The $\lambda$ axis has been split by the forward-slashes to only show the bulk eigenvalues below $\lambda_{+}=2.80$ . The inset shows the 22 isolated eigenvalues for $\lambda>\lambda_{+}$ in semilog scale. The Marčenko–Pastur distribution is fitted with parameters $q=0.38\pm 0.02$ and $\sigma =1.03\pm 0.01$ . (b) Same histogram, but applied to the correlation matrix $\boldsymbol{G}$ (see section 5.1), where the market mode has been de-trended. Here $\lambda_{+}=2.77$ , $q=0.41\pm 0.02$ , $\sigma=1.01\pm 0.01$ , with 35 eigenvalues above $\lambda_{+}$ .
Download figure:
Standard image High-resolution image

**Figure 2.** Each point i in this graph has coordinates $(w_{i1},w_{i2})$ , where w_i1 is the ith entry of the top eigenvector $\boldsymbol{w}_{1}$ , and w_i2 is the ith entry of the second eigenvector $\boldsymbol{w}_{2}$ of the correlation matrix $\boldsymbol{E}$ (defined in equation (2)) for the dataset in appendix A. The length of the corresponding 2D coordinate vector from the origin is given by each blue line. The plot shows that all values of w_i1 are in fact positive.
Download figure:
Standard image High-resolution image

**Figure 2.** Each point i in this graph has coordinates $(w_{i1},w_{i2})$ , where w_i1 is the ith entry of the top eigenvector $\boldsymbol{w}_{1}$ , and w_i2 is the ith entry of the second eigenvector $\boldsymbol{w}_{2}$ of the correlation matrix $\boldsymbol{E}$ (defined in equation (2)) for the dataset in appendix A. The length of the corresponding 2D coordinate vector from the origin is given by each blue line. The plot shows that all values of w_i1 are in fact positive.
Download figure:
Standard image High-resolution image

4. Long memory

We now consider the 'long memory' features of a time series, specialising the discussion to the log volatility in a financial context.

The autocorrelation function (ACF), $\kappa(L)$ , of any time series $x(t)$ is defined as

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \kappa(L)={\rm corr}(x(t+L) ,x(t)) =\frac{\langle\left[x(t+L) x(t)\right]\rangle}{\sigma^{2}} , \label{ACF} \nonumber \end{align} \tag{ 10 }$

where $\langle...\rangle$ denotes the time expectation over $x(t)$ , adjusted to have zero mean. L is the lag and $\sigma^{2}$ is the variance of the process $x(t)$ . If $\kappa(L)$ decays faster than or as fast as an exponential with L, then the time series is said to have short memory [27]. However, in many real world systems ranging from outflows in hydrology to tree ring measurements [27], $\kappa(L)$ has been found to decay much more slowly than an exponential, giving rise to an important effect known as long memory [27]. This means that the process at time t remains heavily influenced by what happened in a rather distant past. In particular for financial data (where $x(t)=|\ln r(t)|$ ), it is an accepted stylised fact (called volatility clustering) that large changes in volatilities are usually followed by other large changes in volatilities, or that the volatilities retain a long memory of previous values [52]. $\kappa(L)$ has also been empirically found to follow a power law decay

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \kappa(L)\sim L^{-\beta^{{\rm vol}}} , \label{PowerLawACF} \nonumber \end{align} \tag{ 11 }$

where $\beta^{{\rm vol}}$ describes the strength of the memory effect—a lower value indicates that a longer memory of past values is retained. However, as shown in [15], to better distinguish between short and long memory it is convenient to consider the non-parametric integrated proxy $\newcommand{\e}{{\rm e}} \eta$ , defined as

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \eta=\int_{L=1}^{L_{\rm cut}}\kappa(L) {\rm d}L ,\label{IntProxy} \nonumber \end{align} \tag{ 12 }$

where $L_{\rm cut}$ is the standard Bartlett cut at the 5% level [40]. The proxy $\newcommand{\e}{{\rm e}} \eta$ is less affected by the noise-dressing of $\kappa(L)$ than $\beta^{{\rm vol}}$ [15], and the larger the value of $\newcommand{\e}{{\rm e}} \eta$ the greater the degree of the memory effect. This observable will constitute an essential ingredient of our method.

5. Methods

In this section, we describe in detail our procedure.

5.1. De-trending the market mode

The first step of our method consists in removing the influence of the market mode, the global factor affecting the data, as we hinted at in section 3.2. To do this, we impose that the standardised log volatility $\omega_{i}(t)=\ln |r_{i}(t)|$ in equation (8) follows a factor model (using the capital asset pricing model, CAPM [53, 54]) with the market mode

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle I_{0}(t)=\sum_{i=1}^{N}w_{i1}\ln |r_{i}(t)| \label{MarketModeWeighted} \nonumber \end{align} \tag{ 13 }$

as a factor. This quantity—essentially a weighted average of $\ln |r_{i}(t)|$ with weights $\boldsymbol{w}_{1}$ , the top eigenvector's components—represents the effect of the market as a whole on all stocks i.e. the common direction taken by all stocks at once.

Hence we define

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \omega_{i}(t)=\beta_{i0}I_{0}(t)+\alpha_{i0}+c_{i}(t) \ .\label{MarketEqn} \nonumber \end{align} \tag{ 14 }$

Here, $\beta_{i0}$ is the responsiveness of stock i to changes in the market mode I₀(t), $\alpha_{i0}$ is the excess volatility compared to the market and c_i(t) are the residual log volatilities.

A standard linear regression of $\omega_{i}(t)$ against I₀(t) brings to the surface the residual volatilities c_i(t) that the market as a whole cannot explain. The matrix of standardised c_i(t) for all stocks is labeled $\boldsymbol{X}^{({\rm market})}$ . We call $\boldsymbol{G}$ the correlation matrix of $\boldsymbol{X}^{({\rm market})}$ , with entries

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle G_{ij}=\frac{1}{T}\sum_{t=1}^{T}c_{i}(t)c_{j}(t) \ . \label{GEqn} \nonumber \end{align} \tag{ 15 }$

By definition, the matrix $\boldsymbol{G}$ will have the influence of the market mode removed through equation (14). This cleaning procedure also makes the correlation structure more stable [55], and therefore we will be working with the matrix $\boldsymbol G$ from now on.

As we did with the matrix $\boldsymbol{E}$ , we again fit the Marčenko–Pastur (MP) distribution—this time to the empirical eigenvalue distribution of $\boldsymbol{G}$ . This is justified even in the presence of autocorrelations since in the bulk the amount of memory is quite low. We can see this empirically by computing the median $L_{\rm cut}$ over principal components that have eigenvalues below $\lambda_{+}$ for the fitted MP distribution, which is 2. The values are quite close to 1, which is the value we would find for white noise. In the presence of weak autocorrelations, [56] showed that the distribution of eigenvalues in the bulk differs slightly to the MP distribution. We clearly see this distortion in our figure 1(b), which bears some similarity in shape with the pdf calculated and plotted in [56] (see figure 1 there). However, the MP distribution is a simpler and very good approximation, especially for the edge points in figure 1(b). We expect that the number of eigenvalues beyond the bulk should increase, since the removal of the market mode makes the true correlation structure more evident and lowly intra-correlated clusters more visible [55]. This is confirmed by the results that are detailed in figure 1(b), where we see that the number of eigenvalues beyond the bulk (shown in the inset plot in figure 1(b)) has indeed increased from 22 to 35. Note that we also see from figure 1 that the best fit q and $\sigma$ for $\boldsymbol{E}$ and $\boldsymbol{G}$ are quite similar, which matches the theoretical result of [57].

With this finding in hand, we can safely disregard all principal components corresponding to eigenvalues within the MP sea. This observation already drastically reduces the maximum value of eligible components—which we call $m_{{\rm max}}$ —from 1202 to 35 for the empirical data in appendix A. We also recall that the eigenvectors of $\boldsymbol{G}$ have an economic interpretation according to the industrial classification benchmark (ICB) supersectors—for more details, see appendix B.

5.2. Regression of principal components

Considering the matrix $\boldsymbol{G}$ in equation (15), where the influence of the market mode has been removed, we must now assess how each stock i's log-volatility is related to the log-volatility of each principal component. We achieve this result by regressing c_i(t) in equation (14) against the average behaviour of the log-volatility for the principal components. The average behaviour I_p(t) is defined as

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle I_{p}(t)=\sum_{i=1}^{N}w_{ip}c_{i}(t) ,\qquad p=1,...,m_{{\rm max}},\label{PortfolioMode} \nonumber \end{align} \tag{ 16 }$

i.e. it is the weighted average log-volatility of the p th principal component, where w_ip is the ith entry of the p th eigenvector of $\boldsymbol G$ . Equation (16) is therefore the projection of the residue c_i(t) onto the $m_{{\rm max}}$ principal components.

The principal components are an orthogonal basis for the correlation matrix $\boldsymbol{G}$ , and represent important features that determine fluctuations in the c_i's. Therefore, it makes sense to define a factor model—which we call the ' $m_{{\rm max}}$ -based PCA factor model'—where the explanatory variables are the $m_{{\rm max}}$ principal components [1, 58]

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle c_{i}(t)=\sum_{p=1}^{m_{{\rm max}}}\beta_{ip}I_{p}(t)+\epsilon_{i}(t) , \label{FactorModel} \nonumber \end{align} \tag{ 17 }$

where $\newcommand{\e}{{\rm e}} \epsilon_{i}(t)$ is a white noise term with zero mean and finite variance, and c_i(t) are the residual volatilities defined in equation (14). Here, $\beta_{ip}$ is the responsiveness of c_i(t) to changes in I_p(t), indicating whether the log-volatility of stock i is higher ( $\beta_{ip}>1$ ) or lower than I_p(t) ( $\beta_{ip}<1$ ).

We can now find $\beta_{ip}$ by regressing the previously obtained input c_i(t) against all the I_p's. This will separate the signal explained by the principal components from the residual noise present in the system. The regression will be performed using a lasso method (see appendix C for details).

5.3. Assessing memory contribution

The next step of our methodology consists in estimating the memory contribution of the $m=1,2,...,m_{{\rm max}}$ components.

Fixing m, we compute for each stock the quantity

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle d_{i}^{(m)}(t)=c_{i}(t)-\sum_{p=1}^{m}\beta_{ip}I_{p}(t) , \ \ i=1,...,N\ . \label{ResidueDfn} \nonumber \end{align} \tag{ 18 }$

Here, the $\beta_{ip}$ are the coefficients obtained with the regression in equation (17). The $d_i^{(m)}(t)$ are the residues after the removal of the first m components.

Using the $d_{i}^{(m)}(t)$ , we can first compute their temporal autocorrelation $\kappa_i^{(m)}(L)$ in equation (10) for different values of the lag L between L = 1 and L = T − 1. We generically find that the $\kappa_{i}^{(m)}(L)$ follow a power law decay as a function of L—see examples in figure 3 depicting the $\kappa_{i}^{(m)}(L)$ for $m=1,11$ for ALJ Regional Holdings (ALJJ), a stock included in our empirical dataset in appendix A. As more components are removed (i.e. as m is increased), the exponent $\beta^{{\rm vol}}$ defined in equation (11) for ALJJ and plotted in figure 3 increases from 0.277 to 0.322. This result is what one would expect since the amount of memory accounted for will decrease as more components are removed.

**Figure 3.** Plots in log–log scale of $\kappa^{(m)}(L)$ in blue, which is given in equation (10) with $x(t)=d_{i}^{(m)}(t)$ , for the stock ALJ Regional Holdings (ALJJ). Here, m = 1 on the left and m = 11 on the right. In red the lines of best fit (using the Theil Sen estimator [59]), which gives the power law decay exponent $\beta^{{\rm vol}}$ as 0.277 and 0.322 for the left and right plot respectively. (a) $\kappa^{(1)}(L)$ . (b) $\kappa^{(11)}(L)$ .
Download figure:
Standard image High-resolution image

**Figure 3.** Plots in log–log scale of $\kappa^{(m)}(L)$ in blue, which is given in equation (10) with $x(t)=d_{i}^{(m)}(t)$ , for the stock ALJ Regional Holdings (ALJJ). Here, m = 1 on the left and m = 11 on the right. In red the lines of best fit (using the Theil Sen estimator [59]), which gives the power law decay exponent $\beta^{{\rm vol}}$ as 0.277 and 0.322 for the left and right plot respectively. (a) $\kappa^{(1)}(L)$ . (b) $\kappa^{(11)}(L)$ .
Download figure:
Standard image High-resolution image

Numerically integrating the $\kappa_{i}^{(m)}(L)$ , we obtain a set of integrated memory proxies $\newcommand{\e}{{\rm e}} \eta_{i}^{(m)}$ (see equation (12)), one for each asset i and for each number m of removed components. In general, the $\newcommand{\e}{{\rm e}} \eta_{i}^{(m)}$ are non-increasing functions of m, since the further removal of subsequent components is bound to decrease the residual memory level present in the system.

We now define

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \zeta(m)={\rm median}\left(\frac{\eta_{i}^{(m)}}{\eta_{i}^{(0)}}\right) , \label{MemoryReductionEqn} \nonumber \end{align} \tag{ 19 }$

where $\newcommand{\e}{{\rm e}} \eta_{i}^{(m)}$ are the integrated proxies, and $\newcommand{\e}{{\rm e}} \eta_{i}^{(0)}$ are just the integrated proxies of the residual volatilities c_i(t) defined in equation (14). $\zeta(m)$ thus represents the 'average' behaviour over all stocks of how much each of the principal components contributes to the memory. It is again a non-increasing function of m, and by definition $\zeta(m)<1$ for all m.

In figure 4(c), we plot $\zeta(m)$ in log–log scale for both the empirical and synthetic datasets in appendix A and section 6.1 respectively. We observe a striking change in concavity at some value $\theta$ , which we interpret as follows: to the right of $\theta$ , the amount of memory left unexplained in the system changes very slowly when more and more components are progressively included. This clearly signals that we have reached the 'optimal stopping' point $m^\star$ beyond which the inclusion of further components would not add more information.

**Figure 4.** (Top left) Plot of $\ln(\zeta(m))$ versus $\ln(m)$ for the homogeneous synthetic system defined in section 6.1. The blue line is the value of $\zeta(m)$ across all assets, with the dashed red line indicating $\hat{\theta}=20$ , the point at which the concavity changes. (Top right) Same plot but for the heterogeneous simulated system described in section 6.1, where $\hat{\theta}=13$ . (Bottom) Same plot but for the empirical dataset described in appendix A, yielding $\hat{\theta}=16$ . These values of $\hat{\theta}$ imply that the number $m^\star$ of principal components to retain should be $m^\star=19,12,15$ respectively. (a) Homogeneous. (b) Heterogeneous. (c) Empirical.
Download figure:
Standard image High-resolution image

Beyond $\theta$ , the behaviour of $\zeta(m)$ is power-law

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \zeta(m) \sim m^{-\gamma} \qquad m \geqslant \theta , \label{ThetaPowerLaw} \nonumber \end{align} \tag{ 20 }$

where $\gamma$ is the exponent. Using the fitting procedure for $\theta$ described in appendix D produces the optimal integer estimator $\hat{\theta}$ . Since the value of $\hat{\theta}$ indicates that for $m<\hat{\theta}-1$ , $\zeta(m)$ decreases more rapidly than a power law, we can safely set

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle m^{\star}=\hat{\theta}-1 \ . \nonumber \end{align} \tag{ 21 }$

5.4. Summary of the procedure

The procedure to select the optimal number $m^{\star}$ of principal components to retain is summarised here for a general, standardised data-matrix $\boldsymbol{X}$ containing long memory effects (justifications for the steps can be found in the sections labelled in brackets after each step):

(i)
Remove any global effect from $\boldsymbol{X}$ to form $\boldsymbol{X}^{({\rm market})}$ , whose entries are the residues c_i(t) defined in equation (14) (section 5.1).
(ii)
Compute the correlation matrix $\boldsymbol{G}$ of $\boldsymbol{X}^{({\rm market})}$ and find the empirical probability density of its eigenvalues. Find the number of eigenvalues $m_{{\rm max}}$ exceeding $\lambda_{+}$ , the upper edge of the Marčenko–Pastur distribution in equation (9) (section 3.2).
(iii)
Forming the $m_{{\rm max}}$ -based PCA factor model from equation (17), use lasso regression (see appendix C) to find the set of parameters $\beta_{ip}$ . This is achieved by regressing the residues c_i(t) against the average behaviour of principal components I_p(t) $p=1,...,m_{{\rm max}}$ (section 5.2).
(iv)
Using these $\beta_{ip}$ 's, determine for each $m=1,...,m_{{\rm max}}$ and stock i the residue $d_{i}^{(m)}(t)$ given in equation (18) (section 5.3).
(v)
From the $d_{i}^{(m)}(t)$ , compute the temporal autocorrelations $\kappa_i^{(m)}(L)$ for different values of L, and by integration determine the proxies $\newcommand{\e}{{\rm e}} \eta_i^{(m)}$ . Construct $\zeta(m)$ from equation (19) (section 5.3).
(vi)
Use the fitting procedure in appendix D to find $\hat{\theta}$ , the best estimator of $\theta$ —the point at which the concavity of $\zeta(m)$ changes—defined in equation (20). Finally, the optimal number of principal components to retain is $m^{\star}=\hat{\theta}-1$ (section 5.3).

6. Applying our method to synthetic and empirical data

In this section, we test our method on synthetically generated data and on an empirical data set defined in appendix A.

6.1. Synthetic system setup

A paradigmatic example of stochastic process with long memory is the fractional Gaussian noise (FGN). The FGN with Hurst exponent H is the process $Y(t)$ with an autocorrelation function [27] given by

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \kappa_{\rm FGN}(L)=\frac{1}{2}\left(|L-1|^{2H}-2|L|^{2H}+|L+1|^{2H}\right) \sim H(2H-1)L^{2H-2} \ . \label{fGn_ACF} \nonumber \end{align} \tag{ 22 }$

Equation (22) indeed shows that the FGN has long memory since its autocorrelation function follows a power law decay as described in section 4. In particular, for 1/2 < H < 1 (H = 1/2 corresponds to the standard white noise) we have a process with positive autocorrelation, a feature that is shared by financial data [28]. Increasing H will enhance the strength of the memory present in the FGN since $\kappa_{\rm FGN}(L)$ will decay more slowly in this case. We shall use the method detailed in [60] to generate realisations of FGN.

For our synthetic setting, we consider a fictitious market made of N stocks, and simulate the log-volatility $\omega_i(t)$ of each stock over a time-window T. To this end, we make use of the widely recognised fact that empirical data from finance are often organised into clusters [61–64]. We therefore impose that the stocks are organised into K disjoint clusters, each containing N_k stocks.

Next, we generate a fictitious market mode I₀(t) that affects all stocks [41, 50, 51]. This is simply a FGN process with Hurst exponent H₀, which we will set to 0.9 in our simulations. We also fix the variance of I₀(t) to be 1.

Our simulated log-volatility processes will thus read

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \omega_i(t)=\beta_0 I_0(t) +\beta_{k(i)}I_{k(i)}(t)+\epsilon_i(t),\label{volproc} \nonumber \end{align} \tag{ 23 }$

where $k(i)$ denotes the index of the cluster the asset i belongs to, and the I_k(t)'s are FGN processes with Hurst exponents H_k and fixed variance of 1. The $\newcommand{\e}{{\rm e}} \epsilon_{i}$ 's are white noise terms with zero mean and variance $\phi$ .

Typical values we use in our simulations are N = 1200, T = 4000 and K = 30. We simulate two markets with different internal arrangements of the clusters: the first one is homogeneous, where the size of each cluster is exactly 40, and the second is heterogeneous, meaning that each cluster has a different number of stocks. The latter case is particularly significant since the cluster sizes present in financial data as well as in many other systems are known to be heterogeneous [61, 62]. To generate the heterogeneous system, we use the procedure described in [65], which yields power-law distributed cluster sizes, a key property of real world data [66]. The particular realisation of this method that we use to generate cluster sizes for N = 1200 has a mean number of stocks in each cluster of 40 and a standard deviation of 26.2. We also set $\beta_{0}=1.3$ , while $\beta_{k}$ are values between 0.14 and 1, and H_k is an equally spaced sequence between 0.7 and 0.9. This choice ensures that clusters with a higher $\beta_{k}$ will also have a higher H_k, to make contact with the empirical result of [67] that stocks with higher volatility cross-correlation have a longer memory. Finally, $\phi$ is fixed to be 1, the same as the variance of the time series I₀(t) and I_k(t). Note that we also simulate the same system using instead an Autoregressive process of 1 lag (AR(1)) [40] in appendix E, where we show that our method can be applied in this case too, but is less accurate. This supports our reasoning that that slow decrease to the right of $\hat{\theta}$ in figure 4(c) is more applicable to long memory processes versus short ones.

Arranging the log-volatilities $\omega_i(t)$ in a rectangular data-matrix $\boldsymbol X$ , we can then feed $\boldsymbol X$ into our algorithmic procedure and check how many significant components $m^\star$ it retrieves. A desirable feature of our synthetic model is that it is rather easy to estimate a priori how many eigenvalues of $\newcommand{\re}{{\rm Re}} \renewcommand{\dag}{\dagger}\boldsymbol E=(1/T)\boldsymbol X^\dagger \boldsymbol X$ (or rather of its de-trended counterpart $\boldsymbol G$ ) contain information that can be separated from pure noise. This occurs because each cluster corresponds to a principal component and hence the number of eigenvalues beyond the bulk is just K. This makes the a posteriori comparison all the more interesting.

6.2. Results for synthetic and empirical data

We simulate 100 independent samples of our synthetic market, after checking that the statistics was sufficient to be confident on the stability of our results, and we follow the procedure set out at the end of section 5.3 to select $m^{\star}$ . First, we checked the eigenvalue distributions of the correlation matrix obtained from the simulated $\boldsymbol X$ . We see from figure 5, which are histograms of the bulk eigenvalues of $\boldsymbol G$ for all samples for the homogeneous (left) and heterogeneous (right) systems respectively, that the bulk of the eigenvalues is well fitted by the Marčenko–Pastur distribution in red. There are $m_{{\rm max}}=30$ (homogeneous) and $m_{{\rm max}}=28$ (heterogeneous) eigenvalues beyond the bulk (depicted in the insets) that carry genuine information. This again shows that in the synthetic case the autocorrelations are also weak. We see this again by calculating the median $L_{\rm cut}$ of equation (12) for the synthetic systems to be 2—again close to 1, which is what we would expect for white noise, hence we can still use the MP distribution as an approximation. We also remark that the MP fits in figures 5(a) and (b) are better than that of figures 1(a) and (b) because we can tune the white noise in our synthetic data so that the bulk in this region behaves more similar to white noise. This is achieved by changing the value of $\phi$ .

**Figure 5.** Histograms of eigenvalues of the matrix $\boldsymbol G$ for 100 samples of the synthetic market with N = 1200, T = 4000, and K = 30 clusters. The values of the $\beta$ coefficients and of the Hurst exponents are as in the main text. (Left) Homogeneous system with 40 stocks in each cluster. In red the best fit Marčenko–Pastur distribution of equation (9) with parameters $q=0.284\pm 0.002$ and $\sigma=0.939\pm 0.001$ with upper edge $\lambda_{+}=2.0756$ . The inset includes the $m_{{\rm max}}=30$ eigenvalues beyond $\lambda_{+}$ . (Right) Same plot but for a heterogeneous system with the same parameters, but a different cluster structure defined in section 6.1. Here $\lambda_{+}=1.9322$ , $q=0.299\pm 0.004$ , and $\sigma=0.898\pm 0.002$ . Finally, in this case there are $m_{{\rm max}}=28$ eigenvalues beyond $\lambda_{+}$ . (a) Homogeneous. (b) Heterogeneous.
Download figure:
Standard image High-resolution image

**Figure 5.** Histograms of eigenvalues of the matrix $\boldsymbol G$ for 100 samples of the synthetic market with N = 1200, T = 4000, and K = 30 clusters. The values of the $\beta$ coefficients and of the Hurst exponents are as in the main text. (Left) Homogeneous system with 40 stocks in each cluster. In red the best fit Marčenko–Pastur distribution of equation (9) with parameters $q=0.284\pm 0.002$ and $\sigma=0.939\pm 0.001$ with upper edge $\lambda_{+}=2.0756$ . The inset includes the $m_{{\rm max}}=30$ eigenvalues beyond $\lambda_{+}$ . (Right) Same plot but for a heterogeneous system with the same parameters, but a different cluster structure defined in section 6.1. Here $\lambda_{+}=1.9322$ , $q=0.299\pm 0.004$ , and $\sigma=0.898\pm 0.002$ . Finally, in this case there are $m_{{\rm max}}=28$ eigenvalues beyond $\lambda_{+}$ . (a) Homogeneous. (b) Heterogeneous.
Download figure:
Standard image High-resolution image

For each sample we find the median $\zeta(m)$ , plotting it in log–log scale in figure 4(a) for the homogenous system and in figure 4(b) for the heterogeneous one.

As already described, the optimal value $m^{\star}$ turns out to be 19 and 12 for the homogenous and heterogeneous systems respectively. The fact that $m^{\star}$ is lower for the heterogeneous system makes sense since its broad, power law distributed values of N_k mean that more of the memory of the system is contained in earlier principal components, whose N_k are larger. Since more of the memory is concentrated in fewer principal components, it is natural that the corresponding values of $m^{\star}$ will be lower for the heterogenous system. On the other hand for the homogenous system, we have that the N_k are equal for all k, so we can expect that the memory is more evenly distributed across the principal components i.e. that $m^{\star}$ will be larger. We also apply the method in section 5 to the data matrix $\boldsymbol{X}$ corresponding to the empirical dataset described in appendix A, for which $m_{{\rm max}}=35$ (see caption of figure 4(c) for details).

7. Comparison with other heuristic methods to select $\boldsymbol{m^{\star}}$

In this section, we shall compare our new method with available 'stopping rules' in the literature. Many heuristic methods have been proposed in order to determine $m^{\star}$ , generally falling into three categories: subjective methods, distribution-based methods and computational procedures [1, 2]. We describe here the most common ones in each category.

In the class of subjective methods, we find two similar procedures, the cumulative percentage of variation [21, 22] and scree plots [20]. The former is based on selecting the minimum value of m such that the cumulative percentage of variation explained by the m principal components exceeds some threshold $\alpha$ :

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle m^{\star}=\min_{m}\left\{\Lambda(m) > \alpha\right\} , \nonumber \end{align} \tag{ 24 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \Lambda(m)=100\frac{\sum\nolimits_{p=1}^{m}\lambda_{p}}{N} , \label{CumVarEqn} \nonumber \end{align} \tag{ 25 }$

where $\Lambda(m)$ is the % cutoff, $\alpha$ is the percentage cutoff threshold and $\{\lambda_{p}\}_{p=1}^{m}$ are the first m eigenvalues of $\boldsymbol{G}$ . Common cutoff ranges lie somewhere between $70\%$ to $90\%$ , with a preference towards larger values when it is known or obvious that the first few principal components will explain most of the variability in the data [1]. An obvious disadvantage of this method is that it relies on the choice of some arbitrary value for the tolerance $\alpha$ .

Scree plots involve plotting a 'score' representing the amount of variability in the data explained by individual principal components, and then choosing the point at which the plot develops an 'elbow', beyond which picking further principal components does not significantly enhance the level of memory already accounted for. This procedure again has the obvious drawback of relying on graphical inspection and therefore being even more subjective than the cumulative percentage of variation.

Among the class of distribution-based methods, the most commonly used procedure is the Bartlett Test [23]. This involves testing the null hypothesis [1]

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle H_{0,m}=\lambda_{m+1}=\lambda_{m+2}=...=\lambda_{N} , \nonumber \end{align} \tag{ 26 }$

that is whether the last N − m eigenvalues are identical, against the alternative that at least two of the last N − m eigenvalues are not identical, and repeating this test for various values of m. One then selects the maximum value of m for which the outcome of the hypothesis test is significant. Intuitively, this procedure tests whether the last N − m eigenvalues explain roughly the same amount of variability in the data so that they can be regarded as noise, and then takes $m^{\star}$ to be the maximum number of 'significant' eigenvalues. According to this procedure, one first tests H_0,N−2 i.e. whether $\lambda_{N-1}=\lambda_{N}$ . If this hypothesis is not rejected, then one tests H_0,N−3, and if this is not rejected the exact same test is performed for H_0,N−4 and so on. The procedure carries on testing each individual H_0,m until the first time ( $m=m^{\star}-1$ ) the hypothesis gets rejected at the required confidence level. Since several tests need to be conducted sequentially, the overall significance of the procedure will not be the same as the one imposed for each individual test, with no way of correcting for this bias as the number of tests to be performed is a priori unknown. This drawback makes distribution-based methods very impractical with real data [1].

The last category (computational procedures) involves the use of cross-validation. Cross-validation requires that some chunks of the original dataset $\boldsymbol{X}$ be initially removed. The remaining data matrix entries are used in conjunction with equation (17) to cast predictions on the removed entries using m principal components. We focus on so called 10-fold contiguous block cross-validation, which has been argued to be optimal in the sense that it most accurately captures the true structure of the correlation matrix (either $\boldsymbol E$ or $\boldsymbol G$ ) [68]. According to this procedure, we divide the data matrix $\boldsymbol{X}$ into 10 rectangular blocks row-wise, which we call $\boldsymbol{X}^{(g)}$ for $g=1,...,10$ . For each group g, we calculate the correlation matrix $\boldsymbol{G}^{(g)}$ associated with the matrix $\boldsymbol X$ but with the block $\boldsymbol{X}^{(g)}$ removed. Next, we take m principal components of $\boldsymbol{G}^{(g)}$ and use them in a factor model like in equation (17) but with m as the upper limit for the sum to predict the values of $\boldsymbol{X}^{(g)}$ , which we call $\hat{\boldsymbol{X}}^{(g,m)}$ . We then repeat this procedure for every m and g.

After doing so, we can calculate the prediction residual error square sum, or ${\rm PRESS}(m)$ , as a function of m. This is the total (un-normalised) squared prediction error for each value and over all blocks

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm PRESS}(m)=\sum_{i=1}^{N}\sum_{g=1}^{10}\sum_{t\in \mathcal{G}_{g}}\left(\hat{\boldsymbol{X}}^{(g,m)}_{ti}-\boldsymbol{X}^{(g)}_{ti}\right)^{2} , \label{PRESS_eqn} \nonumber \end{align} \tag{ 27 }$

with $\hat{\boldsymbol{X}}^{(g,m)}$ being the matrix of predicted values for block g using m principal components, and $\mathcal{G}_{g}$ indicating the row indices belonging to block g. Equation (27) represents the out-of-sample error in predicting the entries of $\boldsymbol{X}$ , which implies that ${\rm PRESS}(m)$ should initially decrease as m increases. However, beyond a certain threshold, ${\rm PRESS}(m)$ might start to increase instead, indicating that we are beginning to overfit the data. The optimal $m^\star$ should therefore be chosen to be the value which minimises ${\rm PRESS}(m)$ , thus striking the optimal balance between increasing the model complexity and overfitting the data. This procedure has an obvious advantage over the previous two categories as it is parameter-free and not subjective. However, one significant drawback for practical purposes is that the procedure becomes computationally very expensive for large datasets due to the typically ∼ $\mathcal{O}(N m_{{\rm max}})$ regressions that need to be performed from the dataset.

We compare our memory-based method, the cumulative variance method with $70\%$ and $90\%$ cutoffs and the 10-fold cross-validation method for 100 samples of the synthetic system described in section 6.1 and for the empirical dataset described in appendix A, where the numerical outputs of $m^{\star}$ for these methods is detailed in the columns of table 1.

Table 1. This table summarises the $m^{\star}$ values obtained for the synthetic data described in section 6.1 and empirical dataset described in appendix A. Results from our memory-based method from section 6.2 are included in the first row. In the second row, we have the cumulative variance rule for the cutoffs $70\%$ and $90\%$ . The final row includes the ${\rm PRESS}(m)$ (see equation (27)), using 10-fold cross-validation.

	Synthetic
	Homogeneous	Heterogeneous	Empirical
Memory-based	19	12	15
Cumulative variance	12–22	7–17	13–27
Cross-validation	29	25	28
$m_{\rm max}$	30	28	35

In figure 6 (top panel), we plot for the homogeneous and heterogeneous synthetic data the median of $\Lambda(m)$ (see equation (25)) over all samples, indicating the $70\%$ and $90\%$ cutoffs in dashed red lines. The $70\%$ and $90\%$ cutoffs yield an optimal number of 12 and 22 components for the homogeneous case and 7 and 17 for the heterogeneous case, respectively. It makes sense that fewer components are needed in the heterogeneous case as more of the total variance is accounted for by the first principal components, which correspond by construction to the larger clusters. We recall that our memory-based method predicts $m^{\star}=19$ and $m^{\star}=12$ for the homogeneous and heterogeneous cases respectively, and these values fall squarely between the prescribed $70\%$ and $90\%$ cutoffs [1]. However, our method is superior in that it gives a unique value for $m^\star$ and not a range of values, and does not use subjective criteria or rules of thumb.

Figure 6 (bottom panel) depicts the median of ${\rm PRESS}(m)$ across all samples, from which we see that the minimum occurs at $m^{\star}=29$ for the homogenous system and $m^{\star}=25$ for the heterogenous one. Hence, the cross-validation method would induce us to keep the majority of components in both systems. This is to be expected since cross-validation is based on minimising the out-of-sample prediction error (see equation (17)), hence performing the linear regression many times necessarily leads to a higher likelihood of including a larger number of principal components. This comes of course at the price of computational speed. Another interesting observation is that the minima occurring in both systems are not sharply defined, which indicates that the out-of-sample error made by including a larger number of components than the optimum $m^\star$ does not actually increase by a significant amount.

Compared to cross-validation, our methodology leads to keeping fewer components. Our procedure, however, is less computationally expensive since it performs far fewer regressions to find $m^{\star}$ (see table 2). Another advantage of our method over cross-validation can be spotted in the top panel of figure 4, which highlights that only $9\%$ and $6\%$ of the total memory for the homogenous and heterogeneous systems after removing the market mode is unaccounted for to the right of $\hat\theta$ . From the perspective of explaining the memory in the time series, therefore, our method does on average a very good job while requiring very limited computational resources.

Table 2. Computational times in seconds for our proposed memory-based method (first row) and cross-validation using 10 contiguous blocks (second row). The first two columns refer to the homogeneous and heterogeneous synthetic systems in section 6.1. The final column is for the empirical dataset described in appendix A. These performance times were calculated on a Windows 10, CPU Intel i7-6700 3.4 GHz, RAM 16GB PC using MATLAB 2017b.

	Synthetic
	Homogeneous	Heterogeneous	Empirical
Memory-based	138.6	137.6	209.7
Cross-validation	1136.8	1146.3	1462.3

Now that we have compared the methods for a fixed $\phi$ , the variance of the noise term for our synthetic data (see equation (23)), we can check how robust each of the methods is to changes in $\phi$ . We note that fixing $\phi=1$ constitutes already a hard regime to analyse since it implies that the fluctuations due to I_k(t) are of the same magnitude as the white noise, so we can see already that our method stands well compared to others with this high value of $\phi$ . In figure 7, we compare—using 100 samples of the synthetic systems for the homogeneous and the heterogeneous cases—the optimal value $m^\star$ predicted by the cumulative variance method with $70\%$ and $90\%$ cutoffs, the 10-folds cross-validation method and our own memory-based procedure as we vary $\phi$ .

**Figure 7.** A comparison of the different methods for selecting $m^{\star}$ by varying $\phi$ , the noise level in the simulation of synthetic data (see equation (23)). For each value of $\phi$ , 100 samples of the process are generated, with the results for the homogeneous system plotted on the left, and for the heterogeneous system on the right. The blue and red lines represent the results for the $70\%$ and $90\%$ cumulative variance procedure. The orange line corresponds to our method. Finally the purple line represents results from 10-fold cross-validation.
Download figure:
Standard image High-resolution image

The $70\%$ and $90\%$ cutoffs for the cumulative variance rule remain relatively stable for most values of $\phi$ , before slowly decreasing for higher values of $\phi$ . This decrease occurs because the increased level of noise lowers the contribution to the variance from higher components, with the consequence that the cutoff is reached sooner for higher values of $\phi$ . Within our memory-based method, the value of $m^\star$ decreases for increasing $\phi$ . This decrease occurs because a higher amount of white noise increasingly masks the long-memory properties of the underlying signal, and will affect the deeper principal components more since they have a lower memory strength (lower H_k) anyway. This is a desirable property since it means that lowering the noise level will lead us to retain more principal components. Whilst the decrease in the number of components occurs earlier than for the cumulative variance method, it still remains between the $70\%$ and $90\%$ cutoffs, and even closer to the $90\%$ cutoff for lower values of $\phi$ .

For the empirical dataset, described in appendix A, we plot in figure 8 (left) the plot of $\Lambda(m)$ , the cumulative percentage of variation explained by the m principal components. We see that if we set our target between $70\%$ and $90\%$ of the cumulative variance as prescribed in [1], this will correspond to retaining between 13 and 27 components, but again it is not clear a priori what exact value within this range we should pick. In figure 8 (right), we plot ${\rm PRESS}(m)$ obtained via 10-fold cross-validation, in which the minimum occurs at $m^{\star}=28$ , close to the $90\%$ cutoff for the cumulative variance. Again—compared to cross-validation—our method picks out fewer principal components, but we obtain our result in far less computational time (see table 2), and with $m^{\star}=15$ we can already account for $80\%$ of the memory.

8. Conclusion

In this paper, we have proposed a novel, data-driven method to select the optimal number $m^\star$ of principal components to retain in the principal component analysis of data with long memory. The main steps are detailed in section 5. We used the crucial fact that subsequent components contribute a decreasing amount to the total memory of the system. This allows us to identify a unique, non-subjective and computationally inexpensive stopping criterion, which compares very well with other available heuristic procedures such as cumulative variance and cross-validation (see tables 1 and 2). We tested our method on two synthetic systems: a homogeneous and heterogeneous version 6.1, and also on an empirical dataset of financial log-volatilities, described in appendix A. Our results could be applied to any large dataset endowed with long-memory properties, for example in climate science [35, 36] and neuroscience [5, 37]. A potential direction for future work could be using a null hypothesis for the bulk eigenvalues which takes into account the presence of autocorrelations rather than the MP distribution used here. A comparison with the cluster driven method presented in [15] or extending the method for example to nonlinear PCA [69] could also be explored.

Acknowledgments

We thank Bloomberg for providing the data used in this paper. We also wish to thank the ESRC Network Plus project 'Rebuilding macroeconomics'. We acknowledge support from Economic and Political Science Research Council (EPSRC) Grant EP/P031730/1. We are grateful to the NVIDIA corporation for supporting our research in this area with the donation of a GPU.

Appendix A. Empirical dataset

The empirical dataset we shall use consists of the daily closing prices of 1270 stocks in the New York Stock Exchange (NYSE), National Association of Securities Dealers Automated Quotations (NASDAQ) and American Stock Exchange (AMEX) from 1 January 2000 to 12 May 2017, which amounts to 4635 entries for each price time series. We make sure that the stocks are 'aligned' through the data cleaning procedure described here. A typical source of misalignment is the fact that some stocks have not been traded on certain days. To ensure we keep as many entries as possible, we fill the gaps dragging the last available price ahead and assuming that a gap in the price time-series corresponds to a zero log-return. At the same time, we do not wish to drag ahead too many prices as doing so would compromise the statistical significance of the time-series. The detailed procedure goes as follows:

(i)
Remove from the dataset the price time-series with length smaller than p times the longest one;
(ii)
Find the common earliest day among the remaining time-series;
(iii)
Create a reference time-series of dates when at least one of the stocks has been traded starting from the earliest common date found in the previous step;
(iv)
Compare the reference time-series of dates with the time-series of dates of each stock and fill the gaps dragging ahead the last available price.

In this paper, we chose p = 0.90 to ensure that we keep the time-series as unmodified as possible. For example, the common earliest day for our dataset is 3 of January 2000. In this period, the stock Ameris Bancorp (ABCB), was not traded on 35 d in the time period and therefore the last available price was used to fill these particular days. Another example is the stock Allied Healthcare Products (AHPI), which was not traded for 508 d in the time period we study, and is removed since its length is less than p times the longest time series. However, the results do not change if we pick a higher value of p . Applying this procedure leaves our dataset with N = 1202 stocks. Hence $\boldsymbol{X}$ and $\boldsymbol{X}^{({\rm market})}$ are $4364\times 1202$ matrices.

Appendix B. Financial interpretation of the eigenvectors and portfolio optimisation

Another motivation for the application of PCA to financial correlation matrices is the financial interpretation of the first principal components, which we explain here. First, we recall that the empirical correlation matrix $\boldsymbol{E}$ between the standardised log volatilities is defined as

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle E_{ij}=\frac{1}{T}\sum_{t=1}^{T}\ln |r_{i}(t)|\ln |r_{j}(t)| \ . \nonumber \end{align} \tag{ B.1 }$

We call $\boldsymbol{w}_{m}$ the eigenvectors of $\boldsymbol{E}$ with $\lambda_m$ its associated eigenvalue. We interpret the entries of $\boldsymbol{w}_{m}$ as the weights of a portfolio, with w_im > 0 indicating a long position where we buy the stock in the expectation that its value will rise, and w_im < 0 denoting a short position where we expect the stock's value to fall and hence sell it [70].

The covariance between the log volatilities of the portfolios m and $m'$ is:

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \frac{1}{T}\sum_{t=1}^{T}\left(\sum_{i=1}^{N}w_{im}\ln |r_{i}(t)|\right)\left(\sum_{j=1}^{N}w_{jm'}\ln |r_{j}(t)|\right)=\sum_{m'}\lambda_{m}\delta_{m,m'} , \label{PortfolioCovariance} \nonumber \end{align} \tag{ B.2 }$

where w_im and $w_{jm'}$ are the entries of the eigenvector $\boldsymbol{w}_{m}$ and $\boldsymbol{w}_{m'}$ respectively. Hence the returns defined by the portfolio $\boldsymbol{w}_{m}$ and another eigenvector $\boldsymbol{w}_{m'}$ are uncorrelated. Another consequence of equation (B.2) is that the variance of the returns, which is used to measure the risk of a portfolio, is the eigenvalue $\lambda_{m}$ . Hence larger eigenvalues of the portfolio defined by $\boldsymbol{w}_{m}$ have a higher risk. Knowing this information about the eigenvalues and their corresponding eigenvectors can therefore inform an investment manager in deciding how to pick portfolios both individually and to reduce a set of portfolios' overall risk by using orthogonal portfolios defined by $\boldsymbol{w}_{m}$ .

For a given level $\Delta$ of tolerable risk, we can also find the optimal investment weights $\boldsymbol{w}_{{\rm opt}}$ by solving the minimisation problem

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \min_{\boldsymbol{w}}\boldsymbol{w}^{T}\boldsymbol{E}\boldsymbol{w} \nonumber \end{align} \tag{ B.3 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm such~ that}~ \boldsymbol{X}\boldsymbol{w}=\Delta \ . \nonumber \end{align} \tag{ B.4 }$

This is known as Markowitz portfolio optimisation theory [71], and can be solved via Lagrange multipliers to give

$\begin{align} \renewcommand{\dag}{\dagger} \newcommand{\e}{{\rm e}} \newcommand{\re}{{\rm Re}} \displaystyle \boldsymbol{w}_{{\rm opt}}=\Delta\frac{\boldsymbol{E}^{-1}\boldsymbol{R}}{\boldsymbol{R}^{\dagger}\boldsymbol{E}^{-1}\boldsymbol{R}} , \label{MarkowitzSoln} \nonumber \end{align} \tag{ B.5 }$

with $\boldsymbol{w}_{{\rm opt}}$ indicating the optimal portfolio weight. We see that the distribution of the eigenvalues enters the portfolio optimisation through the inverse matrix $\boldsymbol{E}^{-1}$ in equation (B.5). Normally, equation (B.5) is applied directly by simply using the sample estimator $\boldsymbol{E}$ . However, since $\boldsymbol{E}$ is empirical, it is subject to noise inherent in the data which means it is vulnerable to the noisy distribution of the eigenvalues, in turn causing the $\boldsymbol{w}_{opt}$ found to underestimate risk [7].

We also note that in line with [51], the eigenvectors corresponding to these eigenvalues beyond the MP bulk for $\boldsymbol{G}$ for the empirical data in appendix A can be identified as belonging to particular or a mixture of 19 economic Industrial Classification Benchmark (ICB) supersectors [72]. We can quantify this for the eigenvectors of $\boldsymbol{G}$ given in equation (15), $\boldsymbol{v}_{i}$ , by defining a 19-dimensional vector $\rho_{i}$ , with entries $\rho_{g,i}$ , $g=1,...,19$ . Specifically, we define a projection matrix $\boldsymbol{P}$ with entries

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle P_{ig} = \left\{\begin{array}{@{}ll@{}} 1/N_{g} & {\rm if}~ i~{\rm is~ in~ supersector}~ g \nonumber \\ 0 & {\rm else}, \end{array}\right. \nonumber \end{align*}$

where N_g is the number of stocks that are part of supersector g. From this we can define $\rho_{i}$ as

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \rho_{i}=\gamma_{i}\boldsymbol{P}\boldsymbol{v}_{i} , \nonumber \end{align} \tag{ B.6 }$

where $\gamma_{i}$ is the normalisation constant $\sum\nolimits_{g=1}^{19}\rho_{g,i}$ . Each $\rho_{g,i}$ gives the contribution of the gth ICB supersector to the ith eigenvector. We plot $\rho_{g}$ for the first three eigenvectors in figure B1. We can see that each eigenvector is dominated by the Real Estate (colour 14), Oil and Gas (colour 1) and Financial Services (colour 6) respectively for the first, second and third principal components.

**Figure B1.** Plots the $\rho_{g}$ defined in appendix B, which is the projection of the eigenvector onto the ICB supersector groups, for the eigenvectors of the first three principal components of $\boldsymbol{G}$ for the data detailed in appendix A. The legend corresponds to each of the ICB supersector groups.
Download figure:
Standard image High-resolution image

Appendix C. Lasso regression

Lasso regression is used to find the values of the coefficients $\beta_{ip}$ using equation (17). Further details of the use of this method is provided in this appendix. Lasso regression [73] provides a way of dealing with overfitting explanatory variables (in our case I_k(t)) and also of performing feature selection, which takes into account a stock i's log-volatility not being affected by changes of I_k(t). Lasso regression solves the constrained minimisation problem

$\begin{align} \renewcommand{\dag}{\dagger} \newcommand{\e}{{\rm e}} \newcommand{\re}{{\rm Re}} \displaystyle \min_{\boldsymbol{\beta}_{i}} \frac{1}{T}\sum_{t=1}^{T}\left(c_{i}(t)-\boldsymbol{I}(t)^{\dagger}\boldsymbol{\beta}_{i}\right)^{2}+\Upsilon P_{a}(\boldsymbol{\beta}_{i}) , \nonumber \end{align} \tag{ C.1 }$

where $\boldsymbol{\beta}_{i}$ is the vector of loadings given by $\newcommand{\re}{{\rm Re}} \renewcommand{\dag}{\dagger}(\beta_{i1}, \beta_{i2}, \dots,\beta_{im_{{\rm max}}}){}^{\dagger}$ , $\boldsymbol{I}(t)$ is the matrix whose columns are $(I_{1}(t),I_{2}(t), \dots, I_{m_{{\rm max}}}(t))$ and $\Upsilon$ is a hyperparameter. $P_{a}(\boldsymbol{\beta}_{i})$ is defined as

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle P_{a}(\boldsymbol{\beta}_{i})=\sum_{j=1}^{m_{{\rm max}}}|\beta_{ij}| \ . \label{LassoPenalty} \nonumber \end{align} \tag{ C.2 }$

The sum in equation (C.2) is the $\mathcal{L}_{1}$ penalty for the lasso regression. The $\Upsilon$ controls the amount of regularisation: the higher it is, the more loadings are zero. To find $\Upsilon$ , we set its scale according to [74] and use 10 cross-validated fits [73], picking the $\Upsilon$ that gives the minimum prediction error. We have also investigated the stability of the results with respect to changes in $\Upsilon$ , and altering the penalty in (C.2) to a L2 penalty. In either case there is little difference to the calculated $m^{\star}$ values.

Appendix D. Fitting procedure for θ

We can estimate $\theta$ by assessing what region of $\zeta(m)$ is most linear in log–log scale, which is done by assessing on each interval $m=\tilde{\theta},...,m_{{\rm max}}$ , where $\tilde{\theta}=2,...,m_{{\rm max}}$ , the quality of a linear fit in log–log scale of $\zeta(m)$ in this interval. The estimate of $\theta$ , $\hat{\theta}$ is then the value of $\tilde{\theta}$ that gives the best-quality linear fit. To assess the quality of the fit, we use the adjusted $R_{{\rm adj}}^{2}$ value [59]:

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle R_{{\rm adj}}^{2}=1-(1-R^{2})\frac{n-1}{n-2} , \label{AdjR} \nonumber \end{align} \tag{ D.1 }$

where R² is the normal coefficient of determination [75], and n is the size of the interval. Note we have written the formula for our specific case where the number of explanatory variables is 1. If $R_{{\rm adj}}^{2}$ is higher, then the interval $m=\tilde{\theta},...,m_{{\rm max}}$ is better described by a linear trend. The difference between $R_{{\rm adj}}^{2}$ and R² is that the former can take into account the different sample sizes induced by the differently sized intervals by reducing the value obtained through R² for smaller values of n. $\hat{\theta}$ is then given by

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \hat{\theta}=\max_{\tilde{\theta}}R_{{\rm adj}}^{2}(\tilde{\theta}) , \label{ThetaHat} \nonumber \end{align} \tag{ D.2 }$

which is the value of $\tilde{\theta}$ which maximises $R_{adj}^{2}$ and gives the region of best-quality linear fit.

Appendix E. Exponentially decaying autocorrelations

The Autoregressive process of order 1 (AR(1)) is given by [40]

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle X(t)=\epsilon(t)+ \psi X(t-1) , \label{AREqn} \nonumber \end{align} \tag{ E.1 }$

where $\newcommand{\e}{{\rm e}} \epsilon(t),\epsilon(t-1),...$ are all white noise terms, $\psi$ is the autoregressive parameter. To enforce stationarity and positive autocorrelations note that we must have that $0<\psi<1$ [40]. The presence of the second term in equation (E.1) introduces memory into the process. The autocorrelation function of $X(t)$ is known to be exponential [40], with increasing $\psi$ increasing the strength of the memory, in contrast to the FBM we used in section 6.1. By using AR(1) to generate I₀ and the set of I_k(i)(t) with parameters $\psi_{0}$ and $\psi_{k}$ respectively, we can investigate whether the method proposed here is still valid when the autocorrelation decays exponentially. For I₀(t), we fix $\psi_{0}=0.95$ . Each I_k(i)(t) is generated using an equally spaced vector from 0.65 to 0.95 for $\psi_{k}$ set in a similar way described in section 6.1 to reflect the empirical result of [67]. We then repeat the steps given in section 5.4 for the same homogenous and heterogenous synthetic systems described in section 6.1. The log–log plots of $\zeta(m)$ versus m are detailed in figure E1. We see that for both systems whilst we do see a decrease, it is not as accurately described by a straight line in log–log scales in this case, as compared to figures 4(a) and (b). Therefore we can conclude that whilst our method can be applied also in the case of faster, exponentially decaying autocorrelation, it is less precise.