1 Introduction

In this paper we consider linear regression. Using a linear map \(\varPi \in \mathbb {R}^{k\times n}\) whose choice is still to be defined, we transform the original data set \([X,Y]\in \mathbb {R}^{n\times (d+1)}\) into a sketch, i.e., a substitute data set, \([\varPi X,\varPi Y]\in \mathbb {R}^{k\times (d+1)}\) that is considerably smaller. Therefore, the likelihood function can be evaluated faster than on the original data. Moreover, we will show that the likelihood is very similar to the original one. In the context of Bayesian regression we have the likelihood \(\mathscr {L}(\beta | X, Y)\) and additional prior information \(p_\mathrm{pre}(\beta )\) given in form of a prior distribution over the parameters \(\beta \in \mathbb {R}^d\) which we would like to estimate. Our main result is to show that the resulting posterior distribution

$$\begin{aligned} p_\mathrm{post}(\beta |X, Y)&\propto f(Y | \beta , X)\cdot p_\mathrm{pre}(\beta ) \end{aligned}$$
(1)

will also be well approximated within a small error. Please note that Bayes’ Theorem (1) contains the probability density function \(f(Y | \beta , X)\). From now on, we will concentrate on its interpretation as likelihood function \(\mathscr {L}(\beta | X, Y)\) as a function of the unknown parameter vector \(\beta \) as given in (2)

$$\begin{aligned} p_\mathrm{post}(\beta |X, Y)&\propto \mathscr {L}(\beta | X,Y)\cdot p_\mathrm{pre}(\beta ). \end{aligned}$$
(2)

The main idea of our approach is given in the following scheme:

$$\begin{aligned} \begin{array}{rcl} [X,Y] &{} \xrightarrow {\quad \varPi \quad } &{} \,\,[\varPi X,\varPi Y]\\ \downarrow \quad &{} &{}\quad \quad \downarrow \\ p_\mathrm{post}(\beta | X, Y) &{} \approx _\varepsilon &{} p_\mathrm{post}(\beta | \varPi X,\varPi Y). \end{array} \end{aligned}$$

More specifically, we can reduce the number of observations from the number of input points n to a target dimension \(k\in O({\text {poly}}(d/\varepsilon ))\), which, in particular, is independent of n. Thus, the running time of all subsequent calculations does not further depend on n. For instance, a Markov Chain Monte Carlo (MCMC) sampling algorithm may be used to obtain samples from the unknown distribution. Using the reduced data set will speed up the computations considerably. The samples remain sufficiently accurate to resemble the original distribution and also to make statistical predictions that are nearly indistinguishable from the predictions that would have been made based on the original sample. Note, that mathematically it is possible to achieve a similar reduction by setting \(\varPi =X^T\) without incurring any error. From the resulting matrix \([X^TX, X^TY]\) it would be possible to compute or evaluate the exact likelihood respectively posterior in some analytically tractable cases, which is the standard textbook approach for classical linear regression and Bayesian linear regression with Gaussian prior and independent normal error, (cf. Bishop 2006; Hastie et al. 2009). However, this approach is numerically not stable due to ill-conditioned covariance matrices, which is known from the literature (cf. Golub 1965; Lawson and Hanson 1995; Golub and van Loan 2013) and evident from our experiments. For that reason, the exact calculation is not an option in general. Instead, approximation algorithms are an alternative, where the error can be controlled.

Using no reduction technique is also not an option, since then even likelihood evaluations depend at least linearly on n. This does not pose a problem for small data sets. For larger n this is also still possible, but employing reduction techniques can already be beneficial to reduce the running time. For data sets that do not fit into the working memory, intelligent solutions are needed to avoid frequent swapping to slower secondary memory. However, for really massive data or infinite data streams, already the much simpler problem of computing the exact ordinary least squares estimator in one pass over the data requires at least \(\varOmega (n)\) space (Clarkson and Woodruff 2009). This makes the task impossible on finite memory machines when n grows large enough.

There are different computational models to deal with massive data sets in streaming and distributed environments. We focus on the streaming model, formally introduced in Muthukrishnan (2005). A data stream algorithm is given an input stream of items, like numerical values, points in \(\mathbb {R}^d\) or edges of a graph at a high rate. As the items arrive one by one it maintains some sort of summary of the data that is observed so far. This can be a subsample or a linear sketch as described above. The linearity allows for flexible dynamic updates of the sketch as we will discuss later. At any time, the memory used by the algorithm is restricted to be sublinear, usually at most polylogarithmic in the number of items. For geometrical problems the dependence on the dimension d is often restricted to be at most a small polynomial. Also, the algorithm is allowed to make only one single pass over the data. With these restrictions in mind, it is clear that the sketching matrix \(\varPi \in \mathbb {R}^{k \times n}\) cannot be explicitly stored in memory. In fact it has to fulfill the following criteria to allow for a streaming algorithm:

  1. 1.

    \(\varPi [X,Y]\) approximates [XY] well in the above sense.

  2. 2.

    \(\varPi \) can be stored succinctly.

  3. 3.

    We can efficiently generate the entries of \(\varPi \).

We will see that the structured randomized constructions of \(\varPi \) can be provably achieved using random bits of only limited independence. This means that the entries of \(\varPi \) need not be fully independent. However, if we choose a small number of entries, they behave as if they were independent. In particular the entries can be stored implicitly, meeting the independence requirements by using hash functions. These can be evaluated very efficiently and make the memory dependency on n only logarithmic.

Although the techniques presented in the following are employed to make the computations possible in a streaming setting, the results are of interest also in the non-streaming setting whenever large data sets can be reduced to meet time and memory constraints.

2 Background and related work

Dimensionality reduction techniques like principal component analysis (PCA) (Jolliffe 2002) and random projections have been widely used in Statistics and Computer Science. However, their focus is usually on reducing the number of variables. Our method aims to reduce the number of observations while keeping the algebraic structure of the data. This leads to a speed-up in the subsequent (frequentist or Bayesian) regression analysis, because the running times of commonly used algorithms heavily depend on n. Basic techniques based on PCA include principle component regression and partial least squares (Hastie et al. 2009). More recent results using PCA stem from the theory of core-sets for the k-means clustering problem and address the problem of computing a small set of data that approximates the original point set with respect to the given objective up to little, say \((1\pm \varepsilon )\), error (Feldman et al. 2013). The concept of core-sets is related to the early works of Madigan et al. (2002) and DuMouchel et al. (1999) on data-squashing that seeks for data compression based on the likelihood of the observations. One of the more recent contributions is Quiroz et al. (2015). They suggest inclusion probabilities proportional to the observations’ contribution to the likelihood, which is approximated by a Gaussian Process or a thin-plate spline approximation. Data-squashing can lead to a considerable reduction in the necessary number of observations, however there is a lack of approximation guarantees. These references show that in the advent of massive data sets, besides the effort in reducing dimensionality, there is also need to reduce the number of observations without incurring loss of too much statistical information.

Random projections have been studied in the context of low-rank approximation (Cohen et al. 2015), least squares regression (Sarlós 2006; Clarkson and Woodruff 2009), Gaussian process regression (Banerjee et al. 2013), clustering problems (Boutsidis et al. 2010; Kerber and Raghvendra 2014; Cohen et al. 2015), classification tasks (Paul et al. 2014) and compressed sensing (Candès et al. 2006; Donoho 2006). Random projections are used similarly to our work, to approximate a collection of subspaces, consisting only of sparse vectors (Baraniuk et al. 2007). Also, Bayesian inference has been proposed for efficient computation in compressed sensing (Ji and Carin 2007).

Recently there has been a series of works studying the statistical aspects of randomized linear algebra algorithms. In Raskutti and Mahoney (2015) and Ma et al. (2014), the statistical properties of subsampling approaches based on the statistical leverage scores of the data are investigated in detail. Deviating from the worst case algorithmic perspective, it was shown in Ma et al. (2014) that on average the leverage scores behave quite uniformly if the data is generated following a standard linear regression model. In Yang et al. (2015), several sketching and subsampling methods are used for fast preconditioning before solving the ordinary least squares (OLS) estimators on the subsampled data using state of the art OLS solvers. Moreover, they give parallel and distributed algorithms and extensive empirical evaluations on large scale data for this task. Our work, while aware of providing worst case guarantees, continues the discussion of statistical properties to the Bayesian setting.

Bayesian regression analysis for large scale data sets has been considered before. Guhaniyogi and Dunson (2014) proposed reducing the number of variables via random projections as a preprocessing step in the large d, small n scenario. They show that under several assumptions the approximation converges to the desired posterior distribution, which is not possible in general, since it was shown in Boutsidis and Magdon-Ismail (2014) that such dimensionality reduction oblivious to the target variable causes additive error in the worst case.

For the large n, small d case, Balakrishnan and Madigan (2006) proposed a one-pass algorithm that reads the data block-wise and performs a certain number of MCMC steps. When the next block is read, the algorithm keeps or replaces some of the data points based on weights that keep track of the importance of the data. The selection rule is justified empirically but lacks theoretical guarantees. Theoretical support is only given in the univariate case based on central limit theorems for Sequential Monte Carlo methods.

When allowing more passes over the data, Tall Skinny QR (TSQR) (Demmel et al. 2012) is a QR decomposition which works especially well in the large n, small d case and can easily be parallelized in the MapReduce setting, as studied by Constantine and Gleich (2011) and Benson et al. (2013). TSQR provides a numerically stable decomposition, which can be used as a preprocessing step prior to MCMC inference which can be conducted with high accuracy. This method, however, depending on the computational setting, has a number of limitations. It only works when the data is given row-by-row and the expensive decomposition has to be carried out a linear number of times, resulting in a total lower bound on the running time of \(\varOmega (n d^2)\) (cf. Demmel et al. 2012). The method is restricted to \(\ell _2\) regression. Our method of random projections is capable of going beyond these limitations. In particular, it can be extended to \(\ell _p\) regression and more generally to robust M-estimators (Clarkson and Woodruff 2015). This flexibility comes at the price of a loss in accuracy, however, this loss is controllable by bounding parameters and does not lead to invalid inference.

Another approach by Bardenet et al. (2014) tries to subsample the data to approximate the acceptance rule in each iteration of an MCMC sampler. The decision is shown to be similar to the original with high probability in each step. The number of samples is highly dependent on the variance of the logarithm of likelihood ratios. The method may be useful for interesting and intractable cases when the variance can be bounded.

While frequentist linear regression can be solved straightforwardly by computing the projection of the target variable to the subspace spanned by the data, Bayesian regression is typically computationally more demanding. In some cases, it is possible to calculate the posterior distribution analytically, but in general this is not true. For that reason, an approximation of the posterior distribution is needed. MCMC methods are one possibility and standard in Bayesian analysis. They are reliable, but can take considerable time before they converge and sample from the desired posterior distribution. Moreover, the running time grows with the number of observations in the data set.

The main bottleneck of a lot of Bayesian analysis methods including MCMC is the repeated evaluation of the likelihood. The running time of each evaluation grows linearly with the number of observations in the data set. There are several approaches trying to reduce the computational effort for Bayesian regression analysis by employing different algorithms that may perform more efficient in certain settings. Approximate Bayesian computing (ABC) and integrated nested laplace approximations (INLA) both fall into this category. The main idea behind ABC is to avoid the exact evaluations by approximating the likelihood function using simulations (Csillery et al. 2010). INLA (Rue et al. 2009; Martins et al. 2013) on the other hand is an approximation of the posterior distribution that is applicable to models that fall into the class of so-called latent Gaussian models. Both methods can lead to a considerable speed-up compared to standard MCMC algorithms.

Note however, that the speed-up is achieved by changing the algorithm which is used to conduct the analysis. This is different in our approach, which reduces the number of observations in the data set while approximately retaining its statistical properties. The running times of many algorithms including MCMC algorithms highly depend on the number of observations, which means that our proposed method also results in a speed-up of the analysis. In this article, we focus on MCMC methods for the analysis, but in principle, as our method provably approximates the posterior, all algorithms that assess the posterior can be employed. We did not use ABC since it is only suitable for summary statistics of very low dimension (\(d <10\)) (Beaumont et al. 2002; Csillery et al. 2010). However, we have tried INLA on a small scale, achieving comparable results as with MCMC, making the running time of the analysis independent of n. Likewise one could consider calculating the exact formulae for analytically tractable cases of the posterior. However, we concentrate on MCMC methods because of their general applicability and reliability.

New directions in Bayesian data analysis in the context of Big Data are surveyed in Welling et al. (2014). Our work directly suits the criteria that is proposed in that reference for the large n case in streaming as well as distributed computation environments.

3 Preliminaries

3.1 General notation

For the sake of a brief presentation, we introduce some notation. We denote by \([n]=\{1,\ldots ,n\}\) the set of all positive integers up to n. For a probability measure \(\lambda \), let \(\mathbb {E}_{\lambda }\left[ X\right] \) be the expected value of X with respect to \(\lambda \). We skip the subscript in \(\mathbb {E}_{}\left[ X\right] \) if the probability measure is clear from the context. For a matrix \(M\in \mathbb {R}^{n\times d}\) we let \(M=U\Sigma V^T\) denote its singular value decomposition (SVD), where \(U\in \mathbb {R}^{n\times d}\) and \(V\in \mathbb {R}^{d\times d}\) are unitary matrices spanning the columnspace and rowspace of M, respectively. \(\Sigma \in \mathbb {R}^{d\times d}\) is a diagonal matrix, whose elements \(\sigma _1\ge \cdots \ge \sigma _d\) are the singular values of M. We denote by \(\sigma _\mathrm{max}=\sigma _1\) the largest and by \(\sigma _\mathrm{min}=\sigma _d\) the smallest singular value of M and write \(\sigma _i(M)\) to make clear the \(\sigma _i\) belong to M. The trace of \(M^TM\) equals the sum of the squared singular values, i.e., \({\text {tr}}\left( M^TM\right) =\sum _{i=1}^d \sigma _i^2(M)\). We assume w.l.o.g. all matrices to have full rank and stress that all our proofs carry out similarly to our presentation if the matrices are of lower rank. One might even use knowledge about lower rank to reduce the space and time complexities to bounds that only depend on the rank rather than on the number of variables.

3.2 Bayesian regression

A linear regression model is given in the following equation:

$$\begin{aligned} Y = X\beta + \xi . \end{aligned}$$

\(Y \in \mathbb {R}^n\) is a random variable containing the values of the response, where n is the number of observations in the data set. \(X \in \mathbb {R}^{n\times d}\) is a matrix containing the values of the d independent variables. We denote by \(\xi \sim N(0, \varsigma ^2 I_n)\) an n-dimensional random vector that models the unobservable error term. The dependent variable Y then follows a Gaussian distribution, \(Y \sim N(X\beta , \varsigma ^2 I_n)\). The corresponding probability density function is

$$\begin{aligned} f(y | X\beta , \Sigma )= & {} (2\pi )^{-\frac{n}{2}} \left| \Sigma \right| ^{-\frac{1}{2}} \exp {\left( -\frac{1}{2\varsigma ^2}\Vert {X\beta -y}\Vert _{2}^2\right) }, \end{aligned}$$

where \(\Sigma = \varsigma ^2 I_n\).

In a Bayesian setting, \(\beta \in \mathbb {R}^d\) is the unknown parameter vector which is assumed to follow an unknown distribution \(p(\beta |X,Y\)) called the posterior distribution. Prior knowledge about \(\beta \) can be modeled using the prior distribution \(p(\beta )\). The posterior distribution is a compromise between the prior distribution and the observed data.

In general, the posterior distribution cannot be calculated analytically. In this paper, we determine the posterior distribution employing Markov Chain Monte Carlo methods, even though the posterior is explicitly known. Regardless of the computational problems related to these explicit formulae (cf. Sect. 1), we focus on MCMC, because this work forms the basis for further research on more complex models where analytical solutions are not obtainable. Possible extensions are hierarchical models and mixtures of normal distributions (cf. Sect. 7). Furthermore, we follow this strategy to rule out possible interaction effects between sketching and MCMC that might occur even in these basic cases. However, our empirical evaluation indicates that there are none.

3.3 Norms and metrics

Before going into details about random projections and subspace embeddings, let us first define the matrix and vector norms used in this paper as well as the metric that we are going to use for quantifying the distance between distributions.

Definition 1

(spectral norm) The spectral or operator norm of a matrix \(A \in \mathbb {R}^{n \times d}\) is defined as

$$\begin{aligned}\Vert {A}\Vert _{2}=\sup \limits _{x\in \mathbb {R}^d\setminus \{0\}} \frac{\Vert {Ax}\Vert _{2}}{\Vert {x}\Vert _{2}},\end{aligned}$$

where \(\Vert {y}\Vert _{2}={(\sum _{i=1}^m y_i^2)}^\frac{1}{2}\) denotes the Euclidean vector norm for any \(y\in \mathbb {R}^m\).

A useful fact that is straightforward from Definition 1 is that the spectral norm of a matrix M equals its largest singular value, i.e., we have \(\Vert {M}\Vert _{2}=\sigma _\mathrm{max}(M)\) (cf. Horn and Johnson 1990).

In order to quantify the distance between probability measures and in particular between the original posterior and its approximated counterpart we will need some further definitions. For this sake, given two probability measures \(\gamma ,\nu \) over \(\mathbb {R}^d\), let \(\Lambda (\gamma ,\nu )\) denote the set of all joint probability measures on \(\mathbb {R}^{d}\times \mathbb {R}^{d}\) with marginals \(\gamma \) and \(\nu \), respectively.

Definition 2

(Wasserstein distance, cf. Villani (2009)) Given two probability measures \(\gamma ,\nu \) on \(\mathbb {R}^d\) the \(\ell _2\) Wasserstein distance between \(\gamma \) and \(\nu \) is defined as

$$\begin{aligned} \begin{aligned} \mathscr {W}_2(\gamma ,\nu )&=\left( \inf \limits _{\lambda \in \Lambda (\gamma ,\nu )} \int _{\mathbb {R}^{d}\times \mathbb {R}^{d}} \Vert {x-y}\Vert _{2}^2 \;\;d \lambda (x,y) \right) ^{\frac{1}{2}}\\&=\inf \limits _{\lambda \in \Lambda (\gamma ,\nu )} \mathbb {E}_{\lambda }\left[ \Vert {x-y}\Vert _{2}^2\right] ^{\frac{1}{2}} \end{aligned} \end{aligned}$$

From the definition of the Wasserstein distance we can derive a measure of how much points drawn from a given distribution will spread from the origin. The Wasserstein weight can be thought of as a norm of a probability measure.

Definition 3

(Wasserstein weight) We define the \(\ell _2\) Wasserstein weight of a probability measure \(\gamma \) as

$$\begin{aligned} \begin{aligned} \mathscr {W}_2(\gamma )&=\mathscr {W}_2(\gamma ,\delta )\\&=\left( \int _{\mathbb {R}^{d}} \Vert {x}\Vert _{2}^2 \;\;d \gamma \right) ^{\frac{1}{2}}=\mathbb {E}_{\gamma }\left[ \Vert {x}\Vert _{2}^2\right] ^{\frac{1}{2}} \end{aligned} \end{aligned}$$

where \(\delta \) denotes the Dirac delta function.

3.4 Random projections and \(\varepsilon \)-subspace embeddings

The following definition of so called \(\varepsilon \)-subspace embeddings will be central to our work. Such an embedding can be used to reduce the size of a given data matrix while preserving the algebraic structure of its spanned subspace up to \((1\pm \varepsilon )\) distortion. Before we summarize several methods to construct a subspace embedding for a given input matrix, we give a formal definition. Here and in the rest of the paper we assume \(0<\varepsilon \le 1/2\).

Definition 4

( \(\varepsilon \) -subspace embedding) Given a matrix \(U\in \mathbb {R}^{n \times d}\) with orthonormal columns, an integer \(k\le n\) and an approximation parameter \(0<\varepsilon \le 1/2\), an \(\varepsilon \)-subspace embedding for U is a map \(\varPi : \mathbb {R}^n \rightarrow \mathbb {R}^k\) such that

$$\begin{aligned} (1-\varepsilon )\Vert {Ux}\Vert _{2}^2\le \Vert {\varPi Ux}\Vert _{2}^2\le (1+\varepsilon )\Vert {Ux}\Vert _{2}^2 \end{aligned}$$
(3)

holds for all \(x\in \mathbb {R}^d\), or, equivalently

$$\begin{aligned} {\Vert {U^T\varPi ^T\varPi U - I_d}\Vert _{2}\le \varepsilon .} \end{aligned}$$
(4)

Inequality (3) is mainly used in this paper, but Inequality (4) is more instructive in the sense that it makes clear that the embedded subspace is close to the identity, not involving much scale or rotation.

Note that an \(\varepsilon \)-subspace embedding \(\varPi \) for the columnspace of a matrix M preserves its squared singular values up to \((1\pm \varepsilon )\) distortion, which in particular means that it also preserves its rank. We prove this claim for completeness.

Observation 1

Let \(\varPi \) be an \(\varepsilon \)-subspace embedding for the columnspace of \(M\in \mathbb {R}^{n\times d}\). Then

$$\begin{aligned} (1-\varepsilon )\,\sigma _i^2(M) \le \sigma _i^2(\varPi M) \le (1+\varepsilon )\, \sigma _i^2(M) \end{aligned}$$

and

$$\begin{aligned} (1-2\varepsilon )\,\sigma _i^{-2}(M) \le \sigma _i^{-2}(\varPi M) \le (1+2\varepsilon )\,\sigma _i^{-2}(M). \end{aligned}$$

Proof

For the first claim, we make use of a min-max representation of the singular values that is known as the Courant-Fischer theorem (cf. Horn and Johnson 1990). In the following derivation we choose \(x^*\) to be the maximizer of (5) and \(S^*\) the minimizer of (6).

$$\begin{aligned} \sigma _i^2(\varPi M)= & {} \min _{S\in \mathbb {R}^{(i-1)\times d}}\max _{Sx=0,\Vert {x}\Vert _{2}=1}\Vert {\varPi Mx}\Vert _{2}^2\nonumber \\\le & {} \max _{S^*x=0,\Vert {x}\Vert _{2}=1}\Vert {\varPi Mx}\Vert _{2}^2\end{aligned}$$
(5)
$$\begin{aligned}= & {} \Vert {\varPi Mx^*}\Vert _{2}^2\nonumber \\\le & {} (1+\varepsilon )\,\Vert {Mx^*}\Vert _{2}^2\nonumber \\\le & {} (1+\varepsilon )\,\max _{S^*x=0,\Vert {x}\Vert _{2}=1}\Vert {Mx}\Vert _{2}^2\nonumber \\= & {} (1+\varepsilon )\,\min _{S\in \mathbb {R}^{(i-1)\times d}}\max _{Sx=0,\Vert {x}\Vert _{2}=1}\Vert {Mx}\Vert _{2}^2\nonumber \\= & {} (1+\varepsilon )\,\sigma _i^2(M). \end{aligned}$$
(6)

The lower bound can be derived analogously using the lower bound of (3).

Now we use the first claim to prove the second. To this end, we bound the difference

$$\begin{aligned} \left| \frac{1}{\sigma _i^{2}(M)} - \frac{1}{\sigma _i^{2}(\varPi M)}\right|= & {} \frac{|\sigma _i^{2}(M) - \sigma _i^{2}(\varPi M)|}{\sigma _i^{2}(M)\sigma _i^{2}(\varPi M)}\\\le & {} \frac{\varepsilon \sigma _i^{2}(M) }{(1-\varepsilon )\,\sigma _i^{4}(M)}\\= & {} \frac{\varepsilon }{(1-\varepsilon )}\,\sigma _i^{-2}(M)\\\le & {} 2\varepsilon \,\sigma _i^{-2}(M). \end{aligned}$$

\(\square \)

There are several ways to construct an \(\varepsilon \)-subspace embedding. One of the more recent methods is using a so called graph-sparsifier, which was initially introduced for the efficient construction of sparse sub-graphs with good expansion properties (Batson et al. 2012). The work of Boutsidis et al. (2013) adapted the technique to work for ordinary least-squares regression. While the initial construction was deterministic, they also gave alternative constructions combining the deterministic decision rules with non-uniform random sampling techniques. Another approach is subspace preserving sampling of rows from the data matrix. This technique was introduced by Drineas et al. (2006) for \(\ell _2\) regression and generalized to more general subspace sampling for the p-norm (Dasgupta et al. 2009). The sampling is done proportional to the so called statistical leverage scores. These techniques have recently been analyzed and extended in a statistical setting as opposed to the algorithmic worst case setting (Raskutti and Mahoney 2015; Ma et al. 2014). All the aforementioned methods are in principle applicable whenever it is possible to read the input multiple times. For instance, one needs two passes over the data to perform the subspace sampling procedures, one for preprocessing the input matrix and another for computing the probabilities and for the actual sampling. This way one might reach a stronger reduction or better statistical properties since their (possibly random) construction depends on the input itself and therefore uses more information.

In principle our approximation results are independent of the actual method used to calculate the embedding as long as the property given in Definition 4 is fulfilled. However, when the number of observations grows really massive or we deal with an infinite stream, then the data can only be read once, given time and space constraints. In order to use \(\varepsilon \)-subspace embeddings in a single-pass streaming algorithm, we consider the approach of so called oblivious subspace embeddings in this paper. These can be viewed as distributions over appropriately structured \(k\times n\) matrices from which we can draw a realization \(\varPi \) independent of the input matrix. It is then guaranteed that for any fixed matrix U as in Definition 4 and failure probability \(0<\alpha \le 1/2, \varPi \) is an \(\varepsilon \)-subspace embedding with probability at least \(1-\alpha \). The results of our work are always conditioned on the event that the map \(\varPi \) is an \(\varepsilon \)-subspace embedding omitting to further mention the error probability \(\alpha \). The reader should keep in mind that there is the aforementioned possibility of failure during the phase of sketching the data.

Instead of a subspace embedding, one might consider \(\varPi = X^{T}\) a suitable choice for a sketching matrix leading to a sketch of size \(d\times (d+1)\) from which the exact likelihood and posterior can be characterized in some analytically tractable cases. However, it is well known that using the resulting matrix \(X^T [X,Y]\) may result in numerical instabilities due to bad conditioning of the covariance matrix \(X^TX\). This effect can occur e.g. in the presence of collinearities and is independent of the size of the data, which may lead to highly inaccurate frequentist estimators, (cf. Lawson and Hanson 1995). In a Bayesian setting, as we consider in this work, the instabilities result in extremely large variances of the MCMC sample, leading to simulations that do not converge. We will observe this behavior in Sect. 5.

We therefore consider the following approaches for obtaining oblivious \(\varepsilon \)-subspace embeddings:

  1. 1.

    The Rademacher Matrix (RAD): \(\varPi \) is obtained by choosing each entry independently from \(\{-1,1\}\) with equal probability. The matrix is then rescaled by \(\frac{1}{\sqrt{k}}\). This method has been shown by Sarlós (2006) to form an \(\varepsilon \)-subspace embedding with probability at least \(1-\alpha \) when choosing essentially \(k=O(\frac{d\log (d/\alpha )}{\varepsilon ^2})\). This was later improved to \(k=O(\frac{d+\log (1/\alpha )}{\varepsilon ^2})\) in Clarkson and Woodruff (2009), which was recently shown to be optimal by Nelson and Nguyên (2014). While this method yields the best reduction among the different constructions that we consider in the present work, the RAD embedding has the disadvantage that we need \(\Theta (ndk)\) time to apply it to an \(n\times d\) matrix when streaming the input in general. If the input is given row by row or at least block by block, our implementation applies a fast matrix multiplication algorithm to each block. We remark that it is provably sufficient that the \(\{-1,1\}\)-entries in each row of the RAD matrix are basically four wise independent, i.e., when considering up to four entries of the same row, these behave as if they were fully independent. Such random numbers can be generated using a hashing scheme that generates BCH codes using a seed of size \(O(\log n)\). This has first been noticed by Alon et al. (1999). In our implementation we have used the four wise independent BCH scheme as described in Rusu and Dobra (2007).

  2. 2.

    The Subsampled Randomized Hadamard Transform (SRHT) (originally from Ailon and Liberty (2009)) is an embedding that is chosen to be \(\varPi =RH_mD\) where D is an \(m\times m\) diagonal matrix where each entry is independently chosen from \(\{-1,1\}\) with equal probability. The value of m is assumed to be a power of two. It is convenient to choose the smallest such number that is not smaller than \(n. H_m\) is the Hadamard-matrix of order m and R is a \(k\times m\) row sampling matrix. That is, each row of R contains exactly one 1-entry and is 0 everywhere else. The index of the 1-entry is chosen uniformly from [m] i.i.d. for every row. The matrix is then rescaled by \(\frac{1}{\sqrt{k}}\). Since m is often larger than n, the input data must be padded with 0-entries to compute the product \(\varPi X\). Of course, it is not necessary to do that explicitly since all multiplications by zero can be omitted. The target dimension needed to form an \(\varepsilon \)-subspace embedding with probability at least \(1-\alpha \) using this family of matrices was shown by Boutsidis and Gittens (2013) to be \(k=O(\frac{(\sqrt{d}+\sqrt{\log n})^2 \log (d/\alpha )}{\varepsilon ^2})\), which improved upon previous results from Drineas et al. (2011). Using this method, we have a small dependency on n, which is negligible whenever \(n=O(\exp (d))\). This is often true in practice when d is reasonably large. Compared to the RAD method, the dependency on the dimension d is worse by essentially a factor of \(O(\log d)\). It is known that \(k=\varOmega (d\log d)\) is necessary due to the sampling based approach. This was shown by reduction from the coupon collectors problem, i.e., solving one problem can be reduced to solving the other. See Halko et al. (2011) for details. The benefit that we get is that due to the inductive structure of the Hadamard matrix, the embedding can be applied in \(O(nd\log k)\) time, which is considerably faster. It has been noticed in the original paper (Ailon and Liberty 2009) that the construction is closely related to four wise independent BCH codes. To our knowledge, there is no explicit proof that it is sufficient to use random bits of little independence. However, we use again the four wise BCH scheme for the implicit construction of the matrix D and the linear congruency generator from the standard library of C++ 11 for the uniform subsampling matrix R. We will see in the empirical evaluation that this works well in practice.

  3. 3.

    The Clarkson Woodruff (CW) sketch (Clarkson and Woodruff 2013) is the most recent construction that we consider in this article. In this case the embedding is obtained as \(\varPi =\varPhi D\). The \(n \times n\) matrix D is constructed in the same way as the diagonal matrix in the SRHT case. Given a random map \(h:[n]\rightarrow [k]\) such that for every \(i\in [n]\) its image is chosen to be \(h(i)=t\in [k]\) with probability \(\frac{1}{k}\), again \(\varPhi \) is a binary matrix whose 1-entries can be defined by \(\varPhi _{h(i),i}=1\). All other entries are 0. This is obviously the fastest embedding, due to its sparse construction. It can be applied to any matrix \(X\in \mathbb {R}^{n\times d}\) in \(O({\text {nnz}}(X))=O(nd)\) time, where \({\text {nnz}}(X)\) denotes the number of non-zero entries in X. This is referred to as input sparsity time and is clearly optimal up to small constants, since this is the time needed to actually read the input from a data stream or external memory, which dominates the sketching phase. However, its disadvantage is that the target dimension is \(k=\varOmega (d^2)\) (Nelson and Nguyen 2013b). Roughly spoken, this is necessary due to the need to obliviously and perfectly hash d of the standard basis vectors spanning \(\mathbb {R}^n\). Improved upper bounds over the original ones of Clarkson and Woodruff (2013) show that \(k=O(\frac{d^2}{\varepsilon ^2\alpha })\) is sufficient to draw an \(\varepsilon \)-subspace embedding from this distribution of matrices with probability at least \(1-\alpha \) (Nelson and Nguyen 2013a). This reference also shows that it is sufficient to use only four wise independent random bits to generate the diagonal matrix D. Again, in our implementation we use the four wise independent BCH scheme from Rusu and Dobra (2007). Moreover, \(\varPhi \) can be constructed using only pairwise independent entries. This can be achieved very efficiently using the fast universal hashing scheme introduced by Dietzfelbinger et al. (1997) which we have employed in our implementation. The space requirement is only \(O(\log n)\) for a hash function from this class. For a really fast implementation using bit-wise operations the actual size parameters of the sketch are chosen to be the smallest powers of two that are larger than the required sizes n and k.

Table 1 summarizes the above discussion, in particular the trade-off behavior between time and space complexity of the presented sketching methods. While in general one is interested in the fastest possible application time, memory constraints might make it impossible to apply the CW sketch due to its quadratic dependency on d. Taking it the other way, for a fixed sketching size, CW will give the weakest approximation guarantee (cf. Yang et al. 2015). For really large d, even the \(O(d\log d)\) factor of SRHT might be too large so that we have to use the slowest RAD sketching method.

Table 1 Comparison of the three considered \(\varepsilon \)-subspace embeddings; \({\text {nnz}}(X)\) denotes the number of non-zero entries in \(X, \alpha \) denotes the failure probability

3.5 Extension to the streaming model

The presented reduction techniques are of interest whenever we deal with medium to large sized data for reducing time and space requirements. However, when the data grows massive, we need to put more importance on the computational requirements. We therefore want to briefly discuss and give references to some of these technical details. For example, while the dimensions of the resulting sketches do not depend on n, this is not true for the embedding matrices \(\varPi \in \mathbb {R}^{k\times n}\). However, due to the structured constructions that we have surveyed above, we stress that the sketching matrices can be stored implicitly by using the different hash functions of limited independence. The hash functions used in our implementations are the four wise independent BCH scheme used in the seminal work of Alon et al. (1999) and the universal hashing scheme by Dietzfelbinger et al. (1997). These can be evaluated very efficiently using bit-wise operations and can be stored using a seed whose size is only \(O(\log n)\). Note that even this small dependency on n is only needed in the sketching phase. After the sketch has been computed, the space requirements will be independent of n. A survey and evaluation of alternative hashing schemes can be found in Rusu and Dobra (2007).

The linearity of the embeddings allows for efficient application in sequential streaming and in distributed environments, see e.g. Clarkson and Woodruff (2009); Woodruff and Zhang 2013); Kannan et al. 2014). The sketches can be updated in the most flexible dynamic setting, which is commonly referred to as the turnstile model (Muthukrishnan 2005). In this model, think of an initial matrix of all zero values. The stream consists of updates of the form (iju) meaning that the entry \(X_{ij}\) will be updated to \(X_{ij}+u\). A single entry can be defined by one single update or by a sequence of not necessarily consecutive updates. For example a stream \(S=\{\ldots ,(i,j,+5),\ldots ,(i,j,-3), \ldots \}\) will result in \(X_{ij}=2\). Even deletions are possible in this setting by using negative updates. Clearly this also allows for additive updates of rows or columns, each consisting of consecutive single updates to all the entries in the same row or column. At first sight this model might seem very technical and unnatural. But the usual form of storing data in a table is not appropriate or performant for massive data sets. The data is rather stored as a sequence of (key, value) pairs in arbitrary order. For dealing with such unstructured data, the design of algorithms working in the turnstile model is of high importance.

For distributed computations, note that the embedding matrices can be communicated efficiently to every machine in a computing cluster of l machines. This is due to the small implicit representation by hash functions. Now, suppose the data is given as \(X=\sum _{i=1}^l X^{(i)}\) where \(X^{(i)}\) is stored on the machine with index \(i\in [l]\). Note that by the above data representation in form of updates, \(X^{(i)}\) can consist of rows, columns or single entries of X. Again, multiple updates to the same entry are possible and may be distributed to different machines. Every machine \(i\in [l]\) can compute a small sketch on its own share \(X^{(i)}\) of the data and efficiently communicate it to one dedicated central server. A sketch of the entire data set can be obtained by summing up the single sketches since \(\varPi X = \sum _{i=1}^l \varPi X^{(i)}\). More details can be found in Kannan et al. (2014). Recent implementations of similar distributed and parallel approaches for OLS are given by Yang et al. (2015).

The above discussions make clear that our methods suit the criteria that need to be satisfied when dealing with Big Data (cf. Welling et al. 2014). Namely, the number of data items that need to be accessed at a time is only a small subset of the whole data set, particularly independent of n. The algorithms should be amenable to distributed computing environments like MapReduce.

4 Theory

In this section we introduce and develop the theoretical foundations of our approach and will combine them with existing results on ordinary least squares regression to bound the Wasserstein distance between the original likelihood function and its counterpart that is defined only on the considerably smaller sketch. Empirical evaluations supporting and complementing our theoretical results will be conducted in the subsequent section.

4.1 Embedding the likelihood

The following observation is standard (cf. Givens and Shortt 1984; Kannan and Vempala 2009) and will be helpful in bounding the \(\ell _2\) Wasserstein distance of two Gaussian measures. It allows us to derive such a bound by inspecting their means and their covariances separately.

Observation 2

Let \(Z_1,Z_2\in \mathbb {R}^d\) be random variables with finite first moments \(m_1,m_2 < \infty \) and let \(Z_1^m=Z_1-m_1\), respectively, \(Z_2^m=Z_2-m_2\) be their mean-centered counterparts. Then it holds that

$$\begin{aligned} \mathbb {E}_{}\left[ \Vert {Z_1-Z_2}\Vert _{2}^2\right] = \Vert {m_1-m_2}\Vert _{2}^2 + \mathbb {E}_{}\left[ \Vert {Z_1^m-Z_2^m}\Vert _{2}^2\right] . \end{aligned}$$

Proof

$$\begin{aligned} \mathbb {E}_{}\left[ \Vert {Z_1-Z_2}\Vert _{2}^2\right]= & {} \mathbb {E}_{}\left[ \Vert {Z_1^m - Z_2^m + m_1 - m_2}\Vert _{2}^2\right] \\= & {} \mathbb {E}_{}\left[ \Vert {Z_1^m - Z_2^m}\Vert _{2}^2 + \Vert {m_1 - m_2}\Vert _{2}^2\right] \\&+\,2\;(m_1-m_2)^T\,\underbrace{\mathbb {E}_{}\left[ Z_1^m-Z_2^m\right] }_{=0}\\= & {} \mathbb {E}_{}\left[ \Vert {Z_1^m - Z_2^m}\Vert _{2}^2\right] + \Vert {m_1 - m_2}\Vert _{2}^2 \end{aligned}$$

\(\square \)

In our first lemma we show that using an \( \varepsilon \)-subspace embedding \(\varPi \) for the columnspace of [XY], we can approximate the least squares regression problem up to a factor of \(1+\varepsilon \). That is, we can find a solution \(\nu \) by projecting \(\varPi Y\) into the columnspace of \(\varPi X\) such that \(\Vert {X\nu -Y}\Vert _{2}\le (1+\varepsilon ) \min _{\beta \in \mathbb {R}^d}\Vert {X\beta -Y}\Vert _{2}\). Similar proofs can be found in Clarkson et al. (2013) and Boutsidis et al. (2013). We repeat the result here for completeness.

Lemma 1

Given \(X\in \mathbb {R}^{n\times d},Y\in \mathbb {R}^{n}\), let \(\varPi \) be an \((\varepsilon /3)\)-subspace embedding for the columnspace of [XY]. Now let \(\gamma ={\text {argmin}}_{\beta \in \mathbb {R}^d} \Vert {X\beta -Y}\Vert _{2}^2\) and similarly define \(\nu ={\text {argmin}}_{\beta \in \mathbb {R}^d} \Vert {\varPi (X\beta -Y)}\Vert _{2}^2\). Then

$$\begin{aligned} \Vert {X\nu -Y}\Vert _{2}^2\le (1+\varepsilon )\,\Vert {X\gamma -Y}\Vert _{2}^2. \end{aligned}$$

Proof

Let \([X,Y]=U\Sigma V^T\) denote the SVD of [XY]. Now define \(\eta _1=\Sigma V^T [\gamma ^T,-1]^T\) and \(\eta _2=\Sigma V^T [\nu ^T,-1]^T\). Using this notation we can rewrite \(U\eta _1=X\gamma -Y\) and similarly \(U\eta _2=X\nu -Y\). We have that

$$\begin{aligned} (1-\varepsilon /3)\,\Vert {U\eta _2}\Vert _{2}^2\le & {} \Vert {\varPi U\eta _2}\Vert _{2}^2\\\le & {} \Vert {\varPi U\eta _1}\Vert _{2}^2\\\le & {} (1+\varepsilon /3)\,\Vert {U\eta _1}\Vert _{2}^2. \end{aligned}$$

The first and the last inequality are direct applications of the subspace embedding property (3), whereas the middle inequality follows from the optimality of \(\nu \) in the embedded subspace. Now, by rearranging and resubstituting terms, this yields

$$\begin{aligned} \Vert {X\nu -Y}\Vert _{2}^2\le & {} \left( \frac{1+\varepsilon /3}{1-\varepsilon /3} \right) \Vert {X\gamma -Y}\Vert _{2}^2\\\le & {} (1+\varepsilon )\,\Vert {X\gamma -Y}\Vert _{2}^2. \end{aligned}$$

\(\square \)

One can even show that a distortion of order \(\sqrt{\varepsilon }\), i.e., an \(O(\sqrt{\varepsilon })\)-subspace embedding is already enough to get the result. This was shown by using a more complicated proof taking the geometry of the least squares solution into account and using the property that the solution is obtained by an orthogonal projection onto the columnspace spanned by the data matrix (cf. Clarkson and Woodruff 2009). Putting it the other way around, by using an \((\varepsilon /3)\)-subspace embedding as in Lemma 1, we even have

$$\begin{aligned} \Vert {X\nu -Y}\Vert _{2}^2\le (1+\varepsilon ^2)\,\Vert {X\gamma -Y}\Vert _{2}^2. \end{aligned}$$
(7)

In the following, we investigate the distributions proportional to the likelihood functions \(p\propto \mathscr {L}(\beta | X,Y)\) and \(p'\propto \mathscr {L}(\beta | \varPi X,\varPi Y)\) and bound their Wasserstein distance.

We begin our contribution with a bound on the distance of their means \(\gamma \) and \(\nu \), respectively. We generalize upon previous results for the specific embedding methods to arbitrary \(\varepsilon \)-subspace embeddings.

Lemma 2

Given \(X\in \mathbb {R}^{n\times d},Y\in \mathbb {R}^{n}\), let \(\varPi \) be an \((\varepsilon /3)\)-subspace embedding for the columnspace of [XY]. Now let \(\gamma ={\text {argmin}}_{\beta \in \mathbb {R}^d} \Vert {X\beta -Y}\Vert _{2}^2\) and similarly define \(\nu ={\text {argmin}}_{\beta \in \mathbb {R}^d} \Vert {\varPi (X\beta -Y)}\Vert _{2}^2\). Then

$$\begin{aligned} \Vert {\gamma - \nu }\Vert _{2}^2 \le \frac{\varepsilon ^2}{\sigma _{min }^{2}(X)} \, \Vert {X\gamma -Y}\Vert _{2}^2. \end{aligned}$$
(8)

Proof

Let \(X=U\Sigma V^T\) denote the SVD of X. Let \(\eta =V^T(\gamma -\nu )\). First note that \(\gamma \) and \(\nu \) are both contained in the columnspace of V (cf. Sarlós 2006) which means that \(V^T\) is a proper rotation with respect to \(\gamma -\nu \). Thus,

$$\begin{aligned} \Vert {X(\gamma -\nu )}\Vert _{2}^2= & {} \Vert {U\Sigma V^T(\gamma -\nu )}\Vert _{2}^2\\= & {} \Vert {\Sigma V^T(\gamma -\nu )}\Vert _{2}^2 \\= & {} \sum \sigma _i^2 \eta _i^2\\\ge & {} \sum \sigma _{\text {min}}^2 \eta _i^2\\= & {} \sigma _{\text {min}}^2\,\Vert {V^T(\gamma -\nu )}\Vert _{2}^2\\= & {} \sigma _{\text {min}}^2\,\Vert {\gamma -\nu }\Vert _{2}^2\,. \end{aligned}$$

Consequently, it remains to bound \(\Vert {X(\gamma -\nu )}\Vert _{2}^2\). This can be done by using the fact that the minimizer \(\gamma \) is obtained by projecting Y orthogonally onto the columnspace of X. Therefore, we have \(X^T(X\gamma -Y)=0\) (cf. Clarkson and Woodruff 2009). Furthermore, by Eq. (7) it holds that \(\Vert {X\nu - Y}\Vert _{2}^2 \le (1+\varepsilon ^2) \Vert {X\gamma -Y}\Vert _{2}^2\). Now by plugging this into the Pythagorean theorem and rearranging we get that

$$\begin{aligned} \Vert {X(\gamma -\nu )}\Vert _{2}^2= & {} \Vert {X\nu -Y}\Vert _{2}^2 - \Vert {X\gamma -Y}\Vert _{2}^2\\\le & {} \varepsilon ^2\,\Vert {X\gamma -Y}\Vert _{2}^2. \end{aligned}$$

Putting all together this yields the proposition

$$\begin{aligned} \Vert {\gamma -\nu }\Vert _{2}^2\le & {} \frac{1}{\sigma _{\text {min}}^{2}(X)}\,\Vert {X(\gamma -\nu )}\Vert _{2}^2\\\le & {} \frac{\varepsilon ^2}{\sigma _{\text {min}}^{2}(X)}\,\Vert {X\gamma -Y}\Vert _{2}^2. \end{aligned}$$

\(\square \)

Now that we have derived a bound on the distance of the different means, recall that by Observation 2, we can assume w.l.o.g. \(\gamma =\nu =0\) when we consider the variances. Namely, it remains to derive a bound on \(\inf \mathbb {E}_{}\left[ \Vert {Z_1^m-Z_2^m}\Vert _{2}^2\right] \), i.e., the least expected squared Euclidean distance of two points drawn from a joint distribution whose marginals are the mean-centered original distribution and its embedded counterpart. Of course we can bound this quantity by explicitly defining a properly chosen joint distribution and bounding the expected squared distance for its particular choice. This is the idea that yields our next lemma.

Lemma 3

Let \(p\propto \mathscr {L}(\beta |X,Y)\) and \(p'\propto \mathscr {L}(\beta | \varPi X,\varPi Y)\). Let \(Z_1^m,Z_2^m\) be the mean-centered versions of the random variables \(Z_1 \sim p\) and \(Z_2 \sim p'\) that are distributed according to p and \(p'\) respectively. Then we have

$$\begin{aligned} \inf \limits _{\lambda \in \Lambda (p,p')} \mathbb {E}_{\lambda }\left[ \Vert {Z_1^m-Z_2^m}\Vert _{2}^2\right] \le \varepsilon ^2 {\text {tr}}\left( (X^TX)^{-1}\right) . \end{aligned}$$

Proof

Our plan is to design a joint distribution that deterministically maps points from one distribution to another in such a way that we can bound the distance of every pair of points. This can be done by utilizing the Dirac delta function \(\delta (\cdot )\), which is a degenerate probability density function that concentrates all probability mass at zero and has zero density otherwise. Given a bijection \(g: \mathbb {R}^d \rightarrow \mathbb {R}^d\) we can define such a joint distribution \(\lambda \in \Lambda (p,p')\) through its conditional distributions \(\lambda (x\,|\,y)=\delta (x-g(y))\) for every \(y\in \mathbb {R}^d\). It therefore remains to define g.

According to Observation 1, when applying the embedding \(\varPi \), the columnspace of a matrix is expanded or contracted, respectively, by a factor of at most \((1\pm \varepsilon )\). We will make use of this fact in the following way. Let \(X=U\Sigma V^T\) and \(\varPi X=\tilde{U}\tilde{\Sigma }\tilde{V}^T\) denote the SVDs of X and \(\varPi X\), respectively. Now, to define the x-y-pairs that will be mapped to each other by g, we consider vectors \(x,x',y,y' \in \mathbb {R}^d\) where \(x'\) and \(y'\) are contained in the columnspaces of V and \(\tilde{V}\), respectively. To obtain the bijection g, let the vectors have the following properties for arbitrary, but fixed radius \(c \ge 0\):

  1. 1.

    \(\Vert {x'}\Vert _{2}=\Vert {y'}\Vert _{2}=c\)

  2. 2.

    \(x=\Sigma V^Tx'\)

  3. 3.

    \(y=\tilde{\Sigma } \tilde{V}^Ty'\)

  4. 4.

    \(\exists \tau > 0: x=\tau y\).

Observe that by the first property \(x'\) and \(y'\) lie on a d-dimensional sphere with radius c centered at 0. Therefore, there exists a rotation matrix \(R\in \mathbb {R}^{d\times d}\) such that \(y'=Rx'\). Note that such a map is bijective by definition. The second item defines a map of such spheres to ellipsoids (also centered at 0) given by \(\Sigma V^T\). Recall that \(x'\) was chosen from the columnspace of V. Thus, this map can be seen as bijection between the d-dimensional vector space and a d-dimensional subspace contained in n-dimensional space. The third property is defined analogously. The fourth property urges that x and y both lie on a ray emanating from 0. Note that any such ray intersects each ellipsoid exactly once.

Our bijection can be defined accordingly as

$$\begin{aligned} g: \mathbb {R}^d\rightarrow & {} \mathbb {R}^d\\ x \;\;\mapsto & {} \tilde{\Sigma } \tilde{V}^TRV\Sigma ^{-1} x \end{aligned}$$

by composing the map \(\Sigma V^T\), defined in the second item, with the rotation R and finally with \(\tilde{\Sigma } \tilde{V}^T\) from the third property. The map is bijective since it is obtained as the composition of bijections.

Now, in order to bound the distance \(\Vert {Z_1^m-Z_2^m}\Vert _{2}^2\) for any realization of \((Z_1^m,Z_2^m)\) according to their joint distribution defined above, we can derive a bound on the parameter \(\tau \). Substituting the second and third properties into the fourth, we get that

$$\begin{aligned} \Sigma V^Tx' = \tau \tilde{\Sigma } \tilde{V}^Ty' \end{aligned}$$

which can be rearranged to

$$\begin{aligned} y'^Ty'\tau= & {} (y'^T \tilde{V}) \tilde{\Sigma }^{-1} \Sigma (V^T x')\\ ~= & {} \sum (y'^T \tilde{V})_i (V^Tx')_i \frac{\sigma _i}{\tilde{\sigma }_i}\\ ~\le & {} \sum (y'^T \tilde{V})_i (V^Tx')_i \frac{\sigma _i}{\sigma _i\sqrt{1-\varepsilon }}\\ ~\le & {} (1+\varepsilon )\sum (y'^T \tilde{V})_i (V^Tx')_i\\ ~\le & {} (1+\varepsilon )\,c^2. \end{aligned}$$

The first inequality follows from \(\tilde{\sigma }_i \ge \sqrt{1-\varepsilon } \; \sigma _i\) and the second from the assumption \(\varepsilon \le 1/2\). This eventually means that \(\tau \le (1+\varepsilon )\) since \(y'^Ty'=c^2\) by the first property.

A lower bound of \(\tau \ge (1-\varepsilon )\) can be derived analogously by using \(\tilde{\sigma }_i \le \sqrt{1+\varepsilon } \; \sigma _i\).

Now we can conclude our proof. It follows that

$$\begin{aligned} \inf \limits _{\lambda '\in \Lambda (p,p')} \mathbb {E}_{\lambda '}\left[ \Vert {Z_1^m-Z_2^m}\Vert _{2}^2\right]\le & {} \mathbb {E}_{\lambda }\left[ \Vert {Z_1^m-Z_2^m}\Vert _{2}^2\right] \\ ~\le & {} \mathbb {E}_{\lambda }\left[ \Vert {\varepsilon Z_1^m}\Vert _{2}^2\right] \\ ~= & {} \varepsilon ^2 \; \mathbb {E}_{\lambda }\left[ \Vert {Z_1^m}\Vert _{2}^2\right] \\ ~= & {} \varepsilon ^2 \, {\text {tr}}\left( (X^TX)^{-1}\right) . \end{aligned}$$

The last equality holds since the expected squared norm of the mean-centered random variable is just the trace of its covariance matrix. \(\square \)

Combining the above results we get the following lemma.

Lemma 4

Let \(\varPi \) be an \((\varepsilon /3)\)-subspace embedding for the columnspace of X. Let \(p\propto \mathscr {L}(\beta |X,Y)\) and \(p'\propto \mathscr {L}(\beta | \varPi X,\varPi Y)\). Then

$$\begin{aligned} \mathscr {W}_2^2(p,p') \le \frac{\varepsilon ^2}{\sigma _{min }^{2}(X)} \, \Vert {X\gamma -Y}\Vert _{2}^2 + \varepsilon ^2 {\text {tr}}\left( (X^TX)^{-1}\right) . \end{aligned}$$

Proof

The lemma follows from Definition 2, Observation 2, Lemma 2 and Lemma 3. \(\square \)

Under mild assumptions we can argue that this leads to a \((1+O(\varepsilon ))\)-approximation of the likelihood with respect to the Wasserstein weight (see Definition 3).

Corollary 1

Let \(\varPi \) be an \((\varepsilon /3)\)-subspace embedding for the columnspace of X. Let \(p\propto \mathscr {L}(\beta |X,Y)\) and similarly let \(p'\propto \mathscr {L}(\beta | \varPi X,\varPi Y)\). Let \(\kappa (X)=\sigma _{\max }(X)/\sigma _{\min }(X)\) be the condition number of X. Assume that for some \(\rho \in (0,1]\) we have \(\Vert {X\gamma }\Vert _{2} \ge \rho \Vert {Y}\Vert _{2}\). Then

$$\begin{aligned} \mathscr {W}_2(p') \le \left( 1+\frac{\kappa (X)}{\rho } \,{\varepsilon } \right) \;\mathscr {W}_2(p). \end{aligned}$$

Proof

By definition, the squared \(\ell _2\) Wasserstein weight of p equals its second moment. Since p is a Gaussian measure with mean \(\gamma \) and covariance matrix \((X^TX)^{-1}\), we thus have

$$\begin{aligned} \mathscr {W}_2^2(p)=\Vert {\gamma }\Vert _{2}^2 + {\text {tr}}\left( (X^TX)^{-1}\right) \end{aligned}$$

and similarly

$$\begin{aligned} \mathscr {W}_2^2(p')=\Vert {\nu }\Vert _{2}^2 + {\text {tr}}\left( (X^T\varPi ^T\varPi X)^{-1}\right) . \end{aligned}$$

Since \(\varPi \) is an \((\varepsilon /3)\)-subspace embedding for the columnspace of X we know from Observation 1, that all the squared singular values of X are approximated up to less than \((1\pm \varepsilon )\) error and so are their inverses. Therefore, we have that

$$\begin{aligned} {\text {tr}}\left( (X^T\varPi ^T\varPi X)^{-1}\right) \le (1+\varepsilon )\, {\text {tr}}\left( (X^TX)^{-1}\right) . \end{aligned}$$
(9)

It remains to bound \(\Vert {\nu }\Vert _{2}^2\). To this end we use the assumption that for some \(\rho \in (0,1]\) we have \(\Vert {X\gamma }\Vert _{2} \ge \rho \Vert {Y}\Vert _{2}\). By \(X^T(X\gamma -Y)=0\) and applying the Pythagorean Theorem this means that

$$\begin{aligned} \Vert {X\gamma -Y}\Vert _{2}^2= & {} \Vert {Y}\Vert _{2}^2 - \Vert {X\gamma }\Vert _{2}^2 \nonumber \\\le & {} \Vert {X\gamma }\Vert _{2}^2 \left( \frac{1}{\rho ^2}-1\right) \nonumber \\\le & {} \frac{\Vert {X\gamma }\Vert _{2}^2}{\rho ^2}. \end{aligned}$$
(10)

Now we can apply the triangle inequality, Lemma 2, Inequality (10) and Definition 1 to get

$$\begin{aligned} \Vert {\nu }\Vert _{2}\le & {} \Vert {\gamma }\Vert _{2} + \Vert {\nu -\gamma }\Vert _{2}\\\le & {} \Vert {\gamma }\Vert _{2} + \frac{\varepsilon }{\sigma _{\text {min}}(X)} \, \Vert {X\gamma -Y}\Vert _{2}\\\le & {} \Vert {\gamma }\Vert _{2} + \frac{\varepsilon }{\rho \sigma _{\text {min}}(X)} \, \Vert {X\gamma }\Vert _{2}\\\le & {} \Vert {\gamma }\Vert _{2} + \frac{\varepsilon }{\rho \sigma _{\text {min}}(X)} \, \Vert {X}\Vert _{2}\Vert {\gamma }\Vert _{2}\\= & {} \Vert {\gamma }\Vert _{2} + \frac{\varepsilon }{\rho } \, \kappa (X)\Vert {\gamma }\Vert _{2}\\= & {} \left( 1+\frac{\kappa (X)}{\rho } \,\varepsilon \right) \Vert {\gamma }\Vert _{2}. \end{aligned}$$

Combining this with Inequality (9), the claim follows since \(\frac{\kappa (X)}{\rho }\ge 1\) and therefore \((1+\varepsilon )\le (1+\frac{\kappa (X)}{\rho }\varepsilon )^2\) and finally taking square roots on both sides. \(\square \)

We stress that the assumption that there exists some constant \(\rho \in (0,1]\) such that \(\Vert {X\gamma }\Vert _{2} \ge \rho \Vert {Y}\Vert _{2}\) is very natural and mild in the setting of linear regression since it means that at least a constant fraction of the dependent variable Y can be explained within the columnspace of the data X (cf. Drineas et al. 2006). If this is not true, then a linear model is not appropriate at all for the given data.

4.2 Bayesian regression

So far we have shown that using subspace embeddings to compress a given data set for regression yields a good approximation to the likelihood. Note that in a Bayesian regression setting Lemma 4 already implies a similar approximation error for the posterior distribution if the priors for \(\beta \) are chosen to be uniform distributions over \(\mathbb {R}\). This is an improper, non-informative choice, \(p_\mathrm{pre}(\beta ) = \mathbbm {1}_{\mathbb {R}^d}\), where \(\mathbbm {1}_{\mathbb {R}^d}\) is the indicator function over the entire \(\mathbb {R}^d\). From this, it follows that

$$\begin{aligned} p_\mathrm{post}(\beta |X, Y)\propto & {} \mathscr {L}(\beta | X,Y)\cdot \mathbbm {1}_{\mathbb {R}^d}\\= & {} \mathscr {L}(\beta | X,Y). \end{aligned}$$

The remaining term is just the Gaussian likelihood which is proper. For regression models, especially on data sets with large n, this covers a considerable amount of the cases of interest (cf. Gelman et al. 2014). We will extend this to arbitrary Gaussian priors \(p_\mathrm{pre}(\beta )\) leading to our main result: an approximation guarantee for Gaussian Bayesian regression in its most general form.

To this end, let m be the mean of the prior distribution and let S be derived from its covariance matrix by \(\Sigma =\varsigma ^2(S^TS)^{-1}\). Now, the posterior distribution is given by

$$\begin{aligned} p_\mathrm{post}(\beta |X, Y)\propto & {} \mathscr {L}(\beta | X,Y)\cdot p_\mathrm{pre}(\beta )\\ ~= & {} \frac{1}{(2\pi \varsigma ^2)^{n/2}} \cdot \exp \left( -\frac{1}{2\varsigma ^2}\Vert {X\beta -Y}\Vert _{2}^2\right) \\&\cdot \, \frac{1}{(2\pi )^{\frac{d}{2}}|\Sigma |^{\frac{1}{2}}} \cdot \exp \left( \!-\!\frac{1}{2\varsigma ^2}\Vert {S(\beta -m)}\Vert _{2}^2\right) . \end{aligned}$$

Thus, we know that up to some constants that are independent of \(\beta \), the exponent of the posterior can be described by

$$\begin{aligned} \Vert {X\beta -Y}\Vert _{2}^2 + \Vert {S(\beta -m)}\Vert _{2}^2 \end{aligned}$$
(11)

which contains all the information to define the mean and covariance structure of the posterior distribution. Now let

$$\begin{aligned} Z=\left[ \begin{matrix}X\\ S\end{matrix}\right] \quad \text {and}\quad z=\left[ \begin{matrix}Y\\ Sm\end{matrix}\right] . \end{aligned}$$

With these definitions we can rewrite Eq. (11) above as \(\Vert {Z\beta -z}\Vert _{2}^2\). This, in turn, can be treated as a (frequentist) regression problem to which we can apply Lemma 4. We just have to use a subspace embedding for the columnspace of [Zz] instead of only embedding [XY]. We will see that it is not necessary to do this explicitly. More precisely, embedding only the data matrix is sufficient to have a subspace embedding for the entire columnspace defined by the data and the prior information and, therefore, to have a proper approximation of the posterior distribution. This is formalized in the following lemma.

Lemma 5

Let \(M=[M_1^T,M_2^T]^T\in \mathbb {R}^{(n_1+n_2)\times d}\) be an arbitrary matrix. Suppose \(\varPi \) is an \(\varepsilon \)-subspace embedding for the columnspace of \(M_1\). Let \(I_{n_2}\in \mathbb {R}^{(n_2\times n_2)}\) be the identity matrix. Then

$$\begin{aligned} P=\left[ \begin{matrix}\varPi &{} 0\\ 0 &{} I_{n_2}\end{matrix}\right] \in \mathbb {R}^{(k+n_2)\times (n_1+n_2)} \end{aligned}$$

is an \(\varepsilon \)-subspace embedding for the columnspace of M.

Proof

Fix an arbitrary \(x\in \mathbb {R}^d\). We have

$$\begin{aligned}&|\Vert {PMx}\Vert _{2}^2-\Vert {Mx}\Vert _{2}^2|\\&\quad = |\Vert {\varPi M_1x}\Vert _{2}^2+\Vert {M_2x}\Vert _{2}^2-\Vert {M_1x}\Vert _{2}^2-\Vert {M_2x}\Vert _{2}^2|\\&\quad = |\Vert {\varPi M_1x}\Vert _{2}^2-\Vert {M_1x}\Vert _{2}^2|\\&\quad \le \varepsilon \Vert {M_1x}\Vert _{2}^2\\&\quad \le \varepsilon \big (\Vert {M_1x}\Vert _{2}^2 + \Vert {M_2x}\Vert _{2}^2\big )\\&\quad = \varepsilon \Vert {Mx}\Vert _{2}^2 \end{aligned}$$

which concludes the proof by singular value decomposition \(M=U\Sigma V^T\) and surjectivity of the linear map \(\Sigma V^T\). \(\square \)

This lemma finally enables us to prove our main theoretical result.

Theorem 1

Let \(\varPi \) be an \((\varepsilon /3)\)-subspace embedding for the columnspace of X. Let \(p_\mathrm{pre}(\beta )\) be an arbitrary Gaussian distribution with mean m and covariance matrix \(\Sigma =\varsigma ^2(S^TS)^{-1}\). Let

$$\begin{aligned} Z=\left[ \begin{matrix}X\\ S\end{matrix}\right] \quad \text {and}\quad z=\left[ \begin{matrix}Y\\ Sm\end{matrix}\right] . \end{aligned}$$

Let \(\mu = {\text {argmin}}_{\beta \in \mathbb {R}^d} \Vert {Z\beta -z}\Vert _{2}\) be the posterior mean. Let \(p\propto \mathscr {L}(\beta |X,Y)\cdot p_\mathrm{pre}(\beta )\) and \(p'\propto \mathscr {L}(\beta | \varPi X,\varPi Y)\cdot p_\mathrm{pre}(\beta )\). Then

$$\begin{aligned} \mathscr {W}_2^2(p,p') \le \frac{\varepsilon ^2}{\sigma _{min }^{2}(Z)} \, \Vert {Z\mu -z}\Vert _{2}^2 + \varepsilon ^2 {\text {tr}}\left( (Z^TZ)^{-1}\right) . \end{aligned}$$

Proof

From our previous reasoning we know that approximating the posterior distribution can be reduced to approximating a likelihood function that is defined in terms of the data as well as the parameters of the prior distribution. This has been shown by rewriting Eq. (11) as \(\Vert {Z\beta -z}\Vert _{2}^2\). For that reason, we can apply Lemma 4 to get the desired result if we are given an \((\varepsilon /3)\)-subspace embedding for the columnspace of Z. Using Lemma 5 we know that for this, it is sufficient to use an \((\varepsilon /3)\)-subspace embedding for the columnspace of [XY] independent of the covariance and mean that define the prior distribution. \(\square \)

Similar to Corollary 1 we have the following result concerning the posterior distribution.

Corollary 2

Let \(\varPi \) be an \((\varepsilon /3)\)-subspace embedding for the columnspace of X. Let \(p_\mathrm{pre}(\beta )\) be an arbitrary Gaussian distribution with mean m and covariance matrix \(\Sigma =\varsigma ^2(S^TS)^{-1}\). Let

$$\begin{aligned} Z=\left[ \begin{matrix}X\\ S\end{matrix}\right] \quad \text {and}\quad z=\left[ \begin{matrix}Y\\ Sm\end{matrix}\right] . \end{aligned}$$

Let \(\mu = {\text {argmin}}_{\beta \in \mathbb {R}^d} \Vert {Z\beta -z}\Vert _{2}\) be the posterior mean. Let \(p\propto \mathscr {L}(\beta |X,Y)\cdot p_\mathrm{pre}(\beta )\) and \(p'\propto \mathscr {L}(\beta | \varPi X,\varPi Y)\cdot p_\mathrm{pre}(\beta )\). Let \(\kappa (Z)\) be the condition number of Z. Assume that for some \(\rho \in (0,1]\) we have \(\Vert {Z\mu }\Vert _{2} \ge \rho \Vert {z}\Vert _{2}\). Then we have

$$\begin{aligned} \mathscr {W}_2(p') \le \left( 1+\frac{\kappa (Z)}{\rho } \,\varepsilon \right) \;\mathscr {W}_2(p). \end{aligned}$$

Both Theorem 1 and Corollary 2 show that the sketch preserves the expected value and the covariance structure of the posterior distribution very well. Note that for normal distributions, these parameters fully characterize the distribution as they are sufficient statistics. Therefore, one can see the corresponding parameters based on the sketched data set as very accurate approximate sufficient statistics for the posterior distribution.

5 Simulation study

To validate the proposed method empirically, we conduct a simulation study. For this, we employ MCMC methods to obtain the posterior distributions for the parameters of the Bayesian regressions. Please note that the sketching techniques can also be combined with other methods. We concentrate on MCMC methods, however, as they are very reliable, widely-used and allow for easy checking of convergence. The different sketching methods were implemented as described in Sect. 3.4 and technically more detailed in the given references. All codes were written in the C++ 11 programming language and compiled using GCC 4.7. For fast matrix multiplications we employed the LAPACK 3.5.0 library where applicable. Our R-package RaProR uses the above implementations. The package is available on its project website (Geppert et al. 2015). All simulations were done using R, version 3.1.2 (R Core Team 2014) and the R-package rstan, version 2.3 (Stan Development Team 2013). The simulations were conducted on an Intel Xeon E5430 quad-core CPU running at 2.66 GHz using 16 GB DDR2 memory on a Debian GNU Linux 7.8 distribution. The hard drive used was a Seagate Momentus 7200.4 G-Force 500 GB, SATA 3Gb/s HDD with 7200 rpm and 16 MB cache.

5.1 Data generation

For the simulation study, we create a set of data sets. We vary the number of observations n, the number of variables d, and the standard deviation of the error term \(\varsigma \). The variation of n is introduced to monitor whether the running time of the analyses based on sketches is indeed independent of n and also to see how well the proposed method deals with growing data sets. We choose values of \(n \in \{50{,}000, 100{,}000, 500{,}000, 1{,}000{,}000\}\). The size of the sketches depends mainly on the number of variables in the data set. For this reason, we conduct simulations with two values of \(d, d=50\) and \(d=100\). The reason for choosing rather small values of n and d is that our aim is to compare the results of the MCMC on the sketched data sets to the results on the respective original data set. The sketching methods presented here can handle larger values of d and arbitrary values of n, but employing MCMC on the original data set then becomes unfeasible. The standard deviation of the error term is important, because the goodness of the approximation also depends on the goodness of the model (cf. Lemma 4 and Theorem 1). Here, we choose \(\varsigma \in \{1, 2, 5, 10\}\), thus ranging from very well-fitting models to models with quite high error variance.

The generated true values of \(\beta \) follow a zero-inflated Poisson distribution, where the expected value of the Poisson distribution is 3 and the probability of a component exhibiting an excess zero is equal to 0.5. This means that the components of \(\beta \) have no influence, i.e. are 0, with probability 0.5 and follow a \(\text {Poi}(3)\) distribution with probability 0.5. All components that follow a \(\text {Poi}(3)\) distribution are multiplied with \((-1)\) with probability 0.5. The data set X is obtained in two steps. At first, a d-dimensional vector that represents the column means is drawn randomly from a N(0, 25) distribution. In a second step, the actual values of X are drawn from a normal distribution with the column mean as expected value and variance of four. The variance in the columns of X is thus lower than the error variance for two of our choices and the same or less for the other two choices. Y is then generated by multiplying X with \(\beta \) and adding the error term, in accordance with the model.

5.2 Regression model

We employ a standard Bayesian linear regression model (cf. Sect. 3.2)

$$\begin{aligned} Y \sim N(X\beta , \varsigma ^2 I_n) \end{aligned}$$

with independent uniform priors over \(\mathbb {R}\) for all components of \(\beta \), which are improper, non-informative prior distributions. For \(\varsigma \), the uniform prior is limited to the positive part of \(\mathbb {R}\). We choose an improper uniform prior rather than an inverse gamma prior with small values for the hyperparameters as Gelman (2006) indicates that such priors can have a skewing effect on the posterior distribution. When using improper prior distributions, it is necessary to ensure that the posterior distributions are proper. For our choice of uniform distributions, this does not pose a problem, since the uniform prior is represented by an indicator function \(p_\mathrm{pre}(\beta ) = \mathbbm {1}_{\mathbb {R}^d}\), which is constant over \(\mathbb {R}^d\), and for that reason does not influence the likelihood. More precisely, we have

$$\begin{aligned} p_\mathrm{post}(\beta |X, Y)\propto & {} \mathscr {L}(\beta | X,Y)\cdot \mathbbm {1}_{\mathbb {R}^d}\\= & {} \mathscr {L}(\beta | X,Y). \end{aligned}$$

The remaining term is just the Gaussian likelihood which is proper with respect to both, \(\beta \) and \(\varsigma \) (cf. Gelman et al. 2014). Although closed-form expressions are known for this model, we employ MCMC for the reasons motivated in Sects. 2 and 3.2.

Our theoretical guarantees comprise the posterior distributions of \(\beta \). Our model is thus more general than the theoretical results. As an alternative, \(\varsigma \) can be fixed to an estimated value obtained using \(\Vert \varPi X \beta - \varPi Y\Vert _2/\sqrt{n}\). The results we present in the following are all based on \(\beta \).

5.3 Preliminary simulations

In a first step, we consider the running times necessary to carry out Bayesian regression for (non-embedded) data sets with an increasing number of observations n, employing the No-U-Turn-Sampler (Hoffman and Gelman 2014), which is implemented in the R-package rstan. We use standard settings and four chains, which are sampled in parallel. The resulting running times are plotted in Fig. 1. The running time depends at least linearly on n, with occasional jumps that are probably due to swaps to external memory, which slows down the computation by a large factor. There is an outlier for the case of \(n=50\), which takes a lot more time to compute than \(n=100\) and even more than \(n=200\). Since we have \(d=50\) variables, the total number of parameters (52) is higher than the number of observations for this case only. While the linear dependency on the number of observations does not pose a problem for small to medium-sized data sets, for big data settings MCMC methods become unfeasible. This underlines the usefulness of embedded data sets.

Fig. 1
figure 1

Running times for simulated data sets with varying number of observations and \(d=50\) variables

Before we consider our embedding methods, we pick up on the idea of using \(\varPi = X^T\) as sketching matrix. If the data set is given row by row, the sketch \(\varPi X = X^T X\) can be computed efficiently using the tensor product of the row vectors \(X^T X = \sum x_i^T x_i\), which is a sum of \(d\times d\) matrices with rank one. Analoguously, we have \(\varPi Y = X^T Y = \sum x_i^T Y_i\). This results in a sketch of size \(d \times (d+1)\), which is the smallest possible sketch for problems of full rank and has no error at all in the sense that the exact likelihood respectively posterior distribution can be calculated from these matrices. We have argued before that this method is not numerically stable in general. As we focus on MCMC in this work, we show how this effect influences the run of the MCMC sampler. We tried analyzing Bayesian linear regression models based on \(X^T [X, Y]\), using some of the data sets described in Section 5.1 and also a smaller data set with \(n=10{,}000\), which was generated in the same way. We have found that the models do not converge in practice. Increasing the number of iterations does not seem to be a remedy as the variance of the MCMC sample grows with more iterations. Figure 2 shows an exemplary traceplot for one parameter, consisting of four chains. The range of the sample is enormous. The variation is high for all of the chains, they exhibit standard deviations in the order of \(10^{9}\). When reducing the number of iterations from the 10, 000 in Fig. 2 to, say, 5000, the range of the MCMC sample decreases markedly. However, there is no sign of convergence with minimum and maximum around \(-4\cdot 10^{9}\) and \(4\cdot 10^{9}\), respectively.

Fig. 2
figure 2

Traceplot of MCMC sample (with four chains) for one parameter based on data set obtained using \(\varPi = X^T\) as sketching matrix. Original data set contains \(n=10{,}000\) observations and \(d = 50\) variables

A numerically stable

We can deal with both of these issues using subspace embeddings as we can underline in our next experiments. We conduct a series of simulations that aims at comparing our proposed method to the standard method on the original data. To obtain the subspace embeddings, we employ the three approaches described in Sect. 3. As approximation parameter, we choose \(\varepsilon = 0.1\) and \(\varepsilon =0.2\) for all three methods. We do not recommend using values of \(\varepsilon >0.2\). Table 2 contains the number of observations of the sketches depending on the number of variables and the value of the approximation parameters. The sizes for RAD and SRHT are both set to \(k=\lceil \frac{d\log d}{\varepsilon ^2}\rceil \) to be comparable. They differ by one due to rounding errors. For the CW sketch we used k equal to the smallest power of two larger than \(\frac{d^2}{20\varepsilon ^2}\). Please note that the CW embeddings generally result in a higher number of observations due to the quadratic dependency on d. However, the opposite is true for 50 variables. This is due to the constants that we used. The constants are set to 1 in the case of RAD and SRHT. For the RAD method this was empirically evaluated by Venkatasubramanian and Wang (2011). For CW the constant may be much smaller as indicated by a lower bound in Nelson and Nguyen (2013b). Preliminary experiments on small scale led to our choice of \(\frac{1}{20}\).

Table 2 Number of observations of the sketches for different values of d and \(\varepsilon \)

5.4 Comparison of posterior means

To evaluate the results, we first compare the posterior means of the models based on the embedded data sets with the posterior means of the model based on the original data set. Table 3 contains an overview of the sum of squared distances between the embedded data sets’ posterior means and those of the original model. Geometrically, this is the squared Euclidean distance of the posterior mean vectors. As indicated by Theorem 1, the sum of squared distances grows with the standard deviation of the error term. There does not seem to be a systematic difference in performance between the different sketching methods. With larger \(\varepsilon \), we usually, but not necessarily observe an increase in the distance. Please note that some values are missing, because the original models did not converge within reasonable time bounds.

Table 3 Sum of squared distances between posterior mean values of the original model and models based on the respective sketches

In addition to the comparison to the original models’ mean, we also compare the posterior means to the true means. Table 4 contains the sum of the squared distances between the true mean for \(d=50\) and varying values of \(\varsigma \). The general picture looks very similar to the results in Table 3. The original model often exhibits the smallest sum of squared distances, but sometimes models based on embedded data sets are closer to the true mean. Again, there does not seem to be a systematic difference between the sketching methods. The squared distances do not seem to be influenced by the value of n, with some squared distances even exhibiting smaller values for larger n.

Table 4 Sum of squared distances between true mean values and posterior means of models based on the respective sketches

5.5 Comparison of fitted values

After this comparison on the level of parameters – whose number is not changed by sketching – we will compare the models on the level of observations, of which the sketches contain merely a fraction of the number of observations in the original data set. We multiply X with the posterior mean vector of \(\beta \), where this posterior mean can be based on the original data set or on the respective sketches. In a frequentist sense, these are fitted values \(\hat{Y}\), but all X values are taken from the original data set, not necessarily from the data set the model is based on. This is done to see how close the approximation is on the level of Y values for both \(\varepsilon =0.1\) and \(\varepsilon =0.2\). Figure 3 is a scatterplot which contains smoothed densities. The fitted values based on the original model are on the x-axis while the fitted values based on the CW sketch (with \(\varepsilon =0.1\)) are on the y-axis. Darker shades of black stand for more observations. Even though the fitted values are based on one of the data sets with the highest standard deviation of the error (\(n=50{,}000, \varsigma =10\)), all values are close or reasonably close to the bisecting line. This means that the fitted values obtained by the two models do not differ by much. To get a better overview, Fig. 4 depicts the distances between the fitted values as boxplots. Here, all three sketching methods with both \(\varepsilon =0.1\) and \(\varepsilon =0.2\) are included. All six sets of distances are centered around zero. The effect of the approximation parameter \(\varepsilon \) is evident from the boxplot, the variation is larger for \(\varepsilon =0.2\) regardless of the sketching method. When fixing \(\varepsilon \), all three sketching methods exhibit similar results, although the RAD sketch seems to introduce slightly more variation into the differences than the other two sketching methods, thus prediction for the data used in learning the model is highly accurate.

Fig. 3
figure 3

Comparison of fitted values based on the original data set with \(n=50{,}000, d=50, \varsigma =10\) and a CW sketch with \(\varepsilon =0.1\). Darker shades of black stand for more observations

Fig. 4
figure 4

Difference of fitted values according to models based on the respective sketching methods and fitted values according to model based on original data set with \(n=50{,}000, d=50, \varsigma =10\)

We have also generated additional data that has not been used in learning the model and employed the posterior mean to predict y-values for these data. The results are very similar to those described above and presented in Figs. 3 and 4. Sketching, again, introduces a little more variation, depending on \(\varepsilon \). More formal treatment of prediction accuracy based on similar sketching techniques for the OLS solution is given in Raskutti and Mahoney (2015).

5.6 Comparison of posterior distributions

As we have conducted Bayesian regression the model consists not only of a mean value, but of a whole posterior distribution for each parameter. Figure 5 contains two exemplary boxplots of MCMC samples representing marginal posterior distributions. The original data set contains \(n=50{,}000\) observations, \(d=50\) variables and has an error standard deviation of \(\varsigma =5\). The medians of the MCMC samples based on the original data set are well-represented by the MCMC samples based on sketches. Even though the median based on an embedded data set might be higher or lower for certain parameters, we did not find any systematic biases. The embedding introduces additional variation, which depends on the value of the approximation parameter \(\varepsilon \), but does not seem to be influenced by the choice of sketching method.

Fig. 5
figure 5

Boxplots of MCMC sample for two parameters based on data set with \(n=50{,}000, d=50, \varsigma =5\) and respective sketches

In regression, a common task is the identification of important variables by means of variable selection. In a Bayesian setting, this can be done via credible intervals, among other approaches. Our results indicate that the identification of important variables is quite accurately possible based on the resulting approximate models. However, one has to take the additional variation into account. Exemplarily, when using 95% credible intervals as criterion, one should not compare the endpoints of the credible interval to a fixed value \(\mu \). Instead, take the extra variation in the posterior distribution and also possible small shifts of the mean and median into account. We therefore recommend using smaller values of \(\varepsilon \) when aiming at variable selection (see Fig. 5).

5.7 Comparison of running times

One of our aims is to make Bayesian regression feasible on very large data sets. After ensuring that the results are close to those obtained by the analysis on the original data set, we will now contemplate the running time required. The total running time is composed of the time required for the analysis and the time required for the necessary preliminary steps: reading the data set into memory and calculating the sketch. Table 5 contains the running times required for the Bayesian regression analysis (four rightmost columns headed “Analysis”) and the running times for the preliminary steps (columns headed “Preprocessing”). For the original data sets, the preliminary steps consist of the time required for reading the data set into memory, for all other cases, Table 5 gives the running times required for constructing the sketch. The total running time of an analysis on a sketch is obtained by summing up the reading time, the sketching time, and the time required for the Bayesian linear regression analysis.

Table 5 Running times for data sets with \(d = 50\)

Table 5 suggests that both the reading times and the sketching times are stable for data sets of the same size, with the possible exception of an outlier for \(n=1{,}000{,}000\) and \(\varsigma =2\). Both the reading times and the sketching times grow with the size of the data set. The sketching times grow for smaller values of \(\varepsilon \). For RAD sketches with \(\varepsilon =0.1\), the sketching takes longer than the reading of the data set, for all other combinations the opposite is true. CW sketches require the shortest amount of running time of the three sketching methods.

Although only a few of the original data sets could be analyzed, the “Analysis” values in Table 5 indicate that the running times for the Bayesian analysis increase with the number of observations in the data set. The running times for the sketched data sets on the other hand show no systematic increase for growing values of n. There is some variation in the running times, but this seems to be more random chance than trend. Larger values of \(\varepsilon \) lead to shorter running times in the analysis, which indicates that the trade-off between time and goodness of the approximation is present both in calculating and analyzing the sketch. The running time of the analyses is similar for all three sketching methods. However, the running times do seem to depend on the value of \(\varsigma \). For higher standard deviations of the error term, the required running time tends to become less.

Figure 6 exemplarily shows the total running times in minutes for data sets with \(d=50\) and \(\varsigma =5\). This comprises reading, sketching (if applicable) and analyzing the data set. For the sketches, \(\varepsilon =0.1\) was used. To illustrate the potential improvement with increasing n, Fig. 6 contains the total running times for \(n=50{,}000\) and \(n=100{,}000\) side by side. For the original data sets, we observe a large increase in running time as n doubles, whereas the running times on the sketched data sets hardly change.

Fig. 6
figure 6

Total running times in minutes for data sets with \(n\in \{50{,}000, 100{,}000\}, d=50, \varsigma =5\) and approximation parameter \(\varepsilon =0.1\). For the sketched data sets, the total running time consists of the time for reading, sketching and analyzing the data set. For the original data set, the sketching time is 0 since this step is not applied

All results presented so far are based on data sets with \(d=50\) variables. We have conducted Bayesian analyses on data sets with \(d=100\) with the same parameter values for n and \(\varsigma \). The findings from these simulations are similar to those for \(d=50\). One exception is that the analysis of CW sketches now takes longer than the analysis of the respective original data sets for \(n=50{,}000\). This changes for larger data sets. One strength of the CW method, however, is its efficiency when dealing with streaming data when the number of variables is not too large. Recall that its dependency on d is quadratic, which means that the sketch sizes are already very large for medium dimensional problems with \(d\approx 500\) or \(d\approx 1000\). Setting the target dimension to a lower value is not a remedy, because this results in a weaker approximation guarantee as also noticed by Yang et al. (2015). In that case one should consider using one of the slower and denser sketches.

5.8 Streaming example and concluding remarks

The data sets considered so far were chosen to be small enough to allow for Bayesian analysis on the full data set in a convenient time. This is by far not what might be called Big Data. To show that our approach is suitable for analyzing really big amounts of data, we have generated and at the same time sketched a data set. The generation of the data followed the same rules as the other simulated data sets, but with \(\varsigma = 0.1\). The data set’s original size is \(10^9\times 100\) double precision values. This corresponds to about 2 TB in CSV format or at least 750 GB in a binary representation. We have used the CW sketching method resulting in a sketch of size \(65{,}536\times 100\), which requires only around 140 MB of space in CSV format and fits into RAM. Clearly, we cannot compare the results to those on the original data set, but calculating the sum of squared distances between the true mean values of \(\beta \) and the posterior mean of the model adds up to \(3.741\,\times \,10^{-6}\). The Bayesian regression analysis took 2781 minutes.

In some cases the algebraic structure might strongly depend on a few observations or variables. In such situations it is important to identify these or to retain their contribution in the reduced data set. So far, our model assumptions did not suffer from such ill-behaved situations, but now we assess the performance of our method in this case. We construct data sets where an important part of the target variable falls into a subspace that is spanned by a small constant number of observations. Uniform random subsampling will pick these only with probability \(O(\frac{1}{n})\). Oblivious subsampling techniques in general will have trouble identifying the important observations. In contrast, oblivious subspace embedding techniques preserve these effects with high probability. This effect is observed in practice even when comparing one sketch against the best of 1000 repetitions of uniform random subsampling.

In conclusion, the simulation study indicates that our proposed method works well for simulated data sets, which are generally well-suited for conducting Bayesian linear regression. But even with a high variance of the error term (and thus a relatively bad model fit), our proposal leads to results similar to those one would obtain on the original data set. The running time of the analysis with the proposed sketches is largely independent of n, giving advantages for very large n. Since the embeddings can be read in sequentially, it is not necessary to load the whole data set into the memory at once, which reduces the required memory.

For CW embeddings, reading the data in and calculating the sketch only takes marginally longer than only reading the data in. In our experiments, we found that reading in and sketching takes around 1.01 to 1.04 times longer. This factor is typically higher for small data sets and lower for larger data sets (cf. Table 5). However, when the number of variables is large it may be favorable to use SRHT, whose sketching time is only slightly larger but has considerably smaller embedding dimension.

6 Real data example

As a real data example, we consider the bike sharing data set taken from Fanaee-T and Gama (2014), which is available in the UCI Machine Learning Repository (Lichman 2013). This is only meant as an exemplary application of the methods to a real data scenario and should not be mistaken for a complete statistical analysis of the data set. The bike sharing data set contains the number of rental bike users per hour over two years as well as additional information about the day and the weather. See Table 6 for an overview of the variables we use in the model. The data set contains some additional variables we do not employ, because they are highly correlated with the variables present in the model. We also made a change to the factor levels of the variable weathersit. In the original data set, this variable contains 4 levels. The fourth level only is present 3 times out of total of \(n=17{,}379\) hours in the data set. To avoid any problems with such an underrepresented level, we combine levels 3 and 4 to obtain a factor with 3 levels. The original levels 3 and 4 stand for “light rain” and “heavy rain”. The new level 3 can easily be interpreted as the presence of rain. For a more detailed description of all variables, please refer to the data set’s web page on the UCI Machine Learning Repository. Footnote 1

Table 6 Variables from the bike sharing data set used in the model

The variable cnt contains the number of rental bikes used per hour and is thus a count variable. However, there are around 850 distinct values, which makes analyzing cnt as a continuous variable reasonable. When analyzing such variables with a linear regression model, transforming them using the square-root is a common procedure. After the transformation, the values of cnt show some bi-modality, but fit reasonably well to the assumption of a normal distribution.

We use the transformed variable cnt as Y-variable and all other variables in Table 6 as X-variables. To handle the factor variables, we use the R-function model.matrix to create a design matrix, which is then passed on to RaProR and rstan. The resulting design matrix contains \(n=17{,}379\) observations and \(d=39\) variables plus the intercept. Again, we calculate sketches for all three methods and with two different settings of \(\varepsilon \). Because of the size of the data set relative to the number of variables, we choose \(\varepsilon =0.15\) and \(\varepsilon =0.2\) for the RAD and SRHT sketches. For CW, we choose values of k that are closest to the target dimension of the other sketches. This results in the values given in Table 7.

Table 7 Number of observations of the sketches for the bike sharing example. Different values of \(\varepsilon \) are used for RAD and SRHT sketches; the target dimension of CW sketches is chosen to be the power of two closest to the size of the RAD and SRHT sketches

The Bayesian model based on the original data set suggests that all mentioned variables are important for the modeling of the number of bikes used per hour. Figure 9 in the appendix gives an overview of the posterior distributions for the model based on the original data set. As one might expect, the weather has a strong influence. More bikes are rented when the apparent temperature is high and, to a lesser extent, when the humidity and the wind speed are low. In clear or partly cloudy weather, the number of rented bikes is highest, but the negative influence of heavier clouds is comparatively small. Rainy weather, however, reduces the number of bike users more substantially. In addition to that, fall seems to be the most popular seasons for bike sharing. Spring and summer also have positive effects in comparison to winter, but the effect sizes are smaller. This might seem surprising at first, especially as the number of rental bike users is highest in summer. This might partly be an effect of the apparent temperature, which is generally higher in summer.

There is also a distinct hourly effect. During night time, especially between midnight and 5am, the number of rented bikes is greatly reduced. On the other hand, between 7am and 9pm, a lot of bikes are used, with two peaks at 8am and 5pm/6pm. This might indicate that the service is used by people transiting to and from work. Holidays – which only includes days that would otherwise be a working day, so only Monday to Friday – have a negative influence on the number of rental bike users. When taking the days of the week into account, Friday and Saturday have the highest positive effect while Sunday seems to be the least popular day. All of these effects based on days have a small influence compared to the variables mentioned before. Lastly, the variable yr also has a positive influence which indicates a positive trend for this bike sharing service.

All conclusions can similarly be derived from the models based on the sketches. Following our approach in Sect. 5, we first compare the resulting posterior mean values of \(\beta \). Table 8 shows the sum of squared distances between posterior mean values of the original model and models based on the three sketching methods, using two values of \(\varepsilon \) each. There is a general increase in the sum of squared distances for \(\varepsilon =0.2\) compared to \(\varepsilon =0.15\), but the amount differs depending on the sketching method. This should not be over-interpreted, however. As these values represent only one realization of a random subspace embedding, there is no evidence for systematic differences.

Table 8 Sum of squared distances between posterior mean values of the original model and models based on the respective sketches for the bike sharing data set

To see the effects of the differences in the posterior means of \(\beta \) on the level of the y-variable, which is the number of rental bikes used, we compare the fitted values as in Sect. 5.5. Again, we multiply the original design matrix X with the posterior means of \(\beta \), where the posterior mean values are obtained from the model on the original design matrix and the models on the respective sketches. Figure 7 contains the six resulting boxplots. As in the simulation study, the differences of the fitted values are centered around zero with only small deviations. Further analysis indicates that the higher deviations occur when the number of bikes used is high, which means that the majority of differences in the fitted values are small relative to the observed value for that data point.

Fig. 7
figure 7

Difference of fitted values according to models based on the respective sketching methods and fitted values according to model based on the original bike sharing data set

Fig. 8
figure 8

Boxplots of MCMC sample for the two weather situation parameters based on the original data set and all sketches

As a last step of our short analysis of the bike sharing data set, we will concentrate on the posterior distributions of the factors that take the weather situation into account. This factor has three levels in our data set. The first level stands for clear weather, which also includes partly cloudy hours, the second level stands for heavier clouds or mist while the third level models rainy weather, which includes light rain, heavy rain, thunderstorms, and snow. The different levels occur with different relative frequencies: around 66 % of the observed hours fall into level 1, 26 % into level 2 and 8 % into level 3.

Figure 8 shows boxplots of the MCMC samples based on the original design matrix and the sketches. The values represent the marginal posterior distributions of the two dummy variables associated with the variable weathersit. The scales of the boxplots are chosen such that one unit is of the same length on both y-axes. This allows for easy comparison of the variation in the two posterior distributions. Again, we can see that the embedding introduces additional variation, which depends on the value of the approximation parameter \(\varepsilon \), but does not seem to be influenced by the choice of the sketching method. In addition, the posterior distributions for the factor “heavy clouds” show less variation compared to the posterior distributions for “rain”. This is as one might expect as the number of occurrences is more than three times higher for “heavy clouds” and a larger number of observations tends to reduce the uncertainty. Nonetheless, it is interesting to observe that the variation introduced by the embedding seems to be a factor of the variation present in the original model.

While the effect of the factor “rain” is undoubtedly negative according to the original model and all sketches used here, the effect of factor “heavy clouds” is close to zero. In the original model, “heavy clouds” would also be seen as an influential factor when using the 95 % credible interval as a criterion. The conclusion is the same for all sketching methods when using \(\varepsilon =0.15\). However, when using \(\varepsilon =0.2\) and CW, “heavy clouds” would be seen as not influential. This stresses again that the endpoints of credible intervals based on sketches exhibit some additional variation and inference based on them may change, depending on the variation in the original model and the choice of \(\varepsilon \). If variable selection is a focus of the regression analysis, we recommend choosing reasonably small \(\varepsilon \) (cf. Fig. 5 and its discussion in Sect. 5.6).

This example underlines that our method also works well on real world applications when the original data follows the model assumptions reasonably well.

7 Conclusion

Our paper deals with random projections as a data reduction technique for Bayesian regression. We have shown how projections can be applied to compress the columnspace of a given data matrix with only little distortion. The size of the reduced data set is independent of the number n of observations in the original data set. Therefore, subsequent computations can operate within time and space bounds that are also independent of n, regardless of which algorithm is actually used. While our focus was on MCMC and the No-U-Turn-Sampler in particular, we tried INLA as well and observed a considerable reduction in running time while achieving very similar results. However, our proposed reduction method is not limited to these approaches, making it highly flexible.

The presented embedding techniques allow for fast application to the data set and do not need the embedding matrices to be stored explicitly. Thus, only very little memory is needed while sketching the data. Furthermore, we have surveyed their useful properties when the computations are performed in sequential streaming as well as in distributed environments. These scenarios are highly desirable when dealing with Big Data (cf. Welling et al. 2014).

We consider the situation where the likelihood is modeled using standard linear regression with a Gaussian error term. We show that the likelihood is approximated within small error. Furthermore, if an arbitrary Gaussian distribution is used as prior distribution, the desired posterior distribution is also well approximated within small error. This includes the case of a uniform prior distribution over \(\mathbb {R}^d\), an improper, non-informative choice that is widely used (cf. Gelman et al. 2014). We also show the results to be \((1+O(\varepsilon ))\) approximations of the distributions of interest in the context of Bayesian linear regression. As the structure of both mean and variance is preserved up to small errors, approximate sufficient statistics for the posterior distributions are well-recovered. This gives the user all the information that is needed for Bayesian regression analysis.

In our simulation experiments, we found that the approximation works well for simulated data sets. All three sketching methods we considered lead to results that are very similar to Bayesian regression on the full data set and the true underlying values. The running time for the MCMC analysis based on the sketches is independent of the number of observations n. The calculation of the embedding does depend on n, but requires little more time than the necessity of only reading the data. This is especially true when using the CW method. But for larger dimensions a CW embedding can be too large. In such a case, the denser SRHT construction also performs very well and is preferable because of its lower dependency on d. RAD has even lower dependency on d but takes considerably more time to calculate.

We have applied the methods to a bike sharing data set by Fanaee-T and Gama (2014). The approximation also works well on this real data example, giving very similar results to the original Bayesian regression, while adding little additional variation to the posterior distributions.

For future research, we would like to generalize our results to other classes of distributions for the likelihood and to more general priors. As a first step, we have used hierarchical models involving normal, Gamma and exponential distributions as hyperpriors. For normal and Gamma distributions, the results seem promising, whereas using exponential distributions seems more challenging. The recent results on frequentist \(\ell _p\) regression of Woodruff and Zhang (2013) might give rise to efficient streaming algorithms also in the Bayesian regression setting. Another interesting direction would be to consider Gaussian mixtures, since they allow to approximate any continuous distribution.

In real-world applications one might exhibit the domain-specific structure to further reduce the time and space bounds when these indicate that the data itself is of low rank or allows for sparse solution vectors.