We have identified four main approaches used to obtain coresets. This is not meant to be an absolute classification of algorithms, but rather an overview of what techniques are currently available and an illustration of how we may apply them.
3.1 Geometric Decompositions
Assume that we are given a point set
A in Euclidean space. A simple way to reduce the size of
A is to compute a discretization
S of the space, snap each point of
A to its nearest neighbor in
S, and use the (weighted) set
S to approximate the target function. Many of these decompositions are based on packing arguments and range spaces. We will illustrate one approach via the
k-means problem, based on the papers by Har-Peled and Mazumdar [
58] and Fichtenberger et al. [
51].
We first take a closer look at the objective function for Euclidean
k-means. Let
\(\text {OPT}\) be the value of the optimum solution. We first compute a 10-approximation
3 \(C'\) to the
k-means problem, which can be done in polynomial time [
68].
Consider now a sequence of balls with exponentially increasing radius centered around each point of \(C'\), starting at radius \(\frac{1}{n}\cdot \text {OPT}\) and ending at \(10\cdot \text {OPT}\). Our discretization will consist of a suitably scaled \(\varepsilon\)-ball-cover of each ball. To prove correctness, we require the following generalization of the triangle inequality.
We now show that the cost of using the discretization is bounded.
We can reapply this analysis for each center of the 10-approximation
\(C'\). Observe that the cost of points
\(A_c\) assigned to a center
\(c\in C'\) is
\(\sum _{x\in A_c}\Vert x-c\Vert ^2\), which is equal to
\(\sum _{x\in A}\Vert x\Vert ^2\) if we treat
c as the origin. Combining this lemma with Lemma
2 and rescaling
\(\varepsilon\) shows that
S is a coreset.
Bibliographic Remarks Algorithms based on a discretization and then snapping points to the closest point of the discretization are particular common for extent approximation problems such as maintaining
\(\varepsilon\)-kernels, the directional width, the diameter, and the minimum enclosing ball problem. In fact, the notion behind coresets was introduced in the context of these problems in the seminal paper by Agarwal et al. [
4]. Since then, a number of papers have progressively reduced the space bound required to maintain coresets [
8,
23,
24,
99] with Arya and Chan giving an algorithm that stores
\(O(\varepsilon ^{-(d-1)/2})\) points [
14] in
d-dimensional Euclidean space. We will briefly outline in Sect.
5 that this space bound is indeed optimal.
The geometric approach was also popular when coresets were introduced to
k-median and
k-means clustering and generalizations, see for instance [
44,
51,
52,
57,
58]. Due to the exponential dependency inherent in all known constructions, the focus later shifted to sampling. Nevertheless, geometric arguments for correctness proofs are necessary for almost all algorithms, and are present in all the other examples presented in this survey. For an extensive and more complete overview of purely geometric algorithms for coreset construction, we recommend the survey by Agarwal et al. [
5]
3.2 Gradient Descent
A quite popular and often only implicitly used technique to build coresets is derived from convex optimization. We begin with an example where a simple application of the well-known sub-gradient method yields the desired coreset. Consider the problem of finding the smallest enclosing ball of a set of points, which is equivalent to the 1-center problem in Euclidean space.
A standard approach for minimizing the convex function
f is to start at an arbitrary point
\(c_0\) and to perform iterative updates of the form
\(c_{i+1}=c_i - s\cdot g(c_i)\) where
\(s\in \mathbb {R}^{>0}\) is an appropriately chosen step size and
\(g(c_i)\in \partial f(c_i)\) is a sub-gradient of
f at the current point
\(c_i\). We will refer to this as the
sub-gradient method. This method has been formally developed by several mathematicians in the sixties. See [
83,
90,
92] for details and historical remarks. Now, consider the following theorem regarding the convergence behavior of the sub-gradient method.
From Theorem
2 we learn, that if our function has a small Lipschitz constant and our initial starting point is not too far from the optimal solution, then the best center among a small number of iterations of the sub-gradient algorithm will have a radius that is close to the optimal radius of the smallest enclosing ball. To assess the parameters more closely we begin with finding a sub-gradient at each step. It is easy to see that for any
\(p_{\max }\in \mathop {\mathrm{argmax}}\nolimits _{p\in P} \Vert c-p\Vert\) attaining the maximum distance
\(g(c) = \frac{c-p_{\max }}{\Vert c-p_{\max }\Vert }\) is a sub-gradient of
f at
c, i.e.
\(g(c)\in \partial f(c)\). Also note that
\(c-p_{\max }\ne 0\) unless
P is a singleton, in which case the problem is trivial since the only input point is the optimal center. Thus, in all interesting cases,
g(
c) is well-defined. Also note that
g(
c) is by definition a normalized vector, i.e.,
\(\Vert g(c)\Vert =1\) which implies that
f is
L-Lipschitz with
\(L=1\), since by definition of a sub-gradient and applying the Cauchy-Schwarz inequality (CSI) we have
$$\begin{aligned} f(c)-f(x)\le g(c)^T(c-x)\overset{\tiny \mathrm{CSI} }{\le } \Vert g(c)\Vert \Vert c-x\Vert \le 1\cdot \Vert c-x\Vert . \end{aligned}$$
This leads to the following corollary.
Now we would like to argue that the set of points S, that the sub-gradient algorithm selects in each iteration as furthest point from the current center, forms a (weak) coreset for the smallest enclosing ball of P. We will need the following well-known fact for our further analysis.
Using this lemma we can argue that any \((1+\varepsilon )\)-approximate center must be within \(\sqrt{\varepsilon } f^*\) distance to the optimal center.
Now we are ready to prove the main result, namely that running \(O(\frac{1}{\varepsilon ^4})\) iterations, S is indeed a weak \(\varepsilon\)-coreset for the smallest enclosing ball of P.
Bibliographic Remarks The presented result is rather weak compared to the optimal coreset size of
\(\lceil \frac{1}{\varepsilon }\rceil\), cf. [
16]. However, we chose to present the method in this way since it shows that the gradient method in its basic form can already yield constant size coresets that do neither depend on the number of input points nor on their dimension. Therefore we see this as a generic method for dimensionality reduction and the design of coresets for computational problems. Putting more effort into geometrically analyzing every single step of the gradient algorithm, i.e., leveraging more problem specific structure in the proof of Nesterov’s theorem (Thm. 2, cf. [
83]), can lead to even better results which we want to survey here. Badoiu and Clarkson present in their short and elegant seminal work [
15] that their gradient algorithm converges to within
\(\varepsilon f^*\) error in
\(O(\frac{1}{\varepsilon ^2})\) iterations similar to Corollary
1. Their analysis is stronger in the sense that it implicitly shows that the points selected by the algorithm form an
\(\varepsilon\)-coreset of size
\(O(\frac{1}{\varepsilon ^2})\). Such a result was obtained before in [
17] by picking a point in each iteration which is
far enough away from the center of the smallest enclosing ball of the current coreset. Taking the
furthest point instead, yields the near-optimal bound of
\(\frac{2}{\varepsilon }\) from [
15]. The complexity was settled with matching upper and lower bounds of
\(\lceil \frac{1}{\varepsilon }\rceil\) in [
16]. The smallest enclosing ball problem is by far not the only one for which looking at gradient methods can help. For instance Har-Peled et al. [
60] have used similar construction methods to derive coresets for support vector machine training. Clarkson [
29] has generalized the method by unifying and strengthening the aforementioned results (among others) into one framework for coresets, sparse approximation and convex optimization. A probabilistic sampling argument was already used in [
17] as a dimensionality reduction technique for the 1-median problem which led to the first linear time approximation algorithms for the
k-median problem in Euclidean and more general metric spaces [
3,
69]. The method itself is closely related to stochastic sub-gradient methods where uniform sub-sampling is used for obtaining an unbiased estimator for the actual sub-gradient in each step [
92]. Similar approaches are currently under study to reduce the exponential dependency on the dimension for the probabilistic smallest enclosing ball problem [
48].
3.3 Random Sampling
The arguably most straightforward way to reduce the size of dataset is to simply pick as many points as permissible uniformly at random. Though it is rare for an algorithm to produce a coreset in this straightforward fashion, we will see that for outlier-resistant problems this approach can already provide reasonable results. In the following we will examine the geometric median, which is one of the few problems for which uniform sampling gives any form of guarantee. Thereafter, we will show how to obtain strong coresets using more involved sampling distributions.
We will show that uniform sampling is sufficient to obtain a weak-coreset guarantee, The exposition for uniform sampling follows Thorup [
94] and Ackerman et al. [
3].
We wish to use Lemma
6 to apply a union bound on all candidate points. In a finite metric consisting of
n points, we can set
\(|S|\ge 144\varepsilon ^{-2}\log n\) and are done. In the continuous setting, this is not as straightforward. We will use a suitably scaled ball-cover and argue that this already suffices to prove the claim for all points.
Obviously, the uniform sampling approach can not yield a strong coreset even for the geometric median problem. The reason is that the cost can rely on a small constant number of input points even in one dimension. Given any input
P of size
\(n-2\), we can place two additional points
\(p_{\min } < \min P\) and
\(p_{\max }> \max P\) arbitrarily far from the points of
P without affecting the median. Clearly, any strong coreset must contain these points to preserve the cost for all possible centers, but the probability of sampling one of them is only
\(\frac{2}{n}\). This implies that a strong coreset based on uniform sampling must have linear size. To overcome this limitation, Langberg and Schulman [
71] introduced the notion of sensitivity for families of functions. For simplicity of presentation, we consider only the special case of the geometric median problem. For a much more general view on sensitivity sampling for numerous other functions the reader is referred to the original literature [
45,
71]
Informally, the sensitivity of a point measures how important it is for preserving the cost over all possible solutions. Note, that from the definition
$$\begin{aligned} \Vert x-c\Vert \le s(x) \sum \limits _{p\in P} \Vert p-c\Vert \end{aligned}$$
(2)
follows for all
\(x\in P\) and all centers
\(c\in \mathbb {R}^d\). The total sensitivity will turn out to be a crucial measure for the complexity of sampling from the distribution given by the sensitivities of the input points. We begin with describing the sampling scheme. The following importance or weighted sampling approach yields an unbiased estimator for the cost of any fixed center
\(f(P,c) = \sum _{x\in P} \Vert x-c\Vert\). We sample a point
\(x\in P\) from the distribution
\(q(x)=\frac{s(x)}{S(P)}\) and set
\(T=\frac{\Vert x-c\Vert }{q(x)}\). A straightforward calculation of its expected value
\(\mathbb {E}[T]=\sum \frac{\Vert x-c\Vert }{q(x)}\cdot q(x)= \sum \limits _{x\in P} \Vert x-c\Vert\) shows that
T is unbiased.
Next we bound its variance which is a crucial parameter in deriving concentration results for our sampling procedure.
In our next lemma we will see why the total sensitivity plays such an important role for the random sampling proportional to the sensitivities.
Lemma
8 shows that the number of samples that we need, depends linearly on the total sensitivity as well as logarithmically on the number of centers for which we need a guarantee. First we want to bound the total sensitivity for the geometric median problem. If for every input point, there is some center such that the points’ contribution to the cost is large, say bounded below by a constant, then we are lost since in that case the total sensitivity sums up to
\(\varOmega (n)\). The next lemma shows that this cannot happen. Actually it turns out that the total sensitivity can be bounded by a constant.
Now we know that the total sensitivity contributes only a constant factor to our sampling complexity. Before we move to the main result of this section, we will need another technical lemma. It is not immediately clear that we can use the triangle inequality for our samples, due to reweighting. We will therefore establish a relaxation of the triangle inequality for the entire weighted subsample.
It remains to bound the number of centers for which we will need a guarantee. This seems impossible at first glance, since the strong coreset property asks to hold for every center
\(c\in \mathbb {R}^d\) which has infinite cardinality. In the following we will see, that again, it is sufficient to consider a ball of appropriate radius centered at the optimum and decompose it by an
\(\varepsilon\)-ball-cover. Lemma
8 is applied to these points only. Now, for every center inside the ball, there is a point of the cover close to it for which the coreset guarantee holds. For the other centers we can argue that they are so far from the optimum center that their distance dominates the cost for both, the original point set as well as for the coreset. This will establish the coreset property for every possible center.
Bibliographic Remarks The arguably first paper to use random sampling for coreset construction is due to Chen [
27]. Like the result on
k-means described in Sect.
3.1, he considered balls of exponential increasing radii. Instead of using a ball-cover, he showed that a uniform random sample suffices to approximate the cost and thereby gave the first coresets for
k-means and
k-median with polynomial dependency on
k,
d,
\(\varepsilon\), and
\(\log n\). The sensitivity sampling approach was introduced by Langberg and Schulman [
71] and further improved by Feldman and Langberg [
45] and Braverman, Feldman and Lang [
22].
The technique is now one of the crown jewels of coreset construction, giving the best currently available parameters for a wide range of problems including
\(\ell _p\) regression,
k-means and
k-median clustering, and low rank subspace approximation. It has defined a unified framework for several importance sampling approaches from the early literature of sampling based sublinear approximation algorithms and coresets. One of the first such works is due to Clarkson [
28] on
\(\ell _1\) regression. After a series of studies on sampling based randomized linear algebra [
36‐
38], the approach could also be adapted for
\(\ell _2\) regression [
40,
41] and was later generalized in [
34] to
\(\ell _p\) regression for
\(p\in [1,\infty )\). In accordance with the statistical meaning of the sampling weights, they were later called (statistical) leverage scores. It was an open problem proposed in [
40] to even approximate the
\(\ell _2\) leverage scores in less time than needed to solve the regression problem exactly. This problem was resolved in [
39] via the oblivious random projection approach we are going to detail in the next section. Low rank subspace approximation was often treated implicitly using the same approaches leading to weak coresets, cf. [
91] for a technical review. The first strong coresets of polynomial size for
\(\ell _p\) subspace approximation are due to Feldman et al. [
47] again achieved via weighted sampling techniques related to the notion of sensitivity. More recently, the sampling based methods from randomized linear algebra [
41] were leveraged to develop coresets in [
79] for dependency networks [
61] and Poisson dependency networks [
55].
3.4 Sketches and Projections
It is often useful to view a dataset as a matrix, where the ith row corresponds to the ith point. In the following, let us assume that we have n points in d-dimensional Euclidean space, i.e. we are given a matrix \(A\in \mathbb {R}^{n\times d}\).
A sketch of A is a linear projection obtained by multiplying A with a sketching matrix \(S\in \mathbb {R}^{m\times n}\) for some \(m\ll n\). Our goal is to design a sketching matrix such that SA retains the key properties of A. Many of the previous coreset constructions can also be viewed in terms of sketching matrices. For instance, a sampling algorithm can be viewed as choosing a diagonal matrix \(S\in \mathbb {R}^{n\times n}\) where the diagonal entries are 1 (or some weight) if the point was picked and 0 otherwise in which case the row can as well be deleted.
We are more interested in oblivious sketching matrices, that is the sketching matrix S can be constructed ahead of time without viewing the data. Though it might seem surprising that this yields any results, sketching is now regarded as one of the central tools for streaming. We will illustrate the power of oblivious sketching in the context of subspace embedding and then show how it may be applied to linear regression.
The central ingredient is based around the seminal Johnson-Lindenstrauss lemma [
64].
The classic variant of the distributional Johnson-Lindenstrauss Lemma further states that an adequate distribution independently draws each entry of the matrix as a Gaussian with mean 0 and variance 1. Any distribution with mean 0 and variance 1 may be used, and nowadays, it is more common to draw entries from the Rademacher distribution, where each entry is with probability 1 / 2 either 1 or \(-1\). We will briefly discuss advantages of various sketching matrices at the end of this section.
Lemma
11 now gives us a powerful dimension reduction tool for Euclidean spaces. Consider, for instance, the case where we are given
\(\eta\) points in an arbitrary number of dimensions. Each distance between two points can be represented by a vector and there are at most
\(\eta ^2\) such vectors. Using Lemma
11, we can achieve an
\((1\pm \varepsilon )\)-approximate embedding into
\(O(\varepsilon ^{-2}\log \eta )\) dimensions with probability at least 2 / 3 by setting
\(m\ge c\cdot \varepsilon ^{-2}\log (\eta ^2)\), where
c is some absolute constant.
For subspaces, the application of the union bound is not as straightforward as there are infinitely many vectors, even in a 1-dimensional subspace. We first observe that any vector
x may be rewritten as
\(\Vert x\Vert \cdot \frac{x}{\Vert x\Vert }\) and hence it is sufficient to consider vectors with unit norm. This solves the subspace approximation problem in a single dimension, but even in 2 dimensions we have infinitely many unitary vectors. Instead, we show that applying the union bound on ball-covers is sufficient. Recall that there exists an
\(\varepsilon\)-ball-cover of size
\((1+2/\varepsilon )^d\) (c.f. Lemma
1). Applying the union bound now gives us
\(m\ge \frac{c\log (1+2/\varepsilon )^d}{\varepsilon ^2} = O(d\varepsilon ^{-2}\log \varepsilon ^{-1})\). A slightly more detailed analysis will allow us to remove the
\(\log \varepsilon ^{-1}\) factor. The main argument is that we can write every vector as linear combination of the vectors of a 1 / 2-cover. To see this, consider a 1 / 2-cover
\(\{x_0,\ldots x_{5^d}\}\) and an arbitrary unit vector
y. There exists some
\(x_i\) such that
\(\Vert y-x_i\Vert \le 1/2\). We then consider the vector
\(\Vert y-x_i\Vert x_j\) closest to
\((y-x_i)\). Since
\(\Vert \frac{1}{\Vert y-x_i\Vert }(y-x_i)-x_j\Vert \le 1/2\), we then have
\(\Vert (y-x_i)-\Vert y-x_i\Vert x_j\Vert \le 1/4.\) In each subsequent step, we halve the length of the next vector added and by iterating this sequence into infinity, the length of the remaining vector
\(\lim y-\sum \alpha _i x_i\) is 0.
Hence, there exists a subspace approximation of size \(O(d/\varepsilon ^2)\). Now let us consider linear regression, where we aim to minimize \(\Vert Ax-b\Vert\) with \(A\in \mathbb {R}^{n\times d}\) and \(b\in \mathbb {R}^n\). Since all possible vectors \(Ax-b\) lie in a \((d+1)\)-dimensional subspace spanned by the columns of A and b, an oblivious subspace embedding is a coreset for linear regression.
Bibliographic Remarks Oblivious subspace embeddings were first introduced by Sarlos [
91] both as a means for solving regression and as means for computing low rank approximations. The upper bound of
\(O(d/\varepsilon ^2)\) on the target dimension for oblivious sketches was (implicitly) given in the work by Clarkson and Woodruff [
31], and this bound was later proved to be optimal by Nelson and Nguyen [
82], though even smaller bounds are possible if the sketch is not oblivious, see for instance [
33,
49]. It is worth noting that if we are only interested in approximating the optimum using the sketch for linear regression, somewhat akin to a weak coreset guarantee, a target dimension of
\(O(d/\varepsilon )\) is sufficient and necessary [
31].
Projections and sketches also play an important role for coresets in Euclidean
k-means [
33,
49]. Cohen et al. [
33] showed that an oblivious sketch of target dimension
\(O(k/\varepsilon ^2)\) is cost-preserving. As a corollary, this implies that for any coreset construction, the dependency on
d may be replaced by a dependency on
\(k/\varepsilon ^2\).
The computation time for all sketching approaches can be generally regarded as acceptable. The smallest target dimension is achievable by computing the SVD or a sufficiently good low rank approximation [
33,
49]. While this can be done in polynomial time, it is more expensive than the oblivious sketching methods we describe in the following. Conceptually, there are two basic approaches to multiply sketching matrices faster. The first one is to improve the sketching dimension. This, however, is known to be impossible for various regimes, see [
12,
63,
65,
72] and very recently for any embedding method by Larsen and Nelson [
73]. The other direction is to make the sketching matrix sparse.
Sparse matrices tend to distort sparse vectors. Clearly, the Johnson-Lindenstrauss guarantee cannot hold in such cases. To remedy this problem, Ailon and Chazelle [
9] proposed to first increase the density of the input vectors by preforming a randomized Fourier Transform before multiplying with a very sparse embedding matrix. This approach, called the
Fast Johnson Lindenstrauss Transform, was later improved and refined, see for instance [
10,
11,
21,
95] and references therein.
The second approach for sparse Johnson Lindenstrauss transforms is based around sophisticated ways of combining Count Sketch estimators. The Count Sketch estimator [
25] originally proposed for finding heavy hitters is essentially a single row of a Rademacher matrix. Instead of naively repeating the Count Sketch and averaging the results as in a standard Rademacher matrix, we aim to partition the entries of the vectors that are to be embedded, apply a Count Sketch for each partition and thereafter aggregate the results. The degree of sparsity that may be achieved using various partitioning schemes have been studied in [
2,
32,
35,
66,
81]. We want to highlight the work by Clarkson and Woodruff [
32] who can achieve an embedding in essentially input sparsity time, however at the cost of slightly larger target dimension
\(\frac{d^2}{\varepsilon ^2}\). The squared dependency on
d was also shown to be necessary by Nelson and Nguyen [
80].
Another related direction is generalizing to
\(\ell _p\) subspace embeddings. A first step was done by Woodruff and Sohler [
93] who designed the first subspace embedding for
\(\ell _1\) via Cauchy random variables. The method is in principle generalizable to using
p-stable distributions and was improved in [
30,
77]. The idea is that the sum of such random variables forms again a random variable from the same type of distribution leading to concentration results for the
\(\ell _p\) norm under study. At the same time it is inherently limited to
\(1\le p\le 2\) as in [
77], since no such distributions exist for
\(p>2\), cf. [
96]. The first attempts to generalize to
\(p>2\) [
30] had nearly linear size, namely
\(n/{\text {poly}}(d)\), which clearly was not satisfying. A remedy came with a manuscript of Andoni [
13], who discovered the
max stability of inverse exponential random variables as a means to embed
\(\ell _p, p>2\) with little distortion into
\(\ell _\infty\). Combining this with the work on
\(\ell _2\) embeddings of Clarkson and Woodruff [
32] culminated in oblivious subspace embeddings for all
p (into
\(\ell _2\) resp.
\(\ell _\infty\)) and only
\({\text {poly(d)}}\) distortion [
96]. Note, that the embedding dimension for
\(p>2\) is
\(n^{1-2/p}{\text {poly}}(d)\) which improved upon the previous
\(n/{\text {poly}}(d)\) and is close to optimal given the lower bound of
\(\varOmega (n^{1-2/p})\) [
86]. The desirable
\((1\pm \varepsilon )\) distortion can be achieved using the embeddings for preconditioning and sampling proportional to the
\(\ell _p\) leverage scores [
30,
34,
96].
Current research has moved beyond strict algorithmic and optimization related characteristics to the study of statistical properties [
76,
88]. In particular, not only maximum likelihood estimators are approximated under random projections. Geppert et al. [
54] showed that in important classes of Bayesian regression models, the whole structure of the posterior distribution is preserved. This yields much faster algorithms for the widely applicable and flexible, but at the same time computationally demanding Bayesian machinery.
Recently a series of optimization software based on random projections and sketches have appeared. A parallel least squares regression solver LSRN was developed in [
78,
97]. An implementation of some of the presented sketching techniques named RaProR was made available for the statistics programming language R [
53,
54,
87].