1 Introduction

This paper treats the homology of simplicial complexes built via deterministic rules from a random set of vertices. In particular, it shows that, depending on the randomness that generates the vertices, the homology of these complexes can either become trivial as the sample size grows, or can contain more and more complex structures.

The motivation for these results comes from applications of topological tools for pattern analysis, object identification, and especially for the analysis of data sets. Typically, one starts with a collection of points and forms some simplicial complexes associated to these, and then takes their homology. For example, the \(0\)-dimensional homology of such complexes can be interpreted as a version of clustering. The basic philosophy behind this attempt is that topology has an essentially qualitative nature and should therefore be robust with respect to small perturbations. Some recent references are [2, 3, 9, 15, 19] with two reviews, from different aspects, in [1] and [12]. Many of these papers find their raison d’être in essentially statistical problems, in which data generate the structures.

An important example occurs in the following manifold learning problem. Let \(\mathcal {M}\) be an unknown manifold embedded in a Euclidean space, and suppose that we are a given a set of independent and identically distributed \((\mathrm {i.i.d.})\) random samples \(\mathcal {X}_n = \big \{X_1,\ldots ,X_n\big \}\) from the manifold. In order to recover the homology of \(\mathcal {M}\), we consider the homology of

$$\begin{aligned} U = \bigcup _{k=1}^n B_{\varepsilon }(X_k), \end{aligned}$$
(1.1)

where \(B_\varepsilon (X)\) is a closed ball of radius \(\varepsilon \) about the point \(X\). The belief, or hope, is that for large enough \(n\) the homology of \(U\) will be equivalent to that of \(\mathcal {M}\). A confounding issue arises when the sample points do not necessarily lie on the manifold, but rather are perturbed from it by a random amount. When this happens, it will follow from our results that the precise distribution behind the randomness plays a qualitatively important role. It is known that if the perturbations come from a bounded or strongly concentrated distribution, then they do not lead to much spurious homology, and the above line of attack, appropriately applied, works. For example, it was shown in [17] that for Gaussian noise it is possible to clean the data and recover the underlying topology of \(\mathcal {M}\) in a way that is essentially independent on the ambient dimension. Both [16, 17] contain results of the form that, given a nice enough \(\mathcal {M}\), and any \(\delta >0\), there are explicit conditions on \(n\) and \(\varepsilon \) such that the homology of \(U\) is equal to the homology of \(\mathcal {M}\) with a probability of at least \((1-\delta )\). However, for other distributions no such results exist, nor, in view of the results of this paper, are they to be expected.

Figure 1 provides an illustrative example of what happens when sampling points from an annulus and perturbing them with additional noise before reconstructing the annulus as in (1.1). In particular, it shows that if the additional noise is in some sense large then sample points can appear basically anywhere, introducing extraneous homology elements.

Fig. 1
figure 1

(a) The original space \(\mathcal {M}\) (an annulus) that we wish to recover from random samples. (b) With the appropriate choice of radius, we can easily recover the homology of the original space from random samples from \(\mathcal {M}\). (c) In the presence of bounded noise, homology recovery is undamaged. (d) In the presence of unbounded noise, many extraneous homology elements appear, and significantly interfere with homology recovery

In order to be able, eventually, to extend the work in [17] beyond Gaussian noise, and make more concrete statements about the probabilistic features of the homology this extension generates, it is necessary to first focus on the behaviour of samples generated by pure noise, with no underlying manifold. In this case, thinking of the above setup, the manifold \(\mathcal {M}\) is simply the point at the origin, and the homology that we shall be trying to recapture is trivial. Nevertheless, we shall see that differing noise models can make this task extremely delicate, regardless of sample size.

1.1 Summary of Results

To start being more concrete, let

$$\begin{aligned} \mathcal {X}_n = \big \{{X_1,\ldots ,X_n}\big \} \end{aligned}$$

be a set of \(n\,\mathrm {i.i.d.}\) random samples in \({\mathbb {R}}^d\), from a common density function \(f\). Recall that the abstract simplicial complex \(\check{C}(\mathcal {X},\varepsilon )\) constructed according to the following rules is called the Čech complex associated to \(\mathcal {X}\) and \(\varepsilon \):

  1. (1)

    The \(0\)-simplices of \(\check{C}(\mathcal {X},\varepsilon )\) are the points in \(\mathcal {X}\),

  2. (2)

    An \(n\)-simplex \(\sigma =[x_{i_0},\ldots ,x_{i_n}]\) is in \(\check{C}(\mathcal {X},\varepsilon )\) if \(\bigcap _{k=0}^{n} B_{x_{i_k}}\!(\varepsilon ) \ne \emptyset \).

An important result, known as the “nerve theorem”, links Čech complexes and the neighborhood set \(U\) of (1.1), establishing that they are homotopy equivalent (cf. [7]). In particular, they have the same Betti numbers, measures of homology that we shall concentrate on in what follows.

If the sample distribution has a compact support \(S\), then it is easy to show that, for a fixed \(\varepsilon \) and large enough \(n\),

$$\begin{aligned} \check{C}(\mathcal {X},\varepsilon ) \ \simeq \ \bigcup _{k=1}^n B_\varepsilon (X_k) \ \approx \ {{\mathrm{Tube}}}(S,\varepsilon ) \ \triangleq \ \big \{x\in {\mathbb {R}}^d: \min _{y\in S} \Vert x-y\Vert \,\le \, \varepsilon \big \}, \end{aligned}$$

where \(\simeq \) denotes homotopy equivalence and \(\Vert \cdot \Vert \) is the standard \(L^2\) norm in \({\mathbb {R}}^d\). Thus, there is not much to study in this case. However, when the support of the distribution is unbounded, interesting phenomena occur.

To study these phenomena, we shall consider three representative examples of probability densities. These are the power-law, exponential, and the standard Gaussian distributions, whose density functions are given, respectively, by

$$\begin{aligned}&f_{\mathrm {p}}(x) \triangleq \frac{c_{\mathrm {p}}}{1+\Vert {x}\Vert ^\alpha },\end{aligned}$$
(1.2)
$$\begin{aligned}&f_{\mathrm {e}}(x) \triangleq c_{\mathrm {e}}\mathrm{e}^{-\Vert {x}\Vert },\end{aligned}$$
(1.3)
$$\begin{aligned}&f_{\mathrm {g}}(x) \triangleq c_{\mathrm {g}}\mathrm{e}^{-\Vert {x}\Vert ^2/2}, \end{aligned}$$
(1.4)

where \(\alpha > d\) and \(c_{\mathrm {p}},c_{\mathrm {e}},c_{\mathrm {g}}\) are appropriate normalization constants that will not be of concern to us.

For large samples from any of these distributions we shall show that there exists a “core”—a region in which the density of points is very high and so placing unit balls around them completely covers the region. Consequently, the Čech complex inside the core is contractible. The size of the core obviously grows to infinity as the sample size \(n\) goes to infinity, but its exact size will depend on the underlying distribution. For the three examples above, we study the size of the core in Sect. 2.1. Denoting a lower bound for the radius of the core by \(R_n^{\mathrm {c}}\), we will show that

$$\begin{aligned} R_n^{\mathrm {c}}\sim {\left\{ \begin{array}{ll} (n/\log n)^{1/\alpha }, &{}\quad f(x) \propto \frac{1}{1+\Vert {x}\Vert ^\alpha } ,\\ \log n, &{}\quad f(x) \propto \mathrm{e}^{-\Vert {x}\Vert }, \\ \sqrt{2\log n}, &{}\quad f(x) \propto \mathrm{e}^{-\Vert {x}\Vert ^2/2}. \end{array}\right. } \end{aligned}$$

Note that in all three cases we have tacitly assumed that the cores are balls, a natural consequence of the spherical symmetry of the probability densities.

Beyond the core, the topology is more varied. For fixed \(n\), there may be additional isolated components, but no longer placed densely enough to connect with one another and to form a contractible set. Indeed, we shall show that the individual components will typically have non trivial homology. Thus, in this region, the topology of the Čech complex is highly nontrivial, and many homology elements of different orders appear. We call this phenomenon “crackling”, akin to the well known phenomenon caused by noise interference in audio signals and commonly referred to as crackling.

As for core size, the exact crackling behaviour depends on the choice of distribution. It turns out that Gaussian samples do not lead to crackling, but the other two cases do. To describe this, with some imprecision of notation we shall write \([a,b)\) not only for an interval on the real line, but also for the annulus

$$\begin{aligned}{}[a,b) \ \triangleq \ \big \{ x\in {\mathbb {R}}^d: a\le \Vert x\Vert < b\big \}. \end{aligned}$$

In Sects. 2.2 and 2.3 we shall show that the exterior of the core can be divided into disjoint spherical annuli at radii

$$\begin{aligned} R_n^c \ll R_{d-1,n} \ll R_{d-2,n} \ll \cdots \ll R_{0,n}, \end{aligned}$$

where by \( a_n \ll b_n\) we mean that \((b_n-a_n) \rightarrow \infty \) as \(n\rightarrow \infty \). These radii are defined differently for each of the two crackling distributions, and we will show that there are different types of crackling (i.e. of homology) dominating in different regions.

In \([R_{0,n},\infty )\) there are mostly disconnected points, and no structures with nontrivial homology. In \([R_{1,n},R_{0,n})\) connectivity is a bit higher, and a finite number of non-trivial \(1\)-cycles appear. In \([R_{2,n},R_{1,n})\) we have a finite number of non-trivial \(2\)-cycles, while the number of \(1\)-cycles grows to infinity as \(n\rightarrow \infty \). In general, in \([R_{k,n},R_{k-1,n})\), as \(n\rightarrow \infty \) we have a finite number of non-trivial \(k\)-cycles, infinitely many \(l\)-cycles for \(l<k\), and no cycles of dimension \(l>k\). In other words, the crackle starts with a pure dust at \(R_{0,n}\) and as we get closer to the core, higher dimensional homology gradually appears. See Fig. 2 in the following section for more details.

Fig. 2
figure 2

The layered behaviour of crackle. Inside the core (\(B_{R_n^{\mathrm {c}}}\)) the complex consists of a single component and no cycles. The exterior of the core is divided into separate annuli. Going from right to left, we see how the Betti numbers grow. In each annulus we present the Betti number that was most recently changed

As we already mentioned, the Gaussian distribution is fundamentally different than the other two, and does not lead to crackling. In Sect. 2.4 we show that, for the Gaussian distribution, there are hardly any points located outside the core. Thus, as \(n\rightarrow \infty \), the union of balls around the sample points becomes a giant contractible ball of radius of order \(\sqrt{2\log n}\).

It is now possible to understand a little better how the results of this paper relate to the noisy manifold learning problem discussed above. For example, if the distribution of the noise is Gaussian, our results imply that if the manifold is well behaved, and the sample size is moderate, noise outliers should not significantly interfere with homology recovery, since Gaussian noise does not introduce artificial homology elements with large samples. However, there is a delicate counterbalance here between “moderate” and “large”. Once the sample size is large, the core is also large, and the reconstructed manifold will have the topology of \(\mathcal {M}\oplus B_{O(\sqrt{2\log n})}(0)\), where \(\oplus \) is Minkowski addition. As \(n\) grows, the core will eventually envelope any compact manifold, and thus the homology of \(\mathcal {M}\) will be hidden by that of the core.

On the other hand, if the distribution of the noise is power-law or exponential, then noise outliers will typically generate extraneous homology elements that, for almost any sample size, will complicate the estimation of the original manifold. Furthermore, increasing the sample size in no way solves this problem. Note that this issue is in addition to the fact that increasing the sample size will, as in the Gaussian case, create the problem of a large core concealing the topology of \(\mathcal {M}\).

Thus, from a practical point of view, the message of this paper is that outliers cause problems in manifold estimation when noise is present, a fact well known to all practitioners who have worked in the area. What is qualitatively new here is a quantification of how this happens, and how it relates to the distribution of the noise. We do not attempt to solve the estimation problem here, but unfortunately it follows from the results of this paper that algorithms for handling outliers will probably involve knowing at least the tail behaviour of the error distribution, despite the fact that in practical situations one does not generally want to take as known prior knowledge.

1.2 On Persistence Intervals

While the above discussion has concentrated on the persistence of noise induced crackle as sample sizes grow, and the regions in \({\mathbb {R}}^d\) in which different types of homology appear, the proofs below also yield information about the more classical persistence diagrams of topological data analysis (cf. [8, 1012]).

For example, in the two cases for which crackle persists—the power-law and exponential cases—estimates of the type appearing in Sect. 3 indicate that, with high probability, there exist extremely long bars in the bar code representation of persistent homology. Up to lower order corrections, preliminary calculations show that bar lengths for the \(k\)-th homology can be as large as \(O(n^{a_k})\) for the power-law case, and \(b_k (\log \log n)\) for the exponential case, for appropriate \(a_k\) and \(b_k\). More detailed studies of these phenomena will appear in a later publication.

1.3 Poisson Processes

Although we have described everything so far in terms of a random sample \(\mathcal {X}\) of \(n\) points taken from a density \(f\), there is another way to approach the results of this paper, and that is to replace the points of \(\mathcal {X}\) with the points of a \(d\)-dimensional Poisson process \(\mathcal {P}_n\) whose intensity function is given by \(\lambda _n = n f\). In this case the number of points is no longer fixed, but has mean \(n\). Similarly to many phenomena in random geometric graphs (see [18]), the results of this paper hold without any change, if we replace \(\mathcal {X}\) by \(\mathcal {P}\).

1.4 Disclaimers

Before starting the paper in earnest, and so as not to be accused of myopia, we note that the subject of manifold learning is obviously much broader that that described above, and algorithms for “estimating” an underlying manifold from a finite sample abound in the statistics and computer science literatures. Very few of them, however, take an algebraic point of view that we or the literature quoted above take. Furthermore, we note that other important results about the homology of Rips and Čech complexes for various distributions can be found in the papers [46, 13, 14]. While the methods and emphases of these papers are rather different, they demonstrate phenomena similar to the ones in this paper. The study of random geometric complexes typically concentrates on situations for which the number of points (\(n\)) goes to infinity and the radius (\(r_n\)) involved in defining the complexes goes to zero. Decreasing the radius \(r_n\) plays a similar role to increasing \(R_n\), as treated in this paper. Both actions result in making the complex sparser. For example, if \(r_n \rightarrow 0\) relatively slow (\(r_n = \Omega ((\log n/n)^{1/d})\)), the entire complex behaves like the “core” discussed earlier. On the other hand, if \(r_n\rightarrow 0\) fast enough (\(r_n = o(n^{-1/d})\)), then the entire complex behaves like “crackle”. For more details see [13].

2 Results

In this section we shall present all our main results, along with some discussion, more technical than that of the Introduction. Recall from Sect. 1.3 that although we present all results for the point set \(\mathcal {X}\), they also hold if we replace the points of \(\mathcal {X}\) by the points of an appropriate Poisson process. All proofs are deferred to Sect. 3.

2.1 The Core of Distributions with Unbounded Support

We start by examining the core of the power-law, exponential and Gaussian distributions. These distributions are spherically symmetric and the samples are concentrated near the origin. By “core” we refer to a centered ball \(B_{R_n}\triangleq B_{R_n}(0) \subset {\mathbb {R}}^d\) containing a very large number of points from the sample \(\mathcal {X}_n\), such that

$$\begin{aligned} B_{R_n}\subset \bigcup _{X\in \mathcal {X}_n \cap B_{R_n}} B_1(X). \end{aligned}$$

i.e. the unit balls around the sample points completely cover \(B_{R_n}\). In this case the homology of \(\bigcup _{X\in \mathcal {X}_n\cap B_{R_n}} B_1(X)\), or equivalently, of \(\check{C}(\mathcal {X}_n \cap B_{R_n},1)\), is trivial. Obviously, as \(n\rightarrow \infty \), the radius \(R_n\) grows as well.

Let \(\big \{R_n\big \}_{n=1}^\infty \) be an increasing sequence of positive numbers. Define by \(C_n\) the event that \(B_{R_n}\) is covered, i.e.

$$\begin{aligned} C_n \triangleq \big \{{B_{R_n}\subset \bigcup _{X\in \mathcal {X}_n \cap B_{R_n}} B_1(X)}\big \}. \end{aligned}$$

We wish to find the largest possible value of \(R_n\) such that \(\mathbb {P}\big (C_n\big ) \rightarrow 1\). The following theorem presents lower bounds for this value.

Theorem 1

Let \(\varepsilon >0\), and define

$$\begin{aligned} R_n^{\mathrm {c}}\triangleq {\left\{ \begin{array}{ll} \left( {\frac{\delta _{\mathrm {p}}n}{{\log n - \mathrm{e}^{-\varepsilon } \log \log n}}-1}\right) ^{1/\alpha }, &{} \quad f = f_{\mathrm {p}}, \\ \log n - \log \log \log n -\delta _{\mathrm {e}}-\varepsilon , &{}\quad f = f_{\mathrm {e}}, \\ \sqrt{2\big ({\log n -\log \log \log n -\delta _{\mathrm {g}}-\varepsilon }\big )}, &{}\quad f = f_{\mathrm {g}}, \end{array}\right. } \end{aligned}$$

where the three distributions are given by (1.2)–(1.4), and

$$\begin{aligned} \delta _{\mathrm {p}}&= c_{\mathrm {p}}\alpha 2^{-d} d^{-(1+d/2)}, \\ \delta _{\mathrm {e}}&= (1+d/2)\log d +d\log 2-\log c_{\mathrm {e}},\\ \delta _{\mathrm {g}}&= (1+d/2)\log d + (d-1)\log 2 -\log c_{\mathrm {g}}. \end{aligned}$$

If \(R_n \le R_n^{\mathrm {c}}\), then

$$\begin{aligned} \mathbb {P}\big (C_n\big ) \rightarrow 1. \end{aligned}$$

Theorem 1 implies that the core size has a completely different order of magnitude for each of the three distributions. The heavy-tailed, power-law distribution has the largest core, while the core of the Gaussian distribution is the smallest. While Theorem 1 provides a lower bound to the size of the core, the results in Theorems 2, 3 and 4 indicate the existence of an equivalent upper bound. In fact we believe that the upper bound would differ from the lower bound in Theorem 1 only by a constant, but this will not be pursued in this paper. In the following sections we shall study the behaviour of the Čech complex outside the core.

2.2 How Power-Law Noise Crackles

In this section we explore the crackling phenomenon in the power-law distribution with a density function given by

$$\begin{aligned} f_{\mathrm {p}}(x) \triangleq \frac{c_{\mathrm {p}}}{1+\Vert {x}\Vert ^\alpha }, \end{aligned}$$

where \(\alpha > d\). Let \(B_{R_n}\subset {\mathbb {R}}^d\) be the centered ball with radius \(R_n\), and let

$$\begin{aligned} \check{C}_n \triangleq \check{C}(\mathcal {X}_n \cap (B_{R_n})^c,1) \end{aligned}$$

be the Čech complex constructed from sample points outside \(B_{R_n}\). We wish to study

$$\begin{aligned} \beta _{k,n}\triangleq \beta _k(\check{C}_n), \end{aligned}$$

the \(k\)-th Betti number of \(\check{C}_n\).

Note that the minimum number of points required to form a non-trivial \(k\)-dimensional cycle (\(k\ge 1\)) is \(k+2\). In this case, the \(k\)-cycle is the boundary of the \(k+1\) dimensional simplex spanned by these points. For \(k\ge 1\) and \(\mathcal {Y}\subset {\mathbb {R}}^d\), denote

$$\begin{aligned} T_k(\mathcal {Y}) \triangleq 1\!\!1\big \{| {\mathcal {Y}}| = k+2,\ \beta _k(\check{C}(\mathcal {Y},1)) = 1\big \}, \end{aligned}$$

i.e. \(T_k\) takes the value \(1\) if \(\check{C}(\mathcal {Y},1)\) is a minimal \(k\)-dimensional cycle, and \(0\) otherwise. This indicator function will be used to define the limits of the Betti numbers.

Theorem 2

If \(\lim _{n\rightarrow \infty }n R_n^{-\alpha } = 0\), then

$$\begin{aligned} \lim _{n\rightarrow \infty }\big ( n R_n^{d-\alpha } \big )^{-1}{\mathbb {E}}\big \{{\beta _{0,n}}\big \}&= \mu _{\mathrm {p},0}, \\ \lim _{n\rightarrow \infty }\big (n^{k+2} R_n^{d- \alpha (k+2)}\big )^{-1}{\mathbb {E}}\big \{{\beta _{k,n}}\big \}&= \mu _{\mathrm {p},k},\quad 1 \le k \le d-1 \end{aligned}$$

where

$$\begin{aligned}&\mu _{\mathrm {p},0}\triangleq \frac{s_{d-1} c_{\mathrm {p}}}{\alpha -d}, \end{aligned}$$
(2.1)
$$\begin{aligned}&\mu _{\mathrm {p},k}\triangleq \frac{s_{d-1}c_{\mathrm {p}}^{k+2}}{(\alpha (k+2)-d)(k+2)!}\int _{({\mathbb {R}}^d)^{k+1}} T_k(0,\mathbf {y}){d}\mathbf {y}, \quad 1\le k\le d-1, \end{aligned}$$
(2.2)

and where \(s_{d-1}\) is the surface area of the \((d-1)\)-dimensional unit sphere in \({\mathbb {R}}^d\).

Next, we define the following values, which will serve as critical radii for the crackle,

$$\begin{aligned}&R_{0,n}^\varepsilon \triangleq n^{\big (\frac{1}{\alpha -d}+\varepsilon \big )}, \\&R_{0,n}\triangleq R_{0,n}^0 ,\\&R_{k,n}^\varepsilon \triangleq n^{\big (\frac{1}{\alpha -d/(k+2)}+\varepsilon \big )} \quad (k\ge 1), \\&R_{k,n}\triangleq R_{k,n}^0. \end{aligned}$$

The following is a straightforward corollary of Theorem 2, and summarizes the behaviour of \({\mathbb {E}}\big \{{\beta _{k,n}}\big \}\) in the power-law case.

Corollary 1

For \(k\ge 0\) and \(\varepsilon >0\),

$$\begin{aligned} \lim _{n\rightarrow \infty }{\mathbb {E}}\big \{{\beta _{k,n}}\big \} = \left\{ \begin{array}{ll} 0, &{}\quad R_n = R_{k,n}^\varepsilon , \\ \mu _{\mathrm {p},k}, &{}\quad R_n = R_{k,n},\\ \infty , &{}\quad R_n = R_{k,n}^{-\varepsilon }. \end{array}\right. \end{aligned}$$

Theorem 2 and Corollary 1 reveal that the crackling behaviour is organized into separate “layers”, see Fig. 2. Dividing \({\mathbb {R}}^d\) into a sequence of annuli at radii

$$\begin{aligned} R_{0,n}^\varepsilon \gg R_{0,n}\gg R_{1,n}^\varepsilon \gg R_{1,n} \gg \cdots \gg R_{d-1,n}^\varepsilon \gg R_{d-1,n} \gg R_n^{\mathrm {c}}. \end{aligned}$$

we observe a different behaviour of the Betti numbers in each annulus. We shall briefly review the behaviour in each annulus, in a decreasing order of radii values. The following description is mainly qualitative, and refers to expected values only.

  • \([R_{0,n}^\varepsilon ,\infty )\)—there are hardly any points (\(\beta _k\sim 0\), \(0\le k \le d-1\)).

  • \([R_{0,n},R_{0,n}^\varepsilon )\)—points start to appear, and \(\beta _0\sim \mu _{\mathrm {p},0}\). The points are very few and scattered, so no cycles are generated (\(\beta _k \sim 0\), \(1\le k \le d-1\)).

  • \([R_{1,n}^\varepsilon ,R_{0,n})\)—the number of components grows to infinity, but no cycles are formed yet (\(\beta _0 \sim \infty \), and \(\beta _k = 0\), \(1 \le k \le d-1\)).

  • \([R_{1,n},R_{1,n}^\varepsilon )\)—a finite number of \(1\)-dimensional cycles show up, among the infinite number of components (\(\beta _0 \sim \infty \), \(\beta _1\sim \mu _{\mathrm {p},1}\), and \(\beta _k = 0\), \(1 \le k \le d-1\)).

  • \([R_{2,n}^\varepsilon ,R_{1,n})\)—we have \(\beta _0\sim \infty \), \(\beta _1\sim \infty \), and \(\beta _k\sim 0\) for \(k\ge 1\).

This process goes on, until the \((d-1)\)-dimensional cycles appear

  • \([R_{d-1},R_{d-1}^\varepsilon )\)—we have \(\beta _{d-1}\sim \mu _{\mathrm {p},d-1}\) and \(\beta _k\sim \infty \) for \(0\le k \le d-2\).

  • \([R_n^{\mathrm {c}},R_{d-1})\)—just before we reach the core, the complex exhibits the most intricate structure, with \(\beta _k \sim \infty \) for \(0\le k \le d-1\).

Note that there is a very fast phase transition as we move from the contractible core to the first crackle layer. At this point we do not know exactly where and how this phase transition takes place. A reasonable conjecture would be that the transition occurs at \(R_n = n^{1/\alpha }\) (since at this radius the term \(n R_n^{-\alpha }\) that appears in Theorem 2 changes its limit, affecting the limiting Betti numbers). However, this remains for future work.

2.3 How Exponential Noise Crackles

In this section we focus on the exponential density function \(f=f_{\mathrm {e}}\). The results in this section are very similar to the those for the power law distribution, and we shall describe them briefly. Differences lie in the specific values of the \(R_{k,n}\) and in the terms in the limit formulae.

Theorem 3

If \(\lim _{n\rightarrow \infty }n\mathrm{e}^{-R_n} = 0\), then

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n R_n^{d-1} \mathrm{e}^{-R_n}\big )^{-1}{\mathbb {E}}\big \{{\beta _{0,n}}\big \}&= \mu _{\mathrm {e},0}, \\ \lim _{n\rightarrow \infty }\big (n^{k+2} R_n^{d-1} \mathrm{e}^{-(k+2)R_n}\big )^{-1} {\mathbb {E}}\big \{{\beta _{k,n}}\big \}&= \mu _{\mathrm {e},k},\quad k\ge 1, \end{aligned}$$

where

$$\begin{aligned}&\mu _{\mathrm {e},0}\triangleq s_{d-1}c_{\mathrm {e}},\end{aligned}$$
(2.3)
$$\begin{aligned}&\mu _{\mathrm {e},k}\triangleq \frac{s_{d-1}c_{\mathrm {e}}^{k+2}}{(k+2)!} \int _0^\infty \int _{({\mathbb {R}}^d)^{k+1}} T_k(0,\mathbf {y}) \mathrm{e}^{-\big ((k+2)\rho + \sum _{i=1}^{k+1}y_i^1\big )} \prod _{i=1}^{k+1} 1\!\!1\big \{y_i^1 > -\rho \big \} {d}\mathbf {y}\, {d}\rho , \end{aligned}$$
(2.4)

and where \(y_i^1\) is the first coordinate of \(y_i\in {\mathbb {R}}^d\).

Next, define

$$\begin{aligned}&R_{0,n}^\varepsilon \triangleq \log n + \big (d-1+\varepsilon \big )\log \log n, \\&R_{0,n}\triangleq R_{0,n}^0 ,\\&R_{k,n}^\varepsilon \triangleq \log n + \big (\frac{d-1}{k+2}+\varepsilon \big )\log \log n \quad (k\ge 1),\\&R_{k,n}\triangleq R_{k,n}^0. \end{aligned}$$

From Theorem 3 we can conclude the following.

Corollary 2

For \(k\ge 0\) and \(\varepsilon >0\),

$$\begin{aligned} \lim _{n\rightarrow \infty }{\mathbb {E}}\big \{{\beta _{k,n}}\big \} = {\left\{ \begin{array}{ll} 0, &{}\quad R_n = R_{k,n}^\varepsilon , \\ \mu _{\mathrm {e},k}, &{}\quad R_n = R_{k,n},\\ \infty , &{} \quad R_n = R_{k,n}^{-\varepsilon }. \end{array}\right. } \end{aligned}$$

As in the power-law case, Theorem 3 implies the same “layered” behaviour, the only difference being in the values of \(R_{k,n}\). From examining the values of \(R_n^{\mathrm {c}}\), and \(R_{k,n}\) it is reasonable to guess that the phase transition in the exponential case occurs at \(R_n = \log n\).

2.4 Gaussian Noise Does Not Crackle

Simplicial complexes built over vertices sampled from the standard Gaussian distribution exhibit a completely different behaviour to that we saw in the power-law and exponential cases. Define

$$\begin{aligned} R_{0,n}^\varepsilon \triangleq \sqrt{2\log n + (d-2+\varepsilon )\log \log n}, \end{aligned}$$

then

Theorem 4

If \(f=f_{\mathrm {g}}\), \(\varepsilon > 0\), and \(R_n = R_{0,n}^\varepsilon \), then for \(0\le k \le d-1\)

$$\begin{aligned} \lim _{n\rightarrow \infty }{\mathbb {E}}\big \{{\beta _{k,n}}\big \} = 0. \end{aligned}$$

Note that in the Gaussian case \(\lim _{n\rightarrow \infty }\big ( R_{0,n}^\varepsilon - R_n^{\mathrm {c}}\big ) = 0\). This implies that as \(n\rightarrow \infty \) we have the core which is contractible, and outside the core there is hardly anything. In other words, the ball placed around every new point we add to the sample immediately connects to the core, and thus, the Gaussian noise does not crackle.

3 Proofs

We now turn to proofs, starting with the proof of the main result of Sect. 2.1.

3.1 The Core

Proof

[Proof of Theorem 1] The proof covers all three distributions, except for specific calculations near the end. Take a grid on \({\mathbb {R}}^d\) of size \(g = \frac{1}{2\sqrt{d}}\). Let \(\mathcal{{Q}}_n\) be the collection of cubes in this grid that are contained in \(B_{R_n}\). Let \(\tilde{C}_n\) be the following event

$$\begin{aligned} \tilde{C}_n \triangleq \big \{{\forall } Q\in \mathcal{{Q}}_n : Q\cap \mathcal {X}_n \ne \emptyset \big \}, \end{aligned}$$

i.e. \(\tilde{C}_n\) is the event that every cube in \(\mathcal{{Q}}_n\) contains at least one point from \(\mathcal {X}_n\). Recall the definition of \(C_n\),

$$\begin{aligned} C_n \triangleq \big \{B_{R_n}\subset \bigcup _{X\in \mathcal {X}_n \cap B_{R_n}} B_1(X)\big \}. \end{aligned}$$

Then it is easy to show that \(\tilde{C}_n \subset C_n\). The complementary event \(\tilde{C}_n^c\) is the event that at least one cube is empty. Thus,

$$\begin{aligned} \mathbb {P}(\tilde{C}_n^c) \le \sum _{Q\in \mathcal{{Q}}_n} \mathbb {P}\big (Q \cap \mathcal {X}_n = \emptyset \big ) = \sum _{Q\in \mathcal{{Q}}_n}(1-p(Q))^n \le \sum _{Q\in \mathcal{{Q}}_n}\mathrm{e}^{-np(Q)} \end{aligned}$$

where

$$\begin{aligned} p(Q) = \int _Q f(z)\mathrm{d}z \ge g^d f(R_n). \end{aligned}$$

In addition, the number of cubes that are contained in \(B_{R_n}\) is less than \(\big (2{{R_n}/{g}}\big )^d\). Therefore,

$$\begin{aligned} \mathbb {P}(\tilde{C}_n^c) \le (2 g^{-1})^d R_n^d \mathrm{e}^{-n g^d f(R_n) }. \end{aligned}$$
(3.1)

Now, choose any \(\varepsilon > 0\) and set

$$\begin{aligned} R_n = R_n^{\mathrm {c}}\triangleq {\left\{ \begin{array}{ll} \left( {\frac{\delta _{\mathrm {p}}n}{{\log n - \mathrm{e}^{-\varepsilon } \log \log n}}-1}\right) ^{1/\alpha }, &{}\quad f = f_{\mathrm {p}}, \\ \log n - \log \log \log n -\delta _{\mathrm {e}}-\varepsilon , &{}\quad f= f_{\mathrm {e}}, \\ \sqrt{2\big (\log n -\log \log \log n -\delta _{\mathrm {g}}-\varepsilon \big )}, &{}\quad f = f_{\mathrm {g}}, \end{array}\right. } \end{aligned}$$

where

$$\begin{aligned} \delta _{\mathrm {p}}&= c_{\mathrm {p}}\alpha 2^{-d} d^{-(1+d/2)}, \\ \delta _{\mathrm {e}}&= \log d -\log c_{\mathrm {e}}- \log g^d, \\ \delta _{\mathrm {g}}&= \log (d/2) -\log c_{\mathrm {g}}- \log g^d. \end{aligned}$$

It is easy to verify that in all cases we have

$$\begin{aligned} R_n^d \mathrm{e}^{-n g^d f(R_n) } \rightarrow 0. \end{aligned}$$

Thus, from (3.1) we conclude that \(\mathbb {P}(\tilde{C}_n) \rightarrow 1\). Since \(\mathbb {P}\big (C_n\big ) \ge \mathbb {P}(\tilde{C}_n)\) we now have that for \(R_n = R_n^{\mathrm {c}}\), in each of the distributions,

$$\begin{aligned} \mathbb {P}\big (C_n\big ) \rightarrow 1, \end{aligned}$$

which completes the proof.\(\square \)

3.2 Crackle: Notation and General Lemmas

For \(R_n > 0\), set

$$\begin{aligned} \mathcal {X}_{n,R_n}\triangleq \mathcal {X}_n \cap (B_{R_n})^c, \end{aligned}$$

i.e. \(\mathcal {X}_{n,R_n}\) consists of the points of \(\mathcal {X}_n\) located outside the ball \(B_{R_n}\). Next, recall the definition of \(T_k\),

$$\begin{aligned} T_k(\mathcal {Y}) \triangleq 1\!\!1\big \{| {\mathcal {Y}}| = k+2,\ \beta _k(\check{C}(\mathcal {Y},1)) = 1\big \}, \end{aligned}$$

for \(\mathcal {Y}\subset {\mathbb {R}}^d\), and write

$$\begin{aligned} S_{0,n}&\triangleq | {\mathcal {X}_{n,R_n}}|, \\ \hat{S}_{0,n}&\triangleq \#\big \{X \in \mathcal {X}_{n,R_n}: X \hbox { is a connected component of } \check{C}(\mathcal {X}_n,1)\big \},\\ S_{k,n}&\triangleq \sum _{\mathcal {Y}\subset \mathcal {X}_{n,R_n}} T_k(\mathcal {Y}),\\ \widehat{S}_{k,n}&\triangleq \sum _{\mathcal {Y}\subset \mathcal {X}_{n,R_n}} T_k(\mathcal {Y})1\!\!1\big \{\check{C}(\mathcal {Y},1) \hbox { is a connected component of } \check{C}(\mathcal {X}_n,1)\big \},\\ L_{k,n}&\triangleq \sum _{\mathcal {Y}\subset \mathcal {X}_{n,R_n}} 1\!\!1\big \{| {\mathcal {Y}}| = k+3,\ \check{C}(\mathcal {Y},1) \hbox { is connected}\big \}, \end{aligned}$$

where \(k\ge 1\). Observe that

$$\begin{aligned} \hat{S}_{0,n}&\le \beta _{0,n} \le S_{0,n}\end{aligned}$$
(3.2)
$$\begin{aligned} \widehat{S}_{k,n}&\le \beta _{k,n}\le \widehat{S}_{k,n}+ L_{k,n},\quad k\ge 1 \end{aligned}$$
(3.3)

We will evaluate the limits of \({\mathbb {E}}\big \{{S_{k,n}}\big \},\,{\mathbb {E}}\{{\widehat{S}_{k,n}}\}\) and \({\mathbb {E}}\big \{{L_{k,n}}\big \}\) and deduce from these the limit of \({\mathbb {E}}\big \{{\beta _{k,n}}\big \}\).

In addition, set

$$\begin{aligned} \mathrm {e}_1&\triangleq (1,0,\ldots ,0) \in {\mathbb {R}}^d ,\\ f(r)&\triangleq f(r \mathrm {e}_1), \ \ r \in {\mathbb {R}}, \\ U(\mathbf {x})&\triangleq \bigcup _{i=1}^{k} B_2(x_i), \ \ \mathbf {x}\in ({\mathbb {R}}^d)^k, \\ p(\mathbf {x})&\triangleq \int _{U(\mathbf {x})}f(z)\mathrm{d}z,\ \ \mathbf {x}\in ({\mathbb {R}}^d)^k. \end{aligned}$$

The following two lemmas are purely technical, but will considerably simplify our computations later.

Lemma 1

Let \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) be a spherically symmetric probability density. Then,

$$\begin{aligned} {\mathbb {E}}\big \{{S_{0,n}}\big \}&= s_{d-1}n \int _{R_n}^\infty r^{d-1}f(r){d}r ,\\ {\mathbb {E}}\{{\hat{S}_{0,n}}\}&= s_{d-1}n \int _{R_n}^\infty r^{d-1}f(r)(1-np(r\mathrm {e}_1))^{n-1}{d}r, \end{aligned}$$

where \(s_{d-1}\) is the volume of the \(d-1\) dimensional unit sphere.

Proof

\(S_{0,n}\) is simply a sum of Bernoulli variables, therefore

$$\begin{aligned} {\mathbb {E}}\big \{{S_{0,n}}\big \} = n \mathbb {P}\big (\Vert X\Vert > R_n\big ) = n\int _{{\mathbb {R}}^d} f(x)1\!\!1\big \{\Vert x\Vert >R_n\big \}{d}x. \end{aligned}$$

Writing the integral in polar coordinates yields

$$\begin{aligned} {\mathbb {E}}\big \{{S_{0,n}}\big \} = n \int _{R_n}^\infty \int _{S^{d-1}} f(r\theta )r^{d-1}J(\theta ){d}\theta \, {d}r, \end{aligned}$$

where \(J(\theta ) = | {\frac{\partial x}{\partial \theta }}|\). Since \(f\) is spherically symmetric, \(f(r\theta ) = f(r)\), and therefore

$$\begin{aligned} {\mathbb {E}}\big \{{S_{0,n}}\big \} = s_{d-1} n \int _{R_n}^\infty r^{d-1} f(r) {d}r. \end{aligned}$$

The proof for \(\hat{S}_{0,n}\) is similar, using the fact that the probability that a point \(x\in {\mathbb {R}}^d\) is disconnected from the rest of the complex \(\check{C}(\mathcal {X}_n,1)\) is \((1-p(x))^{n-1}\).\(\square \)

Lemma 2

Let \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) be a spherically symmetric probability density. Then, for \(k\ge 1\),

$$\begin{aligned} {\mathbb {E}}\big \{{S_{k,n}}\big \}&= s_{d-1} \left( {\begin{array}{c}n\\ k+2\end{array}}\right) \int _{R_n}^\infty r^{d-1}f(r)G_k(r){d}r ,\\ {\mathbb {E}}\{{\widehat{S}_{k,n}}\}&= s_{d-1} \left( {\begin{array}{c}n\\ k+2\end{array}}\right) \int _{R_n}^\infty r^{d-1}f(r)\hat{G}_k(r){d}r, \end{aligned}$$

where \(s_{d-1}\) is the volume of the \(d-1\) dimensional sphere, and where

$$\begin{aligned} G_k(r)&\triangleq \int _{({\mathbb {R}}^d)^{k+1}} f(\Vert r\mathrm {e}_1+\mathbf {y}\Vert ) T_k(0,\mathbf {y}) \prod _{i=1}^{k+1} 1\!\!1\big \{\Vert r\mathrm {e}_1+y_i\Vert > R_n\big \} {d}\mathbf {y}, \\ \hat{G}_k(r)&\triangleq \int _{({\mathbb {R}}^d)^{k+1}} f(\Vert r\mathrm {e}_1+\mathbf {y}\Vert ) T_k(0,\mathbf {y}) \prod _{i=1}^{k+1} 1\!\!1\big \{\Vert r\mathrm {e}_1+y_i\Vert > R_n\big \} \\&\quad \times (1-p(r\mathrm {e}_1, r\mathrm {e}_1+\mathbf {y}))^{n-k-2}{d}\mathbf {y}. \end{aligned}$$

Proof

The proof is in the same spirit of the proof of Lemma 1, but technically more complicated. Thinking of \(S_{k,n}\) as a sum of Bernoulli variables, we have that

$$\begin{aligned} {\mathbb {E}}\big \{{S_{k,n}}\big \} = \left( {\begin{array}{c}n\\ k+2\end{array}}\right) \int _{({\mathbb {R}}^d)^{k+2}}f(\mathbf {x}) T_k(\mathbf {x}) \prod _{i=1}^{k+2} 1\!\!1\big \{\Vert x_i\Vert > R_n\big \} d\mathbf {x}. \end{aligned}$$

Let \(I_k\) denote the integral above. Then, using the change of variables

$$\begin{aligned} x_1&\rightarrow x, \qquad x_i \rightarrow x+ y_{i-1} \ \ (i>1), \end{aligned}$$

yields

$$\begin{aligned} I_k&= \int _{\Vert x\Vert \ge R_n} \int _{({\mathbb {R}}^d)^{k+1}} f(x)f(x+\mathbf {y}) T_k(x,x+\mathbf {y}) \prod _{i=1}^{k+1} 1\!\!1\big \{\Vert x+y_i\Vert > R_n\big \}{d}\mathbf {y}\, {d}x \\&=\int _{\Vert x\Vert \ge R_n} \int _{({\mathbb {R}}^d)^{k+1}} f(x)f(x+\mathbf {y}) T_k(0,\mathbf {y}) \prod _{i=1}^{k+1} 1\!\!1\big \{\Vert x+y_i\Vert > R_n\big \}{d}\mathbf {y}\,{d}x. \end{aligned}$$

Moving to polar coordinates yields

$$\begin{aligned} I_k&= \int _{R_n}^\infty \int _{S^{d-1}} \int _{({\mathbb {R}}^d)^{k+1}} f(r\theta )f(r\theta +\mathbf {y}) T_k(0,\mathbf {y}) \\&\quad \times \prod _{i=1}^{k+1} 1\!\!1\big \{\Vert r\theta +y_i\Vert > R_n\big \} r^{d-1}J(\theta ){ d}\mathbf {y}\, {d}\theta \, {d}r\\&= \int _{R_n}^\infty r^{d-1}f(r) \int _{S^{d-1}} J(\theta )\int _{({\mathbb {R}}^d)^{k+1}} f(\Vert r\theta +\mathbf {y}\Vert ) T_k(0,\mathbf {y}) \\&\quad \,\times \prod _{i=1}^{k+1} 1\!\!1\big \{\Vert r\theta +y_i\Vert > R_n\big \} {d}\mathbf {y}\,{d}\theta \, {d}r, \end{aligned}$$

where \(J(\theta ) = | {\frac{\partial x}{\partial \theta }}|\), and \(f(x) = f(\Vert x\Vert )\) by the spherical symmetry assumption. Set

$$\begin{aligned} G_k(r,\theta ) \triangleq \int _{({\mathbb {R}}^d)^{k+1}} f(\Vert r\theta +\mathbf {y}\Vert ) T_k(0,\mathbf {y}) \prod _{i=1}^{k+1} 1\!\!1\big \{\Vert r\theta +y_i\Vert > R_n\big \} {d}\mathbf {y}. \end{aligned}$$

Since \(T_k\) is rotation invariant, it is easy to show that for every \(\theta \in S^{d-1}\)

$$\begin{aligned} G_k(r,\theta ) = G_k(r,\mathrm {e}_1) \triangleq G_k(r). \end{aligned}$$

Thus,

$$\begin{aligned} I_k = s_{d-1}\int _{R_n}^\infty r^{d-1}f(r)G_k(r){d}r. \end{aligned}$$
(3.4)

This completes the proof for \(S_{k,n}\). The proof for \(\widehat{S}_{k,n}\) is similar.\(\square \)

In what follows, we shall use the following elementary limits:

  1. 1.

    For every \(k > 0\),

    $$\begin{aligned} \lim _{n\rightarrow \infty }n^{-k} \left( {\begin{array}{c}n\\ k\end{array}}\right) = \frac{1}{k!} \end{aligned}$$
    (3.5)
  2. 2.

    For every sequence \(a_n\rightarrow 0\) and \(k\ge 0\),

    $$\begin{aligned} \lim _{n\rightarrow \infty }\frac{(1-a_n)^{n-k}}{\mathrm{e}^{-na_n}} = 1 \end{aligned}$$
    (3.6)

3.3 Crackle: The Power Law Distribution

In this section we prove the results in Sect. 2.2. First, we need a few lemmas.

Lemma 3

If \(f=f_{\mathrm {p}}\), and \(R_n\rightarrow \infty \), then

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n R_n^{d-\alpha }\big )^{-1} {\mathbb {E}}\big \{{S_{0,n}}\big \} = \mu _{\mathrm {p},0}, \end{aligned}$$

where \(\mu _{\mathrm {p},0}\) is defined in (2.1).

If, in addition, \(nR_n^{-\alpha }\rightarrow 0\), then

$$\begin{aligned} \lim _{n\rightarrow \infty }\big ( n R_n^{d-\alpha } \big )^{-1}{\mathbb {E}}\{{\hat{S}_{0,n}}\} = \mu _{\mathrm {p},0}. \end{aligned}$$

Proof

From Lemma 1 we have that

$$\begin{aligned} {\mathbb {E}}\big \{{S_{0,n}}\big \} = s_{d-1}n \int _{R_n}^\infty r^{d-1}f(r){d}r. \end{aligned}$$

Making the change of variables \(r\rightarrow R_n \rho \) yields

$$\begin{aligned} {\mathbb {E}}\big \{{S_{0,n}}\big \}&= s_{d-1}n \int _1^\infty \frac{c_{\mathrm {p}}(R_n\rho )^{d-1}}{1+ (R_n\rho )^\alpha }R_n {d}\rho \\&= s_{d-1}c_{\mathrm {p}}n R_n^{d-\alpha } \int _1^\infty \frac{\rho ^{d-1}}{R_n^{-\alpha }+ \rho ^\alpha } {d}\rho . \end{aligned}$$

Applying the dominated convergence theorem to the previous integral gives

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (nR_n^{d-\alpha }\big )^{-1}{\mathbb {E}}\big \{{S_{0,n}}\big \} = s_{d-1}c_{\mathrm {p}}\int _1^\infty \rho ^{d-1-\alpha }\mathrm{d}\rho = \frac{s_{d-1}c_{\mathrm {p}}}{\alpha -d} = \mu _{\mathrm {p},0}. \end{aligned}$$

This proves the first part of the lemma.

Next, from Lemma 1 we have that

$$\begin{aligned} {\mathbb {E}}\{{\hat{S}_{0,n}}\} = s_{d-1}n \int _{R_n}^\infty r^{d-1}f(r)(1-p(r\mathrm {e}_1))^{n-1}{d}r. \end{aligned}$$

The power term is bounded by \(1\) and therefore will not affect the conditions needed for dominated convergence. Thus, using (3.6), we only need to evaluate its limit.

$$\begin{aligned} p(r\mathrm {e}_1) = \int _{B_2(r\mathrm {e}_1)}f(z){d}z = \int _{B_2(0)}\frac{c_{\mathrm {p}}}{1+\Vert r\mathrm {e}_1+z\Vert }{d}z, \end{aligned}$$

and after the change of variables \(r\rightarrow R_n\rho \) we have

$$\begin{aligned} p(R_n\rho \mathrm {e}_1) =c_{\mathrm {p}}R_n^{-\alpha }\int _{B_2(0)} \frac{1}{R_n^{-\alpha }+\Vert \rho \mathrm {e}_1+R_n^{-1}z\Vert ^\alpha }{d}z. \end{aligned}$$

If \(nR_n^{-\alpha }\rightarrow 0\), then, by dominated convergence, we have

$$\begin{aligned} \lim _{n\rightarrow \infty }np(R_n\rho \mathrm {e}_1) =0. \end{aligned}$$

Thus,

$$\begin{aligned} \lim _{n\rightarrow \infty }(1-p(R_n\rho \mathrm {e}_1))^{n-1} = \lim _{n\rightarrow \infty }\mathrm{e}^{-np(R_n\rho \mathrm {e}_1)} =1, \end{aligned}$$

and therefore we have

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (nR_n^{d-\alpha }\big )^{-1}{\mathbb {E}}\{{\hat{S}_{0,n}}\} = \lim _{n\rightarrow \infty }\big (nR_n^{d-\alpha }\big )^{-1}{\mathbb {E}}\big \{{S_{0,n}}\big \} = \mu _{\mathrm {p},0}. \end{aligned}$$

This completes the proof of the second part of the lemma.\(\square \)

Lemma 4

If \(f=f_{\mathrm {p}}\), and \(R_n\rightarrow \infty \) then

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n^{k+2} R_n^{d- \alpha (k+2)}\big )^{-1} {\mathbb {E}}\big \{{S_{k,n}}\big \} = \mu _{\mathrm {p},k}, \end{aligned}$$

where \(\mu _{\mathrm {p},k}\) is defined in (2.2). If, in addition,\(n R_n^{-\alpha } \rightarrow 0\), then

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n^{k+2} R_n^{d-\alpha (k+2)}\big )^{-1} {\mathbb {E}}\{{\widehat{S}_{k,n}}\} = \mu _{\mathrm {p},k}. \end{aligned}$$

Proof

The proof is in the spirit of the proof of Lemma 3, but technically more complicated. From Lemma 2 we have that

$$\begin{aligned} {\mathbb {E}}\big \{{S_{k,n}}\big \} = \left( {\begin{array}{c}n\\ k+2\end{array}}\right) I_k, \end{aligned}$$

where

$$\begin{aligned} I_k = s_{d-1}\int _{R_n}^\infty r^{d-1}f(r)G_k(r){d}r. \end{aligned}$$

Making the change of variables \(r \rightarrow R_n\rho \) yields

$$\begin{aligned} I_k&= s_{d-1}R_n\int _1^\infty (R_n\rho )^{d-1} f(R_n\rho ) G_k(R_n\rho ) {d}\rho \\&= s_{d-1}c_{\mathrm {p}}^{k+2}(R_n)^{d-\alpha (k+2)}\int _1^\infty \int _{({\mathbb {R}}^d)^{k+1}} \frac{\rho ^{d-1}}{R_n^{-\alpha } + \rho ^\alpha }\prod _{i=1}^{k+1}\frac{1}{R_n^{-\alpha } +\Vert \rho \mathrm {e}_1+ R_n^{-1}y_i\Vert ^\alpha } \\&\quad \times T_k(0,\mathbf {y}) \prod _{i=1}^{k+1} 1\!\!1\big \{\Vert \rho \mathrm {e}_1+ R_n^{-1}y_i\Vert > 1\big \} {d}\mathbf {y}. \end{aligned}$$

Thus, using (3.5),

$$\begin{aligned}&(n^{k+2} R_n^{d-\alpha (k+2)})^{-1}{\mathbb {E}}\big \{{S_{k,n}}\big \} \\&\quad = \frac{s_{d-1}c_{\mathrm {p}}^{k+2}}{(k+2)!}\int _1^\infty \int _{({\mathbb {R}}^d)^{k+1}} \frac{\rho ^{d-1}}{R_n^{-\alpha } + \rho ^\alpha }\\&\quad \quad \times T_k(0,\mathbf {y}) \prod _{i=1}^{k+1} \frac{1}{R_n^{-\alpha } +\Vert \rho \mathrm {e}_1+ R_n^{-1}y_i\Vert ^\alpha }1\!\!1\big \{\Vert \rho \mathrm {e}_1+ R_n^{-1}y_i\Vert > 1\big \} {d}\mathbf {y}. \end{aligned}$$

It is easy to show that the integrand is bounded by an integrable term, so the dominated convergence theorem applies, yielding

$$\begin{aligned}&\lim _{n\rightarrow \infty }(n^{k+2} R_n^{d-\alpha (k+2)})^{-1}{\mathbb {E}}\big \{{S_{k,n}}\big \}\\&\quad = \frac{s_{d-1}c_{\mathrm {p}}^{k+2}}{(k+2)!}\int _1^{\infty }\rho ^{d-1-\alpha (k+2)}\mathrm{d}\rho \int _{({\mathbb {R}}^d)^{k+1}} T_k(0,\mathbf {y}){d}\mathbf {y}\\&\quad = \frac{s_{d-1}c_{\mathrm {p}}^{k+2}}{(\alpha (k+2)-d)(k+2)!}\int _{({\mathbb {R}}^d)^{k+1}} T_k(0,\mathbf {y}){d}\mathbf {y}\\&\quad =\mu _{\mathrm {p},k}. \end{aligned}$$

This proves the first part of the lemma.

Next, the terms \(G_k(r)\) and \(\hat{G}_k(r)\) in Lemma 2 differ only by the term \((1-p(r\mathrm {e}_1, r\mathrm {e}_1+\mathbf {y}))^{n-k-2}\), so dominated convergence still applies. Now,

$$\begin{aligned} p(r\mathrm {e}_1, r\mathrm {e}_1+\mathbf {y}) = \int _{U(r\mathrm {e}_1,r\mathrm {e}_1+\mathbf {y})} f(z){d}z = \int _{U(0,\mathbf {y})}f(r\mathrm {e}_1+z){d}z, \end{aligned}$$

and substituting \(r\rightarrow R_n\rho \) yields

$$\begin{aligned} p(R_n\rho \mathrm {e}_1, R_n\rho \mathrm {e}_1+\mathbf {y}) = c_{\mathrm {p}}R_n^{-\alpha }\int _{U(0,\mathbf {y})} \frac{1}{R_n^{-\alpha }+\Vert \rho \mathrm {e}_1+R_n^{-1}z\Vert ^\alpha }{d}z. \end{aligned}$$

If \(nR_n^{-\alpha }\rightarrow 0\), then using the dominated convergence we have

$$\begin{aligned} \lim _{n\rightarrow \infty }np(R_n\rho \mathrm {e}_1,R_n\rho \mathrm {e}_1+ \mathbf {y}) = 0. \end{aligned}$$

Thus,

$$\begin{aligned} \lim _{n\rightarrow \infty }\mathrm{e}^{-np(R_n\rho \mathrm {e}_1, R_n\rho \mathrm {e}_1+ \mathbf {y})} =1, \end{aligned}$$

and therefore, using (3.6),

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n^{k+2} R_n^{d-\alpha (k+2)}\big )^{-1}{\mathbb {E}}\{{\widehat{S}_{k,n}}\}&= \lim _{n\rightarrow \infty }\big (n^{k+2} R_n^{d-\alpha (k+2)}\big )^{-1}{\mathbb {E}}\{{S_{k,n}}\} \\&= \mu _{\mathrm {p},k}. \end{aligned}$$

This completes the proof of the second part of the lemma.\(\square \)

Lemma 5

If \(f=f_{\mathrm {p}}\), and \(R_n\rightarrow \infty \) then

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n^{k+3} R_n^{d-\alpha (k+3)}\big )^{-1} {\mathbb {E}}\big \{{L_{k,n}}\big \} = \hat{\mu }_{\mathrm {p},k}, \end{aligned}$$

for some \(\hat{\mu }_{\mathrm {p},k}> 0\).

Proof

The proof is very similar to the proof of Lemma 4. We need only replace \(T_k\) with an indicator function that tests whether a sub-complex generated by \(k+3\) points is connected. The exact value of \(\hat{\mu }_{\mathrm {p},k}\) will not be needed anywhere.\(\square \)

We can now prove Theorem 2.

Proof of Theorem 2 To prove the limit for \(\beta _{0,n}\) simply combine Lemma 3 with the inequality (3.2). To prove the limit for \(\beta _{k,n}\), \(k\ge 1 \), combine Lemmas 4 and 5 with the inequality (3.3).\(\square \)

3.4 Crackle: The Exponential Distribution

In this section we wish to prove Theorem 3. We start with the following lemmas.

Lemma 6

If \(f=f_{\mathrm {e}}\), and \(R_n\rightarrow \infty \) then

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n R_n^{d-1} \mathrm{e}^{-R_n}\big )^{-1} {\mathbb {E}}\big \{{S_{0,n}}\big \} = \mu _{\mathrm {e},0}, \end{aligned}$$

where \(\mu _{\mathrm {e},0}\) is defined in (2.3).

If, in addition, \(n\mathrm{e}^{-R_n}\rightarrow 0\) then

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n R_n^{d-1} \mathrm{e}^{-R_n}\big )^{-1} {\mathbb {E}}\{{\hat{S}_{0,n}}\} = \mu _{\mathrm {e},0}. \end{aligned}$$

Proof

From Lemma 1 we have that

$$\begin{aligned} {\mathbb {E}}\big \{{S_{0,n}}\big \} = s_{d-1}n \int _{R_n}^\infty r^{d-1}f(r){d}r. \end{aligned}$$

Using the change of variables \(r\rightarrow \rho + R_n\) yields

$$\begin{aligned} {\mathbb {E}}\big \{{S_{0,n}}\big \}&= s_{d-1}n \int _0^\infty (\rho +R_n)^{d-1}c_{\mathrm {e}}\mathrm{e}^{-(\rho +R_n)} {d}\rho \\&= s_{d-1}c_{\mathrm {e}}n R_n^{d-1}\mathrm{e}^{-R_n} \int _0^\infty \big (\frac{\rho }{R_n}+1\big )^{d-1}\mathrm{e}^{-\rho } {d}\rho . \end{aligned}$$

Applying dominated convergence to the last integral yields

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n R_n^{d-1} \mathrm{e}^{-R_n}\big )^{-1} {\mathbb {E}}\big \{{S_{0,n}}\big \} = s_{d-1}c_{\mathrm {e}}\int _0^\infty \mathrm{e}^{-\rho }\mathrm{d}\rho = s_{d-1}c_{\mathrm {e}}= \mu _{\mathrm {e},0}. \end{aligned}$$

This proves the first part of the lemma.

Next, from Lemma 1 we have that

$$\begin{aligned} {\mathbb {E}}\{{\hat{S}_{0,n}}\} = s_{d-1}n \int _{R_n}^\infty r^{d-1}f(r)(1-p(r\mathrm {e}_1))^{n-1}{d}r. \end{aligned}$$

The power term will not affect the dominated convergence conditions. Thus, we only need to evaluate its limit.

$$\begin{aligned} p(r\mathrm {e}_1) = \int _{B_2(r\mathrm {e}_1)}f(z){d}z = \int _{B_2(0)}c_{\mathrm {e}}\mathrm{e}^{-\Vert r\mathrm {e}_1+z\Vert }{d}z, \end{aligned}$$

and after the change of variables \(r\rightarrow \rho +R_n\) we have

$$\begin{aligned} p((\rho +R_n)\mathrm {e}_1) = \int _{B_2(0)} c_{\mathrm {e}}\mathrm{e}^{-\Vert (\rho +R_n)\mathrm {e}_1+ z\Vert }{ d}z \le \mathrm{e}^{-(R_n+\rho )} \int _{B_2(0)} c_{\mathrm {e}}\mathrm{e}^{\Vert z\Vert }{d}z. \end{aligned}$$

If \(n\mathrm{e}^{-R_n}\rightarrow 0\), then

$$\begin{aligned} \lim _{n\rightarrow \infty }np((\rho +R_n)\mathrm {e}_1) =0. \end{aligned}$$

Thus,

$$\begin{aligned} \lim _{n\rightarrow \infty }\mathrm{e}^{-np((\rho +R_n)\mathrm {e}_1)} =1, \end{aligned}$$

and therefore, using (3.6), we have

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n R_n^{d-1} \mathrm{e}^{-R_n}\big )^{-1} {\mathbb {E}}\{{\hat{S}_{0,n}}\} = \lim _{n\rightarrow \infty }\big (n R_n^{d-1} \mathrm{e}^{-R_n}\big )^{-1} {\mathbb {E}}\{{S_{0,n}}\} = \mu _{\mathrm {e},0}. \end{aligned}$$

This completes the proof of the second part of the lemma.\(\square \)

Lemma 7

If \(f=f_{\mathrm {e}}\), and \(R_n\rightarrow \infty \) then

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n^{k+2} R_n^{d-1} \mathrm{e}^{-(k+2)R_n}\big )^{-1} {\mathbb {E}}\big \{{S_{k,n}}\big \} = \mu _{\mathrm {e},k}, \end{aligned}$$

where \(\mu _{\mathrm {e},k}\) is defined in (2.4).

If, in addition, \(n\mathrm{e}^{-R_n}\rightarrow 0\) then

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n^{k+2} R_n^{d-1} \mathrm{e}^{-(k+2)R_n}\big )^{-1} {\mathbb {E}}\{{\widehat{S}_{k,n}}\} = \mu _{\mathrm {e},k}. \end{aligned}$$

Proof

From Lemma 2 we have that

$$\begin{aligned} {\mathbb {E}}\big \{{S_{k,n}}\big \} = \frac{n^{k+2}}{(k+2)!} I_k, \end{aligned}$$

where

$$\begin{aligned} I_k = s_{d-1}\int _{R_n}^\infty r^{d-1}f(r)G_k(r){d}r. \end{aligned}$$

Making the change of variables \(r\rightarrow \rho + R_n\) yields

$$\begin{aligned} I_k&= s_{d-1}\int _0^\infty (\rho +R_n)^{d-1}f(\rho +R_n)G_k(\rho +R_n){d}\rho \\&= s_{d-1}c_{\mathrm {e}}^{k+2} \int _0^\infty \int _{({\mathbb {R}}^d)^{k+1}} (\rho +R_n)^{d-1} \mathrm{e}^{-(\rho +R_n)} \prod _{i=1}^{k+1} \mathrm{e}^{-\Vert (\rho +R_n)\mathrm {e}_1+ y_i\Vert } \\&\quad \times T_k(0,\mathbf {y}) \prod _{i=1}^{k+1} 1\!\!1\big \{\Vert (\rho +R_n)\mathrm {e}_1+y_i\Vert > R_n\big \} \mathrm{d}\mathbf {y}\, {d}\rho \\&= s_{d-1}c_{\mathrm {e}}^{k+2}\mathrm{e}^{-(k+2)R_n}R_n^{d-1} \int _0^\infty \int _{({\mathbb {R}}^d)^{k+1}} \big (\frac{\rho }{R_n}+1\big )^{d-1} \mathrm{e}^{-\rho } \\&\quad \times T_k(0,\mathbf {y}) \prod _{i=1}^{k+1} \mathrm{e}^{-\Vert (\rho +R_n)\mathrm {e}_1+ y_i\Vert } \mathrm{e}^{R_n} {}\big \{\Vert (\rho +R_n)\mathrm {e}_1+y_i\Vert > R_n\big \} { d}\mathbf {y}\, {d}\rho . \end{aligned}$$

The last integral can be easily shown to satisfy the conditions of the dominated convergence theorem. In addition, it is easy to show that

$$\begin{aligned} \lim _{n\rightarrow \infty }\mathrm{e}^{-\Vert (\rho +R_n)\mathrm {e}_1+y_i\Vert }\mathrm{e}^{R_n} = \mathrm{e}^{-\big (\rho +\langle {\mathrm {e}_1,y_i}\rangle \big )} = \mathrm{e}^{-(\rho + y_i^1)}, \end{aligned}$$

where \(y_i^1\) is the first coordinate of \(y_i \in {\mathbb {R}}^d\), and also that

$$\begin{aligned} \lim _{n\rightarrow \infty }1\!\!1\big \{\Vert (\rho +R_n)\mathrm {e}_1+y_i\Vert > R_n\big \} = 1\!\!1\big \{y_i^1 \ge -\rho \big \}. \end{aligned}$$

Altogether, we have that

$$\begin{aligned}&\lim _{n\rightarrow \infty }\big (n^{k+2} R_n^{{d-1}} \mathrm{e}^{-(k+2)R_n}\big )^{-1} {\mathbb {E}}\big \{{S_{k,n}}\big \} \\&\quad = \frac{s_{d-1}c_{\mathrm {e}}^{k+2}}{(k+2)!} \int _0^\infty \int _{({\mathbb {R}}^d)^{k+1}} T_k(0,\mathbf {y}) \mathrm{e}^{-\big ((k+2)\rho + \sum _{i=1}^{k+1}y_i^1\big )} \prod _{i=1}^{k+1} 1\!\!1\big \{y_i^1 \ge -\rho \big \} {d}\mathbf {y}\,{d}\rho , \end{aligned}$$

proving the first part of the lemma.

Next, as in the proof of Lemma 4, we need to evaluate the term \(p(r\mathrm {e}_1, r\mathrm {e}_1+\mathbf {y})\).

$$\begin{aligned} p(r\mathrm {e}_1, r\mathrm {e}_1+\mathbf {y}) = \int _{U(0,\mathbf {y})}c_{\mathrm {e}}\mathrm{e}^{-\Vert r\mathrm {e}_1+z\Vert }{d}z \le \int _{U(0,\mathbf {y})}c_{\mathrm {e}}\mathrm{e}^{-(r-\Vert z\Vert )}{d}z. \end{aligned}$$

The change of variables \(r\rightarrow \rho +R_n\) yields

$$\begin{aligned} p((\rho +R_n)\mathrm {e}_1,(\rho +R_n)\mathrm {e}_1+\mathbf {y}) \le \mathrm{e}^{-R_n}\mathrm{e}^{-\rho } \int _{U(0,\mathbf {y})}c_{\mathrm {e}}\mathrm{e}^{\Vert z\Vert }{d}z. \end{aligned}$$

If \(n\mathrm{e}^{-R_n}\rightarrow 0\), then

$$\begin{aligned} \lim _{n\rightarrow \infty }n p((\rho +R_n)\mathrm {e}_1,(\rho +R_n)\mathrm {e}_1+\mathbf {y}) = 0. \end{aligned}$$

Thus,

$$\begin{aligned} \lim _{n\rightarrow \infty }\mathrm{e}^{-np(R_n\rho \mathrm {e}_1, R_n\rho \mathrm {e}_1+ \mathbf {y})} =1, \end{aligned}$$

and therefore,

$$\begin{aligned}&\lim _{n\rightarrow \infty }\big (n^{k+2} R_n^{{d-1}} \mathrm{e}^{-(k+2)R_n}\big )^{-1} {\mathbb {E}}\{{\widehat{S}_{k,n}}\} \\&\quad \qquad = \lim _{n\rightarrow \infty }\big (n^{k+2} R_n^{{d-1}} \mathrm{e}^{-(k+2)R_n}\big )^{-1} {\mathbb {E}}\big \{{S_{k,n}}\big \}= \mu _{\mathrm {e},k}. \end{aligned}$$

This completes the proof.\(\square \)

Lemma 8

If \(f=f_{\mathrm {e}}\), and \(R_n\rightarrow \infty \) then

$$\begin{aligned} \lim _{n\rightarrow \infty }\big (n^{k+3} R_n^{d-1} \mathrm{e}^{-(k+3)R_n}\big )^{-1} {\mathbb {E}}\big \{{L_{k,n}}\big \} = \hat{\mu }_{\mathrm {e},k}. \end{aligned}$$

where \(\hat{\mu }_{\mathrm {e},k}> 0\).

Proof

As for the proof of Lemma 5, mimic now the proof of Lemma 7, replacing \(T_k\) with an indicator function that tests whether a sub-complex generated by \(k+3\) points is connected.\(\square \)

Proof (Proof of Theorem 3)

The proof follows the same steps as the proof of Theorem 2.\(\square \)

3.5 Crackle: The Gaussian Distribution

In this section we prove Theorem 4.

Proof (Proof of Theorem 4)

From Lemma 1 we have that

$$\begin{aligned} {\mathbb {E}}\big \{{S_{0,n}}\big \} = s_{d-1}n \int _{R_n}^\infty r^{d-1}f(r){d}r. \end{aligned}$$

Making the change of variables \(r \rightarrow (\rho ^2 + R_n^2)^{1/2}\) which implies \(\textit{dr} = \frac{\rho }{(\rho ^2+R_n^2)^{1/2}}{ d}\rho \), we have

$$\begin{aligned} {\mathbb {E}}\big \{{S_{0,n}}\big \}&= {s_{d-1}c_{\mathrm {g}}n}\mathrm{e}^{-R_n^2/2}\int _{0}^\infty (\rho ^2+R_n^2)^{(d-2)/2}\rho \mathrm{e}^{-\rho ^2/2}{d}\rho \\&= s_{d-1}c_{\mathrm {g}}n \mathrm{e}^{-R_n^2/2}R_n^{d-2}\int _{0}^\infty \big (\big ({\rho }/{R_n}\big )^2+1\big )^{(d-2)/2}\rho \mathrm{e}^{-\rho ^2/2}{d}\rho . \end{aligned}$$

The integrand is bounded, and applying dominated convergence we have

$$\begin{aligned} \lim _{n\rightarrow \infty }\big ({n \mathrm{e}^{-R_n^2/2} R_n^{d-2}}\big )^{-1}{\mathbb {E}}\big \{{S_{0,n}}\big \} = s_{d-1}c_{\mathrm {g}}. \end{aligned}$$

Taking \(R_n = R_{0,n}^\varepsilon \triangleq \sqrt{2 \log n + \big ({d-2}+\varepsilon \big ) \log \log n}\), we have

$$\begin{aligned} \mathrm{e}^{-R_n^2/2} = n^{-1}(\log n)^{-(d-2+\varepsilon )/2} \end{aligned}$$

and so

$$\begin{aligned} \lim _{n\rightarrow \infty }{n \mathrm{e}^{-R_n^2/2} R_n^{d-2}} = 0 \end{aligned}$$

which implies that

$$\begin{aligned} {\mathbb {E}}\big \{{S_{0,n}}\big \} \rightarrow 0. \end{aligned}$$

Finally, for every \(0 \le k \le d-1\),

$$\begin{aligned} \beta _{k,n}\le S_{0,n}. \end{aligned}$$

Therefore,

$$\begin{aligned} \lim _{n\rightarrow \infty }{\mathbb {E}}\big \{{\beta _{k,n}}\big \} = 0, \end{aligned}$$

completing the proof.\(\square \)