1 Introduction
2 Related work
3 Homology
3.1 Simplicial complexes and their homology
3.2 Building simplicial complexes
4 Persistent homology
4.1 Filtered complexes and homology
5 Computation of PH for data
5.1 Data
5.1.1 Networks
5.1.2 Digital images
5.1.3 Finite metric spaces
5.2 Filtered simplicial complexes
Complex
K
|
Size of
K
|
Theoretical guarantee
|
---|---|---|
Čech |
\(2^{\mathcal {O}(N)}\)
| Nerve theorem |
Vietoris–Rips (VR) |
\(2^{\mathcal {O}(N)}\)
| Approximates Čech complex |
Alpha |
\(N^{\mathcal {O}(\lceil d/2 \rceil)}\) (N points in \(\mathbb {R}^{d}\)) | Nerve theorem |
Witness |
\(2^{\mathcal {O}(|L|)}\)
| For curves and surfaces in Euclidean space |
Graph-induced complex |
\(2^{\mathcal {O}(|Q|)}\)
| Approximates VR complex |
Sparsified Čech |
\(\mathcal {O}(N)\)
| Approximates Čech complex |
Sparsified VR |
\(\mathcal {O}(N)\)
| Approximates VR complex |
5.2.1 Vietoris–Rips complex
5.2.2 The Delaunay complex
5.2.3 Alpha complex
5.2.4 Witness complexes
5.2.5 Additional complexes
Software
|
javaPlex
|
Perseus
|
jHoles
|
Dionysus
|
PHAT
|
DIPHA
|
Gudhi
|
SimpPers
|
Ripser
|
---|---|---|---|---|---|---|---|---|---|
(a) Language |
Java
|
C++
|
Java
|
C++
|
C++
|
C++
|
C++
|
C++
|
C++
|
(b) Algorithms for PH | standard, dual, zigzag | Morse reductions, standard | standard (uses javaPlex) | standard, dual, zigzag | standard, dual, twist, chunk, spectral sequence | twist, dual, distributed | dual, multifield | simplicial map | twist, dual |
(c) Coefficient field |
\(\mathbb{Q}\), \(\mathbb{F}_{p}\)
|
\(\mathbb{F}_{2}\)
|
\(\mathbb{F}_{2}\)
|
\(\mathbb{F}_{2}\) (standard, zigzag), \(\mathbb{F}_{p}\) (dual) |
\(\mathbb{F}_{2}\)
|
\(\mathbb{F}_{2}\)
|
\(\mathbb{F}_{p}\)
|
\(\mathbb{F}_{2}\)
|
\(\mathbb{F}_{p}\)
|
(d) Homology | simplicial, cellular | simplicial, cubical | simplicial | simplicial | simplicial, cubical | simplicial, cubical | simplicial, cubical | simplicial | simplicial |
(e) Filtrations computed | VR, W, \(\mathrm{W}_{\nu}\)
| VR, lower star of cubical complex | WRCF | VR, α, C̆ | - | VR, lower star of cubical complex | VR, α, W, lower star of cubical complex | - | VR |
(f) Filtrations as input | simplicial complex, zigzag, CW | simplicial complex, cubical complex | - | simplicial complex, zigzag | boundary matrix of simplicial complex | boundary matrix of simplicial complex | - | map of simplicial complexes | - |
(g) Additional features | Computes some homological algebra constructions, homology generators | weighted points for VR | - | vineyards, circle-valued functions, homology generators | - | - | - | - | - |
(h) Visualization | barcodes | persistence diagram | - | - | - | persistence diagram | - | - | - |
5.2.6 Reduction techniques
5.3 From a filtered simplicial complex to barcodes
-
a face of a simplex precedes the simplex;
-
a simplex in the ith complex \(K_{i}\) precedes simplices in \(K_{j}\) for \(j>i\), which are not in \(K_{i}\).
5.3.1 Standard algorithm
standard algorithm
for the computation of PH was introduced for the field \(\mathbb{F}_{2}\) in [60] and for general fields in [61]. For every \(j \in\{1,\dots, n\}\), we define \(\operatorname {low}(j)\) to be the largest index value i such that \(\delta (i,j)\) is different from 0.7 If column j only contains 0 entries, then the value of \(\operatorname {low}(j)\) is undefined. We say that the boundary matrix is reduced if the map low is injective on its domain of definition. In Algorithm 1, we illustrate the standard algorithm for reducing the boundary matrix. Because this algorithm operates on columns of the matrix from left to right, it is also sometimes called the ‘column algorithm.’ In the worst case, the complexity of the standard algorithm is cubic in the number of simplices.
5.3.2 Reading off the intervals
-
If \(\operatorname {low}(j)=i\), then the simplex \(\sigma_{j}\) is paired with \(\sigma_{i}\), and the entrance of \(\sigma_{i}\) in the filtration causes the birth of a feature that dies with the entrance of \(\sigma_{j}\).
-
If \(\operatorname {low}(j)\) is undefined, then the entrance of the simplex \(\sigma_{j}\) in the filtration causes the birth of a feature. It there exists k such that \(\operatorname {low}(k)=j\), then \(\sigma_{j}\) is paired with the simplex \(\sigma_{k}\), whose entrance in the filtration causes the death of the feature. If no such k exists, then \(\sigma_{j}\) is unpaired.
5.3.3 Other algorithms
twist algorithm
[126] and the dual algorithm
[72, 127]. (Note that the dual algorithm is known to give a speed-up when one computes PH with the VR complex, but not necessarily for other types of complexes (see also the results of our benchmarking for the vertebra data set in Additional file 1 of the SI).) Parallel algorithms in a shared setting include the spectral-sequence algorithm
(see Section VII.4 of [80]) and the chunk algorithm
[128]; parallel algorithms in a distributed setting include the distributed algorithm
[74]. The multifield algorithm
is a sequential algorithm that allows the simultaneous computation of PH over several fields [129].5.4 Statistical interpretation of topological summaries
5.5 Stability
6 Excursus: generalized persistence
zigzag algorithm
, for the computation of PH for inclusion maps that do not all go in the same direction, as, e.g., in the diagram
simplicial map algorithm
[156].7 Software
dual algorithm
[72, 127]. PHAT [62] is a library that implements several algorithms and data structures for the fast computation of barcodes, takes a boundary matrix as input, and is the first software to implement a matrix-reduction algorithm that can be executed in parallel. DIPHA [63], a spin-off of PHAT, implements a distributed computation of the matrix-reduction algorithm. Gudhi [67] implements new data structures for simplicial complexes and the boundary matrix. It also implements the multi-field algorithm
, which allows simultaneous computation of PH over several fields [129]. This library is currently under intense development, and a Python interface was just released in the most recent version of the library (namely, Version 2.0.0, whereas the version that we study in our tests is Version 1.3.1). The library ripser [68], the most recently developed software of the set that we examine, uses several optimizations and shortcuts to speed up the computation of PH with the VR complex. This library does not have an accompanying peer-reviewed publication. However, because it is currently the best-performing (both in terms of memory usage and in terms of wall-time seconds8) library for the computation of PH with the VR complex, we include it in our study. The library SimpPers [163] implements the simplicial map algorithm
. Libraries that implement techniques for the statistical interpretation of barcodes include the TDA Package [140] and the Persistence Landscape Toolbox [150]. (See Section 5.4 for additional libraries for the interpretation of barcodes.) RIVET, a package for visualizing 2-parameter persistent homology, is slated to be released soon [158]. We summarize the properties of the libraries for the computation of PH that we mentioned in this paragraph in Table 2, and we discuss the performance for a selection of them in Section 7.1.3 and in Additional file 1 of the SI. For a list of programs, see https://github.com/n-otter/PH-roadmap.7.1 Benchmarking
7.1.1 Data sets
gag
, pol
, and env
— of the HIV genome. We take \(1\mbox{,}088\) different genomic sequences and compute distances between them by using the Hamming distance. We use the aligned sequences studied using PH in [23]. (The authors of that paper retrieved the sequences from [169].) We denote this data set by HIV.7.1.2 Machines and compilers
Data set
|
(a) Computations on cluster: wall-time seconds
| |||||
---|---|---|---|---|---|---|
eleg
|
Klein
|
HIV
|
drag 2
|
random
|
genome
| |
Size of complex | 4.4 × 106
| 1.1 × 107
| 2.1 × 108
| 1.3 × 109
| 3.1 × 109
| 4.5 × 108
|
Max. dim. | 2 | 2 | 2 | 2 | 8 | 2 |
javaPlex (st) | 84 | 747 | - | - | - | - |
Dionysus (st) | 474 | 1,830 | - | - | - | - |
DIPHA (st) | 6 | 90 | 1,631 | 142,559 | - | 9,110 |
Perseus
| 543 | 1,978 | - | - | - | - |
Dionysus (d) | 513 | 145 | - | - | - | - |
DIPHA (d) | 4 | 6 | 81 | 2,358 | 5,096 | 232 |
Gudhi
| 36 | 89 | 1,798 | 14,368 | - | 4,753 |
Ripser
| 1 | 1 | 2 | 6 | 349 | 3 |
Data set
|
(b) Computations on cluster: CPU seconds
| |||||
---|---|---|---|---|---|---|
eleg
|
Klein
|
HIV
|
drag 2
|
random
|
genome
| |
Size of complex | 4.4 × 106
| 1.1 × 107
| 2.1 × 108
| 1.3 × 109
| 3.1 × 109
| 4.5 × 108
|
Max. dim. | 2 | 2 | 2 | 2 | 8 | 2 |
javaPlex (st) | 284 | 1,031 | - | - | - | - |
Dionysus (st) | 473 | 1,824 | - | - | - | - |
DIPHA (st) | 68 | 1,360 | 25,950 | 1,489,615 | - | 130,972 |
Perseus
| 542 | 1,974 | - | - | - | - |
Dionysus (d) | 513 | 145 | - | - | - | - |
DIPHA (d) | 39 | 73 | 1,276 | 37,572 | 79,691 | 3,622 |
Gudhi
| 36 | 88 | 1,794 | 14,351 | - | 4,764 |
Ripser
| 1 | 1 | 2 | 5 | 348 | 2 |
Data set
|
(c) Computations on shared-memory system: wall-time seconds
| |||||
---|---|---|---|---|---|---|
eleg
|
Klein
|
HIV
|
drag 2
|
genome
|
fract r
| |
Size of complex | 3.2 × 108
| 1.1 × 107
| 2.1 × 108
| 1.3 × 109
| 4.5 × 108
| 2.8 × 109
|
Max. dim. | 3 | 2 | 2 | 2 | 2 | 3 |
javaPlex (st) | 13,607 | 1,358 | 43,861 | - | 28,064 | - |
Perseus
| - | 1,271 | - | - | - | - |
Dionysus (d) | - | 100 | 142,055 | 35,366 | - | 572,764 |
DIPHA (d) | 926 | 13 | 773 | 4,482 | 1,775 | 3,923 |
Gudhi
| 381 | 6 | 177 | 1,518 | 442 | 4,590 |
Ripser
| 2 | 1 | 2 | 5 | 3 | 1,517 |
Data set
|
(a) Computations on cluster
| |||||
---|---|---|---|---|---|---|
eleg
|
Klein
|
HIV
|
drag 2
|
random
|
genome
| |
Size of complex | 4.4 × 106
| 1.1 × 107
| 2.1 × 108
| 1.3 × 109
| 3.1 × 109
| 4.5 × 108
|
Max. dim. | 2 | 2 | 2 | 2 | 8 | 2 |
javaPlex (st) | <5 | <15 | >64 | >64 | >64 | >64 |
Dionysus (st) | 1.3 | 11.6 | - | - | - | - |
DIPHA (st) | 0.1 | 0.2 | 2.7 | 4.9 | - | 4.8 |
Perseus
| 5.1 | 12.7 | - | - | - | - |
Dionysus (d) | 0.5 | 1.1 | - | - | - | - |
DIPHA (d) | 0.1 | 0.2 | 1.8 | 13.8 | 9.6 | 6.3 |
Gudhi
| 0.2 | 0.5 | 8.5 | 62.8 | - | 21.5 |
Ripser
| 0.007 | 0.02 | 0.06 | 0.2 | 24.7 | 0.07 |
Data set
|
(b) Computations on shared-memory system
| |||||
---|---|---|---|---|---|---|
eleg
|
Klein
|
HIV
|
drag 2
|
genome
|
fract r
| |
Size of complex | 3.2 × 108
| 1.1 × 107
| 2.1 × 108
| 1.3 × 109
| 4.5 × 108
| 2.8 × 109
|
Max. dim. | 3 | 2 | 2 | 2 | 2 | 3 |
javaPlex (st) | <600 | <15 | <700 | >700 | <700 | >700 |
Perseus
| - | 11.7 | - | - | - | - |
Dionysus (d) | - | 1.1 | 16.8 | 134.2 | - | 268.5 |
DIPHA (d) | 31.2 | 0.9 | 17.7 | 109.5 | 37.3 | 276.1 |
Gudhi
| 15.4 | 0.5 | 10.2 | 62.8 | 21.4 | 134.8 |
Ripser
| 0.2 | 0.03 | 0.07 | 0.2 | 0.07 | 155 |
7.1.3 Tests and results
time
function in Linux. We report computation times with a precision of one second; if a computation took only fractions of a second, we report ‘one second’ as the computation time. For space reasons, we report results for a subset of the computations. (In Additional file 1 of the SI, we tabulate the rest of our computations.) In Table 4, we report the memory used by the processes in terms of maximum resident set size (RSS); in other words, we give the maximum amount of real RAM a program has used during its execution. We measure the maximum RSS using the getrusage
function in Linux. The header file that we use to measure memory is available at https://github.com/n-otter/PH-roadmap. In DIPHA, the measurement of memory is already implemented by the authors of the software. They also use the getrusage
function in Linux. The package javaPlex is written in Java, and we thus cannot measure its memory as we do for the other packages. However, one can infer memory requirements for this software package using the value of the maximal heap size necessary to perform the computations; we report this value in Table 4. In Table 5, we give the maximum size of the simplicial complex for which we were able to compute PH with each software package in our benchmarkings.
(a)
|
javaPlex
|
Dionysus
|
DIPHA
|
Perseus
|
Gudhi
|
Ripser
| ||
---|---|---|---|---|---|---|---|---|
st
|
st
|
d
|
st
|
d
|
st
|
d
|
d
| |
(b) | 4.5⋅108
| 1.1⋅107
| 2.8 × 109
| 1.3⋅109
| 3.4⋅109
| 1⋅107
| 3.4⋅109
| 3.4⋅109
|
7.2 Conclusions from our benchmarking
Data type
|
Complex
|
Suggested software
|
---|---|---|
networks | WRCF |
jHoles
|
image data | cubical |
Gudhi or DIPHA (st) |
distance matrix | VR |
Ripser
|
distance matrix | W |
javaPlex
|
points in Euclidean space | VR |
Gudhi
|
points in Euclidean space | Č |
Dionysus
|
points in Euclidean space |
α (only in dim 2 and 3) |
Dionysus ((st) in dim 2, (d) in dim 3) or Gudhi
|