Introduction
Background
Symbol | Definition |
---|---|
π1(X,p) | The fundamental group of X with basepoint p |
[c] | Homotopy class of loop c |
\(\mathcal {L}, \mathcal{L'}\)
| Length spectrum function, relaxed length spectrum |
Conv(X) | If X is a graph, Conv(X) is its 2-core |
G=(V,E) | An undirected graph with node set V and edge set E |
n,m | Number of nodes and number of edges of a graph G |
e, e−1 | A directed edge e=u→v and its inverse e−1=v→u |
NBC | Non-backtracking cycle, in which no edge is followed by its inverse |
B
| 2m×2m non-backtracking matrix of a graph |
B
′
| 2n×2n matrix whose eigenvalues are the same as those of B, save for ±1 |
λk=ak+ibk | k-th largest eigenvalue, in magnitude, of B |
Re(λ),Im(λ) | Real and imaginary parts of the complex number λ |
P,Q | n×2m directed incidence matrices of a graph |
p
k
| The fraction of a graphs’ nodes with degree k |
γ
| The degree exponent in the case when pk∼k−γ |
〈ki〉 | i-th moment of the degree distribution pk |
deg(u) | Degree of node u |
nnz(A) | Number of non-zero elements of binary matrix A |
NBD(G,H) | Non-backtracking distance between graphs G and H |
d({λk},{μk}) | Distance between eigenvalues λk and eigenvalues μk |
r
| Number of eigenvalues computed from one graph |
\(\rho = \sqrt {\lambda _{1}}\)
| Magnitude threshold for eigenvalue computation |
r
0
| Number of eigenvalues whose magnitude is greater than ρ |
σ
| Spread parameter of the RBF kernel |
Length Spectrum
Non-Backtracking Cycles
Related work
Lap
from “Data sets and base lines” section.Operationalizing the length spectrum
Relaxed Length Spectrum
Computing B
Spectral properties of B
Eigenvalues: visualization
data set | network |
n
| 〈k〉 |
\(\bar {c}\)
|
---|---|---|---|---|
Online social (social) | Facebook | 4k | 43.69 | 0.61 |
sdot8 | 77k | 14.13 | 0.06 | |
sdot9 | 82k | 14.18 | 0.06 | |
Epinions | 76k | 10.69 | 0.14 | |
Twitter | 81k | 33.02 | 0.57 | |
Wiki | 7k | 28.32 | 0.14 | |
Coauthorship (CA) | AstroPh | 19k | 21.11 | 0.63 |
CondMat | 23k | 8.08 | 0.63 | |
GrQc | 5k | 5.53 | 0.53 | |
HepPh | 12k | 19.74 | 0.61 | |
HepTh | 10k | 5.26 | 0.47 | |
Peer-to-peer file-sharing (P2P) | P2P04 | 11k | 7.35 | 0.01 |
P2P05 | 9k | 7.20 | 0.01 | |
P2P06 | 9k | 7.23 | 0.01 | |
P2P08 | 6k | 6.59 | 0.01 | |
P2P09 | 8k | 6.41 | 0.01 | |
P2P24 | 27k | 4.93 | 0.01 | |
P2P25 | 23k | 4.82 | 0.01 | |
P2P30 | 36k | 4.82 | 0.01 | |
P2P31 | 63k | 4.73 | 0.01 | |
Autonomous systems (AS) | AS331 | 11k | 4.12 | 0.30 |
AS407 | 11k | 4.10 | 0.29 | |
AS414 | 11k | 4.16 | 0.30 | |
AS421 | 11k | 4.19 | 0.30 | |
AS428 | 11k | 4.13 | 0.29 | |
AS505 | 11k | 4.13 | 0.29 | |
AS512 | 11k | 4.12 | 0.29 | |
AS519 | 11k | 4.11 | 0.29 | |
AS526 | 11k | 4.19 | 0.30 | |
BA-1k
| 1k | 13.90 | 0.05 | |
CM-1k
| 1k | 10.20 | 0.17 | |
ER-1k
| 1k | 14.99 | 0.15 | |
HG-1k
| 997±2 | 14.72 | 0.82 | |
KR-1k
| 1021±1 | 14.54 | 0.03 | |
WS-1k
| 1k | 14.00 | 0.67 | |
BA-5k
| 5k | 13.98 | 0.02 | |
CM-5k
| 5k | 11.40 | 0.13 | |
ER-5k
| 5k | 15.00 | 0.00 | |
HG-5k
| 4982±8 | 14.67 | 0.82 | |
KR-5k
| 4089±3 | 18.27 | 0.01 | |
WS-5k
| 5k | 14.00 | 0.67 | |
BA-10k
| 10k | 13.99 | 0.01 | |
CM-10k
| 10k | 11.97 | 0.12 | |
ER-10k
| 10k | 14.99 | 0.00 | |
HG-10k
| 9958±15 | 14.96 | 0.83 | |
KR-10k
| 8170±11 | 18.32 | 0.01 | |
WS-10k
| 10k | 14.00 | 0.67 |
NBD: Non-backtracking distance
Computing NBD
Choice of distance and dependence on r
-
The
Euclidean
metric computes the Euclidean distance between the vectors whose elements are the eigenvalues sorted in order of magnitude: \(\text {Euclidean}\left (\{\lambda _{i}\}, \{\mu _{i}\}\right) = \sqrt {\sum _{i} |\lambda _{i} - \mu _{i}|^{2}}\), with |λi|≥|λi+1| and |μi|≥|μi+1| for all i. If two eigenvalues have the same magnitude, we use their real parts to break ties. Computing this distance takes O(r) computations, where r is the number of eigenvalues. -
The
EMD
metric computes the Earth Mover Distance (a.k.a. Wasserstein distance) (Rubner et al. 1998) which is best understood as follows. Write λk=ak+ibk and μk=ck+idk for all k. Consider the sets \(\left \{(a_{k}, b_{k})\right \}_{k=1}^{r}\) and \(\left \{(c_{k}, d_{k})\right \}_{k=1}^{r}\) as subsets of \(\mathbb {R}^{2}\), and imagine a point mass with weight 1/r at each of these points. Intuitively,EMD
measures the minimum amount of work needed to move all point masses at the points (ak,bk) to the points (ck,dk). This distance requires the computation of the (Euclidean) distance between every pair of eigenvalues and thus it takes O(r2) runtime. -
Hausdorff
metric is a standard distance between compact sets in a metric space (Munkres 2000); it is defined as$$ \text{Hausdorff}(\{\lambda_{i}\}, \{\mu_{i}\}) = \max\left(\max_{i} \min_{j} |\lambda_{i} - \mu_{j}|, \max_{j} \min_{i} |\lambda_{i} - \mu_{j}| \right). $$(5)Similarly as above, this computation takes O(r2) runtime.
HG
, KR
, P2P
, AS
, CA
(see Table 2 and “Data sets and base lines” section for full description of data sets). We report the results in Fig. 5, where we plot the average of the three distances for each data set across several repetitions of the experiment (10 repetitions for random graphs and CA
, 36 for P2P
and AS
), as well as the median value of r after which the eigenvalues have magnitude less than ρ; we call this value r0.
Hausdorff
has high variance and a somewhat erratic behavior on random graphs. Euclidean
has the least variance and most predictable behavior, continuing to increase as the value of r grows. In the KR
data sets, Hausdorff
presents an “elbow” near r0, after which it levels off without much change for increasing values of r. Interestingly, the plots suggest that, in the case of EMD
, the steepest decrease usually occurs before r0. Also of note is the fact that on HG
, all distances have a slight increase for large values of r.Hausdorff
does not appear to have a predictable behavior, and Euclidean
continues to increase with no discernible inflection point even when using the largest r=1000 eigenvalues, we favor EMD
when comparing non-backtracking eigenvalues of two graphs. Furthermore, EMD
has a predictable “elbow” behavior that happens near r0. Thus we conclude that this is an appropriate choice for the number of eigenvalues to use. In the following we use EMD
as the distance d and r0 as the value of r for all our experiments.5NBD: Experiments
Data sets and base lines
ER
) (Erdös and Rényi 1960; Bollobás 2001), Barabási-Albert (BA
) (Barabási and Albert 1999), Stochastic Kronecker Graphs (KR
) (Leskovec et al. 2010; Seshadhri et al. 2013), Configuration Model with power law degree distribution (CM
) (Newman 2003), Watts-Strogatz (WS
) (Watts and Strogatz 1998), and Hyperbolic Graphs (HG
) (Krioukov et al. 2010; Aldecoa et al. 2015). We generated CM
and HG
with degree distribution pk∼k−γ with exponent γ=2.3, and WS
with rewiring probability p=0.01. All graphs have approximate average degree 〈k〉=15. From each model, we generate three batches, each with 50 graphs: the first with n=1,000 nodes, the second with n=5,000, and the third with n=10,000. This yields 6×3 data sets that we refer to by model and graph size (see Table 2). We also use real networks, divided in four different data sets: social
, CA
, AS
, and P2P
. social
contains six networks obtained from human online social networks (facebook
, twitter
, sdot8
, sdot9
, epinions
, and wiki
) (McAuley and Leskovec 2012; Leskovec et al. 2010; Leskovec et al. 2009; Richardson et al. 2003), CA
contains five co-authorship networks (AstroPh
, CondMat
, GrQc
, HepPh
, HepTh
) obtained from the arXiv pre-print server (Leskovec et al. 2007), P2P
contains nine snapshots of the peer-to-peer connections of a file-sharing network (Leskovec et al. 2007), and AS
contains nine snapshots of the Internet autonomous systems infrastructure (Leskovec et al. 2005). Note that P2P
and AS
are temporal snapshots of the same network and thus we may assume that they have been generated using the same growth process. The same is not true for social
and CA
: even though they contain similar kinds of networks, their generating mechanisms may be different. The real networks were obtained through SNAP (Leskovec and Krevl 2014) and ICON (Clauset et al.). In all, we have 929 different networks (900 random, 29 real), comprising 22 different data sets (6×3 random, 4 real).P2P
and AS
. NBD detects less of a difference between data sets social
and CA
, which points toward the homogeneity of connectivity patterns in human-generated networks.
ESCAPE
(Pinar et al. 2017) which given a graph G computes a 98-dimensional vector v of motif counts, i.e., vi is the number of times that the ith motif appears in G for i=1,...,98. To compare two graphs we then compute the cosine distance between their ESCAPE
feature vectors. The second method is Graphlet Correlation Distance (Yaveroğlu et al. 2014), or GCD
for short, which characterizes a graph with a so-called graphlet correlation matrix, a 11×11 matrix whose entries are the correlation between the number of per-node occurrences of 11 small motifs. The distance between two graphs is then the Frobenius norm of the difference of their graphlet correlation matrices. The third method is taken from Schieber et al (Schieber et al. 2017), and will be referred to here as S
. Distance measure S
depends on the distribution of shortest path length distances between each pair of nodes in a graph. In this way, we are comparing NBD, which depends on the number and length of non-backtracking cycles, to other distance methods that also depend on the occurrence of a particular type of subgraphs: ESCAPE
and GCD
depend on motif counts, while distance S depends on lengths of shortest paths. As a baseline, we use the 1-dimensional EMD between the r=300 largest Laplacian eigenvalues of each graph; we call this baseline Lap
. We report the performance of two versions of NBD: NBD-300 uses the largest r=300 non-backtracking eigenvalues of each graph, while NBD- ρ uses r= min{300,r0} where r0 is the number of eigenvalues whose magnitude is greater than \(\rho = \sqrt {\lambda _{1}}\), as explained before. We use a maximum of 300 because most of our networks have r0<300; see Fig. 7 for values of r0 for all our data sets. This means that NBD- ρ sometimes uses vastly different numbers of eigenvalues to compare two networks. For example, when comparing a BA-10k
graph to twitter
, we are comparing approximately 15 eigenvalues versus min{300,r0=1393}=300. Note that EMD
handles this difference naturally.
Clustering networks
ER
graphs, the average distance among all pairs of BA
graphs, and so on for each data set of random graphs, and average all of them together (von Luxburg 2007). Note this yields a different σ for each distance method, and that we do not use graphs from real data sets to tune this parameter. We program the algorithm to find the exact number of ground-truth clusters. We report the results of this experiment using all graphs (22 clusters), only real graphs (4 clusters), and only random graphs (18 clusters).S
(v-measure 0.83). Distances S
and GCD
perform well across data sets, however, numerical methods for computing eigenvalues are orders of magnitude faster than the state-of-the-art methods for computing all shortest path distances (needed to compute S
) or per-node motif counts (needed for GCD
); see “Runtime comparison” section. ESCAPE
performs well in completeness but not in homogeneity in random graphs, and extremely poorly in real graphs (v-measure 0.27).
GCD
performs better than ESCAPE
across data sets, (ii) that comparing length distributions of paths or cycles seems to yield better performance than motif counting, because NBD and S
perform better than others in general, and (iii) that non-backtracking eigenvalues provide slightly better performance at distinguishing between real and synthetic graphs than other combinatorial methods, as well as other spectral methods such as Lap
. We highlight that NBD is in general also more efficient than the combinatorial methods, while being only slightly more time-consuming than the spectral method Lap
; see next section.Runtime comparison
Lap
, the first step computes eigenvalues of the corresponding matrices, while GCD
and ESCAPE
compute motif counts, and S
computes the shortest path distance between each pair of nodes in the graph. The bulk of the computational time of all methods is spent in the computation of these statistics, rather than in the comparison. Therefore, in this section we compare the runtime of this first step only. Another reason to focus on the runtime of this first step only is that it depends on one single graph (in all cases, the computation of the corresponding statistics is done on each graph in parallel), and thus we can focus on investigating how the structure of a single graph affects computation time. All experiments were conducted on a computer with 16 cores and 100GB of RAM with Intel Skylake chip architecture, with no other jobs running in the background. Note that all distance algorithms are implemented in Python, except for ESCAPE
which is implemented in C++. Thus we focus on comparing the four algorithms that are implemented in Python. For big-O complexity analyses we refer the interested reader to each of the corresponding references for each algorithm.Lap
affects their runtime on graphs with n=1000 and 〈k〉=15. The center column shows how varying the number of nodes n affects the runtime of all five distances while holding r=10 and 〈k〉=15 fixed. The right column shows how varying 〈k〉 affects the runtime while keeping r=10 and n=1000 fixed. Note that all panels are presented on a log-linear scale.
Lap
, as claimed in “Computing B” section because the former uses the non-backtracking matrix, which is always four times the size of the Laplacian matrix. However, their runtimes do not differ by only a constant factor in the cases of ER and HG. We hypothesize that this is because the non-backtracking matrix has a more complicated structure than the Laplacian depending on the structure of the underlying graph (see “Spectral properties of B” section and especially Lemma 3). In all other panels we see that NBD and Lap
present similar scaling behavior, but NBD is usually slower. The middle column presents the effect of varying the number of nodes in the graph on all five distance methods. NBD is consistently the third fastest method for HG and BA graphs, after ESCAPE
and Lap
, while GCD
is consistently the slowest by two or three orders of magnitude in some cases. Note that the scaling behavior of all methods is fairly similar between BA and HG graphs, while for ER it is markedly different. When varying the number of nodes of ER graphs, for example, NBD and GCD
seem to scale at the same rate, and take the about the same runtime, while S
scales more poorly than all others. In all other panels, GCD
seems to scale more poorly than other methods.Nearest neighbors
P2P
and AS
are perfectly clustered with each other in one cluster, while five networks of the social
data set (all except twitter
) are clustered with each other, although they are broken up into two clusters. Data set AS
is also broken up into three different clusters: one with GrQc
, CondMat
, and HepTh
, while HepPh
and AstroPh
lie in their own cluster. To test the robustness of these clusters, we ran the same experiment (i) leaving one data set out each time, and (ii) randomly choosing the ordering of the input data sets. These variations are made to address the shortcomings of hierarchical clustering as some implementations may break ties in arbitrary ways. Neither of these variations affected the clusters reported in Fig. 10.
facebook
, AstroPh
, and twitter
. We show the nearest networks among all networks, including random graphs, as well as the nearest real networks only. In the case of facebook
, we see that, even when including random networks, epinions
is the network closest to it. This is perhaps surprising since these two networks have substantially different values for the number of nodes, average degree, and average clustering coefficient (see Table 2). The networks closest to facebook
, after epinions
, all come from the HG
random graph model, which is expected because this model has been shown to produce heavy-tailed degree distributions with high clustering, which both facebook
and epinions
present as well. When considering only real networks, all but one of the social
networks are within the 10 nearest networks to facebook
. In the case of AstroPh
, the closest network to it is another coauthorship network, CondMat
. Like in the previous case, it is close to many random graphs of model HG
. Among the real networks, all other coauthorship graphs are in the nearest 10. Finally, twitter
is the data set that lies farthest from all others. This may be due to the fact that it has substantially more edges than all others: it has 1.3M edges, while the second largest, epinions
, has only 748k. We see that the closest random networks to twitter
all come from the random model WS
. This may be due to the fact that both twitter and WS
have many large eigenvalues with zero imaginary part, so the earth mover distance between them will be small. We also observe that, even though twitter is far from all the networks in the dendrogram, all of the social
graphs are within the 10 nearest networks to it.
facebook
, AstroPh
, twitter
we show the 10 networks closest to each, as determined by NBD- ρ, and report the distance score in parenthesesrank | facebook
| AstroPh
| twitter
|
---|---|---|---|
1 | epinions (4.08) | CondMat (11.33) | AstroPh (31.38) |
2 | HepPh (5.26) | epinions (13.72) | CondMat (40.11) |
3 | sdot9 (6.05) | facebook (14.18) | epinions (41.49) |
4 | sdot8 (6.24) | GrQc (14.71) | facebook (42.45) |
5 | wiki (6.56) | HepTh (14.84) | HepTh (42.60) |
6 | AS414 (8.98) | sdot9 (16.96) | HepPh (44.30) |
7 | AS505 (9.03) | HepPh (16.97) | GrQc (44.54) |
8 | AS512 (9.05) | sdot8 (17.37) | sdot9 (44.66) |
9 | AS421 (9.06) | AS407 (17.62) | wiki (45.64) |
10 | AS407 (9.16) | AS414 (17.68) | sdot8 (45.88) |
AS
) and automatically generated peer-to-peer file sharing networks (P2P
) – are perfectly clustered together, and all online social graphs are ranked near each other, even when some of these have vastly different elementary statistics such as number of nodes, number of edges, and average clustering coefficient.Rewiring edges
AS331
, P2P09
, facebook
, and HepPh
. We show our results in Fig. 11.
AS331
and P2P09
seem to plateau when the percentage of rewired edges is 65% or larger, while facebook
and HepPh
have a slight jump close to 70%. Observe too that the variance of the NBD is kept fairly constant, which indicates that the distribution of non-backtracking cycle lengths in all rewired versions of each graph is very similar to each other. This may be due to the fact that edge rewiring destroys short cycles more quickly than long cycles. Furthermore, observe that for high percentages of rewired edges, the rewired graph is essentially a configuration model graph with the same degree distribution as the original. Therefore, by comparing the NBD values from Figs. 10 and 11, we can conclude that the NBD between these four networks and a random graph with the same degree distribution is larger than the average distance to other networks in the same data set (cf. Fig. 10 and discussion in previous section). We conclude that NBD is detecting graph structural properties – stored in the distribution of non-backtracking cycle lengths and encoded in the non-backtracking eigenvalues –, that are not solely determined by the degree-distribution of the graph.NBED: Non-backtracking embedding dimensions
Visualization
WS-5k
we see small sets of dots (red circles) that are markedly separated from the bulk of the dots. Every one of these sets is made up of the embeddings of the directed edges k→l for a fixed l, and the horizontal position of this set corresponds to the degree of l. That is, the larger the degree of a fixed node l, the more likely all the embeddings of edges of the form k→l are to cluster together separate from the rest, and the further right this cluster will be.BA-5k
and KR-5k
are much more vertically spread than those of CM-5k
or HG-5k
. We conclude that each of these sets have a particular internal structure that correlates with global graph structure, as it is repeated across sets. In ER-5K
, the internal structure of each set is less well marked, owing to the fact that there are no global structural patterns in the ER model.BA-5k
, ER-5k
, HG-5k
, and KR-5k
, we find that the dots inhabit a certain region of the plane that is bounded by a roughly parabolic boundary (gray dashed lines). Furthermore, the cusp of this boundary always lies precisely at the origin. That is, there are never dots to the left of the origin, and the vertical spread of the dots grows at roughly the same rate both in the upper half-plane and the lower half-plane. Since the vertical axis corresponds to the entries of the second eigenvector v2 which has been shown to describe community structure, we conjecture that the significance of this boundary is related to the existence of communities in the network. (See “Case Study: Enron emails” section for more discussion on this point.)ER-5k
looks like a noisy point cloud because the degree distribution is Poisson centered around 〈k〉=15, and the plot for WS-5k
is very homogeneous in color because all nodes start with the same degree k=14 and only a few of them are rewired. Furthermore, if one compares the color distribution of the plots for BA-5k
, CM-5k
, and HG-5k
, one can differentiate that even though all have heavy-tailed degree distributions, their exponents must be different. Indeed, for BA-5k
we have γ=3 while the other two were generated with γ=2.3.KR-5k
as a case study. First, it presents small sets of dots that are markedly separate from the bulk, and whose structure is similar to those of BA-5k
. Second, its color distribution is closer to that of ER-5k
than to any of the others. Third, it exhibits the boundary effect that we see in both BA-5k
and ER-5k
. From these observations we conclude that the stochastic Kronecker graph model is a “mix” between the Barabasi-Albert and Erdos-Renyi models, in the sense its plot has elements from both of the plots for BA and ER. Consider this in light of Fig. 1 where the graphs of these three data sets are clustered together, even when using the largest r=200 non-backtracking eigenvalues of each graph. This means that the eigendecomposition of the non-backtracking matrix is detecting the similarities and differences between these three random graph models, which we can identify through a visual analysis of Figs. 1 and 12.WS-5k
in Fig. 12. This plot shows none of the characteristics of the others, and in fact some of its points clearly fall on a path determined by a continuous curve. We hypothesize that this is due to the strong symmetry of the ring lattice from which it is generated. Studying the properties of the NBED of highly symmetric graphs and their randomized versions is a task we leave for a future work.Case Study: Enron emails
Discussion and conclusions
ESCAPE
, S
, GCD
, and Lap
; see “NBD: Experiments” section. We chose to compare against these methods because the first three depend on motif counts in one way or another, as NBD depends on non-backtracking cycle counts, and Lap
depends on the spectral decomposition of a matrix representation of a graph, as NBD depends on the non-backtracking eigenvalues.Appendix 1: UMAP parameter settings
-
n_neighbors: A weighted k-nearest neighbor graph is constructed from the initial data. The number of neighbors is set with this parameter. We used a value of 75.
-
metric: Different metrics in the high dimensional space where the data are embedded can be specified. We found good overall results with the Canberra metric (shown in Fig. 1. The Chebyshev and Euclidean metrics cluster ER graphs together, for some specific values of the other parameters. However they make HG and CM clusters overlap. A clearly separated projection for the HG and CM graphs can be found using Correlation metric.
-
n_epochs: Training optimization epochs. More epochs can provide better results, with the usual computational drawbacks. We use 1000 epochs.
-
min_dist: Alters the minimum distance between embedded points. Smaller values increase clustering, larger values present a more uniformly distributed visualiaztion. We use a distance of 0.01.
-
repulsion_strength: Values above 1 increase the weight of negative samples, i.e., the importance of the distances between far-apart data points. We use a value of 10.
-
negative_sample_rate: Sets the number of negative samples to be selected per each positive sample in the optimization process. We use 50 negative samples per positive sample.
min_dist
and repulsion_strength
what causes ER graphs to split into two different clusters.Appendix 2: Asymmetric NBED embeddings
Endnotes
EMD
and r=r0 based on the experimental evidence of Fig. 5. An early version of NBD used Euclidean
and a value of r set to a constant (Torres et al. 2018). Following that version, other authors have independently proposed using EMD
(Mellor and Grusovin 2018).