Skip to main content
Erschienen in: EPJ Data Science 1/2024

Open Access 01.12.2024 | Research

The simpliciality of higher-order networks

verfasst von: Nicholas W. Landry, Jean-Gabriel Young, Nicole Eikmeier

Erschienen in: EPJ Data Science | Ausgabe 1/2024

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Higher-order networks are widely used to describe complex systems in which interactions can involve more than two entities at once. In this paper, we focus on inclusion within higher-order networks, referring to situations where specific entities participate in an interaction, and subsets of those entities also interact with each other. Traditional modeling approaches to higher-order networks tend to either not consider inclusion at all (e.g., hypergraph models) or explicitly assume perfect and complete inclusion (e.g., simplicial complex models). To allow for a more nuanced assessment of inclusion in higher-order networks, we introduce the concept of “simpliciality” and several corresponding measures. Contrary to current modeling practice, we show that empirically observed systems rarely lie at either end of the simpliciality spectrum. In addition, we show that generative models fitted to these datasets struggle to capture their inclusion structure. These findings suggest new modeling directions for the field of higher-order network science.
Hinweise

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

A wide range of complex systems are shaped by interactions involving several entities at once: social networks are driven by group behavior [1], emails often have multiple recipients [24], molecular pathways in cells involve multi-protein interactions [5], and scientific articles involve groups of co-authors [6]. Higher-order networks are a natural extension to networks explicitly designed to model such multi-way relationships [7].
Two mathematical representations are most commonly used to model higher-order networks: hypergraphs and simplicial complexes [8]. A hypergraph is a collection of entities (nodes) connected by interactions (hyperedges) between any number of these entities. A simplicial complex can be considered a hypergraph with an additional requirement known as downward closure, which states that when an interaction exists between m entities, every possible sub-interaction also exists. This mathematical construction originates in algebraic topology and is motivated by theoretical applications; for example, forming operators such as boundary matrices to identify cycles in a dataset or the Hodge Laplacian to describe dynamical processes in higher-order networks [9, 10].
Recent work has grappled with the problem and consequences of choosing the proper representation—simplicial complex, hypergraph, or other—for a given complex system. Ref. [11], for instance, shows that synchronization can differ drastically in systems modeled with simplicial complexes and hypergraphs due to synchrony driven by the included edges of simplicial complexes, and three recent studies investigate the impact of inclusions on contagion [1214]. Additionally, Ref. [8] discusses how each representation corresponds to different modeling assumptions and, thus, different analysis pipelines.
Missing from these studies are analyses of the suitability of higher-order representations for describing empirically observed interactions. When a set of interactions is given to a data scientist or modeler, the choice of representation is essentially empirical. Do the data satisfy downward closure? (In which case, a simplicial complex may best represent it.) Or do the data violate downward closure? (In which case, it should be modeled as a hypergraph.)
In this paper, we introduce the concept of simpliciality to describe the extent to which a set of interactions satisfies the downward closure requirement. We implement this concept with three measures of the overall simpliciality of a dataset and describe how to define local versions of these global metrics. Using these measures, we investigate the simpliciality of empirical datasets and show that commonly analyzed higher-order interaction datasets populate the full spectrum of simpliciality. We find that there may be large variations in the simpliciality depending on the chosen empirical dataset and the measure of simpliciality. Additionally, we show that the level of simpliciality displayed by existing models is typically not captured by existing generative models for higher-order networks. Hence, this paper identifies an essential gap in the current set of higher-order measures and models.
These new simpliciality measures complement other higher-order structural measures such as community structure [1517], centrality [1820], clustering [21, 22], assortativity [23, 24], and degree heterogeneity [25]. While these measures are helpful in understanding how higher-order data is organized, they do not address how hypergraphs relate to simplicial complexes. Closest to our work is the recent Ref. [13], in which the authors describe the encapsulation graph, a structural description of the inclusion patterns of any given dataset, as well as a dynamical process based on inclusion that spreads from included hyperedges to larger containing hyperedges. Also relevant is Ref. [26], which defines a global metric of inclusions. There is a large body of literature surrounding the concept of nestedness [27], which measures the inclusion structure of unipartite and bipartite networks, particularly in ecological contexts [28]. Measures of nestedness, however, do not use simplicial complexes as a reference point against which to compare. In contrast, our approach describes downward inclusions succinctly with simple global measures, offering a new perspective on how higher-order data is organized.

1.1 Mathematical definitions

We encode interactions as hypergraphs, defined as a pair \(\mathbf{H}= (V, E)\) where V is a set of \(N=|V|\) entities known as nodes, and where E is a set of subsets of V encoding relationship between nodes. We refer to a set \(e\in E\) as a “hyperedge” or just “edge,” and define the size of an edge as \(|e|\). In general, E could be a multiset of multisets, but in this study, we solely consider simple hypergraphs, where each edge is only present once (no multi-hyperedges), and each edge can only contain unique entities (no self-relations).
Our analysis will focus on the prevalence of inclusions in interaction data, so we describe this relationship formally. We say that an edge f is included in e, or \(f \subset e\), if every node of f is also a node of e. Drawing from the nomenclature of algebraic topology, we also say that f is a subface of e [29].
Inclusions naturally lead to the concept of a maximal hyperedge, i.e., a hyperedge that is not included in any other hyperedge. We denote the set of all the maximal hyperedges of H as \(\widetilde{E}(\mathbf{H})\).
A simplex s is then a collection of hyperedges that contains a single maximal hyperedge and satisfies \(s \equiv \mathcal{P}(\widetilde{e})\), where \(\mathcal{P}(X)\) is the power set of set X. In other words, a simplex is a maximal edge with an associated collection of subfaces in which every possible subface of the maximal edge exists. A collection of simplices is a simplicial complex \(\mathbf{S}=(V, E)\), and we say that a hypergraph where every maximal edge is a simplex satisfies downward closure.
Finally, we will use the notion of an induced simplicial complex, which is the simplicial complex constructed from the maximal edges of a hypergraph by adding all hyperedges needed to satisfy downward closure.

2 Measuring simpliciality

This paper introduces the concept of simpliciality. Simpliciality, broadly defined, measures the inclusion structure of a hypergraph and how similar a higher-order dataset is to the structure of a simplicial complex; see Fig. 1A. There are many ways to measure this, which we outline in Sect. 2.1. Before we get there, however, we must first introduce relevant terminology.

2.1 Measures

This section introduces measures of simpliciality. We follow a few guiding principles to design these measures. One, a simplicial complex must be maximally simplicial with respect to any measure of simpliciality. Two, to facilitate easier comparison between datasets, measures of simpliciality should be normalized so that they map a hypergraph to a value between 0 and 1. Three, if a subface is added to a hypergraph, the simpliciality must increase. Four, as a dataset becomes qualitatively more like a simplicial complex, the simpliciality should increase. And five, we stipulate that the simpliciality of an empty hypergraph is undefined.
There are many ways to define a measure of simpliciality while maintaining these guiding principles. To highlight different structural elements contributing to the inclusion structure, we define three measures: the simplicial fraction, the edit simpliciality, and the mean face simpliciality. These measures are all illustrated in Fig. 1 B-D.
Simplicial fraction
In a simplicial complex, every subface is itself a simplex, so when a hypergraph is a simplicial complex, it contains all subsets of each of its hyperedges. The simplicial fraction (SF) measures the degree to which this is true, defined as the fraction of hyperedges which are also simplices.
Formally, we let \(\mathbf{H}= (V, E)\) be a hypergraph and let \(S = \{ e\in E \ | \ \mathcal{P}(e)\subseteq E\}\) be the set of hyperedges which are also simplices. Then, the simplicial fraction is defined as
$$ \sigma _{\mathrm{SF}}= \frac{|S|}{|E|} $$
(1)
and it takes values in the range \(\sigma _{\mathrm{SF}}\in [0,1]\); see Fig. 1B.
The simplicial fraction directly measures the number of simplices in the dataset and is, therefore, highly interpretable. However, one potential downside is that edges which almost achieve downward closure do not count at all toward the overall simpliciality. Furthermore, this definition weighs smaller simplices heavily, as small simplices contribute to the simpliciality of all hyperedges that include them.
Edit simpliciality
The edit simpliciality (ES) is defined as the minimal number (or fraction, in the normalized case) of additional edges needed to make a hypergraph a simplicial complex.
Our formal definition uses the notion of an induced simplicial complex defined in Sect. 1.1. Given a hypergraph \(\mathbf{H}= (V, E)\) for which we want to measure the ES, we find its maximal edges and construct the simplicial complex \(\mathbf{S}= (V, C)\) induced on H, with \(C=\bigcup_{e \in \widetilde{E}} \mathcal{P}(e)\). The edit simpliciality is then
$$ \sigma _{\mathrm{ES}}= \frac{|E|}{|C|}, $$
(2)
again satisfying \(\sigma _{\mathrm{ES}}\in [0,1]\); see Fig. 1C. (We note that one can use the induced simplicial complex to define variants of the ES, e.g., a simplicial edit distance \(d_{\mathrm{ES}}= |C| - |E|\) or a normalized distance \(d_{\mathrm{NES}}= (|C| - |E|)/|C| = 1 - |E|/|C| = 1 - \sigma _{ \mathrm{ES}}\).)
The ES answers a slightly different question than the SF does—it counts missing hyperedges that would make the dataset into a simplicial complex, rather than the edges that already satisfy downward closure. It thus offers a complementary, equally interpretable measure of simpliciality. However, the ES has the disadvantage of being sensitive to outliers, as a handful of large hyperedges with few inclusions will rapidly drive \(\sigma _{\mathrm{ES}}\) towards 0. Indeed, a hyperedge of size m without any inclusion contributes one edge to \(|E|\) but \(2^{m}\) edges to \(|C|\) in the denominator of Eq. (2).
Face edit simpliciality
Finally, building upon the idea of edit simpliciality, we define a more localized notion of simpliciality, using the number of subfaces that must be added to the hypergraph to make a particular face a simplex.
Given a hyperedge e, the number of edges one must add to the hypergraph to make e a simplex is
$$ d_{\mathrm{FES}}(e) = |\mathcal{P}(e)| - |c|, $$
where \(c = \{ f \in E \mid f \subseteq e\}\). We can think of this quantity as an edit distance, or face edit distance. We use this quantity to define an average
$$ \bar{d}_{\mathrm{FES}} = \frac{1}{ \vert F \vert }\sum _{e\in F} d_{ \mathrm{FES}}(e), $$
where F is a set of edges—most commonly, \(F = \widetilde{E}\) or E. We exclusively use \(F = \widetilde{E}\) in this study. These quantities are on the scale of counts, and to define quantities analogous to previous simpliciality measures, we thus introduce a per-face normalization, either on a distance scale (meaning that the quantity grows as the dataset becomes less simplicial):
$$ \bar{d}_{\mathrm{NFES}} = \frac{1}{ \vert F \vert }\sum _{e\in F} \frac{d_{\mathrm{FES}}(e)}{|\mathcal{P}(e)|}, $$
or, similarly to previous definitions, on a simpliciality scale:
$$ \sigma _{\mathrm{FES}}= \frac{1}{ \vert F \vert }\sum _{e\in F} \biggl(1 - \frac{d_{\mathrm{FES}}(e)}{\mathcal{P}(e)} \biggr). $$
(3)
We call this last measure the face edit simpliciality (FES).
The FES normalizes the face edit distance as a fraction of its maximal simpliciality. This normalization removes the dominance of large edges in the calculation of \(\sigma _{\mathrm{ES}}\) and, in fact, exponentially down-weights the contribution of these edges. In addition, because this metric is computed on faces, this is an averaged local metric.

2.2 Important considerations when measuring simpliciality

Before we turn to applications in Sect. 3, let us discuss three design choices that may impact the conclusion we reach about the simpliciality of a dataset.
First, we note that the formal definition of a simplicial complex can be unnecessarily strict when used to represent perfect inclusion structures. By definition, a simplex always contains singletons (edges comprising a single node) and the empty set. Several datasets will not include such interactions by construction. One example is proximity datasets, where edges encode proximity events in which two or more nodes become in close contact during the observation period. Because of their spatial nature, these datasets are often very dense and contain many inclusions [7]. Yet, according to the standard definition, these will never be simplicial complexes due to the absence of singletons. Another example is email datasets, which also do not contain singletons unless one includes emails that individuals send to themselves. Because we define our notion of inclusion in terms of simplicial complexes, our measures will label these datasets as having no inclusion structure whatsoever.
To circumvent this issue, we use a relaxed definition of downward closure that excludes singletons wherever it makes sense. The relaxation uses the notion of a size-restricted power set \(\mathcal{P}_{K}(X)\), where K is a set of integers, defined as
$$ \mathcal{P}_{K}(X) = \bigl\{ x \in \mathcal{P}(X) \ | \ |x| \in K \bigr\} . $$
(4)
For example, given an edge e of size n, \(\mathcal{P}_{\{2,\dots ,n-1\}}(e)\) is the set of \(2^{|e|} - |e| - 2\) subfaces of e excluding the empty set, all singletons (sets of size one), and the edge e itself. Relaxed measures of simpliciality follow by substituting \(\mathcal{P}(X)\) for \(\mathcal{P}_{K}(X)\) in all the measures of Sect. 2.1. Hence, for example, we obtain a relaxation of \(\sigma _{\mathrm{SF}}\) by replacing the definition of S, the set of the hyperedges of H that are also simplices, by \(S = \{e\subseteq E\mid \mathcal{P}_{K}(e)\subseteq E\} \)), where \(K= \{2,\ldots,|e| \}\).
The results shown in Sect. 3 are all calculated using size restrictions to exclude singletons and the empty set. However, we note that this technique can be used more generally to exclude any interaction sizes deemed unimportant, anomalous, or problematic [30]; or, conversely, to be more strict and to include singletons (say, when analyzing academic co-authorship networks, where single-author papers can meaningfully impact the inclusion structure of the dataset).
Second, we observe that special hyperedges we call “minimal faces” may significantly skew the simpliciality of a dataset. The minimal faces of a hypergraph H are the edges of the minimal size, i.e., \(|e| = \min (K)\), where K is the set of sizes in the size-restricted powerset (In a traditional simplicial complex, the minimal faces are singletons). With the size restrictions in place, the minimal faces of a hypergraph are always simplices because, by definition, there are no smaller edges for these edges to include. We argue that when measuring the simpliciality of a dataset, it is most meaningful to focus on the faces for which inclusion is possible, and so we exclude these minimal faces when counting potential simplices.
Note that this design choice operates differently from the size restriction imposed by the modified power set introduced in Eq. (4); in that context, we argued for ignoring edges that can prevent other edges from being simplices, while here we suggest that counting minimal faces as potential simplices will strongly affect the value of simpliciality. Our strategy is as follows. For SF, this means that both S and E exclude the minimal-sized edges. For ES, we exclude maximal faces that are also minimal faces when constructing the minimal simplicial complex. And for FES, we only average over maximal edges that are not minimal faces.
Third and finally, since the number of potential subfaces of a hyperedge grows exponentially with its size, computational issues prevent us from applying our measures to large hyperedges. For this reason, we select a maximum face size k (we use \(k=11\) throughout), again using the size restriction to define our metrics. This drops information about large hyperedges but speeds up computation drastically in practical applications.

2.3 Local simpliciality

Simpliciality, up to this point a global metric, can also be localized on a smaller subset of the higher-order network to yield information about its local structure. The various face-centric measures used in our construction of the FES provide this information at the level of faces. But for more flexibility, we also use subhypergraphs to define nodal simpliciality measures of our global measures. More specifically, given a hypergraph \(\mathbf{H}= (V,E)\) and a node \(v\in V\), we define the neighborhood of v as \(n(v) = \{u \in V \mid u, v \in e\in E\}\) and the associated subsets \(\widehat{V}=v\,\cup \, n(v)\) and \(\widehat{E} = \{e \in E \mid e\subseteq \widehat{V}\}\). Then the simpliciality of node v is the simpliciality defined on the subhypergraph \(\widehat{\mathbf{H}} =(\widehat{V}, \widehat{E})\) induced on the neighborhood of v. Note that when v is an isolated node or when Ê only contains minimal faces and we do not consider these potential simplices, the nodal simpliciality will be undefined.

3 Results

3.1 Empirical datasets

As the first demonstration of the simpliciality measures, we analyze empirical higher-order datasets from several general domains. All datasets are obtained from the xgi-data repository [31] and are openly available. Following the considerations highlighted in Sect. 2.2, we preprocess these datasets to remove singletons, multiedges, and isolated nodes. In addition, for computational feasibility, we only consider hyperedges of size 11 (order, defined as the size minus one, of 10) and smaller. Basic structural properties of the pre-processed datasets are shown in Table 1.
Table 1
Properties of empirical datasets and their simpliciality. \(|V|\), \(|E|\), \(\langle k\rangle \), \(\langle s\rangle \), \(\sigma _{\mathrm{SF}}\), \(\sigma _{\mathrm{ES}}\), and \(\sigma _{\mathrm{FES}}\) denote the number of nodes, the number of hyperedges, the mean degree, the mean edge size, the simplicial fraction (SF), edit simpliciality (ES), and the face edit simpliciality (FES), respectively
Dataset
|V|
|E|
k
s
\(\sigma _{\mathrm{SF}}\)
\(\sigma _{\mathrm{ES}}\)
\(\sigma _{\mathrm{FES}}\)
Proximity datasets
contact-primary-school
242
12,704
52.50
2.42
0.85
0.92
0.94
contact-high-school
327
7,818
23.91
2.33
0.81
0.93
0.92
hospital-lyon
75
1,824
24.32
2.43
0.91
0.95
0.97
Email datasets
email-enron
143
1,442
10.08
2.97
0.31
0.05
0.50
email-eu
967
23,729
24.54
3.12
0.32
0.05
0.52
Biological datasets
diseasome
516
314
0.61
3.00
0.00
0.05
0.04
disgenenet
1,982
760
0.38
5.14
0.00
0.00
0.01
ndc-substances
2,740
4,754
1.74
5.16
0.02
0.01
0.07
Other
congress-bills
1,715
58,788
34.28
4.95
0.03
0.01
0.10
tags-ask-ubuntu
3,021
145,053
48.01
3.43
0.15
0.25
0.46
Our sample of datasets contains various types of complex systems. We analyze three proximity datasets [1, 2, 3133] (contact-primary-school, contact-high-school, and hospital-lyon), which are collected via proximity sensors with a range of roughly 1 meter. Nodes are individuals, and an edge is created from a proximity event, where two individuals are closer than 1 meter apart. To create a higher-order dataset, each maximal clique is converted into a hyperedge at each time step. Unique to proximity datasets are their geometrical constraints, and because of the proximity sensor range, 5-hyperedges are the largest edges present in these datasets. We also include two datasets of email interactions [24, 34]: email-enron and email-eu. In both cases, the nodes are email addresses, and the hyperedges are emails, at the defunct company Enron in the former case and a large European research institution in the latter. Three datasets are loosely associated with biological processes [2, 35, 36]: diseasome, disgenenet, and ndc-substances. In these datasets, nodes are compounds, diseases, or genes, while hyperedges are interactions amongst these to represent pharmaceuticals, symptoms, and diseases. Finally, we include two miscellaneous datasets as well: tags-ask-ubuntu [2] higher-order dataset in which a node is a tag on Stack Overflow, and an edge is a question to which the tags are associated; and the congress-bills dataset [2, 37, 38] where nodes are congresspeople and edges represent the sponsoring and co-sponsoring congresspeople for a particular bill.
Numerical values of the simpliciality measures are shown in Table 1 for all of these datasets. We find that values for simpliciality fill the spectrum from 0 to 1, depending on the data. The proximity datasets have large simpliciality for all three measures, while the biological datasets have low simpliciality for all three measures. The email datasets have a very small ES simpliciality, with moderate simpliciality for the other two measures. (And since we use size restrictions to exclude singletons, the lack or absence of emails sent to oneself does not affect this assessment.) Similarly, the tags-ask-ubuntu dataset has a range of simpliciality values depending on which measure we consider. This shows that the measures we have defined in Sect. 2.1 capture different features of the inclusion structure.
While the measures give differing perspectives on the simpliciality of each dataset, we verify that they broadly agree with a correlation analysis. The Pearson correlation coefficient is \(\rho =0.97\) between the SF and ES, \(\rho =0.95\) between the SF and FES, and \(\rho =0.90\) between the ES and FES (all significant at the \(p=0.001\) level). Hence, the values are closely and linearly related in our sample. They also order datasets similarly, from the least to most simplicial, since the Spearman rank-order correlation coefficient is \(\rho =0.89\) between SF and ES, \(\rho =0.997\) between SF and FES, and \(\rho =0.90\) between ES and FES (all significant at the same level).
Although our correlation analysis confirms that these measures roughly capture the same concepts, the datasets where they depart from one another highlight their key differences. In our experiments, these differences are due to features such as large edges with few included edges, many edges that are mostly closed downward, and different edge size distributions. Networks of organizational email messages are examples of the first case, where very large organization- or department-wide emails may be sent with no guarantee that emails are also sent between every possible subgroup of individuals. In this case, we would expect ES to be extremely low while the SF need not be low. Proximity networks are examples of the second case, i.e., dense downward-closed datasets. We see this by noting that the SF is not 1, while both ES and FES are close to 1 due to the SF penalizing almost-simplicial edges. Lastly, the edge size distribution has strong implications on all measures; for the same average edge size and number of edges, increasing both the number of small and large edges will affect the SF and ES measures. For ES, the large edges will exponentially increase the number of subfaces needed to create a simplicial complex, driving the simpliciality to zero. In contrast, increasing the number of small edges can create more small simplices, increasing SF.

3.2 Generative models of higher-order networks

To complement our analysis of empirical data, we also examine the simpliciality of synthetic data generated with generative models for higher-order networks.
We focus on models of hypergraphs designed to describe and analyze arbitrary higher-order structures. There are several random hypergraph models, including, among many classes of models, preferential attachment models [3942], models with community structure [15, 4346], models with specified degree and size sequences [23, 43], Erdös-Rényi models [47, 48], models with latent nodal variables governing edge formation [49], and geometric models [42, 50, 51]. Higher-order random models that are commonly fit to empirical datasets include the configuration model [23], the bipartite Chung-Lu model [23, 43], and the bipartite degree-corrected stochastic block model [52]. See Ref. [7] for an extensive overview. Overwhelmingly, generative hypergraph models lack explicit control over the inclusion structure of hyperedges, so there are often relatively few simplices.
We focus our analysis on three models: the configuration model [23], the bipartite Chung-Lu model [43], and the bipartite degree-corrected stochastic block model (biSBM) [52].
We fit each model to the empirical datasets of Table 1, use the fitted models to generate a distribution of higher-order networks (the posterior predictive distribution in Bayesian parlance), and analyze the resulting distribution of simpliciality values.
In all cases, when sampling synthetic higher-order networks from the three generative models, we generate 103 realizations of each model for each empirical dataset. We use the double edge-swap algorithm presented in Ref. [23] to sample from the configuration model and performed \(10 \times |E|\) edge swaps, roughly in accordance with [53]. For the bipartite Chung-Lu model [43], we extract the degree and edge size sequences and then use a bipartite variation of the algorithm introduced in Ref. [54] and available in XGI [55] to sample from this model. Lastly, when sampling from the biSBM, we used a Markov chain Monte Carlo method with a bipartite prior using the algorithm described in Ref. [52].
All results are reported in Fig. 2. Overwhelmingly, we see that the generative models cannot accurately capture the simpliciality of datasets when they have a non-trivial inclusion structure. While it does not reproduce the correct values, the hypergraph configuration model consistently captures the inclusion structure better than the biSBM and the bipartite Chung-Lu model, irrespective of the simpliciality measure used. This may be due to the exact specification of the degree and edge size sequences; the Chung-Lu model and biSBM only match these sequences in expectation.

3.3 Local measures of simpliciality

As a final demonstration, we apply our local measures of simpliciality to the dataset of emails sent by Enron employees (142 nodes and 1126 hyperedges, filtered to include interactions of sizes 2 and 3). Results are shown in Fig. 3.
Focusing on the histograms first, we find that the SF has the most variability and that the FES covers a similarly large range. In contrast, the ES tells us that nearly every neighborhood is strongly simplicial. This is expected behavior because the ES relies on a simplicial complex induced on the ego-hypergraph; the size of the largest hyperedges in this ego-hypergraph can be substantially smaller than that of the largest hyperedges in the whole hypergraph. As a result, the denominator of Eq. (2) is reduced, increasing the local ES systematically. In contrast, when we take a subset of nodes to form an ego-hypergraph, it is easy to omit a small subface shared by many hyperedges, thus leading to a very small SF (and, similarly, to a small FES).
Turning to the spatial distribution of simpliciality shown in the insets, we see that the SF and FES find a region of high simpliciality at the network’s core with regions of low simpliciality on its edges. In fact, these two measures are largely in agreement, with a Pearson correlation coefficient of \(\rho =0.84\) between the local SF and FES. (The correlation drops to \(\rho =0.69\) when comparing the SF and ES). We also observe several nodes for which simpliciality is undefined. These nodes are only connected via minimal faces in their ego-hypergraphs, and these faces are excluded when calculating both potential and actual simplices.
Finally, inspecting Fig. 3, we notice that, in this case, nodes of similar simpliciality tend to be connected to one another. To quantify this observation, we define the simplicial assortativity as the Pearson correlation coefficient of the simpliciality of pairs of nodes connected by at least one hyperedge. More formally, we use the unweighted adjacency matrix of the hypergraph, A, defined as
$$ A_{ij} = \textstyle\begin{cases} 1, & [\mathbf{BB}^{\mathrm{T}} ]_{ij} > 0\text{ and } i\neq j, \\ 0,& \text{ otherwise,} \end{cases} $$
where B is the incidence matrix of the hypergraph, such that \(B_{ij}=1\) if node edge j is incident on node i. The simplicial assortativity, ρ, can then be defined as
$$ \rho = \sum_{i,j} \frac{A_{ij}(\sigma _{i} -E[\sigma ])(\sigma _{i} -E[\sigma ])}{\text{Var}[\sigma ]}, $$
(5)
where \(\sigma _{i}\) is the local simpliciality of node i according to one of our measures. The simplicial assortativity for SF, ES, and FES are denoted \(\rho _{\mathrm{SF}}\), \(\rho _{\mathrm{ES}}\), and \(\rho _{\mathrm{FES}}\) respectively. This coefficient is equivalent to the assortativity coefficient [56] of the local simpliciality on the unweighted pairwise projection of the hypergraph.
One should expect local simpliciality to be assortative as any given subface contributes to the simpliciality of all their nodes. Table 2 shows that the situation is a bit more nuanced. For tags-ask-ubuntu, FES is weakly assortative, whereas the other two measures are weakly disassortative. It is particularly striking that despite the hospital-lyon dataset being highly simplicial (as seen in Fig. 2), it is also weakly disassortative.
Table 2
The simplicial assortativity of each dataset filtered to only include interactions of sizes two and three for computational tractability
Dataset
\(\rho _{\mathrm{SF}}\)
\(\rho _{\mathrm{ES}}\)
\(\rho _{\mathrm{FES}}\)
Proximity datasets
contact-primary-school
0.15
0.15
0.14
contact-high-school
0.22
0.34
0.24
hospital-lyon
−0.02
−0.02
−0.01
Email datasets
email-enron
0.29
0.29
0.24
email-eu
0.19
0.16
0.16
Biological datasets
ndc-substances
0.57
0.65
0.72
diseasome
N/A
0.46
0.75
disgenenet
N/A
0.55
0.89
Other
congress-bills
0.78
0.48
0.75
tags-ask-ubuntu
−0.03
−0.08
0.04

4 Conclusion

In this paper, we have introduced measures to summarize the inclusion structure—the simpliciality—of hypergraphs. We have presented three measures of simpliciality but recognize that other definitions of simpliciality may also prove useful. We have discussed how the simpliciality of higher-order datasets depends on many factors, including, but not limited to, the manner in which the dataset was collected, its domain, and the measure of simpliciality. When fitting common generative models to several empirical higher-order networks, we found that the simpliciality of the original dataset is often much higher than the simpliciality of the posterior predictive distribution of fitted models by any measure of simpliciality. Measuring the simplicial assortativity indicates that the simpliciality displays different levels of localization.
We hope this study will serve as a starting point for network scientists aiming to characterize higher-order network datasets and look forward to future work developing these methods along a number of dimensions of interest.
First, we presented global- and node-level definitions of simpliciality, but other scales of interaction may yield further insights into the inclusion structure of the data [27]. Future work could explore mesoscale measures of simpliciality that describe how, for example, simpliciality varies between communities. One could also obtain the largest simplicial component or the set of simplicial components in a hypergraph. In addition, we have restricted ourselves to unweighted simplicial complexes, but one might consider extending these notions to weighted simplicial complexes [57].
Second, our approach complements the existing literature on nestedness in bipartite networks [27], which shows that nestedness exists for a wide variety of unipartite and bipartite networks [58]. Existing work shows that nestedness is important for the function of networks in many domains [28, 5961], and comparing these findings from the perspective of simpliciality could offer additional insights from both a structural and mechanistic perspective.
Finally, we have shown a disparity between the simpliciality of artificial datasets and observed ones. Our findings should thus inform new higher-order network models that specify the inclusion structure of the network and can be fit to empirical higher-order datasets.

Acknowledgements

N.W.L. would like to acknowledge the participants of the “Workshop on Modelling and Mining Complex Networks as Hypergraphs” at Toronto Metropolitan University and Tim LaRock for helpful conversations. N.W.L. would also like to thank Tzu-Chi Yen for lending his expertise on the biSBM inference.

Declarations

Competing interests

The authors declare that they have no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anhänge

Appendix A: Convergence of the configuration model MCMC

To sample from the configuration model, we implemented the Markov chain Monte Carlo algorithm proposed in Ref. [23]. At each step, (1) two edges are selected at random, \(e_{1}\) and \(e_{2}\); (2) two nodes are selected at random, \(i \in e_{1}\) and \(j \in e_{2}\); and (3) the memberships of these nodes are swapped to form two new edges, \(\widetilde{e}_{1} = (e_{1} \setminus i) \cup j\) and \(\widetilde{e}_{2} = (e_{2} \setminus j) \cup i\). This operation is accepted if the move does not create loopy hyperedges, i.e., \(|e_{1}| = |\widetilde{e}_{1}|\) and \(|e_{2}| = |\widetilde{e}_{2}|\). In Ref. [23], the criterion for convergence is left for future work, and Ref. [53] presents criteria for the convergence of the configuration model for pairwise networks. In Fig. 4, we show that the number of double-edge swaps chosen for our configuration model algorithm (\(10\times |E|\)) is sufficient for convergence with respect to all measures of simpliciality.
From Fig. 4, we see that the configuration model sampler has roughly achieved the stationary value of simpliciality after roughly 10% of the edge swaps have been completed. Assessing convergence in a statistically robust manner is necessary to ensure uniform sampling from the hypergraph configuration model, but this heuristic is sufficient for our purposes.

Appendix B: Measuring simpliciality efficiently

For all measures, we leverage the trie data structure [62] to efficiently compute the measures described in this paper. The trie structure allows us to efficiently verify whether a subface exists (\(\mathcal{O}(|e|)\), for an edge e). To be compatible with the trie data structure, when adding an edge to the trie and when searching for an edge in the trie, we first sort the nodes in that edge. Below, we present the algorithms employed in generating all results.
Simplicial fraction
For each edge, if a single subface is absent, we can immediately determine that the edge is not a simplex. This can be very efficient for sparse hypergraphs.
Edit simpliciality
There are two ways to compute the edit simpliciality: the exhaustive method and a more memory-efficient version. The exhaustive method is simpler and more computationally efficient. However, the memory requirements are enormous because it stores every missing subface. The memory-efficient version keeps track of the number of missing subfaces, not the missing subfaces themselves. This leverages the fact that the number of hyperedges intersecting with any given hyperedge is typically much smaller than the number of hyperedges. We store the hypergraph of maximal edges for fast retrieval of the neighbors of a given maximal face, which is needed in Algorithm 3.
Face edit simpliciality
This computation is more straightforward than the ES computation because it is a local measure that requires relatively little memory and allows us to compute the number of missing subfaces and then store the FES as a running average.
Literatur
19.
Zurück zum Zitat Feng S, Heath E, Jefferson B, Joslyn C, Kvinge H, Mitchell HD, Praggastis B, Eisfeld AJ, Sims AC, Thackray LB, Fan S, Walters KB, Halfmann PJ, Westhoff-Smith D, Tan Q, Menachery VD, Sheahan TP, Cockrell AS, Kocher JF, Stratton KG, Heller NC, Bramer LM, Diamond MS, Baric RS, Waters KM, Kawaoka Y, McDermott JE, Purvine E (2021) Hypergraph models of biological networks to identify genes critical to pathogenic viral response. BMC Bioinform 22(1):287. https://doi.org/10.1186/s12859-021-04197-2CrossRef Feng S, Heath E, Jefferson B, Joslyn C, Kvinge H, Mitchell HD, Praggastis B, Eisfeld AJ, Sims AC, Thackray LB, Fan S, Walters KB, Halfmann PJ, Westhoff-Smith D, Tan Q, Menachery VD, Sheahan TP, Cockrell AS, Kocher JF, Stratton KG, Heller NC, Bramer LM, Diamond MS, Baric RS, Waters KM, Kawaoka Y, McDermott JE, Purvine E (2021) Hypergraph models of biological networks to identify genes critical to pathogenic viral response. BMC Bioinform 22(1):287. https://​doi.​org/​10.​1186/​s12859-021-04197-2CrossRef
21.
Zurück zum Zitat Gallagher SR, Goldberg DS (2013) Clustering coefficients in protein interaction hypernetworks. In: Proceedings of the international conference on bioinformatics, computational biology and biomedical informatics. BCB’13. Association for Computing Machinery, New York, pp 552–560. https://doi.org/10.1145/2506583.2506635CrossRef Gallagher SR, Goldberg DS (2013) Clustering coefficients in protein interaction hypernetworks. In: Proceedings of the international conference on bioinformatics, computational biology and biomedical informatics. BCB’13. Association for Computing Machinery, New York, pp 552–560. https://​doi.​org/​10.​1145/​2506583.​2506635CrossRef
26.
Zurück zum Zitat Joslyn CA, Aksoy SG, Callahan TJ, Hunter LE, Jefferson B, Praggastis B, Purvine E, Tripodi IJ (2021) Hypernetwork science: from multidimensional networks to computational topology. In: Braha D, de Aguiar MAM, Gershenson C, Morales AJ, Kaufman L, Naumova EN, Minai AA, Bar-Yam Y (eds) Unifying themes in complex systems X. Springer proceedings in complexity. Springer, Cham, pp 377–392. https://doi.org/10.1007/978-3-030-67318-5_25CrossRef Joslyn CA, Aksoy SG, Callahan TJ, Hunter LE, Jefferson B, Praggastis B, Purvine E, Tripodi IJ (2021) Hypernetwork science: from multidimensional networks to computational topology. In: Braha D, de Aguiar MAM, Gershenson C, Morales AJ, Kaufman L, Naumova EN, Minai AA, Bar-Yam Y (eds) Unifying themes in complex systems X. Springer proceedings in complexity. Springer, Cham, pp 377–392. https://​doi.​org/​10.​1007/​978-3-030-67318-5_​25CrossRef
29.
Zurück zum Zitat Hatcher A (2001) Algebraic topology, 1st edn. Cambridge University Press, Cambridge Hatcher A (2001) Algebraic topology, 1st edn. Cambridge University Press, Cambridge
34.
Zurück zum Zitat Yin H, Benson AR, Leskovec J, Gleich DF (2017) Local higher-order graph clustering. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’17. Association for Computing Machinery, New York, pp 555–564. https://doi.org/10.1145/3097983.3098069CrossRef Yin H, Benson AR, Leskovec J, Gleich DF (2017) Local higher-order graph clustering. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’17. Association for Computing Machinery, New York, pp 555–564. https://​doi.​org/​10.​1145/​3097983.​3098069CrossRef
41.
Zurück zum Zitat Do MT, Yoon S-E, Hooi B (2020) Structural patterns and generative models of real-world hypergraphs. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. KDD ’20. Association for Computing Machinery, New York, pp 176–186. https://doi.org/10.1145/3394486.3403060CrossRef Do MT, Yoon S-E, Hooi B (2020) Structural patterns and generative models of real-world hypergraphs. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. KDD ’20. Association for Computing Machinery, New York, pp 176–186. https://​doi.​org/​10.​1145/​3394486.​3403060CrossRef
45.
Zurück zum Zitat Kim C, Bandeira AS, Goemans MX (2018) Stochastic block model for hypergraphs: statistical limits and a semidefinite programming approach. arXiv:1807.02884 Kim C, Bandeira AS, Goemans MX (2018) Stochastic block model for hypergraphs: statistical limits and a semidefinite programming approach. arXiv:1807.​02884
Metadaten
Titel
The simpliciality of higher-order networks
verfasst von
Nicholas W. Landry
Jean-Gabriel Young
Nicole Eikmeier
Publikationsdatum
01.12.2024
Verlag
Springer Berlin Heidelberg
Erschienen in
EPJ Data Science / Ausgabe 1/2024
Elektronische ISSN: 2193-1127
DOI
https://doi.org/10.1140/epjds/s13688-024-00458-1

Weitere Artikel der Ausgabe 1/2024

EPJ Data Science 1/2024 Zur Ausgabe