1 Introduction
-
intra-community interaction prediction;
-
inter-community interaction prediction.
2 Interaction prediction problem
3 Proposed approach
3.1 Step 1: community discovery
3.2 Step 2: features design
3.2.1 Pairwise structural features
Measure | Description |
---|---|
Common Neighbors (Newman 2001) |
\(\hbox{CN}(u,v) = |\varGamma (u) \cap \varGamma (v)|\)
|
Jaccard Coefficient (Salton and McGill 1983) |
\(\hbox{JC}(u,v) = \frac{|\varGamma (u) \cap \varGamma (v)|}{|\varGamma (u) \cup \varGamma (v)|}\)
|
Adamic Adar (Adamic and Adar 2003) |
\(\hbox{AA}(u,v) = \sum _{w \in \varGamma (u) \cap \varGamma (v)} \frac{1}{\log {|\varGamma (w)|}}\)
|
Preferential Attachment (Barabási and Albert 1999) |
\(\hbox{PA}(u,v) = |\varGamma (u)| \times |\varGamma (v)|\)
|
-
Common Neighbor (CN) assigns as likelihood score of a new link the number of neighbors shared by endpoints (Newman 2001).
-
Jaccard Coefficient (JC) measures the likelihood of two nodes to establish a new connection as the ratio among their shared neighbors and the total number of their distinct neighbors (Salton and McGill 1983).
-
Adamic Adar (AA) refines \(\hbox{CN}\) by increasing the importance of nodes which possess less connections (Adamic and Adar 2003).
-
Preferential Attachment (PA) assumes that the probability of a future link between two nodes is proportional to their degree (Barabási and Albert 1999).
3.2.2 Global topological features
Measure | Description |
---|---|
Degree Centrality |
\(\hbox{DC}(u) = |\varGamma (u)|\)
|
Page Rank (Page et al. 1999) |
\(\hbox{PR}(u) =\frac{1 - d}{N} + d\sum _{(u,v) \in E} \frac{\hbox{PR}(v)}{|\varGamma (v)|}\)
|
-
Degree Centrality (DC) relates the centrality of a node to its degree.
-
PageRank (PR) is a link analysis algorithm introduced by Page et al. (1999) and used by the Google Web search engine. It assigns a numerical score to each element of a hyperlinked set of documents with the purpose of measuring its relative importance within the set.
3.2.3 Community features
Measure | Description |
---|---|
Community Size |
\(\textit{CE}(G_C) = |E_C|\)
|
Community Edges |
\(\textit{CE}(G_C) = |E_C|\)
|
Shared Communities |
\(\textit{CS}(u, v, \mathcal {C})= |\{C | u \in V_C \wedge v \in V_C \; \forall C \in \mathcal {C}\}|\)
|
Community Density |
\(D(C) = \frac{|E_C|}{|V_C|\times (|V_C|-1)}\)
|
Transitivity |
\(T=3\frac{|triangles(G_C)|}{|triads(G_C)|}\)
|
Max Degree |
\(\textit{MD}(C)= max\{|\varGamma (u)| : u\in V_C\}\)
|
Average Degree |
\(\textit{AD}(C)= \frac{\sum _{u\in V_C} |\varGamma (u)|}{|V_C|}\)
|
-
Community Size (CS) number of nodes belonging to the community C.
-
Community Edges (CE) number of edges within nodes in C.
-
Shared Communities (SC) identifies the number of communities shared by a couple of nodes. When dealing with network partitions, SC takes value in \(\{0,1\}\), while in case of overlapping communities its domain is [0, \(|\mathcal {C}|\)].
-
Community Density (D) ratio of edges belonging to the community over the number of possible edges among all the nodes within it.
-
Transitivity (T) identifies the ratio of triangles with respect to open “triads” (two edges with a shared vertex).
-
Max Degree (MD) identifies the degree (w.r.t. the community subgraph) of the principal hub for the community.
-
Average Degree (AD) identifies the average degree (w.r.t. the community subgraph) of the nodes within the community.
3.3 Step 3: forecasting models
Measure | Description |
---|---|
Last Value (Lv) |
\(\varTheta _t = Z_{t-1}\)
|
Average (Av) |
\(\varTheta _t = \frac{\sum _{i=1}^{T} Z_i }{\tau }\)
|
Moving Average (Ma) |
\(\varTheta _t = \frac{\sum _{i=\tau -n}^{\tau } Z_i }{n}\)
|
Linear Regression (LR) |
\(\varTheta _{t+h} = \alpha _t + h\beta _t\)
|
-
Last Value (Lv) considers as forecast the last observed value of the time series.
-
Average (Av) is the average of all the observations in \(Z_t\).
-
Moving Average (Ma) predicts the next value by taking the mean of the n most recent observed values of a series \(Z_t\). In our experiments, we have ranged n in the interval \([1,\tau ]\).
-
Linear Regression (LR) fits the time series to a straight line. The level \(\alpha\) and the trend \(\beta\) parameter (used to estimate the slope of the line) were defined by minimizing the sum of squared errors between the observed values of the series and the expected ones estimated by the model.
3.4 Step 4: classifier models
-
Balanced class distribution we adopted class balancing through downsampling [as performed in previous works (Lichtenwalter et al. 2010)], thus obtaining balanced classes and a baseline model having 0.5 accuracy.
-
Unbalanced class distribution in order to provide an estimate of the real predictive power expressed by our methodology, we tested it against the unbalanced class distribution as expressed by the original data.
4 Experiments and results
4.1 Datasets
Network | Nodes | Interactions | #Snapshots |
\(\mu _{\rm CC}\)
|
\(\sigma _{\rm CC}\)
|
\(\mu _{\rm D}\)
|
\(\sigma _{\rm D}\)
|
---|---|---|---|---|---|---|---|
DBLP | 747,700 | 5,319,654 | 10 (years) | 0.665 | 0.018 | 3.113e−05 | 9.602e−06 |
Social | 1899 | 113,145 | 6 (months) | 0.105 | 0.015 | 8.600e−03 | 1.400e−03 |
4.2 Intra-community interaction prediction
Predicted
| ||
---|---|---|
Class 0 | Class 1 | |
Actual
| ||
Class 0 | TN (true neg.) | FP (false pos.) |
Class 1 | FN (false neg.) | TP (true pos.) |
4.2.1 Balanced scenario
-
Accuracy, defined as \(\hbox {ACC}=\frac{\hbox {TP}\,+\,\hbox {TN}}{\hbox {TP}\,+\,\hbox {FN}\,+\,\hbox {TN}\,+\,\hbox {FP}}\), measures the ratio of correct prediction over the total;
-
AUC identifies the area under the receiver operating characteristic (ROC). It illustrates the performances of binary classifiers relating the true-positive rate \(\hbox {TPR}=\frac{\hbox {TP}}{\hbox {TP}\,+\,\hbox {FN}}\) to the false-positive rate \(\hbox {FPR}=\frac{\hbox {FP}}{\hbox {FP}\,+\,\hbox {TN}}\) and providing a visual interpretation useful to compare different models.
Network
|
DBLP
|
Social
| ||
---|---|---|---|---|
Algorithm | AUC | ACC (%) | AUC | ACC (%) |
DEMON Ma | 0.907 | 85.58 |
0.981
| 93.55 |
DEMON LR | 0.901 | 84.35 | 0.970 | 91.87 |
Louvain Ma |
0.930
| 87.72 | 0.880 | 80.27 |
Louvain LR | 0.926 | 87.48 | 0.883 | 81.37 |
Infohiermap Ma | 0.920 | 86.69 | 0.890 | 81.34 |
Infohiermap LR | 0.917 | 86.18 | 0.886 | 80.89 |
Algorithm | AUC | ACC (%) |
---|---|---|
SF Ma | 0.901 | 82.88 |
SF LR | 0.895 | 82.18 |
FSF Ma |
0.956
| 90.10 |
FSF LR | 0.937 | 88.09 |
Algorithm | AUC | ACC (%) |
---|---|---|
DEMON Structural
| 0.957 | 90.59 |
DEMON Topology
| 0.962 | 91.44 |
DEMON Community
| 0.903 | 83.53 |
Louvain Structural
| 0.850 | 78.63 |
Louvain Topology
| 0.875 | 79.38 |
Louvain Community
| 0.724 | 66.64 |
Infohiermap Structural
| 0.876 | 79.85 |
Infohiermap Topology
| 0.887 | 80.81 |
Infohiermap Community
| 0.667 | 62.11 |
Algorithm | AUC | ACC (%) |
---|---|---|
DEMON All
|
0.981
| 93.90 |
Louvain All
| 0.901 | 83.05 |
Infohiermap All
| 0.894 | 81.91 |
FS All
| 0.959 | 90.44 |
Algorithm | Structural | Topology | Community |
---|---|---|---|
DEMON | 0.023 | 0.001 | 0.003 |
Louvain | 0.009 | 0.017 | 0.018 |
Infohiermap | 0.042 | 0.015 | 0.081 |
Algorithm | AUC | ACC (%) |
---|---|---|
DEMON | 0.987 | 95.76 |
Louvain | 0.888 | 81.16 |
Infohiermap | 0.846 | 75.95 |
4.2.2 Unbalanced scenario
Algorithm | AUC | PPV (%) |
---|---|---|
SF Ma | 0.897 | 64.06 |
SF LR | 0.893 | 62.62 |
FSF Ma | 0.918 |
74.71
|
FSF LR |
0.932
| 72.45 |
4.3 Inter-community interaction prediction
-
instead of using the original interaction network, we preprocess our data and build, for each snapshot, an induced graph using the previously extracted communities. In particular, for each snapshot graph \(G_i\) and related set of communities \(C_i\) we perform the transformation described by Algorithm 1;
-
we compute the structural and topological features on the community-node pairs of each new induced graph;
-
we apply the time series forecast and, on the forecasted feature values, we build the prediction model.
-
Among the previously analyzed datasets, DBLP is the bigger one and it is always decomposed in a higher number of communities (ensuring community-graphs of meaningful size);
-
DEMON generates overlapping communities; thus, the community-graph extraction loses some effectiveness (shared nodes generate a densely connected graph);
-
Louvain as all modularity-based approaches suffers from the scale problem: this causes very sparse star-like community-graphs composed by few focal nodes (i.e., the bigger communities) linked to many satellites (i.e., very small communities that are rarely connected by interactions).
4.3.1 Balanced scenario
Algorithm | AUC | ACC (%) |
---|---|---|
Lv | 0.580 | 56.01 |
Avg | 0.650 | 65.10 |
Ma |
0.660
| 66.00 |
LR | 0.581 | 58.10 |
Flat Graph | 0.610 | 59.12 |
Baseline | 0.500 | 50.00 |
4.3.2 Unbalanced scenario
Algorithm | AUC | PPV (%) |
---|---|---|
Lv | 0.594 | 33.33 |
Avg | 0.632 | 07.02 |
Ma |
0.647
| 50.00 |
LR | 0.596 | 50.00 |
Flat Graph | 0.316 | 57.20 |
Baseline | 0.504 | 4.01 |