Introduction
Related Work
Methodology
The Threat Model
Machine Learning Attack
Node Signatures
Random Forest Classification
Metrics
Datasets
Real Network Datasets
-
polblogs
(Adamic and Glance 2005) is an interaction network between political blogs during the lead up to the 2004 US presidential election. This dataset includes ground-truth labels identifying each blog as either conservative or liberal. -
fb-dartmouth
,fb-michigan
, andfb-caltech
(Traud et al. 2012) are Facebook social networks extant at three US universities in 2005. A number of node attributes such as dorm, gender, graduation year, and academic major are available. We chose two such attributes that could be represented as binary attributes: gender and occupation, whereby occupation we could identify the attribute values “student” and “faculty”. From each dataset, we obtained two networks with the same topology but different node attribute distributions. -
pokec-1
(Takac and Zabovsky 2012) is a sample of an online social network in Slovakia. While the Facebook samples are university networks, Pokec is a general social platform whose membership comprises 30% of the Slovakian population.pokec-1
is a one-fortieth sample. This dataset has gender information available as a node attribute. -
amazon-products
(Leskovec et al. 2007) is a bi-modal projection of categories in an Amazon product co-purchase network. Nodes are labeled as “book” or “music”, edges signify that the two items were purchased together.
polblogs
, r=−0.22, where there are more interactions between popular and obscure blogs than expected by chance) to assortative (as expected for social networks). All topologies except for amazon-products
have small average path length.
Network | |N| | |E| |
p
|
τ
|
\(\hspace {-2pt}\bar {d}\)
|
C
|
r
|
κ
| |||
---|---|---|---|---|---|---|---|---|---|---|---|
R(%) | B(%) | R−R(%) | B−B(%) | R−B(%) | |||||||
polblogs | 1224 | 16718 | 0.02 | 0.22 | −0.22 | 2.49 | |||||
(party) | 48 | 52 | 44 | 48 | 8 | 0.48 | 0.84 | ||||
fb-caltech | 769 | 16656 | 0.05 | 0.29 | −0.06 | 1.33 | |||||
(gender) | 91.5 | 8.5 | 92.8 | 0.2 | 7 | 0.08 | 0.52 | ||||
(occupation) | 72 | 28 | 69 | 8 | 23 | 0.28 | 0.42 | ||||
fb-dartmouth | 7694 | 304076 | 0.01 | 0.15 | 0.04 | 2.76 | |||||
(gender) | 86.5 | 13.5 | 83.2 | 0.9 | 15.9 | 0.14 | 0.34 | ||||
(occupation) | 62 | 38 | 58 | 18 | 24 | 0.38 | 0.5 | ||||
fb-michigan | 30147 | 1176516 | 0.0026 | 0.13 | 0.115 | 3.05 | |||||
(gender) | 92.2 | 7.8 | 90.5 | 0.2 | 9.3 | 0.08 | 0.37 | ||||
(occupation) | 77.5 | 22.5 | 72 | 9 | 19 | 0.22 | 0.46 | ||||
pokec-1 | 265388 | 700352 | 0.46 | 0 | 2×10−5 | 0.0068 | −0.044 | 5.66 | |||
(gender) | 46 | 54 | 18.6 | 22.4 | 59 | ||||||
amazon-products | 303551 | 835326 | 0.18 | 0.99 | 1.8×10−5 | 0.21 | −0.06 | 17.42 | |||
(category) | 82 | 18 | 83.4 | 16.4 | 0.2 |
pokec-1
(thus, no higher than chance preference for social ties with people of the same gender in Slovakia) to 0.99 in amazon-products
(books are purchased together with other books much more strongly than given by chance). The diversity metric p varies between the overrepresentation of males in the US academic Facebook networks (8% female representation) to an almost perfect political representation in the polblogs
dataset (where p=0.48). Note that, we only consider p as the minimum proportion of two node groups due to the symmetric nature of attributes in our experiments.Synthetic Graphs
Varying Topology via ERGMs
statnet
suite (Handcock et al. 2014), which contains several packages for network analysis, to produce ERGMs and simulate graphs from our real-world network datasets. In this case, we focused on three structural aspects of the graphs: clustering coefficient, average path length, and degree correlation/assortativity.edges
and triangle
parameters in the statnet package. The edges
parameter measures the probability of linkage or no linkage between nodes, and the triangle
term looks at the number of triangles or triad formations in the original graph. For the average path length model, edges
and twopath
terms were used. The twopath
term measures the number of 2-paths in the original network and produces a probability distribution of their formation for the converged ERGM. Lastly, for the assortativity measure, the terms edges
and degcor
were used to produce the models. The degcor
term considers the degree correlation of all pairs of tied nodes (for more on ERGMs see (Hunter et al. 2008; Morris et al. 2008)). These terms proved to be our best choices for preserving, to a certain extent, the desired structural information. Although the creation of ERGMs is a trial and error process, the selected terms were successful in producing models for each of the original networks.simulate
function in the statnet
suite there is no way of forcibly constraining the aspects of the original we want to control. Thus, we experience variation, in some cases more than others. The difference between the original and the simulated graphs seemed more prominent for smaller networks (see Table 1 and Table 2 for comparison) than models based on the larger networks which came closer to the real values of the original graphs.
Network | ERGM |
d
|
C
|
r
|
κ
| |S| (millions) |
---|---|---|---|---|---|---|
polblogs | dc | 0.02 | 0.03 | .08 | 2.52 | 5.5 |
cc | 0.02 | 0.33 | -0.02 | 2.69 | 13.1 | |
apl | 0.02 | 0.10 | -0.06 | 2.49 | 11.5 | |
fb-caltech | dc | 0.06 | 0.08 | 0.11 | 2.13 | 1.2 |
cc | 0.06 | 0.42 | -0.06 | 2.73 | 4.1 | |
apl | 0.06 | 0.07 | 0.11 | 1.97 | 1.2 | |
fb-dartmouth | dc | 0.01 | 0.17 | 0.07 | 2.66 | 14.5 |
cc | 0.01 | 0.24 | 0.04 | 2.77 | 13.2 | |
apl | 0.01 | 0.20 | 0.04 | 2.70 | 14.2 | |
fb-michigan | dc | 0.003 | 0.02 | 0.12 | 3.28 | 38.4 |
cc | 0.002 | 0.20 | 0.12 | 3.52 | 39.9 | |
apl | 0.002 | 0.20 | 0.12 | 3.64 | 38.2 | |
pokec-1 | dc | 2.02E-5 | 0.06 | -0.04 | 5.60 | 29.5 |
cc | 2.05E-5 | 0.07 | -0.04 | 5.84 | 29.3 | |
apl | 2.04E-5 | 0.06 | -0.04 | 5.63 | 27.3 | |
amazon-products | dc | 1.82E-5 | 0.37 | -0.06 | 11.86 | 43.7 |
cc | 1.82E-5 | 0.40 | -0.06 | 13.52 | 72.5 | |
apl | 1.82E-5 | 0.39 | -0.06 | 13.47 | 74.3 |
Synthetic Labeling
polblogs
topology. The proportion of cross-group ties is proportional to p, while it is inversely proportional to τ. When p reaches its maximum (pmax=0.5 due to the symmetric nature of binary attribute values), the proportion of cross-group ties is larger at minimum inbreeding coefficient τ.
Empirical Results
The Vulnerability Cost of Node Attributes
fb-dartmouth
. More interestingly, however, the same attribute performs differently for the other two Facebook networks considered: for fb-caltech
the occupation label functions as noise, leading to a small decrease in the F1-score. For fb-michigan
, on the other hand, the occupation label significantly improves the attacker’s performance.
fb-michigan
topology, where the difference between the impacts of the gender and the occupation attributes is the largest. We thus formulate a new question: What placement of attributes onto nodes reveal more information?Diversity Matters, Homophily Not
polblogs
dataset, which has a large diversity (p=0.48) (thus, almost equal numbers of conservative and liberal blogs). Note that the effect of p on the added vulnerability remains consistent across all topologies (real and synthetic) tested.amazon-products
and pokec-1
are orders of magnitude sparser than the other datasets considered. This means that the topological information available to the machine learning algorithm is limited. In this situation, the addition of the attribute information turns out to be very significant: the T-statistic values for these datasets are significantly larger than for the other datasets, with values over 400 in some cases.pokec-1
topology with the ERGM-generated ones in Fig. 5e: the node attribute contributes much more to the vulnerability of the original topology compared to the synthetic topologies. The reason for this unusual behavior may lay in the different clustering coefficients of the networks, as seen in Tables 1 and 2: the ERGM-generated topologies have clustering coefficients one order of magnitude higher than the original topology (for the same graph density), which leads to more diverse NDD feature vectors for the networks with higher clustering and thus richer training information. This in turn leads to better accuracy in node re-identification in the unlabeled ERGM topologies (with higher clustering) than in the original topology. For example, the maximum F1-score for the ERGM-dc topology is 0.92 while for the original is 0.76 in pokec-1
. Thus, the relative benefit of the node attribute is significantly higher when the topology features were poorer.Topology Leaks
pokec-1
and amazon-products
with a larger range of node degrees, this behavior is observed.polblogs
or pokec-1
), the topological information contributes less than on datasets with low diversity (such as fb-caltech (gender)
). This is because high diversity correlates to richer NAD feature vectors, and thus the relative importance of the NAD features increases.Epidemic and the Risk of Node Re-identification
fb-caltech
network, the T-statistic value reaches a peak value of 10 in four infection steps for β=0.1, while the T-statistic value reaches a peak value of 50 in two infection steps for β=0.9. Interestingly, the most diverse population in fb-caltech
network is also observed after four infection steps for β=0.1, and two infection steps for β=0.9 (as shown in Fig. 7d). In polblogs
, T-statistic values reach peak values of 31 and 36 for the infection rates of 0.1 and 0.9, respectively (as shown in Fig. 7h). The polblogs
population becomes more diverse in the similar number of infection steps given the respective infection rate.