1 Introduction
1.1 Contributions and outline
Symbol | Meaning | Formula |
---|---|---|
\(\delta \) | Product of the squared total degree of a community structure | \(\sum _{C_i \in \overline{C}}{\delta (C_i)}^2\) |
\(\eta \) | Sum of the intra-edges of a community structure | \(\sum _{C_i \in \overline{C}}{\vert E(C_i)\vert }\) |
m | Number of edges in the network \({G}=({V},{E})\) | \(\vert E \vert \) |
\(\delta (C_i)\) | Total degree of community \(C_i\) | \(\sum _{u\in C_i} {\delta (u)}\) |
\(\delta (u)\) | Degree of node u | \(\vert \{(u,v):(u,v)\in E\}\) |
\({V}^{u}(\mathscr {C})\) | Set of nodes reachable from u passing only via nodes in \(\mathscr {C}\), excluding u itself | |
\(E(C_i)\) | Set of intra-community edges of community \(C_i\) | \(\{(u,v): u,v\in C_i\}\) |
\(\widetilde{E}(C_i)\) | Set of inter-community edges of community \(C_i\) | \(\{(u,v): u\in C_i,v\notin C_i\}\) |
\(E(u,C_i)\) | Set of intra-community edges of node u belonging to the community \(C_i\) | \(\{(u,v): u,v\in C_i\}\) |
\(\widetilde{E}(u,C_i)\) | Set of inter-edges of node u belonging to the community \(C_i\) | \(\{(u,v): u\in C_i,v\notin C_i\}\) |
2 Background
2.1 Problem statement
DICE
Waniek et al. (2018), node safeness (maximization) as for SAFDEC (Fionda and Pirrò 2018), or permanence (maximization) as for NEURAL (Mittal et al. 2021). After applying the modifications suggested by the Deceptor and obtaining a new network \(G'\), the desideratum is that the Detector by analyzing \(G'\) is no more able to discover \(\mathscr {C}\); ideally because \(\mathscr {C}\)’s members are scattered among different communities. In order to quantify the privacy leak caused by the Detector, the Deception Evaluator module leverages some score such as the Deception Score (Fionda and Pirrò 2018) reported in Definition 1.3 Related work
3.1 Modularity-based deception
3.2 Safeness-based deception
3.3 Permanence-based deception
leiden
(Traag et al. 2019) may undermine the overall goal of protecting a community in a directed network.4 Deception in directed networks
Symbol | Meaning | Formula |
---|---|---|
\(\overrightarrow{\delta }\) | Product of the total output degree and total input degree of a community structure | \(\overrightarrow{\delta }=\delta _{o}\cdot \delta _{i}\) |
\(\delta _{o}\) | Total output degree of a community structure \(\overline{C}\) | \(\sum _{C_i \in \overline{C}}{\delta _{o}(C_i)}\) |
\(\delta _{i}\) | Total input degree of a community structure \(\overline{C}\) | \(\sum _{C_i \in \overline{C}}{\delta _{i}(C_i)}\) |
\(\delta _{o}(C_i)\) | Total output degree of community \(C_i\) | \(\sum _{u\in C_i} {\delta _{o}(u)}\) |
\(\delta _{i}(C_i)\) | Total input degree of community \(C_i\) | \(\sum _{u\in C_i} {\delta _{i}(u)}\) |
\(\delta _{o}(u)\) | Output degree of node u | \(\vert \{(u,v):(u,v)\in E\}\vert \) |
\(\delta _{i}(u)\) | Input degree of node u | \(\vert \{(v,u):(v,u)\in E\}\vert \) |
4.1 Directed modularity-based deception
4.1.1 Intra-edge addition
4.1.2 Inter-edge addition
4.1.3 Intra-edge deletion
4.1.4 Inter-edge deletion
4.2 Directed Safeness-Based Deception
Symbol | Meaning | Formula |
---|---|---|
\({V}_o^{u}(\mathscr {C})\) | Nodes in \(\mathscr {C}\) (excluding u itself) reachable from u: | |
(i) passing only via nodes in \(\mathscr {C}\) and | ||
(ii) following directed paths originating from u | ||
\({V}_i^{u}(\mathscr {C})\) | Nodes in \(\mathscr {C}\) (excluding u itself) that can reach u: | |
(i) passing only via nodes in \(\mathscr {C}\) and | ||
(ii) by following directed paths terminating in u | ||
\(E_o(u,\mathscr {C})\) | Outgoing edges of u to members of \(\mathscr {C}\) | \(\{(u,v)\mid v\in \mathscr {C}\}\) |
\(E_i(u,\mathscr {C})\) | Incoming edges of u from members of \(\mathscr {C}\) | \(\{(v,u)\mid v\in \mathscr {C}\}\) |
\(\widetilde{E}_o(u,\mathscr {C})\) | Outgoing edges of u to non members of \(\mathscr {C}\) | \(\{(u,v)\mid v\notin \mathscr {C}\}\) |
\(\widetilde{E}_i(u,\mathscr {C})\) | Incoming edges of u from non members of \(\mathscr {C}\) | \(\{(v,u)\mid v\notin \mathscr {C}\}\) |
\(\delta _{o}(u)\) | Output degree of node u | \(\vert \{(u,v)\}\vert \) |
\(\delta _{i}(u)\) | Input degree of node u | \(\vert \{(v,u)\}\vert \) |
4.2.1 Intra-edge addition
4.2.2 Inter-edge addition
4.2.3 Intra-edge deletion
4.2.4 Inter-edge deletion
4.3 Directed permanence-based deception
Symbol | Meaning | Formula |
---|---|---|
\(E_o(u,C_i)\) | Outgoing edges of u to members of \(C_i\) | \(\{(u,v)\mid u\in C_i\}\) |
\(E_i(u,C_i)\) | Incoming edges of u from members of \(C_i\) | \(\{(v,u)\mid v\in C_i\}\) |
\(E_o^{max}(u)\) | Maximum number of edges originated from u connecting u to a neighbour community | \(\max _{C\in \overline{C}}\vert \{(u,v)\mid v\in C\}\vert \) |
\(E_i^{max}(u)\) | Maximum number of incoming edges of u connecting u to a neighbour community | \(\max _{C\in \overline{C}}\vert \{(v,u)\mid v\in C\}\vert \) |
\(C_{in}(u)\) | Clustering coefficient of u’s neighbours | \(\frac{\vert \{(v,w)\in E:v,w\in N_u\}\vert }{\delta (u)(\delta (u)-1)}\) |
\(\delta (u)\) | Degree of u | \(\vert \{(u,v)\}\vert +\vert \{(v,u)\}\vert \) |
4.3.1 Intra-edge addition
4.3.2 Inter-edge addition
4.3.3 Intra-edge deletion
4.3.4 Inter-edge deletion
4.4 Directed deception in practice
infomap
detection algorithm (Rosvall and Bergstrom 2008). In the following, we suppose that the target community is \(C_1=\{\)0, 1, 3, 4, 5\(\}\) and it has been completely disclosed. Moreover, we will consider a budget of update \(\beta =4\).5 Experimental evaluation
5.1 Experimental setting
5.1.1 Detectors
-
Surprise community (Traag et al. 2015) (
surprise
): this algorithm uses the notion of asymptotic surprise, which assesses the quality of the partition of a network into communities. -
InfoMap (Rosvall and Bergstrom 2008) (
infomap
): a detection algorithms that leverages information theory (the shortest description length for a random walk) to return a community structure. -
Gemsec (Rozemberczki et al. 2019) (
gemsec
): an approach that leverages random walks to approximate the point-wise mutual information matrix obtained by pooling normalized adjacency matrix powers. This matrix is decomposed by an approximate factorization technique which is combined with a k-means-like clustering cost.
5.1.2 Deceptors
-
Approaches devised for undirected networks: to make the experiments possible for these approaches, we treat the directed network under consideration as undirected. We considered:
-
Delete Internal Connect External (Waniek et al. 2018) (
DICE
): this community deception algorithm is based on the heuristic of deleting intra-community edges and adding inter-community edges.DICE
is based on the assumption that such kinds of edge updates always minimize modularity. -
Modularity Minimization (Fionda and Pirrò 2018) (
modMin
): this approach corrects for some issues withDICE
; the authors ofmodMin
showed that in some cases,DICE
fails to perform edge updates that minimize modularity. -
Safeness-based deception (Fionda and Pirrò 2018) (
SAF
): this approach introduces safeness maximization for community deception. -
Permanence-based deception (Mittal et al. 2021) (
NEUR
): this approach is based on permanence minimization. -
Random edge updates (
RND
): we consider an approach that randomly selects both the type of update and the endpoints of the edge addition/deletion.
-
-
Approaches devised for directed networks: for this category of deceptors we consider edge direction. In this case, we considered all the novel approaches described in the present paper.
5.2 Datasets
Network | \(\vert V\vert \) | \(\vert E\vert \) | Number of communities | ||||
---|---|---|---|---|---|---|---|
leiden | dm | surprise | infomap | gemsec | |||
Freeman | \(\sim \)50 | \(\sim \)500 | 5 | 5 | 7 | 6 | 5 |
Email | \(\sim \)1K | \(\sim \)25K | 28 | 32 | 21 | 12 | 16 |
AnyBeat | \(\sim \)12K | \(\sim \)67K | 129 | 81 | 143 | 156 | 112 |
WikiVote | \(\sim \)7K | \(\sim \)103K | 30 | 34 | 43 | 51 | 49 |
Facebook-like | \(\sim \)900 | \(\sim \)142K | 6 | 5 | 6 | 7 | 5 |
Epinions | \(\sim \)75K | \(\sim \)508K | 795 | 986 | – | – | – |
Slashdot | \(\sim \)77K | \(\sim \)905K | 825 | 1115 | – | – | – |
SocialNet | \(\sim \)82K | \(\sim \)1M | 841 | 1120 | – | – | |
Academia | \(\sim \)200K | \(\sim \)1.4M | 89 | 93 | – | – | – |
GooglePlus | \(\sim \)211K | \(\sim \)1.5M | 2105 | – | – | – | – |
5.3 Evaluation methodology
-
Deception Score: this score, which ranges between 0 and 1, combines a measure of reachability preservation among \(\mathscr {C}\) members, community spread (in how many communities are \(\mathscr {C}\)’s members spread), and community hidings (\(\mathscr {C}\)’s members should be included in the largest communities) (Fionda and Pirrò 2018).
-
Normalized Mutual Information(NMI) (Danon et al. 2005): this is a measure that we use to check how deception affects the original community structure. In particular, given the community structure before deception \(\overline{C}\) and the community structure after deception \(\overline{C}\)’, we have NMI\((\overline{C},\overline{C}') \in [0,1]\).
-
Running time: we also measured deception running time for the various algorithms without considering the time to find communities.
5.4 Evaluating directed community deception
5.4.1 Deception score
Facebook
), obtaining larger values for the deception score seems easier. In general, the deception score always reaches a value greater than 0.5. This can be considered a reasonably good value because the initial deception score was 0 (that is, \(\mathscr {C}\) was completely revealed). The deception score increases as the number of edge updates increase; this is consistent for all deception algorithms, detection algorithms, and networks. dsaf
performs better than the other algorithms in almost all settings. One exception is the leiden
detection algorithm, where dmod
seems to perform slightly better than dsaf
. dper
seems to be the less performing detection algorithm. We looked into the deception score’s community spread and reachability components to shed light on this behavior. In several cases, edge updates suggested by dper
result in internal edge deletions that result in a disconnection of the community when the number of edge updates increases. gemsec
seems to be a relatively robust detection algorithm for all three deception approaches. This is especially true in the Anybeat
network where, when the number of edge updates is below 60% of the number of edges in \(\mathscr {C}\), the deception score remains quite low. Figure 5 shows the results for the largest networks considered. We note that only leiden
and dm
were able to complete the community detection task within a timeout of 3h. Even in larger networks, it can be observed that larger budget values (x-axis) correspond to larger values of the deception score. In particular, with a budget equal to 60% of the total number of edges in the community in all cases, the deception score is greater than 0.5. We recall that experiments were conducted in the worst-case scenario with a deception score pre-deception equal to 0 (\(\mathscr {C}\) completely revealed). However, it is reasonable to assume that the initial deception score is larger; in reality, when deception algorithms are applied, \(\mathscr {C}\) is not completely revealed. The larger the network, the more difficult it becomes the hide. With the same budget percentage, the results in terms of deception score are lower. By further digging out on the results, we observe that for Epinions
the number of communities found by leiden
and dm
is 795 and 896, respectively. The average size of the \(\mathscr {C}\) considered in the experiments is around 400. hence, with 160 updates, dsaf
can achieve a score greater than 0.5.leiden
is 2105, and the average size of the \(\mathscr {C}\) considered in the experiments is 500. In this case, with 300 updates, dsaf
can reach a deception score value greater than 0.5. We again note the leiden
was the only algorithm able to complete the detection task within a 3h timeout.
dper
, which is around 20% and 15% less performing than dsaf
and dmod
, respectively. One interesting case is the Academia
social network, where nodes represent members that follow other members (hence the network is directed). On the one hand, we observe that dsaf
even with lower budget values can achieve a significant result than the other approaches. On the other hand, we observe that dsaf
seems to reach a saturation point where further edge updates do not add any benefit. The same is not true for dmod
and dper
, the performance of which has a significant increase when moving from a 30% to a 60% budget. Here, the average size of \(\mathscr {C}\) is 100.leiden
is more robust to the deception strategies than dm
. The reason for this can be found in the fact that leiden
finds communities by ensuring that they are well-connected, while dm
is an adaptation of modularity optimization to the directed case. Interestingly, the direct competitor of dm
would be leiden
, also based on directed modularity. However, dsaf
consistently performs better both on medium and large networks.5.4.2 Normalized mutual information (NMI)
dsaf
appears to be the deception algorithm that better preserves the community structure, followed by dmod
. The detection algorithms that seem to suffer less from deception in terms of NMI are surprise and infomap
. On larger networks (Fig. 6 (b)), the NMI values are a bit higher in general. Still, we have that dper
is the deception approach that most changes the community structure. To shed more light on the results, we investigated the relationship between the number of communities before and after deception (referred to as \(\Delta \)). Figure 7 reports the results for medium size networks when the budget is set to 60% of the number of edges in \(\mathscr {C}\). We note that the number of communities after deception decreases; this is always true for the leiden
, infomap
and gemsec
detection algorithms.
Anybeat
network, the number of communities significantly increases (the initial number was 126 for leiden
and becomes 167). For the surprise
detection algorithm, the larger number of communities is observed in the email network (almost 100 additional communities) for dsaf
. In this case, dmod
and dper
decreased the overall number of communities. By looking at the deception score related to this case (see Fig. 4), we note that dsaf
obtained a score higher than dmod
and dper
. The explanation for this improvement is the significant change in the number of communities, which in turn corresponds to an increase in the community spread, that is, the number of communities where \(\mathscr {C}\)’s members are scatted; indeed, this value went from 1 (the initial setting) to 67. A similar observation can be made for the WikiVote
network and the leiden
detection algorithm; here, dsaf
added a larger number of communities that resulted in a more significant deception score than dmod
and dper
.Epinion
and SocialNet
for the leiden
detection algorithm and the dsaf
deception algorithm. We note that dsaf
reaches a deception score of 0.7 (see Fig. 5).5.5 Comparison with undirected deception
Academia
network represents follower-followee relations that naturally carry a direction. To run the experiments with the state-of-the-art deception algorithms, we treated the networks as undirected and considered the same \(\mathscr {C}\). In these experiments, we focus on a budget of updates equal to 60% of the edges of \(\mathscr {C}\) as this configuration worked best for all approaches. In what follows, we report on comparison in terms of deception score and running time.5.5.1 Deception score
leiden
. Here, we observe that for the Facebook
network, the undirected approaches (excluding the random edge update approach) perform better. To shed more light on this aspect, we looked into the difference between the number of communities after and before deception. In this case, the undirected approaches introduced a larger number of communities than the directed ones. This relation between the number of communities after deception and deception score was also observed when focusing on directed approaches alone (see Sect. 5.4.2).
Epinions
network. When considering leiden
, the best performing approach was dmod
while with leiden
, dsaf
obtained slightly better results. One crucial observation is that the undirected algorithms performed significantly worse, reaching in only a few cases a deception score greater than 0.5. As one would expect, the worst-performing deceptor is RND
, which adds/remove edges randomly starting from \(\mathscr {C}\)’s members. Also, NEUR
seems to perform worse than other undirected approaches. To shed more light on this behavior, we looked again at the changes in the community structure and the structure of \(\mathscr {C}\); even in this case, we observed that NEUR
frequently performs internal edge deletions that disconnect \(\mathscr {C}\).5.6 Deception with ground truth communities
email
available from the SNAP repository7, which represents communication between members of an EU institution with ground-truth communities witnessing the membership of individuals in one of the 42 departments. The network consists of \(\sim \)1K nodes, 25K edges, and 42 ground-truth communities.Deception score | |||||
---|---|---|---|---|---|
Leiden | dm | Surprise | Infomap | Gemsec | |
dsaf | 0.623 | 0.893 | 0.665 | 0.635 | 0.654 |
dmod | 0.598 | 0.823 | 0.648 | 0.642 | 0.642 |
dper | 0.578 | 0.789 | 0.514 | 0.612 | 0.591 |
DICE | 0.468 | 0.658 | 0.498 | 0.502 | 0.471 |
modMin | 0.512 | 0.662 | 0.501 | 0.512 | 0.484 |
SAF | 0.532 | 0.698 | 0.503 | 0.594 | 0.493 |
NEUR | 0.516 | 0.658 | 0.509 | 0.449 | 0.472 |
RND | 0.235 | 0.123 | 0.256 | 0.226 | 0.194 |
dsaf
outperforms all the competitors, with approaches devised for undirected networks offering inferior performance. An interesting case is rnd
, which performs worse than before. This indicates that the deception strategies do not heavily depend on the particular community chosen. However, we noticed that when the size of \(\mathscr {C}\) is small, it is, in general, easier to obtain larger values of deception.\(NMI(\overline{C}_G,\overline{C}_B)\) | |||||
---|---|---|---|---|---|
Leiden | dm | Surprise | Infomap | Gemsec | |
0.891 | 0.877 | 0.797 | 0.862 | 0.872 |
leiden
being the most performing one. We now move to the analysis of the NMI values by comparing ground-truth communities and communities after applying community deception techniques.\(NMI(\overline{C}_G,\overline{C}_A)\) | |||||
---|---|---|---|---|---|
Leiden | dm | Surprise | Infomap | Gemsec | |
dsaf | 0.585 | 0.591 | 0.563 | 0.546 | 0.603 |
dmod | 0.591 | 0.594 | 0.574 | 0.567 | 0.57 |
dper | 0.581 | 0.515 | 0.559 | 0.536 | 0.586 |
DICE | 0.593 | 0.504 | 0.591 | 0.504 | 0.581 |
modMin | 0.533 | 0.582 | 0.629 | 0.597 | 0.589 |
SAF | 0.514 | 0.586 | 0.615 | 0.512 | 0.609 |
NEUR | 0.593 | 0.565 | 0.625 | 0.536 | 0.598 |
RND | 0.589 | 0.562 | 0.624 | 0.547 | 0.586 |
leiden
detection algorithm and the dsaf
deception algorithm, which was the best performing detection algorithm, we note that values of NMI drop from 0.891 to 0.585 on average. This means that no matter which of the 42 ground-truth communities we chose as \(\mathscr {C}\), there will be a significant difference between the ground truth communities and the communities returned after applying dsaf
. This same reasoning applies to all other deception techniques. However, we have two observations. First, not always lower values of NMI correspond to higher values of the deception score, which is what ultimately community deception strives to obtain. As an example, although the value of NMI for the DICE
deception algorithm when considering communities returned by the dm
detection algorithm is lower than that of dsaf
, we observe that with the latter, a much larger value of the deception score was obtained (see Table 6). This reasoning is evident when considering the RND
deception strategy, which adds and removes edges without any clear objective. In fact, while the NMI values are always above 0.5 the corresponding deception score values are very low.5.6.1 Deception running time
Freeman
, Facebook
). An exception is the modMin
algorithm in the Facebook
and Email
networks when considering leiden
and dm
, respectively. We further investigated this behavior and hypothesized that it is difficult to exclude bridge edges from possible edge deletions. The same happens in the email network when considering leiden
. We recall that both SAF
and modMin
try to exclude the deletion of internal bridge edges that would disconnect \(\mathscr {C}\).
leiden
and dm
appear to be the most costly in terms of running time. We also observe that directed approaches, especially dper
, require more time than undirected ones in most cases. This comes as no surprise since, from the analysis conducted in this paper (Sect. 4), finding the best edge updates requires a more involved formula where the edge direction and in/out node degrees play a significant role.
GooglePlus
network and the leiden
algorithm, where the dper
algorithm required almost 400 seconds to perform the updates. By further analyzing the numbers, we observed that for the largest network, that is, GooglePlus
the number of updates to be performed was in the order of the hundreds. However, in reality, one can expect that communities that want to implement deception strategies would have a smaller size. This would be reasonable since coordinating among a large group of people can quickly become problematic; in fact, edge updates need to be performed in a real scenario by friending/unfriending or following/unfollowing other nodes.DICE
and RND
is not reported in the table since, at each iteration, the update is selected randomly. Thus, the theoretical complexity only depends on the number of updates (\(\beta \)) that have to be performed. In the case of modularity minimization (row 1 in Table 9) the initialization (corresponding to the computation of \(\delta \), \(\eta \), and the (input/output) total degrees of the communities. Then, the best update to be performed can be computed in \(\mathcal {O}(k+\vert E_\mathscr {C}\vert +\vert V_\mathscr {C}\vert )\), where k is the number of communities found by the community detection algorithm and derives from the computation of the best inter-edge addition, while the term \(\vert E_\mathscr {C}\vert +\vert V_\mathscr {C}\vert \) is necessary to compute new bridge edges in \(\mathscr {C}\) if the performed update is an intra-edge deletion. The asymptotic complexity of the safeness-based algorithm is dominated by the computation of the best \(\beta \) updates. Indeed, in this case, the initialization has a cost \(\mathcal {O}(\vert E_\mathscr {C}\vert +\vert V_\mathscr {C}\vert )\) needed for the computation of the connected components of \(\mathscr {C}\) and the intra and inter (input/output) degree of the nodes of \(\mathscr {C}\). Then, the cost of computing each update is \(\mathcal {O}(\vert E_\mathscr {C}\vert +\vert V_\mathscr {C}\vert )\), where the cost of computing the best inter-community addition is \(\mathcal {O}(\vert V_\mathscr {C}\vert )\) (to find the node bringing the maximum safeness gain). The cost of computing the best intra-community deletion is \(\mathcal {O}(\vert E_\mathscr {C}\vert +\vert V_\mathscr {C}\vert )\) since it is dominated by the recomputing of the new bridges after the deletion. Finally, the asymptotic complexity of the permanence-based algorithm is \(\mathcal {O}(\beta \cdot (\vert V_\mathscr {C}\vert +\vert E_\mathscr {C}\vert ^2))\) (as reported in row 3 of Table 9). In this case the cost of the initialization is \(\mathcal {O}(\vert V_\mathscr {C}\vert \cdot \vert E_\mathscr {C}\vert ))\) to compute intra and inter (input/output) degrees and clustering coefficients. Moreover, each update can be computed with a cost of \(\mathcal {O}(\vert V_\mathscr {C}\vert +\vert E_\mathscr {C}\vert ^2)\), where the best intra-community edge addition and intra-community edge deletion can be computed with a cost \(\mathcal {O}(\vert V_\mathscr {C}\vert \cdot \vert E_\mathscr {C}\vert ))\), dominated by the recomputation of the clustering coefficient for each possible modification. The computation of the best inter-community edge addition can be computed in \(\mathcal {O}(\vert V_\mathscr {C}\vert )\).Deception measure | Asymptotic complexity |
---|---|
Modularity | \(\mathcal {O}(\vert E\vert +\vert V\vert +\beta \cdot (k+\vert E_\mathscr {C}\vert +\vert V_\mathscr {C}\vert ))\) |
Safeness | \(\mathcal {O}(\beta \cdot (\vert E_\mathscr {C}\vert +\vert V_\mathscr {C}\vert ))\) |
Permanence | \(\mathcal {O}(\beta \cdot (\vert V_\mathscr {C}\vert +\vert E_\mathscr {C}\vert ^2))\) |