1 Introduction
-
Disjoint communities detection: The goal here is to compute a partition of the graph node’s set. One node can belong to only one community at once. Most of the work in the area of community detection deals with this problem [21].
-
Proposing LICOD, a general framework form implementing LdCD algorithms.
-
Introducing task-oriented evaluation of community detection algorithms and providing an approach for evaluating different community detection algorithms on data clustering tasks.
2 Definitions and notations
3 Related work
3.1 Community detection approaches
3.1.1 Group-based approaches
-
High mutual connectivity: a community can be assimilated to a maximal clique or to a \(\gamma \)-quasi-clique. A subgraph \(G\) is said to be \(\gamma \)-quasi-clique if \(d(G) \le \gamma \). Finding maximal cliques in a graph is known to be a NP-hard problem. Generally, cliques of reduced size are used as seeds to find larger communities. An example is the clique percolation algorithm [1, 82]. Such approaches are relevant for networks that are rather dense.
-
High internal reachability: One way to relax the constraint of having cliques or quasi-cliques is to consider the internal reachability of nodes within a community. Following this, a community core can be approximated by a maximal k-clique, k-club or k-core subgraph. A k-clique (resp. k-club) is a maximal subgraph in which the longest shortest path between any nodes (resp. the diameter) is \(\le \) k. A k-core is a maximal connected subgraph in which each node has a degree \(\ge \) k. In [86], authors introduce the concept of k-community which is defined as a connected subgraph \(G' = \langle V' \subset V, E' \subset E\rangle \) of a graph \(G\) in which for every couple of nodes \(u,v \in V'\) the following constraint holds: \(|\Gamma _G(v) \cap \Gamma _G(u)| \ge k\). The computational complexity of k-cores and k-communities is polynomial. However, these structures do not correspond to all the community, but are rather used as seeds for computing communities. An additional step for adding non-clustered nodes should be provided. In [67], authors propose to compute k-cores as mean to accelerate computation of communities using standard algorithms, but on size-reduced graphs.
3.1.2 Network-based approaches
-
Agglomerative approaches: These implement a bottom-up approach where an algorithm starts by considering each single node as a community. Then, it iterates by merging some communities guided by some quality criteria. The louvain algorithm [7] is one very known example of such approaches. The algorithm is composed of two phases. First, it looks for small communities by optimizing modularity in a local way. Second, it aggregates nodes of the same community and builds a new network whose nodes are the communities. Two adjacent communities merge if the overall modularity of the obtained partition can be enhanced. These steps are repeated iteratively until a maximum of modularity is reached. The computing complexity of the approach is empirically evaluated to be \(\mathcal {O}(n log( n)) \).
-
Separative approaches: These implement a top-down approach, where an algorithm starts by considering the whole network as a community. It iterates to select ties to remove to split the network into communities. Different criteria can be applied for tie selection. The Newman–Girvan algorithm is the most known representative of this class of approaches [58]. The algorithm is based on the simple idea that a tie linking two communities should have a high betweenness centrality. This is naturally true since an inter-community tie would be traversed by a high fraction of shortest paths between nodes belonging to these different communities. Considering the whole graph \(G\), the algorithm iterates for \(m_G\) times, cutting at each iteration the tie with the highest betweenness centrality. This allows to build a hierarchy of communities, the root of which is the whole graph and leafs are communities composed of isolated nodes. Partition of highest modularity is returned as an output. The algorithm is simple to implement and has the advantage to discover automatically the best number of communities to identify. However, the computation complexity is rather high: \(\mathcal {O}(n^2 \cdot m + (n)^3log(n))\). This is prohibitive to apply to large-scale networks.
-
The best partition of a graph is the one that maximize the modularity.
-
If a network has a community structure, then it is possible to find a precise partition with maximal modularity.
-
If a network has a community structure, then partitions inducing high modularity values are structurally similar.
3.1.3 Propagation-based approaches
-
First, there is no formal guarantee of the convergence to a stable state.
-
Lastly, it lacks for robustness, since different runs produce different partitions due to random tie breaking.
3.1.4 Seed-centric approaches
3.2 Community evaluation approaches
3.2.1 Ground-truth comparison approaches
-
Annotation by experts: For some networks representing real systems, experts in the system field have been able to define the community structure. Examples of such networks are given in Sect. 5.1. In general, these networks are rather very small (allowing hence to be handled by experts) and the defined community structure is usually given by a partition of the studied graph with no overlapping among defined communities.
-
Network generators use: The idea here is to generate artificial networks with predefined community structure. Some early work in this area is the Girvan–Newman benchmark graph [23]. A more sophisticated generator is proposed in [42] where the user can control different parameters of the network including the size, the density, the degree distribution law, the clustering coefficient, the distribution of communities size as well as the separability of the obtained communities. While the approach is interesting, generated networks are not guaranteed to be similar enough to real complex networks observed in real-world applications.
-
Implicit community definition: This approach is based on inferring the community structure in a graph applying simple rules taking usually the semantic of ties into account. For example in [90] authors define a community in the Live journal social network as groups of fans of a given artist. Communities in a co-authorship of scientific publications are taken to be authors participating in a same venue! The relevance of proposed rules seems to be questionable.
-
\(S_{11}\) = {pairs that are in the same cluster under \(P_i\) and \(P_j\)}
-
\(S_{00}\) = {pairs that are in different clusters under \(P_i\) and \(P_j\)}
-
\(S_{10}\) = {pairs that are in the same cluster under \(P_i\) but in different ones under \(P_j\)}
-
\(S_{01}\) = {pairs that are in different clusters under \(P_i\) but in the same under \(P_j\)}
3.2.2 Topological measures for community evaluation
-
Global measures that evaluate the quality of the computed partition as a whole. The modularity \(Q\) defined in [57] (see formula 1) is the most applied measure. Other modularity measures have also been proposed [51, 54]. However, the different modularity limitations discussed earlier (see Sect. 3.1.2) hinder the utility of using it as an evaluation metric.
-
Local topological measures. A number of local topological measures have been proposed to evaluate the quality of a given community. Most are used in the context of identifying ego-centered communities [4, 11]. In [90], authors present an interesting survey on these measures. Let \(f(c)\) be a community evaluation measure. The quality of a partition is then simply given by:$$\begin{aligned} Q(\mathcal {C} )= \frac{\sum \nolimits _{i} f(S_i)}{| \mathcal {C} |} \end{aligned}$$(5)
3.2.3 Task-driven evaluation
4 The LICOD approach
4.1 Informal presentation
4.2 Implementation issues
4.2.1 Function \(isLeader()\)
-
Degree centrality (denoted \(dc\)): This is given by the proportion of nodes directly connected to the target node. Formally, the degree centrality of a node \(v\) is given by: \( \mathrm{dc}(v) = \frac{d_G(v)}{n_G -1}\). The computation complexity is \(\mathcal {O}(n_G)\).
-
Betweenness centrality \(\mathrm{BC}(v)\): The is given by the fraction of all-pairs shortest paths that pass through the target node. Formally, the betweenness centrality of a node \(v\) is given by \(\mathrm{BC}(v) = \sum _{s,t \in V} \frac{\sigma (s,t | v)}{\sigma (s,t)}\) where \(\sigma (s,t)\) is the number of shortest paths linking \(s\) to \(t\), and \(\sigma (s,t | v)\) is the number of paths passing through node \(v\) other than \(s\) and \(t\). The best known algorithm for computing this centrality has a computation complexity \(\mathcal {O}(n_G.m_G + (n_G)^2\mathrm{log}(n_G))\) [9].
4.2.2 Function \(computecommunitiesleaders\)
4.2.3 Function \(memebership(v,c)\)
4.2.4 Rank aggregation approaches
4.2.5 Community assignment
5 Experimentation
5.1 Evaluation on benchmark networks
-
Zachary’s karate club This network is a social network of friendships between 34 members of a karate club at a US university in 1970 [91]. Following a dispute, the network was divided into two groups between the club’s administrator and the club’s instructor. The dispute ended in the instructor creating his own club and taking about half of the initial club with him. The network can hence be divided into two main communities.
-
Dolphins social network This network is an undirected social network resulting from observations of a community of 62 dolphins over a period of 7 years [50]. Nodes represent dolphins and edges represent frequent associations between dolphin pairs occurring more often than expected by chance. Analysis of the data revealed two main groups.
-
American college football dataset This dataset contains the network of American football games [23]. The 115 nodes represent teams and the edges represent games between 2 teams. The teams are divided into 12 groups containing around 8–12 teams each and games are more frequent between members of the same group. Also teams that are geographically close but belong to different groups are more likely to play one another than teams separated by a large distance. Therefore, in this dataset groups can be considered as known communities.
-
American political books This is a political books co-purchasing network. Nodes represent books about US politics sold by the online bookseller Amazon.com. Edges represent frequent co-purchasing of books by the same buyers, as indicated by the “customers who bought this book also bought these other books” feature on Amazon. Books are classified into three disjoint classes: liberal, neutral or conservative. The classification was made separately by Mark Newman based on a reading of the descriptions and reviews of the books posted on Amazon.
Dataset | # Nodes | # Edges | # Real communities |
---|---|---|---|
Zachary | 34 | 78 | 2 |
Football | 115 | 616 | 11 |
US Politics | 100 | 411 | 2 |
Dolphin | 62 | 159 | 2 |
-
Centrality metrics = [Degree centrality (dc), Betweenness centrality (BC)]
-
Voting method = [Borda, Local Kemeny]
-
\(\sigma \in [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]\)
-
\(\delta \in [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]\)
-
\(\epsilon \in [0.0, 0.1, 0.2]\)
Dataset | Algorithm | NMI | ARI |
Q
| # Communities |
---|---|---|---|---|---|
Zachary | Newman | 0.57 | 0.46 | 0.40 | 5 |
Louvain | 0.58 | 0.46 | 0.41 | 4 | |
Walktrap | 0.50 | 0.33 | 0.35 | 5 | |
LICOD |
0.60
|
0.62
| 0.24 | 3 | |
Football | Newman | 0.87 | 0.77 | 0.59 | 10 |
Louvain | 0.89 | 0.80 | 0.60 | 10 | |
Walktrap | 0.88 | 0.81 | 0.60 | 10 | |
LICOD | 0.83 | 0.69 | 0.49 | 16 | |
US Politics | Newman | 0.55 | 0.68 | 0.51 | 5 |
Louvain | 0.57 | 0.55 | 0.52 | 4 | |
Walktrap | 0.53 | 0.65 | 0.50 | 4 | |
LICOD |
0.68
| 0.67 | 0.42 | 6 | |
Dolphins | Newman | 0.55 | 0.39 | 0.51 | 5 |
Louvain | 0.51 | 0.32 | 0.51 | 5 | |
Walktrap | 0.53 | 0.41 | 0.48 | 4 | |
LICOD | 0.41 | 0.32 | 0.35 |
2
|
5.2 Data clustering-driven evaluation
Dataset | Glass | Iris | Wine | Vehicle | Abalone |
---|---|---|---|---|---|
#Instances | 214 | 150 | 178 | 846 | 772 |
#Attributes | 10 | 4 | 13 | 18 | 8 |
#Classes | 7 | 3 | 3 | 4 | 29 |
Distance | Formula |
---|---|
Euclidean distance |
\(dist_{euc}(x,y)=\sqrt{\sum _{i=1}^{n}|x_i - y_i|^{2}}\)
|
Cosine similarity |
\(dist_{cos}(x,y)=1-\frac{x.y}{|x| |y|}\)
|
Chebyshev distance |
\(d_{cheb}(x,y)=\max _{i} (x_i - y_i)\)
|
Dataset | Feature | Euclidean | Chebyshev | Cosine |
---|---|---|---|---|
Iris | # Edges | 382 | 2,468 | 426 |
Diameter | 33 | 14 | 25 | |
Average degree | 5.09 | 32.9 | 5.68 | |
Density | 0.034 | 0.220 | 0.038 | |
Transitivity | 0.055 | 0.340 | 0.011 | |
Glass | # Edges | 558 | 7,786 | 552 |
Diameter | 21 | 8 | 24 | |
Average degree | 5.21 | 72.76 | 5.15 | |
Density | 0.024 | 0.341 | 0.024 | |
Transitivity | 0.0139 | 0.252 | 0.011 | |
Wine | # Edges | 380 | 514 | 438 |
Diameter | 102 | 84 | 59 | |
Average degree | 4.26 | 5.77 | 4.92 | |
Density | 0.024 | 0.032 | 0.027 | |
Transitivity | 0 | 0.178 | 0 | |
Vehicle | # Edges | 2,598 | 4,072 | 2,764 |
Diameter | 63 | 54 | 45 | |
Average degree | 6.14 | 9.62 | 6.53 | |
Density | 0.007 | 0.011 | 0.007 | |
Transitivity | 0.002 | 0.091 | 0 | |
Abalone | # Edges | 2,542 | 89,338 | 2,158 |
Diameter | 38 | 22 | 50 | |
Average degree | 6.58 | 231.44 | 5.59 | |
Density | 0.008 | 0.30 | 0.007 | |
Transitivity | 0 | 0.49 | 0 |
Dataset | Algorithm | NMI | ARI | Q | # Communities |
---|---|---|---|---|---|
Iris | Newman | 0.66 | 0.44 | 0.72 | 9 |
Louvain | 0.59 | 0.40 | 0.72 | 8 | |
Walktrap | 0.64 | 0.47 | 0.68 | 12 | |
LICOD | 0.59 | 0.42 | 0.64 | 8 | |
Glass | Newman | 0.45 | 0.21 | 0.76 | 11 |
Louvain | 0.47 | 0.21 | 0.75 | 12 | |
Walktrap | 0.49 | 0.15 | 0.73 | 22 | |
LICOD | 0.46 | 0.17 | 0.70 | 18 | |
Wine | Newman | 0.32 | 0.14 | 0.79 | 11 |
Louvain | 0.31 | 0.13 | 0.79 | 12 | |
Walktrap | 0.32 | 0.11 | 0.77 | 15 | |
LICOD |
0.34
|
0.21
| 0.72 | 14 | |
Vehicle | Newman | 0.23 | 0.10 | 0.79 | 17 |
Louvain | 0.25 | 0.11 | 0.78 | 14 | |
Walktrap | 0.23 | 0.06 | 0.75 | 32 | |
LICOD | 0.21 | 0.05 | 0.65 | 41 | |
Abalone | Newman | 0.34 | 0.10 | 0.83 | 15 |
Louvain | 0.35 | 0.10 | 0.83 | 19 | |
Walktrap | 0.33 | 0.08 | 0.82 | 21 | |
LICOD |
0.44
| 0.08 | 0.70 | 68 |
6 Conclusion
-
Providing a new efficient algorithm for computing (eventually overlapping) communities.
-
Proposing a new approach for qualitative community evaluation using classical data clustering tasks.