1 Introduction
2 Cohesive subgraphs with exceptional attributes
Symbols | Definitions |
---|---|
\(G=(V,E,{\hat{A}})\) | A graph G with V a set of vertices, \(E \subseteq V \times V\) a set of edges, and \({\hat{A}}\) a set of count attributes on vertices. For \({\hat{a}}\in {\hat{A}}\) and \(v \in V\), \({\hat{a}}(v)\in \text{ Dom }_a\) denotes the value of attribute \({\hat{a}}\) on v |
A | The set \(a \in A\) such that for each \(v \in V\), a(v) is a random variable taking as empirical value \({\hat{a}}(v)\) in G |
\(N_d(v)\) | The geodesic neighborhood of radius d from a vertex v. Formally, \(N_d(v)=\{u \in V \text { } | \text { } dist(v,u) \le d \}\) |
(U, S) | A CSEA pattern with \(U \subseteq V\) as cover set and \(S\subseteq \{ (a,[k_a,\ell _a])\mid a\in A\}\) as characteristic |
\(\text{ Pr }(U,S)\) | The probability that the CSEA pattern (U, S) is present in G |
\(\text {SI}(U,S)\) | The Subjective Interestingness of (U, S), defined as \(\text {SI}(U,S)=\frac{\text {IC}(U,S)}{\text {DL}(U,S)}\) |
\(\text {IC}(U,S)\) | The Information Content of (U, S), defined as \(\text {IC}(U,S)=-\log (\text{ Pr }(U,S)).\) |
\(\text {DL}(U,S)\) | The Description Length of (U, S), defined as \(\text {DL}(U,S)=\text {DL}_{A}(S)+\text {DL}_{V}(U),\) where \(\text {DL}_{A}(S)\) (resp. \(\text {DL}_{V}(U)\)) is the description length of S (resp. U) |
\({\hat{c}}_a(v)\) | p-value for a(v) under the background distribution as null hypothesis: \({\hat{c}}_a(v)\triangleq \text{ Pr }(a(v)\ge {\hat{a}}(v))\) |
\({\mathcal {N}}, {\mathcal {N}}(U)\) | \({\mathcal {N}}\) is the set of all considered neighborhoods \({\mathcal {N}}=\{N_d(v) \text { } | \text { } v \in V,\, d \in {\mathbb {N}} \text{ and } d \le D\}\) (with D the maximum considered radius). \({\mathcal {N}}(U)\) is the subset of neighborhoods that contain U: \({\mathcal {N}}(U)=\{N_d(v) \in {\mathcal {N}} \mid U \subseteq N_d(v) \}\) |
\(\text {DL}_{V}(U)\) | The description length of \(U \subseteq V\). Given a parameter \(\alpha \in {\mathbb {N}}\) used to limit the number of core vertices, \(\text {DL}_{V}(U)=\min _{\begin{array}{c} X \subseteq {\mathcal {N}}(U) \\ |X| \le \alpha \end{array}} f(X,U)\) |
f(X, U) | The length of the description of U by the intersection of neighborhoods \(X \subseteq {\mathcal {N}}(U)\), along with the set of exceptions \(\text{ exc }(X,U)\triangleq \cap _{N_d(v)\in X}N_d(v)\backslash U\) |
3 Subjective interestingness of CSEA patterns
3.1 The information content of a CSEA pattern
3.2 Information content and prior beliefs for count attributes
3.3 Description length
4 Iterative mining of CSEA patterns
5 SIAS-Miner algorithm
5.1 Pattern enumeration
5.2 Computing \(\text {DL}_{V}(U)\)
5.3 Time complexity of SIAS-Miner
5.4 Using SIAS-Miner to iteratively mine CSEA patterns
6 Experiments
- The
London
graph (\(|V|=289 ,|E|=544 ,|{\hat{A}}|=10\)) is based on the social network Foursquare.6 Each vertex depicts a district in London and edges link adjacent districts. Each attribute stands for the number of places of a given type (e.g. outdoors, colleges, restaurants, etc.) in each district. The total number of represented venues in this graph is 25029. - The
Ingredients
graph (\(|V|=2400 ,|E|=7932 ,|{\hat{A}}|=20\)) is built from the data provided by Kaggle7 and features given by Yummly.8 Each vertex is a recipe ingredient. The attributes correspond to the number of recipes grouped by nationality (greek, italian, mexican, thai, etc.) that include this ingredient. An edge exists between two ingredients if the Jaccard similarity between their recipes is higher than 0.03. The total number of used recipes is 39774. - The US
Flights
graph (\(|V|=322 ,|E|=2039 ,|{\hat{A}}|=14\)) is a dataset provided by Kaggle9 which contains information about flights between USA airports in 2015. The vertices represent USA airports and the attributes depict the number of flights per airline company in the corresponding airports. Edges connect two airports if there are at least 100 flights between them. - The
DBLP
graph (\(|V|=38,895 ,|E|=112,404 ,|{\hat{A}}|=10\)) is a co-authorship graph built from the DBLP digital library. Each vertex represents an author who has published at least one paper in 10 major conferences and journals of in the Data Mining and Databases communities,10 between January 1990 and March 2019. Each edge links two authors who co-authored at least one paper in these conferences and journals. The attributes depict for each author the number of publications in the 10 aforementioned conferences and journals.Considering all the numerical values of attributes is computationally expensive, we pre-process each graph so that for each attribute, the values \(c_a(v)\) are binned into five quantiles. This means that each attribute \(c_a(v)\) is numerical and takes only 5 different values.There is no approach that supports the discovery of subjectively interesting attributed subgraphs in the literature. Nevertheless, we identify three approaches whose goal shares some similarities with ours. These three approaches are representative in comparison with the state-of-the-art. Other methods from the literature are not considered in this experimental study because they do not handle (several) numerical attributes or they require some additional settings (e.g., community search). We thus consider in our study the following methods: - P-N-RMiner (Lijffijt et al. 2016) is an algorithm that mines multi-relational datasets to enumerate a specific structure of patterns called Maximal Complete Connected Subsets (MCCS). Any vertex-attributed graph can be mapped to an entity-relational model where the pattern syntax of P-N-RMiner is equivalent to ours [each MCCS corresponds to a closed pattern \((clo(U),max_S(U))\) in our context]. This means that P-N-RMiner is very suitable to evaluate the performance of our algorithm. The mapping from a graph to the required relational format is detailed in Appendix A. Although the pattern syntax in this design is equivalent to our approach, our interestingness quantification is very different, because the information contained in the patterns shown to the user does not align with the ranking of P-N-RMiner. Hence, we use P-N-RMiner only to compare the runtime performance of our approach.
- Cenergetics (Bendimerad et al. 2018) aims at discovering connected subgraphs involving overrepresented and/or underrepresented attributes. It assesses exceptionality with the weighted relative accuracy (WRAcc) measure that accounts for margins but cannot account for other background knowledge.
- GAMER (Günnemann et al. 2010): Given an attributed graph, this method finds dense subgraphs (quasi-cliques) where vertices show a high similarity in subsets of attributes. In these subgraphs, these attribute values fall into narrow intervals whose width does not exceed a specified threshold W. The main difference with our approach is that GAMER looks only for similarity and cohesiveness, but not exceptionality and surprisingness.
- Density and relative degree: The density of a pattern (U, S) is \(\frac{2 \times |E(U)|}{|U|\times (|U|-1)}\) where |E(U)| is the number of edges in the induced subgraph G[U], and the relative degree of a vertex v in a set of vertices U is \(\frac{|N_1(v)\cap U|}{|U|-1}\). These properties are the highest for GAMER, this can be explained by the fact that its patterns are made of quasi-cliques, which makes them denser. In
Flights
andIngredients
datasets, the density and the degree for SIAS-Miner are higher than those of Cenergetics, while they are lower for theLondon
graph. In fact, for SIAS-Miner, D was set to 3 forLondon
, and to 1 forFlights
andIngredients
. The higher the value of D, the sparser the results can be. In general, Cenergetics patterns have a low density and relative degree, because this approach requires only the connectivity of vertices in the patterns, some of these patterns can be very sparse subgraphs. - Diameter: The diameter of a subgraph U is the maximum pairwise distance between vertices of U. GAMER patterns have the smallest average diameter, while Cenergetics patterns have the highest ones. The diameter of SIAS-Miner is comparable to the one of Cenergetics in
London
(\(D=3\)) andDBLP
(\(D=1\)), but it is smaller forFlights
andIngredients
(\(D=1\)). - Size and number of covered attributes: The size of a pattern (U, S) corresponds to |U| and the number of covered attributes is |S|. GAMER has patterns with the smallest average size, this is reasonable because it requires a harder constraint on the structure of patterns, which is the quasi-cliqueness. This small size of U allows GAMER patterns to be covered with a larger number of attributes comparing with the other approaches.
- Contrast of attributes: Given the identified patterns (U, S) we want to measure how much the characteristics S are over (or under) expressed in U. First, we define the constrast of a given attribute a in a given set of vertices U as the absolute difference between its average ratio in U and its overall average ratio: \(contrast(a,U)=| \frac{1}{|U|} \times \sum _{v \in U} \frac{{\hat{a}}(v)}{{\hat{A}}(v)} - \frac{1}{|V|} \times \sum _{v \in V} \frac{{\hat{a}}(v)}{{\hat{A}}(v)} |\), with \({\hat{A}}(v) = \sum _{a \in A} {\hat{a}}(v)\). The contrast of a pattern (U, S) is the average contrast contrast(a, U) among the attributes a that appear in S. As expected, GAMER has the minimum values of contrasts for all the datasets, indeed, GAMER is only interested by the similarity of attributes in the pattern but not by their exceptionality. The contrasts for SIAS-Miner and Cenergetics are higher, and they are comparable in
London
Flights
andDBLP
datasets, while it is higher for SIAS-Miner inIngredients
dataset.
London
graph. We chose the top 4 patterns, and \(P_{14}\) which is the best pattern with more than one core vertex. Green cells represent vertices covered by a CSEA pattern while blue cells are the core vertices that are also in the cover, and the red cells are normal exceptions. For example, \(P_1\) covers the distance three neighbors of the blue vertex, with an exceptional prevalence of nightlife and professional venues. Indeed, \(P_1\) includes the City of London which contains the primary central business district (CBD) of London, and it is known by its prevalent number of nightlife venues (pubs, bars, etc.) in some areas like the Soho and the South Bank neighborhood. The second pattern \(P_2\) covers Stoke Newington, with a high prevalence of nightlife spots (e.g., bars) and food venues, and a non-prevalence of colleges and universities. Specifically, this area is characterized by a wide range of bars and cafes, especially around the Church Street and lining the road of Dalston. The patterns \(P_3\) and \(P_4\) are two other areas where food venues are prevalent. Finally, Bermondsey is covered by \(P_{14}\) and described by a high presence of professional venues, and a non-prevalence of event venues, colleges and universities.
Ingredients
graph. \(P_{1}\) corresponds to a set of ingredients that appear a lot in Mexican recipes. They are described as neighbors of enchilada sauce which is an ingredient originally from Mexico. Notice that this pattern does not contain any exception. \(P_2\) is defined by a set of 27 ingredients that are highly present in Indian recipes (e.g., green cardamom, garam masala, curry leaves, etc.), with ghee as a core vertex. This last ingredient is a class of butter that originated in ancient India, and is commonly used in cuisine of the Indian subcontinent. \(P_3\) presents some ingredients that are characteristic of the Italian cuisine (e.g., pasta sauce, pizza sauce, fresh basil, etc.), and describes them as neighbors of mozzarella cheese, with a number of 2 exceptions. \(P_4\) is another group of ingredients that are highly present in Mexican food, and they are co-occurrent with white onion in a large number of recipes. In Fig. 12, we also present some other specific patterns, \(P_{15}\) is the best pattern with more than one attribute in the characteristic (an exceptional prevalence in Chinese, Thai and Japanese food), and \(P_{71}\) is the best pattern with more than one core vertex (the cores are garlic cloves and chili powder). This last pattern has also an exceptional prevalence in several types of cuisine, but the prevalence is much more significant in Mexican food (with \(C_{mexican}(v) \le 8.6 \cdot 10^{-4}\)). For more details, we display in the companion page the complete set of top 100 patterns of each of London
and Ingredients
datasets.DBLP
graph. \(P_1\) consists of researchers who are co-authors of Milind Tambe in some of the 10 studied conferences and journals. They have published a significant number of papers in the IJCAI conference (156 publications), and they have not collaborated in papers of the VLDB conference or the VLDB Journal. Particularly, the central author Milind Tambe has co-authored 26 IJCAI papers with this group of researchers. \(P_2\) describes the co-authors of Yoshua Bengio, except 5 of them. They have significantly published in the ICML conference (174 papers in overall) and have not published in the VLDB or DMKD journals. \(P_3\) shows another group of researchers belonging to the artificial intelligence community as they have a large number of publications in IJCAI (314 papers). Also, these researchers have not collaborated in papers published in KDD, VLDB, or VLDB Journal. They are described exactly as the set of co-authors of Jérôme Lang, who is a highly influential researcher from the artificial intelligence community. We also show in Fig. 14 the best DBLP
pattern with more than one core vertex. It corresponds to the common co-authors of Hui Xiong and Enhong Chen, for whom there is a large number of publications in ICDM (87 papers), a low number of publications in ICML (only one paper), and no publication in VLDB.
7 Related work
Graph topology | Vertex attributes | Numerical attributes | Pattern exceptionality | Prior knowledge | |
---|---|---|---|---|---|
SSG-Miner (van Leeuwen et al. 2016) | \(\times \) | \(\times \) | |||
Cosmic (Kaytoue et al. 2017) | \(\times \) | \(\times \) | \(\times \) | ||
\(\times \) | \(\times \) | \(\times \) | \(\times \) | ||
Comodo (Atzmueller et al. 2016) | \(\times \) | \(\times \) | \(\times \) | ||
\(\times \) | \(\times \) | \(\times \) | |||
GAMER (Günnemann et al. 2010), TopGraphMiner (Prado et al. 2013), FocusCO (Perozzi et al. 2014) (k,r)-Core (Zhang et al. 2017), Community search (Fang et al. 2017b, 2016; Shang et al. 2016; Huang et al. 2017; Huang and Lakshmanan 2017; Fang et al. 2017a), ST compression (Silva et al. 2015), Multiresolution representation (Chen et al. 2018) | \(\times \) | \(\times \) | \(\times \) | ||
\(\times \) | \(\times \) | ||||
\(\times \) | \(\times \) | ||||
SIAS-Miner (this paper) | \(\times \) | \(\times \) | \(\times \) | \(\times \) | \(\times \) |