1 Introduction
-
Novel definitions of single-subgroup patterns and bi-subgroup patterns, as well as patterns that are global summaries for attributed graphs. (Sect. 3)
-
A quantification of their Subjective Interestingness (SI), based on what prior beliefs an analyst holds, or what information an analyst gains when observing a pattern. (Sect. 4)
-
An algorithm to mine bi-subgroup patterns based on beam search. (Sect. 5)
-
An algorithm to mine global (or summarization) patterns from which a series of interesting single-subgroup and bi-subgroup patterns can be revealed. (Sect. 5)
-
An empirical evaluation of our method on real-world data, to investigate its ability to encode the analyst’s prior beliefs and identify subjective interesting patterns. (Sect. 6)
2 Related work
2.1 Graph modelling
2.2 Pattern mining in attributed graphs
2.2.1 Local pattern mining
2.2.2 Global pattern mining by summarizing or clustering
3 Subgroup pattern and summary syntaxes for graphs
3.1 Local pattern
3.1.1 Single-subgroup pattern
3.1.2 Bi-subgroup pattern
3.2 Global pattern: summarization for graphs
4 Formalizing the subjective interestingness
4.1 General approach
4.2 The background distribution
4.2.1 The initial background distribution
4.2.2 Updating the background distribution
4.3 The subjective interestingness measure
4.3.1 The SI measure for a local pattern
4.3.2 The SI measure for a global pattern
5 Algorithms
5.1 Local pattern mining
5.1.1 Beam search
5.1.2 Nested beam search
Notation | Description |
---|---|
OuterBeam | The outer beam storing best description pairs \((W_1, W_2)\) during the search |
InnerBeam | The inner beam only storing best descriptions \(W_2\) |
\(x_1\) | The outer beam width (i.e., the minimum number of different descriptions \(W_1\) contained in the outer beam |
\(x_2\) | The inner beam width |
D | The search depth (i.e., maximum number of selectors combined in a description) |
5.2 Global pattern mining
5.2.1 The basic search strategy
5.2.2 Speedup strategies
5.3 Implementation
6 Experiments
6.1 Data
Lastfm.com
users, generated from the publicly available dataset (Cantador et al. 2011) in the HetRec 2011 workshop. In this dataset, tag assignments of a list of most-listened musical artists provided by each user are given in [user, tag, artist] tuples, where those tags are unstructured text labels that users used to express songs of artists. We then took tags that a user ever assigned to any artist and assigned those to the user as binary attributes expressing a user’s music interests. This dataset has been used in many publications to evaluate local pattern mining methods (Pool et al. 2014; Atzmueller et al. 2016; Galbrun et al. 2014).Dataset | Type | |V| | |E| | Attribute type | #Attributes | |S| |
---|---|---|---|---|---|---|
Caltech36 | Undirected | 762 | 16,651 | Nominal | 7 | 602 |
Reed98 | Undirected | 962 | 18,812 | Nominal | 7 | 748 |
Lastfm | Undirected | 1892 | 12,717 | Binary | 11,946 | 21,695 |
DBLPtopics | Directed | 10,837 | 6883 | Numerical | 50 | 300 |
DBLPaffs | Directed | 6472 | 3066 | Binary | 116 | 232 |
MPvotes | Undirected | 650 | 49,631 | Binary | 39 | 78 |
6.2 Parameter sensitivity (RQ1)
6.2.1 Experimental setup
6.2.2 Results
6.3 Comparative evaluation (RQ2)
Rank |
W
|
I
|
\(k_W\)
|
\(|\varepsilon (W)|\)
|
\(p_W\cdot n_W\)
| #inter-edges |
---|---|---|---|---|---|---|
1 | idm = 1 | 0 | 96 | 78 | 8.93 | 496 |
2 | heavy metal = 1 | 0 | 220 | 165 | 60.04 | 1322 |
3 | synthpop = 1 | 0 | 208 | 131 | 57.32 | 1307 |
4 | new wave = 1 | 0 | 292 | 191 | 104.01 | 1731 |
Measure |
W
|
I
|
\(k_W\)
|
\(|\varepsilon (W)|\)
| #inter-edges |
---|---|---|---|---|---|
Edge density | 1981 songs = 1 | 0 | 1 | 2 | 21 |
africa = 1 | 0 | 1 | 2 | 76 | |
40s = 1 | 0 | 1 | 2 | 22 | |
early reggae = 1 | 0 | 1 | 2 | 10 | |
Average degree | post rock = 0 \(\wedge \) post-rock = 0 | 0 | 12181 | 1783 | 498 |
post-rock = 0 \(\wedge \) dark ambient = 0 | 0 | 12092 | 1770 | 573 | |
post-rock = 0 \(\wedge \) grindcore = 0 | 0 | 12032 | 1762 | 634 | |
post-rock = 0 \(\wedge \) technical death metal = 0 | 0 | 12106 | 1773 | 560 | |
Pool’s community score | bionic = 1 \(\wedge \) 30 seconds to mars = 0 | 0 | 8 | 6 | 343 |
bionic = 1 \(\wedge \) taylor swift = 0 | 0 | 8 | 6 | 343 | |
or Edge surplus | bionic = 1 \(\wedge \) latin = 0 | 0 | 8 | 6 | 343 |
bionic = 1 \(\wedge \) spanish = 0 | 0 | 8 | 6 | 343 | |
Segregation index | gluhie 90e = 0 \(\wedge \) lithuanian black metal = 1 | 0 | 3 | 3 | 1 |
goddesses = 0 \(\wedge \) pagan black metal = 1 | 0 | 3 | 3 | 1 | |
gluhie 90e = 0 \(\wedge \) pagan black metal = 1 | 0 | 3 | 3 | 1 | |
heartbroke = 0 \(\wedge \) lithuanian black metal = 1 | 0 | 3 | 3 | 1 | |
Modularity of a single community | pop = 1 \(\wedge \) new wave = 0 | 0 | 2689 | 475 | 4913 |
pop = 1 \(\wedge \) progressive rock = 0 | 0 | 2943 | 514 | 5083 | |
pop = 1 \(\wedge \) experimental = 0 | 0 | 2844 | 497 | 5083 | |
pop = 1 \(\wedge \) metal = 0 | 0 | 2761 | 496 | 5067 |
6.3.1 Experimental setup
-
Edge density. The number of edges divided by the maximal number of edges.
-
Average degree. The degree sum for all vertices divided by the number of vertices.
-
Pool’s community score (Pool et al. 2014). The reduction in the number of erroneous links between treating each vertex as a single community and treating all vertices as a whole.
-
Edge surplus (Tsourakakis et al. 2013). The number of edges exceeding the expected number of edges assuming each edge is present at the same probability \(\alpha \).
-
Segregation index (Freeman 1978). The difference between the number of expected inter-edges to the number of observed inter-edges, normalized by the expectation.
-
Inverse average-ODF (out-degree fraction) (Yang and Leskovec 2015). 1 minus the average fraction of vertices’ out-degrees to degrees.
-
Inverse conductance. The number of edges inside the cluster divided by the number of edges leaving the cluster.
6.3.2 Results
6.4 The effects of different prior beliefs: a subjective evaluation (RQ3)
6.4.1 Experimental setup
6.4.2 Results
Notation | Description |
---|---|
\(W_1 / W_2\)
| The description of the first/second subgroup |
\(|\varepsilon (W_1)| / |\varepsilon (W_2)| \)
| The subgroup of vertices satisfying the description \(W_1 / W_2\) |
\(k_W\)
| The number of observed edges between \(\varepsilon (W_1)\) and \(\varepsilon (W_2)\) |
\(p_W\cdot n_W\)
| The expected number of edges between \(\varepsilon (W_1)\) and \(\varepsilon (W_2)\) |
w.r.t. the background distribution | |
I
| The indicator equal to 0 if the observed pattern is dense for the |
analyst (i.e., \(k_W > p_W\cdot n_W\)) or 1 otherwise (i.e., \(k_W < p_W\cdot n_W\)) |
Rank |
\(W_1\)
|
\(W_2\)
|
I
|
\(k_W\)
|
\(|\varepsilon (W_1)|\)
|
\(|\varepsilon (W_2)|\)
|
\(p_W\cdot n_W\)
| |
---|---|---|---|---|---|---|---|---|
Prior 1 | 1 | year = 2006 | year = 2008 | 1 | 1346 | 153 | 173 | 2379.10 |
2 | status = student \(\wedge \) year = 2008 | status = alumni | 1 | 842 | 167 | 159 | 1783.26 | |
3 | status = student \(\wedge \) year = 2008 | year = 2006 | 1 | 1330 | 167 | 153 | 2367.96 | |
4 | status = student \(\wedge \) year = 2006 | year = 2008 | 1 | 1346 | 152 | 173 | 2377.53 | |
Prior 1+ Prior 2 | 1 | dorm/house = 169 | dorm/house = 171 | 1 | 194 | 99 | 67 | 569.56 |
2 | dorm/house = 169 | dorm/house = 166 | 1 | 237 | 99 | 70 | 620.42 | |
3 | dorm/house = 169 | dorm/house = 172 | 1 | 319 | 99 | 91 | 706.65 | |
4 | dorm/house = 169 | dorm/house = 170 | 1 | 300 | 99 | 87 | 646.04 | |
Prior 1+ Prior 2+ Prior 3 | 1 | status = student \(\wedge \) year = 2004 | year = 2008 | 0 | 108 | 3 | 173 | 25.23 |
2 | status = student \(\wedge \) year = 2004 | year = 2008 \(\wedge \) minor = 0 | 0 | 71 | 3 | 114 | 15.67 | |
3 | status = student \( \wedge \) year = 2004 | year = 2008 \(\wedge \) gender = male | 0 | 71 | 3 | 116 | 16.97 | |
4 | student status = student \(\wedge \) dorm/house = 166 | student status = alumni \(\wedge \) high school = 19445 | 0 | 51 | 53 | 1 | 17.52 |
Rank |
\(W_1\)
|
\(W_2\)
|
I
|
\(k_W\)
|
\(|\varepsilon (W_1)|\)
|
\(|\varepsilon (W_2)|\)
|
\(p_W\cdot n_W\)
| |
---|---|---|---|---|---|---|---|---|
Prior 1 | 1 | year = 2008 | year = 2005 | 1 | 495 | 209 | 117 | 1401.97 |
2 | year = 2007 | year = 2009 | 1 | 112 | 165 | 158 | 661.41 | |
3 | status = student \(\wedge \) year = 2008 | year = 2005 | 1 | 495 | 209 | 117 | 1401.97 | |
4 | year = 2008 | year = 2006 | 1 | 765 | 209 | 131 | 1643.38 | |
Prior 1+Prior 2 | 1 | dorm/house = 89 | dorm/house = 88 | 0 | 188 | 23 | 37 | 68.80 |
2 | dorm/house = 89 \(\wedge \) status = student | dorm/house = 88 | 0 | 188 | 22 | 37 | 68.45 | |
3 | dorm/house = 88 \(\wedge \) status = student | dorm/house = 89 | 0 | 183 | 36 | 23 | 65.47 | |
4 | dorm/house = 111 \(\wedge \) year = 0 | year = 2009 | 0 | 24 | 1 | 158 | 0.66 | |
7 | dorm/house = 96 \(\wedge \) year = 2005 | year = 2009 | 0 | 12 | 1 | 158 | 0.07 |
6.5 Evaluation on iterative pattern mining (RQ4)
6.5.1 Experimental setup
6.5.2 Results
Rank | \(W_1\) | \(W_2\) | I | \(k_W\) | \(|\varepsilon (W_1)|\) | \(|\varepsilon (W_2)|\) | \(p_W\cdot n_W\) | |
---|---|---|---|---|---|---|---|---|
Iteration 1 | 1 | USA = 1 | USA = 0 | 1 | 335 | 3132 | 3340 | 765.83 |
2 | USA = 1 \(\wedge \) China = 0 | USA = 0 | 1 | 288 | 2969 | 3340 | 725.97 | |
3 | USA = 1 \(\wedge \) Australia = 0 | USA = 0 | 1 | 320 | 3092 | 3340 | 756.05 | |
Iteration 2 | 1 | NJ (New Jersey) = 0 | NJ = 1 \( \wedge \) CA (California) = 1 | 0 | 93 | 6262 | 15 | 6.91 |
2 | CA = 0 | NJ = 1 \( \wedge \) CA = 1 | 0 | 86 | 5584 | 15 | 6.13 | |
3 | NJ = 1 \( \wedge \) Israel = 0 | NJ = 1 \( \wedge \) CA = 1 | 0 | 93 | 6153 | 15 | 6.76 | |
Iteration 3 | 1 | China = 0 | China = 1 | 1 | 144 | 5599 | 873 | 271.02 |
2 | China = 0 | China = 1 \(\wedge \) IL (Illinois) = 0 | 1 | 128 | 5599 | 861 | 266.10 | |
3 | China = 0 \(\wedge \) USA = 0 | China = 1 | 1 | 64 | 2630 | 873 | 168.09 | |
Iteration 4 | 1 | CA = 1 | CA = 0 \(\wedge \) WA = 1 | 0 | 55 | 888 | 184 | 11.73 |
2 | WA = 0 | WA = 1 | 0 | 182 | 6254 | 218 | 97.78 | |
3 | CA = 1 \(\wedge \) TX (Texas) = 0 | CA = 0 \(\wedge \) WA = 1 | 0 | 55 | 876 | 184 | 11.57 |
6.6 Empirical results on the discovered global patterns (RQ5)
6.6.1 Case study on DBLPaffs
6.6.2 Case study on DBLPtopics
Attribute | Meaning (Top 5 most strongly associated fields of study by absolute weight) |
---|---|
\(a_1\) | Data mining (0.55) |
Machine Learning \((-\,0.49)\) | |
Database (0.32) | |
Computer Science (0.28) | |
Information retrieval (0.25) | |
\(a_3\) | Data mining (0.41) |
Computer science \((-\,0.40)\) | |
Mathematics (0.39) | |
Information retrieval (0.30) | |
Pattern recognition (0.24) | |
\(a_5\) | Database (0.61) |
Information retrieval \((-\,0.49)\) | |
Query optimization (0.21) | |
World Wide Web \((-\,0.18)\) | |
Mathematics (0.15) | |
\(a_8\) | Mathematical optimization (0.45) |
Information retrieval (0.44) | |
Database (0.37) | |
Data mining \((-\,0.25)\) | |
Computer science (0.22) |
6.7 Scalability evaluation (RQ6)
6.7.1 Experimental setup
6.7.2 Results
Dataset | |S| | |V| | Run time (s) | |
---|---|---|---|---|
Single-subgroup pattern mining (RQ2) | Lastfm | 21,695 | 1892 | 278.49 |
DBLPaffs | 232 | 6472 | 32.40 | |
Bi-subgroup pattern mining (RQ3 and RQ4) | Caltech36 | 602 | 762 | 1312.57 |
Reed98 | 748 | 962 | 1965.41 | |
Lastfm | 200 | 1892 | 679.85 | |
DBLPaffs | 232 | 6472 | 3114.78 | |
Global pattern mining (RQ5) | DBLPaffs | 232 | 6472 | 830.69 |
DBLPtopics | 150 | 10,837 | 1570.90 | |
MPvotes | 78 | 650 | 12.73 |