1 Introduction
-
We define the new problem of finding subjectively interesting trees and forests connecting a set of query vertices in a graph (Sect. 2).
-
We propose heuristics for mining the most interesting trees efficiently, both for undirected and directed graphs (Sect. 7).
-
We evaluate and compare the effectiveness of these heuristics on real data and study the utility of the resulting trees, showing that the results are truly and usefully dependent on the assumed prior beliefs of the user (Sect. 8).
-
The pattern syntax is broader, covering also trees in undirected graphs. Previously, only connecting arborescences were discussed (directed trees). We achieved this by providing methods for finding connecting trees, forests, and branchings. This broadens applicability to both undirected and directed graphs, as well as allowing for a possible partitioning of the query vertices (Sect. 2.2).
-
Thorough analysis of the Description Length of connecting trees (Sect. 3).
-
A general method for efficiently fitting background distributions for a large class of prior belief types. Adriaens et al. (2017) discussed how to model prior beliefs about vertex degrees and a time-order relation between the vertices in the graph. The current paper extends this to prior knowledge on the density of any part of the adjacency matrix—a generalization of previous work, which also has important applicability in contexts beyond this work (Sects. 4.3 and 6).
-
A more thorough experimental evaluation and an empirical comparison with related work. The efficiency of our proposed methods is tested in more detail on a wider variety of different types of graphs, and a direct comparison with an existing related method is added (Sect. 8).
2 Subjectively interesting trees in graphs
2.1 Notation and terminology
2.2 Connecting trees, forests, arborescences and branchings as data mining patterns
2.3 A subjective interestingness measure
3 The description length of connecting trees, forests, branchings and arborescences
4 The information content and inferring the background distribution
-
In a friendship network, people with similar age/education/location are more likely to be friends.
-
In a paper citation network, the number of papers citing a paper with an equal or higher publication year is very low (i.e., the network is essentially a DAG).
-
Search engines (e.g., Google, Bing) are hubs in the WWW graph.
-
In some social networks, vertices tend to be connected to other vertices that have similar degree values (assortative mixing).
4.1 Prior beliefs on the density of sets of edges
4.2 What if there is a mismatch between the stated and actual prior beliefs?
4.3 Identifying equivalent Lagrange multipliers
5 Discussing two prior belief types in more detail
5.1 Prior beliefs when vertices represent timed events
5.2 Prior beliefs on degree assortativity
6 A fast heuristic for identifying equivalent Lagrange multipliers
7 Algorithms for finding the most interesting trees
7.1 Proposed methods for finding arborescences
7.2 Proposed methods for finding trees, branchings and forests
8 Experiments
8.1 Fitting the different background models
Prior beliefs dataset | |V| | |E| | unique Lagr. mult. (% of total) | Time (s) Heuristic | Time (s) Newton |
---|---|---|---|---|---|
Degree
| |||||
Amazon | 334,863 | 925,872 | 218 (0.06) | 2 | 3 |
DBLP-4-Area | 329,541 | 1,093,877 | 198 (0.06) | 1.5 | 8 |
Google web | 875,713 | 5,105,039 | 894 (0.1) | 14 | 22 |
Youtube | 1,134,890 | 2,987,624 | 1194 (0.11) | 20 | 47 |
Assortativity
| |||||
DBLP | 317,080 | 1,049,866 | 3262 (0.51) | 5 | 75 |
roadNet-CA | 1,965,206 | 2,766,607 | 72 (0.001) | 14 | 5.4 |
LiveJournal | 3,997,962 | 34,681,189 | 22,110 (0.27) | 128 | 564 |
Bins
| |||||
ACM v8 (16 bins) | 2,381,688 | 10,476,564 | 4838 (0.11) | 13 | 85 |
ACM v8 (78 bins) | – | – | 13,429 (0.28) | 48 | 1095 |
NBER (6 bins) | 2,923,922 | 16,518,948 | 2262 (0.03) | 11 | 24 |
NBER (59 bins) | – | – | 8026 (0.14) | 69 | 591 |
8.2 Testing the relative performance of the heuristics
Dataset |
\(|Q|=4\)
|
\(|Q|=8\)
|
\(|Q|=12\)
|
---|---|---|---|
ACM | 134,740 | 174,910 | 30,878 |
Amazon | 12,889 | 2570 | 780 |
DBLP | 33,032 | 10,081 | 8527 |
8.3 Scalability with varying tree depth
8.4 Testing the influence of the prior belief model on the resulting trees
Amazon | Uniform prior | Degree prior | Ties | p-value |
---|---|---|---|---|
Avg. degree | 11.28 | 10.2 | 143/200 | 2.71e–06 |
ACM | Degree prior | Degree+time prior | Ties | p-value |
---|---|---|---|---|
Avg. difference in publication year | 4.34 | 5.27 | 41/200 | 3.92e–13 |
DBLP | Degree prior | Degree+assortativity prior | Ties | p-value |
---|---|---|---|---|
Avg. degree difference | 19.97 | 27 | 131/200 | 0.0015 |