1 Introduction
-
We propose two novel approaches for graph classification by training data inflation. In one approach, each sample in the inflated data is a randomly chosen vertex whose feature representation is obtained using a neural network-based language model. In another approach, each sample is a partition subgraph, whose feature representation is obtained through traditional graph topology metrics. Both the proposed methods are substantially better in terms of feature computation time and classification performance, specifically for large graphs.
-
We empirically evaluate the performance of our proposed classification algorithms on multiple real-world datasets. To be precise, we use 43 real-life graphs from 7 different domains and classify these graphs into their respective domains.
2 Related Works
2.1 Graph Classification
SubdueCL
, which finds discriminatory subgraphs from a set of graphs and uses these subgraphs as features for graph classification. Deshpande et al. [9] also use a similar approach for subgraph feature extraction for classifying chemical compounds using SVM. [31] proposed DT-CLGBI
, which uses a custom-made decision tree for graph classification such that each node of the tree represents a mined subgraph. In all these works, features extraction is isolated from the classification task. A collection of follow-up works integrates the subgraph mining and graph classification in a unified framework. gBoost [39] is one of the earliest among these which uses mathematical programming. [11] use boosting decision stumps where a decision stump is associated with a subgraph. Other recent methods that use similar approach are gActive
[23], RgMiner
[22], cogboost
[33], GAIA
[21] and Cork
[49].Leadership
(it measures the extent to which the edge connectivity of a graph is dominated by a single vertex), Bonding
a.k.a clustering coefficient and Diversity
(its measurement is based on the number of edges, which share no common end points, and hence are disjoint.) In recent years, Berlingerio et al. [1] propose an algorithm called NetSimile
, which computes features of a graph motivated from different social theories.2.2 Vertex Representation
ICA
, which is an iterative method based on constructing feature vectors for vertices from the information about them and their neighborhood. Henderson et al.’s method, called ReFeX
[17], captures the behavioral feature of a vertex in the graph by recursively combining each node’s local feature with their neighborhood (egonet-based) features. Koutra et al. [24] compare and contrast several guilt-by-association approaches for vertex representation. Tang et al. [47] propose to extract latent social dimensions based on network information and then use them as features for discriminative learning. In a follow-up work [48], they propose a framework called SocioDim
which extracts social dimensions based on the network structure to capture prominent interaction patterns between nodes in order to learn a discriminative classifier. Han et al.[15] suggest that frequent neighborhood patterns can be used for constructing strong structure-aware features, which are effective for within-network classification task. Recently, Perozzi et al. [35] propose an algorithm for finding neighborhood-based feature representation of a vertex in a graph.3 Method
3.1 Vertex Feature Representation Using Random Walk
3.2 Partition-Induced Subgraph Feature Representation
-
\(d_u\), degree of vertex u of G
-
\(d_{{\mathrm{nei}}(u)}= \frac{1}{d_u} \sum _{v\in {\mathrm{nei}}(u)} d_v\), average neighbor’s degree of vertex u
-
\(|E_{{\mathrm{ego}}(u)}|\), number of edges in node u’s egonet.1 \({\mathrm{ego}}(u)\) returns node u’s egonet.
-
\(CC_u\), clustering coefficient of node u which is defined as the number of triangles connected to vertex u over the number of connected triples centered at vertex u.
-
\(CC_{{\mathrm{nei}}(u)}= \frac{1}{d_u} \sum _{v\in {\mathrm{nei}}(u)} CC_v\), average clustering coefficient of vertex u’s neighbors.
-
\(|E^{in}_{{\mathrm{ego}}(u)}|\) = number of edge incident to \({\mathrm{ego}}(u)\).
-
\(d_{{\mathrm{ego}}(u)}\)= degree of \({\mathrm{ego}}(u)\) i.e., number of neighbors.
3.3 Classification Model
Domain | Datasets | (Vertex; edge) | Description |
---|---|---|---|
Animal | Bison | (26; 314) | Dominance between American bisons |
Hen | (32; 496) | Dominance between White Leghorn hens | |
Dolphin | (62; 159) | Social network of bottlenose dolphins | |
Kangaroo | (17; 91) | Interactions between free-ranging gray kangaroos | |
Cattle | (28; 217) | Dominance behaviors observed between dairy cattles | |
Zebra | (27; 111) | Interactions between Grevy’s zebras | |
sheep | (28; 250) | Dominance behavior between bighorn sheeps | |
Macaques | (62; 1,187) | Dominance behavior between female Japanese macaques | |
Communication | UC Irvine messages | (1,899; 20,296) | Messages between students of UC, Irvine |
Enron | (87,273; 321,918) | Email communication between employees of Enron | |
Digg | (30,398; 86,404) | Reply network of the social news website Digg | |
FB Wall Post | (46,952; 274,086) | Subset of posts to other user’s wall on Facebook | |
LKML | (63,399; 242,976) | Communication network of the Linux kernel Mailing List | |
EU institution | (265,214; 420,045) | Email communication of the undisclosed European institution | |
U. Rovira email | (1,133; 5,451) | Email communication at the University Rovira i Virgili | |
Human Contact | Train bombing | (64; 143) | Contacts between terrorists involved in the train bombing |
Windsurfers | (43; 336) | Interpersonal contacts between windsurfers | |
Infectious | (410; 2,765) | Face-to-face behavior of people during the exhibition | |
Conference | (113; 2196) | Face-to-face contacts of the attendees in a conference | |
Coauthorship | arXiv hep-th | (22,908; 2,673,133) | Collaboration graph of arXiv’s High Energy Physics—Theory |
arXiv astro-ph | (18,771; 198,050) | Collaboration graph of arXiv’s Astrophysics section | |
DBLP coauthorship | (317,080; 1,049,866) | Collaboration graph from DBLP computer science bibliography | |
arXiv hep-ph | (28,093; 3,148,447) | Collaboration graph of arXiv’s High Energy Physics—Phenomenology | |
Citation | arXiv hep-ph Cit. | (34,546; 421,578) | Citation graph of the arXiv’s High Energy Physics—Phenomenology |
arXiv hep-th Cit. | (27,770; 352,807) | Citation graph of the arXiv’s High Energy Physics—Theory | |
Cora citation | (23,166; 91,500) | Cora citation network | |
DBLP | (12,591; 49,743) | Citation graph of DBLP | |
Human Social | Jazz | (198; 2,742) | Collaboration network between Jazz musicians |
HighSchool | (70; 366) | Network contains friendships between boys highschool | |
Residence hall | (217; 2,672) | Friendship between residents living at a residence hall | |
Taro exchange | (22; 78) | Gift-givings (taro) between households in a Papuan village | |
Dutch college | (32; 3,062) | Network contains friendships between university freshmen | |
Sampson | (18; 188) | Network contains ratings between monks related to a crisis | |
Zachary karate | (34; 78) | Network contains interaction between members of a karate club | |
Seventh graders | (29; 376) | Network contains ratings between students from seventh grade | |
Adolescent health | (2,539; 12,969) | Network was created from a adolescent health survey | |
Tribes | (16; 58) | Social network of tribes of the Gahuku-Gama | |
Infrastucture | US-Airports | (1,574; 28,236) | Network of flights between US airports in 2010 |
Air traffic control | (1,226; 2,615) | Network of preferred routes recommendations | |
OpenFlights | (2,939; 30,501) | Network contains flights between airports of the world | |
US power grid | (4,941; 6,594) | Network of power supply line between US power grids | |
EuroRoad | (1,174; 1,417) | Road Network in Europe |
3.4 Pseudo-Code
4 Experiments and Results
Animal
, Human Social
and Human Contact
domains have smaller size graphs, whereas Citation
, Coauthorship
, Communication
and Infrastructure
contain moderate and large size graphs.4.1 Experimental Setup
gensim
library (https://radimrehurek.com/gensim/), which contains an open-source python implementation of “Word2Vec” algorithm. We write our own random walk generator using python. We set the length of the random walk (l) and feature vector size (d) to 40 and 30 for small and moderate size graphs (Animal, Human Contact, Human Social). For large size graphs (Citation, Collaboration, Communication and Infrastructure), we set these numbers to 60 and 70, respectively. For both cases, we set the number of random walk parameter (t) to 10. In Sect. 4.8, we discuss in detail the effects of parameter values on the performance of the classification task.NetSimile
by Berlingerio et al. [1] to compute features of the partition-induced subgraph constructed from each partition of a graph. We implement our own version of NetSimile
in Python where all topological features are computed using Networkx [14] package. To partition the graphs in the dataset, we use GraClus
by Kulis et al. [10]. Graclus
takes the number of partition k as a user-defined parameter. We set the value of k to a small number for smaller graphs and reasonably high number for larger graphs. In this work, we choose k to be 60, 20 and 5, for large, moderate and small size graphs, respectively. We implement our own softmax classifier in Python. We set regularize parameter \(\lambda\) to 1e−4 for all executions. We perform fourfold cross-validation over the data and use threefold to train, and onefold to test. To measure classifier’s performance, we use percentage accuracy and micro-F1 metrics. We run all experiments in 3 GHz Intel machine with 16GB memory.Li
and Berlingerio et al’s method as NetSimile
(which is the name of their algorithm). Li
is a topological feature-based approach, which works for a graph classification setting having a large number of small graphs. On the other hand, NetSimile
represents the class of algorithms that is able to handle a small number of large graphs, similar to our problem setup. Given a graph, Li
computes several (20) topological metrics (closeness, average degree, clustering coefficient, etc.) and uses these as features of the graph. The NetSimile
computes seven local/global topological features (clustering coefficient, egonet size, degree of egonets, etc.) to capture connectivity, transitivity, reciprocity among the nodes and their neighbors. The third method that we compare against is RgMiner
, proposed by Keneshloo et al. [22]. It is a frequent subgraph-based graph classification algorithm, which mines discriminatory subgraphs from the set of input graphs and then uses these subgraphs as features for graph classification. Note that a frequent subgraph mining-based algorithm is not scalable to handle large graphs, but for the sake of completeness, we compare our method with RgMiner
, whenever possible. Besides classification accuracy, we also compare the execution time of these algorithms with our proposed methods.Domains | No. of class | Vertex multi-instance-based | Partition multi-instance-based | Current best method | ||
---|---|---|---|---|---|---|
Accuracy (in %) | Improvement (in %) w.r.t current best | Accuracy (in %) | Improvement (in %) w.r.t current best | Accuracy (in %) | ||
A-C | 2 |
85.1
| 13.4 | 81.6 | 8.8 | 75 |
B-C | 2 |
97.0
| 16.4 | 87.5 | 5.0 | 83.3 |
A-B | 2 | 81.8 | 13.2 | 83.3 | 15.3 | 72.2 |
A-D | 2 |
89.0
| 11.2 | 80.0 | 0 | 80.0 |
C-D | 2 | 83.6 | 4.5 | 91.6 | 14.5 | 80.0 |
B-D | 2 | 88.4 | 17.8 | 97.5 | 30.0 | 75.0 |
A-B-C | 3 |
85.2
| 11.8 | 80.0 | 4.9 | 76.2 |
A-B-D | 3 |
85.1
| 6.3 | 83.3 | 4.1 | 80.0 |
A-C-D | 3 | 81.2 | 4.5 | 81.3 | 4.6 | 77.7 |
B-C-D | 3 |
88.1
| 22.1 | 87.5 | 21.5 | 72.0 |
A-B-C-D | 4 |
80.0
| 12.0 | 77.5 | 8.5 | 71.4 |
E-F | 2 |
83.4
| 25.2 | 70.0 | 5.1 | 66.6 |
G-F | 2 |
75.0
| 75.2 | 65.0 | 51.8 | 42.8 |
E-G-F | 3 |
65.0
| 32.7 | 61.5 | 28.4 | 47.9 |
Domains | No. of class | Vertex multi-instance-based micro-F1 (%) | Partition multi-instance-based micro-F1 (%) |
---|---|---|---|
A-C | 2 |
86.7
| 84.7 |
B-C | 2 |
93.2
| 82.5 |
A-B | 2 | 80.0 |
81.2
|
A-D | 2 |
91.2
| 80.5 |
C-D | 2 | 78.2 |
92.1
|
B-D | 2 | 80.0 |
92.2
|
A-B-C | 3 |
85.4
| 75.2 |
A-B-D | 3 | 84.4 |
85.1
|
A-C-D | 3 |
82.1
| 79.1 |
B-C-D | 3 | 85.9 |
88.1
|
A-B-C-D | 4 | 77.6 |
77.9
|
E-F | 2 |
78.8
| 69.3 |
G-F | 2 |
71.4
| 62.6 |
E-G-F | 3 |
61.4
| 59.1 |
4.2 Experiment on Classification Performance
Citation
versus Coauthorship
, Animal
versus Human Social
, etc. The vertex multi-instance classification is an excellent method as we can see in Column 3 of Table 2; for example, in the coauthorship-communication task, this method achieves 97.0% accuracy. In Column 3 of the same table, we show the percentage of improvement on accuracy over current best approach (either NetSimile
or Li
because we are unable to use RgMiner
over all graphs from Table 1) for the vertex multi-instance method. As we can see, vertex multi-instance approach achieves 4% to 75% improvement over the current best method. In Columns 4 and 5, we report percentage accuracy and percentage of improvement for the partition multi-instance method. This method also delivers excellent classification performance; for example, in the Coauthorship-Infrastructure task, it achieves 97.5% accuracy. The partition multi-instance approach achieves 4–51.8% improvement over current best method. In all 14 classification tasks, vertex multi-instance and Partition multi-instance approaches show superior performance over the existing methods. In the last column of Table 2, we show the performance of the best of the existing methods, either NetSimile
or Li
, whichever is better. In Table 3, we also report the average performance of the Vertex and Partition Multi-Instance-based graph classifier in terms of Micro-F1 score. (It is the harmonic mean of F1-scores for each class label.)
4.3 Comparison with the Existing Algorithms
Li
and NetSimile
) of graph classification for all the graphs mentioned in Table 1. In order to make comparisons with a recent state-of-the-art frequent subgraph-based graph classification algorithm RgMiner
[22], we pick animal, human social and human contact domains (Table 1) because these domains have smaller size graphs.Li
and NetSimile
, we create a scatter plot (Fig. 4) by placing the accuracy of our method in x-axis and the same of competitor’s method in the y-axis for each classification task shown in Column 1 of Table 2. We also place a diagonal line in the plotting area. The number of points that lie in the lower triangle represents the number of tasks our methods are better than existing ones. Figure 4a, b shows performance comparison of the vertex multi-instance approach with Li
and NetSimile
. As we can see in all tasks, vertex multi-instance approach performs better than both Li
and NetSimile
. In Fig. 4b, d, we compare the accuracy of the partition multi-instance approach with Li
and NetSimile
. In this case, among 14 classification tasks our method performs better in 13, ties in 1. Superior performance of our proposed methods is essentially triggered by the training data inflation technique. Such technique helps to alleviate the fat-matrix phenomena (discussed in Sect. 1) and improves graph classification accuracy.RgMiner
[22] to perform frequent subgraph-based graph classification. Average accuracies we got for “animal–human contact,” “human contact–human social” and “animal–human contact–human social” settings are 52.3, 50.0 and 43.7%, respectively, which are lower than the accuracies reported by both of our proposed methods (see last 3 rows in Table 2) in the corresponding setups.Domain | Avg. vertex size | Time [26]’s (s) | Time [1]’s (s) | Time VML (s) | Time PML (s) |
---|---|---|---|---|---|
Animal | 35 | 0.05 | 0.135 | 0.09 | 0.08 |
Human contact | 157 | 1.71 | 0.69 | 0.45 | 0.17 |
Human social | 317 | 2.73 | 0.38 | 1.1 | 0.32 |
Infrastucture | 2K | 27.7 | 18.2 | 8.2 | 1.5 |
Citation | 24K | 8634.7 | 1142.8 | 121.6 | 89.1 |
Communication | 62K | 41217.8 | 2136.4 | 289.1 | 254.6 |
Coauthorship | 96K | 137106.4 | 117811.5 | 562.4 | 8367.1 |
4.4 Experiment with Vertically Scaled Dataset
4.5 Timing Analysis
Li
and NetSimile
. For communication and Coauthorship category graphs, the vertex multi-instance method achieves 142- and 243-fold improvement over Li
and 7 and 209 time improvement over NetSimile
, respectively. The partition multi-instance approach achieves 162- and 16-fold improvement over Li
and NetSimile
for communication but only 16 and 14 for Coauthorship domain. Note that we could execute RgMiner
just for the smaller size graphs from animal, human contact and human-social category, and such executions solely will not be enough to portray the complete picture on the comparisons over the running time. So, we decide not to report RgMiner
’s running time in Table 4. Nevertheless, it takes 17.2, 15.8 and 21.3 s for RgMiner
to mine frequent subgraphs using Gaston for 30% support and finding feature representation for “animal–human contact,” “human contact–human social” and animal–human contact–human social” setting, respectively.