1 Introduction
-
SE4LP: We propose an end-to-end scalable link representation learning framework via subgraph contrast, which utilizes informative local subgraphs surrounding links to learn highly expressive link representations.
-
Scalability: We take the receptive field subgraphs extracted from a batch of links as the input during each training step, so as to learn link representation efficiently and make our SE4LP scale well on large-scale graphs.
-
Effectiveness: Extensive experiments demonstrate the superiority of our framework in terms of performance and scalability on link prediction. Furthermore, introducing self-supervised learning to link prediction help to learn effective link representation with less training samples.
2 Preliminaries and related work
2.1 Preliminaries
2.2 Link prediciton
2.2.1 Random walks
2.2.2 Graph auto-encoders (GAEs)
2.2.3 Supervised learning
2.2.4 Contrastive methods
3 Methodology
3.1 Subgraph extraction and sampling
3.1.1 Subgraph extraction
3.1.2 Subgraph sampling
3.2 Subgraph contrastive learning for link representation
3.3 Graph augmentation
3.3.1 Attribute masking
3.3.2 Edge removing
3.3.3 Attribute similarity
3.3.4 KNN graph
3.4 Model training
4 Experiments
-
Citation benchmarks: Cora, Citeseer and Pubmed [41], where nodes represent papers with corresponding bag-of-words features, and edges represent citation relationships between papers.
-
Facebook: a page-page web graph of verified Facebook sites. Nodes represent official Facebook pages, and links are mutual likes between sites. Node features are extracted from the site descriptions created by the page owners to summarize the purpose of the site.
-
Github: a large social network where nodes are GitHub developers who have starred at least 10 repositories and edges are mutual follower relationships between them. Node features are extracted based on the location, repositories starred, employer and e-mail address.
Dataset | Type | Nodes | Edges | Features | |
---|---|---|---|---|---|
Small-scale | Cora | Citation network | 2708 | 5429 | 1433 |
Citeseer | Citation network | 3327 | 4732 | 3703 | |
Pubmed | Citation network | 19717 | 44338 | 500 | |
Large-scale | Facebook | Web network | 22470 | 171002 | 128\(*\) |
Github | Social network | 37700 | 289,003 | 128\(*\) |
4.1 Experimental setting
Method | Cora | Citeseer | Facebook | |||
---|---|---|---|---|---|---|
AUC (%) | AP (%) | AUC (%) | AP (%) | AUC (%) | AP (%) | |
CN | 56.19±0.099 | 63.08±0.059 | 58.76±0.095 | 65.31±0.060 | 88.70±0.076 | 91.72±0.045 |
Salton | 56.85±0.094 | 61.73±0.065 | 59.32±0.090 | 63.79±0.065 | 88.61±0.077 | 91.07±0.054 |
AA | 57.33±0.087 | 59.36±0.081 | 58.97±0.089 | 60.90±0.083 | 87.24±0.093 | 91.18±0.060 |
RA | 57.77±0.090 | 64.75±0.055 | 59.11±0.091 | 65.88±0.058 | 85.08±0.101 | 90.43±0.057 |
\(\diamond \) DeepWalk | 88.14±0.055 | 87.87±0.045 | 85.67±0.052 | 86.11±0.044 | 87.65±0.012 | 85.41±0.011 |
\(\diamond \) Node2Vec | 88.65±0.058 | 89.62±0.049 | 87.36±0.065 | 88.15±0.059 | 85.64±0.022 | 85.25±0.028 |
\(\diamond \) GAE | 93.79±0.038 | 93.43±0.038 | 92.63±0.013 | 93.50±0.016 | OOM | OOM |
\(\diamond \) VGAE | 94.30±0.006 | 94.60±0.082 | 93.78±0.046 | 94.55±0.051 | OOM | OOM |
\(\diamond \) ARGA | 90.27±0.067 | 90.01±0.069 | 89.00±0.040 | 89.60±0.039 | OOM | OOM |
\(\diamond \) ARGVA | 93.26±0.041 | 93.61±0.045 | 94.03±0.013 | 94.30±0.014 | OOM | OOM |
\(\diamond \) DGI | 93.15±0.023 | 92.70±0.030 | 91.84±0.014 | 92.31±0.014 | OOM | OOM |
\(\star \) SEAL(\(h=1\)) | 95.46±0.007 | 95.84±0.010 | 91.20±0.010 | 93.11±0.008 | 97.83±0.018 | 97.29±0.023 |
\(\star \) SEAL(\(h=2\)) | \(\underline{95.88{\pm }0.007}\) | \(\underline{96.14{\pm }0.008}\) | 91.35±0.012 | 93.10±0.008 | 98.58±0.001 | 98.71±0.001 |
\(\star \) SE4LP(\(h=1\)) | 96.05±0.007 | 96.28±0.010 | 94.74±0.007 | 95.24±0.008 | \(\underline{97.94{\pm }0.002}\) | \(\underline{97.64{\pm }0.001}\) |
\(\star \) SE4LP(\(h=2\)) | 94.33±0.008 | 94.03±0.012 | \(\underline{93.38{\pm }0.006}\) | \(\underline{93.67{\pm }0.008}\) | 97.53±0.001 | 97.24±0.001 |
Method | Pubmed | Github | ||
---|---|---|---|---|
AUC (%) | AP (%) | AUC (%) | AP (%) | |
CN | 65.53±0.050 | 67.54±0.034 | 67.87±0.105 | 73.26±0.069 |
Salton | 64.78±0.057 | 66.12±0.047 | 62.84±0.085 | 60.69±0.054 |
AA | 65.26±0.049 | 67.49±0.035 | 67.69±0.108 | 75.23±0.085 |
RA | 65.27±0.047 | 67.78±0.032 | 67.56±0.078 | 77.14±0.050 |
\(\diamond \) DeepWalk | 90.88±0.021 | 87.48±0.026 | 81.25±0.013 | 80.25±0.012 |
\(\diamond \) Node2Vec | 90.60±0.021 | 89.29±0.027 | 80.49±0.019 | 79.62±0.016 |
\(\diamond \) GAE | 91.93±0.051 | 91.84±0.051 | OOM | OOM |
\(\diamond \) VGAE | 89.36±0.056 | 89.37±0.056 | OOM | OOM |
\(\diamond \) ARGA | 88.00±0.049 | 88.33±0.045 | OOM | OOM |
\(\diamond \) ARGVA | 90.48±0.036 | 90.32±0.035 | OOM | OOM |
\(\diamond \) DGI | 91.45±0.004 | 90.87±0.005 | OOM | OOM |
\(\star \) SEAL(\(h=1\)) | 94.24±0.015 | 92.56±0.021 | 96.27±0.019 | 96.01±0.017 |
\(\star \) SEAL(\(h=2\)) | 96.82±0.009 | 98.11±0.011 | 97.11±0.014 | 97.02±0.011 |
\(\star \) SE4LP(\(h=1\)) | 98.36±0.001 | 98.25±0.002 | \(\underline{96.43{\pm }0.005}\) | \(\underline{96.11{\pm }0.005}\) |
\(\star \) SE4LP(\(h=2\)) | \(\underline{98.35{\pm }0.004}\) | \(\underline{98.18{\pm }0.005}\) | 96.12±0.010 | 95.86±0.015 |
4.2 Evaluation on link prediction
4.3 Analysis of augmentation views
Item | Method | Dataset | ||||
---|---|---|---|---|---|---|
Cora | Citeseer | Pubmed | Facebook | Github | ||
Training Time (s) | DeepWalk | 609.2 | 580.7 | 6023.0 | 13531.5 | 21948.7 |
GAE | 0.0 | 0.0 | 0.0 | – | – | |
SEAL | 1.3 | 0.8 | 2.2 | 8.5 | 22.8 | |
SE4LP | 1.6 | 1.2 | 5.1 | 8.7 | 13.3 | |
Memory (MB) | DeepWalk | 346 | 414 | 1679 | 3082 | 3099 |
GAE | 4935 | 4932 | 4994 | OOM | OOM | |
SEAL | 2709 | 3079 | 2820 | 2724 | 3039 | |
SE4LP | 5698 | 6000 | 5589 | 6044 | 8740 |