Background
Previous work
Methods
Notation
Motivation
-
Salton index or cosine similarity (Salton)$$\begin{aligned} s_{Salton}=\frac{S_{CN}}{\sqrt{k_{a}k_{b}}} \end{aligned}$$(2)
-
Jaccard index (Jaccard)$$\begin{aligned} s_{Jaccard}=\frac{S_{CN}}{\left| N_{a}\cup N_{b}\right| } \end{aligned}$$(3)
-
Hub promoted index (HPI)$$\begin{aligned} s_{HPI}=\frac{S_{CN}}{\min (k_{a},k_{b})} \end{aligned}$$(4)
-
Hub depressed index (HDI)$$\begin{aligned} s_{HDI}=\frac{S_{CN}}{\max (k_{a},k_{b})} \end{aligned}$$(5)
Schemes of node neighborhood sets
-
Scheme 1 : \(P_{a,b}^{(1)}=N_{a\backslash b}\) and \(R_{a,b}^{(1)}=S_{b\backslash a}\backslash S_{a\backslash b}\)How many of a's friends know b and his friends outside of the relationship with a?
-
Scheme 2 : \(P_{a,b}^{(2)}=N_{a\backslash b}\) and \(R_{a,b}^{(2)}=S_{b\backslash a}\)How many of a’s friends know b and his friends?
-
Scheme 3 : \(P_{a,b}^{(3)}=S_{a\backslash b}\) and \(R_{a,b}^{(3)}=S_{b\backslash a}\backslash S_{a\backslash b}\)How many of a and his friends know b and his friends outside of the relationship with a?
-
Scheme 4 : \(P_{a,b}^{(4)}=S_{a\backslash b}\) and \(R_{a,b}^{(4)}=S_{b\backslash a}\)How many of a and his friends know b and his friends?
Expected number of edges between two sets of nodes
Erdős- Rényi random graph generation model
-
Scheme 1:$$\begin{aligned} e_{\overline{ab}}^{(1)}=\left( k_{a}k_{b}-\frac{1}{2}(k_{a}+k_{b})\left( 1+k_{ab}\right) +k_{ab}\right) d_{G} \end{aligned}$$(12)
-
Scheme 2:$$\begin{aligned} e_{\overline{ab}}^{(2)}=\left( k_{a}k_{b}-\frac{1}{2}\left( k_{a}+k_{b}\right) -k_{ab}\right) d_{G} \end{aligned}$$(13)
-
Scheme 3:$$\begin{aligned} e_{\overline{ab}}^{(3)}=\left( k_{a}k_{b}-\frac{1}{2}(k_{a}+k_{b})k_{ab}\right) d_{G} \end{aligned}$$(14)
-
Scheme 4:$$\begin{aligned} e_{\overline{ab}}^{(4)}=\left( k_{a}k_{b}-k_{ab}\right) d_{G} \end{aligned}$$(15)
Preferential attachment random graph generation model
-
Scheme 1:$$\begin{aligned} e_{\overline{ab}}^{(1)}=\frac{1}{4m}\left( \sum _{i\in P_{a,b}^{(1)}}\sum _{j\in R_{a,b}^{(1)}}k_{i}k_{j}+\sum _{i\in P_{b,a}^{(1)}}\sum _{j\in R_{b,a}^{(1)}}k_{i}k_{j}\right) \end{aligned}$$(19)
-
Scheme 2:$$\begin{aligned} e_{\overline{ab}}^{(2)}=\frac{1}{4m}\left( \sum _{i\in P_{a,b}^{(2)}}\sum _{j\in R_{a,b}^{(2)}}k_{i}k_{j}+\sum _{i\in P_{b,a}^{(2)}}\sum _{j\in R_{b,a}^{(2)}}k_{i}k_{j}\right) \end{aligned}$$(20)
-
Scheme 3:$$\begin{aligned} e_{\overline{ab}}^{(3)}=\frac{1}{4m}\left( \sum _{i\in P_{a,b}^{(3)}}\sum _{j\in R_{a,b}^{(3)}}k_{i}k_{j}+\sum _{i\in P_{b,a}^{(3)}}\sum _{j\in R_{b,a}^{(3)}}k_{i}k_{j}\right) \end{aligned}$$(21)
-
Scheme 4:$$\begin{aligned} e_{\overline{ab}}^{(4)}=\frac{1}{4m}\left( \sum _{i\in P_{a,b}^{(4)}}\sum _{j\in R_{a,b}^{(4)}}k_{i}k_{j}+\sum _{i\in P_{b,a}^{(4)}}\sum _{j\in R_{b,a}^{(4)}}k_{i}k_{j}\right) \end{aligned}$$(22)
Authentic score using the PA model
Authentic score compensation
Matrix of degree products
Evaluation of the proposed algorithms
Comparison of different combinations of the proposed algorithm
Comparison of outlier edge detection algorithms
Nodes | Edges | Density | GCC (%) | Reference | |
---|---|---|---|---|---|
Advogato | 6.5 k | 51 k |
\(1.2\times 10^{-3}\)
| 9.2 | [25] |
Twitter-icwsm | 465 k | 835 k |
\(3.9\times 10^{-6}\)
| 0.06 | [26] |
Brightkite | 58 k | 214 k |
\(1.3\times 10^{-4}\)
| 11 | [23] |
Facebook-wosn | 63 k | 817 k |
\(4.0\times 10^{-4}\)
| 14.8 | [27] |
Ca-cit-HepPh | 28 k | 4.6 m |
\(8.0\times 10^{-3}\)
| 28 | [28] |
Youtube-friend | 1.1 m | 3.0 m |
\(4.6\times 10^{-6}\)
| 0.6 | [29] |
Web-Google | 875 k | 5.1 m |
\(6.7\times 10^{-6}\)
| 5.5 | [30] |
ER | PA | Jaccard | HPI | PAI | |
---|---|---|---|---|---|
Advogato | 0.887 |
0.893
| 0.858 | 0.859 | 0.877 |
Twitter-icwsm | 0.531 | 0.942 | 0.527 | 0.530 |
0.997
|
Brightkite | 0.885 |
0.905
| 0.833 | 0.827 | 0.873 |
Facebook-wosn | 0.968 |
0.970
| 0.947 | 0.946 | 0.878 |
Ca-cit-HepPh | 0.970 | 0.967 |
0.993
| 0.991 | 0.888 |
Youtube-friend | 0.770 | 0.842 | 0.731 | 0.738 |
0.898
|
Web-Google | 0.985 |
0.992
| 0.944 | 0.945 | 0.859 |
Change of graph properties
Original | ER model | PA model | Random | |
---|---|---|---|---|
GCC | 0.111 | 0.121 | 0.120 | 0.105 |
ALCC | 0.172 | 0.180 | 0.183 | 0.158 |
Diameter | 18 | 19 | 20 | 18 |
ED | 5.91 | 6.78 | 6.36 | 5.95 |
MSP | 3.92 | 4.10 | 4.10 | 3.95 |
Applications
Impact on graph clustering algorithms
\(\mu\)
| LRW | GN | SLM | Danon | Louvain | Infomap |
---|---|---|---|---|---|---|
0.2 | 1.0/1.0 | 0.99/1.0 | 1.0/1.0 | 0.99/1.0 | 1.0/1.0 | 1.0/1.0 |
0.25 | 0.97/1.0 | 0.98/0.99 | 1.0/1.0 | 0.99/0.98 | 1.0/1.0 | 1.0/1.0 |
0.3 | 0.89/0.95 | 0.93/0.97 | 1.0/1.0 | 0.95/0.98 | 1.0/1.0 | 0.92/1.0 |
0.35 | 0.78/0.82 | 0.74/0.72 | 0.96/0.94 | 0.66/0.84 | 0.90/0.86 | 0.36/0.91 |
0.4 | 0.80/0.86 | 0.66/0.70 | 0.83/0.81 | 0.67/0.70 | 0.84/0.81 | 0.78/0.83 |
0.45 | 0.25/0.73 | 0.53/0.52 | 0.71/0.67 | 0.51/0.55 | 0.68/0.60 | 0.22/0.43 |
0.5 | 0.03/0.61 | 0.39/0.47 | 0.58/0.56 | 0.39/0.49 | 0.51/0.53 | 0/0.47 |
\(\mu\)
| LRW | GN (%) | SLM | Danon (%) | Louvain | Infomap |
---|---|---|---|---|---|---|
0.2 | 0 | 0.8 | 0 | 1.0 | 0 | 0 |
0.25 | 3.3% | 1.5 | 0 | −1.0 | 0 | 0 |
0.3 | 7.3% | 5.0 | 0 | 3.5 | 0 | 9.1% |
0.35 | 5.7% | −2.2 | −2.1 | 26 | −4.9% | 155% |
0.4 | 8.5% | 6.7 | −2.2 | 4.8 | −3.0% | 5.8% |
0.45 | 190% | −1.1 | −6.2 | 8.4 | −12% | 95% |
0.5 | 1730% | 19 | −4.4 | 26 | 2.4% |
\(\infty\)
|
\(\mu\)
| LRW (%) | GN (%) | SLM (%) | Danon (%) | Louvain (%) | Infomap (%) |
---|---|---|---|---|---|---|
0.2 | −52 | −11 | −36 | −3.1 | −33 | −47 |
0.25 | −23 | −18 | 1.0 | −1.0 | −41 | −16 |
0.3 | −8.9 | −9.3 | 7.7 | −1.4 | −31 | −13 |
0.35 | −11 | −0.3 | −21 | −3.5 | −35 | 31 |
0.4 | −11 | −5.7 | −5.3 | −3.0 | −20 | 17 |
0.45 | −16 | 2.8 | −14.4 | 2.1 | −41 | 33 |
0.5 | −21 | −6.7 | −1.9 | −3.4 | −39 | 55 |
Outlier node detection in social network graphs
Node id | Outlier edges | Degree | Degree rank | LCC |
---|---|---|---|---|
41 | 21 | 1134 | 1 | 0.005 |
458 | 16 | 1055 | 2 | 0.001 |
115 | 9 | 838 | 4 | 0.004 |
175 | 7 | 270 | 39 | 0.001 |
989 | 7 | 270 | 40 | 0.015 |
2443 | 7 | 379 | 16 | 0.010 |
36 | 5 | 467 | 11 | 0.005 |
158 | 5 | 833 | 5 | 0.004 |
Clustering of noisy data
Removed edges | 2.6% | 2.7% | 2.8% | 3.5% | 6% | 33.3% |
Number of clusters | 2 | 3 | 4 | 5 | 6 | 7 |
Dataset | k-Means | a-Link | N-Cuts | GDL | Proposed |
---|---|---|---|---|---|
(a) | 0.031 | 0.099 | 0.053 | 0.650 |
0.672
|
(b) | 0.743 | 0.743 | 0.743 | 0.743 |
0.848
|
(c) | 0 | 0.004 | 0.559 | 0.654 |
0.755
|
(d) | 0.208 | 0.161 | 0.367 | 0.553 |
0.619
|
(e) | 0.001 | 0.133 | 0.680 | 0.701 |
0.744
|
(f) | 0.001 | 0.162 | 0.627 | 0.612 |
0.714
|