1 Introduction
-
We propose GCNEXT, a novel GCN-based end-to-end framework using expanded balance theory for fraudulent user detection that incorporates both the signs and directions of edges.
-
From the experimental results, we show that the proposed framework performs well, or even best compared to the others we experimented. It also shows remarkable stability in inductive settings, which is associated with the detection of new fraudulent users on a rating platform.
-
We present an analysis of a rating platform based on expanded balance theory and provide new insight into the behavior of fraudulent users in rating networks.
2 Related work
2.1 Fraudulent user detection on online rating platforms
2.1.1 Feature-based approaches
2.1.2 Graph-based
2.2 Graph convolutional networks
2.3 Balance theory for SGCN Derr et al. (2018)
3 GCNEXT: graph convolutional network with expanded balance theory
3.1 Expanded balance theory for signed directed networks
-
“Out-Out Friends” (hereinafter, “O/O F”) is derived from the intuition “Someone liked by my favorite person is my friend” (as well as its opposite, replacing “liked” and “favorite” with their antonyms).
-
“Out-In Friends” (“O/I F”) is derived from the intuition “Someone who has the same opinion of the same entity as I do is my friend.”
-
“In-Out Friends” (“I/O F”) is derived from the intuition “Someone who is rated in the same way by the same person as I am is my friend.”
-
“In-In Friends” (“I/I F”) is derived from the intuition “Someone who likes someone who likes me is my friend” (as well as its opposite).
3.2 Flow of model construction
3.2.1 Notations
Notation | Description |
---|---|
\(\mathcal {U}=\{u_1,u_2, \ldots ,u_n\}\) | Set of nodes |
\(sign \in \{+, -\}\) | Sign of edge |
\(dir \in \{out, in\}\) | Edge direction |
\(x_i\) | Initialized vector of \(u_i\) |
\(N_i^{sign,dir}\) | Neighbors of \(u_i\) along with |
Edges of sign and direction | |
\(h_i^{sign,dir}\) | Hidden representation of \(u_i\) |
After the 1st layer | |
\(W_1^{sign,dir}\) | Weight matrix of the 1st layer |
\(agg\in AGG\) | Aggregator in the 2nd layer |
\(\mathscr {N}_i\) , \(h'_i\) | Map functions |
\(W_2^{agg}\) | Weight matrix of the 2nd layer |
\(W_3\) | Weight matrix of fully |
Connected layer | |
Z | final embedding matrix |
3.2.2 Initialized vectors
3.2.3 Details and formulas
\(agg \in\) AGG | Map functions | |
---|---|---|
\(\mathscr {N}_i(agg, sign)\) | \(h'_i(agg,sign)\) | |
O/O F | \(N_i^{sign,out}\) | \(h^{sign,out}\) |
O/I F | \(N_i^{sign,out}\) | \(h^{sign,in}\) |
I/O F | \(N_i^{sign,in}\) | \(h^{sign,out}\) |
I/I F | \(N_i^{sign,in}\) | \(h^{sign,in}\) |
O/O E | \(N^{sign,out}\) | \(h^{\overline{sign},out}\) |
O/I E | \(N^{sign,out}\) | \(h^{\overline{sign},in}\) |
I/O E | \(N^{sign,in}\) | \(h^{\overline{sign},out}\) |
I/I E | \(N^{sign,in}\) | \(h^{\overline{sign},in}\) |
4 Experiments
4.1 Datasets
-
Bitcoin OTC is a homogeneous user-to-user trust network of Bitcoin users on OTC platform (Kumar et al. 2016). The set of rating values is \(\{x \in \mathbb {Z} \mid -10\le x \le 10, x \ne 0 \}\). As for ground truth, the platform’s founder and users he rated highly positively ( \(\ge 5\)) are defined as benign, and the users whom these benign users uniformly rated negatively ( \(\le - 5\) ) are defined as fraudulent.
-
Bitcoin Alpha is also a user-to-user trust network of Bitcoin users on the Alpha platform (Kumar et al. 2016). The set of rating values is the same as those of Bitcoin OTC. Ground truth is defined in a similar way to OTC, starting from the founder of this platform.
-
Epinions is a user-to-post rating network (Massa et al. 2007) with integer rating values from 1 to 6. Ground truth is defined using a user-to-user trust network (Massa et al. 2007), which is independent of the user-to-post rating network. A user is defined as benign if its total trust rating is \(\ge +10\) but as fraudulent if \(\le -10\). Note that if a user rates multiple posts by another user, there will be multiple edges between those two users in the graph.
-
Amazon is a user-to-product bipartite rating network (McAuley et al. 2013) with a five-point rating scale. The helpfulness vote, which can be used to indicate malicious behavior (Fayazi et al. 2015), is used to define ground truth. Benign users are those who receive at least 50 votes with a ratio of helpful-to-total votes of \(\ge 0.75\). Those who receive at least 50 votes with a ratio of helpful-to-total votes of \(\le 0.25\) are defined as fraudulent.
-
Amazon App, Amazon Music, and Amazon Home are also user-to-product bipartite rating networks from SNAP (Leskovec and Krevl 2014) with five-point rating scale, corresponding to “Apps for Android,” “Digital Music,” and “Home and Kitchen,” respectively. The definition of ground truth follows Amazon dataset described above.
Dataset | Nodes | Edges | Benign users | Fraudulent users | Average degree | Cluster coefficient |
---|---|---|---|---|---|---|
OTC | 5881 | 35,592 | 136 | 180 | 12.10 | 0.15 |
Alpha | 3783 | 24,186 | 138 | 102 | 12.79 | 0.16 |
Amazon | 271,082 | 415,390 | 2358 | 241 | 3.06 | 0.00 |
Epinions | 4180 | 70,227 | 726 | 70 | 33.60 | 0.18 |
Amazon App | 98401 | 465,350 | 7998 | 194 | 9.46 | 0.00 |
Amazon Music | 8901 | 37,836 | 816 | 115 | 8.50 | 0.00 |
Amazon Home | 93,820 | 376,802 | 7339 | 52 | 8.03 | 0.00 |
4.2 Settings
4.2.1 Proposed framework
4.2.2 Baselines
-
Rev2 (Kumar et al. 2018b) is the state-of-the-art method for fraudulent user detection in all four datasets in this experiment.
-
R-GCN (Schlichtkrull et al. 2018) is a GCN-based model for edge-attributed networks. Here, we conduct the experiments in three variants of R-GCN. The first variant is the most standard form of R-GCN, which learns different convolution kernels for different edge ratings. The second is similar to the first, but a hyperparameter named “base” is set to 3. Setting a smaller base value than the actual number of unique edge types is reported to be effective for avoiding overfitting and capturing the similarity of edge types. In the last variants, we preprocess edge types in the same way as the proposed framework. Here edges are recategorized into binary classes (positive and negative) with a threshold at the center of the possible rating values. We report the results from the best variants of each dataset.
-
SIDE (Kim et al. 2018) is a random-walk-based network embedding method utilizing balance theory, which is designed for signed directed networks. The major difference from our proposed method is that SIDE learns in unsupervised fashion. The output of SIDE is a distributed representation, not a class label. So we use random forest to build a classification model with the distributed representation as input.
-
SGCN (Derr et al. 2018) is a GCN-based model for signed networks. As described in Sect. 3.1, limited types of relationships between nodes are considered unlike the proposed framework. Although it is originally designed for link sign prediction, we use the final embedding to the node classification task where parameters are learned jointly.
4.3 Results
4.3.1 Transductive settings
Methods | OTC | Alpha | Amazon | Epinions | Amazon App | Amazon Music | Amazon Home |
---|---|---|---|---|---|---|---|
Rev2 | 0.893 | 0.84 | 0.857 | 0.854 | 0.737 | 0.962 | 0.730 |
R-GCN | 0.960 | 0.926 | 0.818 | 0.767 | 0.616 | 0.824 | 0.713 |
SIDE | 0.975 | 0.921 | 0.855 | 0.938 | 0.757 | 0.912 | 0.708 |
SGCN | 0.994 | 0.959 | 0.871 | 0.959 | 0.704 | 0.974 | 0.751 |
GCNEXT | 0.996 | 0.97 | 0.875 | 0.973 | 0.766 | 0.972 | 0.756 |
Methods | OTC | Alpha | Amazon | Epinions | Amazon App | Amazon Music | Amazon Home |
---|---|---|---|---|---|---|---|
Rev2 | 0.691 | 0.722 | 0.79 | 0.624 | 0.669 | 0.947 | 0.565 |
R-GCN | 0.886 | 0.732 | 0.738 | 0.682 | 0.552 | 0.654 | 0.608 |
SIDE | 0.902 | 0.765 | 0.733 | 0.763 | 0.591 | 0.733 | 0.576 |
SGCN | 0.934 | 0.777 | 0.807 | 0.933 | 0.682 | 0.945 | 0.705 |
GCNEXT | 0.947 | 0.818 | 0.796 | 0.949 | 0.685 | 0.953 | 0.712 |