Background
Related work
-
In this study we experiment with a much larger dataset. To obtain such dataset, we use the same raw data from [15], but this time with slight modifications of the thresholds that were used for filtering out less likely human trafficking related advertisements.
-
As opposed to our previous research which deployed only one feature space, in this work, two feature spaces that have complementary roles to each other are used.
-
In this paper we present a new framework based on the existing Laplacian SVM [14], by adding a regularization term to the standard optimization problem and solving the new optimization equation derived from there. In contrast, [15] utilized the off-the-shelf graph based semi-supervised learner, LabelSpreading method [24], without any further manipulation of the original approach.
-
Unlike [15] in which we did not compare our method with other approaches, this work compares our proposed framework against other semi-supervised and supervised learners. Also unlike our previous work in which only one group of human trafficking related advertisements were passed to two experts for validation, here in order to reduce the inconsistency, two control groups of advertisements–those of interest to law enforcement and those of not—are sent to only one expert for verification.
Data preparation
Feature engineering
No. | Feature group | References |
---|---|---|
1 | Advertisement language pattern | |
Third person language | ||
First person plural pronouns | ||
Kolmogorov complexity | ||
n-grams (1) | ||
n-grams (2) | ||
n-grams (3) | ||
2 | Words and phrases of interest | |
3 | Countries of interest | [3] |
4 | Multiple victims advertised | [6] |
5 | Victim weight | |
6 | Reference to website or spa massage therapy | [6] |
Reference to a website | ||
Reference to a spa massage therapy |
Advertisement language pattern
Words and phrases of interest
Countries of interest
Multiple victims advertised
Victim weight
Reference to website or spa massage therapy
Unsupervised filtering
Expert assisted labeling
Name | Value | |
---|---|---|
Raw | 20,822 | |
Filtered | 3543 | |
Unlabeled | 3343 | |
Labeled | Positive | Negative |
70 | 130 |
Semi-supervised learning framework
Technical preliminaries
The proposed method
Experimental study
Approaches
-
Supervised SVM, KNN, Gaussian naïve Bayes, logistic regression, adaboost and random forest.
-
\(S^3VM-R\) we set the penalty parameter as \(C_l=0.6\) and the regularization parameters \(C_r=0.2\) and \(C_s=0.2\). Linear kernel was used in our approach.
-
Laplacian SVM we used linear kernel and set the parameters \(C_l=0.6\) and \(C_s=0.6\).
-
LabelSpreading (RBF) RBF Kernel was used and \(\gamma \) was set to the default value of 20.
-
LabelSpreading (KNN) KNN kernel was used and the number of neighbors was set to 5.
-
Co-training (SVM) we followed the algorithm introduced in [36] and used two SVM as our classifiers. For both SVMs we set the tolerance for stopping criteria to 0.001 and the penalty parameter \(C=1\).
-
SVM tolerance for stopping criteria was set to the default value of 0.001. Penalty parameter C was set to 1 and linear kernel was used.
-
KNN number of neighbors was set to 5.
-
Gaussian NB there were no specific parameter to tune.
-
Logistic regression we used the ‘l2’ penalty. We also set the parameter \(C=1\) (the inverse of regularization strength) and tolerance for stopping criteria to 0.01.
-
Adaboost number of estimators was set to 200 and we also set the learning rate to 0.01.
-
Random forest we used 200 estimators and the ‘entropy’ criterion was used.
Classification results
-
Overall, our approach achieved highest performance on \({\mathcal {F}}_1\) (Tables 3, 4) and \(\{{\mathcal {F}}_1, {\mathcal {F}}_2\}\) (Table 6), in terms of all metrics. However it did not perform well using solely \({\mathcal {F}}_2\) (Table 5), i.e. when \(C_r=0\). This clearly demonstrates the importance of using \(C_r\) over \(C_s\).
-
When the feature space used is \({\mathcal {F}}_2\), Co-training (SVM) is the best method. Next best methods are supervised learners KNN and Gaussian NB. Three remarks can be made here. First, our approach could not always defeat supervised learners as it is seen from Tables 3 and 5. This is not surprising and in fact lies at the inherent difference between semi-supervised and supervised methods—unlabeled examples could make the trained model susceptible to error propagation and thus wrong estimation. Second, as it is seen in Tables 4, 5 and 6, achieving very high recall on the negative examples and low score on the positive ones shall not be treated as a potent property, otherwise a trivial classifier which always assigns negative labels to all samples would be the best learner. Third, using \(C_r\) always improves the performance over \(C_s\). One point that needs to be clarified is, our ultimate goal is not to achieve high performance on the labeled data, but rather to detect the suspicious (unlabeled) advertisements which could be human trafficking related—this will be explained in more details in “Blind evaluation”.
-
Compared to the other semi-supervised approaches, our approach either achieved higher or comparable AUC scores. The reason we performed exactly the same as the Laplacian SVM, is because by setting \(C_r=0\), the two approaches are inherently the same.
-
For the Laplacian SVM to be able to run on \({\mathcal {F}}_1\), the Laplacian \({\mathcal {L}}^{\prime}\) has to be constructed using \({\mathcal {F}}_1\) while inherently is supposed to be made using \({\mathcal {F}}_2\). This is because \(C_r\) is essentially associated with \({\mathcal {F}}_1\), and \(C_s\) corresponds to \({\mathcal {L}}^{\prime}\) and correspondingly \({\mathcal {F}}_2\). The same holds for \(\{{\mathcal {F}}_1,{\mathcal {F}}_2\}\), where we need to construct a new feature space by concatenating \({\mathcal {F}}_1\) and \({\mathcal {F}}_2\) as the Laplacian SVM does not inherently use \({\mathcal {F}}_1\) at all. The new feature space is then used to construct the Laplacian \({\mathcal {L}}^{\prime}\).
-
Since our approach inherently incorporates both of the Laplacian matrices corresponding to the two feature spaces \({\mathcal {F}}_1\) and \({\mathcal {F}}_2\), all other baselines were also run using the concatenation of these two feature spaces for the sake of fair comparison. Unlike our approach which used the wise combination of \({\mathcal {F}}_1\) and \({\mathcal {F}}_2\), other methods do not gain high AUC by simply combining the feature spaces.
Learner | AUC | Accuracy | ||||
---|---|---|---|---|---|---|
\({\mathcal {F}}_1\)
|
\({\mathcal {F}}_2\)
|
\(\{{\mathcal {F}}_1, {\mathcal {F}}_2\}\)
|
\({\mathcal {F}}_1\)
|
\({\mathcal {F}}_2\)
|
\(\{{\mathcal {F}}_1, {\mathcal {F}}_2\}\)
| |
\(S^3VM-R\)
|
0.91
| 0.9 |
0.96
|
0.91
| 0.9 |
0.97
|
Laplacian SVM | 0.9 | 0.9 | 0.9 | 0.91 | 0.9 | 0.92 |
LabelSpreading (RBF) | 0.78 | 0.87 | 0.84 | 0.8 | 0.85 | 0.86 |
LabelSpreading (KNN) | 0.68 | 0.80 | 0.74 | 0.71 | 0.8 | 0.8 |
Co-training (SVM) | 0.82 |
0.94
| 0.92 | 0.85 |
0.94
| 0.93 |
SVM | 0.82 | 0.9 | 0.91 | 0.85 | 0.92 | 0.93 |
KNN | 0.76 | 0.91 | 0.81 | 0.79 | 0.92 | 0.84 |
Gaussian NB | 0.78 | 0.91 | 0.9 | 0.82 | 0.9 | 0.9 |
Logistic regression | 0.82 | 0.89 | 0.88 | 0.85 | 0.92 | 0.92 |
AdaBoost | 0.82 | 0.85 | 0.85 | 0.85 | 0.88 | 0.88 |
Random forest | 0.81 | 0.89 | 0.89 | 0.83 | 0.91 | 0.92 |
Learner | Precision | Recall | F1-score | |||
---|---|---|---|---|---|---|
\(class_p\)
|
\(class_n\)
|
\(class_p\)
|
\(class_n\)
|
\(class_p\)
|
\(class_n\)
| |
\(S^3VM-R\)
|
0.91
|
0.92
|
0.91
|
0.93
|
0.91
|
0.92
|
Laplacian SVM | 0.86 | 0.89 | 0.88 | 0.9 | 0.87 | 0.88 |
LabelSpreading (RBF) | 0.76 | 0.78 | 0.77 | 0.73 | 0.8 | 0.81 |
LabelSpreading (KNN) | 0.65 | 0.7 | 0.71 | 0.68 | 0.69 | 0.73 |
Co-training (SVM) | 0.81 | 0.84 | 0.71 | 0.92 | 0.73 | 0.87 |
SVM | 0.86 | 0.83 | 0.68 | 0.96 | 0.74 | 0.88 |
KNN | 0.72 | 0.8 | 0.63 | 0.88 | 0.65 | 0.83 |
Gaussian NB | 0.79 | 0.81 | 0.72 | 0.85 | 0.73 | 0.81 |
Logistic regression | 0.81 | 0.85 | 0.71 | 0.93 | 0.74 | 0.88 |
AdaBoost | 0.86 | 0.83 | 0.68 | 0.95 | 0.74 | 0.88 |
Random forest | 0.77 | 0.85 | 0.73 | 0.89 | 0.73 | 0.86 |
Learner | Precision | Recall | F1-score | |||
---|---|---|---|---|---|---|
\(class_p\)
|
\(class_n\)
|
\(class_p\)
|
\(class_n\)
|
\(class_p\)
|
\(class_n\)
| |
\(S^3VM-R\)
| 0.91 | 0.9 | 0.9 | 0.9 | 0.91 | 0.92 |
Laplacian SVM | 0.91 | 0.9 | 0.9 | 0.91 | 0.89 | 0.92 |
LabelSpreading (RBF) | 0.8 | 0.86 | 0.82 | 0.83 | 0.81 | 0.85 |
LabelSpreading (KNN) | 0.7 | 0.75 | 0.73 | 0.78 | 0.79 | 0.77 |
Co-training (SVM) |
0.96
| 0.91 | 0.91 |
0.97
| 0.93 |
0.93
|
SVM | 0.93 | 0.91 | 0.84 |
0.97
| 0.87 |
0.93
|
KNN | 0.87 | 0.92 | 0.88 | 0.94 | 0.87 |
0.93
|
Gaussian NB | 0.78 |
0.96
|
0.94
| 0.87 | 0.84 | 0.91 |
Logistic regression | 0.98 | 0.89 | 0.81 | 0.98 | 0.88 |
0.93
|
AdaBoost | 0.88 | 0.88 | 0.75 | 0.95 | 0.78 | 0.91 |
Random forest | 0.93 | 0.89 | 0.81 |
0.97
| 0.85 |
0.93
|
Learner | Precision | Recall | F1-score | |||
---|---|---|---|---|---|---|
\(class_p\)
|
\(class_n\)
|
\(class_p\)
|
\(class_n\)
|
\(class_p\)
|
\(class_n\)
| |
\(S^3VM-R\)
|
0.97
|
0.97
|
0.95
|
0.98
|
0.94
|
0.95
|
Laplacian SVM | 0.96 | 0.94 | 0.91 | 0.96 | 0.91 | 0.93 |
LabelSpreading (RBF) | 0.83 | 0.86 | 0.82 | 0.84 | 0.81 | 0.86 |
LabelSpreading (KNN) | 0.71 | 0.74 | 0.75 | 0.78 | 0.8 | 0.78 |
Co-training (SVM) | 0.92 | 0.9 | 0.9 | 0.94 | 0.91 | 0.92 |
SVM | 0.96 | 0.92 | 0.84 | 0.97 | 0.89 | 0.94 |
KNN | 0.84 | 0.83 | 0.67 | 0.95 | 0.73 | 0.88 |
Gaussian NB | 0.77 | 0.96 | 0.94 | 0.87 | 0.84 | 0.91 |
Logistic regression | 0.95 | 0.9 | 0.79 | 0.97 | 0.85 | 0.93 |
AdaBoost | 0.88 | 0.88 | 0.75 | 0.95 | 0.78 | 0.91 |
Random forest | 0.93 | 0.9 | 0.82 | 0.97 | 0.86 | 0.93 |
Blind evaluation
Learner | AUC | ||
---|---|---|---|
\({\mathcal {F}}_1\)
|
\({\mathcal {F}}_2\)
|
\(\{{\mathcal {F}}_1, {\mathcal {F}}_2\}\)
| |
Laplacian SVM |
0.9
|
0.92
|
0.93
|
LabelSpreading (RBF) | 0.75 | 0.85 | 0.87 |
LabelSpreading (KNN) | 0.7 | 0.82 | 0.79 |
Co-training (SVM) | 0.8 | 0.9 | 0.91 |
SVM | 0.8 | 0.65 | 0.69 |
KNN | 0.74 | 0.62 | 0.77 |
Gaussian NB | 0.77 | 0.51 | 0.52 |
Logistic regression | 0.76 | 0.62 | 0.75 |
AdaBoost | 0.77 | 0.74 | 0.74 |
Random forest | 0.8 | 0.8 | 0.8 |
Hyperparameter sensitivity
Significance of features
No. | Feature group |
\(\chi ^2\)
| Selected |
---|---|---|---|
1 | Advertisement language pattern | ||
Third person language | 8.4 |
\(\checkmark \)
| |
First person plural pronouns | 9.5 |
\(\checkmark \)
| |
Kolmogorov complexity | 0.7 |
\(\checkmark \)
| |
n-grams (1) | 0.4 | ||
n-grams (2) | 0.0 | ||
n-grams (3) | 0.4 | ||
2 | Words and phrases of interest | 0.0 | |
3 | Countries of interest | 59.3 |
\(\checkmark \)
|
4 | Multiple victims advertised | 14.1 |
\(\checkmark \)
|
5 | Victim weight | 0.2 | |
6 | Reference to website or spa massage therapy | ||
Reference to website | 0.1 | ||
Reference to spa massage therapy | 33.5 |
\(\checkmark \)
|
-
Considering only the feature space \({\mathcal {F}}_1\), our approach achieved higher performance compared to all other baselines by either using the whole feature space or the most discriminative features \({\mathcal {F}}^*_1\).
-
Deploying only the features from \({\mathcal {F}}^*_1\), we were able to achieve comparable results as if we used the whole feature space \({\mathcal {F}}_1\).
Name | Value | ||
---|---|---|---|
\(\overline{{\mathcal {F}}^*_1}\)
|
\({\mathcal {F}}^*_1\)
|
\({\mathcal {F}}_1\)
| |
\(S^3VM-R\)
| 0.82 | 0.87 | 0.91 |