1 Introduction
1.1 Experimental method
-
We propose a learning-based mechanism to evaluate the similarity of web page layouts and identify phishing pages.
-
We define the rules to extract and create effective page layout features and develop a phishing page classifier based on four typical learning algorithms, Supporting Vector Machine, Decision Tree, AdaBoost, and Random Forest.
-
We prototyped our approach and evaluated it with real-world web page samples from phishtank.com. The experiment results illustrate the efficiency of our approach.
1.2 Paper organization
2 Related work
2.1 Page-feature-based phishing detection
2.2 Learning-based phishing detection
Detection | Input features | Input samples | Classifiers | Precision |
---|---|---|---|---|
Pan et al. [10] | 7 features | About 380 | 1 | *** |
Xiang et al. [11] | 15 features | About 8120 | 6 | **** |
Abdelhamid et al. [22] | 16 features | About 1350 | 6 | ** |
Mao et al. [24] | 1 feature | About 2930 | 2 | ** |
Our work | 1 feature | About 26580 | 4 | ** |
3 Background and overview
3.1 Page layout features
3.2 Learning-based layout similarity detection
3.3 Approach overview
3.3.1 Similarity classifier training
3.3.2 Phishing web page detection based on layout similarity
4 Learning-based similar page layout classification
4.1 Property vector extraction
4.1.1 Property vector generation
4.1.2 Comparison vector generation
-
We first unify the property values of the same property into the same dimension. For the same property P k, if Page1 has m1 values \(\left \{{V_{k}^{1}}:{A_{k}^{1}}, \ldots,V_{k}^{m_{1}}:A_{k}^{m_{1}}\right \}\) and Page2 has m2 values \(\left \{{V_{k}^{1}}:{A_{k}^{1}}, \ldots, V_{k}^{m_{2}}:A_{k}^{m_{2}}\right \}\). We choose the larger value of m1 and m2, denoted by m, and extend the page property of the smaller one to the length of m by adding zeros. The outputs in this step are Page1 \(\left \{{V_{k}^{1}}:{A_{k}^{1}}, \ldots, {V_{k}^{m}}:{A_{k}^{m}}\right \}\) and Page2 \(\left \{{V_{k}^{1}}:{A_{k}^{1}},..., {V_{k}^{m}}:{A_{k}^{m}}\right \}\) with the same dimension.
-
We compute the difference between \(Page1:{V_{k}^{i}}\) and \(Page2:{V_{k}^{i}}\) where i∈m and use the maximum value of \(Page1:{A_{k}^{i}}\) and \(Page2:{A_{k}^{i}}\) to multiply the difference value. The result is denoted by \({\varepsilon _{k}^{i}}\). \({\varepsilon _{k}^{i}} = |Page1:{V_{k}^{1}} - Page2:{V_{k}^{1}}| \times max\left (Page1:{A_{k}^{i}}, Page2:{A_{k}^{i}}\right)\). Then, we get a value in ith { Vk:Ak} as the ith dimension of their comparison property vector.
-
We calculate all the \({\varepsilon _{k}^{i}}\) of Pk and obtain \(\varepsilon _{k}=sum \left ({\varepsilon _{k}^{i}}, \text {where} i \mathrm {=1, 2,...,} m \right)\).
-
After repeating the previous steps k times, we finally get the comparison property vector of Page1 and Page2 denoted as [ ε1, ε2,..., εk].
4.2 Classifier building
-
Support Vector Machine (SVM). SVM aims to maximize the margin between classes closest points to find an optimal separating hyperplane between them. The minority of support vectors (SV) produced after training determines the result of classifiers, which avoids dimension disaster and offers a good performance in robustness.Table 2Comparison of the four classification algorithmsClassifierRobustnessEfficiencyDataset scaleSVM∘∘∘∘∘*DT∘∘∘∘*AB∘∘∘**RF∘∘∘∘∘**
-
Decision tree (DT). DT classifies items by making decisions at each branch to obtain as much as entropy gain as possible. A decision tree consists of a root node, several internal nodes, and leaf nodes. Leaf nodes denote the result of the classifier, and other nodes denote each attribute. Every route from the root node to a leaf node corresponds a determining test sequence. It follows the rule of divide-and-conquer.
-
AdaBoost (AB). Boosting is a kind of ensemble learning algorithms that promote weak learner to strong learner. AB is a representative of this kind of boosting. Its training starts with a base learner and adjusts the distribution of samples based on the performance of the base learner. Then, it trains the next base learner based on the adjusted distribution of samples iteratively. Their outputs are given different weights that contribute to the final output of the boosted classifier. It is a kind of serial ensemble algorithm.
-
Random Forest (RF). Different from boosting, Bagging is a parallel ensemble learning algorithm. It samples different sets form the training set, trains base learners based on these different sample sets, and combines the base learners to produce a good result. RF is an expansion of Bagging technique that builds lots of decision trees for training and outputs the most-voting class. It introduces the random attribute selective to make stronger generalization.
5 Evaluation
Source | PhishTank | |
---|---|---|
Dataset | Positive samples | Negative samples |
Training set | 3719 | 17926 |
Testing set | 414 | 1992 |
5.1 Classifier effectiveness
-
Support Vector Machine (SVM). We employ SVM as the classifier and test four metrics regarding the parameter gamma in the SVM algorithm. According to the experiment results shown in Fig. 3, the accuracy is around 96%, while the rest three metrics are mostly above 80%. With the rising of gamma, the value of accuracy and F1 almost remain constant, while recall falls down a little and precision goes up a little. When gamma is about 0.0002, the four metrics get close to their best performance.×
-
Decision Tree (DT). We employ DT as the classifier and test four metrics regarding the parameter max_depth in the DT algorithm. The results are shown in Fig. 4, where the four metrics remain constant when max_depth is above 20, and their values may fluctuate a little. The accuracy is about 93%, while the precision is the lowest, which is around 80%. When max_depth is about 25, the four metrics achieve the best.×
-
AdaBoost (AB). We employ AB as the classifier and test four metrics regarding the parameter n_estimators in the AB algorithm. The four metrics displayed in Fig. 5 are above 82.5%, and their values increase slightly when the value of n_estimators increases. The accuracy is close to 94%. When n_estimators is about 250, the system obtains a relatively optimal performance.×
-
Random Forest (RF). We employ RF as the classifier and test four metrics regarding the parameter n_estimators in the RF algorithm. Figure 6 gives the experiment results, and we can see that the accuracy is above 96%, and the rest of the three metrics are above 90% and their values keep nearly stable over different values of n_estimators. The system gets a better performance, when n_estimators is about 100.×
5.2 Effectiveness of positive-negative sample distributions
5.3 Sensitivity to size of training set
5.4 Results and discussion
Classifier | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|
SVM | 0.96948 | 0.96235 | 0.86115 | 0.90134 |
DT | 0.93676 | 0.80674 | 0.90015 | 0.84467 |
AB | 0.94500 | 0.85218 | 0.93378 | 0.87145 |
RF | 0.97310 | 0.93695 | 0.92046 | 0.92078 |