Introduction
Related works
Related work | Big Data tool | Algorithm | Dataset |
---|---|---|---|
[7] | Apache Spark | K-Means | KDD 99 |
[8] | Anaconda | K-Means++ | KDD 99 |
[9] | Anaconda | Decision tree | KDD 99 |
[10] | Apache Spark | SVM, Naïve Bayes, Decision Tree and Random Forest | UNSW-NB 15 |
[11] | Apache Storm | C-SVM | KDD 99 |
[12] | Apache Spark | Neural network, SVM, DT, Naïve Bayes and Random forest | Synchrophasor |
[13] | Apache Spark | Naïve Bayes, REP TREE, Random Tree, Random Forest, Random Committee, Bagging and Randomizable Filtered | UNSW-NB 15 |
[14] | Apache Spark | SVM | KDD 99 |
[15] | Hadoop | Parallel Naïve Bayes | KDD 99 |
Methods
Spark Chi SVM proposed model
Dataset description
No | Attribute name | No | Attribute name |
---|---|---|---|
1 | Duration | 22 | Is_guest_login |
2 | Protocol_type | 23 | Count |
3 | Service | 24 | Serror_rate |
4 | Src_bytes | 25 | Rerror_rate |
5 | Dst_bytes | 26 | Same_srv_rate |
6 | Flag | 27 | Diff_srv_rate |
7 | Land | 28 | Srv_count |
8 | Wrong_fragment | 29 | Srv_serror_rate |
9 | Urgent | 30 | Srv_rerror_rate |
10 | Hot | 31 | Srv_diff_host_rate |
11 | Num_failed_logins | 32 | Dst_host_count |
12 | Logged_in | 33 | Dst_host_srv_count |
13 | Num_compromised | 34 | Dst_host_same_srv_rate |
14 | Root_shell | 35 | Dst_host_diff_srv_rate |
15 | Su_attempted | 36 | Dst_host_same_src_port_rate |
16 | Num_root | 37 | Dst_host_srv_diff_host_rate |
17 | Num_file_creations | 38 | Dst_host_serror_rate |
18 | Num_shells | 39 | Dst_host_srv_serror_rate |
19 | Num_access_files | 40 | Dst_host_rerror_rate |
20 | Num_outbound_cmds | 41 | Dst_host_srv_rerror_rate |
21 | Is_hot_login | 42 | class |
Apache Spark
Data preprocessing
Standardization
The dataset record | |
---|---|
The record before standardization | res1:org.apache.spark.mllib.regression.LabeledPoint = (1.0,[0.0,181.0,5450.0,0.0,0.0,0.0,0.0,0.0,1.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,8.0, 0.0,0.0,0.0,0.0,1.0,0.0,0.0,9.0,9.0,1.0,0.0,0.11, 0.0,0.0,0.0,0.0,0.0]) |
The record after standardization | res2:org.apache.spark.mllib.regression.LabeledPoint = (1.0,[0.0,1.8315794844034117E-4,0.16495156759878019, 0.0,0.0,0.0,0.0,0.0,2.814168444874875, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.03753270996838475,0.03247770581832668,0.0,0.0,0.0,0.0, 2.576061480099788,0.0,0.0,0.13900605646702138, 0.0848732827397667,2.434387313317322,0.0, 0.22854329046843286,0.0,0.0,0.0,0.0,0.0]) |
Feature selection
numTopFeatures | AUROC result (%) |
---|---|
25 | 99.49 |
22 | 99.51 |
17 | 99.55 |
15 | 99.49 |
11 | 92.81 |
Model classifier
Results and discussion
Classifiers | AUROC (%) | AUPR (%) |
---|---|---|
SVM only | 96.80 | 94.36 |
Chi-SVM | 99.55 | 96.24 |
Logistic regression | 92.70 | 92.77 |
Classifiers | Training time | Predict time |
---|---|---|
SVM only | 25.5 s | 1.37 s |
Chi-SVM | 10.79 s | 1.21 s |
Logistic regression | 25.44 | 1.58 |
The work | Training time (s) | Predict time (s) |
---|---|---|
Spark-Chi-SVM | 10.79 | 1.21 |
SVM Classifier in [10] | 38.91 | 0.20 |
[15] | 1467 | 792 |
SVM classifier in [30] | 530.45 | 19.02 |
SVM classifier in [31] | 561.044 | 26.369 |