Introduction
-
This work extends the previous research of [30] to develop an accurate classification-based NIDS that is robust to adversarial attacks. It presents a new multiple clusterings-based method to handle imbalance classification through undersampling the majority class. The proposed framework allows any classification algorithm to learn better with the minority class that represents attack connections. As such, the resulting model becomes more effective in recognizing obfuscated intrusions, whose appearances partly overlap with those known direct attacks. In addition, the resulting framework is generalized to a wider range of non-payload adversarial attacks.
-
The proposed undersampling technique is based on an iterative and greedy-optimization process of selecting the best alternative from a pool of centroids that represent different clustering results or data partitions. This is inspired by the concept of ensemble or consensus clustering [31, 34], which usually provides a more accurate data partition than a single clustering. Hence, this idea is novel and may enhance the effectiveness of an initial work of [46] that makes use of only one clustering to guide the sampling procedure. For the aforementioned selection, it is designed as an optimization problem that maximizes three different objective functions of: (i) the maximum distance from the centroid under the question to other k nearest centroids in the pool, (ii) the maximum distance from the examined centroid to k nearest instances from the minority class, and (iii) the average between functions (i) and (ii), respectively.
-
The proposed model is evaluated against the single-clustering alternative of [46], RUS (Random UnderSampling [60]) as the conventional model of undersampling, other well-known ensemble-level techniques: RUSBoost [60] and IRUS (Inverse Random UnderSampling [66]), the oversampling technique of [15] and its recent extension [70]. To generalize this experimental investigation, a range of basic classifiers and classifier ensemble method are employed. Specific to the line of IDS research, a state-of-the-art model is also included as a comparison method. Furthermore, parameter analysis specific to the proposed technique is also included such that the relation between predictive performance and algorithmic variables is illustrated and discussed. Anyone tries to apply this to a future problem would find it helpful as to tailor the model for good performance.
Abbreviation | Description |
---|---|
AdaBoost | AdaBoost technique for boosting-based ensemble generation |
ANN | Artificial neural networks |
ASNM | Advanced security network metrics |
C4.5 | A decision tree algorithm |
DNN | Deep neural networks |
DT | Decision tree |
HTTP | Hypertext transfer protocol |
IDS | Intrusion detection system |
IP | Internet protocol |
IRUS | Inverse random undersampling |
KNN | K nearest neighbours |
LDA | Linear discriminant analysis |
LR | Logistic regression |
MTU | Maximum transmission unit |
NB | Naive Bayes |
NIDS | Network-based intrusion detection system |
PCA | Principal component analysis |
RF | Random forest |
RUS | Random undersampling |
RUSBoost | Random undersampling with boosting-based ensemble generation |
SMOTE | Synthetic minority oversampling |
SVM | Support vector machine |
TCP | Transmission control protocol |
TCP FIN | TCP finish flag |
TCP URG | TCP urgent flag |
Related works
Proposed work | Exploited technique | Treatment of class imbalance |
---|---|---|
Online oversampling PCA for anomaly detection [43] | PCA | Oversampling |
Adaptive ensemble learning model [23] | DT, RF, KNN & DNN | n/a |
Handling of imbalance class in IDS [1] | RF and DNN | Oversampling & undersampling |
Hybrid data mining approach [55] | Clustering, feature selection & KNN | Oversampling |
Information-Gain based feature selection [69] | RF & feature selection | Oversampling |
Filter-based attribute reduction [14] | K-means & SVM | n/a |
Handling of imbalance class in IDS [36] | DT, RF, KNN & LDA | Oversampling |
Determining informative features [19] | RF & feature selection | n/a |
Recursive feature elimination [62] | SVM, DT & RF | n/a |
Feature reduced IDS [4] | ANN | n/a |
Artificial bee colony for improved classifier ensemble [48] | DT & AdaBoost | n/a |
Forwarding feature selection [30] | DT, NB, LR & SVM | n/a |
Classification-based NIDS and handling of imbalance class problem
Adversarial attack and non-payload-based obfuscation
-
Spread out packets in time: constant delay of 1 and 8 s., as well as the normal distribution of delay with 5 s. mean with 2.5 s. standard deviation (25% correlation)
-
Packets loss: 25% of packets
-
Unreliable network channel simulation: 25% of packets damaged, 35% of packets damaged, and 35% of packets damaged with 25% correlation
-
Packets duplication: 5% of packets
-
Packets order modifications: reordering of 25% and 50% packets; reordered packets are sent with 10 ms. delay and 50% correlation
-
Fragmentation: MTU 1000, MTU 750, MTU 500 and MTU 250
-
Combinations of the aforementioned techniques
Proposed method
-
Random-k method: these clusterings are generated using k-means with a cluster number that is randomly chosen from the range \(\{2, \ldots \, \sqrt{|X_0|}\}\) or \(\{2, \ldots \, 50\}\) when \(\sqrt{|X_0|} > 50\) (see the report of [32] for relevant details).
-
Random-subspace method: each base clustering can be generated from a random feature subspace \(V_0^{\prime } \in [0, 1]^{n_0 \times d^{\prime }}\) of the feature space \(V_0\). Each of the data subspaces is created with respect to the following interval \(d^{\prime }\).provided that \(\epsilon \in [0, 1]\) is a uniform random variable, while \(d^{\prime }_{min}\) and \(d^{\prime }_{max}\) denote the lower and upper bounds of the generated subspace \(V_0^{\prime }\). Following the initial work of [33], \(d^{\prime }_{min}\) and \(d^{\prime }_{max}\) are set to 0.75d and 0.85d, respectively. With this being decided, one of d features is picked up at a time to form the desired subspace of \(d^{\prime }\) non-duplicated features is obtained. For that, the index of each randomly selected feature is determined by the following.$$\begin{aligned} d^{\prime } = d^{\prime }_{min} + \lfloor \epsilon (d^{\prime }_{max} - d^{\prime }_{min}) \rfloor , \end{aligned}$$(7)where h denotes the \(h^{th}\) feature in the pool of d attributes and \(\eta \in [0, 1)\) is another uniform random variable. As a working example, Fig. 2 summarizes the process of generating a set of representatives from a clustering of two clusters, while Fig. 3 extends this procedure to a selection of representatives from a pool of centroids generated by multiple clusterings. Note that the underlying selection process will be explained next.$$\begin{aligned} h = \lfloor 1 + \eta D \rfloor , \end{aligned}$$(8)
-
Step 1: To start with, the first member \(z_1\) of Z is selected from the \(Z_{\varPi }\), provided that \(z_1\) maximizes the average distance to other centroids in \(Z_{\varPi }\).$$\begin{aligned} z_1 = {\mathop {{{\,\mathrm{argmax}\,}}}\limits _{\forall z_a \in Z_{\varPi }}} \> \> \frac{\sum \nolimits _{\forall z_b \in Z_{\varPi }, z_b \ne z_a} \> {d(z_a, z_b)}}{|Z_{\varPi }| - 1} \end{aligned}$$(9)where \(d(x_i, x_j)\) denotes the distance between two samples \(x_i\) and \(x_j\). Note that the Euclidean metric is employed in this research to estimate the distance measurement. In addition, |A| represents the size of set A. At the end of this step, |Z| is 1, while that of \(Z_{\varPi }\) is reduced by 1 as well, i.e., \(Z_{\varPi } = Z_{\varPi } - z_1\).
-
Step 2: Next, a new member is iteratively chosen from \(Z_{\varPi }\) and moved to Z. To be exact, in each iteration, a greedy optimization is exploited to determine the best centroid \(z_c \in Z_{\varPi }\) using one of three different objective functions defined below. As a result, the two sets are updated by \(Z = Z \cup z_c\) and \(Z_{\varPi } = Z_{\varPi } - z_c\), respectively. This is repeated until all \(\rho \) members are obtained, i.e., \(|Z| = \rho \). The following equation describes the first objective function, called Furthest-First-Majority or shortly as FF-Majority\((Z, Z_{\varPi })\). This is to find a centroid representing a unique feature space that is minimally overlapping with those of others in \(Z_{\varPi }\). It is leveraged with a similar assessment to existing members of Z, which is the first term in this function.The second objective function, called Furthest-First-Minority or shortly as FF-Minority\((Z, Z_{\varPi }, X_1)\). It finds a centroid in \(Z_{\varPi }\) that is largely different from samples of the minority class \(X_1\). This can be formally specified as follows.$$\begin{aligned} z_c = {\mathop {{{\,\mathrm{argmax}\,}}}\limits _{\forall z_a \in Z_{\varPi }}} \> \left( \frac{\sum \nolimits _{\forall z_e \in Z} \> d(z_a, z_e)}{|Z|} + \frac{\sum \nolimits _{\forall z_b \in Z_{\varPi }, z_b \ne z_a} \> {d(z_a, z_b)}}{|Z_{\varPi }| - 1} \right) \end{aligned}$$(10)And the third alternative of objective function, called Furthest-First-Overall or shortly as FF-Overall(Z, \(Z_{\varPi }, X_1)\), combines the two previous functions to gain the overall justification based on samples belong to both majority and minority classes. In particular, \(z_c\) is any \(z_a \in Z_{\varPi }\) that miximize the following measurement.$$\begin{aligned} z_c = {\mathop {{{\,\mathrm{argmax}\,}}}\limits _{\forall z_a \in Z_{\varPi }}} \> \left( \frac{\sum \nolimits _{\forall z_e \in Z} \> d(z_a, z_e)}{|Z|} + \frac{\sum \nolimits _{\forall x_i \in X_1} \> {d(z_a, x_i)}}{|X_1|} \right) \end{aligned}$$(11)$$\begin{aligned}&\Bigg (\frac{\sum \nolimits _{\forall z_e \in Z} \> d(z_a, z_e)}{|Z|}+ \frac{1}{2}\Bigg ( \frac{\sum \nolimits _{\forall z_b \in Z_{\varPi }, z_b \ne z_a} \> {d(z_a, z_b)}}{|Z_{\varPi }| - 1}\nonumber \\&+ \frac{\sum \nolimits _{\forall x_i \in X_1} \> {d(z_a, x_i)}}{|X_1|} \Bigg ) \Bigg )\nonumber \\ \end{aligned}$$(12)
Network service | Legitimate | Direct attack | Obfuscated attack | Total |
---|---|---|---|---|
Apache Tomcat | 809 | 61 | 163 | 1033 |
DistCC | 100 | 12 | 23 | 135 |
MSSQL | 532 | 31 | 103 | 666 |
PostgreSQL | 737 | 13 | 45 | 795 |
Samba | 4641 | 19 | 44 | 4704 |
Server | 3339 | 26 | 100 | 3465 |
Other legitimate traffics | 647 | n/a | n/a | 647 |
All services | 10,805 | 162 | 478 | 11,445 |
Category(total number: selected number) & Feature name; Description | |
Statistical(total 77: selected 4) | |
SigPktLenOut; standard deviation of outbound packet lengths | |
MeanPktLenIn; mean of packet sizes in inbound traffic of a connection | |
ConTcpFinCntIn; number of inbound packets of a connection with FIN flag set | |
ConTcpPshCntIn; number of inbound packets of a connection with PSH flag set | |
Localization(total 8: selected 0) | |
Distributed(total 34: selected 0) | |
Dynamic(total 32: selected 0) | |
Behavioral(total 43: selected 10) | |
CntOfOldFlows; no. of mutual flows between client/server hosts of analyzed connection 5 mins before it started | |
CntOfNewFlows; no. of mutual flows between client/server hosts of analyzed connection 5 mins after it finished | |
FourGonModulIn[1]; the module of 2nd coefficient of the FFT in goniometric representation | |
FourGonModulOut[1]; same as the previous one, but for outbound traffic | |
FourGonAngleN[9]; the angle of 10th coefficient of the FFT in goniometric representation | |
FourGonModulN[0]; same as the previous, but it represents the module of 1st coefficient of FFT | |
PolyInd3ordOut[3]; same as the previous, but it represents 4th coefficient of the approximation | |
GaussProds8Out[7]; same as the previous, but computed above outbound packets and represents a product of 8th slice of packets with | |
Gaussian function that fits to interval of packets’ slice | |
OutPktLen32s10i[3]; same as the previous, but computed above the first 32 secs of a connection.It is totaled outbound packet | |
lengths of 4th interval | |
OutPktLen4s10i[2]; same as the previous, but computed above the first 4 secs of a connection.It is totaled outbound packet | |
lengths of 3rd interval |
Performance evaluation
Investigated dataset and experimental design
-
For the proposed framework, the target size of multiple-clustering pool or M is set to 150, and each specific setting of these models is repeated for 30 trials to achieve a reliable conclusion from non-deterministic processes, based on averages across multiple runs.
-
For the classification algorithm \(\alpha \), four techniques are included here. These include decision tree (C4.5) with the maximum depth of 15, Naive Bayes (NB) with the Gaussian kernel function, support vector machine (SVM) with the Radial kernel function, and Logistic Regression (LR), respectively. These are employed with the dataset generated by those filter and wrapper techniques to create a classifier.
-
Since the current research focuses on robustness of machine-learning-based NIDS to obfuscated intrusions, a classifier is to be trained with instances representing legitimate connections and direct attacks only. Without the knowledge of those obfuscated instances, it is the aim of this study to see how well different methods recognize unseen threats. As such, the stratified 10-fold cross validation is firstly applied to the task of classifying legitimate connections and direct attacks, i.e., the examined data is the combination of instances belonging to these two classes only. The corresponding results illustrate the quality of classifiers to capture usual patterns. On top of that, the classifier trained in each fold will be used to predict 478 obfuscated intrusions, as either a legitimate or attack one. The latter experiment leads to the comparison of robustness as a classifier encounters truly new intrusive connections. At last, metrics used to assess predictive performance are TPR (True Positive Rate), FPR (False Positive Rate) and F1, respectively. Note that results with respect to Precision and Recall metrics are also provided in the Supplementary.1 However, TPR is the only appropriate alternative for the second experiment, as all examined samples belong to the same class, i.e., obfuscated intrusion.
Experimental results
Examined Method | NB | C4.5 | SVM | LR |
---|---|---|---|---|
Baseline | 81.172 (4.182) | 36.402 (2.891) | 15.690 (3.446) | 63.180 (3.852) |
SingleClus | 82.636 (5.376) | 38.494 (3.784) | 17.782 (4.103) | 65.272 (3.448) |
FF-Majority | 83.891 (3.842) | 39.540 (3.019) | 19.038 (3.211) | 66.946 (2.562) |
FF-Minority | 84.100 (2.871) | 39.540 (2.995) | 18.619 (3.103) | 66.527 (2.383) |
FF-Overall | 85.146 (3.004) | 40.377 (2.981) | 19.665 (3.164) | 67.782 (2.410) |
RUS | 80.753 (5.122) | 35.983 (4.721) | 16.946 (5.673) | 63.808 (6.714) |
RUSBoost | 81.172 (4.219) | 36.402 (4.016) | 17.573 (4.862) | 64.644 (4.673) |
IRUS | 81.381 (5.134) | 36.611 (3.673) | 17.782 (4.523) | 64.435 (4.381) |
SMOTE | 79.079 (4.873) | 35.565 (4.822) | 16.318 (3.749) | 64.854 (4.006) |
Outlier-SMOTE | 79.498 (4.027) | 36.192 (3.760) | 16.946 (3.662) | 65.272 (3.885) |
Examined Method | NB | C4.5 | SVM | LR |
---|---|---|---|---|
better/worse | better/worse | better/worse | better/worse | |
Baseline | 24/45 | 38/43 | 8/70 | 13/82 |
SingleClus | 50/31 | 46/37 | 41/31 | 47/38 |
FF-Majority | 73/23 | 70/30 | 55/18 | 74/28 |
FF-Minority | 87/20 | 69/28 | 49/23 | 67/32 |
FF-Overall | 101/12 | 93/19 | 80/14 | 90/13 |
RUS | 20/54 | 13/64 | 21/52 | 20/69 |
RUSBoost | 26/41 | 36/42 | 36/36 | 34/52 |
IRUS | 30/37 | 38/40 | 39/29 | 31/58 |
SMOTE | 8/96 | 5/69 | 16/58 | 36/49 |
Outlier-SMOTE | 16/76 | 19/55 | 32/46 | 49/40 |