Dataset similarity
First, results are provided for Jaccard similarity scores between the FST Agreement Lists of the three different web attacks: Brute Force, SQL Injection, and XSS. Jaccard similarity scores are provided between these three web attacks for the following four different levels of FST Agreement criteria:
\(\text {fst}_{\mathrm{A}}\)={4,5,6,7}. Tables
9,
10,
11,
12 include the Jaccard similarity scores for the FST Agreement Lists of these three web attacks and varying levels of
\(\text {fst}_{\mathrm{A}}\).
From Tables
9 and
10 where
\(\text {fst}_{\mathrm{A}}\)=4 and
\(\text {fst}_{\mathrm{A}}\)=5 respectively, we can easily observe that SQL and XSS have the most features in common between their respective subsets. However, we cannot easily determine which pairs of web attacks have the least amount of features in common for these two different values of FST agreement criteria. One pair (BF/XSS) has the lowest score for
\(\text {fst}_{\mathrm{A}}\)=4, while the other pair (BF/XSS) has the lowest score for
\(\text {fst}_{\mathrm{A}}\)=5. Regardless, these scores are close enough to each other for our purposes of generating Feature Popularity Lists as we were able to obtain a desirable amount of fewer and popular features in Tables
5 and
6 (where these lists had fewer than 20 features but more than 2-3 features). In the next section, we will employ machine learning to determine whether we have serious performance degradation with these lists of fewer features.
For the Jaccard similarity scores of Tables
11 and
12 where six or seven FSTs must agree for the FST Agreement Lists between the three different web attacks, we notice a sharp dropoff in Jaccard similarity scores. This is also evidenced in the Feature Popularity Lists for those FST agreement criteria in Tables
7 and
8 as well, which have mostly empty lists. In this regard, Jaccard similarity scores may help us understand more desirable FST agreement criteria thresholds to employ (by looking for steep dropoffs in scores).
Table 9
Jaccard similarities by dataset for FST Agreement Lists where 4/7 FSTs Agree (\(\text {fst}_{\mathrm{A}}\)=4)
BF | 1.0 | 0.42105 | 0.38095 |
SQL | 0.42105 | 1.0 | 0.61905 |
XSS | 0.38095 | 0.61905 | 1.0 |
Table 10
Jaccard similarities by dataset for FST Agreement Lists where 5/7 FSTs Agree (\(\text {fst}_{\mathrm{A}}\)=5)
BF | 1.0 | 0.45455 | 0.6 |
SQL | 0.45455 | 1.0 | 0.66667 |
XSS | 0.6 | 0.66667 | 1.0 |
Table 11
Jaccard similarities by dataset for FST Agreement Lists where 6/7 FSTs Agree (\(\text {fst}_{\mathrm{A}}\)=6)
BF | 1.0 | na | 0.2 |
SQL | na | 1.0 | 0.16667 |
XSS | 0.2 | 0.16667 | 1.0 |
Table 12
Jaccard similarities by dataset for FST Agreement Lists where 7/7 FSTs Agree (\(\text {fst}_{\mathrm{A}}\)=7); na values indicate there is no feature 7 out of 7 rankers agree on for at least one dataset in a pair
BF | 1.0 | na | na |
SQL | na | 1.0 | na |
XSS | na | na | 1.0 |
Similarly, Jaccard similarity scores could also help us understand which different types of cyberattacks might be good candidates to generate Feature Popularity Lists with. Or, very low Jaccard similarity scores between certain cyberattacks could indicate they are not good candidates to group together within the same Feature Popularity Lists. And possibly, for different classes of attacks, they might be better suited to group them into separate Feature Popularity Lists. For example, if Denial of Service attack types as compared to web attack types obtained very divergent Jaccard similarity scores for their FST Agreement Lists, then maybe separate Feature Popularity Lists could be created for each different class of attack as appropriate.
This is an introductory study with our feature popularity framework and only three different attacks. But, employing these techniques to dozens or even hundreds of different types of cyberattacks might be even more helpful to properly group different cyberattacks into different Feature Popularity Lists by using Jaccard similarity scores of their respective FST Agreement Lists. Future work can extend these feature popularity frameworks towards many different types of cyberattacks, by grouping different types of cyberattacks into different groupings of Feature Popularity Lists. Collectively, additional Feature Popularity List groupings for different cyberattacks might even improve classification performance. Although, at a minimum, it would provide better insights into the application domain problem with easier to explain models.
In this section, classification performance results are provided for both before and after we apply our new feature popularity framework. Overall, we observe that classification performance is not degraded too much with our Feature Popularity Lists which have fewer features. In some cases, classification performance is even improved. Regardless of classification performance, employing Feature Popularity Lists is a powerful framework which enabled us to uncover previously unseen insights into the attack detection process with CSE-CIC-IDS2018 data.
Table
13 provides the classification performance results with four Feature Popularity Lists and “All Features” for five classifiers (CB, DT, LGB, RF, and XGB) with three different web attacks: BF, SQL, and XSS. For the FST column, “All Features” refers to the full feature set of 66 features (before any feature selection technique is applied), and the four Feature Popularity Lists comprise: “2/3 & 4/7 Agree”, “2/3 & 5/7 Agree”, “3/3 & 4/7 Agree”, and “3/3 & 5/7 Agree”. With these Feature Popularity Lists, the first fractional term refers to
\(\hbox {ds}_{\mathrm{A}}\) (specifying how many datasets agree) and the second fractional term refers to
\(\text {fst}_{\mathrm{A}}\) (specifying how many FSTs agree). Classification performance is presented in terms of AUC for these five different levels of Feature Popularity Lists in the FST column across the five classifiers for each of the three web attacks. The three different web attacks are represented as three columns in the table. “SD AUC” refers to the standard deviation for each AUC score. Top AUC scores are indicated in bold for each combination of: FST, classifier, and web attack.
Overall, when visually inspecting Table
13 we can see the classification performance for the Feature Popularity Lists is not seriously degraded. For 5 of the 15 classifier and web attack combinations, Feature Popularity Lists have higher scores as compared to “All Features”. The best AUC scores of the Feature Popularity Lists are not more than 0.02 AUC lower than the “All Features” score. In other words, “All Features” AUC scores are not more than 0.02 AUC above the best score of the Feature Popularity Lists. In particular, the least restrictive “2/3 & 4/7 Agree” Feature Popularity List also does not perform worse than 0.02 AUC score of the “All Features” score. The other Feature Popularity Lists are just subsets of this “2/3 & 4/7 Agree” list (with fewer features due to more restrictive agreement criteria). Mostly throughout this experiment, classification performance is only mildly degraded by employing Feature Popularity Lists and performance is even improved in several cases.
Table 13
Classification performance for 4 feature popularity lists (plus all features), 3 web attacks, and 5 classifiers
CatBoost |
FST | AUC | SD AUC | AUC | SD AUC | AUC | SD AUC |
All Features | 0.93277 | 0.00920 | 0.90008 | 0.02382 | 0.93743 | 0.01432 |
2/3 & 4/7 Agree | 0.91121 | 0.01242 | 0.87876 | 0.03297 | 0.92574 | 0.01290 |
2/3 & 5/7 Agree | 0.91121 | 0.01322 | 0.86604 | 0.02766 | 0.92791 | 0.01226 |
3/3 & 4/7 Agree | 0.88749 | 0.01398 | 0.87531 | 0.03579 | 0.92627 | 0.01420 |
3/3 & 5/7 Agree | 0.88858 | 0.01770 | 0.87057 | 0.03077 | 0.92167 | 0.01696 |
Decision Tree |
FST | AUC | SD AUC | AUC | SD AUC | AUC | SD AUC |
All Features | 0.92252 | 0.01355 | 0.90876 | 0.03032 | 0.93830 | 0.01244 |
2/3 & 4/7 Agree | 0.90346 | 0.01235 | 0.91000 | 0.02972 | 0.93417 | 0.01456 |
2/3 & 5/7 Agree | 0.91178 | 0.01391 | 0.91206 | 0.03209 | 0.94020 | 0.01389 |
3/3 & 4/7 Agree | 0.87643 | 0.01820 | 0.91527 | 0.03260 | 0.93716 | 0.01701 |
3/3 & 5/7 Agree | 0.87384 | 0.01813 | 0.91334 | 0.03429 | 0.93335 | 0.01301 |
LightGBM |
FST | AUC | SD AUC | AUC | SD AUC | AUC | SD AUC |
All Features | 0.93863 | 0.00975 | 0.93499 | 0.03401 | 0.94622 | 0.01416 |
2/3 & 4/7 Agree | 0.93511 | 0.01052 | 0.91610 | 0.03595 | 0.94354 | 0.01320 |
2/3 & 5/7 Agree | 0.93536 | 0.01078 | 0.90934 | 0.03810 | 0.94191 | 0.01491 |
3/3 & 4/7 Agree | 0.93379 | 0.01159 | 0.90404 | 0.04282 | 0.94170 | 0.01360 |
3/3 & 5/7 Agree | 0.93652 | 0.01084 | 0.91161 | 0.04016 | 0.94166 | 0.01347 |
Random Forest |
FST | AUC | SD AUC | AUC | SD AUC | AUC | SD AUC |
All Features | 0.93945 | 0.00910 | 0.91773 | 0.02602 | 0.94216 | 0.01316 |
2/3 & 4/7 Agree | 0.93799 | 0.00944 | 0.91105 | 0.03448 | 0.94455 | 0.01214 |
2/3 & 5/7 Agree | 0.93441 | 0.01089 | 0.90349 | 0.03102 | 0.94036 | 0.01592 |
3/3 & 4/7 Agree | 0.92851 | 0.01070 | 0.89572 | 0.03404 | 0.94274 | 0.01418 |
3/3 & 5/7 Agree | 0.92106 | 0.01235 | 0.90950 | 0.02971 | 0.93966 | 0.01546 |
XGBoost |
FST | AUC | SD AUC | AUC | SD AUC | AUC | SD AUC |
All Features | 0.93668 | 0.00891 | 0.89892 | 0.04456 | 0.93581 | 0.01363 |
2/3 & 4/7 Agree | 0.92197 | 0.01063 | 0.89720 | 0.03518 | 0.92918 | 0.01379 |
2/3 & 5/7 Agree | 0.92613 | 0.01357 | 0.90861 | 0.03699 | 0.94058 | 0.01425 |
3/3 & 4/7 Agree | 0.91230 | 0.01552 | 0.90421 | 0.03625 | 0.93121 | 0.01521 |
3/3 & 5/7 Agree | 0.91188 | 0.01677 | 0.90233 | 0.03867 | 0.92933 | 0.01460 |
Cybersecurity analysis and insights
A major benefit of feature popularity is providing domain experts with new insights from models which are more explainable. Our feature popularity framework led us to major discoveries into the web attack detection process within the CSE-CIC-IDS2018 dataset, even though we had intensely researched this dataset in prior work [
21]. Based on our survey of other CSE-CIC-IDS2018 studies, none of them have identified these insights into the web attack detection process as of the date of this writing.
Our most restrictive “3/3 & 5/7 Agree” Feature Popularity List (
\(\hbox {ds}_{\mathrm{A}}\)=3 and
\(\text {fst}_{\mathrm{A}}\)=5) only includes the following four features from Table
6: Flow_Bytes_s (flow bytes per second), Flow_IAT_Max (maximum time between two flows), Fwd_IAT_Std (standard deviation time between two packets sent in the forward direction), and Fwd_IAT_Total (total time between two packets sent in the forward direction). Using only these four input features, our machine learning models from Table
13 achieved favorable classification performance which was nearly as good as the “All Features” dataset. All four of these features are mainly based upon the time dimension. From a cybersecurity analyst’s perspective, these four features do not truly signal SQL Injection or XSS web attacks. In other words, detection for these two web attacks is not primarily based upon temporal features. For the third and only other web attack label in CSE-CIC-IDS2018, it is not so clear whether Brute Force web attacks should be detected primarily on time-based features and so we will discuss this separately.
Attack characteristics of SQL Injection and XSS web attacks are mainly found in the application layer (7) of the OSI model [
41], as the payloads for these two web attacks operate at protocols which are in layer 7 of the OSI model. The four features (Flow_Bytes_s, Flow_IAT_Max, Fwd_IAT_Std, and Fwd_IAT_Total) are features based on NetFlows [
42,
43] and are operating at lower layers 3 and 4 of the OSI model. Overall, these four features are not indicating attack signatures for these web attacks, because their attack fingerprints occur in the application layer (7) of the OSI model.
For example, Flow_Bytes_s does not signal a SQL Injection or XSS web attack. The Flow_Bytes_s feature is merely indicating the number of bytes per second in a network flow. Normal web traffic can just as easily produce similar values for the Flow_Bytes_s feature, as compared to SQL Injection or XSS web attack traffic. In other words, the Flow_Bytes_s does not properly discriminate normal web traffic as compared to SQL Injection or XSS web attack traffic. Normal web traffic can just as easily have lower or higher values for Flow_Bytes_s as compared to SQL Injection or XSS web attack traffic.
One small and brief web request representing normal traffic could just as easily have similar values for Flow_Bytes_s as compared to a slow and stealthy web attack for either a SQL Injection or XSS web attack. This same logic applies towards moderate velocity normal traffic as compared to moderate velocity traffic for SQL Injection and XSS web attacks for the Flow_Bytes_s feature. While it could be argued that very high velocity traffic for the Flow_Bytes_s feature could be signalling web attacks such as for SQL Injection or XSS, this is simply not the case in the CSE-CIC-IDS2018 dataset as high velocity attack traffic does not exist for these two attack web labels. In CSE-CIC-IDS2018, the SQL Injection label only encompasses 87 instances and the XSS label only encompasses 230 labels. Plus, this approach would not detect slow and stealthy web attacks.
The other three features (Flow_IAT_Max, Fwd_IAT_Std, and Fwd_IAT_Total) have the identical problems as compared to Flow_Bytes_s in discriminating between normal web traffic and SQL Injection and XSS web attacks. These features are all signaling information from layers 3 and 4 of the OSI model, and not layer 7 of the OSI model. Plus, these four features are heavily focused on the time dimension. SQL Injection and XSS web attacks do not typically have characteristics which are based on temporal features (especially when executed in a slow and stealthy fashion by attackers seeking to avoid detection). Instead of detecting these classes of web attacks based on time, other attack characteristics could be used such as those found in the application layer. Better examples of attack characteristics for these classes of web attacks are parsing text payloads for malicious sequences of characters or monitoring error logs (both are in the application layer of the OSI model).
Then, the question arises of what could be signaling such good detection of SQL Injection and XSS web attacks within the CSE-CIC-IDS2018 dataset. We can only speculate on this question, as this question deserves further research. One possibility could be unintentional contamination in the data collection process, where the machine learning models are detecting patterns that are discriminating between attack and normal traffic based on temporal patterns of the data collection and not the underlying signatures of the web attacks. Future work can further investigate this phenomenon.
With regards to the Brute Force web attacks, the same arguments are true as for the SQL Injection and XSS web attacks that these four (Flow_Bytes_s, Flow_IAT_Max, Fwd_IAT_Std, and Fwd_IAT_Total) features do not necessarily signal a web attack. It is true that these four features might signal a Brute Force web attack during a very extreme scenario of massive web traffic spikes. An example for this would be a Brute Force attack which is similar to a Denial of Service attack where the attacker is causing a massive flood of web traffic. However, this approach would not detect Brute Force web attacks which are more slow and stealthy in nature. Many attackers seek to evade detection, and only using these four features would effectively miss detecting one of the most important classes of attacker adversaries (those seeking to avoid detection).
Most importantly, the CSE-CIC-IDS2018 dataset only contains 611 labels for Brute Force web attacks as compared to over 2 million “Normal” labels for those two days of the data collection for web attacks. Given the small fraction of Brute Force web attack labels of 0.03% compared to the normal traffic for those two days, our machine learning models are not detecting some sort of “flood” type of Brute Force web attack. Instead, our machine learning models are likely detecting other patterns regarding the data collection which requires future research.
Even for Brute Force web attacks, the higher application layer (7) of the OSI model contains better attack characteristics as compared to the lower network layers of 3 and 4 (containing NetFlow features). The OWASP Top 10 [
11] contains two items on how to handle Brute Force web attacks at the application layer. First, “OWASP A2:2017-Broken Authentication” [
44] indicates that web applications should “limit or increasingly delay failed login attempts” and “log all failures and alert administrators when credential stuffing, brute force, or other attacks are detected”. Second, “OWASP A10:2017-Insufficient Logging & Monitoring” [
45] highlights that “exploitation of insufficient logging and monitoring is the bedrock of nearly every major incident”. Essentially, properly designed web applications would remove “flood” types of Brute Force web attacks by increasingly delaying their logins. Sensors from application layer logs would still be best equipped to detect Brute Force web attacks which are more slow and stealthy.
Even though we have obtained respectable classification results for web attacks in this study, our newly conceived feature popularity framework allowed us to realize that the features we were detecting upon did not make sense from a cybersecurity analyst’s perspective. When looking at all 79 independent features of the downloaded CSE-CIC-IDS2018 dataset, it can be difficult for a cybersecurity analyst to ascertain whether those NetFlow-based features might be good candidates in detecting web attacks. Even after generating a myriad of Feature Importance Lists in Tables
14,
15,
16,
17,
18,
19, it still was not clear as these lists of features were very divergent from each other. After employing feature popularity which enabled us to visualize more explainable models, we could then ascertain that our top four features (Flow_Bytes_s, Flow_IAT_Max, Fwd_IAT_Std, and Fwd_IAT_Total) did not properly characterize the web attack signatures in question. Overall, future research can further answer the question of whether or not Netflow-based features are even good candidates for detecting web attacks from the application layer of the OSI model.