Introduction
-
We first make a brief discussion on the concept of cybersecurity data science and relevant methods to understand its applicability towards data-driven intelligent decision making in the domain of cybersecurity. For this purpose, we also make a review and brief discussion on different machine learning tasks in cybersecurity, and summarize various cybersecurity datasets highlighting their usage in different data-driven cyber applications.
-
We then discuss and summarize a number of associated research issues and future directions in the area of cybersecurity data science, that could help both the academia and industry people to further research and development in relevant application areas.
-
Finally, we provide a generic multi-layered framework of the cybersecurity data science model based on machine learning techniques. In this framework, we briefly discuss how the cybersecurity data science model can be used to discover useful insights from security data and making data-driven intelligent decisions to build smart cybersecurity systems.
Background
Cybersecurity
-
Confidentiality is a property used to prevent the access and disclosure of information to unauthorized individuals, entities or systems.
-
Integrity is a property used to prevent any modification or destruction of information in an unauthorized manner.
-
Availability is a property used to ensure timely and reliable access of information assets and systems to an authorized entity.
Cyberattacks and security risks
-
Unauthorized access that describes the act of accessing information to network, systems or data without authorization that results in a violation of a security policy [2];
-
Malware known as malicious software, is any program or software that intentionally designed to cause damage to a computer, client, server, or computer network, e.g., botnets. Examples of different types of malware including computer viruses, worms, Trojan horses, adware, ransomware, spyware, malicious bots, etc. [3, 26]; Ransom malware, or ransomware, is an emerging form of malware that prevents users from accessing their systems or personal files, or the devices, then demands an anonymous online payment in order to restore access.
-
Denial-of-Service is an attack meant to shut down a machine or network, making it inaccessible to its intended users by flooding the target with traffic that triggers a crash. The Denial-of-Service (DoS) attack typically uses one computer with an Internet connection, while distributed denial-of-service (DDoS) attack uses multiple computers and Internet connections to flood the targeted resource [2];
-
Phishing a type of social engineering, used for a broad range of malicious activities accomplished through human interactions, in which the fraudulent attempt takes part to obtain sensitive information such as banking and credit card details, login credentials, or personally identifiable information by disguising oneself as a trusted individual or entity via an electronic communication such as email, text, or instant message, etc. [26];
Cybersecurity defense strategies
-
Signature-based IDS: A signature can be a predefined string, pattern, or rule that corresponds to a known attack. A particular pattern is identified as the detection of corresponding attacks in a signature-based IDS. An example of a signature can be known patterns or a byte sequence in a network traffic, or sequences used by malware. To detect the attacks, anti-virus software uses such types of sequences or patterns as a signature while performing the matching operation. Signature-based IDS is also known as knowledge-based or misuse detection [41]. This technique can be efficient to process a high volume of network traffic, however, is strictly limited to the known attacks only. Thus, detecting new attacks or unseen attacks is one of the biggest challenges faced by this signature-based system.
-
Anomaly-based IDS: The concept of anomaly-based detection overcomes the issues of signature-based IDS discussed above. In an anomaly-based intrusion detection system, the behavior of the network is first examined to find dynamic patterns, to automatically create a data-driven model, to profile the normal behavior, and thus it detects deviations in the case of any anomalies [41]. Thus, anomaly-based IDS can be treated as a dynamic approach, which follows behavior-oriented detection. The main advantage of anomaly-based IDS is the ability to identify unknown or zero-day attacks [42]. However, the issue is that the identified anomaly or abnormal behavior is not always an indicator of intrusions. It sometimes may happen because of several factors such as policy changes or offering a new service.
Approach | Pros | Cons |
---|---|---|
Signature-based IDS | Simplest and effective method to detect known attacks | Ineffective to detect unknown attacks |
Anomaly-based IDS | Effective to detect new and unforeseen vulnerabilities | Anomaly is not always an indicator of intrusions, and may increase false positive rate |
Hybrid approach | Reduce the false positive rate of unknown attacks | Model might be complex |
Stateful protocol analysis approach | Know and trace the protocol states | Unable to inspect attacks looking like benign protocol behaviors |
Data science
Cybersecurity data science
Understanding cybersecurity data
Dataset | Description |
---|---|
DARPA | |
KDD’99 Cup | Most widely used data set containing 41 features for evaluating anomaly detection methods, where attacks are categorized into four major classes, such as denial of service (DoS), remote-to-local (R2L), user-to-remote (U2R), and probing [50]. KDD’99 Cup dataset can be used to evaluate ML-based attack detection model |
NSL-KDD | A refined version of KDD’99 cup dataset where redundant records are eliminated. Thus ML classification based security model utilizing NSL-KDD dataset will not be biased towards more frequent records [51] |
CAIDA | |
ISOT’10 | |
ISCX’12 | |
CTU-13 | A labeled malware dataset including botnet, normal, and background traffic that was captured at CTU University, Czech Republic [58]. CTU-13 can be used for data-driven malware analysis using ML techniques and to evaluate the malware detection system |
UNSW-NB15 | The dataset has 49 features and nine different types of attacks including DoS that was created at the University of New South Wales in 2015 [59]. UNSW-NB15 can be used for evaluating ML-based anomaly detection system in cyber applications. |
CIC-IDS2018 CIC-IDS2017 | The datasets include different attack scenarios, namely Brute-force, Heartbleed, Botnet, HTTP DoS, DDoS, Web attacks, and insider attack, collected by the Canadian Institute for Cybersecurity [60]. Datasets can be used for evaluating ML based intrusion detection systems including Zero-Day attacks |
CIC-DDoS2019 | A dataset containing DDoS attacks was collected by the Canadian Institute for Cybersecurity [61]. CIC-DDoS can be used for network traffic behavioral analytics to detect DDoS attacks using ML techniques |
MAWI | A collection of Japanese network research institutions and academic institutions used to detect and evaluate DDoS intrusions using ML techniques [62] |
ADFA IDS | An intrusion dataset with different versions named ADFA-LD and ADFA-WD issued by the Australian Defence Academy (ADFA) [63]. They are designed for evaluation by host-based IDS |
CERT | |
Email | |
DGA | |
Malware | |
Bot-IoT | A dataset that incorporates legitimate and simulated IoT network traffic, along with different attacks for network forensic analytics in the area of Internet of Things [80]. Bot-IoT can be used to evaluate the reliability using different statistical and machine learning methods for forensics purposes |
Defining cybersecurity data science
Key terms | Description |
---|---|
Security incident or attack | An incident or cyber-attack, is any act that threatens the security, confidentiality, integrity, or availability of information assets, information systems, or the networks that deliver the information |
Data breach | An intentional or unintentional release of secure data to an untrusted environment, which is also known as data spill or data leak |
Cyber anomaly | Anomalies are data points, items, observations or events that do not conform to the expected pattern of a given group, such as cyber intrusions or fraud. Anomalies are also referred to as outliers, noise, deviations, and exceptions in cyber data |
Cybercrime | A criminal activity done using computers and the Internet, that can be committed against government and private organizations |
Cybersecurity | A set of technologies and processes designed to protect networks, devices, programs, and data from various cyber attacks, damages, or unauthorized access |
Data science | Focuses on the collection and application of data to provide insights or meaningful information in industry, academia, or the context of human life |
Artificial intelligence (AI) | A technology that behaves intelligently with the ability of thinking and working like humans, e.g., intelligent decision making in cyber domain |
Machine learning | A significant part of AI, which deals with the scientific study of algorithms and statistical models that learn from cybersecurity data to perform a specific task without using explicit instructions, relying on security incident patterns and inference instead. |
Deep learning | A significant part of machine learning in AI that typically builds security models based on artificial neural networks consisting of several data processing layers |
Cyber features | These are attributes, extracted from cyber data sources to analyze and build target cyber security models |
Security models | Models take features as inputs and they apply simple or hybrid machine learning algorithms to come up with a specific outcome for a cybersecurity use case for intelligent decision making |
Threat intelligence | Deals with gathering raw data of threats, and then analyzes and filters the data to produce usable information for automated security control systems, i.e., evidence-based knowledge in cybersecurity |
Behavioral analytics | Deals with the behavioral patterns of various security incidents or the malicious behavior in the data |
Internet-of-Things (IoT) | A smart environment where an object that can represent itself becomes greater by connecting to surrounding objects and the extensive data flowing around it, in which the cyber criminals are associated with. |
Machine learning tasks in cybersecurity
Used Technique | Purpose | References |
---|---|---|
SVM | To classify various attacks such as DoS, Probe, U2R, and R2L | Kotpalliwar et al. [85] |
SVM | Feature selection, intrusion detection and classification | |
SVM | DDoS detection and analysis in SDN-based environment | Kokila et al. [90] |
SVM | Evaluating host-based anomaly detection systems | Xie et al. [91] |
SVM-PSO | To build intrusion detection system | Saxena et al. [92] |
FCM clustering, ANN and SVM | To build network intrusion detection system | Chandrasekhar et al. [93] |
KNN | Network intrusion detection system | |
KNN | To reduce the false alarm rate | Meng et al. [96] |
SVM and KNN | To build intrusion detection system | Dada et al. [97] |
K-means and KNN | To build intrusion detection system | Sharifi et al. [98] |
KNN and Clustering | To build intrusion detection system | Lin et al. [99] |
Naive Bayes | To build an intrusion detection system for multi-class classification. | Koc et al. [100] |
Decision Tree | To detect the malicious code’s behavior information by running malicious code on the virtual machine and analyze the behavior information for intrusion detection. | Moon et al. [101] |
Decision Tree | Feature selection and to build an effective network intrusion detection system | |
Decision Tree and KNN | Anomaly intrusion detection system | Balogun et al. [108] |
Genetic Algorithm and Decision Tree | To solve the problem of small disjunct in the decision tree based intrusion detection system | Azad et al. [109] |
Decision Tree and ANN | To measure the performance of intrusion detection system | Jo et al. [110] |
Random Forests | To build network intrusion detection systems | Zhang et al. [111] |
Association Rule | To build network intrusion detection systems | Tajbakhsh et al. [112] |
Behavior Rule | To build intrusion detection system for safety critical medical cyber physical systems | Mitchell et al. [113] |
Supervised | For malware detection and analysis | |
Semi-supervised Adaboost | For network anomaly detection | Yuan et al. [115] |
Hidden Markov Models | To build an intrusion detection system | |
Genetic Algorithm | For prevention of cyberterrorism through dynamic and evolving intrusion detection | |
Deep Learning Recurrent, RNN, LSTM | To build anomaly intrusion detection system and attack classification | |
Deep Learning Convolutional | Malware traffic classification system | |
Deep and Reinforcement Learning | Malicious activities and intrusion detection system |
Supervised learning
Unsupervised learning
Neural networks and deep learning
Other learning techniques
Research issues and future directions
-
Cybersecurity datasets: Source datasets are the primary component to work in the area of cybersecurity data science. Most of the existing datasets are old and might insufficient in terms of understanding the recent behavioral patterns of various cyber-attacks. Although the data can be transformed into a meaningful understanding level after performing several processing tasks, there is still a lack of understanding of the characteristics of recent attacks and their patterns of happening. Thus, further processing or machine learning algorithms may provide a low accuracy rate for making the target decisions. Therefore, establishing a large number of recent datasets for a particular problem domain like cyber risk prediction or intrusion detection is needed, which could be one of the major challenges in cybersecurity data science.
-
Handling quality problems in cybersecurity datasets: The cyber datasets might be noisy, incomplete, insignificant, imbalanced, or may contain inconsistency instances related to a particular security incident. Such problems in a data set may affect the quality of the learning process and degrade the performance of the machine learning-based models [162]. To make a data-driven intelligent decision for cybersecurity solutions, such problems in data is needed to deal effectively before building the cyber models. Therefore, understanding such problems in cyber data and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like malware analysis or intrusion detection and prevention is needed, which could be another research issue in cybersecurity data science.
-
Security policy rule generation: Security policy rules reference security zones and enable a user to allow, restrict, and track traffic on the network based on the corresponding user or user group, and service, or the application. The policy rules including the general and more specific rules are compared against the incoming traffic in sequence during the execution, and the rule that matches the traffic is applied. The policy rules used in most of the cybersecurity systems are static and generated by human expertise or ontology-based [163, 164]. Although, association rule learning techniques produce rules from data, however, there is a problem of redundancy generation [153] that makes the policy rule-set complex. Therefore, understanding such problems in policy rule generation and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like access control [165] is needed, which could be another research issue in cybersecurity data science.
-
Hybrid learning method: Most commercial products in the cybersecurity domain contain signature-based intrusion detection techniques [41]. However, missing features or insufficient profiling can cause these techniques to miss unknown attacks. In that case, anomaly-based detection techniques or hybrid technique combining signature-based and anomaly-based can be used to overcome such issues. A hybrid technique combining multiple learning techniques or a combination of deep learning and machine-learning methods can be used to extract the target insight for a particular problem domain like intrusion detection, malware analysis, access control, etc. and make the intelligent decision for corresponding cybersecurity solutions.
-
Protecting the valuable security information: Another issue of a cyber data attack is the loss of extremely valuable data and information, which could be damaging for an organization. With the use of encryption or highly complex signatures, one can stop others from probing into a dataset. In such cases, cybersecurity data science can be used to build a data-driven impenetrable protocol to protect such security information. To achieve this goal, cyber analysts can develop algorithms by analyzing the history of cyberattacks to detect the most frequently targeted chunks of data. Thus, understanding such data protecting problems and designing corresponding algorithms to effectively handling these problems, could be another research issue in the area of cybersecurity data science.
-
Context-awareness in cybersecurity: Existing cybersecurity work mainly originates from the relevant cyber data containing several low-level features. When data mining and machine learning techniques are applied to such datasets, a related pattern can be identified that describes it properly. However, a broader contextual information [140, 145, 166] like temporal, spatial, relationship among events or connections, dependency can be used to decide whether there exists a suspicious activity or not. For instance, some approaches may consider individual connections as DoS attacks, while security experts might not treat them as malicious by themselves. Thus, a significant limitation of existing cybersecurity work is the lack of using the contextual information for predicting risks or attacks. Therefore, context-aware adaptive cybersecurity solutions could be another research issue in cybersecurity data science.
-
Feature engineering in cybersecurity: The efficiency and effectiveness of a machine learning-based security model has always been a major challenge due to the high volume of network data with a large number of traffic features. The large dimensionality of data has been addressed using several techniques such as principal component analysis (PCA) [167], singular value decomposition (SVD) [168] etc. In addition to low-level features in the datasets, the contextual relationships between suspicious activities might be relevant. Such contextual data can be stored in an ontology or taxonomy for further processing. Thus how to effectively select the optimal features or extract the significant features considering both the low-level features as well as the contextual features, for effective cybersecurity solutions could be another research issue in cybersecurity data science.
-
Remarkable security alert generation and prioritizing: In many cases, the cybersecurity system may not be well defined and may cause a substantial number of false alarms that are unexpected in an intelligent system. For instance, an IDS deployed in a real-world network generates around nine million alerts per day [169]. A network-based intrusion detection system typically looks at the incoming traffic for matching the associated patterns to detect risks, threats or vulnerabilities and generate security alerts. However, to respond to each such alert might not be effective as it consumes relatively huge amounts of time and resources, and consequently may result in a self-inflicted DoS. To overcome this problem, a high-level management is required that correlate the security alerts considering the current context and their logical relationship including their prioritization before reporting them to users, which could be another research issue in cybersecurity data science.
-
Recency analysis in cybersecurity solutions: Machine learning-based security models typically use a large amount of static data to generate data-driven decisions. Anomaly detection systems rely on constructing such a model considering normal behavior and anomaly, according to their patterns. However, normal behavior in a large and dynamic security system is not well defined and it may change over time, which can be considered as an incremental growing of dataset. The patterns in incremental datasets might be changed in several cases. This often results in a substantial number of false alarms known as false positives. Thus, a recent malicious behavioral pattern is more likely to be interesting and significant than older ones for predicting unknown attacks. Therefore, effectively using the concept of recency analysis [170] in cybersecurity solutions could be another issue in cybersecurity data science.