Fuzziness based semi-supervised learning approach for intrusion detection system
Introduction
Intrusion detection (ID) is a process of monitoring, detecting, and analyzing the events that are considered as violation to the security policies of a networked environment [45]. Denning [12] introduced the concept of detecting cyber-based attacks on computer networks by providing a framework for intrusion detection system (IDS), which is based on the hypothesis that security violations can be detected by monitoring system audit records for abnormal patterns of system usage. Organizations deploy their own access controls to grant or restrict the level of access for their assets but this approach does not guarantee the appropriate assurance and protection level for a particular resource [10]. This problem is evident through various security incidents around the world, for example, the compromise of Yahoo’s and Amazon’s websites after some sophisticated persistent attacks [6]. Intruders and attackers are always seeking to disrupt network traffic and degrade network performance with different types of attacks or intrusions. A network intrusion refers to a suspicious and sudden deviation from the normal behavior of the system, which destabilizes the security of the network system. According to Qui et al. [40], Hernndez-Pereira et al. [16] and Yan and Yu [56], intrusion can be depicted as the set of actions that attempt to compromise the confidentiality, integrity, or availability (CIA) of information resources; therefore, it is necessary to take different measures to minimize such risks.
The Internet has turned into an indispensable wellspring for exchanging information among users and organizations; therefore, security has become an essential aspect in this type of communication. IDSs are often used to sniff network packets by providing a better understanding of what is happening in a particular network. Two mainstream preferences for IDSs are (1) host-based IDSs, and (2) network-based IDSs. Correspondingly, the detection methods used in IDS are anomaly based and misuse based (also called signature or knowledge based), each having their own advantages and restrictions. In misuse-based detection, data gathered from the system is compared to a set of rules or patterns, also known as signatures, to describe network attacks. The core difference between these two techniques is that anomaly-based IDS uses collections of data containing examples of normal behavior and builds a model of familiarity, therefore, any action that deviates from the model is considered suspicious and is classified as an intrusion [20]. According to Mukkamala et al. [31], in misuse-based detection, attacks are represented by signatures or patterns. However, this approach does not contribute much in terms of zero-day attack detection. The main issue is how to build permanent signatures that have all the possible variations and non-intrusive activities to lower the false-negative and false-positive alarms.
The KDDCUP’99 [18] was derived in 1999 from the DARPA98 network traffic dataset and a very popular benchmark dataset used in the International Knowledge Discovery in Databases (KDD) competition. From the literature, one can study that this dataset is widely used for the evaluation of anomaly based IDS. Many machine learning techniques, which may be either supervised or unsupervised, have been used to increase the efficacy of IDSs. Supervised learning techniques are applied to obtain the training data in which instances are tagged with labels and each label indicates the class of a particular instance. Many supervised algorithms, such as k-nearest neighbor (KNN) [24], neural network (NN) [29], and support vector machine (SVM) [30] have been extensively used to detect the intrusions. These algorithms build the model that separates a new unseen example or instance with the correct label. Many advantages and disadvantages related to supervised learning with IDS have been reported by many researchers. One of the shortcomings of supervised learning is the need for labeled instances. The only dataset is available for ID is the KDDCUP’99 dataset [18], and many new types of attacks have been developed. Therefore, this dataset is considered as obsolete, and for new types of examples its accuracy drops [22]. Many researchers are widely using the KDDCUP’99 dataset because it is the only dataset that is publically available for ID problem and useful information can still be extracted from it. Apart from its disadvantage, supervised learning has the advantage to achieve better accuracy to classify similar examples [22]. Unsupervised learning techniques deal with the learning tasks with unlabeled or untagged data. Clustering is the most popular unsupervised learning technique [25]. In clustering, the learning algorithm finds similarities among instances to build the clusters (i.e. group of instances). Instances that belong to the same cluster are assumed to having similar characteristics or properties and then are assembled into the same class. The disadvantage of unsupervised learning is the manually assignment of cluster numbers, which results in low accuracy in predictions. However, it has the advantage of detecting new examples better than supervised learning techniques, and considered to be more robust in IDSs. According to Laskov et al. [22], many new attacks have been developed, and the improper labeling of examples could make the unsupervised learning and SSL techniques the best choices for improving the accuracy of IDSs.
Regarding the aforementioned development in this area, the main objective behind our work is not just to seek for the smallest classification error but also to try to find a model that must be capable of incorporating new data that keeps its good generalization ability. We compute the fuzziness of every unlabeled sample outputted by the classier, and try to discover its relationship with misclassification. From literature, except for [51], [53], we have not found any studies on generalization based on the fuzziness of a classifier. Therefore, based on our preliminary work [51] in which the sample categorization is performed according to the fuzziness quantity, we propose a new algorithm for the IDS. The experimental results demonstrate that samples belonging to the low and high fuzziness categories play an important role in improving the accuracy of IDSs.
The rest of the paper is organized as follows. Section 2 presents a prologue to the background of semi-supervised learning (SSL). Section 3 details the proposed fuzziness based algorithm using the neural network with random weights (). The performance evaluation is presented in Section 4. Finally, Section 5 ends this paper with concluding remarks, and provides future directions for this research.
Section snippets
Semi-supervised learning (SSL)
SSL is an amalgamation of supervised and unsupervised learning techniques. The SSL technique deals with the learning tasks by utilizing both labeled and unlabeled data [65]. Labeled instances are, however, expensive and time-consuming to obtain and require the efforts of domain experts. Apart from this concern, unlabeled data can easily be obtained in many real world applications. SSL methods assign labels by considering unlabeled instances, together with the labeled instances, and then build a
Proposed fuzziness based algorithm using for IDS
In this section, we first discuss the fuzziness and introduce a fast learning mechanism for a single hidden layer feed-forward neural network (SLFN), i.e., neural network with random weights (), and then propose an algorithm for IDS.
Performance evaluation
Tavallaee et al. [48] statistically discovered and figured out some shortcomings in the original KDDCUP’99 dataset that adversely affect the performance of the evaluated system and provide inefficient anomaly detection schemes. They proposed an enhanced dataset called NSL-KDD [35] to counter these issues and provided a more efficient scheme for comparing various ID models. They also mentioned some advantages of the new dataset over the original dataset: (1) non-availability of redundant records
Conclusion
In this paper we have designed a new SSL algorithm for improving the classifier performance on ID datasets by investigating a divide-and-conquer strategy in which unlabeled samples with their predicted labels are categorized according to the magnitude of fuzziness. We used the neural network with random weights () as a base classifier because it is computationally efficient and has an excellent learning performance. The hidden-node parameters (i.e. weights and biases) in are selected
Acknowledgments
The authors would like to extend their sincere appreciation to the Deanship of Scientific Research at King Saud University for its funding of this research through the Research Group Project no. RG-1435-048. This research is also supported by China Postdoctoral Science Foundation (2015M572361), Basic Research Project of Knowledge Innovation Program in Shenzhen (JCYJ20150324140036825), and National Natural Science Foundations of China (61503252 and 71371063).
References (65)
- et al.
Fast decorrelated neural network ensembles with random weights
Inf. Sci.
(2014) - et al.
A probabilistic learning algorithm for robust modeling using neural networks with random weights
Inf. Sci.
(2015) - et al.
A definition of a nonprobabilistic entropy in the setting of fuzzy sets theory
Inf. Control
(1972) - et al.
A novel semi-supervised learning for face recognition
Neurocomputing
(2015) - et al.
Use of k-nearest neighbor classifier for intrusion detection
Comput. Secur.
(2002) - et al.
Semi-supervised evolutionary ensembles for web video categorization
Knowl. Based Syst.
(2015) - et al.
Intrusion detection using an ensemble of intelligent paradigms
J. Netw. Comput. Appl.
(2005) - et al.
Local margin based semi-supervised discriminant embedding for visual recognition
Neurocomputing
(2011) - et al.
Learning and generalization characteristics of the random vector functional-link net
Neurocomputing
(1994) - et al.
Semi-supervised classification with privileged information
Int. J. Mach. Learn. Cybern.
(2015)