Hybrid Weighted K-Means Clustering and Artificial Neural Network for an Anomaly-Based Network Intrusion Detection System

Rafath Samrin; Devara Vasumathi

doi:10.1515/jisys-2016-0105

Open Access Published by De Gruyter November 15, 2016

Hybrid Weighted K-Means Clustering and Artificial Neural Network for an Anomaly-Based Network Intrusion Detection System

Rafath Samrin
Rafath Samrin obtained the Bachelor’s degree in Computer Science and IT department from Syed Hashim College of Science and Technology, Pregnapur, Medak Dist in 2004. Then she obtained Master’s degree in Computer Science and Engineering in 2008 from Jawaharlal Nehru Technological University, Hyderabad. She is doing her PhD from Jawaharlal Nehru Technological University, Hyderabad. She is currently working as Associate Professor in the department of Computer Science and Engineering at Syed Hashim College of Science and Technology. Her area of interest includes Data Mining, Networking, Artificial Neural Network, Data Bases etc. She has also member of ISTE.Her current research interest in Data mining and Artificial Neural Networks.
and Devara Vasumathi
Devara Vasumathi obtained her Bachelor’s degree in Computer Science from Jawaharlal Nehru Technological University, Hyderabad. Then she obtained her Master’s degree in Computer Science and PhD in Computer Science and Engineering from Jawaharlal Nehru Technological University, Hyderabad. She is Professor of CSE in JNTUH, Hyderabad. Her area of interest includes Big Data Analytics, Data Mining, Databases, Web Mining, Networking etc. She has also member for many professional bodies like Computer Society of India (CSI&LMISTE) and etc. She has more than 100 research publications both in International and National Journals and Conferences to her credit. She has chaired many sessions in International and National Conferences and also delivered many lectures. She involves in various advisory committees as part of administration. She has visited Bangkok and Dubai and presented research papers. She has guided eight PhD students and four are under progress.

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2016-0105

Abstract

Despite the rapid developments in data technology, intruders are among the most revealed threats to security. Network intrusion detection systems are now a typical constituent of network security structures. In this paper, we present a combined weighted K-means clustering algorithm with artificial neural network (WKMC+ANN)-based intrusion identification scheme. This paper comprises two modules: clustering and intrusion detection. The input dataset is gathered into clusters with the usage of WKMC in clustering module. In the intrusion detection module, the clustered information is trained with the utilization of ANN and its structure is stored. In the testing process, the data are tested by choosing the most suitable ANN classifier, which corresponds to the closest cluster to the test data, according to distance or similarity measures. For experimental evaluation, we used the benchmark database, and the results clearly demonstrated that the proposed technique outperformed the existing technique by having better accuracy.

Keywords: Intrusion detection; weighted K-means; artificial neural network; centroid; intruded data

1 Introduction

Currently, the use of Internet and local networks are increasing, and consequently the intrusion events to computer systems are also increasing [27]. Computer systems can be easily susceptible to attacks by anyone due to the augmented network connectivity. The chief requirements of similar attacks are to threaten the conventional security devices of the systems and accomplish surplus operations using the intruder’s authorization. Intruder operations include accessing secured or private information and performing some malevolent impairment to the scheme or user files [20]. By building complex tools, the structure security operator can effortlessly find malicious actions if they happen, and will screen and report activities concurrently. Intrusion detection organizations are vital for preserving proper network security [2, 6, 27]. The key function of the intrusion detection system (IDS) is to screen networked devices and discover uncharacteristic or malevolent behaviors in the outlines of movement in the audit stream [4]. This software is mostly utilized to monitor the events that occur in a computer system or network, inspect the scheme events, find suspected intrusions, and then release an alarm. IDSs can be classified into two kinds: host-based IDS and network-based IDS. Host-based IDSs are called host-based sensors and network-based IDSs are called network-based sensors [24].

To investigate retrieved files and implemented applications, host-based technology is utilized [13]. Host-based IDS gathers data from a discrete computer system like operating system trails, C2 audit logs, and system logs, and then it will accomplish its task [1, 5]. Host-based intrusion detection is an influential device to perceive impending attacks and appropriate techniques to block their future requests. Host-based IDS also uses audit logs; thus, it is much more computerized and establishes refined and responsive detection approaches. Fundamentally, host-based IDS is a monitoring scheme for occurrences and safety logs using Windows NT and Syslog in UNIX environment. When any one among these files is altered, the IDS will associate the novel log entry with a critical signature to check if there is a match or not [10]. Furthermore, network-based IDS is mostly used to perceive the unauthorized use of a computer over a network such as the Internet [8]. Network-based IDS gathers enormous amount of information from the network [1]. Network-based IDS investigates events as packets of data exchange among computers (network traffic) [13]. Network-based IDS utilizes raw network packets as the information source. Network-based IDS fundamentally smears licentious modes to perform monitoring over a network adapter, and inspects all real-time traffic over the network. It attacks reorganization modules. There are four fundamental methods utilized to regulate an attack signature: (i) pattern, expression, or byte code matching; (ii) frequency or threshold crossing; (iii) correlation of lesser events; and (iv) statistical anomaly detection [10]. If the host-dependent scheme discovers more inside activity, then the network-based system will also discover more incoming network activity. Finally, both schemes can report and/or alert the security officers in the appropriate way [13].

A good intrusion detection scheme has the ability to distinguish between normal and abnormal consumer activity [29]. The predefined standard is utilized to detect abnormalities by whichever event, state, content, or behavior. Information mining-based intrusion detection techniques can be separated according to their detection scheme. There are two major approaches [21]: misuse recognition, which identifies intrusions by considering well-known attacks or weak spots of the scheme [15], and anomaly identification, which aids in detecting if divergence from the identified normal use patterns can be highlighted as intrusions [4, 15]. Clustering is one of the vital methods for intrusion detection from among several information mining methods. Clustering is the unconfirmed pattern of arrangements into groups. The cluster is utilized to group a set of objects into subsets of objects that are similar to each other. Clustering is a basic procedure for condensing and summarizing data. It provides a summary of the stored information from those data. A few among the clustering applications include the family pattern for group technology, image segmentation, data retrieval, web page alignment market division, and scientific and engineering investigation [22]. The issue of clustering has numerous regulations like statistics, pattern reorganization, signal processing (e.g. vector quantization), biology, and so on. According to purpose, the amount of clustering technique is proposed by these communities, spanning different clustering paradigms like segregated [17, 18], hierarchical [11], spectral [25], density-based [7], mixture modeling, etc. One-level (unnested) partitioning of information points is produced with the help of partition clustering methods. Suppose K is the preferred amount of clusters, and then the fundamental partitioned method will detect all K clusters at once. Currently, because of calculation needs, the partitioning clustering method is considered the most suitable for assembling a large dataset [3]. The major aim of clustering in intrusion recognition is to regulate natural groupings of information from a great dataset to brand a transitory illustration of a scheme behavior [31].

Henceforth, we present a novel clustering and classification method for improving the organization accuracy of the network intrusion recognition scheme. The clustering method is the weighted K-means clustering (WKMC) and the classification method is the artificial neural network (ANN). The rest of the article is organized as follows. A brief appraisal of the analysis associated with the proposed method is explained in Section 2. The context of the analysis is presented in Section 3, and the proposed intrusion detection scheme with the help of WKMC+ANN is described in Section 4. The comprehensive analytical solutions and deliberations are specified in Section 5. The conclusion is summed up in Section 6.

2 Literature Survey

Currently, investigators are more interested in intrusion identification as it usually preserves the security over the network. At this time, a few intrusion detection methods have been reported. An intrusion identification system with the help of network profiling and then online sequential extreme learning machine has been proposed by Singh et al. [26]. Alpha profiling is utilized to reduce the time complexity when unrelated features are redundant, with the help of the collection of filtered data; correlation and consistency depend on feature selection techniques in this method. Beta profiling is utilized to reduce the size of the training dataset other than sampling. For the approximation of the method performance, the standard NSL-KDD 2009 (Network Security Laboratory-Knowledge Discovery in addition to Data mining) dataset is used. In this work, the time and space difficulty of these methods are also deliberated.

Al-mamory and Jassim [3] presented the feature of two-grain levels of network intrusion detection. The intrusions could not be perceived in the usual case. The suitable IDS was the coarse-grained level to recover IDS functions. If any intrusion was identified with the help of coarse-grained IDS, the fine-grained level was activated to identify probable attack details. For both of these identification levels, a decision tree algorithm was utilized. The KDD CUP 99 offline dataset and a real traffic dataset were utilized to inspect the efficacy of their model. Feature assortment depended on a hybrid anomaly intrusion identification scheme with the help of K-means and radial basis function (RBF) kernel performance, as reported by Ravale et al. [23]. Among the chief threats to intrusion identification are the problems of misjudgment, misdetection, and the absence of actual time response to the attack. The anticipated hybrid technique grouping information mining methods include the K-means clustering algorithm and also RBF kernel performance of support vector machine as per classification modules.

Kumar and NandaMohan [14] presented the significance of an effective IDS. Their grouping of three techniques consisted of two machine learning schemes. With the help of K-means clustering, fuzzy logic, and the neural network methods, an actual intrusion detection scheme was positioned. This technique revealed the advantages of converging the K-means-fuzzy-neural network technique to eliminate the needless interference of human analysts in such times. Gaddam et al. [9] presented a technique called “K-means+ID3”, utilized to improve the accuracy and also the competence of the intrusion detection scheme. This technique aims to cascade K-means clustering and ID3 decision tree approaches with the help of differentiating between anomalous and normal actions in a computer network, an energetic electronic circuit, and a mechanical mass beam scheme. An IDS in a service cloud used to effectively secure the consumer network was introduced by Zarrabi and Zarrabi [30]. It was able to exploit established potentials by the simplified use of the anticipated information from the consumer network for valuation. The associated model was anticipated to be extensible by permitting the consumers to tap into various groups of IDSs concomitantly for mingling the qualities of dissimilar yields for other reliable IDS results. Because of the complexity of the cloud architecture, Mathew and Jose [19] emphasized the need for the arrangement architecture of the intrusion detection scheme in the cloud. They efficaciously elucidated and registered various problems in the cloud substructure and effectively engaged the IDS and its implementation in the cloud.

3 Background of the Research Algorithm

First, we elucidate the context of the proposed WKMC algorithm in this section.

3.1 K-Means Clustering Algorithm

K-means clustering is a clustering investigation algorithm that combines objects on the basis of their attribute value into K^k disjoint clusters. With the help of the utilization of equivalent values, similar objects can be separated into a cluster. Here, K^k is a positive integer denoting the number of clusters, and has to be specified. The steps of the K-means clustering algorithm are as follows:

Define the number of clusters K^k.
Initialize the K^k cluster centroids. This can be done by arbitrarily dividing all objects into K^k clusters, computing their centroids (C^c), and verifying that all centroids (C^c) are different from each other. Alternatively, the centroids (C^c) can be initialized to K^k arbitrarily chosen, different objects.
Iterate over all objects and calculate the distances to the centroids (C^c) of all clusters. Allocate each object to the cluster with the nearest centroid (C^c).
Recalculate the centroids (C^c) of both adapted clusters.
Repeat step 3 until the centroids do not change anymore.

The K-means algorithm is very effective if utilized to cluster large information sets; however, its imperfections are also very perceptible, as follows:

It can only cluster numeric information because of the constraint of Euclidean distance.
The clustering consequences are connected with the input series of objects and the first clustering centers.
The K-means algorithm utilizes Euclidean distance to endorse the modification of information via this type of measuring technique. It usually comprises all characteristics of the information and undertakes that each and every characteristic is proportionate to the coldness, so it frequently causes ambiguity, i.e. the so-called curse of dimensionality [5].
In the K-means algorithm, each information point has equivalent significance in discovering the centroid of the cluster. This method is not lengthier. Consequently, the clustering algorithm has to deliberate a weight accompanying each information point in the calculation of cluster centers. The anticipated allowance to the K-means algorithm is known as the weighted K-means (Figure 1).

Figure 1:

Basic Diagram of Data Clustering.

4 Proposed IDS

The main objective of our investigation is to present an anomaly-based network IDS on the basis of the WKMC together with ANN. Training and testing are two chief procedures of the intrusion detection scheme. The proposed strategy incorporates the two processes of training and testing to ascertain if the input data is intruded or not. The dataset working here included in our predictable scheme is the KDD CUP 99 dataset. At first, the data are subjected to preprocessing to remove unwanted data. Then, the preprocessed data are given to a clustering module. In the clustering module, we utilize WKMC, which splits the data into K_C number of clusters. Finally, the clustered data are given to the classification module. In the classification module, ANN classifies whether the given data are intruded or not. The complete structure of the proposed IDS is given in Figure 2.

Figure 2:

Overall Diagram of the Proposed Network Intrusion Detection System.

4.1 WKMC-Based Clustering Module

The main aim of the weighted K-means cluster module is to allocate a particular set of information into clusters. The training set (TR) is gathered into numerous subsets using the clustering module. The authenticity of the size and difficulty of each and every training subset will decrease, and also the effectiveness and efficacy of the succeeding NN modules can be enhanced. WKMC is an information clustering algorithm; in this, all points are subjected to clustering to a proposed degree with the help of a membership grade. It is based on minimization of the below-pointing performance specified in the equation in the clustering module:

(1)O(p, q)=∑l = 1L∑pi = l∥wele i∥2 ∥ai−ql∥2,

(2)=∑l = 1L∑gi = l dist (ai, ql),

where q=(q₁,...q_l,...,q_L)^T indicates the number of cluster centers. dis (ai, ql)=∥welei∥2∥ai−sl∥2 denotes the weighted Euclidean distance among cluster data a_c and the cluster center q_l. Consequently, the objective function O can be assessed as an addition of weighted Euclidean distance that must be diminished in terms of p and q. O is not convex with respect to p and q uninterruptedly; however, when p is fixed, O is convex with reference to q. Thus, the optimum sub-array allocates can be originated with the help of refining correspondingly the grouping vector p and the cluster center q step by step. The detailed iterative process of the proposed technique is as follows:

Deliberate the information point and conforming weights:
a=[a1, a2, …, ai] and w=[w1, w2, …, wi].
Fix the maximum number of iterations, t_max. The initial L-partition of R is represented as p⁽⁰⁾, and the set of initial L cluster centers is q(0)=(q1(0), …, ql(0), …, qL(0))T, where the cluster center ql(0) is computed as
(3)∂O∂ql(0) = ∂ [ ∑l=1L ∑pi=l(0) ‖wi‖2 ‖ ai−ql(0)‖2]
(4)= 2 ∑l=1L ∑pi=l(0) ‖wi‖2 ql(0) − 2 ∑l=1L ∑pi=l(0) ‖wi‖2 ai
= 0.
It is found that sl(0) has the close form solution of
ql(0)=∑pi = l(0)∥wi∥2 ai∑pi = l(0)∥wi∥2,i=1, …, N; l=1, …, L.
When the group of cluster center q is fixed, the objective performance is minimized over g with the help of grouping the clustering vectors to their adjacent cluster center to minimize dist(a_c, q_l); hence, the membership can be labeled as
(5)pi(t + 1)=arg min dist(ai, ql(t))l
(6)=arg min∥wi∥2l ∥ai, ql(t)∥2.
If dist (ai, ql(t)) is the minimum distance from the cluster center, so that p_c=l, then that signifies the i^th element fits the l^th subgroups.
Affording to the memberships p^{(^t+1)} obtained in step 3, the novel cluster center ql(t + 1) will be computed using Eq. (7):
(7)ql(t + 1)=∑pi = l(t + 1)∥wi∥2∑pi = l(t + 1)∥wi∥2,i=1, 2, …, N; l=1, 2, …, L.
Repeat step 3 and step 4 until the cluster center is firmly converged or the iteration index t>t_max. When the first cluster is designated arbitrarily, the processing cannot assure a global optimum; henceforth, it is required to implement various numbers of times with dissimilar initial clusters, and at that time select the optimal subgroup partition according to the minimum excitation matching error.

Subsequent to the clustering processes, we accomplish the K_C number of clusters on the basis of the centroid. This K_C number of clusters specifies the K_C number of ANNs. In Figure 3, the flow diagram of the proposed WKMC algorithm is portrayed.

Figure 3:

Flow Diagram of the Proposed WKMC Algorithm.

4.2 ANN-Based Intrusion Detection Module

Each obtained output cluster from the WKMC algorithm is trained using K_C numbers of ANN classifier. The current IDS suffers from low detection accuracy and insufficient system robustness for new and rare security breaches. To improve the efficiency of the IDS after the clustering process, we utilize the classification process. Here, the numbers of cluster and neural networks are identical. The ANN module is characteristic of the feature of each subset. Actually, the ANN indicates the physically intensified type of distributed assessment. The network unites the input layer and an output layer with one or more unseen layers in between the input and the output layers. Every established consequence cluster from the WKMC is practiced with the help of K_C numbers of ANN classifier, as specified in Figure 4.

Figure 4:

Structure of the Neural Network Training Stage.

The ANN module is chiefly utilized to study the arrangement of each and every subset. ANN is a physically stimulated outlet of disseminated calculation. It is linked to simple processing units and a mixture of them. In this paper, we employ classic feed-forward neural networks trained with the back-propagation algorithm to predict intrusion. A feed-forward neural network holds an input layer, an output layer, and one or more unseen layers among the input and output layers. The ANN performances are as follows.

Each node x in the input layer has information I_x as the network’s input, multiplied with the help of a weight value among the input layer and the unseen layer. Each node y in the unseen layer obtains the data H_y, as follows:

(8)H(y)=θy+∑x = 1nIx wxy,

and is then passed through the bipolar sigmoid activation functions given in Eq. (9):

(9)f(I)=2(1+exp(−I))−1.

The output of the activation function f(In(y)) is then to broadcast all of the neurons to the output layer:

(10)Ok=θk+∑y = 1mwyk f(In(y)),

where θ_y and θ_k are the biases in the unseen layer and the output layer. The output value will be associated with the target; in this paper, we utilized the mean absolute error as an error function:

(11)ER=12n∑k(Tk − Ok)2,

where n represents the number of training patterns, Y_k represents the output value, and T_k represents the target value. In this, if the obtained error value is maximum, the weight value is updated until the minimum error value is obtained. The weight value is adjusted based on Eq. (12):

(12)w(t+1)=w(t)−η∂ER(t)/∂w(t),

where t represents the number of epochs and η represents the learning rate. To speed up the convergence of the error in the learning process, the momentum with the momentum gain β is incorporated into Eq. (12):

(13)w(t+1)=w(t)−η ∂E(t)/∂w(t)+βΔw(t).

Here, the value for β is between 0 and 1. In a particular stage, we obtain the minimum error. That ANN structure weight value is stored, and this value is given to the testing process.

4.3 Testing Phase

After the training phase, we accomplish the testing procedure. At this point, initially we have provided the testing information into the preprocessing stage. The preprocessed information is then provided to the clustering procedure. At this point, the information is assembled on the basis of the WKMC procedures. Thereafter, the clustered information is delivered to the consistent neural network. The trained conforming neural network weights are allocated to the testing procedures. On the basis of the weight, we attain the score value. As a final point, we check whether the provided information is intruded or not on the basis of the threshold value. A score value beyond the threshold means the information is intruded and the information is normal. Therefore, the obtained score value is measured with Eq. (14) for classifying the information:

(14)Result={Th≥score; data are normalTh<score; data are intruded.

5 Results and Discussion

In this section, we deliberate on the solution attained with the presented method. We employed MATLAB version 7.12 in order to apply the presented method. This protected method is done in a Windows machine with Intel Core i5 processor with speed 1.6 GHz and 4 GB RAM. For associating the function, the KDD CUP 99 dataset is utilized.

5.1 Dataset Description

The KDD CUP 99 dataset is chiefly utilized to inspect and assess the procedure of the anticipated technique. The form of the unique 1998 DARPA intrusion recognition assessment program is utilized here for KDD CUP 99. It also one of the publically available information sets that have actual attacks [16]. Thus, at this time, for design and scrutiny of our intrusion detection scheme, we used this dataset. For a period of 9 weeks, the KDD CUP 99 dataset was analyzed from TCP dump information. The assortment of data of network traffic activities comprise both standard and malicious relations that include five million reports as training information and two million reports as test data. Forty-one landscapes were identified using each case that is clear as either normal or an attack. From both training and testing information, the total attacks were originated. These attacks can be categorized as four kinds: PROBE, remote to local, denial of service, and consumer to root [28]. The KDD CUP 99 datasets are available in three different files. They are the KDD full dataset that has 4,898,431 occurrences, the KDD CUP 10% dataset that has 494,021 instances, and the KDD modified dataset that has 311,029 instances. The analysis used 38 continuous or discrete arithmetic attributes and three categorical attributes, for a total of 41 attributes. Each illustration is obvious as either normal or one exact attack. The dataset comprises one normal and 22 various attacks, for a total of 23 class labels. The 22 attacks can be categorized into four categories of the aforementioned attacks [12]. The KDD CUP 99 dataset is huge in size and very sensitive to experimental procedures. Consequently, we utilized only 10% of the KDD CUP 99 dataset for our experimentation.

5.2 Evaluation Metrics

The evaluation metrics utilized in our proposed technique are true positive (TP), true negative (TN), false positive (FP), and false negative (FN). At this time, TP indicates the amount of an appropriately classified attack. A TP is a symbol of correctly perceiving the incidences of attacks in an intrusion detection scheme. TN indicates the number of valid records that are appropriately classified. A TN stipulates that the IDS has not made an error in noticing a normal circumstance. FP refers to records that were erroneously categorized as attacks, while in fact, they are valid occurrences. An FP stipulates the wrong detection of a specific attack with the help of IDS. An FP is frequently produced because of lost recognition conditions, and it characterizes the accuracy of the detection scheme. FN indicates the records that were wrongly classified as valid activities, while in fact, they are attacks. An FN specifies that the IDS is incapable of noticing the intrusion after a specific attack has transpired. The function of our intrusion detection scheme is appraised using: (i) accuracy, (ii) sensitivity, (iii) specificity, (iv) map, (v) root mean square error (RMSE), (vi) mean absolute deviation (MAD), and (vii) mean square error (MSE) on the basis of TP, TN, FP and FN.

Accuracy
The accuracy of our scheme is obtained by using the expression below:
Accuracy=TP+TNTP+TN+FP+FN.
Accuracy means the probability in which our proposed scheme can promptly predict positive and negative examples.
Sensitivity
Sensitivity means the probability that the algorithms can correctly predict positive examples:
Sensitivity=TPTP+FN.
Specificity
Specificity means the probability that the algorithms can properly foresee negative examples:
Specificity=TNTN+FP.

5.3 Comparative Analysis

Our proposed method is distinguished from the prevailing methods like the K-means clustering+ANN method in this section. At this point, the accuracy, specificity, sensitivity, mean absolute percentage error (MAPE), and MAD of the anticipated WKMC+ANN and the prevailing method is engaged for different cluster sizes like 10, 15, 20, 25, and 30 shows in Table 1.

Table 1:

Accuracy, Sensitivity, and Specificity Obtained for Cluster Sizes 20, 15, 20, 25, and 30.

Cluster Size	Accuracy		Sensitivity		Specificity
Cluster Size	Proposed (%)	Existing (%)	Proposed (%)	Existing (%)	Proposed (%)	Existing (%)
10	88	78	80	76	66	75
15	89	81	83	74	68	71
20	92	84	76	61	68	78
25	88	78	73	65	69	68
30	89	84	83	75	59	71

We established that the accuracy of our anticipated technique is 88% for cluster size 10 and 89% for cluster size 15 from Table 1, and that it is superior to the accuracy of the available process that has 78% accuracy for cluster size 10 and 81% for cluster size 15, etc. In Table 1, we obtain a maximum accuracy of 92%. Similarly, the sensitivity of occurrence from our anticipated technique is 80% for cluster size 10 and 83% for cluster size 15, i.e. better than the available scheme that has 76% sensitivity for cluster size 10 and 74% for cluster size 15. Similarly, the value of the specificity of our proposed method is greater than the specificity of the existing method.

5.3.1 Comparative Analysis of Other Important Measures

In this section, we compare our proposed work with the existing work based on the following metrics: MAPE, MSE, and RMSE. This comparison is also used to prove the effectiveness of the approach.

MSE:

MSE=1N∑t = 1N(Pt−Dt).

MAPE:

MAPE=1N ∑t = 1N|Pt−DtPt.|

RMSE:

RMSE=∑t = 1N (Pt−Dt)2N.

Figure 5 shows the performance comparison of MAPE plots by varying cluster counts. From Figure 5, the MAPE of our proposed method is 0.16 for cluster size 10 and 0.22 for cluster size 15, whereas the values are 0.22 and 0.25 for the existing approach. Likewise, Figure 6 shows the performance comparison of RMSE plots by varying cluster counts. The RMSE of our proposed method is 0.08 for cluster size 10 and 0.12 for cluster size 15, which are lower than those of the existing method: 0.16 for cluster size 10 and 0.19 for cluster size 15. Likewise, the values for the MSE of our proposed method is lesser than the specificity of the existing method, which indicates that our proposed method is better than the existing IDS (Figure 7).

Figure 5:

Performance Comparison of MAPE Plots by Varying Cluster Counts.

Figure 6:

Performance Comparison of RMSE Plots by Varying Cluster Counts.

Figure 7:

Performance Comparison of MSE Plots by Varying Cluster Counts.

6 Conclusion

Currently, network security is among the main concerns because of different attacks and vulnerabilities in the Internet. By means of a solution, intrusion recognition is a significant constituent in network security. We proposed a novel IDS with the help of WKMC and the ANN classifier in this article. WKMC is a clustering technique utilized here that delivers better clustering output than the currently used clustering technique. The classification procedure is achieved with the help of the presented ANN classifier. The classified output is obtained after different steps of the training procedure test dataset are assumed as input. The investigational solutions with the help of the KDD CUP 99 dataset determined the efficiency of our novel method, showing that it delivers improved classification accuracy compared with the prevailing technique.

About the authors

Rafath Samrin

Rafath Samrin obtained the Bachelor’s degree in Computer Science and IT department from Syed Hashim College of Science and Technology, Pregnapur, Medak Dist in 2004. Then she obtained Master’s degree in Computer Science and Engineering in 2008 from Jawaharlal Nehru Technological University, Hyderabad. She is doing her PhD from Jawaharlal Nehru Technological University, Hyderabad. She is currently working as Associate Professor in the department of Computer Science and Engineering at Syed Hashim College of Science and Technology. Her area of interest includes Data Mining, Networking, Artificial Neural Network, Data Bases etc. She has also member of ISTE.Her current research interest in Data mining and Artificial Neural Networks.

Devara Vasumathi

Devara Vasumathi obtained her Bachelor’s degree in Computer Science from Jawaharlal Nehru Technological University, Hyderabad. Then she obtained her Master’s degree in Computer Science and PhD in Computer Science and Engineering from Jawaharlal Nehru Technological University, Hyderabad. She is Professor of CSE in JNTUH, Hyderabad. Her area of interest includes Big Data Analytics, Data Mining, Databases, Web Mining, Networking etc. She has also member for many professional bodies like Computer Society of India (CSI&LMISTE) and etc. She has more than 100 research publications both in International and National Journals and Conferences to her credit. She has chaired many sessions in International and National Conferences and also delivered many lectures. She involves in various advisory committees as part of administration. She has visited Bangkok and Dubai and presented research papers. She has guided eight PhD students and four are under progress.

Bibliography

[1] A. O. Adetunmbi, S. O. Falaki, O. S. Adewale and B. K. Alese, Network intrusion detection based on rough set and K-nearest neighbour, Int. J. Comput. ICT Res.2 (2008), 60–66.Search in Google Scholar

[2] J. Allen, A. Christie and W. Fithen, State of the Practice of Intrusion Detection Technologies, Technical Report, CMU/SEI-99-TR-028, 2000.10.21236/ADA375846Search in Google Scholar

[3] S. O. Al-mamory and F. S. Jassim, On the designing of two grains levels network intrusion detection system, J. Karbala Int. J. Modern Sci.1 (2015), 15–25.10.1016/j.kijoms.2015.07.002Search in Google Scholar

[4] M. M. Campos and B. L. Milenova, Creation and deployment of data mining-based intrusion detection systems in Oracle Database 10g, in: Proceedings of the Fourth International Conference on Machine Learning and Applications, 2005.Search in Google Scholar

[5] B. R. Cha, K. W. Park and J. H. Seo, Neural network techniques for host anomaly intrusion detection using fixed pattern transformation, in: International Conference on Computational Science and Its Applications (ICCSA2005), LNCS 3481, pp. 254–263, Springer, Berlin, Heidelberg, 2005.10.1007/11424826_27Search in Google Scholar

[6] B. V. Dasarathy, Intrusion detection, Inform. Fusion4 (2003), 243–245.10.1016/j.inffus.2003.08.003Search in Google Scholar

[7] M. Ester, H. Kriegel, J. Sander and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proc. of KDD, AAAI, Palo Alto, California, USA, 1996.Search in Google Scholar

[8] D. M. Farid and M. Z. Rahman, Anomaly network intrusion detection based on improved self adaptive Bayesian algorithm, J. Comput.5 (2010), 23–31.10.4304/jcp.5.1.23-31Search in Google Scholar

[9] S. R. Gaddam, V. V. Phoha and K. S. Balagani, K-means+ID3: a novel method for supervised anomaly detection by cascading K-means clustering and ID3 decision tree learning methods, IEEE Trans. Knowl. Data Eng.19 (2007), 345–354.10.1109/TKDE.2007.44Search in Google Scholar

[10] ISS Internet Security Systems, Network- vs. Host-Based Intrusion Detection: A Guide to Intrusion Detection, White Paper, Atlanta, GA, 1998.Search in Google Scholar

[11] C. J. Jardine, N. Jardine and C. Sibson, The structure and construction of taxonomic hierarchies, Math. Bio-sci.1 (1967), 173–179.10.1016/0025-5564(67)90032-6Search in Google Scholar

[12] M. Jin, Z. Xu, R. Li and D. Wu, Fuzzy ARTMAP ensemble based decision making and application, Math. Probl. Eng.2013 (2013). Article ID 124263, 7 pages. Available at: http://dx.doi.org/10.1155/2013/124263.10.1155/2013/124263Search in Google Scholar

[13] H. Kozushko, Intrusion Detection: Host-Based and Network-Based Intrusion Detection Systems, White Paper from Independent Study, 2003. Available at: https://pdfs.semanticscholar.org/471b/6047150e82d5b94cbcf1fed36586dcf929c1.pdf.Search in Google Scholar

[14] K. S. A. Kumar and V. NandaMohan, Novel anomaly intrusion detection using neuro-fuzzy inference system, IJCSNS Int. J. Comput. Sci. Netw. Secur.8 (2008), 6–11.Search in Google Scholar

[15] W. Lee and S. J. Stolfo, Data mining approaches for intrusion detection, in: Proceedings of the 7th USENIX Security Symposium, San Antonio, TX, January 26–29, 1998.Search in Google Scholar

[16] W. Lee, S. Stolfo and K. Mok, A data mining framework for building intrusion detection model, in: Proceedings of the IEEE Symposium on Security and Privacy, pp. 120–132, IEEE Computer Society Press, Oakland, CA, 1999.Search in Google Scholar

[17] S. P. Lloyd, Least square quantization in PCM, IEEE Trans. Inform. Theory28 (1982), 129–136.10.1109/TIT.1982.1056489Search in Google Scholar

[18] J. B. MacQueen, Some method for classification and analysis of multivariate observations, in: Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297, University of California Press, Berkeley, 1967.Search in Google Scholar

[19] S. Mathew and A. P. Jose, Securing cloud from attacks based on intrusion detection system, Int. J. Adv. Res. Comput. Commun. Eng.1 (2012), 753–759.Search in Google Scholar

[20] N. Naidu and R.V. Dharaskar, An effective approach to network intrusion detection system using genetic algorithm, Int. J. Comput. Appl.1 (2010), 26–32.10.5120/89-188Search in Google Scholar

[21] S. Noel, D. Wijesekera and C. Youman, Modern intrusion detection, data mining, and degrees of attack guilt, in: Applications of Data Mining in Computer Security, pp. 2–25, Kluwer Academic Publishers, 2002.10.1007/978-1-4615-0953-0_1Search in Google Scholar

[22] D. T. Pham and A. A. Afify, Clustering techniques and their applications in engineering, Proc. Inst. Mech. Eng. Pt. C J. Mech. Eng. Sci.221 (2007), 1445–1460.10.1243/09544062JMES508Search in Google Scholar

[23] U. Ravale, N. Marathe and P. Padiya, Feature selection based hybrid anomaly intrusion detection system using K means and RBF kernel function, Proc. Comput. Sci.45 (2015), 428–435.10.1016/j.procs.2015.03.174Search in Google Scholar

[24] A. Sarmah, Intrusion Detection Systems: Definition, Need and Challenges, White Paper from SANS Institute, SANS Institute, Swansea, UK, 2001.Search in Google Scholar

[25] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell.22 (2000), 888–905.10.1109/34.868688Search in Google Scholar

[26] R. Singh, H. Kumarb and R. K. Singla, An intrusion detection system using network traffic profiling and online sequential extreme learning machine, Expert Syst. Appl.42 (2015), 8609–8624.10.1016/j.eswa.2015.07.015Search in Google Scholar

[27] J. T. Yao, S. L. Zhao and L. V. Saxton, A study on fuzzy intrusion detection, in: Proceedings of the Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security, vol. 5812, pp. 23–30, SPIE, Orlando, FL, USA, 2005.Search in Google Scholar

[28] K. Yoshida, Entropy based intrusion detection, in: Proceedings of the IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, vol. 2, pp. 840–843, 2003.Search in Google Scholar

[29] A. Zainal, M. A. Maarof and S. M. Shamsudin, Research issues in adaptive intrusion detection, in: Proceedings of the 2nd Postgraduate Annual Research Seminar (PARS’06), Faculty of Computer Science & Information Systems, pp. 24–25, Universiti Teknologi Malaysia, 2006.Search in Google Scholar

[30] A. Zarrabi and A. Zarrabi, Internet intrusion detection system service in a cloud, Int. J. Comput. Sci.9 (2012), 308–315.Search in Google Scholar

[31] Y. Zhao and G. Karypis, Empirical and theoretical comparisons of selected criterion functions for document clustering, Mach. Learn.55 (2004), 311–331.10.1023/B:MACH.0000027785.44527.d6Search in Google Scholar

Received: 2016-7-6

Published Online: 2016-11-15

Published in Print: 2018-3-28

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Hybrid Weighted K-Means Clustering and Artificial Neural Network for an Anomaly-Based Network Intrusion Detection System

Abstract

1 Introduction

2 Literature Survey

3 Background of the Research Algorithm

3.1 K-Means Clustering Algorithm

4 Proposed IDS

4.1 WKMC-Based Clustering Module

4.2 ANN-Based Intrusion Detection Module

4.3 Testing Phase

5 Results and Discussion

5.1 Dataset Description

5.2 Evaluation Metrics

5.3 Comparative Analysis

5.3.1 Comparative Analysis of Other Important Measures

6 Conclusion

About the authors

Bibliography

Journal and Issue

Articles in the same Issue