ABSTRACT
In this paper, Big Data and Deep Learning Techniques are integrated to improve the performance of intrusion detection systems. Three classifiers are used to classify network traffic datasets, and these are Deep Feed-Forward Neural Network (DNN) and two ensemble techniques, Random Forest and Gradient Boosting Tree (GBT). To select the most relevant attributes from the datasets, we use a homogeneity metric to evaluate features. Two recently published datasets UNSW NB15 and CICIDS2017 are used to evaluate the proposed method. 5-fold cross validation is used in this work to evaluate the machine learning models. We implemented the method using the distributed computing environment Apache Spark, integrated with Keras Deep Learning Library to implement the deep learning technique while the ensemble techniques are implemented using Apache Spark Machine Learning Library. The results show a high accuracy with DNN for binary and multiclass classification on UNSW NB15 dataset with accuracies at 99.16% for binary classification and 97.01% for multiclass classification. While GBT classifier achieved the best accuracy for binary classification with the CICIDS2017 dataset at 99.99%, for multiclass classification DNN has the highest accuracy with 99.56%.
- M. Al-Zewairi, S. Almajali, and A. Awajan. 2017. Experimental Evaluation of a Multi-layer Feed-Forward Artificial Neural Network Classifier for Network Intrusion Detection System. 2017 International Conference on New Trends in Computing Sciences (ICTCS), Amman, Jordan, pp. 167--172, IEEEGoogle Scholar
- M. Belouch, S. El Hadaj, and M. Idhammad. 2017. Two-stage Classifier Approach Using RepTree algorithm for Network Intrusion Detection. International Journal of Advanced Computer Science and Applications, 8(6), pp. 389--394.Google ScholarCross Ref
- M. Belouch, S. El Hadaj, and M. Idhammad. 2018. Performance Evaluation of Intrusion Detection based on Machine Learning Using Apache Spark. Procedia Computer Science 127, pp. 1--6. Google ScholarDigital Library
- L. Breiman. 2001. Random Forests. Machine Learning, 45(1), pp. 5--32. Google ScholarDigital Library
- V. Chandola, A. Banerjee, and V. Kumar. 2009. Anomaly Detection: A Survey. ACM Computing Surveys, 41(3), pp. 1--15. Google ScholarDigital Library
- F. Coelho, A. Braga, and M. Verleysen. 2012. Cluster Homogeneity as a Semi-supervised Principle for Feature Selection Using Mutual Information. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium.Google Scholar
- P. Dahiya and D. Srivastava. 2018. Network Intrusion Detection in Big Dataset Using Spark. Procedia Computer Science 132, pp. 253--262.Google ScholarDigital Library
- L. Dhanabal, and S. p.Shantharajah. 2015. A Study on NSL KDD Dataset for Intrusion Detection System based on Classification Algorithms. International Journal of Advanced Research in Computer and Communication Engineering, 4(6), pp. 446--452.Google Scholar
- R. Di Pietro and L. V. Mancini, eds. 2008. Intrusion Detection Systems. Springer Science & Business, vol. 38. Media.Google Scholar
- Osama Faker. 2018. Intrusion Detection Using Big Data and Deep Learning Techniques. MS Thesis, Cankaya University.Google Scholar
- J.H. Friedman. 2002. Stochastic Gradient Boosting. Computational Statistics & Data Analysis, 38(4), pp. 367--378. Google ScholarDigital Library
- H. Gharaee and H. Hosseinvand. 2016. A New Feature Selection IDS based on Genetic Algorithm and SVM. Telecommunications (IST), 2016 8th International Symposium on. IEEE, pp. 139--144.Google Scholar
- G.P. Gupta and M. Kulariya. 2016. A Framework for Fast and Efficient Cyber Security Network Intrusion Detection Using Apache Spark. Procedia Computer Science 93, Kochi, India, pp. 824--831.Google ScholarCross Ref
- J. Han, E. Haihong, G. Le, and J. Du. 2011. Survey on NoSQL Databases. In Pervasive Computing and Applications (ICPCA), Port Elizabeth, South Africa 2011 6th International Conference on, pp. 363--366. IEEE.Google Scholar
- A. Lashkari, G. Draper-Gil, M. Mamun, and A. Ghorbani. 2017. Characterization of Tor Traffic Using Time based Features. The 3rd International Conference on Information Systems Security and Privacy, pp. 253--262.Google Scholar
- Y. Liu. 2014. Random Forest Algorithm in Big Data Environment. Computer Modelling & New Technologies, 18(12A), pp. 147--151.Google Scholar
- N. Moustafa and J. Slay. 2016. The Evaluation of Network Anomaly Detection Systems: Statistical Analysis of the UNSW NB15 Data Set and the Comparison with the KDD99 Data Set. Information Security Journal: A Global Perspective, 25(13), pp. 18--31. Google ScholarDigital Library
- N. Moustafa and J. Slay. 2015. UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems (UNSW-NB15 Network Data Set). Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, pp. 1--6, IEEE.Google Scholar
- N. Moustafa and J. Slay. 2018. The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set. Information Security Journal: A Global Perspective, 25(1-3), pp. 18--31. Google ScholarDigital Library
- R. Primartha and B. Tama. 2017. Anomaly Detection Using Random Forest: A Performance Revisited. Data and Software Engineering (ICoDSE), International Conference on, Palembang Sumatra Selatan, Indonesia, pp. 1--6, IEEE.Google Scholar
- P. Resende and A. Drummond. 2018. Adaptive Anomaly-based Intrusion Detection System Using Genetic Algorithm and Profiling. Security and Privacy, e36, pp. 1--13.Google Scholar
- A. Rosenberg and J. Hirschberg. 2007. V-measure: A Conditional Entropy-based External Cluster Evaluation Measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning(EMNLP-CoNLL), pp. 410--420.Google Scholar
- J. Schmidhuber. 2015. Deep Learning in Neural Networks: An Overview. Neural Networks, vol. 61, pp. 85--117. Google ScholarDigital Library
- I. Sharafaldin, A. Lashkari, and A. A. Ghorbani. 2018. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP 2018). Funchal, Madeira-Portugal, pp. 108--116.Google ScholarCross Ref
- I. Sharafaldin, A. Gharib, A. H. Lashkari, and A. A. Ghorbani. 2018. Towards a Reliable Intrusion Detection Benchmark Dataset. Software Networking, 2018(1), pp. 177--200.Google ScholarCross Ref
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. 2010. The Hadoop Distributed File System. Mass Storage Systems and Technologies (MSST), IEEE 26th symposium on, pp. 1--10. Google ScholarDigital Library
- O.B. Sezer, M. Ozbayoglu, E. Dogdu. 2017. A Deep Neural-Network Based Stock Trading System Based on Evolutionary Optimized Technical Analysis Parameters. Procedia Computer Science, 114, pp. 473--480. Google ScholarDigital Library
- S. Suthaharan. 2014. Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning. ACM SIGMETRICS Performance Evaluation Review 41(4), pp. 70--73. Google ScholarDigital Library
- M. Tavallaee, E. Bagheri, W. Lu, and A. A.Ghorbani. 2009. A Detailed Analysis of the KDD CUP 99 Data Set. In Computational Intelligence for Security and Defense Applications. CISDA 2009. IEEE Symposium on, pp. 1--6, IEEE. Google ScholarDigital Library
- A. Thusoo, et al.2009. Hive: A Warehousing Solution over a Map-Reduce Framework. Proceedings of the VLDB Endowment 2(2), pp. 1626--1629. Google ScholarDigital Library
- E.D. Ubeyli and E. Dogdu. 2010. Automatic Detection of Erythemato-squamous Diseases Using K-means Clustering. Journal of Medical Systems, 34(2), pp. 179--184. Google ScholarDigital Library
- R. Vijayanand, D. Devaraj, and B. Kannapiran. 2018. Intrusion Detection System for Wireless Mesh Network Using Multiple Support Vector Machine Classifiers with Genetic-Algorithm-based Feature Selection. Computers & Security 77, pp. 304--314.Google ScholarDigital Library
- M. Zaharia, et al. 2016. Apache Spark: A Unified Engine for Big Data Processing. Communications of the ACM 59(11), pp. 56--65. Google ScholarDigital Library
- C. Zhang and Y. Ma, eds. 2012. Ensemble Machine Learning: Methods and Applications. Springer Science & Business Media, Springer.Google Scholar
- P. Zikopoulos and C. Eaton. 2011. Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Osborne Media. Google ScholarDigital Library
- R. Zuech, T. M. Khoshgoftaar, and R. Wald. 2015. Intrusion Detection and Big Heterogeneous Data: A Survey. Journal of Big Data, 2(3), pp. 1--41.Google Scholar
Index Terms
- Intrusion Detection Using Big Data and Deep Learning Techniques
Recommendations
Improving performance of intrusion detection system using ensemble methods and feature selection
ACSW '18: Proceedings of the Australasian Computer Science Week MulticonferenceThe main task of an intrusion detection system (IDS) is to detect anomalous behaviors from both within and outside the network system, and there have been increasing studies applying machine learning in this area. The limitations of using a single ...
Analysis of Feature Selection and Ensemble Classifier Methods for Intrusion Detection
Day by day network security is becoming more challenging task. Intrusion detection systems IDSs are one of the methods used to monitor the network activities. Data mining algorithms play a major role in the field of IDS. NSL-KDD'99 dataset is used to ...
Real time intrusion detection system for ultra-high-speed big data environments
In recent years, the number of people using the Internet and network services is increasing day by day. On a daily basis, a large amount of data is generated over the Internet from zeta byte to petabytes with a very high speed. On the other hand, we see ...
Comments