Abstract
Most research in the field of network intrusion detection heavily relies on datasets. Datasets in this field, however, are scarce and difficult to reproduce. To compare, evaluate, and test related work, researchers usually need the same datasets or at least datasets with similar characteristics as the ones used in related work. In this work, we present concepts and the Intrusion Detection Dataset Toolkit (ID2T) to alleviate the problem of reproducing datasets with desired characteristics to enable an accurate replication of scientific results. Intrusion Detection Dataset Toolkit (ID2T) facilitates the creation of labeled datasets by injecting synthetic attacks into background traffic. The injected synthetic attacks created by ID2T blend with the background traffic by mimicking the background traffic’s properties.
This article has three core contributions. First, we present a comprehensive survey on intrusion detection datasets. In the survey, we propose a classification to group the negative qualities found in the datasets. Second, the architecture of ID2T is revised, improved, and expanded in comparison to previous work. The architectural changes enable ID2T to inject recent and advanced attacks, such as the EternalBlue exploit or a peer-to-peer botnet. ID2T’s functionality provides a set of tests, known as TIDED, that helps identify potential defects in the background traffic into which attacks are injected. Third, we illustrate how ID2T is used in different use-case scenarios to replicate scientific results with the help of reproducible datasets. ID2T is open source software and is made available to the community to expand its arsenal of attacks and capabilities.
- Sebastian Abt and Harald Baier. 2013. Are we missing labels? A study of the availability of ground-truth in network security research. In Proceedings of the Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS’14). Google ScholarDigital Library
- United States Military Academy. 2009. CDX 2009 Network. Retrieved from https://www.westpoint.edu/centers-and-research/cyber-research-center/data-sets.Google Scholar
- Akamai. 2018. The state of the internet / security report. Retrieved from https://www.akamai.com/uk/en/multimedia/documents/case-study/spring-2018-state-of-the-internet-security-report.pdf.Google Scholar
- Rafael Ramos Regis Barbosa, Ramin Sadre, Aiko Pras, and Remco Meent. 2010. Simpleweb/University of Twente Traffic Traces Data Repository. Technical Report. Centre for Telematics and Information Technology, University of Twente.Google Scholar
- Steven M. Bellovin. 1992. Packets found on an internet 1 introduction 2 address space oddities. Comput. Commun. 23, 3 (1992), 1--8.Google Scholar
- Monowar H. Bhuyan, Dhruba K. Bhattacharyya, and Jugal K. Kalita. 2015. Towards generating real-life datasets for network intrusion detection. Int. J. Netw. Secur. 17, 6 (2015), 683--701.Google Scholar
- Daniela Brauckhoff, Arno Wagner, and May Martin. 2008. FLAME: A flow-level anomaly modeling engine. In Proceedings of the Conference on Cyber Security (CSET’08). Google ScholarDigital Library
- CAIDA. 2017. CAIDA Data—Overview of Datasets, Monitors, and Reports. Retrieved from http://www.caida.org/data/overview/.Google Scholar
- National CyberWatch Center. 2017. Mid-Atlantic Collegiate Cyber Defense Competition. Retrieved from https://maccdc.org/.Google Scholar
- Carlos Garcia Cordero, Emmanouil Vasilomanolakis, Nikolay Milanov, Christian Koch, David Hausheer, and Max Mühlhäuser. 2015. ID2T: A DIY dataset creation toolkit for intrusion detection systems. In Proceedings of the Conference on Communications and Network Security (CNS’15). IEEE, 739--740.Google ScholarCross Ref
- Michelle Cotton, Lars Eggert, Joe Touch, Magnus Westerlund, and Stuart Cheshire. 2011. Internet Assigned Numbers Authority (IANA) Procedures for the Management of the Service name and Transport Protocol Port Number Registry. RFC 6335. Retrieved from http://buildbot.tools.ietf.org/html/rfc6335.Google Scholar
- Gideon Creech and Jiankun Hu. 2013. Generation of a new IDS test dataset: Time to Retire the KDD Collection. In Proceedings of the Wireless Communications and Networking Conference (WCNC’13). IEEE, 4487--4492.Google ScholarCross Ref
- Robert K. Cunningham, Richard P. Lippmann, David J. Fried, Simson L. Garfinkel, Isaac Graf, Kris R. Kendall, Seth E. Webster, Dan Wyschogrod, and Marc A. Zissman. 1999. Evaluating Intrusion Detection Systems Without Attacking your Friends: The 1998 DARPA Intrusion Detection Evaluation. Technical Report. MIT Lincoln Lab.Google Scholar
- Peter B. Danzig and Sugih Jamin. 1991. tcplib: A library of internetwork traffic characteristics. Library 48 (1991), 1--8.Google Scholar
- Romain Fontugne, Pierre Borgnat, Patrice Abry, and Kensuke Fukuda. 2010. MAWILab: Combining diverse anomaly detectors for automated anomaly labeling and performance benchmarking. In Proceedings of the Conference on Emerging Networking EXperiments and Technologies (CoNEXT’10). ACM, 1--12. Google ScholarDigital Library
- Sebastian Garcia. 2011. Stratosphere Research Laboratory. Retrieved from https://www.stratosphereips.org/.Google Scholar
- Sebastian Garcia, Martin Grill, Jan Stiborek, and Alejandro Zunino. 2014. An empirical comparison of botnet detection methods. Comput. Secur. 45 (2014), 100--123. Google ScholarDigital Library
- Carlos Garcia Cordero, Sascha Hauke, Max Mühlhäuser, and Mathias Fischer. 2016. Analyzing flow-based anomaly intrusion detection using replicator neural networks. In Proceedings of the 14th Annual Conference on Privacy, Security and Trust (PST’16). 317--324. DOI:https://doi.org/10.1109/PST.2016.7906980Google ScholarCross Ref
- Dan Grossman. 2002. New Terminology and Clarifications for Diffserv. RFC 3260. Retrieved from http://buildbot.tools.ietf.org/html/rfc3260. Google ScholarDigital Library
- W. Haider, J. Hu, J. Slay, B. P. Turnbull, and Y. Xie. 2017. Generating realistic intrusion detection system dataset based on fuzzy qualitative modeling. J. Netw. Comput. Appl. 87 (2017), 185--192. Google ScholarDigital Library
- Santiago Hernández. 2018. Awesome-Cybersecurity-Datasets. Retrieved from https://github.com/shramos/Awesome-Cybersecurity-Datasets.Google Scholar
- IMPACT. 2017. Information Marketplace. Retrieved from https://www.impactcybertrust.org.Google Scholar
- Kadangode K. Ramakrishnan, Sally Floyd, and D. Black. 2001. The Addition of Explicit Congestion Notification (ECN’01) to IP. Technical Report.Google Scholar
- KDD Cup 99. 1999. Knowledge Discovery and Data Mining Tools Competition. Retrieved from http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.Google Scholar
- Robert Koch, Mario Golling, and Gabi Dreo Rodosek. 2014. Towards comparability of intrusion detection systems: New data sets. In Proceedings of the TERENA Networking Conference. 7.Google Scholar
- Anukool Lakhina, Mark Crovella, and Christophe Diot. 2005. Mining anomalies using traffic feature distributions. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM’05). ACM Press, 217--228. Google ScholarDigital Library
- Imed Lassoued. 2011. Adaptive Monitoring and Management of Internet Traffic. PhD Thesis. Université de Nice.Google Scholar
- Marc Liberatore and Prashant Shenoy. 2013. Umass trace repository. Retrieved from http://traces.cs.umass.edu.Google Scholar
- Thomas Lukaseder. 2017. 2017-SUEE-data-set. Retrieved from https://github.com/vs-uulm/2017-SUEE-data-set.Google Scholar
- Matthew V. Mahoney. 2003. Network traffic anomaly detection based on packet bytes. In Proceedings of the 2003 ACM Symposium on Applied Computing. ACM, 346--350. Google ScholarDigital Library
- Matthew V. Mahoney and Philip K. Chan. 2003. An analysis of the 1999 DARPA/lincoln laboratory evaluation data for network anomaly detection. In Proceedings of the International Symposium on Recent Advances in Intrusion Detection. 220--237. DOI:https://doi.org/10.1007/b13476Google Scholar
- John McHugh. 2000. Testing intrusion detection systems: A critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by lincoln laboratory. ACM Trans. Info. Syst. Secur. 3, 4 (2000), 262--294. DOI:https://doi.org/10.1145/382912.382923 Google ScholarDigital Library
- Nour Moustafa and Jill Slay. 2015. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the Military Communications and Information Systems Conference (MilCIS’15). IEEE, 1--6.Google ScholarCross Ref
- Boris Nechaev, Mark Allman, Vern Paxson, and Andrei V. Gurtov. 2010. A preliminary analysis of TCP performance in an enterprise network. INM/WREN 10 (2010).Google Scholar
- NETRESEC. 2010. Capture files from Mid-Atlantic CCDC. Retrieved from https://www.netresec.com/?page=MACCDC.Google Scholar
- Vern Paxson. 1999. Bro: A system for detecting network intruders in real-time. Comput. Netw. 31, 23--24 (1999), 2435--2463. DOI:https://doi.org/10.1016/S1389-1286(99)00112-7 Google ScholarDigital Library
- Jon Postel et al. 1981. Internet Protocol. RFC 791. Retrieved from http://buildbot.tools.ietf.org/html/rfc791.Google Scholar
- Nadun Rajasinghe, Jagath Samarabandu, and Xianbin Wang. 2018. INSecS-DCS: A highly customizable network intrusion dataset creation framework. In Proceedings of the IEEE Canadian Conference on Electrical 8 Computer Engineering (CCECE’18). IEEE, 1--4.Google ScholarCross Ref
- Joyce Reynolds and Jon Postel. 1994. Assigned Numbers. Technical Report.Google Scholar
- Haakon Ringberg, Matthew Roughan, and Jennifer Rexford. 2008. The need for simulation in evaluating anomaly detectors. SIGCOMM Comput. Commun. Rev. 38, 1 (Jan. 2008), 55--59. DOI:https://doi.org/10.1145/1341431.1341443 Google ScholarDigital Library
- Benjamin Sangster, Thomas Cook, Robert Fanelli, Erik Dean, William J. Adams, Chris Morrell, and Gregory Conti. 2009. Toward instrumenting network warfare competitions to generate labeled datasets. In Proceedings of the USENIX Security’s Workshop on Cyber Security Experimentation and Test (CSET’09). Google ScholarDigital Library
- Mike Sconzo. 2015. Samples of Security Related Data. Retrieved from https://www.secrepo.com/.Google Scholar
- Ali Shiravi, Hadi Shiravi, Mahbod Tavallaee, and Ali A. Ghorbani. 2012. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput. Secur. 31, 3 (2012), 357--374. Google ScholarDigital Library
- John Sonchack, Adam J. Aviv, and Jonathan M. Smith. 2013. Bridging the data gap: Data related challenges in evaluating large scale collaborative security systems. In Proceedings of the 6th Workshop on Cyber Security Experimentation and Test.Google Scholar
- Jungsuk Song, Hiroki Takakura, and Yasuo Okabe. 2006. Description of Kyoto University benchmark data. Academic Center for Computing and Media Studies (ACCMS), Kyoto University.Google Scholar
- Jungsuk Song, Hiroki Takakura, and Yasuo Okabe. 2008. Cooperation of intelligent honeypots to detect unknown malicious codes. In Proceedings of the WOMBAT Workshop on Information Security Threats Data Collection and Sharing (WISTDCS’08). IEEE, 31--39. Google ScholarDigital Library
- Anna Sperotto, Ramin Sadre, Frank Van Vliet, and Aiko Pras. 2009. A labeled data set for flow-based intrusion detection. In Proceedings of the International Workshop on IP Operations and Management. Springer, 39--50. Google ScholarDigital Library
- SPIRENT. 2002. pcapr: PCAP files repository. Retrieved from https://www.pcapr.net/.Google Scholar
- Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorbani. 2009. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the Symposium on Computational Intelligence for Security and Defense Applications (CISDA’09). IEEE, 1--6. DOI:https://doi.org/10.1109/CISDA.2009.5356528 Google ScholarDigital Library
- Emmanouil Vasilomanolakis, Carlos Garcia Cordero, Nikolay Milanov, and Max Mühlhäuser. 2016. Towards the creation of synthetic, yet realistic, intrusion detection datasets. In Proceedings of the IEEE/IFIP Network Operations and Management Symposium (NOMS’16). IEEE, 1209--1214.Google ScholarDigital Library
- Emmanouil Vasilomanolakis, Shankar Karuppayah, Max Mühlhäuser, and Mathias Fischer. 2015. Taxonomy and survey of collaborative intrusion detection. Comput. Surveys 47, 4 (2015), 33. Google ScholarDigital Library
- Emmanouil Vasilomanolakis, Matthias Krügl, Carlos Garcia Cordero, Max Mühlhäuser, and Mathias Fischer. 2015. SkipMon: A locality-aware collaborative intrusion detection system. In Proceedings of the IEEE 34th International Performance on Computing and Communications Conference (IPCCC’15). IEEE, 1--8. Google ScholarDigital Library
- Richard Zuech, Taghi M. Khoshgoftaar, Naeem Seliya, Maryam M. Najafabadi, and Clifford Kemp. 2015. A new intrusion detection benchmarking system. In Proceedings of the FLAIRS Conference. 252--256.Google Scholar
Index Terms
- On Generating Network Traffic Datasets with Synthetic Attacks for Intrusion Detection
Recommendations
A hybrid intrusion detection system design for computer network security
Intrusions detection systems (IDSs) are systems that try to detect attacks as they occur or after the attacks took place. IDSs collect network traffic information from some point on the network or computer system and then use this information to secure ...
Service-independent payload analysis to improve intrusion detection in network traffic
AusDM '08: Proceedings of the 7th Australasian Data Mining Conference - Volume 87The popularity of computer networks broadens the scope for network attackers and increases the damage these attacks can cause. In this context, Intrusion Detection Systems (IDS) are included as part of any complete security package. This work focuses on ...
A Comparative Study on the Impact of Adversarial Machine Learning Attacks on Contemporary Intrusion Detection Datasets
AbstractAdversarial attack techniques have taken a firm stand against the capabilities of deep neural networks, rendering them less efficient in performing their functions. Various kind of attacks have been studied and appropriate defense mechanisms have ...
Comments