Skip to main content
Top

2024 | OriginalPaper | Chapter

Machine Learning Metrics for Network Datasets Evaluation

Authors : Dominik Soukup, Daniel Uhříček, Daniel Vašata, Tomáš Čejka

Published in: ICT Systems Security and Privacy Protection

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The chapter explores the challenges of using traditional methods like deep packet inspection for network traffic analysis, highlighting the promising results of machine learning in detecting security events even in encrypted traffic. It emphasizes the importance of enhancing ML automation with Active Learning and data drift detection methods to ensure consistent and reliable results across different environments. The authors propose three novel metrics to evaluate the quality and suitability of network traffic datasets, addressing the limitations of existing approaches. These metrics are designed to universally assess linear and non-linear multi-class tasks and can be used to more accurately assess dataset quality over time. The chapter includes a detailed evaluation of the proposed metrics on publicly available datasets, demonstrating their benefits and added value compared to existing solutions. The findings underscore the need for additional dataset evaluation techniques in both scientific research and the production deployment of ML technologies in network security.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Anderson, B., McGrew, D.: Machine learning for encrypted malware traffic classification: accounting for noisy labels and non-stationarity. In: 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017) Anderson, B., McGrew, D.: Machine learning for encrypted malware traffic classification: accounting for noisy labels and non-stationarity. In: 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017)
2.
go back to reference Brabec, J., et al.: On model evaluation under non-constant class imbalance. In: Computational Science (ICCS) (2020) Brabec, J., et al.: On model evaluation under non-constant class imbalance. In: Computational Science (ICCS) (2020)
3.
go back to reference Celdrán, A.H., et al.: RITUAL: a platform quantifying the trustworthiness of supervised machine learning. In: 18th International Conference on Network and Service Management (CNSM) (2022) Celdrán, A.H., et al.: RITUAL: a platform quantifying the trustworthiness of supervised machine learning. In: 18th International Conference on Network and Service Management (CNSM) (2022)
4.
go back to reference Chen, H., et al.: Data curation and quality assurance for machine learning-based cyber intrusion detection (2021) Chen, H., et al.: Data curation and quality assurance for machine learning-based cyber intrusion detection (2021)
5.
go back to reference Zelaya, C.V.G.: Towards explaining the effects of data preprocessing on machine learning. In: 35th International Conference on Data Engineering (2019) Zelaya, C.V.G.: Towards explaining the effects of data preprocessing on machine learning. In: 35th International Conference on Data Engineering (2019)
6.
go back to reference Hwang, I., et al.: SimEX: express prediction of inter-dataset similarity by a fleet of autoencoders. arXiv preprint arXiv:2001.04893 (2020) Hwang, I., et al.: SimEX: express prediction of inter-dataset similarity by a fleet of autoencoders. arXiv preprint arXiv:​2001.​04893 (2020)
7.
go back to reference Jeřábek, K., Hynek, K., Čejka, T., Ryšavý, O.: Collection of datasets with DNS over https traffic. Data Brief 42, 108310 (2022)CrossRef Jeřábek, K., Hynek, K., Čejka, T., Ryšavý, O.: Collection of datasets with DNS over https traffic. Data Brief 42, 108310 (2022)CrossRef
8.
go back to reference Koh, P.W., et al.: WILDS: a benchmark of in-the-wild distribution shifts. In: Proceedings of the 38th International Conference on Machine Learning (2021) Koh, P.W., et al.: WILDS: a benchmark of in-the-wild distribution shifts. In: Proceedings of the 38th International Conference on Machine Learning (2021)
9.
go back to reference Komorniczak, J., Ksieniewicz, P.: Problexity - an open-source python library for binary classification problem complexity assessment. arXiv preprint arXiv:2207.06709 (2022) Komorniczak, J., Ksieniewicz, P.: Problexity - an open-source python library for binary classification problem complexity assessment. arXiv preprint arXiv:​2207.​06709 (2022)
11.
go back to reference Lee, Y.W., et al.: AIMQ: a methodology for information quality assessment. Inf. Manag. (2002) Lee, Y.W., et al.: AIMQ: a methodology for information quality assessment. Inf. Manag. (2002)
12.
go back to reference Lorena, A.C., Garcia, L.P.F., Lehmann, J., Souto, M.C.P., Ho, T.K.: How complex is your classification problem? A survey on measuring classification complexity. ACM Comput. Surv. 52(5) (2019) Lorena, A.C., Garcia, L.P.F., Lehmann, J., Souto, M.C.P., Ho, T.K.: How complex is your classification problem? A survey on measuring classification complexity. ACM Comput. Surv. 52(5) (2019)
13.
go back to reference Luxemburk, J., Čejka, T.: Fine-grained TLS services classification with reject option. Comput. Netw. 220, 109467 (2023)CrossRef Luxemburk, J., Čejka, T.: Fine-grained TLS services classification with reject option. Comput. Netw. 220, 109467 (2023)CrossRef
14.
go back to reference Maillo, J., Triguero, I., Herrera, F.: Redundancy and complexity metrics for big data classification: towards smart data. IEEE Access 8, 87918–87928 (2020)CrossRef Maillo, J., Triguero, I., Herrera, F.: Redundancy and complexity metrics for big data classification: towards smart data. IEEE Access 8, 87918–87928 (2020)CrossRef
15.
go back to reference Obaid, H.S., et al.: The impact of data pre-processing techniques and dimensionality reduction on the accuracy of machine learning. In: 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON) (2019) Obaid, H.S., et al.: The impact of data pre-processing techniques and dimensionality reduction on the accuracy of machine learning. In: 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON) (2019)
16.
go back to reference Papadogiannaki, E., Ioannidis, S.: A survey on encrypted network traffic analysis applications, techniques, and countermeasures. ACM Comput. Surv. 54(6) (2021) Papadogiannaki, E., Ioannidis, S.: A survey on encrypted network traffic analysis applications, techniques, and countermeasures. ACM Comput. Surv. 54(6) (2021)
17.
go back to reference Pendlebury, F., et al.: TESSERACT: eliminating experimental bias in malware classification across space and time. In: Proceedings of the 28th USENIX Conference on Security Symposium, USA (2019) Pendlebury, F., et al.: TESSERACT: eliminating experimental bias in malware classification across space and time. In: Proceedings of the 28th USENIX Conference on Security Symposium, USA (2019)
18.
go back to reference Pesarin, F., Salmaso, L.: A review and some new results on permutation testing for multivariate problems. Stat. Comput. 22(2), 639–646 (2012)MathSciNetCrossRef Pesarin, F., Salmaso, L.: A review and some new results on permutation testing for multivariate problems. Stat. Comput. 22(2), 639–646 (2012)MathSciNetCrossRef
19.
go back to reference Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: International Conference on Information Systems Security and Privacy (2018) Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: International Conference on Information Systems Security and Privacy (2018)
20.
go back to reference Soukup, D., et al.: Towards evaluating quality of datasets for network traffic domain. In: 17th International Conference on Network and Service Management (CNSM) (2021) Soukup, D., et al.: Towards evaluating quality of datasets for network traffic domain. In: 17th International Conference on Network and Service Management (CNSM) (2021)
22.
go back to reference Yoon, J., Arik, S., Pfister, T.: Data valuation using reinforcement learning. In: Daumé, H., III., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 10842–10851. PMLR (2020) Yoon, J., Arik, S., Pfister, T.: Data valuation using reinforcement learning. In: Daumé, H., III., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 10842–10851. PMLR (2020)
23.
go back to reference Zhang, Y., Zhao, S., Sang, Y.: Towards unknown traffic identification using deep auto-encoder and constrained clustering. In: Computational Science – ICCS (2019) Zhang, Y., Zhao, S., Sang, Y.: Towards unknown traffic identification using deep auto-encoder and constrained clustering. In: Computational Science – ICCS (2019)
Metadata
Title
Machine Learning Metrics for Network Datasets Evaluation
Authors
Dominik Soukup
Daniel Uhříček
Daniel Vašata
Tomáš Čejka
Copyright Year
2024
DOI
https://doi.org/10.1007/978-3-031-56326-3_22

Premium Partner