Skip to main content
Erschienen in: Empirical Software Engineering 2/2023

01.03.2023

An empirical study of text-based machine learning models for vulnerability detection

verfasst von: Kollin Napier, Tanmay Bhowmik, Shaowei Wang

Erschienen in: Empirical Software Engineering | Ausgabe 2/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

With an increase in complexity and severity, it is becoming harder to identify and mitigate vulnerabilities. Although traditional tools remain useful, machine learning models are being adopted to expand efforts. To help explore methods of vulnerability detection, we present an empirical study on the effectiveness of text-based machine learning models by utilizing 344 open-source projects, 2,182 vulnerabilities and 38 vulnerability types. With the availability of vulnerabilities being presented in forms such as code snippets, we construct a methodology based on extracted source code functions and create equal pairings. We conduct experiments using seven machine learning models, five natural language processing techniques and three data processing methods. First, we present results based on full context function pairings. Next, we introduce condensed functions and conduct a statistical analysis to determine if there is a significant difference between the models, techniques, or methods. Based on these results, we answer research questions regarding model prediction for testing within and across projects and vulnerability types. Our results show that condensed functions with fewer features may achieve greater prediction results when testing within rather than across. Overall, we conclude that text-based machine learning models are not effective in detecting vulnerabilities within or across projects and vulnerability types.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
10
The original database link provided by the paper is unavailable, but an alternative link was found: https://​github.​com/​announce/​vcc-base
 
22
CVE Details does provide a disclaimer that the site and all data are provided “as is”, meaning it is not guaranteed to be accurate or complete.
 
Literatur
Zurück zum Zitat Bates S, Cozby P (2017) Methods in behavioral research. McGraw-Hill Education, New York Bates S, Cozby P (2017) Methods in behavioral research. McGraw-Hill Education, New York
Zurück zum Zitat Cor K, Sood G (2018) Pwned: How often are Americans’ online accounts breached? arXiv:1808.01883 Cor K, Sood G (2018) Pwned: How often are Americans’ online accounts breached? arXiv:1808.​01883
Zurück zum Zitat Czerwonka J, Greiler M, Tilford J (2015) Code reviews do not find bugs. How the current code review best practice slows us down. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 2. IEEE, pp 27–28. https://doi.org/10.1109/ICSE.2015.131 Czerwonka J, Greiler M, Tilford J (2015) Code reviews do not find bugs. How the current code review best practice slows us down. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 2. IEEE, pp 27–28. https://​doi.​org/​10.​1109/​ICSE.​2015.​131
Zurück zum Zitat Dowd M, McDonald J, Schuh J (2006) The art of software security assessment: Identifying and preventing software vulnerabilities. Pearson Education Dowd M, McDonald J, Schuh J (2006) The art of software security assessment: Identifying and preventing software vulnerabilities. Pearson Education
Zurück zum Zitat Fischer F, Böttinger K, Xiao H, Stransky C, Acar Y, Backes M, Fahl S (2017) Stack overflow considered harmful? the impact of copy&paste on android application security. In: 2017 IEEE symposium on security and privacy (SP). https://doi.org/10.1109/SP.2017.31. IEEE, pp 121–136 Fischer F, Böttinger K, Xiao H, Stransky C, Acar Y, Backes M, Fahl S (2017) Stack overflow considered harmful? the impact of copy&paste on android application security. In: 2017 IEEE symposium on security and privacy (SP). https://​doi.​org/​10.​1109/​SP.​2017.​31. IEEE, pp 121–136
Zurück zum Zitat Grieco G, Grinblat GL, Uzal L, Rawat S, Feist J, Mounier L (2016) Toward large-scale vulnerability discovery using machine learning. In: Proceedings of the 6th ACM conference on data and application security and privacy, pp 85–96. https://doi.org/10.1145/2857705.2857720 Grieco G, Grinblat GL, Uzal L, Rawat S, Feist J, Mounier L (2016) Toward large-scale vulnerability discovery using machine learning. In: Proceedings of the 6th ACM conference on data and application security and privacy, pp 85–96. https://​doi.​org/​10.​1145/​2857705.​2857720
Zurück zum Zitat Harer JA, Kim LY, Russell RL, Ozdemir O, Kosta LR, Rangamani A, Hamilton LH, Centeno GI, Key JR, Ellingwood PM et al (2018) Automated software vulnerability detection with machine learning. arXiv:1803.04497 Harer JA, Kim LY, Russell RL, Ozdemir O, Kosta LR, Rangamani A, Hamilton LH, Centeno GI, Key JR, Ellingwood PM et al (2018) Automated software vulnerability detection with machine learning. arXiv:1803.​04497
Zurück zum Zitat Koroteev M (2021) Bert: A review of applications in natural language processing and understanding. arXiv:2103.11943 Koroteev M (2021) Bert: A review of applications in natural language processing and understanding. arXiv:2103.​11943
Zurück zum Zitat Lin G, Zhang J, Luo W, Pan L, Xiang Y (2017) Poster: Vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 2539–2541. https://doi.org/10.1145/3133956.3138840 Lin G, Zhang J, Luo W, Pan L, Xiang Y (2017) Poster: Vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 2539–2541. https://​doi.​org/​10.​1145/​3133956.​3138840
Zurück zum Zitat Perl H, Dechand S, Smith M, Arp D, Yamaguchi F, Rieck K, Fahl S, Acar Y (2015) VCCFinder: Finding potential vulnerabilities in open-source projects to assist code audits. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp 426–437. https://doi.org/10.1145/2810103.2813604 Perl H, Dechand S, Smith M, Arp D, Yamaguchi F, Rieck K, Fahl S, Acar Y (2015) VCCFinder: Finding potential vulnerabilities in open-source projects to assist code audits. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp 426–437. https://​doi.​org/​10.​1145/​2810103.​2813604
Zurück zum Zitat Piessens F (2002) A taxonomy of causes of software vulnerabilities in internet software. In: Supplementary Proceedings of the 13th international symposium on software reliability engineering. Citeseer, pp 47–52 Piessens F (2002) A taxonomy of causes of software vulnerabilities in internet software. In: Supplementary Proceedings of the 13th international symposium on software reliability engineering. Citeseer, pp 47–52
Zurück zum Zitat Shin Y, Williams L (2008) An empirical model to predict security vulnerabilities using code complexity metrics. In: Proceedings of the 2nd ACM-IEEE international symposium on Empirical software engineering and measurement, pp 315–317. https://doi.org/10.1145/1414004.1414065 Shin Y, Williams L (2008) An empirical model to predict security vulnerabilities using code complexity metrics. In: Proceedings of the 2nd ACM-IEEE international symposium on Empirical software engineering and measurement, pp 315–317. https://​doi.​org/​10.​1145/​1414004.​1414065
Zurück zum Zitat Shu X, Tian K, Ciambrone A, Yao D (2017) Breaking the target: An analysis of target data breach and lessons learned. arXiv:1701.04940 Shu X, Tian K, Ciambrone A, Yao D (2017) Breaking the target: An analysis of target data breach and lessons learned. arXiv:1701.​04940
Zurück zum Zitat Sultana KZ, Deo A, Williams BJ (2016) A preliminary study examining relationships between nano-patterns and software security vulnerabilities. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC). https://doi.org/10.1109/COMPSAC.2016.34, vol 1. IEEE, pp 257–262 Sultana KZ, Deo A, Williams BJ (2016) A preliminary study examining relationships between nano-patterns and software security vulnerabilities. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC). https://​doi.​org/​10.​1109/​COMPSAC.​2016.​34, vol 1. IEEE, pp 257–262
Zurück zum Zitat Tang G, Meng L, Wang H, Ren S, Wang Q, Yang L, Cao W (2020) A comparative study of neural network techniques for automatic software vulnerability detection. In: 2020 international symposium on theoretical aspects of software engineering (TASE). IEEE, pp 1–8. https://doi.org/10.1109/TASE49443.2020.00010 Tang G, Meng L, Wang H, Ren S, Wang Q, Yang L, Cao W (2020) A comparative study of neural network techniques for automatic software vulnerability detection. In: 2020 international symposium on theoretical aspects of software engineering (TASE). IEEE, pp 1–8. https://​doi.​org/​10.​1109/​TASE49443.​2020.​00010
Zurück zum Zitat Zhu M (2004) Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo. Waterloo 2(30):6 Zhu M (2004) Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo. Waterloo 2(30):6
Zurück zum Zitat Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S (2015) Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision, pp 19–27. https://doi.org/10.1109/ICCV.2015.11 Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S (2015) Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision, pp 19–27. https://​doi.​org/​10.​1109/​ICCV.​2015.​11
Metadaten
Titel
An empirical study of text-based machine learning models for vulnerability detection
verfasst von
Kollin Napier
Tanmay Bhowmik
Shaowei Wang
Publikationsdatum
01.03.2023
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 2/2023
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-022-10276-6

Weitere Artikel der Ausgabe 2/2023

Empirical Software Engineering 2/2023 Zur Ausgabe

Premium Partner