Skip to main content
Top
Published in: Empirical Software Engineering 2/2023

01-03-2023

An empirical study of text-based machine learning models for vulnerability detection

Authors: Kollin Napier, Tanmay Bhowmik, Shaowei Wang

Published in: Empirical Software Engineering | Issue 2/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

With an increase in complexity and severity, it is becoming harder to identify and mitigate vulnerabilities. Although traditional tools remain useful, machine learning models are being adopted to expand efforts. To help explore methods of vulnerability detection, we present an empirical study on the effectiveness of text-based machine learning models by utilizing 344 open-source projects, 2,182 vulnerabilities and 38 vulnerability types. With the availability of vulnerabilities being presented in forms such as code snippets, we construct a methodology based on extracted source code functions and create equal pairings. We conduct experiments using seven machine learning models, five natural language processing techniques and three data processing methods. First, we present results based on full context function pairings. Next, we introduce condensed functions and conduct a statistical analysis to determine if there is a significant difference between the models, techniques, or methods. Based on these results, we answer research questions regarding model prediction for testing within and across projects and vulnerability types. Our results show that condensed functions with fewer features may achieve greater prediction results when testing within rather than across. Overall, we conclude that text-based machine learning models are not effective in detecting vulnerabilities within or across projects and vulnerability types.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
10
The original database link provided by the paper is unavailable, but an alternative link was found: https://​github.​com/​announce/​vcc-base
 
22
CVE Details does provide a disclaimer that the site and all data are provided “as is”, meaning it is not guaranteed to be accurate or complete.
 
Literature
go back to reference Bates S, Cozby P (2017) Methods in behavioral research. McGraw-Hill Education, New York Bates S, Cozby P (2017) Methods in behavioral research. McGraw-Hill Education, New York
go back to reference Dowd M, McDonald J, Schuh J (2006) The art of software security assessment: Identifying and preventing software vulnerabilities. Pearson Education Dowd M, McDonald J, Schuh J (2006) The art of software security assessment: Identifying and preventing software vulnerabilities. Pearson Education
go back to reference Fischer F, Böttinger K, Xiao H, Stransky C, Acar Y, Backes M, Fahl S (2017) Stack overflow considered harmful? the impact of copy&paste on android application security. In: 2017 IEEE symposium on security and privacy (SP). https://doi.org/10.1109/SP.2017.31. IEEE, pp 121–136 Fischer F, Böttinger K, Xiao H, Stransky C, Acar Y, Backes M, Fahl S (2017) Stack overflow considered harmful? the impact of copy&paste on android application security. In: 2017 IEEE symposium on security and privacy (SP). https://​doi.​org/​10.​1109/​SP.​2017.​31. IEEE, pp 121–136
go back to reference Harer JA, Kim LY, Russell RL, Ozdemir O, Kosta LR, Rangamani A, Hamilton LH, Centeno GI, Key JR, Ellingwood PM et al (2018) Automated software vulnerability detection with machine learning. arXiv:1803.04497 Harer JA, Kim LY, Russell RL, Ozdemir O, Kosta LR, Rangamani A, Hamilton LH, Centeno GI, Key JR, Ellingwood PM et al (2018) Automated software vulnerability detection with machine learning. arXiv:1803.​04497
go back to reference Koroteev M (2021) Bert: A review of applications in natural language processing and understanding. arXiv:2103.11943 Koroteev M (2021) Bert: A review of applications in natural language processing and understanding. arXiv:2103.​11943
go back to reference Lin G, Zhang J, Luo W, Pan L, Xiang Y (2017) Poster: Vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 2539–2541. https://doi.org/10.1145/3133956.3138840 Lin G, Zhang J, Luo W, Pan L, Xiang Y (2017) Poster: Vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 2539–2541. https://​doi.​org/​10.​1145/​3133956.​3138840
go back to reference Perl H, Dechand S, Smith M, Arp D, Yamaguchi F, Rieck K, Fahl S, Acar Y (2015) VCCFinder: Finding potential vulnerabilities in open-source projects to assist code audits. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp 426–437. https://doi.org/10.1145/2810103.2813604 Perl H, Dechand S, Smith M, Arp D, Yamaguchi F, Rieck K, Fahl S, Acar Y (2015) VCCFinder: Finding potential vulnerabilities in open-source projects to assist code audits. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp 426–437. https://​doi.​org/​10.​1145/​2810103.​2813604
go back to reference Piessens F (2002) A taxonomy of causes of software vulnerabilities in internet software. In: Supplementary Proceedings of the 13th international symposium on software reliability engineering. Citeseer, pp 47–52 Piessens F (2002) A taxonomy of causes of software vulnerabilities in internet software. In: Supplementary Proceedings of the 13th international symposium on software reliability engineering. Citeseer, pp 47–52
go back to reference Shin Y, Williams L (2008) An empirical model to predict security vulnerabilities using code complexity metrics. In: Proceedings of the 2nd ACM-IEEE international symposium on Empirical software engineering and measurement, pp 315–317. https://doi.org/10.1145/1414004.1414065 Shin Y, Williams L (2008) An empirical model to predict security vulnerabilities using code complexity metrics. In: Proceedings of the 2nd ACM-IEEE international symposium on Empirical software engineering and measurement, pp 315–317. https://​doi.​org/​10.​1145/​1414004.​1414065
go back to reference Shu X, Tian K, Ciambrone A, Yao D (2017) Breaking the target: An analysis of target data breach and lessons learned. arXiv:1701.04940 Shu X, Tian K, Ciambrone A, Yao D (2017) Breaking the target: An analysis of target data breach and lessons learned. arXiv:1701.​04940
go back to reference Sultana KZ, Deo A, Williams BJ (2016) A preliminary study examining relationships between nano-patterns and software security vulnerabilities. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC). https://doi.org/10.1109/COMPSAC.2016.34, vol 1. IEEE, pp 257–262 Sultana KZ, Deo A, Williams BJ (2016) A preliminary study examining relationships between nano-patterns and software security vulnerabilities. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC). https://​doi.​org/​10.​1109/​COMPSAC.​2016.​34, vol 1. IEEE, pp 257–262
go back to reference Tang G, Meng L, Wang H, Ren S, Wang Q, Yang L, Cao W (2020) A comparative study of neural network techniques for automatic software vulnerability detection. In: 2020 international symposium on theoretical aspects of software engineering (TASE). IEEE, pp 1–8. https://doi.org/10.1109/TASE49443.2020.00010 Tang G, Meng L, Wang H, Ren S, Wang Q, Yang L, Cao W (2020) A comparative study of neural network techniques for automatic software vulnerability detection. In: 2020 international symposium on theoretical aspects of software engineering (TASE). IEEE, pp 1–8. https://​doi.​org/​10.​1109/​TASE49443.​2020.​00010
go back to reference Zhu M (2004) Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo. Waterloo 2(30):6 Zhu M (2004) Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo. Waterloo 2(30):6
go back to reference Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S (2015) Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision, pp 19–27. https://doi.org/10.1109/ICCV.2015.11 Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S (2015) Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision, pp 19–27. https://​doi.​org/​10.​1109/​ICCV.​2015.​11
Metadata
Title
An empirical study of text-based machine learning models for vulnerability detection
Authors
Kollin Napier
Tanmay Bhowmik
Shaowei Wang
Publication date
01-03-2023
Publisher
Springer US
Published in
Empirical Software Engineering / Issue 2/2023
Print ISSN: 1382-3256
Electronic ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-022-10276-6

Other articles of this Issue 2/2023

Empirical Software Engineering 2/2023 Go to the issue

Premium Partner