nach oben

Empirical Software Engineering

Erschienen in:

01.03.2023

An empirical study of text-based machine learning models for vulnerability detection

verfasst von: Kollin Napier, Tanmay Bhowmik, Shaowei Wang

Erschienen in: Empirical Software Engineering | Ausgabe 2/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

With an increase in complexity and severity, it is becoming harder to identify and mitigate vulnerabilities. Although traditional tools remain useful, machine learning models are being adopted to expand efforts. To help explore methods of vulnerability detection, we present an empirical study on the effectiveness of text-based machine learning models by utilizing 344 open-source projects, 2,182 vulnerabilities and 38 vulnerability types. With the availability of vulnerabilities being presented in forms such as code snippets, we construct a methodology based on extracted source code functions and create equal pairings. We conduct experiments using seven machine learning models, five natural language processing techniques and three data processing methods. First, we present results based on full context function pairings. Next, we introduce condensed functions and conduct a statistical analysis to determine if there is a significant difference between the models, techniques, or methods. Based on these results, we answer research questions regarding model prediction for testing within and across projects and vulnerability types. Our results show that condensed functions with fewer features may achieve greater prediction results when testing within rather than across. Overall, we conclude that text-based machine learning models are not effective in detecting vulnerabilities within or across projects and vulnerability types.

Vorheriger Artikel Enhancing the defectiveness prediction of methods and classes via JIT

Nächster Artikel What makes Ethereum blockchain transactions be processed fast or slow? An empirical study

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

https://haveibeenpwned.com

https://owasp.org/www-project-top-ten/

https://cve.mitre.org/cve/

https://cvedetails.com/browse-by-date.php

https://scitools.com

https://tomcat.apache.org/

https://nvd.nist.gov

https://samate.nist.gov/SARD/

https://dwheeler.com/flawfinder/

The original database link provided by the paper is unavailable, but an alternative link was found: https://github.com/announce/vcc-base

https://github.com/ZeoVan/MSR_20_Code_vulnerability_CSV_Dataset

https://radimrehurek.com/gensim/models/word2vec.html

https://radimrehurek.com/gensim/models/doc2vec.html

https://scikit-learn.org/

https://radimrehurek.com/gensim/

https://keras.io

https://tensorflow.org

https://wikipedia.org

https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4

https://github.com/krn65/emse_data

https://cvedetails.com

CVE Details does provide a disclaimer that the site and all data are provided “as is”, meaning it is not guaranteed to be accurate or complete.

https://github.com/FFmpeg/FFmpeg

https://github.com/bonzini/qemu

https://cwe.mitre.org/data/definitions/119.html

https://cwe.mitre.org/data/definitions/20.html

Abdalkareem R, Shihab E, Rilling J (2017) On code reuse from stackoverflow: An exploratory study on android Apps. Inf Softw Technol 88:148–158. https://doi.org/10.1016/j.infsof.2017.04.005CrossRef

Ban X, Liu S, Chen C, Chua C (2019) A performance evaluation of deep-learnt features for software vulnerability detection. Concurr Comput Pract Experience 31(19):e5103. https://doi.org/10.1002/cpe.5103CrossRef

Bates S, Cozby P (2017) Methods in behavioral research. McGraw-Hill Education, New York

Cavusoglu H, Mishra B, Raghunathan S (2004) The effect of internet security breach announcements on market value: Capital market reactions for breached firms and internet security developers. Int J Electron Commer 9(1):70–104. https://doi.org/10.1080/10864415.2004.11044320CrossRef

Chen Y (2015) Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo. http://hdl.handle.net/10012/9592

Chernis B, Verma R (2018) Machine learning methods for software vulnerability detection. In: Proceedings of the 4th ACM international workshop on security and privacy analytics, pp 31–39. https://doi.org/10.1145/3180445.3180453

Cor K, Sood G (2018) Pwned: How often are Americans’ online accounts breached? arXiv:1808.01883

Czerwonka J, Greiler M, Tilford J (2015) Code reviews do not find bugs. How the current code review best practice slows us down. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 2. IEEE, pp 27–28. https://doi.org/10.1109/ICSE.2015.131

Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:181004805, https://doi.org/10.48550/arXiv.1810.04805

Dowd M, McDonald J, Schuh J (2006) The art of software security assessment: Identifying and preventing software vulnerabilities. Pearson Education

Duan X, Wu J, Ji S, Rui Z, Luo T, Yang M, Wu Y (2019) Vulsniper: Focus your attention to shoot fine-grained vulnerabilities. In: IJCAI, pp 4665–4671. https://doi.org/10.24963/ijcai.2019/648

Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64. https://doi.org/10.1080/01621459.1961.10482090MathSciNetCrossRefMATH

Egele M, Scholte T, Kirda E, Kruegel C (2008) A survey on automated dynamic malware-analysis techniques and tools. ACM Comput Surv (CSUR) 44 (2):1–42. https://doi.org/10.1145/2089125.2089126CrossRef

Fan J, Li Y, Wang S, Nguyen TN (2020) AC/C++ code vulnerability dataset with code changes and CVE summaries. In: Proceedings of the 17th international conference on mining software repositories, pp 508–512. https://doi.org/10.1145/3379597.3387501

Fischer F, Böttinger K, Xiao H, Stransky C, Acar Y, Backes M, Fahl S (2017) Stack overflow considered harmful? the impact of copy&paste on android application security. In: 2017 IEEE symposium on security and privacy (SP). https://doi.org/10.1109/SP.2017.31. IEEE, pp 121–136

Ghaffarian SM, Shahriari HR (2017) Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey. ACM Comput Surv (CSUR) 50(4):1–36. https://doi.org/10.1145/3092566CrossRef

Grieco G, Grinblat GL, Uzal L, Rawat S, Feist J, Mounier L (2016) Toward large-scale vulnerability discovery using machine learning. In: Proceedings of the 6th ACM conference on data and application security and privacy, pp 85–96. https://doi.org/10.1145/2857705.2857720

Harer JA, Kim LY, Russell RL, Ozdemir O, Kosta LR, Rangamani A, Hamilton LH, Centeno GI, Key JR, Ellingwood PM et al (2018) Automated software vulnerability detection with machine learning. arXiv:1803.04497

Hovsepyan A, Scandariato R, Joosen W, Walden J (2012) Software vulnerability prediction using text analysis techniques. In: Proceedings of the 4th international workshop on Security measurements and metrics, pp 7–10. https://doi.org/10.1145/2372225.2372230

Huang S, Tang H, Zhang M, Tian J (2010) Text clustering on national vulnerability database. In: 2010 2nd international conference on computer engineering and applications, vol 2. IEEE, pp 295–299. https://doi.org/10.1109/ICCEA.2010.209

Ijaz M, Durad MH, Ismail M (2019) Static and dynamic malware analysis using machine learning. In: 2019 16th international BHURBAN conference on applied sciences and technology (IBCAST). https://doi.org/10.1109/IBCAST.2019.8667136. IEEE, pp 687–691

Jie G, Xiao-Hui K, Qiang L (2016) Survey on software vulnerability analysis method based on machine learning. In: 2016 IEEE 1st international conference on data science in cyberspace (DSC). https://doi.org/10.1109/DSC.2016.33. IEEE, pp 642–647

Kim J, Hubczenko D, Montague P (2019) Towards attention based vulnerability discovery using source code representation. In: International conference on artificial neural networks. https://doi.org/10.1007/978-3-030-30490-4_58. Springer, pp 731–746

Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). https://doi.org/10.3115/v1/D14-1181, https://aclanthology.org/D14-1181. Association for Computational Linguistics, Doha, Qatar, pp 1746–1751

Klock R (2021) Quality of SQL code security on stackoverflow and methods of prevention. PhD thesis, Oberlin College. http://rave.ohiolink.edu/etdc/view?acc_num=oberlin1625831198110328

Koroteev M (2021) Bert: A review of applications in natural language processing and understanding. arXiv:2103.11943

Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621. https://doi.org/10.1080/01621459.1952.10483441CrossRefMATH

Layton R, Watters PA (2014) A methodology for estimating the tangible cost of data breaches. J Inf Secur Appl 19(6):321–330. https://doi.org/10.1016/j.jisa.2014.10.012CrossRef

Le QV, Mikolov T (2014) Distributed representations of sentences and documents. https://doi.org/10.48550/arXiv.1405.4053

Li P, Cui B (2010) A comparative study on software vulnerability static analysis techniques and tools. In: 2010 IEEE international conference on information theory and information security. https://doi.org/10.1109/ICITIS.2010.5689543. IEEE, pp 521–524

Li X, Chang X, Board JA, Trivedi KS (2017) A novel approach for software vulnerability classification. In: 2017 annual reliability and maintainability symposium (RAMS). https://doi.org/10.1109/RAM.2017.7889792. IEEE, pp 1–7

Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv:180101681, https://doi.org/10.14722/ndss.2018.23158

Li Z, Zou D, Xu S, Chen Z, Zhu Y, Jin H (2021a) Vuldeelocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans Dependable Sec Comput. https://doi.org/10.1109/TDSC.2021.3076142

Li Z, Zou D, Xu S, Jin H, Zhu Y, Chen Z (2021b) Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/TDSC.2021.3051525

Lin G, Zhang J, Luo W, Pan L, Xiang Y (2017) Poster: Vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 2539–2541. https://doi.org/10.1145/3133956.3138840

Lin G, Zhang J, Luo W, Pan L, De Vel O, Montague P, Xiang Y (2019) Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Trans Dependable Sec Comput. https://doi.org/10.1109/TDSC.2019.2954088

Lin G, Wen S, Han QL, Zhang J, Xiang Y (2020) Software vulnerability detection using deep neural networks: A survey. Proc IEEE 108(10):1825–1848. https://doi.org/10.1109/JPROC.2020.2993293CrossRef

Liu B, Shi L, Cai Z, Li M (2012) Software vulnerability discovery techniques: A survey. In: 2012 4th international conference on multimedia information networking and security. https://doi.org/10.1109/MINES.2012.202. IEEE, pp 152–156

Liu S, Lin G, Han QL, Wen S, Zhang J, Xiang Y (2019) Deepbalance: Deep-learning and fuzzy oversampling for vulnerability detection. IEEE Trans Fuzzy Syst 28(7):1329–1343. https://doi.org/10.1109/TFUZZ.2019.2958558CrossRef

Liu S, Lin G, Qu L, Zhang J, De Vel O, Montague P, Xiang Y (2020) CD-VulD: Cross-domain vulnerability discovery based on deep domain adaptation. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/TDSC.2020.2984505

Mäntylä V, Lassenius C (2008) What types of defects are really discovered in code reviews? IEEE Trans Softw Eng 35(3):430–448. https://doi.org/10.1109/TSE.2008.71CrossRef

McQueen MA, McQueen TA, Boyer WF, Chaffin MR (2009) Empirical estimates and observations of 0day vulnerabilities. In: 2009 42nd Hawaii international conference on system sciences. https://doi.org/10.1109/HICSS.2009.186. IEEE, pp 1–12

Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. https://doi.org/10.48550/arXiv.1301.3781

Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. https://doi.org/10.48550/arXiv.1310.4546

Mokbal FMM, Dan W, Imran A, Jiuchuan L, Akhtar F, Xiaoxi W (2019) MLPXSS: an integrated XSS-based attack detection scheme in web applications using multilayer perceptron technique. IEEE Access 7:100567–100580. https://doi.org/10.1109/ACCESS.2019.2927417CrossRef

Mubarek AM, Adalı E (2017) Multilayer perceptron neural network technique for fraud detection. In: 2017 international conference on computer science and engineering (UBMK). https://doi.org/10.1109/UBMK.2017.8093417. IEEE, pp 383–387

Perl H, Dechand S, Smith M, Arp D, Yamaguchi F, Rieck K, Fahl S, Acar Y (2015) VCCFinder: Finding potential vulnerabilities in open-source projects to assist code audits. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp 426–437. https://doi.org/10.1145/2810103.2813604

Pham NH, Nguyen TT, Nguyen HA, Nguyen TN (2010) Detection of recurring software vulnerabilities. In: Proceedings of the IEEE/ACM international conference on automated software engineering, pp 447–456. https://doi.org/10.1145/1858996.1859089

Piessens F (2002) A taxonomy of causes of software vulnerabilities in internet software. In: Supplementary Proceedings of the 13th international symposium on software reliability engineering. Citeseer, pp 47–52

Plachkinova M, Maurer C (2018) Security breach at target. J Inf Syst Educ 29(1):11–20. https://aisel.aisnet.org/jise/vol29/iss1/7

Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: A survey. Sci China Technol Sci 63(10):1872–1897. https://doi.org/10.1007/s11431-020-1647-3CrossRef

Scandariato R, Walden J, Hovsepyan A, Joosen W (2014) Predicting vulnerable software components via text mining. IEEE Trans Softw Eng 40(10):993–1006. https://doi.org/10.1109/TSE.2014.2340398CrossRef

Shar LK, Briand LC, Tan HBK (2014) Web application vulnerability prediction using hybrid program analysis and machine learning. IEEE Trans Dependable Secur Comput 12(6):688–707. https://doi.org/10.1109/TDSC.2014.2373377CrossRef

Shin Y, Williams L (2008) An empirical model to predict security vulnerabilities using code complexity metrics. In: Proceedings of the 2nd ACM-IEEE international symposium on Empirical software engineering and measurement, pp 315–317. https://doi.org/10.1145/1414004.1414065

Shu X, Tian K, Ciambrone A, Yao D (2017) Breaking the target: An analysis of target data breach and lessons learned. arXiv:1701.04940

Spanos G, Angelis L, Toloudis D (2017) Assessment of vulnerability severity using text mining. In: Proceedings of the 21st Pan-Hellenic conference on informatics, pp 1–6. https://doi.org/10.1145/3139367.3139390

Spreitzenbarth M, Schreck T, Echtler F, Arp D, Hoffmann J (2015) Mobile-sandbox: combining static and dynamic analysis with machine-learning techniques. Int J Inf Secur 14(2):141–153. https://doi.org/10.1007/s10207-014-0250-0CrossRef

Su W, Yuan Y, Zhu M (2015) A relationship between the average precision and the area under the ROC curve. In: Proceedings of the 2015 international conference on the theory of information retrieval, pp 349–352. https://doi.org/10.1145/2808194.2809481

Sultana KZ, Deo A, Williams BJ (2016) A preliminary study examining relationships between nano-patterns and software security vulnerabilities. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC). https://doi.org/10.1109/COMPSAC.2016.34, vol 1. IEEE, pp 257–262

Tang G, Meng L, Wang H, Ren S, Wang Q, Yang L, Cao W (2020) A comparative study of neural network techniques for automatic software vulnerability detection. In: 2020 international symposium on theoretical aspects of software engineering (TASE). IEEE, pp 1–8. https://doi.org/10.1109/TASE49443.2020.00010

Telang R, Wattal S (2007) An empirical analysis of the impact of software vulnerability announcements on firm stock price. IEEE Trans Softw Eng 33(8):544–557. https://doi.org/10.1109/TSE.2007.70712CrossRef

Wang H, Ye G, Tang Z, Tan SH, Huang S, Fang D, Feng Y, Bian L, Wang Z (2020) Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans Inf Forensics Secur 16:1943–1958. https://doi.org/10.1109/TIFS.2020.3044773CrossRef

Wang P, Johnson C (2018) Cybersecurity incident handling: A case study of the equifax data breach. Issues Inf Syst 19(3). https://doi.org/10.48009/3_iis_2018_150-159

Wijayasekara D, Manic M, McQueen M (2014) Vulnerability identification and classification via text mining bug databases. In: IECON 2014-40th annual conference of the IEEE industrial electronics society. https://doi.org/10.1109/IECON.2014.7049035. IEEE, pp 3612–3618

Yamaguchi F, Lindner F, Rieck K (2011) Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning. In: Proceedings of the 5th USENIX conference on Offensive technologies, pp 13–13. https://dl.acm.org/doi/10.5555/2028052.2028065

Zhang H, Wang S, Li H, Chen THP, Hassan AE (2021) A study of C/C++ code weaknesses on stack overflow. IEEE Trans Softw Eng. https://doi.org/10.1109/TSE.2021.3058985

Zhu M (2004) Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo. Waterloo 2(30):6

Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S (2015) Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision, pp 19–27. https://doi.org/10.1109/ICCV.2015.11

Zou D, Wang S, Xu S, Li Z, Jin H (2019) μ vuldeepecker: A deep learning-based system for multiclass vulnerability detection. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/TDSC.2019.2942930

Titel: An empirical study of text-based machine learning models for vulnerability detection
verfasst von: Kollin Napier
Tanmay Bhowmik
Shaowei Wang
Publikationsdatum: 01.03.2023
Verlag: Springer US
Erschienen in: Empirical Software Engineering / Ausgabe 2/2023
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI: https://doi.org/10.1007/s10664-022-10276-6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 2/2023

Software selection in large-scale software engineering: A model and criteria based on interactive rapid reviews

On the usage, co-usage and migration of CI/CD tools: A qualitative analysis

Evaluating state-of-the-art # SAT solvers on industrial configuration spaces

What really changes when developers intend to improve their source code: a commit-level study of static metric value and static analysis warning changes

Registered reports in software engineering

Automated variable renaming: are we there yet?

Premium Partner