Skip to main content
Erschienen in: Neural Computing and Applications 20/2021

17.05.2021 | Original Article

Deep neural-based vulnerability discovery demystified: data, model and performance

verfasst von: Guanjun Lin, Wei Xiao, Leo Yu Zhang, Shang Gao, Yonghang Tai, Jun Zhang

Erschienen in: Neural Computing and Applications | Ausgabe 20/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Detecting source-code level vulnerabilities at the development phase is a cost-effective solution to prevent potential attacks from happening at the software deployment stage. Many machine learning, including deep learning-based solutions, have been proposed to aid the process of vulnerability discovery. However, these approaches were mainly evaluated on self-constructed/-collected datasets. It is difficult to evaluate the effectiveness of proposed approaches due to lacking a unified baseline dataset. To bridge this gap, we construct a function-level vulnerability dataset from scratch, providing in source-code-label pairs. To evaluate the constructed dataset, a function-level vulnerability detection framework is built to incorporate six mainstream neural network models as vulnerability detectors. We perform experiments to investigate the performance behaviors of the neural model-based detectors using source code as raw input with continuous Bag-of-Words neural embeddings. Empirical results reveal that the variants of recurrent neural networks and convolutional neural network perform well on our dataset, as the former is capable of handling contextual information and the latter learns features from small context windows. In terms of generalization ability, the fully connected network outperforms the other network architectures. The performance evaluation can serve as a reference benchmark for neural model-based vulnerability detection at function-level granularity. Our dataset can serve as ground truth for ML-based function-level vulnerability detection and a baseline for evaluating relevant approaches.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Guanjun L, Sheng W, QingLong H, Jun Z, Yang X (2020) Software vulnerability detection using deep neural networks: a survey. Proc IEEE 1080(10):1825–1848 Guanjun L, Sheng W, QingLong H, Jun Z, Yang X (2020) Software vulnerability detection using deep neural networks: a survey. Proc IEEE 1080(10):1825–1848
4.
Zurück zum Zitat Cadar C, Dunbar D, Engler DR, et al (2008) Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In: OSDI, vol 8, pp 209–224 Cadar C, Dunbar D, Engler DR, et al (2008) Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In: OSDI, vol 8, pp 209–224
5.
Zurück zum Zitat Sutton M, Greene A, Amini P (2007) Fuzzing: brute force vulnerability discovery. Pearson Education, London Sutton M, Greene A, Amini P (2007) Fuzzing: brute force vulnerability discovery. Pearson Education, London
6.
Zurück zum Zitat Newsome J, Song D (2005) Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. Citeseer, Princeton Newsome J, Song D (2005) Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. Citeseer, Princeton
7.
Zurück zum Zitat Yamaguchi F, Lindner F, Rieck K (2011) Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning. In: Proceedings of the 5th USENIX conference on Offensive technologies. USENIX Association Yamaguchi F, Lindner F, Rieck K (2011) Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning. In: Proceedings of the 5th USENIX conference on Offensive technologies. USENIX Association
8.
Zurück zum Zitat Nan S, Jun Z, Paul R, Shang G, Zhang Leo Yu, Yang X (2019) Data-driven cybersecurity incident prediction: a survey. IEEE Commun Surv Tutor 210(2):1744–1772 Nan S, Jun Z, Paul R, Shang G, Zhang Leo Yu, Yang X (2019) Data-driven cybersecurity incident prediction: a survey. IEEE Commun Surv Tutor 210(2):1744–1772
10.
Zurück zum Zitat Jun Z, Yang X, Wang Yu, Wanlei Z, Yong X, Yong G (2013) Network traffic classification using correlation information. IEEE Trans Parallel Distrib Syst 240(1):104–117 Jun Z, Yang X, Wang Yu, Wanlei Z, Yong X, Yong G (2013) Network traffic classification using correlation information. IEEE Trans Parallel Distrib Syst 240(1):104–117
11.
Zurück zum Zitat Mohammad GS, Reza SH (2017) Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey. ACM Comput Surv 500(4):56 Mohammad GS, Reza SH (2017) Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey. ACM Comput Surv 500(4):56
12.
Zurück zum Zitat Yonghee S, Andrew M, Laurie W, Osborne Jason A (2011) Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities. TSE 370(6):772–787 Yonghee S, Andrew M, Laurie W, Osborne Jason A (2011) Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities. TSE 370(6):772–787
13.
Zurück zum Zitat Liu L, De Vel O, Han Q-L, Zhang J, Xiang Y (2018) Detecting and preventing cyber insider threats: a survey. IEEE Commun Surv Tutor 200(2):1397–1417CrossRef Liu L, De Vel O, Han Q-L, Zhang J, Xiang Y (2018) Detecting and preventing cyber insider threats: a survey. IEEE Commun Surv Tutor 200(2):1397–1417CrossRef
14.
Zurück zum Zitat Yamaguchi F, Golde N, Arp D, Rieck K (2014) Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE symposium on security and privacy (SP), pp 590–604. IEEE Yamaguchi F, Golde N, Arp D, Rieck K (2014) Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE symposium on security and privacy (SP), pp 590–604. IEEE
15.
Zurück zum Zitat Yamaguchi F, Lottmann M, Rieck K (2012) Generalized vulnerability extrapolation using abstract syntax trees. In: Proceedings of the 28th ACSAC, pp 359–368. ACM Yamaguchi F, Lottmann M, Rieck K (2012) Generalized vulnerability extrapolation using abstract syntax trees. In: Proceedings of the 28th ACSAC, pp 359–368. ACM
16.
Zurück zum Zitat Chen X, Li C, Wang D, Wen S, Zhang J, Nepal S, Xiang Y, Ren K (2020) Android HIV: A study of repackaging malware for evading machine-learning detection. IEEE Trans Inf Forensics Secur 15:987–1001CrossRef Chen X, Li C, Wang D, Wen S, Zhang J, Nepal S, Xiang Y, Ren K (2020) Android HIV: A study of repackaging malware for evading machine-learning detection. IEEE Trans Inf Forensics Secur 15:987–1001CrossRef
17.
Zurück zum Zitat Perl H, Dechand S, Smith M, Arp D, Yamaguchi F, Rieck K, Fahl S, Acar Y (2015) Vccfinder: finding potential vulnerabilities in open-source projects to assist code audits. In: Proceedings of the 22nd SIGSAC conference on CCS, pp 426–437. ACM Perl H, Dechand S, Smith M, Arp D, Yamaguchi F, Rieck K, Fahl S, Acar Y (2015) Vccfinder: finding potential vulnerabilities in open-source projects to assist code audits. In: Proceedings of the 22nd SIGSAC conference on CCS, pp 426–437. ACM
18.
Zurück zum Zitat Guanjun L, Jun Z, Wei L, Lei P, Yang X, De Vel O, Paul M (2018) Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans Ind Inf 140(7):3289–3297 Guanjun L, Jun Z, Wei L, Lei P, Yang X, De Vel O, Paul M (2018) Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans Ind Inf 140(7):3289–3297
19.
Zurück zum Zitat Lin G, Zhang J, Luo W, Pan L, Xiang Y (2017) Poster: vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 SIGSAC Conference on CCS, pp 2539–2541. ACM Lin G, Zhang J, Luo W, Pan L, Xiang Y (2017) Poster: vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 SIGSAC Conference on CCS, pp 2539–2541. ACM
21.
Zurück zum Zitat Scandariato R, Walden J, Hovsepyan A, Joosen W (2014) Predicting vulnerable software components via text mining. TSE 400(10):993–1006 Scandariato R, Walden J, Hovsepyan A, Joosen W (2014) Predicting vulnerable software components via text mining. TSE 400(10):993–1006
22.
Zurück zum Zitat Choi M, Jeong S, Oh H, Choo J (2017) End-to-end prediction of buffer overruns from raw source code via neural memory networks. arXiv preprint arXiv:1703.02458 Choi M, Jeong S, Oh H, Choo J (2017) End-to-end prediction of buffer overruns from raw source code via neural memory networks. arXiv preprint arXiv:​1703.​02458
23.
24.
Zurück zum Zitat Peng H, Mou L, Li G, Liu Y, Zhang L, Jin Z (2015) Building program vector representations for deep learning. In: International conference on knowledge science, engineering and management, pp 547–553. Springer Peng H, Mou L, Li G, Liu Y, Zhang L, Jin Z (2015) Building program vector representations for deep learning. In: International conference on knowledge science, engineering and management, pp 547–553. Springer
25.
Zurück zum Zitat Black PE (2018) A software assurance reference dataset: Thousands of programs with known bugs. J Res Natl Inst Stand Technol 123 Black PE (2018) A software assurance reference dataset: Thousands of programs with known bugs. J Res Natl Inst Stand Technol 123
26.
Zurück zum Zitat Black PE, Black PE (2018) Juliet 1.3 Test Suite: Changes From 1.2. US Department of Commerce, National Institute of Standards and Technology Black PE, Black PE (2018) Juliet 1.3 Test Suite: Changes From 1.2. US Department of Commerce, National Institute of Standards and Technology
27.
Zurück zum Zitat Ramsundar B, Zadeh RB (2018) TensorFlow for deep learning: from linear regression to reinforcement learning. O’Reilly Media Inc., Newton Ramsundar B, Zadeh RB (2018) TensorFlow for deep learning: from linear regression to reinforcement learning. O’Reilly Media Inc., Newton
28.
Zurück zum Zitat Shar LK, Tan HBK (2012) Predicting common web application vulnerabilities from input validation and sanitization code patterns. In: 2012 Proceedings of the 27th IEEE/ACM international conference on automated software engineering, pp 310–313. IEEE Shar LK, Tan HBK (2012) Predicting common web application vulnerabilities from input validation and sanitization code patterns. In: 2012 Proceedings of the 27th IEEE/ACM international conference on automated software engineering, pp 310–313. IEEE
29.
Zurück zum Zitat Grieco Gustavo, Grinblat Guillermo Luis, Uzal Lucas, Rawat Sanjay, Feist Josselin, Mounier Laurent (2016) Toward large-scale vulnerability discovery using machine learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, pages 85–96. ACM Grieco Gustavo, Grinblat Guillermo Luis, Uzal Lucas, Rawat Sanjay, Feist Josselin, Mounier Laurent (2016) Toward large-scale vulnerability discovery using machine learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, pages 85–96. ACM
30.
Zurück zum Zitat Feng D, Wang LQ, Guoai X, Shaodong Z (2018) Defect prediction in android binary executables using deep neural network. Wireless Pers Commun 1020(3):2261–2285 Feng D, Wang LQ, Guoai X, Shaodong Z (2018) Defect prediction in android binary executables using deep neural network. Wireless Pers Commun 1020(3):2261–2285
31.
Zurück zum Zitat Lee YJ, Choi S-H, Kim C, Lim S-H, Park K-W (2017) Learning binary code with deep learning to detect software weakness. In: KSII the 9th international conference on internet (ICONI) 2017 symposium Lee YJ, Choi S-H, Kim C, Lim S-H, Park K-W (2017) Learning binary code with deep learning to detect software weakness. In: KSII the 9th international conference on internet (ICONI) 2017 symposium
32.
Zurück zum Zitat Harer JA, Kim LY, Russell RL, Ozdemir O, Kosta LR, Rangamani A, Hamilton LH, Centeno GI, Key JR, Ellingwood PM et al (2018) Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1803.04497 Harer JA, Kim LY, Russell RL, Ozdemir O, Kosta LR, Rangamani A, Hamilton LH, Centeno GI, Key JR, Ellingwood PM et al (2018) Automated software vulnerability detection with machine learning. arXiv preprint arXiv:​1803.​04497
33.
Zurück zum Zitat Russell R, Kim L, Hamilton L, Lazovich T, Harer J, Ozdemir O, Ellingwood P, McConley M (2018) Automated vulnerability detection in source code using deep representation learning. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA), pp 757–762. IEEE Russell R, Kim L, Hamilton L, Lazovich T, Harer J, Ozdemir O, Ellingwood P, McConley M (2018) Automated vulnerability detection in source code using deep representation learning. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA), pp 757–762. IEEE
34.
Zurück zum Zitat Li Z, Zou D, Xu S, Jin H, Qi H, Hu J (2016) Vulpecker: an automated vulnerability detection system based on code similarity analysis. In: Proceedings of the 32nd ACCSA, pp 201–213. ACM Li Z, Zou D, Xu S, Jin H, Qi H, Hu J (2016) Vulpecker: an automated vulnerability detection system based on code similarity analysis. In: Proceedings of the 32nd ACCSA, pp 201–213. ACM
35.
Zurück zum Zitat Sepp H, Jürgen S (1997) Long short-term memory. Neural Comput 90(8):1735–1780 Sepp H, Jürgen S (1997) Long short-term memory. Neural Comput 90(8):1735–1780
36.
Zurück zum Zitat Dam HK, Tran T, Pham T, Ng SW, Grundy J, Ghose A (2017) Automatic feature learning for vulnerability prediction. arXiv preprint arXiv:1708.02368 Dam HK, Tran T, Pham T, Ng SW, Grundy J, Ghose A (2017) Automatic feature learning for vulnerability prediction. arXiv preprint arXiv:​1708.​02368
37.
Zurück zum Zitat Li Z, Zou D, Xu S, Jin H, Zhu Y, Chen Z, Wang S, Wang J (2018) Sysevr: a framework for using deep learning to detect software vulnerabilities. arXiv preprint arXiv:1807.06756 Li Z, Zou D, Xu S, Jin H, Zhu Y, Chen Z, Wang S, Wang J (2018) Sysevr: a framework for using deep learning to detect software vulnerabilities. arXiv preprint arXiv:​1807.​06756
39.
Zurück zum Zitat Wu F, Wang J, Liu J, Wang W (2017) Vulnerability detection with deep learning. In: 2017 3rd IEEE international conference on computer and communications (ICCC), pp 1298–1302. IEEE Wu F, Wang J, Liu J, Wang W (2017) Vulnerability detection with deep learning. In: 2017 3rd IEEE international conference on computer and communications (ICCC), pp 1298–1302. IEEE
40.
Zurück zum Zitat Le T, Nguyen T, Le T, Phung D, Montague P, De Olivier V, Qu L (2018) Maximal divergence sequential autoencoder for binary software vulnerability detection Le T, Nguyen T, Le T, Phung D, Montague P, De Olivier V, Qu L (2018) Maximal divergence sequential autoencoder for binary software vulnerability detection
41.
Zurück zum Zitat Sukhbaatar S, Weston J, Fergus R et al (2015) End-to-end memory networks. In: Advances in neural information processing systems, pp 2440–2448 Sukhbaatar S, Weston J, Fergus R et al (2015) End-to-end memory networks. In: Advances in neural information processing systems, pp 2440–2448
43.
Zurück zum Zitat Yonghee S, Laurie W (2013) Can traditional fault prediction models be used for vulnerability prediction? ESE 180(1):25–59 Yonghee S, Laurie W (2013) Can traditional fault prediction models be used for vulnerability prediction? ESE 180(1):25–59
44.
Zurück zum Zitat Wang M, Zhu T, Zhang T, Zhang J, Yu S, Zhou W (2020) Security and privacy in 6G networks: new areas and new challenges. Digit Commun Netw 6(3):281–291CrossRef Wang M, Zhu T, Zhang T, Zhang J, Yu S, Zhou W (2020) Security and privacy in 6G networks: new areas and new challenges. Digit Commun Netw 6(3):281–291CrossRef
45.
Zurück zum Zitat Vivienne S, Yu-Hsin C, Tien-Ju Y, Emer Joel S (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 1050(12):2295–2329 Vivienne S, Yu-Hsin C, Tien-Ju Y, Emer Joel S (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 1050(12):2295–2329
46.
Zurück zum Zitat Miltiadis A, Barr Earl T, Premkumar D, Charles S (2018) A survey of machine learning for big code and naturalness. ACM Comput Surv 510(4):81 Miltiadis A, Barr Earl T, Premkumar D, Charles S (2018) A survey of machine learning for big code and naturalness. ACM Comput Surv 510(4):81
47.
Zurück zum Zitat Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: a deep learning-based system for vulnerability detection. In: Proceedings of NDSS Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: a deep learning-based system for vulnerability detection. In: Proceedings of NDSS
48.
Zurück zum Zitat Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:​1506.​00019
49.
Zurück zum Zitat Olah C (2015) Understanding LSTM networks. GITHUB blog. Accessed 30 Apr 2019 Olah C (2015) Understanding LSTM networks. GITHUB blog. Accessed 30 Apr 2019
51.
53.
Zurück zum Zitat Zhang Y, Wallace B (2015) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 Zhang Y, Wallace B (2015) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:​1510.​03820
54.
Zurück zum Zitat Yih W-T, He X, Meek C (2014) Semantic parsing for single-relation question answering. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, vol 2, pp 643–648 Yih W-T, He X, Meek C (2014) Semantic parsing for single-relation question answering. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, vol 2, pp 643–648
55.
Zurück zum Zitat Junyang Q, Jun Z, Wei L, Lei P, Surya N, Yang X (2020) A survey of android malware detection with deep neural models. ACM Comput Surv (CSUR) 530(6):1–36 Junyang Q, Jun Z, Wei L, Lei P, Surya N, Yang X (2020) A survey of android malware detection with deep neural models. ACM Comput Surv (CSUR) 530(6):1–36
57.
Zurück zum Zitat Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International conference on machine learning, pp 1050–1059 Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International conference on machine learning, pp 1050–1059
58.
Zurück zum Zitat Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:​1301.​3781
59.
Zurück zum Zitat Christopher PR, Manning D, Schütze H (2009) Introduction to information retrieval. Cambridge University Press, CambridgeMATH Christopher PR, Manning D, Schütze H (2009) Introduction to information retrieval. Cambridge University Press, CambridgeMATH
60.
Zurück zum Zitat Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. OSDI 16:265–283 Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. OSDI 16:265–283
61.
Zurück zum Zitat Radim R, Petr S (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, pp 45–50 Radim R, Petr S (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, pp 45–50
62.
Metadaten
Titel
Deep neural-based vulnerability discovery demystified: data, model and performance
verfasst von
Guanjun Lin
Wei Xiao
Leo Yu Zhang
Shang Gao
Yonghang Tai
Jun Zhang
Publikationsdatum
17.05.2021
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 20/2021
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-021-05954-3

Weitere Artikel der Ausgabe 20/2021

Neural Computing and Applications 20/2021 Zur Ausgabe

Premium Partner