research-article

Toward Large-Scale Vulnerability Discovery using Machine Learning

Authors:
Gustavo Grieco

CIFASIS-CONICET, Rosario, Argentina

CIFASIS-CONICET, Rosario, Argentina
View Profile

,
Guillermo Luis Grinblat

CIFASIS-CONICET, Rosario, Argentina

CIFASIS-CONICET, Rosario, Argentina
View Profile

,
Lucas Uzal

CIFASIS-CONICET, Rosario, Argentina

CIFASIS-CONICET, Rosario, Argentina
View Profile

,
Sanjay Rawat

Vrije Universiteit, Amsterdam, Netherlands

Vrije Universiteit, Amsterdam, Netherlands
View Profile

,
Josselin Feist

VERIMAG, Université Grenoble Alps, Grenoble, France

VERIMAG, Université Grenoble Alps, Grenoble, France
View Profile

,
Laurent Mounier

VERIMAG, Université Grenoble Alps, Grenoble, France

VERIMAG, Université Grenoble Alps, Grenoble, France
View Profile

CODASPY '16: Proceedings of the Sixth ACM Conference on Data and Application Security and PrivacyMarch 2016Pages 85–96https://doi.org/10.1145/2857705.2857720

Published:09 March 2016Publication History

CODASPY '16: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy

Pages 85–96

ABSTRACT

With sustained growth of software complexity, finding security vulnerabilities in operating systems has become an important necessity. Nowadays, OS are shipped with thousands of binary executables. Unfortunately, methodologies and tools for an OS scale program testing within a limited time budget are still missing.

In this paper we present an approach that uses lightweight static and dynamic features to predict if a test case is likely to contain a software vulnerability using machine learning techniques. To show the effectiveness of our approach, we set up a large experiment to detect easily exploitable memory corruptions using 1039 Debian programs obtained from its bug tracker, collected 138,308 unique execution traces and statically explored 76,083 different subsequences of function calls. We managed to predict with reasonable accuracy which programs contained dangerous memory corruptions.

We also developed and implemented VDiscover, a tool that uses state-of-the-art Machine Learning techniques to predict vulnerabilities in test cases. Such tool will be released as open-source to encourage the research of vulnerability discovery at a large scale, together with VDiscovery, a public dataset that collects raw analyzed data.

References

A. Zeller, Why Programs Fail: A Guide to Systematic Debugging. Morgan Kaufmann Publishers Inc., 2005. Google ScholarDigital Library
Microsoft Corporation, Microsoft Security Development Lifecycle," MicrosoftSecurityDevelopmentLifecycle, 2012.Google Scholar
C. M. Bishop et al., Pattern recognition and machine learning. springer New York, 2006, vol. 1. Google ScholarDigital Library
H. Drucker, S. Wu, and V. N. Vapnik, Support vector machines for spam categorization," Neural Networks, IEEE Transactions on, vol. 10, no. 5, 1999. Google ScholarDigital Library
G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks," Science, vol. 313, no. 5786, 2006.Google Scholar
A. Genkin, D. D. Lewis, and D. Madigan, Large-scale bayesian logistic regression for text categorization," Technometrics, vol. 49, no. 3, 2007.Google Scholar
M. Sutton, A. Greene, and P. Amini, Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley Professional, 2007. Google ScholarDigital Library
P. Godefroid, A. Kiezun, and M. Y. Levin, Grammar-based whitebox fuzzing," SIGPLAN Not., 2008. Google ScholarDigital Library
P. Godefroid, M. Y. Levin, and D. A. Molnar, Sage: whitebox fuzzing for security testing." Commun. ACM, 2012. Google ScholarDigital Library
V. Ganesh, T. Leek, and M. Rinard, Taint-based directed whitebox fuzzing," in Proceedings of the 31st International Conference on Software Engineering, ser. ICSE '09. IEEE Computer Society, 2009. Google ScholarDigital Library
C. Cadar, D. Dunbar, and D. R. Engler, Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs." in OSDI. USENIX Association, 2008. Google ScholarDigital Library
T. Wang, T. Wei, G. Gu, and W. Zou, Checksum-aware fuzzing combined with dynamic taint analysis and symbolic execution." ACM Trans. Inf. Syst. Secur., 2011. Google ScholarDigital Library
S. K. Cha, T. Avgerinos, A. Rebert, and D. Brumley, Unleashing mayhem on binary code," in Proceedings of the 2012 IEEE Symposium on Security and Privacy, ser. SP '12. IEEE Computer Society, 2012. Google ScholarDigital Library
S.-K. Huang, M.-H. Huang, P.-Y. Huang, H.-L. Lu, and C.-W. Lai, Software crash analysis for automatic exploit generation on binary programs," Reliability, IEEE Transactions on, March 2014.Google Scholar
T. Avgerinos, S. K. Cha, A. Rebert, E. J. Schwartz, M. Woo, and D. Brumley, \Automatic exploit generation," Commun. ACM, 2014. Google ScholarDigital Library
P. Cousot, R. Cousot, J. Feret, L. Mauborgne et al., The astre E analyzer." ser. Lecture Notes in Computer Science. Springer, 2005. Google ScholarDigital Library
P. Cuoq, F. Kirchner, N. Kosmatov, V. Prevosto et al., Frama-c - a software analysis perspective." ser. Lecture Notes in Computer Science. Springer, 2012. Google ScholarDigital Library
W. Landi, Undecidability of static analysis." LOPLAS, 1992. Google ScholarDigital Library
D. Evans and D. Larochelle, Improving security using extensible lightweight static analysis." IEEE Software, 2002. Google ScholarDigital Library
F. Yamaguchi, N. Golde, D. Arp, and K. Rieck,\Modeling and discovering vulnerabilities with code property graphs," in Proceedings of the 2014 IEEE Symposium on Security and Privacy, ser. SP '14. IEEE Computer Society, 2014. Google ScholarDigital Library
S. Rawat and L. Mounier, Finding buffer overflow inducing loops in binary executables," in Proceedings of Sixth International Conference on Software Security and Reliability (SERE). IEEE, 2012. Google ScholarDigital Library
[email protected], File Stream Pointer Overflows Paper," http://www.ouah.org/fsp-overflows.txt, 2003.Google Scholar
M. Team, Reporting 1.2K crashes," https://lists.debian.org/debian-devel/2013/06/msg00720.html, 2013.Google Scholar
H. He and E. A. Garcia, Learning from imbalanced data," Knowledge and Data Engineering, IEEE Transactions on, vol. 21, no. 9, 2009. Google ScholarDigital Library
J. Lee, T. Avgerinos, and D. Brumley, Tie: Principled reverse engineering of types in binary programs."Google Scholar
M. Zhang, A. Prakash, X. Li, Z. Liang, and H. Yin, Identifying and analyzing pointer misuses for sophisticated memory-corruption exploit diagnosis," 2012.Google Scholar
J. C--espedes, ltrace," http://www.ltrace.org, 2014.Google Scholar
L. Breiman, Random forests," Machine learning, 2001. Google ScholarDigital Library
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors," 2012.Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012.Google Scholar
Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, Deepface: Closing the gap to human-level performance in face verification," in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014. Google ScholarDigital Library
H. Pirzadeh, A. Hamou-Lhadj, and M. Shah, Exploiting text mining techniques in the analysis of execution traces," in Software Maintenance (ICSM), 2011 27th IEEE International Conference on, Sept 2011. Google ScholarDigital Library
W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, Detecting large-scale system problems by mining console logs," in Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, 2009. Google ScholarDigital Library
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., 2005. Google ScholarDigital Library
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space," 2013.Google Scholar
L. Wolf, Y. Hanani, K. Bar, and N. Dershowitz, Joint word2vec networks for bilingual semantic representations," International Journal of Computational Linguistics and Applications, vol. 5, no. 1, 2014.Google Scholar
S. P. F. G. H. Moen and T. S. S. Ananiadou, Distributional semantics resources for biomedical text processing."Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel et al., \Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, 2011. Google ScholarDigital Library
I. J. Goodfellow, D. Warde-Farley, P. Lamblin, V. Dumoulin et al., Pylearn2: a machine learning research library," 2013.Google Scholar
V. Stinner, python-ptrace," http://python-ptrace.readthedocs.org, 2014.Google Scholar
Microsoft Security Engineering Center (MSEC) Security Science Team, Exploitable," http://msecdbg.codeplex.com, 2013.Google Scholar
Jonathan Foote, CERT Triage Tools," http://www. cert.org/vulnerability-analysis/tools/triage.cfm, 2013.Google Scholar
I. Santos, J. Devesa, F. Brezo, J. Nieves, and P. Bringas, Opem: A static-dynamic approach for machine-learning-based malware detection," in International Joint Conference CISIS'12-ICEUTEt'12-SOCOt'12 Special Sessions, ser. Advances in Intelligent Systems and Computing. Springer Berlin Heidelberg, 2013, vol. 189.Google Scholar
F. Yamaguchi, F. Lindner, and K. Rieck, Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning," in Proceedings of the 5th USENIX Conference on Offensive Technologies, ser. WOOT'11. USENIX Association, 2011. Google ScholarDigital Library
S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longsta, A sense of self for unix processes," in Proceedings of the 1996 IEEE Symposium on Security and Privacy, ser. SP '96. IEEE Computer Society, 1996. Google ScholarDigital Library
S. Rawat, V. P. Gulati, and A. K. Pujari, Transactions on rough sets iv." Springer-Verlag, 2005, ch. A Fast Host-based Intrusion Detection System Using Rough Set Theory. Google ScholarDigital Library
T. G. and C. P., Learning rules from system calls arguments and sequences for anomaly detection," in Proc. ICDM Workshop on Data Mining for Computer Security (DMSEC). Springer, 2003.Google Scholar

Index Terms

Toward Large-Scale Vulnerability Discovery using Machine Learning
1. Security and privacy
  1. Security services
    1. Access control
2. Software and its engineering
  1. Software organization and properties
    1. Software functional properties
      1. Formal methods
        Automated static analysis
        Software verification

Recommendations

Detecting Blind Cross-Site Scripting Attacks Using Machine Learning
SPML '18: Proceedings of the 2018 International Conference on Signal Processing and Machine Learning

Cross-site scripting (XSS) is a scripting attack targeting web applications by injecting malicious scripts into web pages. Blind XSS is a subset of stored XSS, where an attacker blindly deploys malicious payloads in web pages that are stored in a ...
Read More
XSS Vulnerability Detection Using Optimized Attack Vector Repertory
CYBERC '15: Proceedings of the 2015 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery

In order to detect the Cross-Site Script (XSS)vulnerabilities in the web applications, this paper proposes a method of XSS vulnerability detection using optimal attack vector repertory. This method generates an attack vector repertory automatically, ...
Read More
A Survey on SQL Injection Attacks, Detection and Prevention
ICMLC '20: Proceedings of the 2020 12th International Conference on Machine Learning and Computing

Since the uses of Web in daily life is increasing in past 20 years and becoming trend now, almost every Web application has its own database to store important data. An attacker can get or even modify the data from database through SQL injection ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CODASPY '16: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy
March 2016
340 pages
ISBN:9781450339353
DOI:10.1145/2857705
General Chairs:
Elisa Bertino
Purdue University, USA
,
Ravi Sandhu
University of Texas at San Antonio, USA
,
Program Chair:
Alexander Pretschner
Technische Universität München, Germany
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 March 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dynamic analysis
machine learning
static analysis
vulnerability detection
Qualifiers
- research-article
Conference

Acceptance Rates
CODASPY '16 Paper Acceptance Rate22of115submissions,19%Overall Acceptance Rate149of789submissions,19%
More
Upcoming Conference
CODASPY '24

Sponsor:

sigsac

Fourteenth ACM Conference on Data and Application Security and Privacy

June 19 - 21, 2024

Porto , Portugal
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 148
  Total Citations
  View Citations
- 1,886
  Total Downloads
- Downloads (Last 12 months)168
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Toward Large-Scale Vulnerability Discovery using Machine Learning

CODASPY '16: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy

ABSTRACT

References

Cited By

Index Terms

Recommendations

Detecting Blind Cross-Site Scripting Attacks Using Machine Learning

XSS Vulnerability Detection Using Optimized Attack Vector Repertory

A Survey on SQL Injection Attacks, Detection and Prevention