ABSTRACT
With sustained growth of software complexity, finding security vulnerabilities in operating systems has become an important necessity. Nowadays, OS are shipped with thousands of binary executables. Unfortunately, methodologies and tools for an OS scale program testing within a limited time budget are still missing.
In this paper we present an approach that uses lightweight static and dynamic features to predict if a test case is likely to contain a software vulnerability using machine learning techniques. To show the effectiveness of our approach, we set up a large experiment to detect easily exploitable memory corruptions using 1039 Debian programs obtained from its bug tracker, collected 138,308 unique execution traces and statically explored 76,083 different subsequences of function calls. We managed to predict with reasonable accuracy which programs contained dangerous memory corruptions.
We also developed and implemented VDiscover, a tool that uses state-of-the-art Machine Learning techniques to predict vulnerabilities in test cases. Such tool will be released as open-source to encourage the research of vulnerability discovery at a large scale, together with VDiscovery, a public dataset that collects raw analyzed data.
- A. Zeller, Why Programs Fail: A Guide to Systematic Debugging. Morgan Kaufmann Publishers Inc., 2005. Google ScholarDigital Library
- Microsoft Corporation, Microsoft Security Development Lifecycle," MicrosoftSecurityDevelopmentLifecycle, 2012.Google Scholar
- C. M. Bishop et al., Pattern recognition and machine learning. springer New York, 2006, vol. 1. Google ScholarDigital Library
- H. Drucker, S. Wu, and V. N. Vapnik, Support vector machines for spam categorization," Neural Networks, IEEE Transactions on, vol. 10, no. 5, 1999. Google ScholarDigital Library
- G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks," Science, vol. 313, no. 5786, 2006.Google Scholar
- A. Genkin, D. D. Lewis, and D. Madigan, Large-scale bayesian logistic regression for text categorization," Technometrics, vol. 49, no. 3, 2007.Google Scholar
- M. Sutton, A. Greene, and P. Amini, Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley Professional, 2007. Google ScholarDigital Library
- P. Godefroid, A. Kiezun, and M. Y. Levin, Grammar-based whitebox fuzzing," SIGPLAN Not., 2008. Google ScholarDigital Library
- P. Godefroid, M. Y. Levin, and D. A. Molnar, Sage: whitebox fuzzing for security testing." Commun. ACM, 2012. Google ScholarDigital Library
- V. Ganesh, T. Leek, and M. Rinard, Taint-based directed whitebox fuzzing," in Proceedings of the 31st International Conference on Software Engineering, ser. ICSE '09. IEEE Computer Society, 2009. Google ScholarDigital Library
- C. Cadar, D. Dunbar, and D. R. Engler, Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs." in OSDI. USENIX Association, 2008. Google ScholarDigital Library
- T. Wang, T. Wei, G. Gu, and W. Zou, Checksum-aware fuzzing combined with dynamic taint analysis and symbolic execution." ACM Trans. Inf. Syst. Secur., 2011. Google ScholarDigital Library
- S. K. Cha, T. Avgerinos, A. Rebert, and D. Brumley, Unleashing mayhem on binary code," in Proceedings of the 2012 IEEE Symposium on Security and Privacy, ser. SP '12. IEEE Computer Society, 2012. Google ScholarDigital Library
- S.-K. Huang, M.-H. Huang, P.-Y. Huang, H.-L. Lu, and C.-W. Lai, Software crash analysis for automatic exploit generation on binary programs," Reliability, IEEE Transactions on, March 2014.Google Scholar
- T. Avgerinos, S. K. Cha, A. Rebert, E. J. Schwartz, M. Woo, and D. Brumley, \Automatic exploit generation," Commun. ACM, 2014. Google ScholarDigital Library
- P. Cousot, R. Cousot, J. Feret, L. Mauborgne et al., The astre E analyzer." ser. Lecture Notes in Computer Science. Springer, 2005. Google ScholarDigital Library
- P. Cuoq, F. Kirchner, N. Kosmatov, V. Prevosto et al., Frama-c - a software analysis perspective." ser. Lecture Notes in Computer Science. Springer, 2012. Google ScholarDigital Library
- W. Landi, Undecidability of static analysis." LOPLAS, 1992. Google ScholarDigital Library
- D. Evans and D. Larochelle, Improving security using extensible lightweight static analysis." IEEE Software, 2002. Google ScholarDigital Library
- F. Yamaguchi, N. Golde, D. Arp, and K. Rieck,\Modeling and discovering vulnerabilities with code property graphs," in Proceedings of the 2014 IEEE Symposium on Security and Privacy, ser. SP '14. IEEE Computer Society, 2014. Google ScholarDigital Library
- S. Rawat and L. Mounier, Finding buffer overflow inducing loops in binary executables," in Proceedings of Sixth International Conference on Software Security and Reliability (SERE). IEEE, 2012. Google ScholarDigital Library
- [email protected], File Stream Pointer Overflows Paper," http://www.ouah.org/fsp-overflows.txt, 2003.Google Scholar
- M. Team, Reporting 1.2K crashes," https://lists.debian.org/debian-devel/2013/06/msg00720.html, 2013.Google Scholar
- H. He and E. A. Garcia, Learning from imbalanced data," Knowledge and Data Engineering, IEEE Transactions on, vol. 21, no. 9, 2009. Google ScholarDigital Library
- J. Lee, T. Avgerinos, and D. Brumley, Tie: Principled reverse engineering of types in binary programs."Google Scholar
- M. Zhang, A. Prakash, X. Li, Z. Liang, and H. Yin, Identifying and analyzing pointer misuses for sophisticated memory-corruption exploit diagnosis," 2012.Google Scholar
- J. C--espedes, ltrace," http://www.ltrace.org, 2014.Google Scholar
- L. Breiman, Random forests," Machine learning, 2001. Google ScholarDigital Library
- G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors," 2012.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012.Google Scholar
- Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, Deepface: Closing the gap to human-level performance in face verification," in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014. Google ScholarDigital Library
- H. Pirzadeh, A. Hamou-Lhadj, and M. Shah, Exploiting text mining techniques in the analysis of execution traces," in Software Maintenance (ICSM), 2011 27th IEEE International Conference on, Sept 2011. Google ScholarDigital Library
- W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, Detecting large-scale system problems by mining console logs," in Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, 2009. Google ScholarDigital Library
- I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., 2005. Google ScholarDigital Library
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space," 2013.Google Scholar
- L. Wolf, Y. Hanani, K. Bar, and N. Dershowitz, Joint word2vec networks for bilingual semantic representations," International Journal of Computational Linguistics and Applications, vol. 5, no. 1, 2014.Google Scholar
- S. P. F. G. H. Moen and T. S. S. Ananiadou, Distributional semantics resources for biomedical text processing."Google Scholar
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel et al., \Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, 2011. Google ScholarDigital Library
- I. J. Goodfellow, D. Warde-Farley, P. Lamblin, V. Dumoulin et al., Pylearn2: a machine learning research library," 2013.Google Scholar
- V. Stinner, python-ptrace," http://python-ptrace.readthedocs.org, 2014.Google Scholar
- Microsoft Security Engineering Center (MSEC) Security Science Team, Exploitable," http://msecdbg.codeplex.com, 2013.Google Scholar
- Jonathan Foote, CERT Triage Tools," http://www. cert.org/vulnerability-analysis/tools/triage.cfm, 2013.Google Scholar
- I. Santos, J. Devesa, F. Brezo, J. Nieves, and P. Bringas, Opem: A static-dynamic approach for machine-learning-based malware detection," in International Joint Conference CISIS'12-ICEUTEt'12-SOCOt'12 Special Sessions, ser. Advances in Intelligent Systems and Computing. Springer Berlin Heidelberg, 2013, vol. 189.Google Scholar
- F. Yamaguchi, F. Lindner, and K. Rieck, Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning," in Proceedings of the 5th USENIX Conference on Offensive Technologies, ser. WOOT'11. USENIX Association, 2011. Google ScholarDigital Library
- S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longsta, A sense of self for unix processes," in Proceedings of the 1996 IEEE Symposium on Security and Privacy, ser. SP '96. IEEE Computer Society, 1996. Google ScholarDigital Library
- S. Rawat, V. P. Gulati, and A. K. Pujari, Transactions on rough sets iv." Springer-Verlag, 2005, ch. A Fast Host-based Intrusion Detection System Using Rough Set Theory. Google ScholarDigital Library
- T. G. and C. P., Learning rules from system calls arguments and sequences for anomaly detection," in Proc. ICDM Workshop on Data Mining for Computer Security (DMSEC). Springer, 2003.Google Scholar
Index Terms
- Toward Large-Scale Vulnerability Discovery using Machine Learning
Recommendations
Detecting Blind Cross-Site Scripting Attacks Using Machine Learning
SPML '18: Proceedings of the 2018 International Conference on Signal Processing and Machine LearningCross-site scripting (XSS) is a scripting attack targeting web applications by injecting malicious scripts into web pages. Blind XSS is a subset of stored XSS, where an attacker blindly deploys malicious payloads in web pages that are stored in a ...
XSS Vulnerability Detection Using Optimized Attack Vector Repertory
CYBERC '15: Proceedings of the 2015 International Conference on Cyber-Enabled Distributed Computing and Knowledge DiscoveryIn order to detect the Cross-Site Script (XSS)vulnerabilities in the web applications, this paper proposes a method of XSS vulnerability detection using optimal attack vector repertory. This method generates an attack vector repertory automatically, ...
A Survey on SQL Injection Attacks, Detection and Prevention
ICMLC '20: Proceedings of the 2020 12th International Conference on Machine Learning and ComputingSince the uses of Web in daily life is increasing in past 20 years and becoming trend now, almost every Web application has its own database to store important data. An attacker can get or even modify the data from database through SQL injection ...
Comments