skip to main content
10.1145/2857705.2857720acmconferencesArticle/Chapter ViewAbstractPublication PagescodaspyConference Proceedingsconference-collections
research-article

Toward Large-Scale Vulnerability Discovery using Machine Learning

Published:09 March 2016Publication History

ABSTRACT

With sustained growth of software complexity, finding security vulnerabilities in operating systems has become an important necessity. Nowadays, OS are shipped with thousands of binary executables. Unfortunately, methodologies and tools for an OS scale program testing within a limited time budget are still missing.

In this paper we present an approach that uses lightweight static and dynamic features to predict if a test case is likely to contain a software vulnerability using machine learning techniques. To show the effectiveness of our approach, we set up a large experiment to detect easily exploitable memory corruptions using 1039 Debian programs obtained from its bug tracker, collected 138,308 unique execution traces and statically explored 76,083 different subsequences of function calls. We managed to predict with reasonable accuracy which programs contained dangerous memory corruptions.

We also developed and implemented VDiscover, a tool that uses state-of-the-art Machine Learning techniques to predict vulnerabilities in test cases. Such tool will be released as open-source to encourage the research of vulnerability discovery at a large scale, together with VDiscovery, a public dataset that collects raw analyzed data.

References

  1. A. Zeller, Why Programs Fail: A Guide to Systematic Debugging. Morgan Kaufmann Publishers Inc., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Microsoft Corporation, Microsoft Security Development Lifecycle," MicrosoftSecurityDevelopmentLifecycle, 2012.Google ScholarGoogle Scholar
  3. C. M. Bishop et al., Pattern recognition and machine learning. springer New York, 2006, vol. 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Drucker, S. Wu, and V. N. Vapnik, Support vector machines for spam categorization," Neural Networks, IEEE Transactions on, vol. 10, no. 5, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks," Science, vol. 313, no. 5786, 2006.Google ScholarGoogle Scholar
  6. A. Genkin, D. D. Lewis, and D. Madigan, Large-scale bayesian logistic regression for text categorization," Technometrics, vol. 49, no. 3, 2007.Google ScholarGoogle Scholar
  7. M. Sutton, A. Greene, and P. Amini, Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley Professional, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Godefroid, A. Kiezun, and M. Y. Levin, Grammar-based whitebox fuzzing," SIGPLAN Not., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Godefroid, M. Y. Levin, and D. A. Molnar, Sage: whitebox fuzzing for security testing." Commun. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. V. Ganesh, T. Leek, and M. Rinard, Taint-based directed whitebox fuzzing," in Proceedings of the 31st International Conference on Software Engineering, ser. ICSE '09. IEEE Computer Society, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Cadar, D. Dunbar, and D. R. Engler, Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs." in OSDI. USENIX Association, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Wang, T. Wei, G. Gu, and W. Zou, Checksum-aware fuzzing combined with dynamic taint analysis and symbolic execution." ACM Trans. Inf. Syst. Secur., 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. K. Cha, T. Avgerinos, A. Rebert, and D. Brumley, Unleashing mayhem on binary code," in Proceedings of the 2012 IEEE Symposium on Security and Privacy, ser. SP '12. IEEE Computer Society, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S.-K. Huang, M.-H. Huang, P.-Y. Huang, H.-L. Lu, and C.-W. Lai, Software crash analysis for automatic exploit generation on binary programs," Reliability, IEEE Transactions on, March 2014.Google ScholarGoogle Scholar
  15. T. Avgerinos, S. K. Cha, A. Rebert, E. J. Schwartz, M. Woo, and D. Brumley, \Automatic exploit generation," Commun. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Cousot, R. Cousot, J. Feret, L. Mauborgne et al., The astre E analyzer." ser. Lecture Notes in Computer Science. Springer, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Cuoq, F. Kirchner, N. Kosmatov, V. Prevosto et al., Frama-c - a software analysis perspective." ser. Lecture Notes in Computer Science. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. Landi, Undecidability of static analysis." LOPLAS, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Evans and D. Larochelle, Improving security using extensible lightweight static analysis." IEEE Software, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. F. Yamaguchi, N. Golde, D. Arp, and K. Rieck,\Modeling and discovering vulnerabilities with code property graphs," in Proceedings of the 2014 IEEE Symposium on Security and Privacy, ser. SP '14. IEEE Computer Society, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Rawat and L. Mounier, Finding buffer overflow inducing loops in binary executables," in Proceedings of Sixth International Conference on Software Security and Reliability (SERE). IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [email protected], File Stream Pointer Overflows Paper," http://www.ouah.org/fsp-overflows.txt, 2003.Google ScholarGoogle Scholar
  23. M. Team, Reporting 1.2K crashes," https://lists.debian.org/debian-devel/2013/06/msg00720.html, 2013.Google ScholarGoogle Scholar
  24. H. He and E. A. Garcia, Learning from imbalanced data," Knowledge and Data Engineering, IEEE Transactions on, vol. 21, no. 9, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Lee, T. Avgerinos, and D. Brumley, Tie: Principled reverse engineering of types in binary programs."Google ScholarGoogle Scholar
  26. M. Zhang, A. Prakash, X. Li, Z. Liang, and H. Yin, Identifying and analyzing pointer misuses for sophisticated memory-corruption exploit diagnosis," 2012.Google ScholarGoogle Scholar
  27. J. C--espedes, ltrace," http://www.ltrace.org, 2014.Google ScholarGoogle Scholar
  28. L. Breiman, Random forests," Machine learning, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors," 2012.Google ScholarGoogle Scholar
  30. A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012.Google ScholarGoogle Scholar
  31. Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, Deepface: Closing the gap to human-level performance in face verification," in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. H. Pirzadeh, A. Hamou-Lhadj, and M. Shah, Exploiting text mining techniques in the analysis of execution traces," in Software Maintenance (ICSM), 2011 27th IEEE International Conference on, Sept 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, Detecting large-scale system problems by mining console logs," in Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space," 2013.Google ScholarGoogle Scholar
  36. L. Wolf, Y. Hanani, K. Bar, and N. Dershowitz, Joint word2vec networks for bilingual semantic representations," International Journal of Computational Linguistics and Applications, vol. 5, no. 1, 2014.Google ScholarGoogle Scholar
  37. S. P. F. G. H. Moen and T. S. S. Ananiadou, Distributional semantics resources for biomedical text processing."Google ScholarGoogle Scholar
  38. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel et al., \Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. I. J. Goodfellow, D. Warde-Farley, P. Lamblin, V. Dumoulin et al., Pylearn2: a machine learning research library," 2013.Google ScholarGoogle Scholar
  40. V. Stinner, python-ptrace," http://python-ptrace.readthedocs.org, 2014.Google ScholarGoogle Scholar
  41. Microsoft Security Engineering Center (MSEC) Security Science Team, Exploitable," http://msecdbg.codeplex.com, 2013.Google ScholarGoogle Scholar
  42. Jonathan Foote, CERT Triage Tools," http://www. cert.org/vulnerability-analysis/tools/triage.cfm, 2013.Google ScholarGoogle Scholar
  43. I. Santos, J. Devesa, F. Brezo, J. Nieves, and P. Bringas, Opem: A static-dynamic approach for machine-learning-based malware detection," in International Joint Conference CISIS'12-ICEUTEt'12-SOCOt'12 Special Sessions, ser. Advances in Intelligent Systems and Computing. Springer Berlin Heidelberg, 2013, vol. 189.Google ScholarGoogle Scholar
  44. F. Yamaguchi, F. Lindner, and K. Rieck, Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning," in Proceedings of the 5th USENIX Conference on Offensive Technologies, ser. WOOT'11. USENIX Association, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longsta, A sense of self for unix processes," in Proceedings of the 1996 IEEE Symposium on Security and Privacy, ser. SP '96. IEEE Computer Society, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. S. Rawat, V. P. Gulati, and A. K. Pujari, Transactions on rough sets iv." Springer-Verlag, 2005, ch. A Fast Host-based Intrusion Detection System Using Rough Set Theory. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. T. G. and C. P., Learning rules from system calls arguments and sequences for anomaly detection," in Proc. ICDM Workshop on Data Mining for Computer Security (DMSEC). Springer, 2003.Google ScholarGoogle Scholar

Index Terms

  1. Toward Large-Scale Vulnerability Discovery using Machine Learning

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CODASPY '16: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy
          March 2016
          340 pages
          ISBN:9781450339353
          DOI:10.1145/2857705

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 March 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          CODASPY '16 Paper Acceptance Rate22of115submissions,19%Overall Acceptance Rate149of789submissions,19%

          Upcoming Conference

          CODASPY '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader