skip to main content
10.1145/3230833.3230856acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaresConference Proceedingsconference-collections
research-article

Discovering software vulnerabilities using data-flow analysis and machine learning

Authors Info & Claims
Published:27 August 2018Publication History

ABSTRACT

We present a novel method for static analysis in which we combine data-flow analysis with machine learning to detect SQL injection (SQLi) and Cross-Site Scripting (XSS) vulnerabilities in PHP applications. We assembled a dataset from the National Vulnerability Database and the SAMATE project, containing vulnerable PHP code samples and their patched versions in which the vulnerability is solved. We extracted features from the code samples by applying data-flow analysis techniques, including reaching definitions analysis, taint analysis, and reaching constants analysis. We used these features in machine learning to train various probabilistic classifiers. To demonstrate the effectiveness of our approach, we built a tool called WIRECAML, and compared our tool to other tools for vulnerability detection in PHP code. Our tool performed best for detecting both SQLi and XSS vulnerabilities. We also tried our approach on a number of open-source software applications, and found a previously unknown vulnerability in a photo-sharing web application.

References

  1. Frances E. Allen. 1970. Control Flow Analysis. In Proceedings of a Symposium on Compiler Optimization. ACM, 1--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798--1828. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sebastian Bergmann. 2018-02-09T09:54:03Z. Phploc: A Tool for Quickly Measuring the Size of a PHP Project. https://github.com/sebastianbergmann/phplocGoogle ScholarGoogle Scholar
  4. Brain Chess and Garry McGraw. 2004. Static Analysis for Security. IEEE Security & Privacy 2, 6 (2004), 76--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jesse Davis and Mark Goadrich. 2006. The Relationship between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 233--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Maureen Doyle and James Walden. 2011. An Empirical Study of the Evolution of PHP Web Application Security. In Third International Workshop On Security Measurements and Metrics (Metrisec). IEEE, 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Peter Flach. 2012. Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Seyed Mohammad Ghaffarian and Hamid Reza Shahriari. 2017. Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey. Comput. Surveys 50, 4 (2017), 1--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nenad Jovanovic, Christopher Kruegel, and Engin Kirda. 2006. Pixy: A Static Analysis Tool for Detecting Web Application Vulnerabilities. In IEEE Symposium on Security and Privacy (S&P'06). IEEE. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1624016 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Oliver Klee. 2012. Pixy Is a Scanner Static Code Analysis Tools That Scans PHP Applications for Security Vulnerabilities. https://github.com/oliverklee/pixy Accessed 2017-06-19.Google ScholarGoogle Scholar
  11. Jorrit Kronjee. 2018. WIRECAML: Weakness Identification Research Employing CFG Analysis and Machine Learning. https://github.com/jorkro/wirecamlGoogle ScholarGoogle Scholar
  12. Ibéria Medeiros, Nuno F Neves, and Miguel Correia. 2014. Automatic detection and correction of web application vulnerabilities using data mining to predict false positives. In Proceedings of the 23rd international conference on World wide web. ACM, 63--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Michal Zalewski. 2016. Technical "Whitepaper" for Afl-Fuzz. http://lcamtuf.coredump.cx/afl/technical_details.txtGoogle ScholarGoogle Scholar
  14. MITRE. 2016. CVE - Common Vulnerabilities and Exposures (CVE). https://cve.mitre.org/Google ScholarGoogle Scholar
  15. MITRE. 2017. CWE - Common Weakness Enumeration. https://cwe.mitre.org/Google ScholarGoogle Scholar
  16. MITRE. 2018. CVE - CVE-2018-6883. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-6883Google ScholarGoogle Scholar
  17. Steven S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. National Vulnerability Database. 2018. NVD - Statistics Search. https://web.nvd.nist.gov/view/vuln/statisticsGoogle ScholarGoogle Scholar
  19. NIST. 2017. Source Code Security Analyzers - SAMATE. https://samate.nist.gov/index.php/Source_Code_Security_Analyzers.html Accessed 2017-07-02.Google ScholarGoogle Scholar
  20. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Stanisław Pitucha. 2010. Phply: PHP Parser Written in Python Using PLY. https://github.com/viraptor/phply Accessed 2017-09-10.Google ScholarGoogle Scholar
  22. Pull Request #1 2018. Stivalet/PHP-Vuln-Test-Suite-Generator. https://github.com/stivalet/PHP-Vuln-test-suite-generator/pull/1 Accessed 2018-03-17.Google ScholarGoogle Scholar
  23. Pull Request #2 2018. Stivalet/PHP-Vuln-Test-Suite-Generator. https://github.com/stivalet/PHP-Vuln-test-suite-generator/pull/2 Accessed 2018-03-17.Google ScholarGoogle Scholar
  24. RIPS 2018. Free PHP Security Scanner Using Static Code Analysis. http://rips-scanner.sourceforge.net/ Accessed 2018-03-28.Google ScholarGoogle Scholar
  25. RIPS Technologies 2017. RIPS - Static Code Analysis for PHP Security Vulnerabilities. https://www.ripstech.com/ Accessed 2017-07-01.Google ScholarGoogle Scholar
  26. SAMATE 2018. Software Assurance Metrics And Tool Evaluation Project Main Page. https://samate.nist.gov/Main_Page.html Accessed 2018-03-28.Google ScholarGoogle Scholar
  27. Michael Scovetta. 2017. http://www.scovetta.com/yasca.html Accessed 2017-05-17.Google ScholarGoogle Scholar
  28. Lwin Khin Shar, Lionel C. Briand, and Hee Beng Kuan Tan. 2015. Web Application Vulnerability Prediction Using Hybrid Program Analysis and Machine Learning. IEEE Transactions on Dependable and Secure Computing 12, 6 (2015), 688--707.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lwin Khin Shar and Hee Beng Kuan Tan. 2012. Predicting Common Web Application Vulnerabilities from Input Validation and Sanitization Code Patterns. In Proceedings of the 27th IEEE/ACM Automated International Conference On Software Engineering (ASE). IEEE, 310--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Lwin Khin Shar and Hee Beng Kuan Tan. 2013. Predicting SQL Injection and Cross Site Scripting Vulnerabilities through Mining Input Sanitization Patterns. Information and Software Technology 55, 10 (2013), 1767--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Lwin Khin Shar, Hee Beng Kuan Tan, and Lionel C. Briand. 2013. Mining SQL Injection and Cross Site Scripting Vulnerabilities Using Hybrid Program Analysis. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 642--651. http://dl.acm.org/citation.cfm?id=2486873 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yonghee Shin, Andrew Meneely, Laurie Williams, and Jason A. Osborne. 2011-11. Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities. IEEE Transactions on Software Engineering 37, 6 (2011-11), 772--787. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Bertrand Stivalet. 2014. PHP-Vuln-Test-Suite-Generator: PHP Synthetic Test Cases Generator. https://github.com/stivalet/PHP-Vuln-test-suite-generator Accessed 2016-04-12.Google ScholarGoogle Scholar
  34. Bertrand Stivalet and Elizabeth Fong. 2016. Large Scale Generation of Complex and Faulty PHP Test Cases. In IEEE International Conference on Software Testing, Verification and Validation (ICST). IEEE, 409--415. http://ieeexplore.ieee.org/abstract/document/7515499/Google ScholarGoogle Scholar
  35. James Walden, Jeff Stuckman, and Riccardo Scandariato. 2014. Predicting Vulnerable Components: Software Metrics vs Text Mining. In IEEE 25th International Symposium On Software Reliability Engineering (ISSRE). IEEE, 23--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. WAP 2018. Web Application Protection. http://awap.sourceforge.net/ Accessed 2018-03-28.Google ScholarGoogle Scholar
  37. Dumidu Wijayasekara, Milos Manic, and Miles McQueen. 2014. Vulnerability Identification and Classification via Text Mining Bug Databases. In IECON 2014-40th Annual Conference of the IEEE Industrial Electronics Society. IEEE, 3612--3618. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7049035Google ScholarGoogle ScholarCross RefCross Ref
  38. Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. In IEEE Symposium On Security and Privacy (SP). IEEE, 590--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Fabian Yamaguchi, Felix Lindner, and Konrad Rieck. 2011. Vulnerability Extrapolation: Assisted Discovery of Vulnerabilities Using Machine Learning. In Proceedings of the 5th USENIX Conference on Offensive Technologies (WOOT'11). USENIX Association, 13--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Fabian Yamaguchi, Markus Lottmann, and Konrad Rieck. 2012. Generalized Vulnerability Extrapolation Using Abstract Syntax Trees. In Proceedings of the 28th Annual Computer Security Applications Conference. ACM, 359--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Fabian Yamaguchi, Alwin Maier, Hugo Gascon, and Konrad Rieck. 2015. Automatic inference of search patterns for taint-style vulnerabilities. In IEEE Symposium on Security and Privacy (SP). IEEE, 797--812. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Fabian Yamaguchi, Christian Wressnegger, Hugo Gascon, and Konrad Rieck. 2013. Chucky: Exposing missing checks in source code for vulnerability discovery. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 499--510. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Discovering software vulnerabilities using data-flow analysis and machine learning

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              ARES '18: Proceedings of the 13th International Conference on Availability, Reliability and Security
              August 2018
              603 pages
              ISBN:9781450364485
              DOI:10.1145/3230833

              Copyright © 2018 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 27 August 2018

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited

              Acceptance Rates

              ARES '18 Paper Acceptance Rate128of260submissions,49%Overall Acceptance Rate228of451submissions,51%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader