ABSTRACT
We present a novel method for static analysis in which we combine data-flow analysis with machine learning to detect SQL injection (SQLi) and Cross-Site Scripting (XSS) vulnerabilities in PHP applications. We assembled a dataset from the National Vulnerability Database and the SAMATE project, containing vulnerable PHP code samples and their patched versions in which the vulnerability is solved. We extracted features from the code samples by applying data-flow analysis techniques, including reaching definitions analysis, taint analysis, and reaching constants analysis. We used these features in machine learning to train various probabilistic classifiers. To demonstrate the effectiveness of our approach, we built a tool called WIRECAML, and compared our tool to other tools for vulnerability detection in PHP code. Our tool performed best for detecting both SQLi and XSS vulnerabilities. We also tried our approach on a number of open-source software applications, and found a previously unknown vulnerability in a photo-sharing web application.
- Frances E. Allen. 1970. Control Flow Analysis. In Proceedings of a Symposium on Compiler Optimization. ACM, 1--19. Google ScholarDigital Library
- Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798--1828. Google ScholarDigital Library
- Sebastian Bergmann. 2018-02-09T09:54:03Z. Phploc: A Tool for Quickly Measuring the Size of a PHP Project. https://github.com/sebastianbergmann/phplocGoogle Scholar
- Brain Chess and Garry McGraw. 2004. Static Analysis for Security. IEEE Security & Privacy 2, 6 (2004), 76--79. Google ScholarDigital Library
- Jesse Davis and Mark Goadrich. 2006. The Relationship between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 233--240. Google ScholarDigital Library
- Maureen Doyle and James Walden. 2011. An Empirical Study of the Evolution of PHP Web Application Security. In Third International Workshop On Security Measurements and Metrics (Metrisec). IEEE, 11--20. Google ScholarDigital Library
- Peter Flach. 2012. Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press. Google ScholarDigital Library
- Seyed Mohammad Ghaffarian and Hamid Reza Shahriari. 2017. Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey. Comput. Surveys 50, 4 (2017), 1--36. Google ScholarDigital Library
- Nenad Jovanovic, Christopher Kruegel, and Engin Kirda. 2006. Pixy: A Static Analysis Tool for Detecting Web Application Vulnerabilities. In IEEE Symposium on Security and Privacy (S&P'06). IEEE. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1624016 Google ScholarDigital Library
- Oliver Klee. 2012. Pixy Is a Scanner Static Code Analysis Tools That Scans PHP Applications for Security Vulnerabilities. https://github.com/oliverklee/pixy Accessed 2017-06-19.Google Scholar
- Jorrit Kronjee. 2018. WIRECAML: Weakness Identification Research Employing CFG Analysis and Machine Learning. https://github.com/jorkro/wirecamlGoogle Scholar
- Ibéria Medeiros, Nuno F Neves, and Miguel Correia. 2014. Automatic detection and correction of web application vulnerabilities using data mining to predict false positives. In Proceedings of the 23rd international conference on World wide web. ACM, 63--74. Google ScholarDigital Library
- Michal Zalewski. 2016. Technical "Whitepaper" for Afl-Fuzz. http://lcamtuf.coredump.cx/afl/technical_details.txtGoogle Scholar
- MITRE. 2016. CVE - Common Vulnerabilities and Exposures (CVE). https://cve.mitre.org/Google Scholar
- MITRE. 2017. CWE - Common Weakness Enumeration. https://cwe.mitre.org/Google Scholar
- MITRE. 2018. CVE - CVE-2018-6883. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-6883Google Scholar
- Steven S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann. Google ScholarDigital Library
- National Vulnerability Database. 2018. NVD - Statistics Search. https://web.nvd.nist.gov/view/vuln/statisticsGoogle Scholar
- NIST. 2017. Source Code Security Analyzers - SAMATE. https://samate.nist.gov/index.php/Source_Code_Security_Analyzers.html Accessed 2017-07-02.Google Scholar
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830. Google ScholarDigital Library
- Stanisław Pitucha. 2010. Phply: PHP Parser Written in Python Using PLY. https://github.com/viraptor/phply Accessed 2017-09-10.Google Scholar
- Pull Request #1 2018. Stivalet/PHP-Vuln-Test-Suite-Generator. https://github.com/stivalet/PHP-Vuln-test-suite-generator/pull/1 Accessed 2018-03-17.Google Scholar
- Pull Request #2 2018. Stivalet/PHP-Vuln-Test-Suite-Generator. https://github.com/stivalet/PHP-Vuln-test-suite-generator/pull/2 Accessed 2018-03-17.Google Scholar
- RIPS 2018. Free PHP Security Scanner Using Static Code Analysis. http://rips-scanner.sourceforge.net/ Accessed 2018-03-28.Google Scholar
- RIPS Technologies 2017. RIPS - Static Code Analysis for PHP Security Vulnerabilities. https://www.ripstech.com/ Accessed 2017-07-01.Google Scholar
- SAMATE 2018. Software Assurance Metrics And Tool Evaluation Project Main Page. https://samate.nist.gov/Main_Page.html Accessed 2018-03-28.Google Scholar
- Michael Scovetta. 2017. http://www.scovetta.com/yasca.html Accessed 2017-05-17.Google Scholar
- Lwin Khin Shar, Lionel C. Briand, and Hee Beng Kuan Tan. 2015. Web Application Vulnerability Prediction Using Hybrid Program Analysis and Machine Learning. IEEE Transactions on Dependable and Secure Computing 12, 6 (2015), 688--707.Google ScholarDigital Library
- Lwin Khin Shar and Hee Beng Kuan Tan. 2012. Predicting Common Web Application Vulnerabilities from Input Validation and Sanitization Code Patterns. In Proceedings of the 27th IEEE/ACM Automated International Conference On Software Engineering (ASE). IEEE, 310--313. Google ScholarDigital Library
- Lwin Khin Shar and Hee Beng Kuan Tan. 2013. Predicting SQL Injection and Cross Site Scripting Vulnerabilities through Mining Input Sanitization Patterns. Information and Software Technology 55, 10 (2013), 1767--1780. Google ScholarDigital Library
- Lwin Khin Shar, Hee Beng Kuan Tan, and Lionel C. Briand. 2013. Mining SQL Injection and Cross Site Scripting Vulnerabilities Using Hybrid Program Analysis. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 642--651. http://dl.acm.org/citation.cfm?id=2486873 Google ScholarDigital Library
- Yonghee Shin, Andrew Meneely, Laurie Williams, and Jason A. Osborne. 2011-11. Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities. IEEE Transactions on Software Engineering 37, 6 (2011-11), 772--787. Google ScholarDigital Library
- Bertrand Stivalet. 2014. PHP-Vuln-Test-Suite-Generator: PHP Synthetic Test Cases Generator. https://github.com/stivalet/PHP-Vuln-test-suite-generator Accessed 2016-04-12.Google Scholar
- Bertrand Stivalet and Elizabeth Fong. 2016. Large Scale Generation of Complex and Faulty PHP Test Cases. In IEEE International Conference on Software Testing, Verification and Validation (ICST). IEEE, 409--415. http://ieeexplore.ieee.org/abstract/document/7515499/Google Scholar
- James Walden, Jeff Stuckman, and Riccardo Scandariato. 2014. Predicting Vulnerable Components: Software Metrics vs Text Mining. In IEEE 25th International Symposium On Software Reliability Engineering (ISSRE). IEEE, 23--33. Google ScholarDigital Library
- WAP 2018. Web Application Protection. http://awap.sourceforge.net/ Accessed 2018-03-28.Google Scholar
- Dumidu Wijayasekara, Milos Manic, and Miles McQueen. 2014. Vulnerability Identification and Classification via Text Mining Bug Databases. In IECON 2014-40th Annual Conference of the IEEE Industrial Electronics Society. IEEE, 3612--3618. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7049035Google ScholarCross Ref
- Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. In IEEE Symposium On Security and Privacy (SP). IEEE, 590--604. Google ScholarDigital Library
- Fabian Yamaguchi, Felix Lindner, and Konrad Rieck. 2011. Vulnerability Extrapolation: Assisted Discovery of Vulnerabilities Using Machine Learning. In Proceedings of the 5th USENIX Conference on Offensive Technologies (WOOT'11). USENIX Association, 13--13. Google ScholarDigital Library
- Fabian Yamaguchi, Markus Lottmann, and Konrad Rieck. 2012. Generalized Vulnerability Extrapolation Using Abstract Syntax Trees. In Proceedings of the 28th Annual Computer Security Applications Conference. ACM, 359--368. Google ScholarDigital Library
- Fabian Yamaguchi, Alwin Maier, Hugo Gascon, and Konrad Rieck. 2015. Automatic inference of search patterns for taint-style vulnerabilities. In IEEE Symposium on Security and Privacy (SP). IEEE, 797--812. Google ScholarDigital Library
- Fabian Yamaguchi, Christian Wressnegger, Hugo Gascon, and Konrad Rieck. 2013. Chucky: Exposing missing checks in source code for vulnerability discovery. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 499--510. Google ScholarDigital Library
Index Terms
- Discovering software vulnerabilities using data-flow analysis and machine learning
Recommendations
Detecting Blind Cross-Site Scripting Attacks Using Machine Learning
SPML '18: Proceedings of the 2018 International Conference on Signal Processing and Machine LearningCross-site scripting (XSS) is a scripting attack targeting web applications by injecting malicious scripts into web pages. Blind XSS is a subset of stored XSS, where an attacker blindly deploys malicious payloads in web pages that are stored in a ...
Machine-Learning-Guided Typestate Analysis for Static Use-After-Free Detection
ACSAC '17: Proceedings of the 33rd Annual Computer Security Applications ConferenceTypestate analysis relies on pointer analysis for detecting temporal memory safety errors, such as use-after-free (UAF). For large programs, scalable pointer analysis is usually imprecise in analyzing their hard "corner cases", such as infeasible paths, ...
Precise and efficient integration of interprocedural alias information into data-flow analysis
Data-flow analysis is a basis for program optimization and parallelizing transformations. The mechanism of passing reference parameters at call sites generates interprocedural aliases which complicate this analysis. Solutions have been developed for ...
Comments