research-article

Discovering software vulnerabilities using data-flow analysis and machine learning

Authors:
Jorrit Kronjee

Open University of the Netherlands, Heerlen, The Netherlands

Open University of the Netherlands, Heerlen, The Netherlands
View Profile

,
Arjen Hommersom

Open University of the Netherlands, Heerlen, The Netherlands, Radboud University, Nijmegen, The Netherlands

Open University of the Netherlands, Heerlen, The Netherlands, Radboud University, Nijmegen, The Netherlands
View Profile

,
Harald Vranken

Open University of the Netherlands, Heerlen, The Netherlands, Radboud University, Nijmegen, The Netherlands

Open University of the Netherlands, Heerlen, The Netherlands, Radboud University, Nijmegen, The Netherlands
View Profile

ARES '18: Proceedings of the 13th International Conference on Availability, Reliability and SecurityAugust 2018Article No.: 6Pages 1–10https://doi.org/10.1145/3230833.3230856

Published:27 August 2018Publication History

ARES '18: Proceedings of the 13th International Conference on Availability, Reliability and Security

Pages 1–10

ABSTRACT

We present a novel method for static analysis in which we combine data-flow analysis with machine learning to detect SQL injection (SQLi) and Cross-Site Scripting (XSS) vulnerabilities in PHP applications. We assembled a dataset from the National Vulnerability Database and the SAMATE project, containing vulnerable PHP code samples and their patched versions in which the vulnerability is solved. We extracted features from the code samples by applying data-flow analysis techniques, including reaching definitions analysis, taint analysis, and reaching constants analysis. We used these features in machine learning to train various probabilistic classifiers. To demonstrate the effectiveness of our approach, we built a tool called WIRECAML, and compared our tool to other tools for vulnerability detection in PHP code. Our tool performed best for detecting both SQLi and XSS vulnerabilities. We also tried our approach on a number of open-source software applications, and found a previously unknown vulnerability in a photo-sharing web application.

References

Frances E. Allen. 1970. Control Flow Analysis. In Proceedings of a Symposium on Compiler Optimization. ACM, 1--19. Google ScholarDigital Library
Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798--1828. Google ScholarDigital Library
Sebastian Bergmann. 2018-02-09T09:54:03Z. Phploc: A Tool for Quickly Measuring the Size of a PHP Project. https://github.com/sebastianbergmann/phplocGoogle Scholar
Brain Chess and Garry McGraw. 2004. Static Analysis for Security. IEEE Security & Privacy 2, 6 (2004), 76--79. Google ScholarDigital Library
Jesse Davis and Mark Goadrich. 2006. The Relationship between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 233--240. Google ScholarDigital Library
Maureen Doyle and James Walden. 2011. An Empirical Study of the Evolution of PHP Web Application Security. In Third International Workshop On Security Measurements and Metrics (Metrisec). IEEE, 11--20. Google ScholarDigital Library
Peter Flach. 2012. Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press. Google ScholarDigital Library
Seyed Mohammad Ghaffarian and Hamid Reza Shahriari. 2017. Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey. Comput. Surveys 50, 4 (2017), 1--36. Google ScholarDigital Library
Nenad Jovanovic, Christopher Kruegel, and Engin Kirda. 2006. Pixy: A Static Analysis Tool for Detecting Web Application Vulnerabilities. In IEEE Symposium on Security and Privacy (S&P'06). IEEE. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1624016 Google ScholarDigital Library
Oliver Klee. 2012. Pixy Is a Scanner Static Code Analysis Tools That Scans PHP Applications for Security Vulnerabilities. https://github.com/oliverklee/pixy Accessed 2017-06-19.Google Scholar
Jorrit Kronjee. 2018. WIRECAML: Weakness Identification Research Employing CFG Analysis and Machine Learning. https://github.com/jorkro/wirecamlGoogle Scholar
Ibéria Medeiros, Nuno F Neves, and Miguel Correia. 2014. Automatic detection and correction of web application vulnerabilities using data mining to predict false positives. In Proceedings of the 23rd international conference on World wide web. ACM, 63--74. Google ScholarDigital Library
Michal Zalewski. 2016. Technical "Whitepaper" for Afl-Fuzz. http://lcamtuf.coredump.cx/afl/technical_details.txtGoogle Scholar
MITRE. 2016. CVE - Common Vulnerabilities and Exposures (CVE). https://cve.mitre.org/Google Scholar
MITRE. 2017. CWE - Common Weakness Enumeration. https://cwe.mitre.org/Google Scholar
MITRE. 2018. CVE - CVE-2018-6883. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-6883Google Scholar
Steven S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann. Google ScholarDigital Library
National Vulnerability Database. 2018. NVD - Statistics Search. https://web.nvd.nist.gov/view/vuln/statisticsGoogle Scholar
NIST. 2017. Source Code Security Analyzers - SAMATE. https://samate.nist.gov/index.php/Source_Code_Security_Analyzers.html Accessed 2017-07-02.Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830. Google ScholarDigital Library
Stanisław Pitucha. 2010. Phply: PHP Parser Written in Python Using PLY. https://github.com/viraptor/phply Accessed 2017-09-10.Google Scholar
Pull Request #1 2018. Stivalet/PHP-Vuln-Test-Suite-Generator. https://github.com/stivalet/PHP-Vuln-test-suite-generator/pull/1 Accessed 2018-03-17.Google Scholar
Pull Request #2 2018. Stivalet/PHP-Vuln-Test-Suite-Generator. https://github.com/stivalet/PHP-Vuln-test-suite-generator/pull/2 Accessed 2018-03-17.Google Scholar
RIPS 2018. Free PHP Security Scanner Using Static Code Analysis. http://rips-scanner.sourceforge.net/ Accessed 2018-03-28.Google Scholar
RIPS Technologies 2017. RIPS - Static Code Analysis for PHP Security Vulnerabilities. https://www.ripstech.com/ Accessed 2017-07-01.Google Scholar
SAMATE 2018. Software Assurance Metrics And Tool Evaluation Project Main Page. https://samate.nist.gov/Main_Page.html Accessed 2018-03-28.Google Scholar
Michael Scovetta. 2017. http://www.scovetta.com/yasca.html Accessed 2017-05-17.Google Scholar
Lwin Khin Shar, Lionel C. Briand, and Hee Beng Kuan Tan. 2015. Web Application Vulnerability Prediction Using Hybrid Program Analysis and Machine Learning. IEEE Transactions on Dependable and Secure Computing 12, 6 (2015), 688--707.Google ScholarDigital Library
Lwin Khin Shar and Hee Beng Kuan Tan. 2012. Predicting Common Web Application Vulnerabilities from Input Validation and Sanitization Code Patterns. In Proceedings of the 27th IEEE/ACM Automated International Conference On Software Engineering (ASE). IEEE, 310--313. Google ScholarDigital Library
Lwin Khin Shar and Hee Beng Kuan Tan. 2013. Predicting SQL Injection and Cross Site Scripting Vulnerabilities through Mining Input Sanitization Patterns. Information and Software Technology 55, 10 (2013), 1767--1780. Google ScholarDigital Library
Lwin Khin Shar, Hee Beng Kuan Tan, and Lionel C. Briand. 2013. Mining SQL Injection and Cross Site Scripting Vulnerabilities Using Hybrid Program Analysis. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 642--651. http://dl.acm.org/citation.cfm?id=2486873 Google ScholarDigital Library
Yonghee Shin, Andrew Meneely, Laurie Williams, and Jason A. Osborne. 2011-11. Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities. IEEE Transactions on Software Engineering 37, 6 (2011-11), 772--787. Google ScholarDigital Library
Bertrand Stivalet. 2014. PHP-Vuln-Test-Suite-Generator: PHP Synthetic Test Cases Generator. https://github.com/stivalet/PHP-Vuln-test-suite-generator Accessed 2016-04-12.Google Scholar
Bertrand Stivalet and Elizabeth Fong. 2016. Large Scale Generation of Complex and Faulty PHP Test Cases. In IEEE International Conference on Software Testing, Verification and Validation (ICST). IEEE, 409--415. http://ieeexplore.ieee.org/abstract/document/7515499/Google Scholar
James Walden, Jeff Stuckman, and Riccardo Scandariato. 2014. Predicting Vulnerable Components: Software Metrics vs Text Mining. In IEEE 25th International Symposium On Software Reliability Engineering (ISSRE). IEEE, 23--33. Google ScholarDigital Library
WAP 2018. Web Application Protection. http://awap.sourceforge.net/ Accessed 2018-03-28.Google Scholar
Dumidu Wijayasekara, Milos Manic, and Miles McQueen. 2014. Vulnerability Identification and Classification via Text Mining Bug Databases. In IECON 2014-40th Annual Conference of the IEEE Industrial Electronics Society. IEEE, 3612--3618. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7049035Google ScholarCross Ref
Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. In IEEE Symposium On Security and Privacy (SP). IEEE, 590--604. Google ScholarDigital Library
Fabian Yamaguchi, Felix Lindner, and Konrad Rieck. 2011. Vulnerability Extrapolation: Assisted Discovery of Vulnerabilities Using Machine Learning. In Proceedings of the 5th USENIX Conference on Offensive Technologies (WOOT'11). USENIX Association, 13--13. Google ScholarDigital Library
Fabian Yamaguchi, Markus Lottmann, and Konrad Rieck. 2012. Generalized Vulnerability Extrapolation Using Abstract Syntax Trees. In Proceedings of the 28th Annual Computer Security Applications Conference. ACM, 359--368. Google ScholarDigital Library
Fabian Yamaguchi, Alwin Maier, Hugo Gascon, and Konrad Rieck. 2015. Automatic inference of search patterns for taint-style vulnerabilities. In IEEE Symposium on Security and Privacy (SP). IEEE, 797--812. Google ScholarDigital Library
Fabian Yamaguchi, Christian Wressnegger, Hugo Gascon, and Konrad Rieck. 2013. Chucky: Exposing missing checks in source code for vulnerability discovery. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 499--510. Google ScholarDigital Library

Index Terms

Discovering software vulnerabilities using data-flow analysis and machine learning

Recommendations

Detecting Blind Cross-Site Scripting Attacks Using Machine Learning
SPML '18: Proceedings of the 2018 International Conference on Signal Processing and Machine Learning

Cross-site scripting (XSS) is a scripting attack targeting web applications by injecting malicious scripts into web pages. Blind XSS is a subset of stored XSS, where an attacker blindly deploys malicious payloads in web pages that are stored in a ...
Read More
Machine-Learning-Guided Typestate Analysis for Static Use-After-Free Detection
ACSAC '17: Proceedings of the 33rd Annual Computer Security Applications Conference

Typestate analysis relies on pointer analysis for detecting temporal memory safety errors, such as use-after-free (UAF). For large programs, scalable pointer analysis is usually imprecise in analyzing their hard "corner cases", such as infeasible paths, ...
Read More
Precise and efficient integration of interprocedural alias information into data-flow analysis

Data-flow analysis is a basis for program optimization and parallelizing transformations. The mechanism of passing reference parameters at call sites generates interprocedural aliases which complicate this analysis. Solutions have been developed for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ARES '18: Proceedings of the 13th International Conference on Availability, Reliability and Security
August 2018
603 pages
ISBN:9781450364485
DOI:10.1145/3230833

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 August 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Software security
data-flow analysis
machine learning
static code analysis
vulnerability detection
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
ARES '18 Paper Acceptance Rate128of260submissions,49%Overall Acceptance Rate228of451submissions,51%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 25
  Total Citations
  View Citations
- 1,067
  Total Downloads
- Downloads (Last 12 months)153
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Discovering software vulnerabilities using data-flow analysis and machine learning

ARES '18: Proceedings of the 13th International Conference on Availability, Reliability and Security

ABSTRACT

References

Cited By

Index Terms

Recommendations

Detecting Blind Cross-Site Scripting Attacks Using Machine Learning

Machine-Learning-Guided Typestate Analysis for Static Use-After-Free Detection

Precise and efficient integration of interprocedural alias information into data-flow analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Discovering software vulnerabilities using data-flow analysis and machine learning

ARES '18: Proceedings of the 13th International Conference on Availability, Reliability and Security

ABSTRACT

References

Cited By

Index Terms

Recommendations

Detecting Blind Cross-Site Scripting Attacks Using Machine Learning

Machine-Learning-Guided Typestate Analysis for Static Use-After-Free Detection

Precise and efficient integration of interprocedural alias information into data-flow analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media