skip to main content
10.1145/3508398.3511495acmconferencesArticle/Chapter ViewAbstractPublication PagescodaspyConference Proceedingsconference-collections
short-paper

Building a Commit-level Dataset of Real-world Vulnerabilities

Published:15 April 2022Publication History

ABSTRACT

While CVE have become a de facto standard for publishing advisories on vulnerabilities, the state of current CVE databases is lackluster. Yet, CVE advisories are insufficient to bridge the gap with the vulnerability artifacts in the impacted program. Therefore, the community is lacking a public real-world vulnerabilities dataset providing such association. In this paper, we present a method restoring this missing link by analyzing the vulnerabilities from the AOSP, an aggregate of more than 1,800 projects. It is the perfect target for building a representative dataset of vulnerabilities, as it covers the full spectrum that may be encountered in a modern system where a variety of low-level and higher-level components interact. More specifically, our main contribution is a dataset of more than 1,900 vulnerabilities, associating generic metadata (e.g. vulnerability type, impact level) with their respective patches at the commit granularity (e.g. fix commit-id, affected files, source code language). Finally, we also augment this dataset by providing precompiled binaries for a subset of the vulnerabilities. These binaries open various data usage, both for binary only analysis and at the interface between source and binary. In addition of providing a common baseline benchmark, our dataset release supports the community for data-driven software security research.

References

  1. CVE-2015--3873. libstagefright in Android before 5.1.1 LMY48T allows remote attackers to execute arbitrary code or cause a denial of service (memory corruption) via a crafted media file.Google ScholarGoogle Scholar
  2. Universal ctags. original-date: 2010-03--25T10:43:13Z.Google ScholarGoogle Scholar
  3. Junaid Akram and Luo Ping. How to build a vulnerability benchmark to overcome cyber security attacks. 14(1):60--71.Google ScholarGoogle Scholar
  4. Bas van Schaik and Kevin Backhouse. FPs are cheap. show me the CVEs!Google ScholarGoogle Scholar
  5. Guru Prasad Bhandari, Amara Naseer, and Leon Moonen. CVEfixes: Automated collection of vulnerabilities and their fixes from open-source software.Google ScholarGoogle Scholar
  6. Paul E. Black. A software assurance reference dataset: Thousands of programs with known bugs. 123:123005.Google ScholarGoogle Scholar
  7. Frederick Boland and Paul Black. The juliet 1.1 c/cGoogle ScholarGoogle Scholar
  8. and java test suite. (45). Publisher: Computer (IEEE Computer).Google ScholarGoogle Scholar
  9. Min-je Choi, Sehun Jeong, Hakjoo Oh, and Jaegul Choo. End-to-end prediction of buffer overruns from raw source code via neural memory networks.Google ScholarGoogle Scholar
  10. DARPA. Cyber grand challenge.Google ScholarGoogle Scholar
  11. Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. LAVA: Large-scale automated vulnerability addition. In 2016 IEEE Symposium on Security and Privacy (SP), pages 110--121. IEEE.Google ScholarGoogle Scholar
  12. Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. DeepBinDiff: Learning program-wide code representations for binary diffing. In Proceedings 2020 Network and Distributed System Security Symposium. Internet Society.Google ScholarGoogle Scholar
  13. Alexandre Dulaunoy and Pieter-Jan Moreels. cve-search - a free software to collect, search and analyse common vulnerabilities and exposures in software.Google ScholarGoogle Scholar
  14. Sadegh Farhang, Mehmet Bahadir Kirdan, Aron Laszka, and Jens Grossklags. Hey google, what exactly do your security patches tell us? a large-scale empirical study on android patched vulnerabilities.Google ScholarGoogle Scholar
  15. Google. Android security bulletins.Google ScholarGoogle Scholar
  16. Google. gitiles - git at google.Google ScholarGoogle Scholar
  17. Jiyong Jang, Abeer Agrawal, and David Brumley. ReDeBug: Finding unpatched code clones in entire OS distributions. In 2012 IEEE Symposium on Security and Privacy, pages 48--62. IEEE.Google ScholarGoogle Scholar
  18. Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. VUDDY: A scalable approach for vulnerable code clone discovery. In 2017 IEEE Symposium on Security and Privacy (SP), pages 595--614. IEEE.Google ScholarGoogle Scholar
  19. Frank Li and Vern Paxson. A large-scale empirical study of security patches. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 2201--2215. ACM.Google ScholarGoogle Scholar
  20. Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Hanchao Qi, and Jie Hu. VulPecker: an automated vulnerability detection system based on code similarity analysis. In Proceedings of the 32nd Annual Conference on Computer Security Applications, pages 201--213. ACM.Google ScholarGoogle Scholar
  21. Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. VulDeePecker: A deep learning-based system for vulnerability detection.Google ScholarGoogle Scholar
  22. Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou. ?diff: cross-version binary code similarity detection with DNN. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering - ASE 2018, pages 667--678. ACM Press.Google ScholarGoogle Scholar
  23. Zhen Liu, Qiang Wei, and Yan Cao. VFDETECT: A vulnerable code clone detection system based on vulnerability fingerprint. In 2017 IEEE 3rd Information Technology and Mechatronics Engineering Conference (ITOEC), pages 548--553. IEEE.Google ScholarGoogle Scholar
  24. MITRE Corporation. MITRE.Google ScholarGoogle Scholar
  25. Antonio Nappa, Richard Johnson, Leyla Bilge, Juan Caballero, and Tudor Dumitras. The attack of the clones: A study of the impact of shared code on vulnerability patching. In 2015 IEEE Symposium on Security and Privacy, pages 692--708. IEEE.Google ScholarGoogle Scholar
  26. Henning Perl, Sergej Dechand, Matthew Smith, Daniel Arp, Fabian Yamaguchi, Konrad Rieck, Sascha Fahl, and Yasemin Acar. VCCFinder: Finding potential vulnerabilities in open-source projects to assist code audits. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security - CCS '15, pages 426--437. ACM Press.Google ScholarGoogle Scholar
  27. Serena Elisa Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, and Cedric Dangremont. A manually-curated dataset of fixes to vulnerabilities of open-source software. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 383--387. IEEE.Google ScholarGoogle Scholar
  28. Xinda Wang, Kun Sun, Archer Batcheller, and Sushil Jajodia. Detecting "0-day" vulnerability: An empirical study of secret security patch in OSS. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 485--492. ISSN: 1530-0889.Google ScholarGoogle Scholar
  29. Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural network-based graph embedding for cross-platform binary code similarity detection. pages 363--376.Google ScholarGoogle Scholar
  30. Yifei Xu, Zhengzi Xu, Bihuan Chen, Fu Song, Yang Liu, and Ting Liu. BinXRay: Patch based vulnerability matching for binary programs. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 376--387. ACM.Google ScholarGoogle Scholar
  31. Yinxing Xue, Zhengzi Xu, Mahinthan Chandramohan, and Yang Liu. Accurate and scalable cross-architecture cross-OS binary code search with emulation. 45(11):1125--1149.Google ScholarGoogle Scholar
  32. Hang Zhang and Zhiyun Qian. Precise and accurate patch presence test for binaries. In 27th USENIX Security Symposium (USENIX Security 18), pages 887--902. USENIX Association.Google ScholarGoogle Scholar

Index Terms

  1. Building a Commit-level Dataset of Real-world Vulnerabilities

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CODASPY '22: Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy
        April 2022
        392 pages
        ISBN:9781450392204
        DOI:10.1145/3508398

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 April 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        Overall Acceptance Rate149of789submissions,19%

        Upcoming Conference

        CODASPY '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader