short-paper

Building a Commit-level Dataset of Real-world Vulnerabilities

Authors:
Alexis Challande

Quarkslab, Inria, & Institut Polytechnique de Paris, Paris, France

Quarkslab, Inria, & Institut Polytechnique de Paris, Paris, France
View Profile

,
Robin David

Quarkslab, Paris, France

Quarkslab, Paris, France
View Profile

,
Guénaël Renault

ANSSI, Inria, & Institut Polytechnique de Paris, Paris, France

ANSSI, Inria, & Institut Polytechnique de Paris, Paris, France
View Profile

CODASPY '22: Proceedings of the Twelfth ACM Conference on Data and Application Security and PrivacyApril 2022Pages 101–106https://doi.org/10.1145/3508398.3511495

Published:15 April 2022Publication History

CODASPY '22: Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy

Pages 101–106

ABSTRACT

While CVE have become a de facto standard for publishing advisories on vulnerabilities, the state of current CVE databases is lackluster. Yet, CVE advisories are insufficient to bridge the gap with the vulnerability artifacts in the impacted program. Therefore, the community is lacking a public real-world vulnerabilities dataset providing such association. In this paper, we present a method restoring this missing link by analyzing the vulnerabilities from the AOSP, an aggregate of more than 1,800 projects. It is the perfect target for building a representative dataset of vulnerabilities, as it covers the full spectrum that may be encountered in a modern system where a variety of low-level and higher-level components interact. More specifically, our main contribution is a dataset of more than 1,900 vulnerabilities, associating generic metadata (e.g. vulnerability type, impact level) with their respective patches at the commit granularity (e.g. fix commit-id, affected files, source code language). Finally, we also augment this dataset by providing precompiled binaries for a subset of the vulnerabilities. These binaries open various data usage, both for binary only analysis and at the interface between source and binary. In addition of providing a common baseline benchmark, our dataset release supports the community for data-driven software security research.

References

CVE-2015--3873. libstagefright in Android before 5.1.1 LMY48T allows remote attackers to execute arbitrary code or cause a denial of service (memory corruption) via a crafted media file.Google Scholar
Universal ctags. original-date: 2010-03--25T10:43:13Z.Google Scholar
Junaid Akram and Luo Ping. How to build a vulnerability benchmark to overcome cyber security attacks. 14(1):60--71.Google Scholar
Bas van Schaik and Kevin Backhouse. FPs are cheap. show me the CVEs!Google Scholar
Guru Prasad Bhandari, Amara Naseer, and Leon Moonen. CVEfixes: Automated collection of vulnerabilities and their fixes from open-source software.Google Scholar
Paul E. Black. A software assurance reference dataset: Thousands of programs with known bugs. 123:123005.Google Scholar
Frederick Boland and Paul Black. The juliet 1.1 c/cGoogle Scholar
and java test suite. (45). Publisher: Computer (IEEE Computer).Google Scholar
Min-je Choi, Sehun Jeong, Hakjoo Oh, and Jaegul Choo. End-to-end prediction of buffer overruns from raw source code via neural memory networks.Google Scholar
DARPA. Cyber grand challenge.Google Scholar
Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. LAVA: Large-scale automated vulnerability addition. In 2016 IEEE Symposium on Security and Privacy (SP), pages 110--121. IEEE.Google Scholar
Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. DeepBinDiff: Learning program-wide code representations for binary diffing. In Proceedings 2020 Network and Distributed System Security Symposium. Internet Society.Google Scholar
Alexandre Dulaunoy and Pieter-Jan Moreels. cve-search - a free software to collect, search and analyse common vulnerabilities and exposures in software.Google Scholar
Sadegh Farhang, Mehmet Bahadir Kirdan, Aron Laszka, and Jens Grossklags. Hey google, what exactly do your security patches tell us? a large-scale empirical study on android patched vulnerabilities.Google Scholar
Google. Android security bulletins.Google Scholar
Google. gitiles - git at google.Google Scholar
Jiyong Jang, Abeer Agrawal, and David Brumley. ReDeBug: Finding unpatched code clones in entire OS distributions. In 2012 IEEE Symposium on Security and Privacy, pages 48--62. IEEE.Google Scholar
Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. VUDDY: A scalable approach for vulnerable code clone discovery. In 2017 IEEE Symposium on Security and Privacy (SP), pages 595--614. IEEE.Google Scholar
Frank Li and Vern Paxson. A large-scale empirical study of security patches. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 2201--2215. ACM.Google Scholar
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Hanchao Qi, and Jie Hu. VulPecker: an automated vulnerability detection system based on code similarity analysis. In Proceedings of the 32nd Annual Conference on Computer Security Applications, pages 201--213. ACM.Google Scholar
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. VulDeePecker: A deep learning-based system for vulnerability detection.Google Scholar
Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou. ?diff: cross-version binary code similarity detection with DNN. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering - ASE 2018, pages 667--678. ACM Press.Google Scholar
Zhen Liu, Qiang Wei, and Yan Cao. VFDETECT: A vulnerable code clone detection system based on vulnerability fingerprint. In 2017 IEEE 3rd Information Technology and Mechatronics Engineering Conference (ITOEC), pages 548--553. IEEE.Google Scholar
MITRE Corporation. MITRE.Google Scholar
Antonio Nappa, Richard Johnson, Leyla Bilge, Juan Caballero, and Tudor Dumitras. The attack of the clones: A study of the impact of shared code on vulnerability patching. In 2015 IEEE Symposium on Security and Privacy, pages 692--708. IEEE.Google Scholar
Henning Perl, Sergej Dechand, Matthew Smith, Daniel Arp, Fabian Yamaguchi, Konrad Rieck, Sascha Fahl, and Yasemin Acar. VCCFinder: Finding potential vulnerabilities in open-source projects to assist code audits. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security - CCS '15, pages 426--437. ACM Press.Google Scholar
Serena Elisa Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, and Cedric Dangremont. A manually-curated dataset of fixes to vulnerabilities of open-source software. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 383--387. IEEE.Google Scholar
Xinda Wang, Kun Sun, Archer Batcheller, and Sushil Jajodia. Detecting "0-day" vulnerability: An empirical study of secret security patch in OSS. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 485--492. ISSN: 1530-0889.Google Scholar
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural network-based graph embedding for cross-platform binary code similarity detection. pages 363--376.Google Scholar
Yifei Xu, Zhengzi Xu, Bihuan Chen, Fu Song, Yang Liu, and Ting Liu. BinXRay: Patch based vulnerability matching for binary programs. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 376--387. ACM.Google Scholar
Yinxing Xue, Zhengzi Xu, Mahinthan Chandramohan, and Yang Liu. Accurate and scalable cross-architecture cross-OS binary code search with emulation. 45(11):1125--1149.Google Scholar
Hang Zhang and Zhiyun Qian. Precise and accurate patch presence test for binaries. In 27th USENIX Security Symposium (USENIX Security 18), pages 887--902. USENIX Association.Google Scholar

Index Terms

Building a Commit-level Dataset of Real-world Vulnerabilities
1. Security and privacy
  1. Software and application security
    1. Software security engineering
  2. Systems security
    1. Vulnerability management

Recommendations

CrossVul: a cross-language vulnerability dataset with commit data
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Examining the characteristics of software vulnerabilities and the code that contains them can lead to the development of more secure software. We present a dataset (∼1.4 GB) containing vulnerable source code files together with the corresponding, ...
Read More
CVEfixes: automated collection of vulnerabilities and their fixes from open-source software
PROMISE 2021: Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering

Data-driven research on the automated discovery and repair of security vulnerabilities in source code requires comprehensive datasets of real-life vulnerable code and their fixes. To assist in such research, we propose a method to automatically collect ...
Read More
Building a dataset through attack pattern modeling and analysis system
Abstract
The different types of cyber-attacks on information and telecommunications systems are becoming increasingly sophisticated and complex, with several defined phases (attack pattern). Therefore, it is necessary to research and develop ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CODASPY '22: Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy
April 2022
392 pages
ISBN:9781450392204
DOI:10.1145/3508398
General Chair:
Anupam Joshi
University of Maryland, Baltimore County, USA
,
Program Chairs:
Maribel Fernandez
King's College London, UK
,
Rakesh M. Verma
University of Houston, USA
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 April 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
binary matching
dataset
patch detection
security vulnerabilities
vulnerability research
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate149of789submissions,19%
Upcoming Conference
CODASPY '24

Sponsor:

sigsac

Fourteenth ACM Conference on Data and Application Security and Privacy

June 19 - 21, 2024

Porto , Portugal
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 259
  Total Downloads
- Downloads (Last 12 months)94
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Building a Commit-level Dataset of Real-world Vulnerabilities

CODASPY '22: Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy

ABSTRACT

References

Cited By

Index Terms

Recommendations

CrossVul: a cross-language vulnerability dataset with commit data

CVEfixes: automated collection of vulnerabilities and their fixes from open-source software

Building a dataset through attack pattern modeling and analysis system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Building a Commit-level Dataset of Real-world Vulnerabilities

CODASPY '22: Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy

ABSTRACT

References

Cited By

Index Terms

Recommendations

CrossVul: a cross-language vulnerability dataset with commit data

CVEfixes: automated collection of vulnerabilities and their fixes from open-source software

Building a dataset through attack pattern modeling and analysis system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media