short-paper

A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries

Authors:
Jiahao Fan

SPACE Lab, Informatics, New Jersey Institute of Technology

SPACE Lab, Informatics, New Jersey Institute of Technology
View Profile

,
Yi Li

SPACE Lab, Informatics, New Jersey Institute of Technology

SPACE Lab, Informatics, New Jersey Institute of Technology
View Profile

,
Shaohua Wang

SPACE Lab, Informatics, New Jersey Institute of Technology

SPACE Lab, Informatics, New Jersey Institute of Technology
View Profile

,
Tien N. Nguyen

CS Department, The University of Texas at Dallas

CS Department, The University of Texas at Dallas
View Profile

MSR '20: Proceedings of the 17th International Conference on Mining Software RepositoriesJune 2020Pages 508–512https://doi.org/10.1145/3379597.3387501

Published:18 September 2020Publication History

MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories

Pages 508–512

ABSTRACT

We collected a large C/C++ code vulnerability dataset from open-source Github projects, namely Big-Vul. We crawled the public Common Vulnerabilities and Exposures (CVE) database and CVE-related source code repositories. Specifically, we collected the descriptive information of the vulnerabilities from the CVE database, e.g., CVE IDs, CVE severity scores, and CVE summaries. With the CVE information and its related published Github code repository links, we downloaded all of the code repositories and extracted vulnerability related code changes. In total, Big-Vul contains 3,754 code vulnerabilities spanning 91 different vulnerability types. All these code vulnerabilities are extracted from 348 Github projects. All information is stored in the CSV format. We linked the code changes with the CVE descriptive information. Thus, our Big-Vul can be used for various research topics, e.g., detecting and fixing vulnerabilities, analyzing the vulnerability related code changes. Big-Vul is publicly available on Github.

References

CVE Details. 2020. CVE Details Website. http://https://www.cvedetails.com/.Google Scholar
Antonios Gkortzis, Dimitris Mitropoulos, and Diomidis Spinellis. 2018. VulinOSS: a dataset of security vulnerabilities in open-source systems. In Proceedings of the 15th International Conference on Mining Software Repositories. 18--21.Google ScholarDigital Library
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681 (2018).Google Scholar
Serena Elisa Ponta, Henrik Plate, and Antonino Sabetta. 2018. Beyond metadata: Code-centric and usage-based analysis of known vulnerabilities in open-source software. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 449--460.Google ScholarCross Ref
Serena Elisa Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, and Cédric Dangremont. 2019. A manually-curated dataset of fixes to vulnerabilities of open-source software. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 383--387.Google ScholarDigital Library
This Project. [n.d.]. Our C/C++dataset. https://github.com/ZeoVan/MSR_20_Code_Vulnerability_CSV_Dataset.Google Scholar
Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. 2018. Automated vulnerability detection in source code using deep representation learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 757--762.Google ScholarCross Ref
Antonino Sabetta and Michele Bezzi. 2018. A practical approach to the automatic classification of security-relevant commits. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 579--582.Google ScholarCross Ref
Zack Whittaker. 2020. Microsoft and NSA say a security bug affects millions of Windows 10 computers. https://techcrunch.com/2020/01/14/microsoft-critical-certificates-bug/.Google Scholar
Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In 2014 IEEE Symposium on Security and Privacy. IEEE, 590--604.Google ScholarDigital Library
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in Neural Information Processing Systems. 10197--10207.Google Scholar
Yaqin Zhou and Asankhaya Sharma. 2017. Automated identification of security issues from commit messages and bug reports. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 914--919.Google ScholarDigital Library

Index Terms

A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries
1. Security and privacy

Recommendations

Common Vulnerability Scoring System

Vendors have historically used proprietary methods for scoring software vulnerabilities, usually without detailing their criteria or processes. The Common Vulnerability Scoring System (CVSS) is a public initiative designed to address this issue by ...
Read More
SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques
MSR4P&S 2022: Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security

Automated source code generation is currently a popular machine-learning-based task. It can be helpful for software developers to write functionally correct code from a given context. However, just like human developers, a code generation model can ...
Read More
Automatic clustering of code changes
MSR '16: Proceedings of the 13th International Conference on Mining Software Repositories

Several research tools and projects require groups of similar code changes as input. Examples are recommendation and bug finding tools that can provide valuable information to developers based on such data. With the help of similar code changes they can ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories
June 2020
675 pages
ISBN:9781450375177
DOI:10.1145/3379597

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 September 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
C/C++ Code
Code Changes
Common Vulnerabilities and Exposures
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 109
  Total Citations
  View Citations
- 2,872
  Total Downloads
- Downloads (Last 12 months)1,286
- Downloads (Last 6 weeks)203
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries

MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories

ABSTRACT

References

Cited By

Index Terms

Recommendations

Common Vulnerability Scoring System

SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques

Automatic clustering of code changes

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries

MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories

ABSTRACT

References

Cited By

Index Terms

Recommendations

Common Vulnerability Scoring System

SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques

Automatic clustering of code changes

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media