1 Introduction
-
Almost all datasets are on a function level and do not provide context information (e.g., traces) explaining how a bug may happen. Besides, they usually do not specify the bug types and locations. In many cases, the function-level example does not even include the bug root cause.
-
There are also labeling efforts based on commit messages or code diffs. Predicting code labels based on commit messages is known to produce low-quality labels (Russell et al. 2018). Code diff based methods (Zhou et al. 2019) assume all functions in a bug-fixing commit are buggy, which may not be the case in reality. More importantly, these approaches have difficulty in identifying bug types, locations, and traces.
OpenSSL
, FFmpeg
, libav
, httpd
, NGINX
and libtiff
, well known open source projects. Out of 349,373,753 issues reported by the static analyzer, after deduplication, we labeled 18,653 unique issues as positives and 1,276,970 unique issues as negatives. Given there is no ground truth, to validate the efficacy of the auto-labeler, we randomly selected and manually reviewed 57 examples. The result shows that D2A improves the label accuracy from 7.8% observed in the manual case study without our technique (Section 2.2.1) to 53% based on the manual label validation of randomly selected samples in Table 4.-
We propose a novel approach to label static analysis issues based on differential analysis and commit history heuristics.
-
Given that it can take several hours to analyze a single version pair (e.g. 12hrs for
FFmpeg
), we parallelized the pipeline such that we can process thousands of version pairs simultaneously in a cluster, which makes D2A a practical approach. -
We ran large-scale analyses on thousands of version pairs of real-world C/C++ programs, and created a labeled dataset of millions of samples with the hope that the dataset can be helpful to AI method on vulnerability detection tasks.
-
Unlike existing function-level datasets, we derive samples from inter-procedural analysis and preserve more details such as bug types, locations, traces, and analyzer outputs.
-
We demonstrated a use case of the D2A dataset. We trained both classic machine learning models and a deep learning model for the static program analysis false positive reduction task, which can effectively help developers prioritize issues that are more likely to be real bugs.
-
We created a leaderboard based on the D2A dataset and made it public. It has already attracted community attention and participation. Using the leaderboard, researchers can compare their model performance on D2A with other models. The leaderboard can be found at https://ibm.github.io/D2A.
-
To facilitate future research, we make the D2A dataset and its generation pipeline publicly available at https://github.com/ibm/D2A.
2 Motivation
2.1 Existing Datasets for AI on Vulnerability Detection Task
Dataset | Type | Level | WDR | Bug | Bug | Bug | CT | CE | G.A. | Labelling method |
---|---|---|---|---|---|---|---|---|---|---|
Type | Line | Trace | ||||||||
Juliet | synthetic | function | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | ✘ | – | \(\checkmark \) | – | predefined pattern |
S-Babi | synthetic | function | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | ✘ | – | \(\checkmark \) | \(\checkmark \) | predefined pattern |
Choi et.al | synthetic | function | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | ✘ | – | \(\checkmark \) | \(\checkmark \) | predefined pattern |
Draper | mixed | function | \(\checkmark \) | \(\checkmark \) | ✘ | ✘ | ✘ | ✘ | ✘ | static analysis |
Devign | real-world | function | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | manual + code diff |
CDG | real-world | slice | \(\checkmark \) | ✘ | ✘ | ✘ | ✘ | ✘ | \(\checkmark \) | NVD + code diff |
D2A | real-world | trace | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | differential analysis |
2.2 Manual Review and False Positive Reduction
NULL_DEREFERENCE
(CWE476 (CWE476)), UNINITIALIZED_VALUE
(CWE457 (2023)) and RESOURCE_LEAK
(CWE400 (2023)).2.2.1 Manual Case Study
OpenSSL
as a benchmark because of its importance in the open-source security ecosystem and its long commit history. We use OpenSSL
version 7f0a8dc
which has 1499 *.c
/*.h
files and 513.6k lines of C code in total. We run Infer using its default setting and the results are summarized in Table 2. Infer reported 492 issues of 4 bug types: 326 DEAD_STORE
, 101 UNINITIALIZED_VALUE
, 64 NULL_DEREFERENCE
, and 1 RESOURCE_LEAK
. Among them, DEAD_STORE
refers to issues where the value written to a variable is never used. Since such issues are not security vulnerabilities and are in fact often intended to overwrite data such as passwords to avoid leakage of sensitive data, their removal might create a security issue so were excluded from the manual review. The remaining 166 issues may lead to security-related problems and thus were included in the study.7f0a8dc
Error type | Reported | Manual review | ||
---|---|---|---|---|
FP | TP | FP:TP | ||
UNINITIALIZED_VALUE | 101 | 101 | 0 | – |
NULL_DEREFERENCE | 64 | 51 | 13 | 4:1 |
RESOURCE_LEAK | 1 | 1 | 0 | – |
TOTAL | 166 | 153 | 13 | 12:1 |
NULL_DEREFERENCE
issue. It has two sections. The bug location, bug type, and a brief justification why Infer thinks the bug can happen are listed in lines 1–3. The bug explanation part can be in different formats for different bugs. In lines 6-27, the bug trace that consists of the last steps of the offending execution is listed. Figure 1 shows 3 of the 5 steps. In each step (e.g. line 6-11), the location and 4 additional lines of code that sit before and after the highlighted line are provided.2.2.2 Feature Exploration for False Positive Reduction
3 D2A Dataset Generation
3.1 Overview
3.2 Commit Message Analysis
3.3 Auto-labeler
infer-reportdiff
tool (Facebook 2023b) to compute them.OpenSSL
and FFmpeg
, respectively, in single-thread mode. As we will need to analyze thousands of version pairs, it’s impractical to do so on a PC or a small workstation. Therefore, we parallelized the analysis to process more than a thousand version pairs simultaneously in a cluster. The improvement in performance depends on the availability of computation resources like CPUs and RAMs.-
Fixed-then-unfixed issues: Because of some randomness in the way Infer selects code to analyze, it may accidentally omit a bug from the after-commit version, falsely suggesting that it was fixed by that commit. If a fixed issue reappears in a later commit pair, we assume that it is a false positive caused by an error in the static analyzer. We change the label of such cases and mark them as negative. (Note that this sequence could happen if the suspect code was removed and then later re-introduced.)
-
Untouched issues: For each fixed issue we check which parts of the code are patched by the commit. If the commit code diff does not overlap with any step of the bug trace at all, it’s unlikely the issue is fixed by the commit but more likely to be a static analyzer error. We mark such cases negative as well.
3.4 Infer’s Bug Trace
3.5 An Example in the D2A Dataset
3.6 Dataset Generation Results
3.6.1 Dataset Statistics
OpenSSL
, FFmpeg
, httpd
, NGINX
, libtiff
, and libav
) and generated the initial version of the D2A dataset. In particular, Infer can detect more than 150 types of issues in C/C++/Objective-C/Java programs (Infer Infer). However, some issues detectors are not ready for production and thus disabled by default. In the pipeline, we additionally enabled the detection of all issue types related to buffer overflows, integer overflows, and memory/resource leaks, even though some of them may not be production-ready.Project | Version pairs | Issues reported | Unique auto-labeler examples | Unique after-fix | |||
---|---|---|---|---|---|---|---|
CMA | Infer | All | Negatives | Positives | Negatives | ||
OpenSSL | 3,011 | 2,643 | 42,151,595 | 351,170 | 343,148 | 8,022 | 8,022 |
FFmpeg | 5,932 | 4,930 | 215,662,372 | 659,717 | 654,891 | 4,826 | 4,826 |
httpd | 1,168 | 542 | 1,681,692 | 12,692 | 12,475 | 217 | 217 |
NGINX | 785 | 635 | 3,283,202 | 18,366 | 17,945 | 421 | 421 |
libtiff | 144 | 144 | 525,360 | 12,649 | 12,096 | 553 | 553 |
libav | 3,407 | 2,952 | 86,069,532 | 241,029 | 236,415 | 4,614 | 4,614 |
Total | 14,447 | 11,846 | 349,373,753 | 1,295,623 | 1,276,970 | 18,653 | 18,653 |
3.6.2 Manual Label Validation
Positives | Negatives | All | |||||||
---|---|---|---|---|---|---|---|---|---|
# | A | D | # | A | D | # | A | D | |
BUFFER_OVERRUN_L1 | 2 | 0 | 2 | 1 | 1 | 0 | 3 | 1 | 2 |
BUFFER_OVERRUN_L2 | 3 | 1 | 2 | 1 | 1 | 0 | 4 | 2 | 2 |
BUFFER_OVERRUN_L3 | 6 | 1 | 5 | 4 | 4 | 0 | 10 | 5 | 5 |
BUFFER_OVERRUN_S2 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 |
INTEGER_OVERFLOW_L1 | 3 | 2 | 1 | 1 | 1 | 0 | 4 | 3 | 1 |
INTEGER_OVERFLOW_L2 | 13 | 6 | 7 | 3 | 3 | 0 | 16 | 9 | 7 |
INTEGER_OVERFLOW_R2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
MEMORY_LEAK | 1 | 1 | 0 | 1 | 1 | 0 | 2 | 2 | 0 |
NULL_DEREFERENCE | 2 | 1 | 1 | 1 | 0 | 1 | 3 | 1 | 2 |
RESOURCE_LEAK | 1 | 1 | 0 | 1 | 1 | 0 | 2 | 2 | 0 |
UNINITIALIZED_VALUE | 9 | 3 | 6 | 1 | 1 | 0 | 10 | 4 | 6 |
USE_AFTER_FREE | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 |
ALL | 41 | 17 | 24 | 16 | 13 | 3 | 57 | 30 | 27 |
ALL | 100% | 41% | 59% | 100% | 81% | 19% | 100% | 53% | 47% |
OpenSSL
study in Section 2.2 as an example. Without auto-labeler, the accuracy was only 7.8% observed in the manual case study of a set of 166 security-related examples (Section 2.2.1).4 Using D2A for FP Prediction
4.1 Problem Statement
libtiff
case study in Table 9.4.2 Problem Dataset
4.3 FP Prediction with Classical Machine Learning
4.3.1 Feature Engineering
Feature | Description | Feature | Description |
---|---|---|---|
error | Infer bug/issue type | error_line | line number of the error |
error_line_len | length of error line | error_line_depth | indent for the error line text |
average_error_line_depth | average indent of code lines | max_error_line_depth | max indent of code lines |
error_pos_fun | position of error within function | average_code_line_length | average length of lines in flow |
max_code_line_length | max length of lines in flow | length | the number of lines of code |
code_line_count | the number of flow lines | alias_count | the number of address assignment lines |
arithmetic_count | average operators / step | assignment_count | fraction of Assignment steps |
call_count | fraction of call steps | cfile_count | the number of different .c files |
for_count | the number of for loops in report | infinity_count | fraction of +00 steps |
keywords_count | the number of C keywords | package_count | the number of different directories |
question_count | fraction of ‘??’ steps | return_count | average branches / step |
size_calculating_count | average size calculations / step | parameter_count | fraction of parameter steps |
offset_added | the number of “offset added”s in report | max_if_AND_count | Max logical ANDs in an if statement |
max_if_OR_count | Max logical ORs in an if statement | avg_if_AND_count | Avg logical ANDs in an if statement |
avg_if_OR_count | Avg logical ORs in an if statement | error_char | char number of the error in trace |
variable_changed | Number of variables that changed value |
Feature | Description | Feature | Description |
---|---|---|---|
CC_bug_code | Cyclomatic complexity of buggy file | params_bug_code | Number of parameters in buggy file |
loop_num_bug_code | Number of loops in buggy file | IFwoELSE_bug_code | Number of if without else statements in buggy file |
CC_functions | Cyclomatic complexity of trace functions | params_functions | Number of parameters in trace functions |
loop_num_functions | Number of loops in trace functions | IFwoELSE_functions | Number of if without else statements in trace functions |
CC_bug_function | Cyclomatic complexity of buggy function | params_bug_function | Number of parameters in buggy function |
loop_num_bug_function | Number of loops in buggy function | IFwoELSE_bug_function | Number of if without else statements in buggy function |
4.3.2 Model Selection
4.3.3 Evaluation Metrics
4.3.4 Voting
4.3.5 Stacking Ensemble
4.4 FP Prediction with Deep Learning
OpenSSL
and libav
).-
Trace only: the bug trace produced by Infer only (e.g., Fig. 1). It’s a mixture of source code and natural language.
-
Single bug function: the particular function body where the bug occurs per the Infer report. It’s only source code.
-
Functions: concatenating all the functions that appeared in the Infer trace. It’s only source code.
-
Trace + Bug Function: besides the Infer trace, we additionally include the function body where the bug occurs per the Infer report. It’s a mixture of source code and natural language.
4.5 False Positive Prediction Results
4.5.1 Dataset
FFmpeg
and libav
examples are quite similar as libav
was forked from FFmpeg
(Wiki 2023). We dropped FFmpeg
examples so that the all-data combined experiment would be fair. FFmpeg
examples are more imbalanced compared to libav
and we leave it for future work.OpenSSL
and libav
datasets, as can be seen in Table 9. With C-BERT and improved ensemble models with new code features, the results improved on the full dataset with all error types. Table 7 shows the statistics of the full train:dev:test data used in the experiments.All errors | Prod-ready sec Errs | |||||
---|---|---|---|---|---|---|
Negatives | Positives | N:P | Negatives | Positives | N:P | |
OpenSSL | 341,625 | 7,916 | 43:1 | 27,227 | 797 | 34:1 |
libav | 235,369 | 4,585 | 51:1 | 14,954 | 280 | 53:1 |
NGINX | 1,7829 | 417 | 43:1 | 1,446 | 36 | 40:1 |
libtiff | 11,720 | 552 | 21:1 | 1,185 | 27 | 44:1 |
httpd | 11,511 | 208 | 55:1 | 174 | 11 | 16:1 |
4.5.2 Results
Model | D2A | Predicted | Correct | D2A | AUC | |
---|---|---|---|---|---|---|
Positives | Positives | Positives | Negatives | |||
OpenSSL | vote | 81 | 506 | 58 | 2711 | 0.83 |
c-bert | 81 | 170 | 36 | 2711 | 0.80 | |
vote-new | 81 | 168 | 43 | 2711 | 0.86 | |
libav | vote | 28 | 254 | 21 | 1495 | 0.89 |
c-bert | 28 | 53 | 20 | 1495 | 0.87 | |
vote-new | 28 | 35 | 21 | 1495 | 0.91 | |
NGINX | vote | 5 | 54 | 4 | 145 | 0.78 |
c-bert | 5 | 9 | 3 | 145 | 0.82 | |
vote-new | 5 | 6 | 2 | 145 | 0.89 | |
libtiff | vote | 3 | 7 | 2 | 118 | 0.97 |
c-bert | 3 | 7 | 2 | 118 | 0.96 | |
vote-new | 3 | 4 | 2 | 118 | 0.98 | |
httpd | vote | 2 | 6 | 1 | 17 | 0.85 |
c-bert | 2 | 2 | 2 | 17 | 1 | |
vote-new | 2 | 2 | 1 | 17 | 1 | |
combined | vote | 119 | 814 | 82 | 4486 | 0.84 |
c-bert | 119 | 224 | 63 | 4486 | 0.83 | |
vote-new | 119 | 291 | 70 | 4486 | 0.87 |
Model | D2A | Predicted | Correct | D2A | AUC | |
---|---|---|---|---|---|---|
Positives | Positives | Positives | Negatives | |||
OpenSSL | vote | 793 | 34251 | 792 | 34149 | 0.69 |
c-bert | 793 | 2034 | 333 | 34149 | 0.75 | |
vote-new | 793 | 1854 | 146 | 34149 | 0.73 | |
libav | vote | 458 | 8 | 8 | 23536 | 0.61 |
c-bert | 458 | 1297 | 171 | 23536 | 0.68 | |
vote-new | 458 | 1294 | 143 | 23536 | 0.73 | |
NGINX | vote | 42 | 315 | 29 | 1783 | 0.77 |
c-bert | 42 | 106 | 26 | 1783 | 0.89 | |
vote-new | 42 | 90 | 33 | 1783 | 0.93 | |
libtiff | vote | 58 | 198 | 44 | 1171 | 0.89 |
c-bert | 58 | 98 | 41 | 1171 | 0.94 | |
vote-new | 58 | 111 | 54 | 1171 | 0.98 | |
httpd | vote | 20 | 263 | 13 | 1150 | 0.77 |
c-bert | 20 | 43 | 10 | 1150 | 0.82 | |
vote-new | 20 | 64 | 12 | 1150 | 0.90 |
4.5.3 Feature Importance
4.5.4 Libtiff Results Analysis
libtiff
for which all models achieve very good AUC. To analyze this result further, in Fig. 9 we plot the cost of finding each True Positive in terms of False Positives for libtiff
with the Vote-soft ensemble on all error types. The X-axis shows True Positives in decreasing order of model confidence. The Y-axis plots the count of new False Positives since the last True Positive. The purple line represents the cumulative number of False Positives, while the dotted purple line parallel to the x-axis indicates the 95% FP Reduction line or 5% FP rate, indicating that 95% of False Positives lie above this line. As mentioned before, this arbitrary point is our guess of how much false positives users would be willing to tolerate. The plot indicates that the model is confident in its prediction of True Positives to a considerable degree. The first 27 highly ranked samples are all TPs. This analysis is useful because it justifies providing a prioritized list of static analyzer output to developers so that they can focus first on those samples which a model confidently thinks are TP.
5 D2A Leaderboard
5.1 Data
-
Infer Bug Reports (Trace): This dataset consists of Infer bug reports, which are a combination of English language and C Programming language text.
-
Bug function source code (Function)
-
Bug function source code, trace functions source code and bug function file URL (Code)
Task | Metrics | Total samples | Train / dev / test | N:P |
---|---|---|---|---|
Code + Trace | AUROC, F1-5%FPR | 45,957 | 36,719 / 4,634 / 4,604 | 39:1 |
Trace | AUROC, F1-5%FPR | 45,957 | 36,719 / 4,634 / 4,604 | 39:1 |
Code | AUROC, F1-5%FPR | 45,957 | 36,719 / 4,634 / 4,604 | 39:1 |
Function | Accuracy | 5,857 | 4,643 / 596 / 618 | 0.9:1 |
5.2 Tasks
-
Trace: Bug trace or a bug report contains both natural language and code. The code is limited to code snippets from different functions and files. Models are expected to work with a combination of natural language and code snippets to make the prediction.
-
Code: Models can use source code from bug function, all the bug trace functions and the file in which the bug function occurs to make the prediction. The file pointed to by bug_url must be downloaded in order to be used.
-
Trace + Code: Models can use all the fileds from the previous 2 tasks to make the prediction.
-
Function: Models can use only the source code from the bug function to make the prediction. The functions have been derived from a different subset of the full D2A dataset chosen to achieve a more balanced dataset.
5.3 Metrics
-
Balanced Data: For the balanced dataset we use Accuracy to measure model performance.
-
Unbalanced Data: Because the dataset is so heavily unbalanced, we cannot use Accuracy since the model predicting only 0 would have a 98% accuracy. Instead we use the two metrics described below.
-
AUROC: Many open source project datasets are huge with hundreds of thousands of examples and thousands of positive examples. The cost associated with verifying every label is high, which is why it is important to rank the models in the order of their overall model confidence. We use AUROC percentage for this purpose.
-
F1 - 5% FPR: The macro-average F1 score is generally considered a good metric for unbalanced datasets. We want the AUROC curve to peak as early as possible so we calculate the macro-average F1-score percentage at 5% FPR point indicated in Fig. 5.
-
-
Overall: To get the overall model performance, we calculate the simple average percentage of all the scores across all the tasks.
5.4 Leaderboard Results
Code + Trace | Trace | Code | Function | Overall score | ||||
---|---|---|---|---|---|---|---|---|
F1 | AUC | F1 | AUC | F1 | AUC | Accuracy | Average | |
stacking | 63.4 | 83.6 | 61.1 | 81.2 | 65.8 | 85.2 | 55.2 | 70.8 |
c-bert | 66.1 | 81.7 | 62.4 | 80.4 | 62.4 | 80.2 | 60.2 | 70.5 |
vote-new | 64.3 | 85.0 | 61.3 | 80.2 | 65.2 | 85.7 | 45.6 | 69.6 |
6 Related Work
-
Pre-training is a self-supervised process, with a goal to build a general language representation that is not connected to specific tasks. To learn the statistical properties of source code, parts of the input are masked and then the model is asked to predict them back (Feng et al. 2020; Kanade et al. 2020).
-
Fine-tuning is a supervised phase and requires task-related labels, so every downstream task, like vulnerability detection, needs a specific fine-tuning dataset.
7 Threats to Validity
8 Conclusion
libtiff
, we show how a prioritized list of static analyzer issues can be helpful for developers.libtiff
, httpd
and NGINX
. More importantly, we show that adding more features leads to model performance improvement on harder data from relatively large projects like OpenSSL
and libav
with all error types. We show how C-BERT, a transformer based deep learning model can be used to train with D2A data. Deep learning models are more generalizable than hand crafted features and show good performance on D2A data.