Introduction
-
We propose a variable propagation path-based changed code generation approach for method-level cross-version defect detection, which accurately determines the scope of code changes during version iteration.
-
Graph representations combining changes in software code version information are proposed and applied in practice. We built the BugPre system and conducted extensive experiments on open-source Apache projects. Compared with the existing three state-of-the-art defect detection methods, our BugPre has increased by over 16% in terms of F1-score.
Related work
Cross-version defect prediction
Change impact analysis
Graph neural network
System design
System overview
Data processing
Graph construction
AST-based node relationship construction
next-exec
. For example, there are two next-exec
edges from the condition node named “BinaryExpr” to the “ExpressionStmt” node and “forStmt” node, respectively, in Fig. 2, and the former node corresponds to the assignment statement “r = 1.0
” in the source code, and the latter node represents the for
loop structure. (2) For the loop statement, the loop judgment node “BinaryExpr” and the loop statement “BlockStmt” need to form a loop by adding two next-exec
edges to simulate the execution of the loop statement. (3) Finally, for the sequential execution of statements, the nodes are connected through next-exec
edge according to the execution order of the statements.-
VarDefinition-Edge
: It represents the definition of a variable.VarDefinition-Edge
is used to connect the parent node with the variable definition node. -
VarUse-Edge
: It indicates whether the variable is used or not. When a node uses a currently defined variable node, we will add an edge of such type between these two node pairs. -
VarModify-Edge
: We use this edge type to describe the relationships between modified/reassigned variables and their callers or passers.
Feature extraction
MethodCallExper
node in the graph, which can reflect the structure complexity of code. In addition, the statistical information also contains the characteristic information of the code structure, such as statistics of control flow nodes and comparison statements.Statistical indicators | Description |
---|---|
inDegree | In-degree of control flow structure |
outDegree | The degree of control flow structure |
nuMethodCalled | Number of function calls |
isControlNode | Whether it is a control flow node |
is_Compare | Whether it is a comparison statement |
isDeclaration | Whether it is a function declaration node |
isAssignExpr | Whether it is an assignment node |
isBinaryExpr | Whether it is a binary operational symbol |
Statistical indicators | Description |
---|---|
NR | Number of revisions |
NDC | Number of distinct committers |
NML | Number of modified Lines |
NDPV | Number of Defects in previous version |
MTPV | Modify type in previous version |
IBPV | Whether it contains defects in previous version |
Changed code generation
Obtaining change set
git diff
to find the changed code over two versions of a project. However, the matching results may lead to high false positives of the change set due to changes in logs, formats, etc. Therefore, we propose a fine-grained approach of matching function pairs to identify the actual change set. If a file in the prior version has changed, we compare it with the file in the current version to form function pairs. Then, we filter the invalid code and match the function pair to locate the real changed function, generating an accurate change set. Note that new functions (including functions in new files) do not need any comparison; they are added directly to the change set. Next, we detail the method of obtaining a changed set, including AST pruning and locating modified functions based on graph matching.getMethodDeclaration
to extract AST of each function according to node type “MethodDeclaration” and then filters meaningless code segments through the function filter
.
S
of the two graphs. A higher value of S
indicates a higher degree of similarity.Project | Version | Java files | Changed functions | Unchanged functions |
---|---|---|---|---|
1.4.0 | 1122 | |||
Camel | 1.6.0 | 1252 | 357 | 4651 |
1.5.0 | 401 | |||
Ant | 1.6.0 | 523 | 680 | 1307 |
2.4.0 | 862 | |||
Xalan | 2.5.0 | 946 | 252 | 3532 |
Generating impact set
Notation | Structure | Description |
---|---|---|
VPT\(\_\)C | List<VPT\(\_\)F>, List<VPT\(\_\)M> | It represents a complete class structure, which usually contains global variable and method. |
VPT\(\_\)F | Assigned, List<Assigner> | It represents the global variable, composed of an assigned variable and multiple assigned variables. |
VPT\(\_\)M | List<VPT\(\_\)S> | It represents the method in class, which consists of statement nodes. |
VPT\(\_\)S | Assigned, List<Assigner> | It represents the statement node, composed of an assigned variable and multiple assigned variables. |
getNodes
to return all child nodes with a deep traversal method. Then, the AST nodes are parsed to the corresponding VPT node according to its type. The nodetype
is an attribute of the AST node. It is used to represent the type node in AST, mainly including declaration type and statement type. For instance, ClassOrInterfaceDeclaration
and MethodDeclaration
are the class declaration and function declaration, respectively, and Ifstmt
denotes if-statement. The node type nodetype
is obtained through the function getNodeType
.buildControlFlow
, buildDataFlow
, and buildFunCall
to build edge relationships for VPT nodes to generate final VPT. Function buildControlFlow
in Algorithm 3 constructs control flow by adding next-token
edge and next-exac
edge to represent the inclusion relationship and the sequential execution between nodes, respectively. Function buildDataFlow
shown in Algorithm 4 is used to describe the pass of value in VPT by adding VarDefinition
, VarUse
, and VarModify
edges. Algorithm 5 shows the process of building a calling relationship between functions. The function buildFunCall
conducts a connection of FunCall
edge between the calling function node and the called function node. Figure 4 shows a Java class and its corresponding VPT.
-
Using global variable. We consider member variables and public static variables in the class as global variables and variables participating in the modification of global variables as generalized global variables. When a global variable in a class or a generalized global variable in a function is modified, the scope of its influence will not be limited to the current function. We need to consider the use of propagation in the entire class where the variable is located.
-
Using return. “return” indicates a function call relationship. The called function obtains the processed result through calculation and returns the result to the calling location. If a function with a return value gets changed, BugPre analyzes the propagation path of the modified variable in this method and finds out associated code with it. Besides, BugPre also needs to determine the position returned by the “return” statement (i.e., the calling location) and then analyze the associated code of the calling function.
-
Using local variable. The local variable is only defined in the function, and its life cycle is limited to the function body. Therefore, it is only necessary to analyze the data flow relationship in this function.
Training the predictive model
Prediction
Experimental evaluation
Experiment environment
Dataset
Datasets | Version | No. of instances | % of defective instance |
---|---|---|---|
Camel | 1.0 | 339 | 3.8 |
1.2 | 608 | 35.5 | |
1.4 | 872 | 6.6 | |
1.6 | 965 | 19.5 | |
Ivy | 1.1 | 136 | 4.7 |
1.4 | 322 | 5.0 | |
2.0 | 478 | 8.2 | |
Log4j | 1.0 | 135 | 25.2 |
1.1 | 109 | 3.3 | |
1.2 | 205 | 3.8 | |
Xalan | 2.4 | 723 | 15.2 |
2.5 | 803 | 48.2 | |
2.6 | 885 | 46.4 | |
Lucene | 2.0 | 289 | 31.1 |
2.2 | 383 | 37.6 | |
2.4 | 537 | 37.8 | |
Ant | 1.4 | 266 | 15.0 |
1.5 | 402 | 8.0 | |
1.6 | 524 | 17.7 | |
1.7 | 1066 | 15.6 |
Efficiency of filtering invalid code
Abbreviation | Description |
---|---|
CM | Number of commits of two versions projects in total |
CF | Number of changed files in version change |
CL | Number of changed lines in version change |
FML | Number of lines of format adjustment in all changed lines |
COML | Number of lines of comment in all changed lines |
OTL | Number of lines of other code such as code irrelevant to logic |
FIL | Number of lines of changed code excluding invalid code |
Project | Prior | Current | CM | CF | CL | FML | COML | OTL | FIL |
---|---|---|---|---|---|---|---|---|---|
Hadoop | 3.2.2-RC4 | 3.2.2-RC5 | 4 | 562 | 13377 | 4 | 201 | 38 | 12576 |
Lucene | 8.7.0 | 8.8.0 | 216 | 6931 | 407546 | 55 | 4802 | 823 | 401866 |
Log4j | 2.1-rc2 | 2.1-rc3 | 14 | 36 | 994 | 0 | 64 | 7 | 923 |
Ant | 1.10.8 | 1.10.9 | 51 | 5240 | 348924 | 254 | 9555 | 565 | 338550 |
Project | Prior | Current | Actual Set | VPT-based | Influence CIA | VM-CIA |
---|---|---|---|---|---|---|
NanoXML | 2.1.1 | 2.2 | 58 | 107 | 206 | 121 |
2.2 | 2.2.1 | 20 | 110 | 222 | 117 | |
JUnit | 3.4 | 3.5 | 69 | 236 | 253 | 201 |
3.5 | 3.6 | 58 | 221 | 351 | 238 | |
HttpCore | 0.9 | 1.0 | 186 | 549 | 1217 | 828 |
1.0 | 1.1 | 133 | 239 | 1448 | 332 | |
Heritrix | 0.2.0 | 0.4.0 | 279 | 439 | 865 | 598 |
0.4.0 | 0.6.0 | 480 | 859 | 1276 | 876 |
Efficiency of VPT-based associated code analysis
Efficiency of defect detection based on GCNN
Comparison with prior works
Approach | Camel | Ivy | Log4j | Xalan | Ant | Avg |
---|---|---|---|---|---|---|
DefectPre | 0.6387 | 0.7205 | 0.6685 | 0.7103 | 0.6227 | 0.6721 |
DS3 [11] | 0.2790 | 0.1890 | 0.5435 | 0.4717 | 0.3710 | 0.3708 |
WGNCS [58] | 0.4509 | 0.3603 | 0.6250 | 0.6355 | 0.4215 | 0.4986 |
HALKP [59] | 0.412 | 0.2713 | 0.6567 | 0.5313 | 0.37 | 0.4483 |
Limitation and discussion
-
Data labeling: The current version of BugPre uses the keywords such as “fix,” “bug,” and “defect” in the commit record to set labels for code. The behind reason is that, according to the git commit specification, the commit type of “fix” represents the modification of the bug. We try to get accurately labeled data from high-quality projects, but it is difficult to deal with out-of-spec submissions. We believe more advanced bug tracking approaches [60] can help label data accurately.
-
Expandability: The proposed system is currently applicable to projects with Java languages. In our approach, AST, control flow, and data flow are used to build node relationship graphs, all of which exist in most programming languages. To extend to the defect detection of projects with other languages, we only need to achieve program parsing of the target language and follow the graph construction process in this paper to build graphs for other programming languages.
-
Insufficient historical data: Insufficient historical data in the early stage of software development is a well-known challenge in cross-version defect detection. For future design, we believe that combining cross-project defect detection can mitigate this limitation.