1 Introduction
-
A large data set of manual classifications of commit intents with improving internal and external quality categories.
-
A confirmatory study of size and complexity metric value as well as static analysis warning changes for quality improvements.
-
An exploratory study of size and complexity metric values as well as static analysis warnings of files that are the target of quality improvements.
-
A fine-tuned state-of-the-art deep learning model for automatic classification of commit intents.
-
We confirm previous work that quality increasing commits are smaller than changes unrelated to quality.
-
While perfective changes have a positive impact on most static source code metric values and static analysis warnings, corrective changes have a negative impact on size and complexity.
-
The files that are the target of perfective changes are already less complex and smaller than files which are not the target of perfective changes.
-
The files that are the target of corrective changes are more complex and larger than files which are not the target of corrective changes.
2 Research Questions and Hypotheses
-
RQ1: Does developer intent to improve internal or external quality have a positive impact on software metric values? Previous work provides us with certain indications about the impact on software metric values. This is part of our confirmatory study, and we derive two hypotheses from previous work regarding how size and software metric values should change for different types of quality improvement. We formulate our assumptions as hypothesis and test these in our case study.
-
H1: Intended quality improvements are smaller than non-perfective and non-corrective changes. Mockus (2000) found that corrective changes modify fewer lines while perfective changes delete more lines. Purushothaman and Perry (2005) also observed more deletions for perfective maintenance and an overall smaller size of perfective and corrective maintenance. Both studies provide measurements we base our hypothesis on. While they are using the same closed source project we will be able to see if our assumption holds for our multiple Java open source projects.Hönel et al. (2019) used size-based metrics as additional features for an automated approach to classify maintenance types. They found that the size-based metric values increased the classification performance. Moreover, just-in-time quality assurance (Kamei et al. 2013) builds on the assumption that changes and metrics derived from these changes can predict bug introduction, meaning there should be a difference. Therefore, we hypothesize that corrective as well as perfective maintenance consist of smaller changes. Addition of features should be larger than both, and therefore we assume that the categories we are interested in, perfective and corrective, are smaller than other non-perfective and non-corrective changes.
-
H2: Intended quality improvements impact software quality metric values in a positive way. In this paper, we focus on metrics used in the Columbus Quality Model (Bakota et al. 2011, 2014). The metrics are specifically chosen for a quality model so they should provide different measurements based on their maintenance category. Prior research, e.g., Ch’avez et al. (2017) and Stroggylos and Spinellis (2007) found that refactorings, which are part of our classification, have a measurable impact on software metric values. We hypothesize that an improvement consciously applied by a developer via a perfective commit has a measurable, positive impact on software metric values. Positive means that we expect a value change direction of the metric value, e.g., complexity is reduced. We note our expected direction for each metric together with a description in Table 4.Defect prediction research assumes a connection between software metrics and external software quality in the form of bugs. While most publications in defect prediction are not investigating the impact of single bug fixing changes the most common datasets all contain coupling, size and complexity metrics as independent variables, e.g., Jureczko and Madeyski (2010), NASA (2004), and D’Ambros et al. (2012), see also the systematic literature review by Hosseini et al. (2017). We hypothesize that fixing bugs via corrective commits has a measurable, positive impact on software metric values. While a bug fix may add complexity, our study compares bug fix changes with all non-corrective changes including feature additions. Therefore, we do not hypothesize that bug fixing decreases complexity generally, but that it is decreasing complexity in comparison to all non-corrective changes. In contrast to H1 we are not able to compare our results to concrete studies as we are not aware of a study that investigates metric value changes of perfective and corrective changes and compares them against all other non-perfective and non-corrective changes. We are instead trying to validate the assumption that quality improvements should have a positive impact on software quality metrics as they are found to improve detection of defects (Gyimothy et al. 2005).
-
-
RQ2: What kind of files are the target of internal or external quality improvements? The first part of our study provides us with information about metric value changes for quality increasing commits. In this part, we are exploring which files are the target of quality increasing commits. We are interested in how complex, e.g., via cyclomatic complexity, a file is on average that receives perfective maintenance. Moreover, on the external quality side we are interested in which files are receiving corrective changes. Due to the exploratory nature of this research question, we do not derive hypotheses.
3 Related Work
4 Case Study Design
4.1 Data and Study Subject Selection
Project | Timeframe | #C | #S | #SP | #SC | #AP | #AC |
---|---|---|---|---|---|---|---|
archiva | 2005–2018 | 3,914 | 79 | 35 | 17 | 1,478 | 1,005 |
calcite | 2012–2018 | 1,987 | 40 | 8 | 14 | 565 | 665 |
cayenne | 2007–2018 | 3,738 | 75 | 31 | 14 | 1,470 | 1,007 |
commons-bcel | 2001–2019 | 884 | 18 | 9 | 6 | 588 | 171 |
commons-beanutils | 2001–2018 | 577 | 12 | 5 | 2 | 317 | 130 |
commons-codec | 2003–2018 | 828 | 17 | 12 | 1 | 619 | 76 |
commons-collections | 2001–2018 | 1,827 | 37 | 27 | 3 | 1,185 | 200 |
commons-compress | 2003–2018 | 1,598 | 32 | 17 | 6 | 873 | 317 |
commons-configuration | 2003–2018 | 2,075 | 42 | 23 | 7 | 1,027 | 253 |
commons-dbcp | 2001–2019 | 1,034 | 21 | 15 | 3 | 672 | 211 |
commons-digester | 2001–2017 | 1,256 | 26 | 16 | 0 | 744 | 113 |
commons-imaging | 2007–2018 | 682 | 14 | 10 | 2 | 476 | 96 |
commons-io | 2002–2018 | 1,036 | 21 | 15 | 3 | 613 | 171 |
commons-jcs | 2002–2018 | 788 | 16 | 10 | 1 | 400 | 162 |
commons-jexl | 2002–2018 | 1,469 | 30 | 20 | 1 | 873 | 199 |
commons-lang | 2002–2018 | 3,261 | 66 | 50 | 6 | 2,182 | 420 |
commons-math | 2003–2018 | 4,675 | 94 | 66 | 10 | 2,981 | 574 |
commons-net | 2002–2018 | 1,092 | 22 | 13 | 5 | 585 | 246 |
commons-rdf | 2014–2018 | 529 | 11 | 9 | 0 | 341 | 35 |
commons-scxml | 2005–2018 | 479 | 10 | 6 | 2 | 256 | 76 |
commons-validator | 2002–2018 | 1,573 | 32 | 18 | 6 | 900 | 296 |
commons-vfs | 2002–2018 | 1,136 | 23 | 11 | 8 | 628 | 207 |
eagle | 2015–2018 | 582 | 12 | 5 | 4 | 104 | 199 |
falcon | 2011–2018 | 1,547 | 31 | 7 | 13 | 255 | 676 |
flume | 2011–2018 | 1,489 | 30 | 5 | 14 | 266 | 591 |
giraph | 2010–2018 | 854 | 18 | 4 | 6 | 201 | 281 |
gora | 2010–2019 | 569 | 12 | 3 | 4 | 182 | 141 |
helix | 2011–2019 | 2,199 | 44 | 8 | 9 | 552 | 580 |
httpcomponents-client | 2005–2019 | 2,399 | 48 | 22 | 16 | 1,113 | 639 |
httpcomponents-core | 2005–2019 | 2,598 | 52 | 25 | 12 | 1,326 | 544 |
jena | 2002–2019 | 8,698 | 174 | 88 | 34 | 4,163 | 1,424 |
jspwiki | 2001–2018 | 4,326 | 87 | 32 | 25 | 1,523 | 941 |
knox | 2012–2018 | 1,131 | 23 | 3 | 10 | 266 | 306 |
kylin | 2014–2018 | 6,789 | 136 | 40 | 40 | 1,904 | 2,163 |
lens | 2013–2018 | 1,370 | 28 | 9 | 9 | 321 | 479 |
mahout | 2008–2018 | 2,075 | 42 | 16 | 15 | 836 | 467 |
manifoldcf | 2010–2019 | 2,867 | 58 | 10 | 21 | 602 | 1,164 |
mina-sshd | 2008–2019 | 1,281 | 26 | 10 | 6 | 381 | 396 |
nifi | 2014–2018 | 3,299 | 66 | 12 | 18 | 592 | 1,052 |
opennlp | 2008–2018 | 1,763 | 36 | 22 | 6 | 805 | 275 |
parquet-mr | 2012–2018 | 1,228 | 25 | 7 | 9 | 439 | 316 |
pdfbox | 2008–2018 | 8,256 | 166 | 81 | 69 | 3,934 | 2,904 |
phoenix | 2014–2019 | 7,835 | 157 | 23 | 83 | 828 | 4,545 |
ranger | 2014–2018 | 2,213 | 45 | 10 | 20 | 434 | 908 |
roller | 2005–2019 | 2,435 | 49 | 15 | 13 | 869 | 723 |
santuario-java | 2001–2019 | 1,455 | 30 | 14 | 5 | 627 | 406 |
storm | 2011–2018 | 2,839 | 57 | 24 | 9 | 987 | 716 |
streams | 2012–2019 | 911 | 19 | 7 | 2 | 264 | 196 |
struts | 2006–2018 | 2,945 | 59 | 21 | 18 | 1,191 | 682 |
systemml | 2012–2018 | 3,860 | 78 | 21 | 25 | 921 | 1,416 |
tez | 2013–2018 | 2,359 | 48 | 8 | 27 | 443 | 1,223 |
tika | 2007–2018 | 2,581 | 52 | 11 | 10 | 705 | 740 |
wss4j | 2004–2018 | 2,455 | 50 | 22 | 10 | 712 | 702 |
zeppelin | 2013–2018 | 1,836 | 37 | 11 | 6 | 333 | 699 |
125,482 | 2,533 | 1,022 | 685 | 47,852 | 35,124 |
4.2 Change Type Classification Guidelines
A change is classified as perfective if… |
1. the commit message says code is removed or marked as deprecated. |
2. code is moved to new packages. |
3. generics are introduced, new Java features are used, existing code is switched to collections, or class |
members are switched to final. |
4. documentation is improved or example code is updated. |
5. static analysis warnings are fixed even though no related bug is reported. |
6. code is reformatted or the readability is otherwise improved (e.g. whitespace fixes or tabs to spaces). |
7. existing code is cleaned up, simplified, or its efficiency improved. |
8. dependencies are updated. |
9. developer tooling is improved, e.g., build scripts or logging facilities. |
10. the repository layout is cleaned, e.g., by removing compiled code or maintaining .gitignore files. |
11. tests are improved or added. |
Examples: Eliminated unused private field. JIRA: DBCP-255a Because of other null |
checks it was already impossible to use the field. Thus, this is clean up. [CODEC-127] |
Non-ascii characters in source filesb While the linked issue is a bug, it only affects IDEs |
for developers and not the compiled code. Thus, this is an improvement of developer |
tooling. JEXL-240: Javadocc The message indicates that this commit only improved |
the code comments. Therefore, it is classified as perfective. |
A change is classified as corrective if… |
1. the commit message mentions bug fixes. |
2. the commit message or the linked issue mentions that a wrong behaviour is fixed. |
3. the commit message or the linked issue mentions that a NullPointerException is fixed. |
4. a bug report is linked via the commit message that is of type bug and is not just a |
feature request in disguise (see Herzig et al. 2013). |
Examples: KYLIN-940 ,fix NPE in monitor module, apply patch from Xiaoyu Wangd |
This fixes a NullPointerException that is visible to the end user. owl syntax checker (bug |
fixes)e Fixes a wrong behavior. |
A change is classified as other if… |
1. the commit message mentions feature or functionality addition. |
2. the commit message mentions license information or copyrights changes. |
3. the commit message mentions repository related information with unclear purpose, |
e.g., merges of branches without information, tagging of releases. |
4. the commit message mentions that a release is prepared. |
5. an issue is linked via the commit message that requests a feature. |
6. any of the 1-5 are tangled with a perfective or corrective classification. |
Examples: KYLIN-715 fix license issuef License changes or additions are not direct |
improvements of source code. Support the alpha channel for PAM files. Fix the alpha |
channel order when reading and writing. Add various tests.g This change adds support |
for a new feature, fixes something and adds tests, it is therefore highly tangled and we |
do not classify it as either or both. |
Model | Acc. | F1 | MCC | Description |
---|---|---|---|---|
von der Mosel et al. (2022) | 0.80 | 0.79 | 0.70 | BERT model pre-trained on |
software engineering data, fine-tuned | ||||
with only commit messages | ||||
Ghadhab et al. (2021) | 0.78 | 0.80 | – | BERT model pre-trained on natural |
language, includes code changes. | ||||
Gharbi et al. (2019) | – | 0.46 | – | Multi-label active learning, only |
commit message | ||||
Levin and Yehudai (2017) | 0.76 | – | – | Keywords and code changes, Random |
Forest model | ||||
Hönel et al. (2019) | 0.80 | – | – | LogitBoost model, includes code density. |
4.3 Deep Learning for Commit Intent Classification
4.4 Metric Selection
Name and description | Abbrev | ⇕ |
---|---|---|
Cyclomatic Complexity (McCabe 1976) | ||
The number of independent control-flow paths. | McCC | ↓ |
Logical Lines of Code | ||
Number of lines in a file without comments and empty lines. | LLOC | ↓ |
Nesting Level else-if | ||
Maximum of nesting level in a file. | NLE | ↓ |
Number of parameters in a method | ||
The sum of all parameters of all methods in a file. | NUMPAR | ↓ |
Clone Coverage | ||
Ratio of code covered by duplicates. | CC | ↓ |
Comment lines of code | ||
Sum of commented lines. | CLOC | ↑ |
Comment density | ||
Ratio of CLOC to LLOC. | CD | ↑ |
API Documentation | ||
Number of documented public methods, + 1 if class is documented. | AD | ↑ |
Number of Ancestors | ||
Number of classes, interfaces, enums from which the class is inherited. | NOA | ↓ |
Coupling between object classes | ||
Number of used classes (inheritance, function call, type reference). | CBO | ↓ |
Number of Incoming Invocations | ||
Other methods that call the current class. | NII | ↓ |
Minor static analysis warnings | ||
E.g., brace rules, naming conventions. | Minor | ↓ |
Major static analysis warnings | ||
E.g., type resolution rules, unnecessary/unused code rules. | Major | ↓ |
Critical static analysis warnings | ||
E.g., equals for string comparison, catching null pointer exceptions. | Critical | ↓ |
4.5 Analysis Procedure
5 Results
5.1 Confirmatory Study
5.1.1 Results H1: Intended Quality Improvements are Smaller than Non-perfective and Non-corrective Changes
Metric | Perfective | Corrective | ||
---|---|---|---|---|
p-value | d | p-val | d | |
#lines added | <0.0001 | 0.20 (s) | <0.0001 | 0.21 (s) |
#lines deleted | <0.0001 | 0.15 (s) | <0.0001 | 0.16 (s) |
#files modified | 0.2081 | – | <0.0001 | 0.22 (s) |
#hunks | <0.0001 | 0.01 (n) | <0.0001 | 0.22 (s) |
5.1.2 Results H2: Intended Quality Improvements Impact Software Quality Metric Values in a Positive Way
Metric | %NZ | %NZ P | %NZ C |
---|---|---|---|
McCC | 51.03 | 31.01 | 57.70 |
LLOC | 74.69 | 60.93 | 77.99 |
NLE | 36.76 | 23.92 | 34.28 |
NUMPAR | 35.93 | 24.44 | 24.98 |
CC | 49.41 | 37.81 | 55.14 |
CLOC | 51.56 | 46.52 | 42.51 |
CD | 76.07 | 66.48 | 77.35 |
AD | 27.19 | 20.63 | 15.82 |
NOA | 10.51 | 6.96 | 3.62 |
CBO | 30.89 | 22.52 | 22.22 |
NII | 27.08 | 17.78 | 21.09 |
Minor | 36.15 | 27.02 | 29.77 |
Major | 19.87 | 13.23 | 14.77 |
Critical | 7.23 | 4.20 | 4.95 |
Metric | Perfective | Corrective | ||
---|---|---|---|---|
p-val | d | p-val | d | |
McCC | <0.0001 | 0.39 (m) | 1.0000 | – |
LLOC | <0.0001 | 0.45 (m) | 1.0000 | – |
NLE | <0.0001 | 0.27 (s) | 1.0000 | – |
NUMPAR | <0.0001 | 0.25 (s) | <0.0001 | 0.09 (n) |
CC | 1.0000 | – | <0.0001 | 0.12 (s) |
CLOC | <0.0001 | 0.16 (s) | <0.0001 | 0.05 (n) |
CD | 1.0000 | – | <0.0001 | 0.16 (s) |
AD | <0.0001 | 0.02 (n) | <0.0001 | 0.08 (n) |
NOA | <0.0001 | 0.08 (n) | <0.0001 | 0.07 (n) |
CBO | <0.0001 | 0.19 (s) | <0.0001 | 0.06 (n) |
NII | <0.0001 | 0.19 (s) | <0.0001 | 0.02 (n) |
Minor | <0.0001 | 0.19 (s) | <0.0001 | 0.05 (n) |
Major | <0.0001 | 0.12 (s) | <0.0001 | 0.05 (n) |
Critical | <0.0001 | 0.05 (n) | <0.0001 | 0.03 (n) |
5.2 Summary RQ1
5.3 Exploratory Study
Metric | All | Perfective | Corrective |
---|---|---|---|
McCC | 21.78 | 18.78 | 33.23 |
LLOC | 186.98 | 163.75 | 264.18 |
NLE | 9.60 | 8.33 | 14.00 |
NUMPAR | 16.06 | 15.00 | 22.00 |
CC | 0.04 | 0.04 | 0.05 |
CLOC | 46.25 | 55.00 | 54.00 |
CD | 0.25 | 0.32 | 0.25 |
AD | 0.50 | 0.67 | 0.46 |
NOA | 1.00 | 1.00 | 1.00 |
CBO | 9.67 | 8.00 | 14.00 |
NII | 8.00 | 8.50 | 9.50 |
Minor | 7.00 | 6.00 | 9.67 |
Major | 2.00 | 1.25 | 3.00 |
Critical | 0.00 | 0.00 | 0.00 |
Metric | Perfective | Corrective | ||
---|---|---|---|---|
p-val | d | p-val | d | |
McCC | <0.0001 | 0.05 (n) | <0.0001 | 0.08 (n) |
LLOC | <0.0001 | 0.05 (n) | <0.0001 | 0.05 (n) |
NLE | <0.0001 | 0.04 (n) | <0.0001 | 0.07 (n) |
NUMPAR | 0.6367 | – | 0.0218 | – |
CC | <0.0001 | 0.01 (n) | 0.0011 | – |
CLOC | <0.0001 | 0.12 (s) | <0.0001 | 0.06 (n) |
CD | <0.0001 | 0.15 (s) | <0.0001 | 0.15 (s) |
AD | <0.0001 | 0.17 (s) | <0.0001 | 0.15 (s) |
NOA | 0.5109 | – | <0.0001 | 0.02 (n) |
CBO | <0.0001 | 0.09 (n) | <0.0001 | 0.07 (n) |
NII | <0.0001 | 0.05 (n) | <0.0001 | 0.04 (n) |
Minor | <0.0001 | 0.04 (n) | <0.0001 | 0.02 (n) |
Major | <0.0001 | 0.09 (n) | <0.0001 | 0.04 (n) |
Critical | <0.0001 | 0.05 (n) | <0.0001 | 0.03 (n) |