1 Introduction
2 Related Work
2.1 Self-Admitted Technical Debt
2.2 SATD in Issue Tracking Systems
3 Study Design
-
(RQ1) How to accurately identify self-admitted technical debt in issue tracking systems? This research question focuses on the accuracy of SATD automatic identification, and is further refined into three sub-questions:
-
◦ (RQ1.1) Which algorithms have the best accuracy to capture self-admitted technical debt in issue tracking systems? Since we are aiming at accurately identifying SATD within issues, we need to compare the accuracy of different approaches to choose the best one.
-
◦ (RQ1.2) How to improve accuracy of machine learning model? To optimize the accuracy of identifying SATD in issue tracking systems, we investigate word embedding refinement, imbalanced data handling strategies, and hyperparameters tuning.
-
◦ (RQ1.3) How can transfer learning improve the accuracy of identifying self-admitted technical debt in issue tracking systems? Transfer learning focuses on using knowledge gained while solving one task to address a different but related task. Therefore, we can study the influence of leveraging external datasets (from source code comments) on our SATD detector using transfer learning.
-
-
(RQ2) Which keywords are the most informative to identify self-admitted technical debt in issue tracking systems? By extracting keywords of technical debt statements, we can understand better how developers declare technical debt in issue tracking systems. The summarized keywords are also helpful to developers for understanding and identifying SATD within issues. Overall, understanding these keywords allow us to explain how the classifier works.
-
(RQ3) How generic is the classification approach among projects and issue tracking systems? Different projects use different issue tracking systems (e.g. Jira and Google Monorail) and are maintained by different communities (e.g. Apache and Google). Thus, we need to evaluate how far our results can be applicable to different projects and issue tracking systems. This research question is concerned with the generalizability of machine learning approaches.
-
(RQ4) How much data is needed for training the machine learning model to accurately identify self-admitted technical debt in issues? Intuitively, training a machine learning classifier on a bigger dataset leads to better accuracy. However, manually annotating SATD in issue tracking systems is a time-consuming task. Therefore, we ask RQ4 to determine the most suitable size for a training dataset, which can achieve the best classification accuracy with a minimum amount of effort.
3.1 Approach Overview
3.2 Data Collection
Project | Project details | Classification details | |||||
---|---|---|---|---|---|---|---|
Issue | Languages | SLOC | # Issues | # Analyzed | # SATD | % SATD | |
tracker | sections | sections | sections | ||||
Camel | Jira | Java | 1,525k | 14,411 | 2,792 | 377 | 13.5% |
Chromium | Google | C++, C, and JavaScript | 22,472k | 1,079,511 | 3,435 | 264 | 6.7% |
Gerrit | Google | Java | 455k | 12,711 | 2,812 | 195 | 6.9% |
Hadoop | Jira | Java | 3,409k | 16,808 | 4,515 | 831 | 18.4% |
HBase | Jira | Java | 912k | 24,342 | 4,936 | 688 | 13.9% |
Impala | Jira | C++, Java, and Python | 640k | 9,733 | 1,934 | 355 | 18.4% |
Thrift | Jira | C++, Java, and C | 294k | 5,196 | 2,756 | 567 | 20.6% |
Average | 3,311 | 468 | 14.1% | ||||
Total | 23,180 | 3,277 |
3.3 Filtering Issue Sections
3.4 Issue Section Classification
Type | Indicator | Definition | # | # | % |
---|---|---|---|---|---|
Architecture debt | Violation of modularity | Because shortcuts were taken, multiple modules became inter-dependent, while they should be independent. | 46 | 87 | 2.7 |
Using obsolete technology | Architecturally-significant technology has become obsolete. | 41 | |||
Build debt | Over- or under-declared dependencies | Under-declared dependencies: dependencies in upstream libraries are not declared and rely on dependencies in lower level libraries. Over-declared dependencies: unneeded dependencies are declared. | 25 | 64 | 2.0 |
Poor deployment practice | The quality of deployment is low that compile flags or build targets are not well organized. | 39 | |||
Code debt | Complex code | Code has accidental complexity and requires extra refactoring action to reduce this complexity. | 30 | 1246 | 38.0 |
Dead code | Code is no longer used and needs to be removed. | 121 | |||
Duplicated code | Code that occurs more than once instead of as a single reusable function. | 40 | |||
Low-quality code | Code quality is low, for example because it is unreadable, inconsistent, or violating coding conventions. | 856 | |||
Multi-thread correctness | Thread-safe code is not correct and may potentially result in synchronization problems or efficiency problems. | 40 | |||
Slow algorithm | A non-optimal algorithm is utilized that runs slowly. | 159 | |||
Defect debt | Uncorrected known defects | Defects are found by developers but ignored or deferred to be fixed. | 25 | 25 | 0.8 |
Design debt | Non-optimal decisions | Non-optimal design decisions are adopted. | 935 | 935 | 28.5 |
Documentation debt | Low-quality documentation | The documentation has been updated reflecting the changes in the system, but quality of updated documentation is low. | 342 | 486 | 14.8 |
Outdated documentation | A function or class is added, removed, or modified in the system, but the documentation has not been updated to reflect the change. | 144 | |||
Requirement debt | Requirements partially implemented | Requirements are implemented, but some are not fully implemented. | 67 | 96 | 2.9 |
Non-functional requirements not being fully satisfied | Non-functional requirements (e.g. availability, capacity, concurrency, extensibility), as described by scenarios, are not fully satisfied. | 29 | |||
Test debt | Expensive tests | Tests are expensive, resulting in slowing down testing activities. Extra refactoring actions are needed to simplify tests. | 28 | 338 | 10.3 |
Flaky tests | Tests fail or pass intermittently for the same configuration. | 83 | |||
Lack of tests | A function is added, but no tests are added to cover the new function. | 158 | |||
Low coverage | Only part of the source code is executed during testing. | 69 |
3.5 Training and Executing Machine Learning Models
3.5.1 Machine Learning Models
-
Traditional machine learning approaches (SVM, NBM, kNN, LR, RF): Support Vector Machine (SVM) (Sun et al. 2009), Naive Bayes Multinomial (NBM) (McCallum et al. 1998), k-Nearest Neighbor (kNN) (Tan 2006), Logistic Regression (LR) (Genkin et al. 2007), and Random Forest (RF) (Xu et al. 2012) classifiers are widely used in text classification tasks (Kowsari et al. 2019) due to their good classification accuracy. Moreover, the results of current studies on SATD identification (Maldonado et al. 2017; Huang et al. 2018; Flisar and Podgorelec 2019) show that these approaches achieve good accuracy in classifying SATD in source code comments. Thus, these approaches have potential to also achieve good accuracy when classifying SATD in issue trackers. Therefore, we train all of these classifiers using the implementation in Sklearn8 using Bag-of-Words (BoW) with default settings and compare their accuracy.
-
Text Graph Convolutional Network (Text GCN): Text GCN generates a single large graph from the corpus and classifies text through classifying graph nodes using a graph neural network (Yao et al. 2019). This approach achieves promising performance in text classification tasks by outperforming numerous state-of-the-art methods (Yao et al. 2019).
-
Text Convolutional Neural Network (Text CNN): Text CNN is a simple one-layer CNN proposed by Kim (2014), that achieved high accuracy over the state of the art. The details of this approach are presented in more detail, as they are background knowledge for understanding some of the results in Section 4. The architecture of the model is presented in Fig. 2. The input issue section is first tokenized and converted into a matrix using an n-dimensional word embedding (see Section 3.5.4). For example, in Fig. 2, the input issue section is ‘document should be updated to reflect this’, which is represented as a 7 × 5 matrix because the issue section contains 7 words and the dimensionality of the word embedding is 5. Then the matrix is regarded as an image, and convolution operation is performed to extract the high level features. Because each row in the issue section matrix represents a word and the depth of the filter must be the same as the depth (width) of the input matrix, only the height of the filter can be adjusted, which is denoted by region size. It is important to note that multiple filters with different region sizes are applied to the issue section matrix to extract multiple features. In Fig. 2, the model adopts three filter region sizes (i.e., 1, 2, and 3) and three filters per region size. Applying the three filter region sizes 1, 2, and 3 on the input issue section produces nine feature maps with the sizes of 7, 6, and 5. For example, with a filter region size of 1, the convolution operation needs to be applied on every row (i.e. every word) of the input issue section, and thus producing a feature map with size of 7. After that, to make use of the information from each feature map, 1-max-pooling (which computes the maximum value of each feature map) is applied to extract a scalar from each feature. Then the output features are concatenated and flattened to form the penultimate layer. Finally, the output layer calculates the probability of the section to be a SATD section using the softmax activation function. This approach has been proven to be accurate for identifying SATD in source code comments (Ren et al. 2019); thus it also has potential for accurately identifying SATD in issues.
3.5.2 Baseline Approaches
-
Baseline 1 (Random): This is a simple baseline approach, which assumes that the SATD detection is random. This random approach classifies sections as SATD sections randomly based on the probability of a section being a SATD section. For example, if 3,277 out of 23,180 sections are SATD sections in the training set, we assume the probability of a section being a SATD section is 14.1%. Then the random approach randomly classifies any section in the test set as SATD section corresponding to the calculated probability (14.1%).
-
Baseline 2 (Keyword): In the work of Potdar and Shihab (2014), they identified and summarized 62 SATD keywords, such as fixme, ugly, temporary solution, this isn’t quite right, and this can be a mess. Those keywords were used for automatically identifying SATD comments (Bavota and Russo 2016). The SATD keyword-based method classifies a section as a SATD section when the section contains one or more of these SATD keywords.
3.5.3 Strategies for Handling Imbalanced Data
-
Easy Data Augmentation (EDA): this technique augments text data through synonym replacement, random insertion, random swap, and random deletion (Wei and Zou 2019). To balance the dataset, we generate and add synthetic SATD sections to the training data using the EDA technique.
-
Oversampling: This method simply replicates the minority class to re-balance the training data. We replicate the SATD sections to balance the SATD sections and non-SATD sections before training.
-
Weighted loss: This method first calculates weights for all classes according to their occurrence. High frequency in occurrence leads to low weight value. Then the loss of each measurable element is scaled by the corresponding weight value in accordance with the class. Weighted loss penalizes harder the wrongly classified sections from minority classes (i.e. false negative and false positive errors) during training of machine learning models to resolve the imbalanced data. This strategy is widely used for training CNN models on imbalanced datasets (Phan et al. 2017; Ren et al. 2019).
3.5.4 Word Embedding
3.5.5 Evaluation Metrics
3.5.6 Keyword Extraction
4 Results
4.1 (RQ1.1) Which Algorithms Have the Best Accuracy to Capture Self-Admitted Technical Debt in Issue Tracking Systems?
Type | Classifier | Precision | Recall | F1-score | F1-score | F1-score |
---|---|---|---|---|---|---|
Imp. Over | Imp. Over | |||||
Random | Keyword | |||||
Deep learning | Text CNN (rand) | 0.685 | 0.530 | 0.597 | 4.3× | 13.6× |
Text CNN (wiki) | 0.677 | 0.463 | 0.549 | 3.9× | 12.5× | |
Text CNN (SO) | 0.651 | 0.541 | 0.590 | 4.2× | 13.4× | |
Text GCN | 0.474 | 0.056 | 0.081 | 0.6× | 1.8× | |
Traditional | SVM | 0.861 | 0.179 | 0.295 | 2.1× | 6.7× |
machine learning | NBM | 0.520 | 0.539 | 0.529 | 3.8× | 12.0× |
kNN | 0.582 | 0.029 | 0.055 | 0.4× | 1.2× | |
LR | 0.643 | 0.430 | 0.515 | 3.7× | 11.7× | |
RF | 0.730 | 0.182 | 0.291 | 2.1× | 6.6× | |
Baseline | Random | 0.140 | 0.139 | 0.139 | ||
Keyword | 0.515 | 0.023 | 0.044 |
4.2 (RQ1.2) How to Improve Accuracy of Machine Learning Model?
4.2.1 Handling Imbalanced Data
Method | Word | Precision | Recall | F1-score | F1-score |
---|---|---|---|---|---|
embedding | (average) | (average) | (average) | Imp. | |
Default | Random | 0.685 | 0.530 | 0.597 | – |
Wiki-news | 0.677 | 0.463 | 0.549 | – | |
StackOverflow-post | 0.651 | 0.541 | 0.590 | – | |
Average | 0.671 | 0.511 | 0.578 | – | |
EDA | Random | 0.606 | 0.470 | 0.529 | − 11.3% |
Wiki-news | 0.556 | 0.406 | 0.469 | − 14.5% | |
StackOverflow-post | 0.604 | 0.595 | 0.599 | 1.5% | |
Average | 0.588 | 0.490 | 0.532 | − 7.9% | |
Oversampling | Random | 0.573 | 0.717 | 0.636 | 6.5% |
Wiki-news | 0.591 | 0.592 | 0.591 | 7.6% | |
StackOverflow-post | 0.610 | 0.618 | 0.612 | 3.7% | |
Average | 0.591 | 0.642 | 0.613 | 6.0% | |
Weighted loss | Random | 0.555 | 0.735 | 0.632 | 5.8% |
Wiki-news | 0.583 | 0.617 | 0.599 | 9.1% | |
StackOverflow-post | 0.591 | 0.640 | 0.613 | 3.8% | |
Average | 0.576 | 0.664 | 0.614 | 6.2% |
4.2.2 Refining Word Embeddings
Word embedding | Dimensionality | Precision (average) | Recall (average) | F1-score (average) |
---|---|---|---|---|
Random | 300 | 0.555 | 0.735 | 0.632 |
Wiki-news | 300 | 0.583 | 0.617 | 0.599 |
StackOverflow-post | 200 | 0.591 | 0.640 | 0.613 |
Issue-tracker-data | 100 | 0.647 | 0.686 | 0.664 |
200 | 0.662 | 0.680 | 0.670 | |
300 | 0.648 | 0.703 | 0.673 |
4.2.3 Tuning CNN Hyperparameters
Type | Region size | Precision (average) | Recall (average) | F1-score (average) |
---|---|---|---|---|
Single | (1) | 0.552 | 0.782 | 0.646 |
(3) | 0.638 | 0.707 | 0.670 | |
(5) | 0.631 | 0.686 | 0.657 | |
(7) | 0.642 | 0.665 | 0.652 | |
Multiple | (1,2) | 0.643 | 0.715 | 0.676 |
(1,2,3) | 0.657 | 0.711 | 0.682 | |
(2,3,4) | 0.655 | 0.706 | 0.677 | |
(3,4,5) | 0.648 | 0.703 | 0.673 | |
(1,2,3,4) | 0.663 | 0.703 | 0.679 | |
(1,3,5,7) | 0.675 | 0.677 | 0.675 | |
(2,4,6,8) | 0.674 | 0.665 | 0.669 | |
(1,2,3,4,5) | 0.678 | 0.685 | 0.680 | |
(1,2,3,5,7) | 0.681 | 0.682 | 0.680 | |
(1,3,4,5,7) | 0.667 | 0.686 | 0.676 | |
(1,3,5,7,9) | 0.668 | 0.681 | 0.673 | |
(1,2,3,4,5,6) | 0.669 | 0.691 | 0.678 | |
(1,2,3,4,5,6,7) | 0.669 | 0.680 | 0.673 |
Number of feature maps | Precision (average) | Recall (average) | F1-score (average) |
---|---|---|---|
50 | 0.640 | 0.713 | 0.674 |
100 | 0.657 | 0.711 | 0.682 |
200 | 0.685 | 0.689 | 0.686 |
400 | 0.675 | 0.699 | 0.685 |
600 | 0.677 | 0.671 | 0.683 |
4.2.4 Final Results After Machine Learning Optimization
4.3 (RQ1.3) How Can Transfer Learning Improve the Accuracy of Identifying Self-Admitted Technical Debt in Issue Tracking Systems?
-
: Parameters are not transferred, but are randomly initialized and allowed to fine-tune.
-
: Parameters are transferred and allowed to fine-tune during training.
-
: Parameters are transferred and frozen, i.e., they can not learn during training.
-
Source code comment SATD dataset (CO-SATD) (Maldonado et al. 2017): We chose this dataset, because it contains SATD in source code comments, which is highly similar to our issue SATD dataset. This dataset contains 62,566 comments, in which 4,071 comments were annotated as SATD comments.
-
Amazon review (AMZ2) and Yelp review (YELP2) datasets (Zhang et al. 2015): AMZ2 contains 2,000,000 Amazon product reviews for each polarity. YELP2 includes 289,900 business reviews for each polarity. We selected these two datasets, because their size is significantly bigger than our dataset and because they are commonly used for text classification tasks (Semwal et al. 2018).
-
Jira issues sentiment (JIRA-SEN) and Stack Overflow posts sentiment (SO-SEN) datasets (Ortu et al. 2016; Calefato et al. 2018): Both datasets are relatively small datasets (only containing 926 and 2,728 samples respectively). We chose these two datasets because they are in the software engineering domain.
Source Dataset | Setting | Precision | Recall | F1-score |
---|---|---|---|---|
– | 0.686 | 0.689 | 0.686 | |
CO-SATD | 0.674 | 0.671 | 0.672 | |
0.675 | 0.684 | 0.679 | ||
AMZ2 | 0.666 | 0.664 | 0.665 | |
0.681 | 0.664 | 0.672 | ||
YELP2 | 0.670 | 0.677 | 0.673 | |
0.689 | 0.677 | 0.681 | ||
JIRA-SEN | 0.676 | 0.696 | 0.684 | |
0.689 | 0.694 | 0.691 | ||
SO-SEN | 0.682 | 0.686 | 0.683 | |
0.685 | 0.685 | 0.685 |
Source | Setting | Number of issue sections used for training | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
dataset | 0 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | |
– | 0.157 | 0.357 | 0.386 | 0.424 | 0.451 | 0.474 | 0.488 | 0.501 | 0.507 | 0.523 | |
CO-SATD | 0.260 | 0.395 | 0.402 | 0.412 | 0.425 | 0.435 | 0.442 | 0.445 | 0.458 | 0.461 | |
0.202 | 0.367 | 0.393 | 0.425 | 0.469 | 0.478 | 0.493 | 0.497 | 0.518 | 0.515 | ||
AMZ2 | 0.170 | 0.289 | 0.295 | 0.305 | 0.334 | 0.345 | 0.361 | 0.364 | 0.380 | 0.388 | |
0.145 | 0.349 | 0.386 | 0.400 | 0.438 | 0.451 | 0.461 | 0.475 | 0.487 | 0.497 | ||
YELP2 | 0.051 | 0.288 | 0.290 | 0.300 | 0.323 | 0.329 | 0.352 | 0.358 | 0.378 | 0.389 | |
0.147 | 0.341 | 0.371 | 0.398 | 0.439 | 0.462 | 0.477 | 0.481 | 0.506 | 0.500 | ||
JIRA-SEN | 0.273 | 0.337 | 0.352 | 0.379 | 0.411 | 0.419 | 0.435 | 0.449 | 0.468 | 0.473 | |
0.169 | 0.354 | 0.391 | 0.410 | 0.452 | 0.472 | 0.486 | 0.503 | 0.517 | 0.531 | ||
SO-SEN | 0.256 | 0.296 | 0.301 | 0.330 | 0.356 | 0.371 | 0.384 | 0.393 | 0.418 | 0.428 | |
0.089 | 0.350 | 0.389 | 0.410 | 0.458 | 0.477 | 0.484 | 0.498 | 0.515 | 0.517 |
4.4 (RQ2) Which Keywords Are the Most Informative to Identify Self-Admitted Technical Debt in Issue Tracking Systems?
Unigram keyword | Bigram keyword | Trigram keyword |
flaky (test) | too much | get rid of |
leak (code) | not used (code) | not thread safe (requirement) |
unused (code) | more readable (code) | clean up code (code) |
unnecessary (code) | more efficient (code) | not done yet (requirement) |
typo (code/documentation) | dead code (code) | avoid extra seek |
slow (code) | infinite loop (code) | reduce duplicate code (code) |
redundant (code) | too long (documentation) | no longer needed |
(code/documentation) | ||
confusing | not implemented (requirement) | not supported yet |
(code) | (requirement) | |
nit | less verbose (code) | documentation doesn’t match |
(documentation) | ||
ugly | more robust (design) | short term solution |
simplify | speed up (code) | spurious error messages (code) |
(code) | ||
misleading (documentation) | missing documentation | it’d be nice |
(documentation) | ||
Four-gram Keyword | Five-gram Keyword | |
please add a test (test) | wastes a lot of space | |
would significantly improve | there is no unit test (test) | |
performance (code) | ||
makes it much easier | lead to huge memory allocation | |
(design) | ||
avoid calling it twice (code) | test doesn’t add much value (test) | |
takes a long time (code) | some holes in the doc (documentation) | |
good to have coverage (test) | by hard coding instead of (code) | |
makes it very hard | should be updated to reflect | |
(documentation) | ||
patch doesn’t apply cleanly | more tightly coupled than ideal | |
(code) | (design) | |
it’s not perfectly documented | any chance of a test (test) | |
(documentation) | ||
need to update documentation | should improve a bit by | |
(documentation) | ||
make it less brittle (design) | it’d help code readability if (code) | |
documentation does not | solution won’t be really satisfactory | |
mention (documentation) |
Type indicator | Keyword | Example | |
---|---|---|---|
Code | Complex Code | simplify | “That can simplify the |
redundant | logic there.” - [HADOOP-10295] | ||
less verbose | |||
Dead Code | unused | “I would like to remove this as its | |
unnecessary | no longer needed, and also its | ||
not used | code is not complete.” - | ||
dead code | [Camel-8174] | ||
no longer needed | |||
Duplicated Code | reduce duplicate code | ||
Low-Quality | typo | “...to make their code more | |
Code | leak | readable. I would like to see | |
confusing | something like this in the API...” | ||
more readable | - [HBase-1990] | ||
infinite loop | |||
spurious error messages | |||
avoid calling it twice | |||
patch doesn’t apply cleanly | |||
it’d help code readability if | |||
Slow Algorithm | slow | “Rowlocks should use | |
more efficient | rentHashMap as it is much more | ||
speed up | efficient than Collections. | ||
would significantly improve | synchronizedMap(HashMap)” | ||
performance | - [HBase-798] | ||
takes a long time | |||
Design | Non-Optimal | more robust | “...didn’t tackle those pieces yet. |
Decision | make it less brittle | They also seem more tightly | |
lead to huge memory allocation | coupled than ideal.” | ||
more tightly coupled than ideal | - [HBase-12749] | ||
Requirement | Requirement | not implemented | “Not implemented reached in |
Partially | not done yet | virtual void...” - | |
Implemented | not supported yet | [Chromium-43196] | |
Documentation | Outdated | missing documentation | “I am using this opportunity to |
Documentation | documentation doesn’t match | fill in some holes in the doc...” - | |
some holes in the doc | [Impala-991] | ||
it’s not perfectly documented | |||
need to update documentation | |||
documentation does not mention | |||
should be updated to reflect | |||
Low-Quality | typo | ‘Default searches documentation | |
Documentation | confusing | misleading about single-change | |
simplify | search match behaviour in UI” - | ||
misleading | [Gerrit-8592] | ||
too long | |||
Test | Lack of Tests | please add a test | “It looks good to me except these. |
there is no unit test | Please add a test case for the | ||
any chance of a test | code change...” - [Hadoop-12155] | ||
Low Coverage | good to have coverage | “this test doesn’t add | |
test doesn’t add much value | much value, does it?” - [Gerrit-6524] | ||
Flaky Tests | flaky |
Camel | Chromium | Gerrit | |
leak | leak | confusing | |
typo | flaky | typo | |
confusing | slow | flaky | |
verbose | unnecessary | unused | |
deprecated | simplify | bad | |
dead | redundant | slow | |
slow | typo | truncated | |
unnecessary | truncated | unnecessarily | |
document this | ugly | not implemented yet | |
avoid | not implemented | leak | |
todo | unused | misleading | |
improve documentation | bad | documentation is wrong | |
complicated | confusing | coverage | |
remove ugly warnings | odd | complicated | |
thread safe | the short term | performance degradation | |
reuse | clean up code | documentation doesn’t | |
missing | too verbose | undocumented | |
rid of | expensive | ugly | |
improve exception message if failed | isn’t implemented | reword documentation | |
improve performance | too much | ambiguous | |
Hadoop | HBase | Impala | Thrift |
unnecessary | flaky | flaky | unused |
unused | unused | slow | leak |
typo | nit | unnecessary | unnecessary |
redundant | typo | coverage | typo |
nit | leak | confusing | redundant |
leak | ugly | simplify | confusing |
slow | redundant | misleading | simplify |
flaky | unnecessary | excessive | flaky |
readability | confusing | overhead | coverage |
clean up code | too much | avoid | thread-safe |
complicated | bad | expensive | spurious |
spurious | slow | improve error message | inconsistent |
reuse | expensive | redundant | abstract |
bad | misleading | rework | redundancy |
ugly | avoid | thread-safe | ugly |
not used | simplify | reduce duplicate code | outdated |
cover | overhead | readability | missing |
rid of | dead lock | difficult | performance regression |
expensive | readability | wasted space | extra |
thread-safe | rid of | verbose | unstable |
4.5 (RQ3) How Generic Is the Classification Approach Among Projects and Issue Tracking Systems?
Project | Precision | Recall | F1-score |
---|---|---|---|
Camel | 0.719 | 0.647 | 0.681 |
Hadoop | 0.618 | 0.761 | 0.682 |
HBase | 0.651 | 0.648 | 0.649 |
Impala | 0.697 | 0.721 | 0.709 |
Thrift | 0.693 | 0.668 | 0.679 |
Chromium | 0.659 | 0.556 | 0.603 |
Gerrit | 0.481 | 0.671 | 0.561 |
Avg. | 0.645 | 0.667 | 0.652 |
Issue | Project | Precision | Recall | F1-score | |||
---|---|---|---|---|---|---|---|
tracker | (target) | Result | Diff. | Result | Diff. | Result | Diff. |
Trained on projects using Google issue tracker | |||||||
Jira | Camel | 0.553 | − 23.0% | 0.538 | − 16.8% | 0.545 | − 19.9% |
Hadoop | 0.526 | − 14.8% | 0.611 | − 19.7% | 0.566 | − 17.0% | |
HBase | 0.506 | − 22.2% | 0.610 | − 5.8% | 0.553 | − 14.7% | |
Impala | 0.559 | − 19.7% | 0.619 | − 14.1% | 0.588 | − 17.0% | |
Thrift | 0.589 | − 15.0% | 0.590 | − 11.6% | 0.590 | −13.1% | |
Avg. | 0.546 | − 18.9% | 0.593 | − 13.6% | 0.568 | − 16.3% | |
Trained on projects using Jira issue tracker | |||||||
Google | Chromium | 0.591 | − 10.3% | 0.537 | − 3.4% | 0.563 | − 6.6% |
Gerrit | 0.488 | 1.4% | 0.569 | − 15.2% | 0.526 | −6.2% | |
Avg. | 0.539 | − 4.4% | 0.553 | − 9.3% | 0.544 | − 6.4% |
4.6 (RQ4) How Much Data Is Needed for Training the Machine Learning Model to Accurately Identify self-admitted technical debt in Issues?
5 Discussion
5.1 Differences Between Identifying SATD in Source Code Comments and Issue Tracking Systems
Source | Avg. Length of Issue | # of Sections/ | Vocabulary |
---|---|---|---|
Sections / Code Comments | comments | size | |
Source code comments | 10.9 | 62275 | 31728 |
Issue tracking systems | 35.4 | 23180 | 37202 |
“TODO: may not work on all OSes” - [Defect debt from JMeter code comments]
“I do not think it is a critical bug. Deferring it to 0.14.” - [Defect debt from Hadoop issues]“...thank you for reporting the bug and contributing a patch.” - [Non-defect debt from Hadoop issues]
“TODO support multiple signers” - [requirement debt from JMeter code comments]
“The backend (in master) has everything in place now to support this, but the frontend still needs to be adapted.” - [Requirement debt from Gerrit issues]“It would be good to add a script to launch Hive using Ranger authorization.” - [Non-requirement debt from Impala issues]
5.2 Similarity Between SATD Keywords Extracted from Source Code Comments and Issue Tracking Systems
Unique keyword | Common keyword | Unique keyword |
---|---|---|
(issue tracking systems) | (source code comments) | |
performance | why | todo |
clean | improve | fixme |
typo | leak | hack |
remove | probably | should |
flaky | perhaps | workaround |
unused | better | defer argument checking |
slow | instead | xxx |
refactor | wrong | bug |
warnings | missing | not needed |
confusing | deprecated | implement |
5.3 Implications for Researchers and Practitioners
-
Our work provides a deep learning approach to automatically identify SATD from issue tracking systems. The proposed approach can enable researchers to automatically identify SATD within issues and conduct studies on the measurement, prioritization, as well as repayment of SATD in issue tracking systems on a large scale.
-
To enable further research in this area, we make our issue SATD dataset publicly available.11 The dataset contains 23,180 issue sections, in which 3,277 issue sections are classified as SATD issue sections.
-
We found that relatively small datasets can achieve decent accuracy in identifying SATD on both source code comments and issues. We thus recommend that researchers explore SATD in other sources (i.e., pull requests and commit messages) and contribute a moderate-sized dataset for automatic SATD identification in the corresponding sources.
-
Our findings suggest that there is a certain similarity between SATD in issues and in source code comments, but the similarity is limited. We encourage researchers to study the differences between SATD in these and other sources, e.g. in pull requests or commit messages. This could advance the understanding of SATD in the different sources.
-
Although our study experimented with the generalizability of our approach across projects and across issue tracking systems, the scope of our study is still limited. Thus, we recommend that researchers investigate the applicability of our approach to other projects (esp. industrial projects) and other issue tracking systems. If possible, we advise them to make their datasets publicly available to be used for training new SATD detectors.
-
Because of the high diversity of issues and the different forms of SATD in issues, SATD identification within issues is harder than in source code comments. However, further research can potentially improve the F1-score obtained in our study (e.g., through using other machine learning techniques or trying richer datasets in the software engineering domain for transfer learning).
-
Our SATD identification approach can help software developers and especially project managers to evaluate the quality of their project. For instance, project managers can use this tool to track SATD in issue tracking systems along evolution. If the accumulated SATD reaches a threshold, then more effort may need to be spent in paying it back.
-
We recommend that tool developers use our SATD identifier in their toolsets and dashboards and experiment with them in practice.
-
We encourage practitioners to study carefully the SATD keywords listed in our results. This will help them to understand in practice the nature of SATD, how to better formulate it themselves and how to recognize SATD stated from others.
-
Our findings can help practitioners better understand the differences between SATD in different sources, e.g., defect debt is identified differently in source code comments compared to issue tracking systems. This can also help practitioners better identify SATD in different sources.