1 Introduction
other
object are added onto the member variables representing the coordinates x, y, z
. However, the last line in this block of three similar lines contains an error, as it adds the y
coordinate onto the z
coordinate. Instead, the last line should be
host
does not equate the empty string, in the last position, port_str
should have been compared:
-
RQ 1 Is the last line in a multi-line micro-clone more likely to contain an error?
-
RQ 2 Is the last statement in a single-line micro-clone more likely to contain an error?
-
RQ 3 What are the reasons for the existence of faulty micro-clones and the last line effect in particular?
-
We define and introduce the term micro-clone.
-
We introduce techniques for the detection of faulty micro-clones implemented in the automated static analysis tool (ASAT, Beller 2016) PVS-Studio, which cannot be found with traditional clone detection.
-
We manually investigate the error proneness of 263 micro-clones in 219 well-known open-source systems (OSS), based on a total of 1,891 warnings.
-
We provide an initial analysis of the underlying psychological mechanisms behind the existence of the last line effect.
-
We lead six interviews with authors of micro-clones in real-world systems.
-
We conduct a repository analysis on four well-known OSS projects based on the results of the interviews, investigating abnormally large commit sizes.
2 Study Setup
2.1 Study Design C1: Spread and Prevalence of the Last Line Effect within Micro-Clones
2.2 Study Design C2: Analyzing Reasons Behind the Existence of the Last Line Effect
ag
) in every commit of the project’s history.git blame
, to ensure we receive both refactorings that were applied to it as well as its true original author.git blame -e
to obtain the developers’ email addresses to contact them. In an attempt to maximize the response rate, we also perform a web search to acquire additional personal information about the developers and verify the timeliness of the contact email addresses. We also made clear we will not disclose their identity to incentivize honest answers. We then send short personalized emails containing the micro-clone they authored, how it was later modified or fixed, the context of the bug, why we do the investigation, and a set of questions on the micro-clone at hand.2.3 Study Objects
2.4 How to Replicate This Study
findings_old/
and the newer data added for this paper findings_new/
. Moreover, it contains the analyzed and categorized micro-clones (analyzed_data.csv
) together with an evaluation spreadsheet (evaluation.ods
) and the results from the repository analysis of C
1 and C
2. We also provide the R scripts to replicate the results and graphs in this paper. Finally, we share a draft of the questions we sent to developers.3 Methods
3.1 Why Current Clone Detectors are not Suitable
3.2 How We Found Faulty Micro-Clones Instead
PVS | #within | #multi | ||
---|---|---|---|---|
Error | Description | line | line | Σ#/All |
Code | clones | clones | ||
V501 | There are identical sub-expressions to the left and to the right of the foo operator. | 104 | 108 | 212/217 |
V517 | The use of if (A) {...} else if (A) {...} pattern was detected. There is a probability of logical error presence. | 0 | 8 | 8/58 |
V519 | The x variable is assigned values twice successively. Perhaps this is a mistake. | 0 | 23 | 23/117 |
V523 | The then statement is equivalent to the else statement. | 0 | 5 | 5/47 |
V524 | It is odd that the body of Foo_1 function is fully equivalent to the body of Foo_2 . | 0 | 3 | 3/13 |
V525 | The code containing the collection of similar blocks. Check items X, Y, Z, ... in lines N1, N2, N3, ... | 1 | 1 | 2/11 |
V537 | Consider reviewing the correctness of X item’s usage. | 0 | 8 | 8/10 |
V570 | The variable is assigned to itself. | 1 | 1 | 2/17 |
V571 | Recurring check. This condition was already verified in previous line. | 0 | 2 | 2/17 |
V581 | The conditional expressions of the if operators situated alongside each other are identical. | 0 | 2 | 2/13 |
V583 | The ?: operator, regardless of its conditional expression, always returns one and the same value. | 0 | 1 | 1/7 |
V656 | Variables are initialized through the call to the same function. It’s probably an error or un-optimized code. | 0 | 4 | 4/8 |
Σ | 106 | 166 | 272/535 |
3.3 How We Inferred the Origin of an Erroneous Micro-Clone Instance
x
, y
, z
in Example 1 or cardinally:port_str
in the first place and host
in the second place in line 3. Hence, we assume that the first instance of the micro-clone host != buzz::STR_EMPTY
is the influencing origin and the second instance is the destination.cx().isRelative
in line 1, instead of cy().isRelative
, which seems to be influenced by the second line. Natural order, as well as lines 3 and 4 suggest that the micro-clone start with return cx().isRelative()
in line 1 instead.3.4 How We Put Commit Sizes in Perspective
git log
to build a sequenced graph of all commits (excluding merges) in the repository, extracting the number of added and deleted lines in each commit, summing them up as the modified lines and outputting this churn integer for each commit. We then compare the churn of the micro-clone inducing commits to the overall distribution, and specifically to its median. Although our sample size of ten is too small for statistical testing, this way, we can make substantiated statements about a possible size difference between commits. We use the median (and not the average mean, for example) as our distributions are non-normal, it is a single real value and we compare other, singular observations to it.4 Results
4.1 General Description of Results
... with all | ... with faulty | ... with last line/stmt. bug | |
---|---|---|---|
findings | micro-clones | (rel. to all, rel. to faulty) | |
Analysis time | June 2011 to July 2015 | ||
Analysis software | PVS-Studio versions 4.00 to 5.27 | ||
# of analyses | 162 | 12 (7 %) | 12 (7 %, 100 %) |
# of projects | 219 | 106 (49 %) | 97 (45 %, 92 %) |
# of warnings | 1,891 | 272 (14 %) | 228 (12 %, 84 %) |
# of unique clones | – | 263 (–) | 228 (–, 87 %) |
Multi-line micro-clone | One-line micro-clone | |
---|---|---|
#errors in last line/stmt. |
117 (74 %)
|
95 (90 %)
|
#errors not in last line/stmt. | 41 (26 %) | 10 (10 %) |
effect size (odds ratio) | 2.9 | 9.5 |
Σ | 158 (100 %) | 105 (100 %) |
ΣΣ | 263 |
4.2 In-Depth Investigation of Findings
4.2.1 V501 – Identical Sub-Expressions
||
), thus representing a tautology. Instead, the Boolean expression misses to take into account the surname (NAME_LAST
), an example of the last statement effect in this tricolon.4.2.2 V517 – Identical if-Conditions
if
-statements.else if
condition following the third micro-clone on line 9 is dead code, as it can never be reached. If slot
was indeed zero, it would already enter the first if
condition’s body.4.2.3 V519 – Identical Assignment to Variable
m_ucRed
is assigned twice, but the developers forgot to set m_ucBlue
.f->fmt.vbi.samples_per_line
again, even though it has just been set in line 1. Since no other method calls have been made in the further control flow of this method, the assignment in line 1 seems to have no effect. However, as the assignment is active for at least one CPU cycle, there might be threads that read its value in the meantime (for example, watchdogs on the buffer state) or there might be other intended side-effects. To be on the conservative side, we compiled the code with release settings and if the compiler optimized the first assignment away, we were sure it was indeed an error.4.2.4 V523 – Equivalent Behavior Despite Branching
if
-conditions, these could be simplified by collapsing them into one block, for example in Haiku:
mpa_size
should have been set to a different value in the else
-branch. The code context of this micro-clone seems highly suspicious, as it mentions in line 3 that “[t]his compression stuff is all wrong,” and the detected erroneous micro-clone fits to this comment.4.2.5 V524 – Equivalent Function Bodies
PerPtrBottomUp.clear()
. This also serves as one rare example of a two-instance micro-clone where the origin succeeds the target (\(\delta _{E_{10}}=-1\)).4.2.6 V537 – Suspicious Use of Variable or Statement
rectf.X
:rectf.X
in the second (i.e., last) line of this micro-clone.4.2.7 V656 – Two Variables Bear Identical Value
maSelection.Max()
is assigned not the maximum value of aSelection
, but its minimum, clearly representing an error.4.2.8 Counterexample
data_[M02]
from itself. However, they meant to write:
4.3 Statistical Evaluation
4.4 Origin of Micro-Clones
-
RQ 3 What are the reasons for the existence of faulty micro-clones and the last line effect in particular?
4.5 Developer Interviews
Project | Sampled Commit | Local Commit Date | Commit Churn | Median Churn | #Commits | Replies |
---|---|---|---|---|---|---|
Chromium
| 2db5310
| 2010-09-30 20:53 | 123 | 43 | 639,564 | 4/4 |
6b7fcb4
| 2011-02-23 05:57 | 1,220 | ||||
(7b37fbb)
| (2011-03-07 16:16) | (1,635) | ||||
47fcb0e
| 2012-10-24 3:52 | 1,627 | ||||
LibreOffice
| b90bc10
| 2008-08-19 22:06 | 103,083 | 18 | 438,994 | 0/2 |
44cfc7c (rebase) | 2012-10-09 12:22 | 470 | ||||
Samba
| 781ed1f
| 2005-12-09 05:21 | 45 | 16 | 241,276 | 1/1 |
Mesa 3D
| 0ff3b2b
| 2010-07-26 23:56 | 108 | 21 | 99,115 | 1/2 |
45124e0
| 2010-12-07 21:37 | 251 | ||||
libjingle
| 562554d
| 2010-09-30 20:53 | 110,184 | 212 | 341 | 1/1 |
\(\sum \)
| 10 | 7/10 (6 authors) |
7b37fbb
, the interviewee I1 told us that he merely refactored and did not author this piece of code originally (hence we report six interviews with authors in Table 7). He forwarded us to the real author of the code, whom we also interviewed (6b7fcb4
).!has_mic
when he should have typed !has_audio
instead. In his experience, this happens a lot when working with code in which one types the same words repetitively. He observed that “I was not under any major stress at the time,” but that “I will note that when working with very large changes it is much easier for something like this to be missed.” He added that the real error was not having a unit test that covers this line and that the reviewer missed the absurdity of the pattern !a && !a
, too.field.type == trans("string") ||
and copy-and-pastes it several times, ending up in a sequence like:4.6 Usefulness of Results
pvs-studio bug | issue
10 shows numerous bug fixes in Firefox, libxml, MySQL, Clang, samba and many other projects based on our findings. As one such example, on October 11th 2016 in commit caff670
, we fixed a micro-clone-related issue that had existed in samba since 2005.11
5 Discussion
5.1 Technical Complexity & Reasons
5.2 Psychological Mechanisms & Reasons
y
into z
) but the second editing sub-step was not performed, thus producing the error. In principle, line 1 could have been copied twice with the editing steps having been performed on the two lines. However, the presence of a y rather than x in line 3 suggests that line 2 was copied. Section 4.4 shows that in most cases of micro-clones with more than two lines, the previous line was copied. This suggests that in such micro-clones, the sequence of actions was as follows: “[copy, modify, modify], [copy, modify, modify], ...”5.3 Threats to Validity
5.3.1 Internal Threats
host
first and then port_str
) imposes a natural logical order for the remainder of the program (first check host
, then port_str
in line 3). In order to reduce personal bias, we also separated the list of findings to triage across the first two authors, and then discussed unclear cases. If we could not reach agreement, we discarded said finding. In this process, we also re-classified all original previous 202 findings (Beller et al. 2015) and found almost total agreement with our previous assessment. Since flagging erroneous lines is a well-defined task under these circumstances, we are sure there is a high inter-rater reliability, ensuring the repeatability of our study.ctrl+c
, ctrl+v
” was used, 2) in which order micro-clones are created, and 3) in which order micro-clones are read and changed, if developers need to modify them during maintenance. In order to get to know such information, we would require to study how developers work in-vivo, similar to the WatchDog plugin (Beller et al. 2015, 2015, 2016). To that end, we could reuse parts of CloneBoard, which captures all cut, copy and paste actions in Eclipse (de Wit et al. 2009).