1 Introduction
-
They demand a huge amount of manpower.
-
Their accuracy depends on the person who made the judgments.
2 Motivating Example
bash-4.2
from the result of CDTs on source code in the C language, then asked them whether these code clones were true code clones, based on their own experience and motivation. Table 1 shows a part of the result. In this table, a code clone set is marked as O
if a student thought this clone is a true code clone or as X
if not. We can see from this table that the attitudes toward these clones varied from person to person.Clone ID | Participant | |||
---|---|---|---|---|
Y | S | M | U | |
1 | X | O | O | O |
2 | X | X | O | O |
3 | O | X | X | O |
4 | X | O | X | O |
5 | X | O | X | O |
… | … | |||
O count | 5 | 24 | 23 | 25 |
X count | 100 | 81 | 82 | 80 |
3 Fica System
3.1 Overall Workflow of Fica System
3.2 Classification Based on Token Sequence of Clones
-
The similarity of the token sequences was pointed out by Tiarks et al. (2009) as one of the best characteristics for filtering out unneeded clones, which are referred to as rejected candidates in that study.
-
The types of each token in the token sequence are identical among all instances of a type-2 clone set.
-
The required token sequence is obtainable as a side-product of a token-based CDT.
3.3 Marking Clones Manually
-
Boolean clones are marked based on whether they are true code clones.
-
Range clones are marked with a numerical likelihood to be true code clones.
-
Tagged clones are marked by one of several tags or categories by the users based on their use, such as refactoring, issue tracking ID, etc.
refactoring
are suitable candidates for refactoring, or clones marked with the tag buggy
are clones that tend to have bugs.3.4 Machine Learning
3.5 Cycle of Supervised Learning
4 Machine Learning Method
4.1 Input Format of Fica
project
consists of the original source code and the detected code clones represented as clone sets
. The source code fragments are tokenized by the CDT. As Fica needs both tokenized source code for calculating the similarity and the original source code for presenting back to the user, these two forms are passed together to Fica. A clone set is a set of identical code fragments. By identical, we mean the tokenized code fragments are literally equal to each other in a clone set
. We are targeting type-2 code clones; therefore, all the token types in a clone set are identical.clone
in a clone set
is the tokenized code clone fragment in the original source file in the project, The list of token types in the clone set should be equivalent to the types of tokens in every clone in the clone set. A token has a type and a position in the original source code. An example of how the input of Fica looks like is in Fig. 6.
4.2 Calculating Similarity between Clone Sets
term
t as an n-gram of a token sequence, document
d as a clone set
, and all documents D
as a set of clone sets
that can be gathered from a single project
or several projects
.term frequency
values of n-grams within this document of clone set by Eq. 1.term frequency
of term
t in document
d is the normalized frequency, where term
t appears in document
d. The result of TF of the above clone set
is as follows:
inverse document frequency
IDF and TF–IDF are calculated by Eqs. 2 and 3.document
s D is the logarithm of the total number of document
s divided by the number of document
s containing term
t. In Eq. 2, we add 1 to the denominator to avoid division by zero. By combining tf and idf as Eq. 3, we can then calculate the vector space \( \overrightarrow{\mathrm{TF}-\mathrm{IDF}\left(d,D\right)} \) as in Eq. 4 for each clone set in the overall documents.4.3 User Profile and Marks on Clone Sets
5 Experiments
5.1 Implementation Details
5.2 Experimental Setup
.c
files and ignored all other files as unnecessary because all four projects were developed in C language. As the parameters passed to the CDT part of the fica system, we only detected clone sets that contained more than 48 tokens, and only reported those clone sets from different files. These two parameters were chosen based on experience. Namely, the length limitation less than 48 for C projects is more likely to result in small false positives, as well as for clones from the same source code file. Although too many clone sets are an obstacle to conducting this experiment, those small false positives should be detectable in Fica as well. The CCFinder, one of the well-known CDTs proposed in Kamiya et al. (2002), uses a length of 30 tokens as a default parameter. This length is widely accepted in both industrial and academic studies, while its definition of tokens is compacted. For example, CCFinder treats an IF
followed with a LEFT_BRACE
as one compressed token, whereas our CDT considers them as two tokens. Based on our experience, a length limit of 48 tokens generates almost the same amount of code clones as CCFinder.Project and version | # of clone sets | LOC | Tokens | Files | # of participants |
---|---|---|---|---|---|
git v1.7.9-rc1 | 78 | 153,388 | 829,930 | 315 | 32 |
xz 5.0.3 | 36 | 25,873 | 113,894 | 113 | 27 |
bash 4.2 | 105 | 133,547 | 494,248 | 248 | 25 |
e2fsprogs 1.41.14 | 62 | 99,129 | 442,978 | 274 | 25 |
Total | 281 | 411,937 | 1,881,050 | 950 | 32 |
5.3 Code Clone Similarity of Classification Labels by Users
d3.js
javascript library by Bostock (2012) to generate this FDG. Some other technical details about this graph are listed in Table 3.Parameter | Value |
---|---|
force.size | 800 |
force.charge | 100 |
link.linkDistance | 1/link. value
|
link.linkStrength |
link. value/16 + 0.2 |
link.stroke-width |
\( 10\cdot \sqrt{ link. value} \)
|
node.r | 13 ⋅ major/(minor/1.5 + major) 8 |
node.stroke-width | 8 ⋅ minor/(minor/1.5 + major) 9 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
git | 0.78 | 0.46 | 0.21 | 0.36 | 0.73 | 0.78 | 0.9 | 0.55 |
xz | 0.64 | 0.46 | 0.37 | 0.66 | 0.58 | 0.77 | 0.45 | 0.0 |
bash | 0.79 | 0.18 | 0.19 | 0.87 | 0.25 | 0.06 | 0.9 | 0.14 |
e2fsprogs | 0.76 | 0.74 | 0.42 | 0.33 | 0.71 | 0.76 | 0.79 | 0.79 |
git
project, the cluster with ID 2 has an average normalized distance of 0.21; this means that the false clone sets appear to be closer than the cluster with ID 6, whose average normalized distance is 0.9. This result indicates that the probability of false clone sets calculated by Eq. 7 is more accurate than the probability of true clone sets, because false clones are grouped into categories more tightly.-
homogeneity scores high when each cluster contains only members of a single class.
-
completeness scores high when all members of a given class are assigned to the same cluster.
switch-case
statements, or a group of assignments. Meanwhile, the true clones are harder to classify into similar categories. In general, true code clones usually contain rare sequences of tokens. Thus, we have observations 1 and 2, regardless of the subjective judgment of participants,-
Observation 1 (Categories of False Clones) False code clones are more likely to fall into several categories of meta-clones by comparing the literal similarity of their token types.
-
Observation 2 (Diversity of True Clones) True code clones composed of sequences of tokens are more diverse than false clones by comparing the literal similarity.
5.4 Ranking Clone Sets
5.5 Predictions for Each Project
bash
project shows the most desirable result, namely, most of the prediction accuracies are over 80 % when the training set is larger than 16 %, which is approximately 17 of all 105 clone sets. The results of the git
project and the e2fsprogs
project largely depend on the user, for example, the result of user H
always achieves more than 90 %. Meanwhile, the results of user A
and C
converge to approximately 60 % and even decrease when the training set is growing. The reason why the result did not converge to 100 % and the accuracy dropped is discussed in Section 5.8.-
Observation 3 (Minimum Size of Training Set) The minimum required size for the training set grows roughly linearly with the number of categories that the clone sets fall into, which is less than the total number of detected clone sets.
R
and user V
, are always low, which means less literal similarity is found among the clone sets they marked as true clones. Also, the prediction for user H
is high for all projects. This result shows the consistency of user behaviors.
5.6 Cross Project Evaluation
bash
project.e2fsprogs/bash
is a desirable result and xz/git
is an undesirable result. We expect this kind of variation in the results, because each project has its own characteristics for code clones. In practice, the user of the Fica system is expected to cross-analyze similar projects when finding similar code clone categories in both projects.
5.7 Recall and Precision of Fica
git
and bash
projects show good results while the results for xz
and e2fsprogs
are not so comparable. These recall and precision charts show a trend similar to the accuracy graph in Fig. 13; that is, when the project has clustered clone categories, the result is more appropriate.
5.8 Reason for Converging Results
xz
project in Fig. 17. These three code fragments have two code clone pairs. The code from Fig. 17a in lines 130 to 142 and the code from Fig. 17b are the code fragments of the first clone pair, referred to as clone α. The code from Fig. 17a in lines 136 to 145 and the code from Fig. 17c are the code fragments of the second clone pair, referred to as clone β. As we can see from the source code, each instance of clone α consists of a complete function body. Clone β consists of only two half parts of functions. Thus, from the viewpoint of the ease of refactoring, clone α is much easier than is clone β. On the other hand, the calculated similarity between clone α and β is greater than 43 % by Eq. 5. This is a very high percentage among all the other similarities between clone sets, because these two clones share a large amount of identical code fragments. As the result, seven out of eight users thought that clone pair α were true clones and six out of eight users thought that clone pair β were false clones, while Fica always thought they belonged to the same group.
e2progs
project shown in Fig. 18. It is clear that two code clone pairs are found in Fig. 18. The code from Fig. 18a and that from Fig. 18b form the first code clone pair, referred to as clone γ, and the code from Figs. 18c and d forms the second, referred to as clone δ. Users can tell these two clone pairs are different because clone γ consists of two function bodies and can be merged into one function, while clone δ consists of two lists of function declarations. However, the calculated similarity between clone γ and δ is greater than 6.4 %, which is large enough to affect the overall result. The reason why Fica regarded them as similar is that they share many common N-grams, such as STATIC ID ID LPAREN CONST
or CONST ID TIMES ID COMMA
, which results in the high value of similarity.