1 Introduction
1.1 Motivation
“When source code is copied and modified, which code similarity detection techniques or tools get the most accurate results?”
1.2 Contributions
2 Background
2.1 Source Code Modifications
2.2 Code Similarity Measurement
2.3 Obfuscation and Deobfuscation
2.4 Program Decompilation
3 Empirical Study
3.1 Experimental Framework
3.2 Tools and Techniques
3.2.1 Obfuscators
Code modifications | Artifice | ProGuard | (De)compilers |
---|---|---|---|
Lexical modification
| |||
✓ | ✓ | ||
✓ | ✓ | ||
✓ | ✓ | ✓ | |
Modification of constant values (Duric and Gasevic 2013) | ✓ | ||
Structural modification
| |||
Split or merge of variable declarations (Duric and Gasevic 2013) | ✓ | ✓ | |
✓ | ✓ | ||
Line insertion/deletion with further edits (Roy and Cordy 2009) | ✓ | ✓ | |
✓ | ✓ | ✓ | |
✓ | ✓ | ||
Changing of data types and modification of data structures (Duric and Gasevic 2013) | ✓ | ||
✓ | |||
✓ |
3.2.2 Compiler and Decompilers
3.2.3 Plagiarism Detectors
3.2.4 Clone Detectors
3.2.5 Compression Tools
3.2.6 Other Techniques
Tool/Technique | Similarity calculation |
---|---|
Clone Det.
| |
ccfx (Kamiya et al., 2002) | tokens and suffix tree matching |
deckard (Jiang et al., 2007b) | characteristic vectors of AST optimised by LSH |
iclones (Göde and Koschke 2009) | tokens and generalised suffix tree |
nicad (Roy and Cordy 2008) | TXL and string comparison (LCS) |
simian (Harris 2015) | line-based string comparison |
Plagiarism Det.
| |
jplag-java (Prechelt et al., 2002) | tokens, Karp Rabin matching, Greedy String Tiling |
jplag-text (Prechelt et al., 2002) | tokens, Karp Rabin matching, Greedy String Tiling |
plaggie (Ahtiainen et al., 2006) | N/A (not disclosed) |
sherlock (Pike R and Loki 2002) | digital signatures |
simjava (Gitchell and Tran 1999) | tokens and string alignment |
simtext (Gitchell and Tran 1999) | tokens and string alignment |
Compression
| |
7zncd | NCD with 7z |
bzip2ncd | NCD with bzip2 |
gzipncd | NCD with gzip |
xz-ncd | NCD with xz |
icd | Equation 4
|
ncd (Cilibrasi et al., 2015) |
ncd tool with bzlib & zlib |
Others
| |
bsdiff | Equation 5
|
diff | Equation 5
|
difflib (Python Software Foundation 2016) | Gestalt pattern matching |
fuzzywuzzy (Cohen 2011) | fuzzy string matching |
jellyfish (Turk and Stephens 2016) | approximate and phonetic matching of strings |
ngram (Poulter 2012) | fuzzy search based using n-gramme |
cosine (Pedregosa et al., 2011) | cosine similarity from machine learning library |
Tool | Settings | Details | DF | Range |
---|---|---|---|---|
Clone det.
| ||||
ccfx | b | min no. of tokens | 50 | 3 4 5 10 15 16 17 18 |
19 20 21 22 23 24 25 | ||||
30 35 40 45 50 | ||||
t | min token kinds | 12 | 1 2 3 .. 14 | |
deckard | mintoken | min no. of tokens | 50 | 30, 50 |
stride | sliding window size | inf | 0, 1, 2, inf | |
similarity | clone similarity | 1.0 | 0.90, 0.95, 1.00 | |
iclones | minblock | min token length | 20 | 8 10 20 30 40 50 |
minclone | min no. of tokens | 100 | 50 60 .. 140 150 | |
nicad | UPI | % of unique code | 0.30 | 0.30, 0.50 |
minline | min no. of lines | 10 | 5, 8, 10 | |
rename | variable renaming | none | blind, consistent | |
abstract | code abstraction | none | none, declaration, statement, expression, condition, literal | |
simian | threshold | min no. of lines | 6 | 3 4 5 .. 10 |
options | other options | none | none, ignoreCharacters, ignoreIdentifiers, ignoreLiterals, ignoreVariableNames | |
Plagiarism det.
| ||||
jplag-java | t | min no. of tokens | 9 | 1 2 3 .. 12 |
jplag-text | t | min no. of tokens | 9 | 1 2 3 .. 12 |
plaggie | M | min no. of tokens | 11 | 1 2 3 .. 14 |
sherlock | N | chain length | 4 | 1 2 3 .. 8 |
Z | zero bits | 3 | 0 1 2 .. 8 | |
simjava | r | min run size | N/A | 10 11 12 .. 24 |
simtext | r | min run size | N/A | 4 5 6 .. 12 |
Compression
| ||||
7zncd-BZip2 | mx | compression level | N/A | 1 3 5 7 9 |
7zncd-Deflate | mx | compression level | N/A | 1 3 5 7 9 |
7zncd-Deflate64 | mx | compression level | N/A | 1 3 5 7 9 |
7zncd-LZMA | mx | compression level | N/A | 1 3 5 7 9 |
7zncd-LZMA2 | mx | compression level | N/A | 1 3 5 7 9 |
7zncd-PPMd | mx | compression level | N/A | 1 3 5 7 9 |
bzip2ncd | C | block size | N/A | 1 2 3 .. 9 |
gzipncd | C | compression speed | N/A | 1 2 3 .. 9 |
icd | ma | compression algo. | N/A | BZip2, Deflate, Deflate64, LZMA, LZMA2, PPMd |
mx | compression level | N/A | 1 3 5 7 9 | |
ncd-zlib | N/A | |||
ncd-bzlib | N/A | |||
xzncd | -N | compression level | 6 | 1 2 3 .. 9, e |
Others
| ||||
bsdiff | N/A | |||
diff | N/A | |||
difflib | autojunk | auto. junk heuristic | N/A | true, false |
whitespace | ignoring white space | N/A | true, false | |
fuzzywuzzy | similarity | similarity calculation | N/A | ratio, partial_ratio, token_sort_ratio, token_set_ratio |
jellyfish | distance | edit distance algo. | N/A | jaro_distance, jaro_winkler |
ngram | N/A | |||
cosine | N/A |
4 Experimental Scenarios
4.1 Scenario 1 (Pervasive Modifications)
4.1.1 Preparation, Transformation, and Normalisation
No. | File | SLOC | Description |
---|---|---|---|
1 | BubbleSort.javaa
| 39 | Bubble Sort implementation |
2 | EightQueens.javab
| 65 | Solution to the Eight Queens problem |
3 | GuessWord.javaa
| 115 | A word guessing game |
4 | TowerOfHanoi.javaa
| 141 | The Tower of Hanoi game |
5 | InfixConverter.javaa
| 95 | Infix to postfix conversion |
6 | Kapreka_Transformation.javaa
| 111 | Kapreka Transformation of a number |
7 | MagicSquare.javab
| 121 | Generating a Magic Square of size n
|
8 | RailRoadCar.javaa
| 71 | Rearranging rail road cars |
9 | SLinkedList.javaa
| 110 | Singly linked list implementation |
10 | SqrtAlgorithm.javaa
| 118 | Calculating the square root of a number |
4.1.2 Similarity Detection
4.1.3 Analysing the Similarity Reports
4.2 Scenario 2 (Reused Boiler-Plate Code)
4.3 Scenario 3 (Decompilation)
4.4 Scenario 4 (Ranked Results)
4.5 Scenario 5 (Pervasive Modifications + Boiler-Plate Code)
Obfuscation | Decomp. | Pairs | TP | |||
---|---|---|---|---|---|---|
Type | Modification | Source | Bytecode | |||
O
| Original | 1,089 | 55 | |||
A
| Artifice | ✓ | 2,178 | 110 | ||
K
| Krakatau | ✓ | 2,178 | 110 | ||
P
c
| Procyon | ✓ | 2,178 | 110 | ||
P
g
K
| ProGuard + Krakatau | ✓ | ✓ | 2,178 | 110 | |
P
g
P
c
| ProGuard + Procyon | ✓ | ✓ | 2,178 | 110 | |
AK
| Artifice + Krakatau | ✓ | ✓ | 2,178 | 110 | |
A
P
c
| Artifice + Procyon | ✓ | ✓ | 2,178 | 110 | |
A
P
g
K
| Artifice + ProGuard + Krakatau | ✓ | ✓ | ✓ | 2,178 | 110 |
A
P
g
P
c
| Artifice + ProGuard + Procyon | ✓ | ✓ | ✓ | 2,178 | 110 |
5 Results
5.1 RQ1: Performance Comparison
5.1.1 Pervasively Modified Code
Tool | Settings |
T
| FP | FN | Acc | Prec | Rec | AUC | F1 | R |
---|---|---|---|---|---|---|---|---|---|---|
Clone det.
| ||||||||||
ccfx (C) a
| b = 5,t = 11 | 36 | 24 | 24 | 0.9952 | 0.9760 | 0.9760 | 0.9995 | 0.9760 | 1 |
deckard (T) a
| mintoken = 30 | 17 | 44 | 227 | 0.9729 | 0.9461 | 0.7730 | 0.9585 | 0.8509 | 6 |
stride = 2 | ||||||||||
similarity = 0.95 | ||||||||||
iclones (L) a
| minblock = 10 | 0 | 36 | 358 | 0.9196 | 0.9048 | 0.4886 | 0.7088 | 0.6345 | 27 |
minclone = 50 | ||||||||||
nicad (L) a
| UPI = 0.50 | 38 | 38 | 346 | 0.9616 | 0.9451 | 0.6540 | 0.8164 | 0.7730 | 23 |
minline = 8 | ||||||||||
rename = blind | ||||||||||
abstract = literal | ||||||||||
simian (C) a
| threshold = 4 | 5 | 150 | 165 | 0.9685 | 0.8477 | 0.8350 | 0.9262 | 0.8413 | 9 |
ignoreVariableNames | ||||||||||
Plag. det.
| ||||||||||
jplag-java | t = 7 | 19 | 58 | 196 | 0.9746 | 0.9327 | 0.8040 | 0.9563 | 0.8636 | 3 |
jplag-text | t = 4 | 14 | 66 | 239 | 0.9695 | 0.9202 | 0.7610 | 0.9658 | 0.8331 | 12 |
plaggie | M = 8 | 19 | 83 | 234 | 0.9683 | 0.9022 | 0.7660 | 0.9546 | 0.8286 | 15 |
sherlock | N = 4, Z = 2 | 6 | 142 | 196 | 0.9662 | 0.8499 | 0.8040 | 0.9447 | 0.8263 | 17 |
simjava | r = 16 | 15 | 120 | 152 | 0.9728 | 0.8760 | 0.8480 | 0.9711 | 0.8618 | 5 |
simtext | r = 4 | 14 | 38 | 422 | 0.9540 | 0.9383 | 0.5780 | 0.8075 | 0.7153 | 25 |
Compression
| ||||||||||
7zncd-BZip2 | mx = 1,3,5 | 45 | 64 | 244 | 0.9692 | 0.9220 | 0.7560 | 0.9557 | 0.8308 | 14 |
7zncd-Deflate | mx = 7 | 38 | 122 | 215 | 0.9663 | 0.8655 | 0.7850 | 0.9454 | 0.8233 | 20 |
7zncd-Deflate64 | mx = 7,9 | 38 | 123 | 215 | 0.9662 | 0.8645 | 0.7850 | 0.9453 | 0.8229 | 21 |
7zncd-LZMA | mx = 7,9 | 41 | 115 | 213 | 0.9672 | 0.8725 | 0.7870 | 0.9483 | 0.8275 | 16 |
7zncd-LZMA2 | mx = 7,9 | 41 | 118 | 213 | 0.9669 | 0.8696 | 0.7870 | 0.9482 | 0.8262 | 18 |
7zncd-PPMd | mx = 9 | 42 | 140 | 198 | 0.9662 | 0.8514 | 0.8020 | 0.9467 | 0.8260 | 19 |
bzip2ncd | C = 1..9 | 38 | 62 | 216 | 0.9722 | 0.9267 | 0.7840 | 0.9635 | 0.8494 | 7 |
gzipncd | C = 7 | 31 | 110 | 203 | 0.9687 | 0.8787 | 0.7970 | 0.9556 | 0.8359 | 11 |
icd | ma = LZMA2 | 50 | 86 | 356 | 0.9558 | 0.8822 | 0.6440 | 0.9265 | 0.7445 | 24 |
mx = 7,9 | ||||||||||
ncd-zlib | N/A | 30 | 104 | 207 | 0.9689 | 0.8841 | 0.7930 | 0.9584 | 0.8361 | 10 |
ncd-bzlib | N/A | 37 | 82 | 206 | 0.9712 | 0.9064 | 0.7940 | 0.9636 | 0.8465 | 8 |
xzncd | −e | 39 | 120 | 203 | 0.9677 | 0.8691 | 0.7970 | 0.9516 | 0.8315 | 13 |
Others
| ||||||||||
bsdif a
| N/A | 71 | 199 | 577 | 0.9224 | 0.6801 | 0.4230 | 0.8562 | 0.5216 | 30 |
diff (C) a
| N/A | 8 | 626 | 184 | 0.9190 | 0.5659 | 0.8160 | 0.9364 | 0.6683 | 26 |
difflib | whitespace = false | 28 | 12 | 232 | 0.9756 | 0.9846 | 0.7680 | 0.9412 | 0.8629 | 4 |
autojunk = false | ||||||||||
fuzzywuzzy | token_set_ratio | 85 | 58 | 176 | 0.9766 | 0.9342 | 0.8240 | 0.9772 | 0.8757 | 2 |
jellyfish | jaro_distance | 78 | 340 | 478 | 0.9182 | 0.6056 | 0.5220 | 0.8619 | 0.5607 | 29 |
ngram | N/A | 49 | 110 | 224 | 0.9666 | 0.8758 | 0.7760 | 0.9410 | 0.8229 | 22 |
cosine | N/A | 48 | 292 | 458 | 0.9250 | 0.6499 | 0.5420 | 0.9113 | 0.5911 | 28 |
5.1.2 Boiler-plate Code
Tool | Settings |
T
| FP | FN | Acc | Prec | Rec | AUC | F1 | R |
---|---|---|---|---|---|---|---|---|---|---|
Clone det.
| ||||||||||
ccfx (C)a
| b = 15,16,17 t = 12 | 25 | 42 | 15 | 0.9992 | 0.9125 | 0.9669 | 0.9905 | 0.9389 | 7 |
deckard (T)a
| mintoken = 50 | 19 | 27 | 17 | 0.9993 | 0.9417 | 0.9625 | 0.9823 | 0.9520 | 5 |
stride = 2 | ||||||||||
similarity = 1.00 | ||||||||||
iclones (L)a
| minblock = 40 | 19 | 20 | 57 | 0.9989 | 0.9519 | 0.8742 | 0.9469 | 0.9114 | 12 |
minclone = 50 | ||||||||||
nicad (L)a
| UPI = 0.30 | 22 | 19 | 51 | 0.9990 | 0.9549 | 0.8874 | 0.9694 | 0.9199 | 9 |
minline = 5 | ||||||||||
rename = consistent | ||||||||||
abstract = condition | ||||||||||
simian (L)a
| threshold = 4 | 26 | 20 | 17 | 0.9994 | 0.9561 | 0.9625 | 0.9921 | 0.9593 | 3 |
ignoreVariableNames | ||||||||||
Plag. det.
| ||||||||||
jplag-java | t = 12 | 29 | 26 | 13 | 0.9994 | 0.9442 | 0.9713 | 0.9895 | 0.9576 | 4 |
jplag-text | t = 9 | 32 | 16 | 12 | 0.9996 | 0.9650 | 0.9735 | 0.9939 | 0.9692 | 1 |
plaggie | M = 14 | 33 | 36 | 37 | 0.9989 | 0.9204 | 0.9183 | 0.9753 | 0.9193 | 10 |
sherlock | N = 5, Z = 0 | 22 | 22 | 54 | 0.9989 | 0.9477 | 0.8808 | 0.9996 | 0.9130 | 11 |
simjava | r = 25 | 46 | 18 | 11 | 0.9996 | 0.9607 | 0.9757 | 0.9987 | 0.9682 | 2 |
simtext | r = 12 | 17 | 73 | 19 | 0.9986 | 0.8560 | 0.9581 | 0.9887 | 0.9042 | 13 |
Compression
| ||||||||||
7zncd-BZip2 | mx = 1,3,5 | 64 | 24 | 118 | 0.9979 | 0.9331 | 0.7395 | 0.9901 | 0.8251 | 26 |
7zncd-Deflate | mx = 7 | 64 | 27 | 97 | 0.9982 | 0.9295 | 0.7859 | 0.9937 | 0.8517 | 24 |
7zncd-Deflate64 | mx = 7 | 64 | 27 | 96 | 0.9982 | 0.9297 | 0.7881 | 0.9957 | 0.8530 | 23 |
7zncd-LZMA | mx = 7,9 | 69 | 11 | 99 | 0.9984 | 0.9699 | 0.7815 | 0.9940 | 0.8655 | 20 |
7zncd-LZMA2 | mx = 7,9 | 69 | 11 | 99 | 0.9984 | 0.9699 | 0.7815 | 0.9939 | 0.8655 | 20 |
7zncd-PPMd | mx = 9 | 68 | 19 | 106 | 0.9981 | 0.9481 | 0.7660 | 0.9948 | 0.8474 | 25 |
bzip2ncd | C = 1,2,3,..,8,9 | 54 | 20 | 94 | 0.9983 | 0.9473 | 0.7925 | 0.9944 | 0.8630 | 22 |
gzipncd | C = 9 | 54 | 25 | 82 | 0.9984 | 0.9369 | 0.8190 | 0.9961 | 0.8740 | 16 |
icdb
| ma = LZMA mx = 1,3 | 84 | 12 | 151 | 0.9976 | 0.9618 | 0.6667 | 0.9736 | 0.7875 | 27 |
ncd-zlib | N/A | 57 | 10 | 91 | 0.9985 | 0.9731 | 0.7991 | 0.9983 | 0.8776 | 14 |
ncd-bzlib | N/A | 52 | 30 | 82 | 0.9983 | 0.9252 | 0.8190 | 0.9943 | 0.8689 | 18 |
xzncd | 2,3 | 64 | 13 | 94 | 0.9984 | 0.9651 | 0.7925 | 0.9942 | 0.8703 | 17 |
6,7,8,9,e | 65 | |||||||||
Others
| ||||||||||
bsdiff | N/A | 90 | 2125 | 212 | 0.9652 | 0.1019 | 0.5320 | 0.9161 | 0.1710 | 29 |
diff (C) | N/A | 29 | 7745 | 5 | 0.8845 | 0.0547 | 0.9890 | 0.9180 | 0.1036 | 30 |
difflib | autojunk = true | 42 | 30 | 21 | 0.9992 | 0.9351 | 0.9536 | 0.9999 | 0.9443 | 6 |
whitespace = true | ||||||||||
fuzzywuzzy | ratio | 65 | 30 | 30 | 0.9991 | 0.9338 | 0.9338 | 0.9989 | 0.9338 | 8 |
jellyfish | jaro_distance | 82 | 0 | 162 | 0.9976 | 1.0000 | 0.6424 | 0.9555 | 0.7823 | 28 |
ngram | N/A | 59 | 20 | 84 | 0.9984 | 0.9486 | 0.8146 | 0.9967 | 0.8765 | 15 |
cosine | N/A | 68 | 50 | 68 | 0.9982 | 0.8851 | 0.8499 | 0.9973 | 0.8671 | 19 |
5.1.3 Observations of the Tools’ Performances on the Two Data Sets
5.2 RQ2: Optimal Configurations
5.2.1 Pervasively Modified Code
Error measure | Value | ccfx’s parameters | |
---|---|---|---|
b
|
t
| ||
Precision | 1.000 | 19 | 7 8 9 |
Recall | 0.980 | 5 | 12 |
5.2.2 Boiler-Plate Code
5.3 RQ3: Normalisation by Decompilation
Tool | Settings |
T
| FP | FN | Acc | Prec | Rec | AUC | F1 | R |
---|---|---|---|---|---|---|---|---|---|---|
Clone det.
| ||||||||||
ccfx a
b (T) | b = 5, t = 8 | 50 | 0 | 18 | 0.9982 | 1.0000 | 0.9820 | 0.9991 | 0.9909 | 4 |
deckard a
b (L) | mintoken = 30 | 29 | 0 | 84 | 0.9916 | 1.0000 | 0.9160 | 0.9459 | 0.9562 | 11 |
stride = 1 | ||||||||||
similarity = 0.95 | ||||||||||
iclones a (L) | minblock = 8 | 10 | 0 | 86 | 0.9914 | 1.0000 | 0.9140 | 0.9610 | 0.9551 | 14 |
minclone = 50 | ||||||||||
nicad a
b (T) | UPI = 0.30 | 19 | 0 | 106 | 0.9894 | 1.0000 | 0.8940 | 0.9526 | 0.9440 | 24 |
minline = 8 | ||||||||||
rename = blind | ||||||||||
abstract = literal | ||||||||||
simian a
b (T) | threshold = 3 | 17 | 0 | 0 | 1.0000 | 1.0000 | 1.0000 | 0.9960 | 1.0000 | 1 |
ignoreidentifiers | ||||||||||
Plagiarism det.
| ||||||||||
jplag-java | t = 4..12,default | 23 | 0 | 0 | 1.0000 | 1.0000 | 1.0000 | 0.9964 | 1.0000 | 1 |
jplag-text | t = 1 | 56 | 16 | 24 | 0.9960 | 0.9839 | 0.9760 | 0.9993 | 0.9799 | 6 |
plaggie | M = 9 | 29 | 0 | 84 | 0.9916 | 1.0000 | 0.9160 | 0.9454 | 0.9562 | 13 |
sherlock | N = 1,Z = 0 | 60 | 34 | 22 | 0.9944 | 0.9664 | 0.9780 | 0.9989 | 0.9722 | 7 |
simjava b
| r = 18 | 17 | 0 | 0 | 1.0000 | 1.0000 | 1.0000 | 0.9998 | 1.0000 | 1 |
simtext | r = 4; | 33 | 33 | 60 | 0.9907 | 0.9661 | 0.9400 | 0.9862 | 0.9529 | 16 |
r = 5 | 31 | |||||||||
Compression
| ||||||||||
7zncd-BZip2 | mx = 1,3,5 | 49 | 40 | 40 | 0.9920 | 0.9600 | 0.9600 | 0.9983 | 0.9600 | 10 |
7zncd-Deflate | mx = 9 | 46 | 28 | 71 | 0.9901 | 0.9707 | 0.9290 | 0.9978 | 0.9494 | 18 |
7zncd-Deflate64 | mx = 9 | 46 | 28 | 72 | 0.9900 | 0.9707 | 0.9280 | 0.9978 | 0.9489 | 19 |
7zncd-LZMA | mx = 7,9 | 48 | 28 | 72 | 0.9900 | 0.9707 | 0.9280 | 0.9977 | 0.9489 | 19 |
7zncd-LZMA2 | mx = 7,9 | 48 | 28 | 72 | 0.9900 | 0.9707 | 0.9280 | 0.9977 | 0.9489 | 19 |
7zncd-PPMd | mx = 9 | 49 | 40 | 31 | 0.9929 | 0.9604 | 0.9690 | 0.9985 | 0.9647 | 8 |
bzip2ncd | C = 1..9,default | 43 | 40 | 36 | 0.9924 | 0.9602 | 0.9640 | 0.9983 | 0.9621 | 9 |
gzipncd | C = 8,9 | 38 | 28 | 63 | 0.9909 | 0.9710 | 0.9370 | 0.9980 | 0.9537 | 15 |
icd b
| ma = LZMA, mx = 7,9 | 54 | 45 | 68 | 0.9887 | 0.9539 | 0.9320 | 0.9921 | 0.9428 | 25 |
ncd-zlib | N/A | 37 | 28 | 72 | 0.9900 | 0.9707 | 0.9280 | 0.9981 | 0.9489 | 19 |
ncd-bzlib | N/A | 42 | 46 | 36 | 0.9918 | 0.9545 | 0.9640 | 0.9984 | 0.9592 | 11 |
xzncd | − 1 | 43 | 16 | 83 | 0.9901 | 0.9829 | 0.9170 | 0.9967 | 0.9488 | 23 |
Others
| ||||||||||
bsdiff | N/A | 78 | 0 | 171 | 0.9829 | 1.0000 | 0.8290 | 0.9595 | 0.9065 | 28 |
diff (C) | N/A | 23 | 12 | 186 | 0.9802 | 0.9855 | 0.8140 | 0.9768 | 0.8916 | 29 |
difflib | autojunk = true | 23 | 28 | 66 | 0.9906 | 0.9709 | 0.9340 | 0.9823 | 0.9521 | 17 |
fuzzywuzzy | token_set_ratio | 90 | 0 | 32 | 0.9968 | 1.0000 | 0.9680 | 0.9966 | 0.9837 | 5 |
jellyfish | jaro_winkler | 89 | 40 | 220 | 0.9740 | 0.9512 | 0.7800 | 0.9473 | 0.8571 | 30 |
ngram | N/A | 60 | 48 | 104 | 0.9848 | 0.9492 | 0.8960 | 0.9726 | 0.9218 | 26 |
cosine | N/A | 68 | 98 | 66 | 0.9836 | 0.9050 | 0.9340 | 0.9955 | 0.9193 | 27 |
Test |
p-value | Significant? | Effect size (A 12) |
---|---|---|---|
Before-after decompiled by Krakatau | 1.863e-09 | Yes | 0.969 (large) |
Before-after decompiled by Procyon | 1.863e-09 | Yes | 0.937 (large) |
Tool | Settings |
T
| FP | FN | Acc | Prec | Rec | AUC | F1 | R |
---|---|---|---|---|---|---|---|---|---|---|
Clone det.
| ||||||||||
ccfx a (L) | b = 20, t = 1..7 | 11 | 4 | 38 | 0.9958 | 0.9959 | 0.962 | 0.9970 | 0.9786 | 4 |
deckard a (T) | mintoken = 30 | 10 | 0 | 32 | 0.9968 | 1.0000 | 0.9680 | 0.9978 | 0.9837 | 2 |
stride = 1, inf | ||||||||||
similarity = 1.00 | ||||||||||
iclones a (C) | minblock = 10 | 0 | 18 | 98 | 0.9884 | 0.9804 | 0.9020 | 0.9508 | 0.9396 | 11 |
minclone = 50 | ||||||||||
nicad a (W) | UPI = 0.30 | 11 | 16 | 100 | 0.9884 | 0.9825 | 0.9000 | 0.9536 | 0.9394 | 12 |
minline = 10 | ||||||||||
rename = blind | ||||||||||
abstract = condition,literal | ||||||||||
simian a (C) | threshold = 3 | 23 | 8 | 70 | 0.9922 | 0.9915 | 0.9300 | 0.9987 | 0.9598 | 8 |
ignoreIdentifiers | ||||||||||
Plagiarism det.
| ||||||||||
jplag-java | t = 8 | 22 | 0 | 72 | 0.9928 | 1.0000 | 0.9280 | 0.9887 | 0.9627 | 7 |
jplag-text | t = 9 | 11 | 16 | 48 | 0.9936 | 0.9835 | 0.9520 | 0.9982 | 0.9675 | 6 |
plaggie | M = 13,14 | 10 | 16 | 80 | 0.9904 | 0.9829 | 0.9200 | 0.9773 | 0.9504 | 9 |
sherlock | N = 1, Z = 0 | 55 | 28 | 16 | 0.9956 | 0.9723 | 0.9840 | 0.9997 | 0.9781 | 5 |
simjava | r = default | 11 | 8 | 0 | 0.9992 | 0.9921 | 1.0000 | 0.9999 | 0.9960 | 1 |
simtext | r = 4 | 15 | 42 | 100 | 0.9858 | 0.9554 | 0.9000 | 0.9686 | 0.9269 | 14 |
r = default | 0 | |||||||||
Compression
| ||||||||||
7zncd-BZip2 | mx = 1,3,5 | 51 | 30 | 116 | 0.9854 | 0.9672 | 0.8840 | 0.9909 | 0.9237 | 16 |
7zncd-Deflate | mx = 9 | 49 | 25 | 154 | 0.9821 | 0.9713 | 0.8460 | 0.9827 | 0.9043 | 20 |
7zncd-Deflate64 | mx = 9 | 49 | 25 | 154 | 0.9821 | 0.9713 | 0.8460 | 0.9827 | 0.9043 | 20 |
7zncd-LZMA | mx = 7,9 | 52 | 16 | 164 | 0.9820 | 0.9812 | 0.8360 | 0.9843 | 0.9028 | 23 |
7zncd-LZMA2 | mx = 7,9 | 52 | 17 | 164 | 0.9819 | 0.9801 | 0.8360 | 0.9841 | 0.9023 | 24 |
7zncd-PPMd | mx = 9 | 53 | 22 | 122 | 0.9856 | 0.9756 | 0.8780 | 0.9861 | 0.9242 | 15 |
bzip2ncd | C = 1..9,default | 47 | 12 | 140 | 0.9848 | 0.9862 | 0.8600 | 0.9922 | 0.9188 | 18 |
gzipncd | C = 3 | 36 | 40 | 133 | 0.9827 | 0.9559 | 0.8670 | 0.9846 | 0.9093 | 25 |
icd | ma = LZMA, mx = 7,9 | 54 | 37 | 150 | 0.9813 | 0.9583 | 0.8500 | 0.9721 | 0.9009 | |
ma = LZMA2, mx = 7,9 | ||||||||||
ncd-zlib | N/A | 41 | 30 | 158 | 0.9812 | 0.9656 | 0.8420 | 0.9876 | 0.8996 | 26 |
ncd-bzlib | N/A | 47 | 8 | 140 | 0.9852 | 0.9908 | 0.8600 | 0.9923 | 0.9208 | 17 |
xzncd | −e | 49 | 35 | 148 | 0.9817 | 0.9605 | 0.8520 | 0.9860 | 0.9030 | 22 |
Others
| ||||||||||
bsdiff | N/A | 73 | 48 | 236 | 0.9716 | 0.9409 | 0.7640 | 0.9606 | 0.8433 | 29 |
diff (C) | N/A | 23 | 6 | 244 | 0.9750 | 0.9921 | 0.7560 | 0.9826 | 0.8581 | 28 |
difflib | autojunk = true | 26 | 12 | 94 | 0.9894 | 0.9869 | 0.9060 | 0.9788 | 0.9447 | 10 |
fuzzywuzzy | token_set_ratio | 90 | 0 | 36 | 0.9964 | 1.0000 | 0.9640 | 0.9992 | 0.9817 | 3 |
jellyfish | jaro_winkler | 87 | 84 | 270 | 0.9646 | 0.8968 | 0.7300 | 0.9218 | 0.8049 | 30 |
ngram | N/A | 58 | 8 | 192 | 0.9800 | 0.9902 | 0.8080 | 0.9714 | 0.8899 | 27 |
cosine | N/A | 69 | 54 | 74 | 0.9872 | 0.9449 | 0.9260 | 0.9897 | 0.9354 | 12 |
5.4 RQ4: Reuse of Configurations
Tools |
C
gen
|
C
soco
| |||||
---|---|---|---|---|---|---|---|
Settings |
T
| generated | SOCO | Settings |
T
| SOCO | |
F-score | F-score | F-score | |||||
ccfx (C) | b = 5,t = 11 | 36 | 0.9760 | 0.8441 | b = {15 16 17}, | 25 | 0.9389 |
t = 12 | |||||||
fuzzywuzzy | token_set_ratio | 85 | 0.8757 | 0.6012 | ratio | 65 | 0.9338 |
jplag-java | t = 7 | 19 | 0.8636 | 0.3168 | t = 12 | 29 | 0.9576 |
difflib | autojunk = false | 28 | 0.8629 | 0.2113 | autojunk = true | 42 | 0.9443 |
whitespace = false | whitespace = true | ||||||
simjava | r = 16 | 15 | 0.8618 | 0.5888 | r = 25 | 46 | 0.9682 |
deckard (T) | M = 30 | 17 | 0.8509 | 0.3305 | M = 50 | 19 | 0.9520 |
S1 = 2 | S1 = 1 | ||||||
S2 = 0.95 | S2 = 1.0 | ||||||
bzip2ncd | C = 1..9 | 38 | 0.8494 | 0.3661 | C = 1 .. 9 | 54 | 0.8630 |
ncd-bzlib | N/A | 37 | 0.8465 | 0.3357 | N/A | 52 | 0.8689 |
simian (C) | threshold = 4, I1
| 5 | 0.8413 | 0.6394 | threshold = 4, I1
| 26 | 0.9593 |
ncd-zlib | N/A | 30 | 0.8361 | 0.3454 | N/A | 57 | 0.8776 |
5.5 RQ5: Ranked Results
5.5.1 Precision-at-n
Rank | Pair-based | Query-based | ||
---|---|---|---|---|
F-score | prec@n | ARP | MAP | |
1 | (0.976) ccfx | (0.976) ccfx | (1.000) ccfx | (1.000) ccfx |
2 | (0.876) fuzzywuzzy | (0.860) simjava | (0.915) fuzzywuzzy | (0.949) fuzzywuzzy |
3 | (0.864) jplag-java | (0.858) fuzzywuzzy | (0.913) ncd-bzlib | (0.943) ncd-bzlib |
4 | (0.863) difflib | (0.842) simian | (0.912) 7zncd-BZip2 | (0.942) bzip2ncd |
5 | (0.862) simjava | (0.836) deckard | (0.909) bzip2ncd | (0.938) 7zncd-BZip2 |
6 | (0.851) deckard | (0.836) jplag-java | (0.900) 7zncd-PPMd | (0.937) gzipncd |
7 | (0.849) bzip2ncd | (0.832) bzip2ncd | (0.900) gzipncd | (0.935) ncd-zlib |
8 | (0.847) ncd-bzlib | (0.828) difflib | (0.898) ncd-zlib | (0.933) jplag-text |
9 | (0.841) simian | (0.826) ncd-bzlib | (0.898) xzncd | (0.930) 7zncd-PPMd |
10 | (0.836) ncd-zlib | (0.820) 7zncd-BZip2 | (0.895) 7zncd-LZMA2 | (0.929) xzncd |
Rank | Pair-based | Query-based | ||
---|---|---|---|---|
F-score | prec@n | ARP | MAP | |
1 | (0.969) jplag-text | (0.965) jplag-text | (0.998) jplag-java | (0.997) jplag-java |
2 | (0.968) simjava | (0.960) simjava | (0.998) difflib | (0.997) difflib |
3 | (0.959) simian | (0.956) simian | (0.989) ccfx | (0.993) jplag-text |
4 | (0.958) jplag-java | (0.947) deckard | (0.989) simjava | (0.988) simjava |
5 | (0.952) deckard | (0.943) jplag-java | (0.987) gzipncd | (0.987) gzipncd |
6 | (0.944) difflib | (0.938) difflib | (0.986) jplag-text | (0.987) ncd-zlib |
7 | (0.939) ccfx | (0.929) ccfx | (0.985) ncd-zlib | (0.986) sherlock |
8 | (0.934) fuzzywuzzy | (0.929) fuzzywuzzy | (0.984) 7zncd-Deflate | (0.986) 7zncd-Deflate64 |
9 | (0.920) nicad | (0.914) plaggie | (0.984) 7zncd-Deflate64 | (0.986) 7zncd-Deflate |
10 | (0.919) plaggie | (0.901) nicad | (0.983) 7zncd-LZMA | (0.984) fuzzywuzzy |
5.5.2 Average r-Precision
Tool | ccfx | fuzzywuzzy | ncd-bzlib | bzip2ncd | ncd-zlib | deckard | simjava | jplag-java | simian | difflib |
---|---|---|---|---|---|---|---|---|---|---|
ccfx |
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
| |
fuzzywuzzy |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
ncd-bzlib |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
bzip2ncd |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
ncd-zlib |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
deckard |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
simjava |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
jplag-java |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
simian |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
difflib |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
Tool | jplag-java | difflib | ccfx | simjava | gzipncd | jplag-text | ncd-zlib | deflate | deflate64 | LZMA |
---|---|---|---|---|---|---|---|---|---|---|
jplag-java |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
| |
difflib |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
| |
ccfx |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
simjava |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
gzipncd |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
jplag-text |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
ncd-zlib |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
deflate |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
deflate64 |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
LZMA |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
5.5.3 Mean Average Precision
Tool | jplag-java | difflib | jplag-text | simjava | gzipncd | ncd-zlib | sherlock | deflate64 | deflate | fuzzywuzzy |
---|---|---|---|---|---|---|---|---|---|---|
jplag-java |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
| |
difflib |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
|
\(\blacktriangleright \)
| |
jplag-text |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
simjava |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
gzipncd |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
ncd-zlib |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
sherlock |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
deflate64 |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
deflate |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
| |
fuzzywuzzy |
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
\(\square \)
|
5.6 RQ6: Local + Global Code Modifications
Tool | F-Score | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
O
|
A
|
K
|
P
c
|
P
g
|
P
g
|
A
|
A
|
A
|
A
| |
K
|
P
c
|
K
|
P
c
|
P
g
|
P
g
| |||||
K
|
P
c
| |||||||||
Clone det. | ||||||||||
ccfx (C)a
|
0.8911
| 0.3714 | 0.0000 | 0.6265 | 0.0000 | 0.1034 | 0.0000 | 0.2985 | 0.0000 | 0.1034 |
deckard (T)a
|
0.9636
|
0.9217
| 0.1667 | 0.3333 | 0.0357 | 0.2286 | 0.1667 | 0.3252 | 0.0357 | 0.2286 |
iclones (L)a
| 0.5000 | 0.0000 | 0.0000 | 0.0357 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
nicad (T)a
| 0.5823 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
simian (L)a
|
0.8350
| 0.1034 | 0.0357 | 0.1356 | 0.0000 | 0.0357 | 0.0000 | 0.0357 | 0.0000 | 0.0357 |
Plagiarism det.
| ||||||||||
jplag-java |
1.0000
|
1.0000
| 0.7429 |
0.9524
| 0.2973 | 0.4533 | 0.7547 |
0.9720
| 0.2973 | 0.4507 |
jplag-text |
0.9815
| 0.6265 | 0.5581 | 0.6304 | 0.3590 | 0.4250 | 0.4906 | 0.5581 | 0.3590 | 0.4304 |
plaggie |
0.9636
|
0.9159
| 0.7363 |
0.9372
| 0.2171 | 0.4626 | 0.7363 |
0.9423
| 0.2171 | 0.4626 |
sherlock |
0.9483
|
0.8298
| 0.7872 |
0.8298
| 0.3061 | 0.3516 | 0.6744 | 0.7826 | 0.3061 | 0.3516 |
simjava |
0.9649
|
0.9815
|
1.0000
| 0.7525 | 0.3188 | 0.3913 |
0.8041
| 0.7525 | 0.3188 | 0.3913 |
simtext |
0.9649
| 0.7191 | 0.1667 | 0.4932 | 0.0357 | 0.1667 | 0.0702 | 0.2258 | 0.0357 | 0.1667 |
Compression
| ||||||||||
7zncd-BZip2 |
0.9273
| 0.7736 | 0.6852 |
0.8649
| 0.2446 | 0.3704 | 0.6423 | 0.7465 | 0.2446 | 0.3704 |
7zncd-Deflate |
0.9483
| 0.7579 | 0.6935 |
0.8406
| 0.2427 | 0.3333 | 0.6360 | 0.7418 | 0.2427 | 0.3333 |
7zncd-Deflate64 |
0.9483
| 0.7579 | 0.6935 |
0.8406
| 0.2427 | 0.3333 | 0.6360 | 0.7373 | 0.2427 | 0.3333 |
7zncd-LZMA |
0.9649
| 0.7967 | 0.7488 |
0.8851
| 0.2663 | 0.3842 | 0.6768 | 0.7665 | 0.2632 | 0.3842 |
7zncd-LZMA2 |
0.9649
| 0.7934 | 0.7536 |
0.8851
| 0.2718 | 0.3923 | 0.6700 | 0.7632 | 0.2697 | 0.4000 |
7zncd-PPMd |
0.9623
| 0.7965 | 0.7628 |
0.8909
| 0.2581 | 0.3796 | 0.6667 |
0.8019
| 0.2581 | 0.3796 |
bzip2ncd |
0.9649
|
0.8305
|
0.8302
|
0.9273
| 0.3590 | 0.4681 | 0.7612 |
0.8448
| 0.3562 | 0.4681 |
gzipncd |
0.9623
| 0.7965 | 0.7628 |
0.8909
| 0.2581 | 0.3796 | 0.6667 |
0.8019
| 0.2581 | 0.3796 |
icd |
0.9216
| 0.5058 | 0.4371 | 0.5623 | 0.2237 | 0.2822 | 0.3478 | 0.4239 | 0.2237 | 0.2822 |
ncd-zlib |
0.9821
|
0.8571
|
0.8246
|
0.9432
| 0.4021 | 0.4920 | 0.7491 |
0.8559
| 0.3963 | 0.4920 |
ncd-bzlib |
0.9649
|
0.8269
|
0.8269
|
0.9273
| 0.3529 | 0.4634 | 0.7500 |
0.8448
| 0.3500 | 0.4719 |
xzncd |
0.9734
|
0.8416
| 0.7925 |
0.9198
| 0.3133 | 0.4615 | 0.7035 |
0.8148
| 0.3133 | 0.4615 |
Others
| ||||||||||
bsdiff | 0.4388 | 0.2280 | 0.1529 | 0.2005 | 0.1151 | 0.1350 | 0.1276 | 0.1596 | 0.1152 | 0.1353 |
diff (C) | 0.2835 | 0.2374 | 0.1585 | 0.2000 | 0.1296 | 0.1248 | 0.1530 | 0.1786 | 0.1302 | 0.1249 |
difflib |
0.9821
|
0.9550
|
0.8952
|
0.9565
| 0.4790 | 0.5087 |
0.8688
|
0.9381
| 0.4606 | 0.5091 |
fuzzywuzzy |
1.0000
|
0.9821
|
0.9259
|
0.9636
| 0.4651 | 0.5116 |
0.9074
|
0.9541
| 0.4557 | 0.5116 |
jellyfish |
0.9273
| 0.7253 | 0.6400 | 0.6667 | 0.2479 | 0.3579 | 0.5513 | 0.5000 | 0.2479 | 0.3662 |
ngram |
1.0000
|
0.9464
|
0.8952
|
0.9346
| 0.4110 | 0.4490 |
0.8785
|
0.8908
| 0.4054 | 0.4578 |
cosine |
0.9074
| 0.6847 | 0.7123 | 0.6800 | 0.3500 | 0.3596 | 0.5823 | 0.5287 | 0.3500 | 0.3596 |