Skip to main content
Top
Published in: Software Quality Journal 4/2009

01-12-2009

Clone detection via structural abstraction

Authors: William S. Evans, Christopher W. Fraser, Fei Ma

Published in: Software Quality Journal | Issue 4/2009

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper describes the design, implementation, and application of a new algorithm to detect cloned code. It operates on the abstract syntax trees formed by many compilers as an intermediate representation. It extends prior work by identifying clones even when arbitrary subtrees have been changed. These subtrees may represent structural rather than simply lexical code differences. In several hundred thousand lines of Java and C# code, 20–50% of the clones that we find involve these structural changes, which are not accounted for by previous methods. Our method also identifies cloning in declarations, so it is somewhat more general than conventional procedural abstraction.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
The notation emphasizes the fact that each hole may be filled with a different pattern.
 
2
This does not consider the cost of the r − 1 call instructions that replace r − 1 of the occurrences.
 
4
Let n, c, t, and ℓ be the number of nodes, characters, tokens, and lines in a file. For Java, n ≈ 0.55 c ≈ 4.0 t ≈ 13.5 ℓ. For C#, n ≈ 0.39 c ≈ 1.45 t ≈ 14.9 ℓ.
 
5
The entire clone comprises 231 nodes (21 source lines), contains one hole, and the clone occurs twice.
 
9
A trademark of Semantic Designs Inc.
 
10
They actually reported the fraction of sampled clone pairs that were unacceptable.
 
Literature
go back to reference Badros, G. J. (2000). Java ML: A markup language for Java source code. Computer Networks (Amsterdam, Netherlands), 33(1–6), 159–177. Badros, G. J. (2000). Java ML: A markup language for Java source code. Computer Networks (Amsterdam, Netherlands), 33(1–6), 159–177.
go back to reference Baker, B. S. (1995). On finding duplication and near-duplication in large software systems. In Proceedings of the IEEE working conference on reverse engineering (pp. 86–95). Baker, B. S. (1995). On finding duplication and near-duplication in large software systems. In Proceedings of the IEEE working conference on reverse engineering (pp. 86–95).
go back to reference Baker, B. S. (1997). Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM Journal on Computing, 26(5), 1343–1362.MATHCrossRefMathSciNet Baker, B. S. (1997). Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM Journal on Computing, 26(5), 1343–1362.MATHCrossRefMathSciNet
go back to reference Baker, B. S. (2007). Finding clones with Dup: Analysis of an experiment. IEEE Transactions on Software Engineering, 33(9), 608–621.CrossRef Baker, B. S. (2007). Finding clones with Dup: Analysis of an experiment. IEEE Transactions on Software Engineering, 33(9), 608–621.CrossRef
go back to reference Baker, B. S., & Manber, U. (1998). Deducing similarities in Java sources from bytecodes. In Proceedings of USENIX annual technical conference (pp. 179–190). Baker, B. S., & Manber, U. (1998). Deducing similarities in Java sources from bytecodes. In Proceedings of USENIX annual technical conference (pp. 179–190).
go back to reference Baxter, I. D., Yahin, A., Moura, L., Sant’Anna, M., & Bier, L. (1998). Clone detection using abstract syntax trees. In Proceedings of the international conference on software maintenance (pp. 368–377). Baxter, I. D., Yahin, A., Moura, L., Sant’Anna, M., & Bier, L. (1998). Clone detection using abstract syntax trees. In Proceedings of the international conference on software maintenance (pp. 368–377).
go back to reference Bellon, S. (2002). Vergleich von techniken zur erkennung duplizierten quellcodes. Master’s thesis, University of Stuttgart. Thesis number 1998. Bellon, S. (2002). Vergleich von techniken zur erkennung duplizierten quellcodes. Master’s thesis, University of Stuttgart. Thesis number 1998.
go back to reference Bellon, S., Koschke, R., Antoniol, G., Krinke, J., & Merlo, E. (2007). Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering, 33(9), 577–591.CrossRef Bellon, S., Koschke, R., Antoniol, G., Krinke, J., & Merlo, E. (2007). Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering, 33(9), 577–591.CrossRef
go back to reference Cheung, W., Evans, W., & Moses, J. (2003). Predicated instructions for code compaction. In Proceedings of the 7th international workshop on software and compilers for embedded systems (pp. 17–32). Cheung, W., Evans, W., & Moses, J. (2003). Predicated instructions for code compaction. In Proceedings of the 7th international workshop on software and compilers for embedded systems (pp. 17–32).
go back to reference Chi, Y., Nijssen, S., Muntz, R. R., & Kok, J. N. (2005). Frequent subtree mining—an overview. Fundamenta Informaticae, 66(1–2), 161–198.MATHMathSciNet Chi, Y., Nijssen, S., Muntz, R. R., & Kok, J. N. (2005). Frequent subtree mining—an overview. Fundamenta Informaticae, 66(1–2), 161–198.MATHMathSciNet
go back to reference Church, K., & Helfman, J. (1993). Dotplot: A program for exploring self-similarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 2(2), 153–174.CrossRef Church, K., & Helfman, J. (1993). Dotplot: A program for exploring self-similarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 2(2), 153–174.CrossRef
go back to reference Cooper, K. D., & McIntosh, N. (1999). Enhanced code compression for embedded RISC processors. In ACM conference on programming language design and implementation (pp. 139–149). Cooper, K. D., & McIntosh, N. (1999). Enhanced code compression for embedded RISC processors. In ACM conference on programming language design and implementation (pp. 139–149).
go back to reference Debray, S. K., Evans, W., Muth, R., & de Sutter, B. (2000). Compiler techniques for code compaction. ACM Transactions on Programming Languages and Systems, 22(2), 378–415.CrossRef Debray, S. K., Evans, W., Muth, R., & de Sutter, B. (2000). Compiler techniques for code compaction. ACM Transactions on Programming Languages and Systems, 22(2), 378–415.CrossRef
go back to reference Ducasse, S., Rieger, M., & Demeyer, S. (1999). A language independent approach for detecting duplicated code. In Proceedings of the IEEE international conference on software maintenance (ICSM) (pp. 109–118). Ducasse, S., Rieger, M., & Demeyer, S. (1999). A language independent approach for detecting duplicated code. In Proceedings of the IEEE international conference on software maintenance (ICSM) (pp. 109–118).
go back to reference Evans, W., Fraser, C. W., & Ma, F. (2007). Clone detection via structural abstraction. In Proceedings of the IEEE working conference on reverse engineering (pp. 150–159). Evans, W., Fraser, C. W., & Ma, F. (2007). Clone detection via structural abstraction. In Proceedings of the IEEE working conference on reverse engineering (pp. 150–159).
go back to reference Fraser, C., Myers, E., & Wendt, A. (1984). Analyzing and compressing assembly code. In Proceedings of the ACM SIGPLAN symposium on compiler construction (Vol. 19, pp. 117–121). Fraser, C., Myers, E., & Wendt, A. (1984). Analyzing and compressing assembly code. In Proceedings of the ACM SIGPLAN symposium on compiler construction (Vol. 19, pp. 117–121).
go back to reference Griswold, R. E., & Griswold, M. T. (1996). The Icon programming language. Peer-to-Peer Communications. Griswold, R. E., & Griswold, M. T. (1996). The Icon programming language. Peer-to-Peer Communications.
go back to reference Griswold, W. G., & Notkin, D. (1993). Automated assistance for program restructuring. ACM Transactions on Software Engineering and Methodology, 2(3), 228–279.CrossRef Griswold, W. G., & Notkin, D. (1993). Automated assistance for program restructuring. ACM Transactions on Software Engineering and Methodology, 2(3), 228–279.CrossRef
go back to reference Hanson, D. R., & Proebsting, T. A. (2004). A research C# compiler. Software-Practice and Experience, 34(13), 1211–1224.CrossRef Hanson, D. R., & Proebsting, T. A. (2004). A research C# compiler. Software-Practice and Experience, 34(13), 1211–1224.CrossRef
go back to reference Jiang, L., Misherghi, G., Su, Z., & Glondu, S. (2007). DECKARD: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on software engineering (pp. 96–105). Jiang, L., Misherghi, G., Su, Z., & Glondu, S. (2007). DECKARD: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on software engineering (pp. 96–105).
go back to reference Kamiya, T., Kusumoto, S., & Inoue, K. (2002). CCFinder: A multi-linguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 28(7), 654–670. Kamiya, T., Kusumoto, S., & Inoue, K. (2002). CCFinder: A multi-linguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 28(7), 654–670.
go back to reference Karp, R. M., Miller, R. E., & Rosenberg, A. L. (1972). Rapid identification of repeated patterns in strings, trees, and arrays. In Proceedings of the ACM symposium on theory of computing (pp. 125–136). Karp, R. M., Miller, R. E., & Rosenberg, A. L. (1972). Rapid identification of repeated patterns in strings, trees, and arrays. In Proceedings of the ACM symposium on theory of computing (pp. 125–136).
go back to reference Komondoor, R., & Horwitz, S. (2001). Using slicing to identify duplication in source code. In Proceedings of the eighth international symposium on static analysis (pp. 40–56). Komondoor, R., & Horwitz, S. (2001). Using slicing to identify duplication in source code. In Proceedings of the eighth international symposium on static analysis (pp. 40–56).
go back to reference Kontogiannis, K. A., DeMori, R., Merlo, E., Galler, M., & Bernstein, M. (1996). Pattern matching for clone and concept detection. Automated Software Engineering, 3, 77–108.CrossRefMathSciNet Kontogiannis, K. A., DeMori, R., Merlo, E., Galler, M., & Bernstein, M. (1996). Pattern matching for clone and concept detection. Automated Software Engineering, 3, 77–108.CrossRefMathSciNet
go back to reference Koschke, R., Falke, R., & Frenzel, P. (2006). Clone detection using abstract syntax suffix trees. In Proceedings of the IEEE working conference on reverse engineering (pp. 253–262). Koschke, R., Falke, R., & Frenzel, P. (2006). Clone detection using abstract syntax suffix trees. In Proceedings of the IEEE working conference on reverse engineering (pp. 253–262).
go back to reference Li, Z., Lu, S., Myagmar, S., & Zhou, Y. (2006). CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering, 32(3), 176–192.CrossRef Li, Z., Lu, S., Myagmar, S., & Zhou, Y. (2006). CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering, 32(3), 176–192.CrossRef
go back to reference Ma, F. (2006). On the study of tree pattern matching algorithms and applications. Master’s thesis, Department of Computer Science, University of British Columbia. Ma, F. (2006). On the study of tree pattern matching algorithms and applications. Master’s thesis, Department of Computer Science, University of British Columbia.
go back to reference Mayrand, J., Leblanc, C., & Merlo, E. (1996). Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the IEEE international conference on software maintenance (pp. 244–253). Mayrand, J., Leblanc, C., & Merlo, E. (1996). Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the IEEE international conference on software maintenance (pp. 244–253).
go back to reference Seal, D. (Ed.) (2001). ARM architecture reference manual (2nd ed.). Addison-Wesley. Seal, D. (Ed.) (2001). ARM architecture reference manual (2nd ed.). Addison-Wesley.
go back to reference Sutter, B. D., Bus, B. D., & Bosschere, K. D. (2002). Sifting out the mud: Low level C++ code reuse. In Proceedings of the 17th ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications (pp. 275–291). Sutter, B. D., Bus, B. D., & Bosschere, K. D. (2002). Sifting out the mud: Low level C++ code reuse. In Proceedings of the 17th ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications (pp. 275–291).
go back to reference Yang, W. (1991). Identifying syntactic differences between two programs. Software-Practice and Experience, 21(7), 739–755.CrossRef Yang, W. (1991). Identifying syntactic differences between two programs. Software-Practice and Experience, 21(7), 739–755.CrossRef
go back to reference Zaki, M. J. (2002). Efficiently mining frequent trees in a forest. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80). Zaki, M. J. (2002). Efficiently mining frequent trees in a forest. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80).
Metadata
Title
Clone detection via structural abstraction
Authors
William S. Evans
Christopher W. Fraser
Fei Ma
Publication date
01-12-2009
Publisher
Springer US
Published in
Software Quality Journal / Issue 4/2009
Print ISSN: 0963-9314
Electronic ISSN: 1573-1367
DOI
https://doi.org/10.1007/s11219-009-9074-y

Other articles of this Issue 4/2009

Software Quality Journal 4/2009 Go to the issue

EditorialNotes

In this issue

Premium Partner