Skip to main content
Erschienen in: Software Quality Journal 4/2009

01.12.2009

Clone detection via structural abstraction

verfasst von: William S. Evans, Christopher W. Fraser, Fei Ma

Erschienen in: Software Quality Journal | Ausgabe 4/2009

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper describes the design, implementation, and application of a new algorithm to detect cloned code. It operates on the abstract syntax trees formed by many compilers as an intermediate representation. It extends prior work by identifying clones even when arbitrary subtrees have been changed. These subtrees may represent structural rather than simply lexical code differences. In several hundred thousand lines of Java and C# code, 20–50% of the clones that we find involve these structural changes, which are not accounted for by previous methods. Our method also identifies cloning in declarations, so it is somewhat more general than conventional procedural abstraction.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The notation emphasizes the fact that each hole may be filled with a different pattern.
 
2
This does not consider the cost of the r − 1 call instructions that replace r − 1 of the occurrences.
 
4
Let n, c, t, and ℓ be the number of nodes, characters, tokens, and lines in a file. For Java, n ≈ 0.55 c ≈ 4.0 t ≈ 13.5 ℓ. For C#, n ≈ 0.39 c ≈ 1.45 t ≈ 14.9 ℓ.
 
5
The entire clone comprises 231 nodes (21 source lines), contains one hole, and the clone occurs twice.
 
9
A trademark of Semantic Designs Inc.
 
10
They actually reported the fraction of sampled clone pairs that were unacceptable.
 
Literatur
Zurück zum Zitat Badros, G. J. (2000). Java ML: A markup language for Java source code. Computer Networks (Amsterdam, Netherlands), 33(1–6), 159–177. Badros, G. J. (2000). Java ML: A markup language for Java source code. Computer Networks (Amsterdam, Netherlands), 33(1–6), 159–177.
Zurück zum Zitat Baker, B. S. (1995). On finding duplication and near-duplication in large software systems. In Proceedings of the IEEE working conference on reverse engineering (pp. 86–95). Baker, B. S. (1995). On finding duplication and near-duplication in large software systems. In Proceedings of the IEEE working conference on reverse engineering (pp. 86–95).
Zurück zum Zitat Baker, B. S. (1997). Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM Journal on Computing, 26(5), 1343–1362.MATHCrossRefMathSciNet Baker, B. S. (1997). Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM Journal on Computing, 26(5), 1343–1362.MATHCrossRefMathSciNet
Zurück zum Zitat Baker, B. S. (2007). Finding clones with Dup: Analysis of an experiment. IEEE Transactions on Software Engineering, 33(9), 608–621.CrossRef Baker, B. S. (2007). Finding clones with Dup: Analysis of an experiment. IEEE Transactions on Software Engineering, 33(9), 608–621.CrossRef
Zurück zum Zitat Baker, B. S., & Manber, U. (1998). Deducing similarities in Java sources from bytecodes. In Proceedings of USENIX annual technical conference (pp. 179–190). Baker, B. S., & Manber, U. (1998). Deducing similarities in Java sources from bytecodes. In Proceedings of USENIX annual technical conference (pp. 179–190).
Zurück zum Zitat Baxter, I. D., Yahin, A., Moura, L., Sant’Anna, M., & Bier, L. (1998). Clone detection using abstract syntax trees. In Proceedings of the international conference on software maintenance (pp. 368–377). Baxter, I. D., Yahin, A., Moura, L., Sant’Anna, M., & Bier, L. (1998). Clone detection using abstract syntax trees. In Proceedings of the international conference on software maintenance (pp. 368–377).
Zurück zum Zitat Bellon, S. (2002). Vergleich von techniken zur erkennung duplizierten quellcodes. Master’s thesis, University of Stuttgart. Thesis number 1998. Bellon, S. (2002). Vergleich von techniken zur erkennung duplizierten quellcodes. Master’s thesis, University of Stuttgart. Thesis number 1998.
Zurück zum Zitat Bellon, S., Koschke, R., Antoniol, G., Krinke, J., & Merlo, E. (2007). Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering, 33(9), 577–591.CrossRef Bellon, S., Koschke, R., Antoniol, G., Krinke, J., & Merlo, E. (2007). Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering, 33(9), 577–591.CrossRef
Zurück zum Zitat Cheung, W., Evans, W., & Moses, J. (2003). Predicated instructions for code compaction. In Proceedings of the 7th international workshop on software and compilers for embedded systems (pp. 17–32). Cheung, W., Evans, W., & Moses, J. (2003). Predicated instructions for code compaction. In Proceedings of the 7th international workshop on software and compilers for embedded systems (pp. 17–32).
Zurück zum Zitat Chi, Y., Nijssen, S., Muntz, R. R., & Kok, J. N. (2005). Frequent subtree mining—an overview. Fundamenta Informaticae, 66(1–2), 161–198.MATHMathSciNet Chi, Y., Nijssen, S., Muntz, R. R., & Kok, J. N. (2005). Frequent subtree mining—an overview. Fundamenta Informaticae, 66(1–2), 161–198.MATHMathSciNet
Zurück zum Zitat Church, K., & Helfman, J. (1993). Dotplot: A program for exploring self-similarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 2(2), 153–174.CrossRef Church, K., & Helfman, J. (1993). Dotplot: A program for exploring self-similarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 2(2), 153–174.CrossRef
Zurück zum Zitat Cooper, K. D., & McIntosh, N. (1999). Enhanced code compression for embedded RISC processors. In ACM conference on programming language design and implementation (pp. 139–149). Cooper, K. D., & McIntosh, N. (1999). Enhanced code compression for embedded RISC processors. In ACM conference on programming language design and implementation (pp. 139–149).
Zurück zum Zitat Debray, S. K., Evans, W., Muth, R., & de Sutter, B. (2000). Compiler techniques for code compaction. ACM Transactions on Programming Languages and Systems, 22(2), 378–415.CrossRef Debray, S. K., Evans, W., Muth, R., & de Sutter, B. (2000). Compiler techniques for code compaction. ACM Transactions on Programming Languages and Systems, 22(2), 378–415.CrossRef
Zurück zum Zitat Ducasse, S., Rieger, M., & Demeyer, S. (1999). A language independent approach for detecting duplicated code. In Proceedings of the IEEE international conference on software maintenance (ICSM) (pp. 109–118). Ducasse, S., Rieger, M., & Demeyer, S. (1999). A language independent approach for detecting duplicated code. In Proceedings of the IEEE international conference on software maintenance (ICSM) (pp. 109–118).
Zurück zum Zitat Evans, W., Fraser, C. W., & Ma, F. (2007). Clone detection via structural abstraction. In Proceedings of the IEEE working conference on reverse engineering (pp. 150–159). Evans, W., Fraser, C. W., & Ma, F. (2007). Clone detection via structural abstraction. In Proceedings of the IEEE working conference on reverse engineering (pp. 150–159).
Zurück zum Zitat Fraser, C., Myers, E., & Wendt, A. (1984). Analyzing and compressing assembly code. In Proceedings of the ACM SIGPLAN symposium on compiler construction (Vol. 19, pp. 117–121). Fraser, C., Myers, E., & Wendt, A. (1984). Analyzing and compressing assembly code. In Proceedings of the ACM SIGPLAN symposium on compiler construction (Vol. 19, pp. 117–121).
Zurück zum Zitat Griswold, R. E., & Griswold, M. T. (1996). The Icon programming language. Peer-to-Peer Communications. Griswold, R. E., & Griswold, M. T. (1996). The Icon programming language. Peer-to-Peer Communications.
Zurück zum Zitat Griswold, W. G., & Notkin, D. (1993). Automated assistance for program restructuring. ACM Transactions on Software Engineering and Methodology, 2(3), 228–279.CrossRef Griswold, W. G., & Notkin, D. (1993). Automated assistance for program restructuring. ACM Transactions on Software Engineering and Methodology, 2(3), 228–279.CrossRef
Zurück zum Zitat Hanson, D. R., & Proebsting, T. A. (2004). A research C# compiler. Software-Practice and Experience, 34(13), 1211–1224.CrossRef Hanson, D. R., & Proebsting, T. A. (2004). A research C# compiler. Software-Practice and Experience, 34(13), 1211–1224.CrossRef
Zurück zum Zitat Jiang, L., Misherghi, G., Su, Z., & Glondu, S. (2007). DECKARD: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on software engineering (pp. 96–105). Jiang, L., Misherghi, G., Su, Z., & Glondu, S. (2007). DECKARD: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on software engineering (pp. 96–105).
Zurück zum Zitat Kamiya, T., Kusumoto, S., & Inoue, K. (2002). CCFinder: A multi-linguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 28(7), 654–670. Kamiya, T., Kusumoto, S., & Inoue, K. (2002). CCFinder: A multi-linguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 28(7), 654–670.
Zurück zum Zitat Karp, R. M., Miller, R. E., & Rosenberg, A. L. (1972). Rapid identification of repeated patterns in strings, trees, and arrays. In Proceedings of the ACM symposium on theory of computing (pp. 125–136). Karp, R. M., Miller, R. E., & Rosenberg, A. L. (1972). Rapid identification of repeated patterns in strings, trees, and arrays. In Proceedings of the ACM symposium on theory of computing (pp. 125–136).
Zurück zum Zitat Komondoor, R., & Horwitz, S. (2001). Using slicing to identify duplication in source code. In Proceedings of the eighth international symposium on static analysis (pp. 40–56). Komondoor, R., & Horwitz, S. (2001). Using slicing to identify duplication in source code. In Proceedings of the eighth international symposium on static analysis (pp. 40–56).
Zurück zum Zitat Kontogiannis, K. A., DeMori, R., Merlo, E., Galler, M., & Bernstein, M. (1996). Pattern matching for clone and concept detection. Automated Software Engineering, 3, 77–108.CrossRefMathSciNet Kontogiannis, K. A., DeMori, R., Merlo, E., Galler, M., & Bernstein, M. (1996). Pattern matching for clone and concept detection. Automated Software Engineering, 3, 77–108.CrossRefMathSciNet
Zurück zum Zitat Koschke, R., Falke, R., & Frenzel, P. (2006). Clone detection using abstract syntax suffix trees. In Proceedings of the IEEE working conference on reverse engineering (pp. 253–262). Koschke, R., Falke, R., & Frenzel, P. (2006). Clone detection using abstract syntax suffix trees. In Proceedings of the IEEE working conference on reverse engineering (pp. 253–262).
Zurück zum Zitat Li, Z., Lu, S., Myagmar, S., & Zhou, Y. (2006). CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering, 32(3), 176–192.CrossRef Li, Z., Lu, S., Myagmar, S., & Zhou, Y. (2006). CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering, 32(3), 176–192.CrossRef
Zurück zum Zitat Ma, F. (2006). On the study of tree pattern matching algorithms and applications. Master’s thesis, Department of Computer Science, University of British Columbia. Ma, F. (2006). On the study of tree pattern matching algorithms and applications. Master’s thesis, Department of Computer Science, University of British Columbia.
Zurück zum Zitat Mayrand, J., Leblanc, C., & Merlo, E. (1996). Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the IEEE international conference on software maintenance (pp. 244–253). Mayrand, J., Leblanc, C., & Merlo, E. (1996). Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the IEEE international conference on software maintenance (pp. 244–253).
Zurück zum Zitat Seal, D. (Ed.) (2001). ARM architecture reference manual (2nd ed.). Addison-Wesley. Seal, D. (Ed.) (2001). ARM architecture reference manual (2nd ed.). Addison-Wesley.
Zurück zum Zitat Sutter, B. D., Bus, B. D., & Bosschere, K. D. (2002). Sifting out the mud: Low level C++ code reuse. In Proceedings of the 17th ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications (pp. 275–291). Sutter, B. D., Bus, B. D., & Bosschere, K. D. (2002). Sifting out the mud: Low level C++ code reuse. In Proceedings of the 17th ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications (pp. 275–291).
Zurück zum Zitat Yang, W. (1991). Identifying syntactic differences between two programs. Software-Practice and Experience, 21(7), 739–755.CrossRef Yang, W. (1991). Identifying syntactic differences between two programs. Software-Practice and Experience, 21(7), 739–755.CrossRef
Zurück zum Zitat Zaki, M. J. (2002). Efficiently mining frequent trees in a forest. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80). Zaki, M. J. (2002). Efficiently mining frequent trees in a forest. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80).
Metadaten
Titel
Clone detection via structural abstraction
verfasst von
William S. Evans
Christopher W. Fraser
Fei Ma
Publikationsdatum
01.12.2009
Verlag
Springer US
Erschienen in
Software Quality Journal / Ausgabe 4/2009
Print ISSN: 0963-9314
Elektronische ISSN: 1573-1367
DOI
https://doi.org/10.1007/s11219-009-9074-y

Weitere Artikel der Ausgabe 4/2009

Software Quality Journal 4/2009 Zur Ausgabe

EditorialNotes

In this issue