Skip to main content
Erschienen in: Empirical Software Engineering 6/2008

01.12.2008

Empirical evaluation of clone detection using syntax suffix trees

verfasst von: Raimar Falke, Pierre Frenzel, Rainer Koschke

Erschienen in: Empirical Software Engineering | Ausgabe 6/2008

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Reusing software through copying and pasting is a continuous plague in software development despite the fact that it creates serious maintenance problems. Various techniques have been proposed to find duplicated redundant code (also known as software clones). A recent study has compared these techniques and shown that token-based clone detection based on suffix trees is fast but yields clone candidates that are often not syntactic units. Current techniques based on abstract syntax trees—on the other hand—find syntactic clones but are considerably less efficient. This paper describes how we can make use of suffix trees to find syntactic clones in abstract syntax trees. This new approach is able to find syntactic clones in linear time and space. The paper reports the results of a large case study in which we empirically compare the new technique to other techniques using the Bellon benchmark for clone detectors. The Bellon benchmark consists of clone pairs validated by humans for eight software systems written in C or Java from different application domains. The new contributions of this paper over the conference paper are the additional analysis of Java programs, the exploration of an alternative path that uses parse trees instead of abstract syntax trees, and the investigation of the impact on recall and precision when clone analyses insist on consistent parameter renaming.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
2
A trie is an ordered tree data structure that is used to store an associative array where the keys are strings.
 
3
An alternative to suffix trees are suffix arrays, which offer the advantage of less space consumption (Manber and Myers 1991) but at the cost of more runtime.
 
4
Please be reminded that (a, b, l) denotes a clone pair starting at token index a and b, respectively, with length l.
 
6
Personal communication; March 2007.
 
7
Personal communication at Dagstuhl, July 2006.
 
Literatur
Zurück zum Zitat Antoniol G, Casazza G, Penta MD, Merlo E (2001) Modeling clones evolution through time series. In: International conference on software maintenance. IEEE CS Press, pp 273–280 Antoniol G, Casazza G, Penta MD, Merlo E (2001) Modeling clones evolution through time series. In: International conference on software maintenance. IEEE CS Press, pp 273–280
Zurück zum Zitat Antoniol G, Villano U, Merlo E, Penta M (2002) Analyzing cloning evolution in the linux kernel. Inf Softw Technol 44(13):755–765CrossRef Antoniol G, Villano U, Merlo E, Penta M (2002) Analyzing cloning evolution in the linux kernel. Inf Softw Technol 44(13):755–765CrossRef
Zurück zum Zitat Bailey J, Burd E (2002) Evaluating clone detection tools for use during preventative maintenance. In: SCAM Bailey J, Burd E (2002) Evaluating clone detection tools for use during preventative maintenance. In: SCAM
Zurück zum Zitat Baker BS (1992) A program for identifying duplicated code. In: Computer science and statistics 24: Proceedings of the 24th symposium on the interface Baker BS (1992) A program for identifying duplicated code. In: Computer science and statistics 24: Proceedings of the 24th symposium on the interface
Zurück zum Zitat Baker BS (1995) On finding duplication and near-duplication in large software systems. In: Working conference on reverse engineering. IEEE CS Press Baker BS (1995) On finding duplication and near-duplication in large software systems. In: Working conference on reverse engineering. IEEE CS Press
Zurück zum Zitat Baker BS (1996) Parameterized pattern matching: algorithms and applications. JCSS Baker BS (1996) Parameterized pattern matching: algorithms and applications. JCSS
Zurück zum Zitat Baker BS (2007) Finding clones with Dup: analysis of an experiment. IEEE Trans Softw Eng 33(9):608–621 (September)CrossRef Baker BS (2007) Finding clones with Dup: analysis of an experiment. IEEE Trans Softw Eng 33(9):608–621 (September)CrossRef
Zurück zum Zitat Baker BS, Giancarlo R (2002) Sparse dynamic programming for longest common subsequence from fragments. J Algorithms 42(2):231–254 (February)MATHCrossRefMathSciNet Baker BS, Giancarlo R (2002) Sparse dynamic programming for longest common subsequence from fragments. J Algorithms 42(2):231–254 (February)MATHCrossRefMathSciNet
Zurück zum Zitat Balazinska M, Merlo E, Dagenais M, Lague B, Kontogiannis K (1999) Measuring clone based reengineering opportunities. In: IEEE symposium on software metrics. IEEE CS Press, pp 292–303 (November) Balazinska M, Merlo E, Dagenais M, Lague B, Kontogiannis K (1999) Measuring clone based reengineering opportunities. In: IEEE symposium on software metrics. IEEE CS Press, pp 292–303 (November)
Zurück zum Zitat Balazinska M, Merlo E, Dagenais M, Lague B, Kontogiannis K (2000) Advanced clone-analysis to support object-oriented system refactoring. In: Working conference on reverse engineering, IEEE CS Press, pp 98–107 (October) Balazinska M, Merlo E, Dagenais M, Lague B, Kontogiannis K (2000) Advanced clone-analysis to support object-oriented system refactoring. In: Working conference on reverse engineering, IEEE CS Press, pp 98–107 (October)
Zurück zum Zitat Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: ICSM Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: ICSM
Zurück zum Zitat Bellon S (2002) Vergleich von Techniken zur Erkennung duplizierten Quellcodes. Master’s thesis, University of Stuttgart, Germany Bellon S (2002) Vergleich von Techniken zur Erkennung duplizierten Quellcodes. Master’s thesis, University of Stuttgart, Germany
Zurück zum Zitat Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. IEEE Trans Softw Eng 33(9):577–591 (September)CrossRef Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. IEEE Trans Softw Eng 33(9):577–591 (September)CrossRef
Zurück zum Zitat Bentley JL, Ottmann TA (1979) Algorithms for reporting and counting geometric intersections. IEEE Trans Comput C-28:643–647CrossRef Bentley JL, Ottmann TA (1979) Algorithms for reporting and counting geometric intersections. IEEE Trans Comput C-28:643–647CrossRef
Zurück zum Zitat Bruntink M, van Deursen A, Tourwe T, van Engelen R (2004) An evaluation of clone detection techniques for crosscutting concerns. In: International conference on software maintenance, pp 200–209 Bruntink M, van Deursen A, Tourwe T, van Engelen R (2004) An evaluation of clone detection techniques for crosscutting concerns. In: International conference on software maintenance, pp 200–209
Zurück zum Zitat Bruntink M, van Engelen R, Tourwe T (2005) On the use of clone detection for identifying crosscutting concern code. IEEE Trans Softw Eng 31(10):804–818CrossRef Bruntink M, van Engelen R, Tourwe T (2005) On the use of clone detection for identifying crosscutting concern code. IEEE Trans Softw Eng 31(10):804–818CrossRef
Zurück zum Zitat Chou A, Yang J, Chelf B, Hallem S, Engler DR (2001) An empirical study of operating system errors. In: Symposium on operating systems principles, pp 73–88 Chou A, Yang J, Chelf B, Hallem S, Engler DR (2001) An empirical study of operating system errors. In: Symposium on operating systems principles, pp 73–88
Zurück zum Zitat Cordy JR (2003) Comprehending reality—practical barriers to industrial adoption of software maintenance automation. In: International workshop on program comprehension, IEEE CS Press Cordy JR (2003) Comprehending reality—practical barriers to industrial adoption of software maintenance automation. In: International workshop on program comprehension, IEEE CS Press
Zurück zum Zitat Cordy JR, Dean TR, Synytskyy N (2004) Practical language-independent detection of near-miss clones. In: CASCON, IBM Press Cordy JR, Dean TR, Synytskyy N (2004) Practical language-independent detection of near-miss clones. In: CASCON, IBM Press
Zurück zum Zitat Di Lucca G, Di Penta M, Fasolino, A (2002) An approach to identify duplicated web pages. In: COMPSAC Di Lucca G, Di Penta M, Fasolino, A (2002) An approach to identify duplicated web pages. In: COMPSAC
Zurück zum Zitat Ducasse S, Rieger M, Demeyer S (1999) A language independent approach for detecting duplicated code. In: ICSM Ducasse S, Rieger M, Demeyer S (1999) A language independent approach for detecting duplicated code. In: ICSM
Zurück zum Zitat Fowler M (1999) Refactoring: improving the design of existing code. Addison Wesley, Boston, MA, USA Fowler M (1999) Refactoring: improving the design of existing code. Addison Wesley, Boston, MA, USA
Zurück zum Zitat Gitchell D, Tran N (1999) Sim: a utility for detecting similarity in computer programs. In: SIGCSE, ACM Press Gitchell D, Tran N (1999) Sim: a utility for detecting similarity in computer programs. In: SIGCSE, ACM Press
Zurück zum Zitat Godfrey M, Tu Q (2001) Growth, evolution and structural change in open source software. In: Workshop on principles of software evolution (September) Godfrey M, Tu Q (2001) Growth, evolution and structural change in open source software. In: Workshop on principles of software evolution (September)
Zurück zum Zitat Higo Y, Ueda Y, Kamiya T, Kusumoto S, Inoue K (2002) On software maintenance process improvement based on code clone analysis. In: International conference on product focused software process improvement. Lecture notes in computer science, vol 2559. Springer Higo Y, Ueda Y, Kamiya T, Kusumoto S, Inoue K (2002) On software maintenance process improvement based on code clone analysis. In: International conference on product focused software process improvement. Lecture notes in computer science, vol 2559. Springer
Zurück zum Zitat Johnson JH (1993) Identifying redundancy in source code using fingerprints. In: CASCON, IBM Press Johnson JH (1993) Identifying redundancy in source code using fingerprints. In: CASCON, IBM Press
Zurück zum Zitat Kamiya T, Kusumoto S, Inoue K (2002) CCFinder: a multi-linguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670CrossRef Kamiya T, Kusumoto S, Inoue K (2002) CCFinder: a multi-linguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670CrossRef
Zurück zum Zitat Kapser C, Godfrey M (2003a) A taxonomy of clones in source code: the reengineers most wanted list. In: Working conference on reverse engineering. IEEE CS Press Kapser C, Godfrey M (2003a) A taxonomy of clones in source code: the reengineers most wanted list. In: Working conference on reverse engineering. IEEE CS Press
Zurück zum Zitat Kapser C, Godfrey MW (2003b) Toward a taxonomy of clones in source code: a case study. In: Evolution of large scale industrial software architectures Kapser C, Godfrey MW (2003b) Toward a taxonomy of clones in source code: a case study. In: Evolution of large scale industrial software architectures
Zurück zum Zitat Kapser C, Godfrey M (2005) Improved tool support for the investigation of duplication in software. Proceedings of the 21st IEEE international conference on software maintenance Kapser C, Godfrey M (2005) Improved tool support for the investigation of duplication in software. Proceedings of the 21st IEEE international conference on software maintenance
Zurück zum Zitat Kapser C, Godfrey MW (2006) “Clones considered harmful” considered harmful. In: Working conference on reverse engineering Kapser C, Godfrey MW (2006) “Clones considered harmful” considered harmful. In: Working conference on reverse engineering
Zurück zum Zitat Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in OOPL. In: International symposium on empirical software engineering. IEEE CS Press, pp 83–92 Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in OOPL. In: International symposium on empirical software engineering. IEEE CS Press, pp 83–92
Zurück zum Zitat Kim M, Sazawal V, Notkin D, Murphy GC (2005) An empirical study of code clone genealogies. In: European software engineering conference and foundations of software engineering (ESEC/FSE) Kim M, Sazawal V, Notkin D, Murphy GC (2005) An empirical study of code clone genealogies. In: European software engineering conference and foundations of software engineering (ESEC/FSE)
Zurück zum Zitat Komondoor R, Horwitz S (2001) Using slicing to identify duplication in source code. In: Proc. int. symposium on static analysis Komondoor R, Horwitz S (2001) Using slicing to identify duplication in source code. In: Proc. int. symposium on static analysis
Zurück zum Zitat Kontogiannis K, Mori RD, Merlo E, Galler M, Bernstein M (1996) Pattern matching for clone and concept detection. Autom Softw Eng 3(1/2):79–108 Kontogiannis K, Mori RD, Merlo E, Galler M, Bernstein M (1996) Pattern matching for clone and concept detection. Autom Softw Eng 3(1/2):79–108
Zurück zum Zitat Koschke R, Girard JF, Würthner M (1998) Intermediate representations for reverse engineering. In: Working conference on reverse engineering. IEEE CS Press, pp 241–250 Koschke R, Girard JF, Würthner M (1998) Intermediate representations for reverse engineering. In: Working conference on reverse engineering. IEEE CS Press, pp 241–250
Zurück zum Zitat Koschke R, Falke R, Frenzel P (2006) Clone detection using abstract syntax suffix trees. In: Working conference on reverse engineering. IEEE CS Press, pp 253–262 Koschke R, Falke R, Frenzel P (2006) Clone detection using abstract syntax suffix trees. In: Working conference on reverse engineering. IEEE CS Press, pp 253–262
Zurück zum Zitat Krinke J (2001) Identifying similar code with program dependence graphs. In: WCRE Krinke J (2001) Identifying similar code with program dependence graphs. In: WCRE
Zurück zum Zitat Lague B, Proulx D, Mayrand J, Merlo E, Hudepohl J (1997) Assessing the benefits of incorporating function clone detection in a development process. In: International conference on software maintenance, pp 314–321 Lague B, Proulx D, Mayrand J, Merlo E, Hudepohl J (1997) Assessing the benefits of incorporating function clone detection in a development process. In: International conference on software maintenance, pp 314–321
Zurück zum Zitat Laguë B, Proulx D, Mayrand J, Merlo EM, Hudepohl J (1997) Assessing the benefits of incorporating function clone detection in a development process. In: ICSM Laguë B, Proulx D, Mayrand J, Merlo EM, Hudepohl J (1997) Assessing the benefits of incorporating function clone detection in a development process. In: ICSM
Zurück zum Zitat Lanubile F, Mallardo T (2003) Finding function clones in web applications. In: Conference on software maintenance and reengineering Lanubile F, Mallardo T (2003) Finding function clones in web applications. In: Conference on software maintenance and reengineering
Zurück zum Zitat Leitao AM (2003) Detection of redundant code using R2D2. In: Workshop source code analysis and manipulation. IEEE CS Press Leitao AM (2003) Detection of redundant code using R2D2. In: Workshop source code analysis and manipulation. IEEE CS Press
Zurück zum Zitat Li Z, Lu S, Myagmar S, Zhou Y (2004) Cp-miner: a tool for finding copy-paste and related bugs in operating system code. In: Operating system design and implementation, pp 289–302 Li Z, Lu S, Myagmar S, Zhou Y (2004) Cp-miner: a tool for finding copy-paste and related bugs in operating system code. In: Operating system design and implementation, pp 289–302
Zurück zum Zitat Li Z, Lu S, Myagmar S, Zhou Y (2006) Copy-paste and related bugs in large-scale software code. IEEE Trans Softw Eng 32(3):176–192 (March)CrossRef Li Z, Lu S, Myagmar S, Zhou Y (2006) Copy-paste and related bugs in large-scale software code. IEEE Trans Softw Eng 32(3):176–192 (March)CrossRef
Zurück zum Zitat Manber U, Myers G (1991) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22(5):935–948 (October)CrossRefMathSciNet Manber U, Myers G (1991) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22(5):935–948 (October)CrossRefMathSciNet
Zurück zum Zitat Marcus A, Maletic J (2001) Identification of high-level concept clones in source code. In: Conference on automated software engineering Marcus A, Maletic J (2001) Identification of high-level concept clones in source code. In: Conference on automated software engineering
Zurück zum Zitat Mayrand J, Leblanc C, Merlo EM (1996) Experiment on the automatic detection of function clones in a software system using metrics. In: ICSM. IEEE Computer Society Press Mayrand J, Leblanc C, Merlo EM (1996) Experiment on the automatic detection of function clones in a software system using metrics. In: ICSM. IEEE Computer Society Press
Zurück zum Zitat Monden A, Nakae D, Kamiya T, Sato S, Matsumoto K (2002) Software quality analysis by code clones in industrial legacy software. In: IEEE symposium on software metrics, pp 87–94 Monden A, Nakae D, Kamiya T, Sato S, Matsumoto K (2002) Software quality analysis by code clones in industrial legacy software. In: IEEE symposium on software metrics, pp 87–94
Zurück zum Zitat Prechelt L, Malpohl G, Philippsen M (2000) Jplag: finding plagiarisms among a set of programs. Technical report, University of Karlsruhe, Department of Informatics Prechelt L, Malpohl G, Philippsen M (2000) Jplag: finding plagiarisms among a set of programs. Technical report, University of Karlsruhe, Department of Informatics
Zurück zum Zitat Rieger M (2005) Effective clone detection without language barriers. Dissertation, University of Bern, Switzerland Rieger M (2005) Effective clone detection without language barriers. Dissertation, University of Bern, Switzerland
Zurück zum Zitat Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the SIGMOD, pp 76–85 Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the SIGMOD, pp 76–85
Zurück zum Zitat Shannon CE (1984) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–656MathSciNet Shannon CE (1984) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–656MathSciNet
Zurück zum Zitat Van Rysselberghe F, Demeyer S (2004) Evaluating clone detection techniques from a refactoring perspective. In: Conference on automated software engineering Van Rysselberghe F, Demeyer S (2004) Evaluating clone detection techniques from a refactoring perspective. In: Conference on automated software engineering
Zurück zum Zitat Walter V, Seipel D, von Gudenberg JW, Fischer G (2004) Clone detection in source code by frequent itemset techniques. In: Workshop source code analysis and manipulation Walter V, Seipel D, von Gudenberg JW, Fischer G (2004) Clone detection in source code by frequent itemset techniques. In: Workshop source code analysis and manipulation
Zurück zum Zitat Yang W (1991) Identifying syntactic differences between two programs. Software Pract Ex 21(7):739–755CrossRef Yang W (1991) Identifying syntactic differences between two programs. Software Pract Ex 21(7):739–755CrossRef
Metadaten
Titel
Empirical evaluation of clone detection using syntax suffix trees
verfasst von
Raimar Falke
Pierre Frenzel
Rainer Koschke
Publikationsdatum
01.12.2008
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 6/2008
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-008-9073-9

Weitere Artikel der Ausgabe 6/2008

Empirical Software Engineering 6/2008 Zur Ausgabe