Skip to main content

2021 | Book

Code Clone Analysis

Research, Tools, and Practices

Editors: Dr. Katsuro Inoue, Chanchal K. Roy

Publisher: Springer Singapore


About this book

This is the first book organized around code clone analysis. To cover the broad studies of code clone analysis, this book selects past research results that are important to the progress of the field and updates them with new results and future directions.

The first chapter provides an introduction for readers who are inexperienced in the foundation of code clone analysis, defines clones and related terms, and discusses the classification of clones. The chapters that follow are categorized into three main parts to present 1) major tools for code clone analysis, 2) fundamental topics such as evaluation benchmarks, clone visualization, code clone searches, and code similarities, and 3) applications to actual problems. Each chapter includes a valuable reference list that will help readers to achieve a comprehensive understanding of this diverse field and to catch up with the latest research results.

Code clone analysis relies heavily on computer science theories such as pattern matching algorithms, computer language, and software metrics. Consequently, code clone analysis can be applied to a variety of real-world tasks in software development and maintenance such as bug finding and program refactoring. This book will also be useful in designing an effective curriculum that combines theory and application of code clone analysis in university software engineering courses.

Table of Contents

Correction to: NiCad: A Modern Clone Entector
Manishankar Mondal, Chanchal K. Roy, James R. Cordy

Introduction to Code Clone

Introduction to Code Clone Analysis
Code Clone is a code snippet that has the same or similar code snippet in the same or different software system. The existence of code clones is an issue on software maintenance and a clue to understanding the structure and evolution of software systems. A large number of researches on code clones have been performed, and many tools for code clone analysis have been developed. In this chapter, we will explain some of the terms that are important for understanding code clones, such as definition, type, analysis granularity, and analysis domain. We will also outline the approaches and applications of code clone analysis.
Katsuro Inoue

Code Clone Analysis Tools

CCFinderX: An Interactive Code Clone Analysis Environment
CCFinderX is a successor tool that advances the concepts of CCFinder (2002) and Gemini (2002), and is a standalone environment for code clone detection and analysis. CCFinderX is designed as an interactive analysis environment that allows users to switch between views of scatter plot, file, metrics, and source text, for applying to large code bodies. This chapter describes the features of these views, such as the display content, operations for coordination between views, metrics for files and clone classes, and the revamped code clone detection algorithm.
Toshihiro Kamiya
NiCad: A Modern Clone Detector
Code clones are exactly or nearly similar code fragments in the code-base of a software system. Studies have revealed that such code fragments can have mixed impact (both positive and negative) on software evolution and maintenance. In order to reduce the negative impact of code clones and benifit from their advantages, researchers have suggested a number of different clone management techniques. Clone management begins with clone detection. Clone detection has thus been a hot research topic, resulting in many different clone detectors [26, 32] that have been used in a range of applications, including clone analysis, refactoring, and tracking. One of those that has been widely used and investigated is NiCad [3, 27]. What follows is a brief overview of the NiCad detection mechanism and its application in various studies of code clones.
Manishankar Mondal, Chanchal K. Roy, James R. Cordy
SourcererCC: Scalable and Accurate Clone Detection
Clone detection is an active area of research. However, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. SourcererCC was developed as an attempt to fill this gap. It is a widely used token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. In the evaluation experiments, SourcererCC demonstrated both high recall and precision, and the ability to scale to a large inter-project repository (250MLOC) even using a standard workstation. This chapter reflects on some of the principle design decisions behind the success of SourcererCC and also presents an architecture to scale it horizontally.
Hitesh Sajnani, Vaibhav Saini, Chanchal K. Roy, Cristina Lopes
Oreo: Scaling Clone Detection Beyond Near-Miss Clones
With recent advancements in the field of code clone detection, researchers have made it possible to scale large datasets. The scope of scalable and accurate clone detection, however, was limited to Type-1, Type-2, and near-miss Type-3 clones. Most clone detectors fail to detect clones beyond the near-miss Type-3 category as it becomes hard to detect such clones in a scalable manner. There are two main challenges in identifying clones beyond the Type-3 category: (1) Syntactical similarity is low between such complex clones and (2) comparing code snippets leads to prohibitive quadratic comparisons, which causes candidate explosion and leads to scalability issues. Oreo introduces a novel semantic filter named Action filter  which filters out a large number of code pairs that do not share semantic similarities, thereby addressing the candidate explosion issue. Moreover, the candidates that pass this filter have high semantic similarity which leads to the detection of complex and semantically similar clones. As many semantically similar candidates may not be clones, Oreo uses a deep learning model to validate the structural similarity between the semantically similar candidates, which leads to greater accuracy in clone detection. Oreo demonstrated broader range of clone detection, high recall, precision, speed, and ability to scale to a large inter-project repository (250MLOC) using a standard workstation. This chapter aims to describe the design decisions and concepts which enabled Oreo to take scalable and accurate clone detection beyond the near-miss clones.
Vaibhav Saini, Farima Farmahinifarahani, Hitesh Sajnani, Cristina Lopes
CCLearner: Clone Detection via Deep Learning
To facilitate clone maintenance, various automated tools were proposed to detect code clones by identifying similar token sequences or similar program syntactic structures in source code. They achieved different trade-offs between precision and recall. Inspired by prior work, we developed a new approach CCLearner, a solely token-based clone detection approach using deep learning. Given known clone pairs and non-clone pairs, CCLearner extracts features from each code pair and leverages the features to train a classifier. The classifier is then used to compare methods pair-by-pair in a given codebase to detect clones. We evaluated CCLearner by reusing an existing benchmark of real clone code—BigCloneBench. We split the benchmark such that some data was used for classifier training, and some data was used for testing. With the testing data, we evaluated CCLearner’s effectiveness of clone detection, and also assessed three existing popular clone detection tools: SourcererCC, NiCad, and Deckard. CCLearner outperformed existing tools by achieving a better trade-off between precision and recall. To further investigate whether other machine learning algorithms can perform comparatively as deep learning, we replaced deep learning with five alternative machine learning algorithms in CCLearner, and observed that CCLearner worked best when using deep learning.
Liuqing Li, He Feng, Na Meng, Barbara Ryder

Research Basis of Code Clone

Many clone detection tools and techniques have been created to tackle various use-cases, including syntactical clone detection, semantic clone detection, inter-project clone detection, large-scale clone detection and search, and so on. While a few clone benchmarks are available, none target this breadth of usage. BigCloneBench is a clone benchmark designed to evaluate clone detection tools across a variety of use-cases. It was built by mining a large inter-project source repository for functions implementing known functionalities. This produced a large benchmark of inter-project and intra-project semantic clones across the full spectrum of syntactical similarity. The benchmark is augmented with an evaluation framework named BigCloneEval which simplifies tool evaluation studies and allows the user to slice the benchmark based on the clone properties in order to evaluate for a particular use-case. We have used BigCloneBench in a number of studies that demonstrate its value, as well as show where it has been used by the research community. In this chapter, we discuss the clone benchmarking theory and the existing benchmarks, describe the BigCloneBench creation process, and overview the BigCloneEval evaluation procedure. We conclude by summarizing BigCloneBench’s usage in the literature, and present ideas for future improvements and expansion of the benchmark.
Jeffrey Svajlenko, Chanchal K. Roy
Visualization of Clones
Identifying similar code fragments, referred to as code clones, is beneficial in software re-engineering and maintenance. Various visualization techniques have been developed to present cloning information for programmers in a more useful and comprehensible manner. This chapter provides a summary of state of the art in visualizing software clones, along with a classification of visualizations according to the supported user goals, and the relevant information needs to achieve the user goals. Moreover, it further presents an assessment of clone visualizations on the basis of clone relations and clone granularity.
Muhammad Hammad, Hamid Abdul Basit, Stan Jarzabek, Rainer Koschke
Source Code Clone Search
Identifying similarities in source code is the main challenge for reuse, plagiarism, and code clone detection. Code clone search has emerged as a new research branch in clone detection, aiming to provide similarity search functionality for code snippets. While clone search shares its fundamentals with clone detection, both its objective and requirements differ significantly. Clone search focuses on search engines that are designed to find clones of a single input code snippet (i.e., query) from a large set of code snippets (i.e., corpus). Scalability, short response time, and the ability to rank result sets among the major challenges have to be dealt with by a clone search engine. In this chapter, we identify and define major concepts related to clone search. We then present a framework that summarizes the architecture of a clone search engine and enables us to provide a systematic view of the internals of such an engine. Finally, we discuss how to benchmark and evaluate the performance of clone search engines. The discussion includes a set of measures that are helpful in evaluating clone search engines.
Iman Keivanloo, Juergen Rilling
Code Similarity in Clone Detection
Clone detection is one application of measuring the similarity of code. However, clone and plagiarism detectors use very different representations of source code and different techniques to identify similar code fragments. This chapter investigates the impact of source code representation (i.e. tokenisation and renaming of identifiers and literals) and the impact of similarity measurements (e.g. Jaccard index or Kondrak’s distance over n-grams) for measuring source code similarity on two known datasets. A comparison using average precision at k with dedicated clone and plagiarism detectors shows that simple similarity measurements like Kondrak’s distance using n-grams over tokenised source code usually outperform specialised tools for the detection of similar, cloned, plagiarised or duplicated code.
Jens Krinke, Chaiyong Ragkhitwetsagul
Is Late Propagation a Harmful Code Clone Evolutionary Pattern? An Empirical Study
Two similar code segments, or clones, form a clone pair within a software system. The changes to the clones over time create a clone evolution history. Late propagation is a specific pattern of clone evolution. In late propagation, one clone in the clone pair is modified, causing the clone pair to become inconsistent. The code segments are then re-synchronized in a later revision. Existing work has established late propagation as a clone evolution pattern, and suggested that the pattern is related to a high number of faults. In this chapter, we replicate and extend the work by Barbour et al. (2011 27th IEEE International Conference on Software Maintenance (ICSM). IEEE (2011) [1]) by examining the characteristics of late propagation in 10 long-lived open-source software systems using the iClones clone detection tool. We identify eight types of late propagation and investigate their fault-proneness. Our results confirm that late propagation is the more harmful clone evolution pattern and that some specific cases of late propagations are more harmful than others. We trained machine learning models using 18 clone evolution related features to predict the evolution of late propagation and achieved high precision within the range of 0.91–0.94 and AUC within the range of 0.87–0.91.
Osama Ehsan, Lillane Barbour, Foutse Khomh, Ying Zou
A Summary on the Stability of Code Clones and Current Research Trends
Code clones are exactly or nearly similar code pieces in the source code files of a software system. These mainly get created because of the frequent copy/paste activities of the programmers during development. Many studies have been done on realizing the impacts of code clones on software evolution and maintenance. We performed a comprehensive study on clone stability in order to understand whether clone or non-clone code in a software system is more change-prone. Intuitively, code pieces with higher change-proneness (lower stability) will require higher maintenance effort and cost during software evolution. According to our study, code clones are more change-prone than non-clone code in general and thus, code clones are likely to require a higher maintenance effort and cost. We suggest that code clones should be managed with proper tool support so that we can get rid of their negative impacts and can get benefited from their positive impacts. This document provides a brief summary of our study on clone stability. It also discusses the studies that were done mostly after the publication of our study. Finally, it mentions some possible future works on the basis of the findings of the existing studies.
Manishankar Mondal, Chanchal K. Roy, Kevin A. Schneider

Applying Clone Technology in Practice

Identifying Refactoring-Oriented Clones and Inferring How They Can Be Merged
Our research group has been working on code clones for more than 20 years. In this chapter, I review our work on merging clones published in 2008 (Higo et al. in J Soft Mainten Evolut 20:435–461, 2008 [3]), introduce two subsequent studies, and discuss prospects for future research.
Yoshiki Higo
Clone Evolution and Management
Programmers tend to write code clones unintentionally, which can be easily avoided. Clone change management is a crucial issue in open-source software (OSS) and industrial software development (e.g., development of social infrastructure, financial systems, and medical equipment). When industrial software developers have to fix a defect, they must find the code clones corresponding to the code fragment, including it. To date, several studies have been conducted on the analysis of clone evolution using OSS. However, only a few studies have reported on the application of a clone change notification system to the industrial software development process of our knowledge. In this chapter, first, we introduce a system that notifies about the creation of code clones. Then, we report on our experience with the system after a 40-day long application of it in a corporation’s software development process. In the industrial application, a developer successfully identified ten unintentionally created clones that should be merged. Moreover, we introduce the improvements that were made since we released the initial version of the notification system. Besides, we demonstrate a usage scenario of the current version. The current version of Clone Notifier and its video are available at: https://​github.​com/​s-tokui/​CloneNotifier.
Norihiro Yoshida, Eunjong Choi
Sometimes, Cloning Is a Sound Design Decision!
The practice of copy-paste-edit—also known as code cloning—has always been popular with software developers; however, evidence suggests that code cloning also carried risks: code bloat, creeping system fragility and design drift, increased bugginess, and inconsistent maintenance are all possible side effects of code cloning. Early research into this practice often tacitly assumed that it was always problematic, and sought to identify instances of it (“clone detection”) for later elimination. However, our studies of how cloning has been practised in the development of several large open-source systems suggested a more nuanced view might be appropriate: we found that code cloning seems to be practised for a variety of reasons, and sometimes with principled engineering goals in mind. That is, the idea that “code cloning is uniformly harmful to software system quality” is itself harmful. We argue instead that code clone instances should be evaluated along a number of criteria—such as developer intent, likely risk, and mitigation strategies—before any refactoring action is taken. Also, after some years of reflection on our original studies, we further suggest that instead of concentrating only on source code and other technical artifacts, there is much to be gained by shifting our focus to studying how developers perceive and practice code cloning.
Michael W. Godfrey, Cory J. Kapser
IWSC(D): From Research to Practice: A Personal Historical Retelling
The two authors of this chapter were among the founding participants of the First International Workshop on Software Clone Detection (IWSCD)—nowadays known as the International Workshop on Software Clones (IWSC). This chapter briefly summarizes the history of this community-building workshop from its early days until today. IWSC(D) has had not only an impact on research but also in practice. Indeed the authors have also developed clone detection tools—among other static program analyses to assess the internal quality of programs—used in the software industry by thousands of developers. The foundations of these tools were laid in software clone research, which highlights both the relevance of this topic for industry and what impact research may be capable to achieve. This historical retelling will not only be a summary of almost 20 years of the history of our primary community event—trying to be as accurate and complete as possible—but also provide the personal perspectives of the two authors along with some anecdotes.
Rainer Koschke, Stefan Bellon
Code Clone Analysis
Dr. Katsuro Inoue
Chanchal K. Roy
Copyright Year
Springer Singapore
Electronic ISBN
Print ISBN

Premium Partner