research-article

Oreo: detection of clones in the twilight zone

Authors:
Vaibhav Saini

University of California at Irvine, USA

University of California at Irvine, USA
View Profile

,
Farima Farmahinifarahani

University of California at Irvine, USA

University of California at Irvine, USA
View Profile

,
Yadong Lu

University of California at Irvine, USA

University of California at Irvine, USA
View Profile

,
Pierre Baldi

University of California at Irvine, USA

University of California at Irvine, USA
View Profile

,
Cristina V. Lopes

University of California at Irvine, USA

University of California at Irvine, USA
View Profile

ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringOctober 2018Pages 354–365https://doi.org/10.1145/3236024.3236026

Published:26 October 2018Publication History

Related Artifact: Reusable Package for Article: Oreo: Detection of Clones in the Twilight Zone September 2018 software https://doi.org/10.5281/zenodo.1317760

ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 354–365

ABSTRACT

Source code clones are categorized into four types of increasing difficulty of detection, ranging from purely textual (Type-1) to purely semantic (Type-4). Most clone detectors reported in the literature work well up to Type-3, which accounts for syntactic differences. In between Type-3 and Type-4, however, there lies a spectrum of clones that, although still exhibiting some syntactic similarities, are extremely hard to detect – the Twilight Zone. Most clone detectors reported in the literature fail to operate in this zone. We present Oreo, a novel approach to source code clone detection that not only detects Type-1 to Type-3 clones accurately, but is also capable of detecting harder-to-detect clones in the Twilight Zone. Oreo is built using a combination of machine learning, information retrieval, and software metrics. We evaluate the recall of Oreo on BigCloneBench, and perform manual evaluation for precision. Oreo has both high recall and precision. More importantly, it pushes the boundary in detection of clones with moderate to weak syntactic similarity in a scalable manner

References

Ambient Software Evoluton Group. 2013. IJaDataset 2.0. http://secold.org/ projects/seclone. (January 2013).Google Scholar
Brenda S Baker. 1992. A program for identifying duplicated code. Computing Science and Statistics (1992), 24–49.Google Scholar
Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of the 2nd Working Conference on Reverse Engineering. IEEE, 86–95. Google ScholarDigital Library
Pierre Baldi and Yves Chauvin. 1993. Neural networks for fingerprint recognition. Neural Computation 5, 3 (1993), 402–418. Google ScholarDigital Library
Pierre Baldi and Peter Sadowski. 2014. The dropout learning algorithm. Artificial intelligence 210 (2014), 78–122.Google Scholar
Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings of the International Conference on Software Maintenance. IEEE, 368–377. Google ScholarDigital Library
Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and Evaluation of Clone Detection Tools. IEEE Transactions on Software Engineering 33, 9 (Sept 2007), 577–591. Google ScholarDigital Library
B Barla Cambazoglu, Aytul Catal, and Cevdet Aykanat. 2006. Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems. In Proceedings of International Symposium on Computer and Information Sciences. Springer, 717–725. Google ScholarDigital Library
Kai Chen, Peng Liu, and Yingjun Zhang. 2014. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering. ACM, 175–186. Google ScholarDigital Library
Neil Davey, Paul Barson, Simon Field, Ray Frank, and D Tansley. 1995. The development of a software clone detector. International Journal of Applied Software Technology (1995).Google Scholar
Pietro Di Lena, Ken Nagata, and Pierre Baldi. 2012. Deep architectures for protein contact map prediction. Bioinformatics 28, 19 (2012), 2449–2457. Google ScholarDigital Library
Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Proceedings of the IEEE International Conference on Software Maintenance (ICSM’99). IEEE, 109–118. Google ScholarDigital Library
Rochelle Elva and Gary T Leavens. 2012. Jsctracker: A semantic clone detection tool for java code. Technical Report. University of Central Florida, Dept. of EECS, CS division.Google Scholar
Mark Gabel, Lingxiao Jiang, and Zhendong Su. 2008. Scalable detection of semantic clones. In Proceedings of the ACM/IEEE 30th International Conference on Software Engineering (ICSE’08). IEEE, 321–330. Google ScholarDigital Library
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 315–323.Google Scholar
Alberto Goffi, Alessandra Gorla, Andrea Mattavelli, Mauro Pezzè, and Paolo Tonella. 2014. Search-based synthesis of equivalent method sequences. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 366–376. Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV ’15). IEEE Computer Society, 1026–1034. Google ScholarDigital Library
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering. IEEE Computer Society, 96–105. Google ScholarDigital Library
Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the Eighteenth International Symposium on Software Testing and Analysis. ACM, 81–92. Google ScholarDigital Library
J Howard Johnson. 1993. Identifying redundancy in source code using fingerprints. In Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Software Engineering - Volume 1. IBM Press, 171–183. Google ScholarDigital Library
J Howard Johnson. 1994. Substring matching for clone detection and change tracking. In Proceedings of 1994 International Conference on Software Maintanence. 120–126. Google ScholarDigital Library
Toshihiro Kamiya. 2013. Agec: An execution-semantic clone detection tool. In Proceeings of the 21st IEEE International Conference on Program Comprehension (ICPC). IEEE, 227–229.Google ScholarCross Ref
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654–670. Google ScholarDigital Library
Iman Keivanloo, Chanchal K Roy, and Juergen Rilling. 2012. Java bytecode clone detection via relaxation on code fingerprint and semantic web reasoning. In Proceedings of the 6th International Workshop on Software Clones. IEEE Press, 36–42. Google ScholarDigital Library
Iman Keivanloo, Chanchal K Roy, and Juergen Rilling. 2012. Sebyte: A semantic clone detection tool for intermediate languages. In Proceedings of the 20th IEEE International Conference onProgram Comprehension (ICPC). IEEE, 247–249.Google ScholarCross Ref
Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of International Static Analysis Symposium. Springer, 40–56. Google ScholarDigital Library
Kostas Kontogiannis. 1997. Evaluation experiments on the detection of programming patterns using software metrics. In Proceedings of the Fourth Working Conference on Reverse Engineering. IEEE, 44–54. Google ScholarDigital Library
Rainer Koschke, Raimar Falke, and Pierre Frenzel. 2006. Clone detection using abstract syntax suffix trees. In Proceedings of 13th Working Conference on Reverse Engineering, 2006 (WCRE’06). IEEE, 253–262. Google ScholarDigital Library
Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of the Eighth Working Conference on Reverse Engineering. IEEE, 301–309. Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25. 1097–1105. Google ScholarDigital Library
Anagha Kulkarni and Jamie Callan. 2010. Document allocation policies for selective searching of distributed indexes. In Proceedings of the 19th ACM International Conference on Information and knowledge Management. ACM, 449–458. Google ScholarDigital Library
Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics 22, 1 (1951), 79–86.Google Scholar
Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CCLearner: A Deep Learning-Based Clone Detection Approach. In Proceedings of the 33rd IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 249–260.Google ScholarCross Ref
Joerg Liebeherr, Edward Robert Omiecinski, and Ian F. Akyildiz. 1993. The effect of index partitioning schemes on the performance of distributed query processing. IEEE Transactions on Knowledge and Data Engineering 5, 3 (1993), 510–522. Google ScholarDigital Library
Chao Liu, Chen Chen, Jiawei Han, and Philip S Yu. 2006. GPLAG: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International conference on Knowledge Discovery and Data mining. ACM, 872–881. Google ScholarDigital Library
Simone Livieri, Yoshiki Higo, Makoto Matushita, and Katsuro Inoue. 2007. Verylarge scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In Proceedings of 29th International Conference on Software Engineering (ICSE 2007). IEEE, 106–115. Google ScholarDigital Library
Jean Mayrand, Claude Leblanc, and Ettore Merlo. 1996. Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics. In Proceedings of International Conference on Software Maintenance. 244. Google ScholarDigital Library
Grégoire Montavon and Klaus-Robert Müller. 2012. Better representations: Invariant, disentangled and reusable. In Neural Networks: Tricks of the Trade. Springer, 559–560.Google Scholar
Lindsay Anne Neubauer. 2015. Kamino: Dynamic approach to semantic code clone detection. Technical Report, Department of Computer Science, Columiba University, CUCS-022-14 (2015).Google Scholar
J-F Patenaude, Ettore Merlo, Michel Dagenais, and Bruno Laguë. 1999. Extending software quality assessment techniques to java systems. In Proceedings of Seventh International Workshop on Program Comprehension. IEEE, 49–56. Google ScholarDigital Library
C. K. Roy and J. R. Cordy. 2007. A survey on software clone detection research. Technical Report, Queen’s University at Kingston (2007).Google Scholar
Chanchal K Roy and James R Cordy. 2008. NICAD: Accurate detection of nearmiss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 16th IEEE International Conference on Program Comprehension (ICPC08). IEEE, 172–181. Google ScholarDigital Library
Chanchal K Roy, James R Cordy, and Rainer Koschke. 2009. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming 74, 7 (2009), 470–495. Google ScholarDigital Library
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering (ICSE16). IEEE, 1157– 1168. Google ScholarDigital Library
Ioannis Samoladas, Georgios Gousios, Diomidis Spinellis, and Ioannis Stamelos. 2008. The SQO-OSS quality model: measurement based open source software evaluation. In Proceedings of the International Conference on Open Source Systems. 237–248.Google ScholarCross Ref
Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks 61 (2015), 85–117. Google ScholarDigital Library
A. Sheneamer and J. Kalita. 2016. Semantic Clone Detection Using Machine Learning. In Proceedings of the 15th IEEE International Conference on Machine Learning and Applications (ICMLA). 1024–1028.Google Scholar
Abdullah Sheneamer and Jugal Kalita. 2016. A Survey of Software Clone Detection Techniques. International Journal of Computer Applications 137 (2016), 1–21.Google ScholarCross Ref
Richard Socher, Yoshua Bengio, and Christopher D Manning. 2012. Deep learning for NLP (without magic). In Tutorial Abstracts of ACL 2012. Association for Computational Linguistics, 5–5. Google ScholarDigital Library
Jeffrey Svajlenko and Chanchal Kumar Roy. 2015. Evaluating Clone Detection Tools with BigCloneBench. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME ’15). 131–140. Google ScholarDigital Library
Jeffrey Svajlenko and Chanchal K Roy. 2016. BigCloneEval: A clone detection tool evaluation framework with bigclonebench. In Proceedings of 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. Saini, F. Farmahinifarahani, Y. Lu, P. Baldi, and C. V. Lopes 596–600.Google ScholarCross Ref
Jeffrey Svajlenko and Chanchal K Roy. 2017. Fast and flexible large-scale clone detection with cloneworks. In Proceedings of the 39th International Conference on Software Engineering Companion. IEEE Press, 27–30. Google ScholarDigital Library
Rajkumar Tekchandani, Rajesh Kumar Bhatia, and Maninder Singh. 2013. Semantic code clone detection using parse trees and grammar recovery. In Confluence 2013: The Next Generation Information Technology Summit. IET.Google Scholar
Tiantian Wang, Mark Harman, Yue Jia, and Jens Krinke. 2013. Searching for better configurations: a rigorous approach to clone evaluation. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 455–465. Google ScholarDigital Library
Hui-Hui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17). 3034–3040. Google ScholarDigital Library
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87–98. Google ScholarDigital Library

Index Terms

Oreo: detection of clones in the twilight zone
1. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Maintaining software

Recommendations

CCAligner: a token based large-gap clone detector
ICSE '18: Proceedings of the 40th International Conference on Software Engineering

Copying code and then pasting with large number of edits is a common activity in software development, and the pasted code is a kind of complicated Type-3 clone. Due to large number of edits, we consider the clone as a large-gap clone. Large-gap clone ...
Read More
NIL: large-scale detection of large-variance clones
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

A code clone (in short, clone) is a code fragment that is identical or similar to other code fragments in source code. Clones generated by a large number of changes to copy-and-pasted code fragments are called large-variance (modifications are scattered)...
Read More
IDE-based real-time focused search for near-miss clones
SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing

Code clone is a well-known code smell that needs to be detected and managed during the software development process. However, the existing clone detectors have one or more of the three shortcomings: (a) limitation in detecting Type-3 clones, (b) they ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
October 2018
987 pages
ISBN:9781450355735
DOI:10.1145/3236024
General Chair:
Gary T. Leavens
University of Central Florida, USA
,
Program Chairs:
Alessandro Garcia
PUC-Rio, Brazil
,
Corina S. Păsăreanu
NASA Ames Research Center, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
Author Tags
Clone detection
Machine Learning
Software Metrics
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate112of543submissions,21%
Upcoming Conference
FSE '24

Sponsor:

sigsoft

32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Ipojuca (Pernambuco) , Brazil
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 112
  Total Citations
  View Citations
- 913
  Total Downloads
- Downloads (Last 12 months)109
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Oreo: detection of clones in the twilight zone

ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

CCAligner: a token based large-gap clone detector

NIL: large-scale detection of large-variance clones

IDE-based real-time focused search for near-miss clones