ABSTRACT
Source code clones are categorized into four types of increasing difficulty of detection, ranging from purely textual (Type-1) to purely semantic (Type-4). Most clone detectors reported in the literature work well up to Type-3, which accounts for syntactic differences. In between Type-3 and Type-4, however, there lies a spectrum of clones that, although still exhibiting some syntactic similarities, are extremely hard to detect – the Twilight Zone. Most clone detectors reported in the literature fail to operate in this zone. We present Oreo, a novel approach to source code clone detection that not only detects Type-1 to Type-3 clones accurately, but is also capable of detecting harder-to-detect clones in the Twilight Zone. Oreo is built using a combination of machine learning, information retrieval, and software metrics. We evaluate the recall of Oreo on BigCloneBench, and perform manual evaluation for precision. Oreo has both high recall and precision. More importantly, it pushes the boundary in detection of clones with moderate to weak syntactic similarity in a scalable manner
- Ambient Software Evoluton Group. 2013. IJaDataset 2.0. http://secold.org/ projects/seclone. (January 2013).Google Scholar
- Brenda S Baker. 1992. A program for identifying duplicated code. Computing Science and Statistics (1992), 24–49.Google Scholar
- Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of the 2nd Working Conference on Reverse Engineering. IEEE, 86–95. Google ScholarDigital Library
- Pierre Baldi and Yves Chauvin. 1993. Neural networks for fingerprint recognition. Neural Computation 5, 3 (1993), 402–418. Google ScholarDigital Library
- Pierre Baldi and Peter Sadowski. 2014. The dropout learning algorithm. Artificial intelligence 210 (2014), 78–122.Google Scholar
- Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings of the International Conference on Software Maintenance. IEEE, 368–377. Google ScholarDigital Library
- Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and Evaluation of Clone Detection Tools. IEEE Transactions on Software Engineering 33, 9 (Sept 2007), 577–591. Google ScholarDigital Library
- B Barla Cambazoglu, Aytul Catal, and Cevdet Aykanat. 2006. Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems. In Proceedings of International Symposium on Computer and Information Sciences. Springer, 717–725. Google ScholarDigital Library
- Kai Chen, Peng Liu, and Yingjun Zhang. 2014. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering. ACM, 175–186. Google ScholarDigital Library
- Neil Davey, Paul Barson, Simon Field, Ray Frank, and D Tansley. 1995. The development of a software clone detector. International Journal of Applied Software Technology (1995).Google Scholar
- Pietro Di Lena, Ken Nagata, and Pierre Baldi. 2012. Deep architectures for protein contact map prediction. Bioinformatics 28, 19 (2012), 2449–2457. Google ScholarDigital Library
- Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Proceedings of the IEEE International Conference on Software Maintenance (ICSM’99). IEEE, 109–118. Google ScholarDigital Library
- Rochelle Elva and Gary T Leavens. 2012. Jsctracker: A semantic clone detection tool for java code. Technical Report. University of Central Florida, Dept. of EECS, CS division.Google Scholar
- Mark Gabel, Lingxiao Jiang, and Zhendong Su. 2008. Scalable detection of semantic clones. In Proceedings of the ACM/IEEE 30th International Conference on Software Engineering (ICSE’08). IEEE, 321–330. Google ScholarDigital Library
- Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 315–323.Google Scholar
- Alberto Goffi, Alessandra Gorla, Andrea Mattavelli, Mauro Pezzè, and Paolo Tonella. 2014. Search-based synthesis of equivalent method sequences. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 366–376. Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV ’15). IEEE Computer Society, 1026–1034. Google ScholarDigital Library
- Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering. IEEE Computer Society, 96–105. Google ScholarDigital Library
- Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the Eighteenth International Symposium on Software Testing and Analysis. ACM, 81–92. Google ScholarDigital Library
- J Howard Johnson. 1993. Identifying redundancy in source code using fingerprints. In Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Software Engineering - Volume 1. IBM Press, 171–183. Google ScholarDigital Library
- J Howard Johnson. 1994. Substring matching for clone detection and change tracking. In Proceedings of 1994 International Conference on Software Maintanence. 120–126. Google ScholarDigital Library
- Toshihiro Kamiya. 2013. Agec: An execution-semantic clone detection tool. In Proceeings of the 21st IEEE International Conference on Program Comprehension (ICPC). IEEE, 227–229.Google ScholarCross Ref
- Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654–670. Google ScholarDigital Library
- Iman Keivanloo, Chanchal K Roy, and Juergen Rilling. 2012. Java bytecode clone detection via relaxation on code fingerprint and semantic web reasoning. In Proceedings of the 6th International Workshop on Software Clones. IEEE Press, 36–42. Google ScholarDigital Library
- Iman Keivanloo, Chanchal K Roy, and Juergen Rilling. 2012. Sebyte: A semantic clone detection tool for intermediate languages. In Proceedings of the 20th IEEE International Conference onProgram Comprehension (ICPC). IEEE, 247–249.Google ScholarCross Ref
- Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of International Static Analysis Symposium. Springer, 40–56. Google ScholarDigital Library
- Kostas Kontogiannis. 1997. Evaluation experiments on the detection of programming patterns using software metrics. In Proceedings of the Fourth Working Conference on Reverse Engineering. IEEE, 44–54. Google ScholarDigital Library
- Rainer Koschke, Raimar Falke, and Pierre Frenzel. 2006. Clone detection using abstract syntax suffix trees. In Proceedings of 13th Working Conference on Reverse Engineering, 2006 (WCRE’06). IEEE, 253–262. Google ScholarDigital Library
- Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of the Eighth Working Conference on Reverse Engineering. IEEE, 301–309. Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25. 1097–1105. Google ScholarDigital Library
- Anagha Kulkarni and Jamie Callan. 2010. Document allocation policies for selective searching of distributed indexes. In Proceedings of the 19th ACM International Conference on Information and knowledge Management. ACM, 449–458. Google ScholarDigital Library
- Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics 22, 1 (1951), 79–86.Google Scholar
- Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CCLearner: A Deep Learning-Based Clone Detection Approach. In Proceedings of the 33rd IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 249–260.Google ScholarCross Ref
- Joerg Liebeherr, Edward Robert Omiecinski, and Ian F. Akyildiz. 1993. The effect of index partitioning schemes on the performance of distributed query processing. IEEE Transactions on Knowledge and Data Engineering 5, 3 (1993), 510–522. Google ScholarDigital Library
- Chao Liu, Chen Chen, Jiawei Han, and Philip S Yu. 2006. GPLAG: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International conference on Knowledge Discovery and Data mining. ACM, 872–881. Google ScholarDigital Library
- Simone Livieri, Yoshiki Higo, Makoto Matushita, and Katsuro Inoue. 2007. Verylarge scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In Proceedings of 29th International Conference on Software Engineering (ICSE 2007). IEEE, 106–115. Google ScholarDigital Library
- Jean Mayrand, Claude Leblanc, and Ettore Merlo. 1996. Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics. In Proceedings of International Conference on Software Maintenance. 244. Google ScholarDigital Library
- Grégoire Montavon and Klaus-Robert Müller. 2012. Better representations: Invariant, disentangled and reusable. In Neural Networks: Tricks of the Trade. Springer, 559–560.Google Scholar
- Lindsay Anne Neubauer. 2015. Kamino: Dynamic approach to semantic code clone detection. Technical Report, Department of Computer Science, Columiba University, CUCS-022-14 (2015).Google Scholar
- J-F Patenaude, Ettore Merlo, Michel Dagenais, and Bruno Laguë. 1999. Extending software quality assessment techniques to java systems. In Proceedings of Seventh International Workshop on Program Comprehension. IEEE, 49–56. Google ScholarDigital Library
- C. K. Roy and J. R. Cordy. 2007. A survey on software clone detection research. Technical Report, Queen’s University at Kingston (2007).Google Scholar
- Chanchal K Roy and James R Cordy. 2008. NICAD: Accurate detection of nearmiss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 16th IEEE International Conference on Program Comprehension (ICPC08). IEEE, 172–181. Google ScholarDigital Library
- Chanchal K Roy, James R Cordy, and Rainer Koschke. 2009. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming 74, 7 (2009), 470–495. Google ScholarDigital Library
- Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering (ICSE16). IEEE, 1157– 1168. Google ScholarDigital Library
- Ioannis Samoladas, Georgios Gousios, Diomidis Spinellis, and Ioannis Stamelos. 2008. The SQO-OSS quality model: measurement based open source software evaluation. In Proceedings of the International Conference on Open Source Systems. 237–248.Google ScholarCross Ref
- Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks 61 (2015), 85–117. Google ScholarDigital Library
- A. Sheneamer and J. Kalita. 2016. Semantic Clone Detection Using Machine Learning. In Proceedings of the 15th IEEE International Conference on Machine Learning and Applications (ICMLA). 1024–1028.Google Scholar
- Abdullah Sheneamer and Jugal Kalita. 2016. A Survey of Software Clone Detection Techniques. International Journal of Computer Applications 137 (2016), 1–21.Google ScholarCross Ref
- Richard Socher, Yoshua Bengio, and Christopher D Manning. 2012. Deep learning for NLP (without magic). In Tutorial Abstracts of ACL 2012. Association for Computational Linguistics, 5–5. Google ScholarDigital Library
- Jeffrey Svajlenko and Chanchal Kumar Roy. 2015. Evaluating Clone Detection Tools with BigCloneBench. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME ’15). 131–140. Google ScholarDigital Library
- Jeffrey Svajlenko and Chanchal K Roy. 2016. BigCloneEval: A clone detection tool evaluation framework with bigclonebench. In Proceedings of 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. Saini, F. Farmahinifarahani, Y. Lu, P. Baldi, and C. V. Lopes 596–600.Google ScholarCross Ref
- Jeffrey Svajlenko and Chanchal K Roy. 2017. Fast and flexible large-scale clone detection with cloneworks. In Proceedings of the 39th International Conference on Software Engineering Companion. IEEE Press, 27–30. Google ScholarDigital Library
- Rajkumar Tekchandani, Rajesh Kumar Bhatia, and Maninder Singh. 2013. Semantic code clone detection using parse trees and grammar recovery. In Confluence 2013: The Next Generation Information Technology Summit. IET.Google Scholar
- Tiantian Wang, Mark Harman, Yue Jia, and Jens Krinke. 2013. Searching for better configurations: a rigorous approach to clone evaluation. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 455–465. Google ScholarDigital Library
- Hui-Hui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17). 3034–3040. Google ScholarDigital Library
- Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87–98. Google ScholarDigital Library
Index Terms
- Oreo: detection of clones in the twilight zone
Recommendations
CCAligner: a token based large-gap clone detector
ICSE '18: Proceedings of the 40th International Conference on Software EngineeringCopying code and then pasting with large number of edits is a common activity in software development, and the pasted code is a kind of complicated Type-3 clone. Due to large number of edits, we consider the clone as a large-gap clone. Large-gap clone ...
NIL: large-scale detection of large-variance clones
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringA code clone (in short, clone) is a code fragment that is identical or similar to other code fragments in source code. Clones generated by a large number of changes to copy-and-pasted code fragments are called large-variance (modifications are scattered)...
IDE-based real-time focused search for near-miss clones
SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied ComputingCode clone is a well-known code smell that needs to be detected and managed during the software development process. However, the existing clone detectors have one or more of the three shortcomings: (a) limitation in detecting Type-3 clones, (b) they ...
Comments