skip to main content
10.1145/3236024.3236026acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections

Oreo: detection of clones in the twilight zone

Published:26 October 2018Publication History

ABSTRACT

Source code clones are categorized into four types of increasing difficulty of detection, ranging from purely textual (Type-1) to purely semantic (Type-4). Most clone detectors reported in the literature work well up to Type-3, which accounts for syntactic differences. In between Type-3 and Type-4, however, there lies a spectrum of clones that, although still exhibiting some syntactic similarities, are extremely hard to detect – the Twilight Zone. Most clone detectors reported in the literature fail to operate in this zone. We present Oreo, a novel approach to source code clone detection that not only detects Type-1 to Type-3 clones accurately, but is also capable of detecting harder-to-detect clones in the Twilight Zone. Oreo is built using a combination of machine learning, information retrieval, and software metrics. We evaluate the recall of Oreo on BigCloneBench, and perform manual evaluation for precision. Oreo has both high recall and precision. More importantly, it pushes the boundary in detection of clones with moderate to weak syntactic similarity in a scalable manner

References

  1. Ambient Software Evoluton Group. 2013. IJaDataset 2.0. http://secold.org/ projects/seclone. (January 2013).Google ScholarGoogle Scholar
  2. Brenda S Baker. 1992. A program for identifying duplicated code. Computing Science and Statistics (1992), 24–49.Google ScholarGoogle Scholar
  3. Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of the 2nd Working Conference on Reverse Engineering. IEEE, 86–95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Pierre Baldi and Yves Chauvin. 1993. Neural networks for fingerprint recognition. Neural Computation 5, 3 (1993), 402–418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Pierre Baldi and Peter Sadowski. 2014. The dropout learning algorithm. Artificial intelligence 210 (2014), 78–122.Google ScholarGoogle Scholar
  6. Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings of the International Conference on Software Maintenance. IEEE, 368–377. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and Evaluation of Clone Detection Tools. IEEE Transactions on Software Engineering 33, 9 (Sept 2007), 577–591. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B Barla Cambazoglu, Aytul Catal, and Cevdet Aykanat. 2006. Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems. In Proceedings of International Symposium on Computer and Information Sciences. Springer, 717–725. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kai Chen, Peng Liu, and Yingjun Zhang. 2014. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering. ACM, 175–186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Neil Davey, Paul Barson, Simon Field, Ray Frank, and D Tansley. 1995. The development of a software clone detector. International Journal of Applied Software Technology (1995).Google ScholarGoogle Scholar
  11. Pietro Di Lena, Ken Nagata, and Pierre Baldi. 2012. Deep architectures for protein contact map prediction. Bioinformatics 28, 19 (2012), 2449–2457. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Proceedings of the IEEE International Conference on Software Maintenance (ICSM’99). IEEE, 109–118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Rochelle Elva and Gary T Leavens. 2012. Jsctracker: A semantic clone detection tool for java code. Technical Report. University of Central Florida, Dept. of EECS, CS division.Google ScholarGoogle Scholar
  14. Mark Gabel, Lingxiao Jiang, and Zhendong Su. 2008. Scalable detection of semantic clones. In Proceedings of the ACM/IEEE 30th International Conference on Software Engineering (ICSE’08). IEEE, 321–330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 315–323.Google ScholarGoogle Scholar
  16. Alberto Goffi, Alessandra Gorla, Andrea Mattavelli, Mauro Pezzè, and Paolo Tonella. 2014. Search-based synthesis of equivalent method sequences. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 366–376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV ’15). IEEE Computer Society, 1026–1034. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering. IEEE Computer Society, 96–105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the Eighteenth International Symposium on Software Testing and Analysis. ACM, 81–92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J Howard Johnson. 1993. Identifying redundancy in source code using fingerprints. In Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Software Engineering - Volume 1. IBM Press, 171–183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J Howard Johnson. 1994. Substring matching for clone detection and change tracking. In Proceedings of 1994 International Conference on Software Maintanence. 120–126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Toshihiro Kamiya. 2013. Agec: An execution-semantic clone detection tool. In Proceeings of the 21st IEEE International Conference on Program Comprehension (ICPC). IEEE, 227–229.Google ScholarGoogle ScholarCross RefCross Ref
  23. Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654–670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Iman Keivanloo, Chanchal K Roy, and Juergen Rilling. 2012. Java bytecode clone detection via relaxation on code fingerprint and semantic web reasoning. In Proceedings of the 6th International Workshop on Software Clones. IEEE Press, 36–42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Iman Keivanloo, Chanchal K Roy, and Juergen Rilling. 2012. Sebyte: A semantic clone detection tool for intermediate languages. In Proceedings of the 20th IEEE International Conference onProgram Comprehension (ICPC). IEEE, 247–249.Google ScholarGoogle ScholarCross RefCross Ref
  26. Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of International Static Analysis Symposium. Springer, 40–56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kostas Kontogiannis. 1997. Evaluation experiments on the detection of programming patterns using software metrics. In Proceedings of the Fourth Working Conference on Reverse Engineering. IEEE, 44–54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Rainer Koschke, Raimar Falke, and Pierre Frenzel. 2006. Clone detection using abstract syntax suffix trees. In Proceedings of 13th Working Conference on Reverse Engineering, 2006 (WCRE’06). IEEE, 253–262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of the Eighth Working Conference on Reverse Engineering. IEEE, 301–309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25. 1097–1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Anagha Kulkarni and Jamie Callan. 2010. Document allocation policies for selective searching of distributed indexes. In Proceedings of the 19th ACM International Conference on Information and knowledge Management. ACM, 449–458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics 22, 1 (1951), 79–86.Google ScholarGoogle Scholar
  33. Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CCLearner: A Deep Learning-Based Clone Detection Approach. In Proceedings of the 33rd IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 249–260.Google ScholarGoogle ScholarCross RefCross Ref
  34. Joerg Liebeherr, Edward Robert Omiecinski, and Ian F. Akyildiz. 1993. The effect of index partitioning schemes on the performance of distributed query processing. IEEE Transactions on Knowledge and Data Engineering 5, 3 (1993), 510–522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Chao Liu, Chen Chen, Jiawei Han, and Philip S Yu. 2006. GPLAG: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International conference on Knowledge Discovery and Data mining. ACM, 872–881. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Simone Livieri, Yoshiki Higo, Makoto Matushita, and Katsuro Inoue. 2007. Verylarge scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In Proceedings of 29th International Conference on Software Engineering (ICSE 2007). IEEE, 106–115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jean Mayrand, Claude Leblanc, and Ettore Merlo. 1996. Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics. In Proceedings of International Conference on Software Maintenance. 244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Grégoire Montavon and Klaus-Robert Müller. 2012. Better representations: Invariant, disentangled and reusable. In Neural Networks: Tricks of the Trade. Springer, 559–560.Google ScholarGoogle Scholar
  39. Lindsay Anne Neubauer. 2015. Kamino: Dynamic approach to semantic code clone detection. Technical Report, Department of Computer Science, Columiba University, CUCS-022-14 (2015).Google ScholarGoogle Scholar
  40. J-F Patenaude, Ettore Merlo, Michel Dagenais, and Bruno Laguë. 1999. Extending software quality assessment techniques to java systems. In Proceedings of Seventh International Workshop on Program Comprehension. IEEE, 49–56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. K. Roy and J. R. Cordy. 2007. A survey on software clone detection research. Technical Report, Queen’s University at Kingston (2007).Google ScholarGoogle Scholar
  42. Chanchal K Roy and James R Cordy. 2008. NICAD: Accurate detection of nearmiss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 16th IEEE International Conference on Program Comprehension (ICPC08). IEEE, 172–181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Chanchal K Roy, James R Cordy, and Rainer Koschke. 2009. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming 74, 7 (2009), 470–495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering (ICSE16). IEEE, 1157– 1168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ioannis Samoladas, Georgios Gousios, Diomidis Spinellis, and Ioannis Stamelos. 2008. The SQO-OSS quality model: measurement based open source software evaluation. In Proceedings of the International Conference on Open Source Systems. 237–248.Google ScholarGoogle ScholarCross RefCross Ref
  46. Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks 61 (2015), 85–117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Sheneamer and J. Kalita. 2016. Semantic Clone Detection Using Machine Learning. In Proceedings of the 15th IEEE International Conference on Machine Learning and Applications (ICMLA). 1024–1028.Google ScholarGoogle Scholar
  48. Abdullah Sheneamer and Jugal Kalita. 2016. A Survey of Software Clone Detection Techniques. International Journal of Computer Applications 137 (2016), 1–21.Google ScholarGoogle ScholarCross RefCross Ref
  49. Richard Socher, Yoshua Bengio, and Christopher D Manning. 2012. Deep learning for NLP (without magic). In Tutorial Abstracts of ACL 2012. Association for Computational Linguistics, 5–5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Jeffrey Svajlenko and Chanchal Kumar Roy. 2015. Evaluating Clone Detection Tools with BigCloneBench. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME ’15). 131–140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jeffrey Svajlenko and Chanchal K Roy. 2016. BigCloneEval: A clone detection tool evaluation framework with bigclonebench. In Proceedings of 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. Saini, F. Farmahinifarahani, Y. Lu, P. Baldi, and C. V. Lopes 596–600.Google ScholarGoogle ScholarCross RefCross Ref
  52. Jeffrey Svajlenko and Chanchal K Roy. 2017. Fast and flexible large-scale clone detection with cloneworks. In Proceedings of the 39th International Conference on Software Engineering Companion. IEEE Press, 27–30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Rajkumar Tekchandani, Rajesh Kumar Bhatia, and Maninder Singh. 2013. Semantic code clone detection using parse trees and grammar recovery. In Confluence 2013: The Next Generation Information Technology Summit. IET.Google ScholarGoogle Scholar
  54. Tiantian Wang, Mark Harman, Yue Jia, and Jens Krinke. 2013. Searching for better configurations: a rigorous approach to clone evaluation. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 455–465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Hui-Hui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17). 3034–3040. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87–98. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Oreo: detection of clones in the twilight zone

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
      October 2018
      987 pages
      ISBN:9781450355735
      DOI:10.1145/3236024

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 October 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate112of543submissions,21%

      Upcoming Conference

      FSE '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader