skip to main content
10.1145/3273045.3273047acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Clone-Slicer: Detecting Domain Specific Binary Code Clones through Program Slicing

Published:15 January 2018Publication History

ABSTRACT

In this paper, we presented a novel framework, Clone-Slicer, a domain-specific code clone detector for binary executables, that integrates program slicing and a deep learning based binary code clone modeling framework to improve the number of code clone detected. In particular, we chose pointer analysis for memory safety as our example domain to demonstrate the usefulness of our approach. We evaluated our approach using real-world applications from SPEC 2006 benchmark suite. Our results show Clone-Slicer is able to detect up to 43.64% code clones compared to prior work and further cut the time-to-solution (the time spent to verify memory bound safety) for Clone-Slicer by 32.96% compared to Clone-Hunter. As future work, we plan to apply Clone-Slicer to different domains and tasks, such as vulnerable program path discovery, and further improve the capability for code clone detection through advanced clustering algorithms. We will also study the cost-benefit tradeoffs of using such advanced algorithms.

References

  1. 2006. SPEC CPU 2006. https://www.spec.org/cpu2006/.Google ScholarGoogle Scholar
  2. 2016. IDA Pro disassembler. https://www.hex-rays.com/products/ida/.Google ScholarGoogle Scholar
  3. Sheeva Afshan, Phil McMinn, and Mark Stevenson. 2013. Evolving readable string test inputs using a natural language model to reduce human oracle cost. In Software Testing, Verification and Validation (ICST), 2013 IEEE Sixth International Conference on. IEEE, 352--361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 38--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Reverse Engineering, 1995., Proceedings of 2ndWorking Conference on. IEEE, 86--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Brenda S Baker. 1997. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput. 26, 5 (1997), 1343--1362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ira D Baxter, Christopher Pidgeon, and Michael Mehlich. 2004. DMS/spl reg: program transformations for practical scalable software evolution. In Software Engineering, 2004. ICSE 2004. Proceedings. 26th International Conference on. IEEE, 625--634. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christopher M Bishop. 2006. Machine learning and pattern recognition. Information Science and Statistics. Springer, Heidelberg (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Juan Caballero, Gustavo Grieco, Mark Marron, and Antonio Nappa. 2012. Undangle: early detection of dangling pointers in use-after-free and double-free vulnerabilities. In Proceedings of the 2012 International Symposium on Software Testing and Analysis. ACM, 133--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Shuo Chen, Jun Xu, Nithin Nakka, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2005. Defeating memory corruption attacks via pointer taintedness detection. In Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on. IEEE, 378--387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mauro Conti, Stephen Crane, Lucas Davi, Michael Franz, Per Larsen, Marco Negro, Christopher Liebchen, Mohaned Qunaibit, and Ahmad-Reza Sadeghi. 2015. Losing control: On the effectiveness of control-flow integrity under stack attacks. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 952--963. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Localitysensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry. ACM, 253--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Software Maintenance, 1999.(ICSM'99) Proceedings. IEEE International Conference on. IEEE, 109--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Christine Franks, Zhaopeng Tu, Premkumar Devanbu, and Vincent Hellendoorn. 2015. Cacheca: A cache language model based code suggestion tool. In Proceedings of the 37th International Conference on Software Engineering-Volume 2. IEEE Press, 705--708. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, 147--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518--529. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Christoph Goller and Andreas Kuchler. 1996. Learning task-dependent distributed representations by backpropagation through structure. In Neural Networks, 1996., IEEE International Conference on, Vol. 1. IEEE, 347--352.Google ScholarGoogle ScholarCross RefCross Ref
  18. Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 763--773. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837--847. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yikun Hu, Yuanyuan Zhang, Juanru Li, and Dawu Gu. 2017. Binary code clone detection across architectures and compiling configurations. In Proceedings of the 25th International Conference on Program Comprehension. IEEE Press, 88--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, 96--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Dan Jurafsky and James H Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. , 1024 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Miryung Kim, Vibha Sazawal, David Notkin, and Gail Murphy. 2005. An empirical study of code clone genealogies. In ACM SIGSOFT Software Engineering Notes, Vol. 30. ACM, 187--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In International Static Analysis Symposium. Springer, 40--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kostas A Kontogiannis, Renator DeMori, Ettore Merlo, Michael Galler, and Morris Bernstein. 1996. Pattern matching for clone and concept detection. Automated Software Engineering 3, 1--2 (1996), 77--108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Peng Li, Yang Liu, and Maosong Sun. 2013. Recursive autoencoders for ITG-based translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 567--577.Google ScholarGoogle Scholar
  28. Yongbo Li, Fan Yao, Tian Lan, and Guru Venkataramani. 2016. Sarre: semanticsaware rule recommendation and enforcement for event paths on android. IEEE Transactions on Information Forensics and Security 11, 12 (2016), 2748--2762. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2006. CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on software Engineering 32, 3 (2006), 176--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky. 2011. Rnnlm-recurrent neural network language modeling toolkit. In Proc. of the 2011 ASRU Workshop. 196--201.Google ScholarGoogle Scholar
  31. Santosh Nagarakatte, Jianzhou Zhao, Milo MK Martin, and Steve Zdancewic. 2009. SoftBound: Highly compatible and complete spatial memory safety for C. ACM Sigplan Notices 44, 6 (2009), 245--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In Security and Privacy (SP), 2015 IEEE Symposium on. IEEE, 709--724. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Andreas Sæbjørnsen, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and Zhendong Su. 2009. Detecting code clones in binary executables. In Proceedings of the eighteenth international symposium on Software testing and analysis. ACM, 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Fermin J Serna. 2012. The info leak era on software exploitation. Black Hat USA (2012).Google ScholarGoogle Scholar
  35. Yan Shoshitaishvili, RuoyuWang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, et al. 2016. Sok:(state of) the art of war: Offensive techniques in binary analysis. In Security and Privacy (SP), 2016 IEEE Symposium on. IEEE, 138--157.Google ScholarGoogle ScholarCross RefCross Ref
  36. Richard Socher, Alex Perelygin, JeanWu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631--1642.Google ScholarGoogle Scholar
  37. Open Source. 2016. Dyninst: An application program interface (api) for runtime code generation. Online, http://www. dyninst. org (2016).Google ScholarGoogle Scholar
  38. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, 384--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Guru Venkataramani, Ioannis Doudalis, Yan Solihin, and Milos Prvulovic. 2008. Flexitaint: A programmable accelerator for dynamic taint propagation. In High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on. IEEE, 173--184.Google ScholarGoogle ScholarCross RefCross Ref
  40. Guru Venkataramani, Ioannis Doudalis, Yan Solihin, and Milos Prvulovic. 2009. MemTracker: An accelerator for memory debugging and monitoring. ACM Transactions on Architecture and Code Optimization (TACO) 6, 2 (2009), 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Vera Wahler, Dietmar Seipel, J Wolff, and Gregor Fischer. 2004. Clone detection in source code by frequent itemset techniques. In Source Code Analysis and Manipulation, 2004. Fourth IEEE International Workshop on. IEEE, 128--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 363--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Hongfa Xue, Yurong Chen, Fan Yao, Yongbo Li, Tian Lan, and Guru Venkataramani. 2017. Simber: Eliminating redundant memory bound checks via statistical inference. In IFIP International Conference on ICT Systems Security and Privacy Protection. Springer, 413--426.Google ScholarGoogle ScholarCross RefCross Ref
  45. Hongfa Xue, Guru Venkataramani, and Tian Lan. 2018. Clone-hunter: accelerated bound checks elimination via binary code clone detection. In Proceedings of the 2nd ACM SIGPLAN InternationalWorkshop on Machine Learning and Programming Languages. ACM, 11--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Fan Yao, Yongbo Li, Yurong Chen, Hongfa Xue, Tian Lan, and Guru Venkataramani. 2017. Statsym: vulnerable path discovery through statistics-guided symbolic execution. In Dependable Systems and Networks (DSN), 2017 47th Annual IEEE/IFIP International Conference on. IEEE, 109--120.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Clone-Slicer: Detecting Domain Specific Binary Code Clones through Program Slicing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        FEAST '18: Proceedings of the 2018 Workshop on Forming an Ecosystem Around Software Transformation
        October 2018
        39 pages
        ISBN:9781450359979
        DOI:10.1145/3273045

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 January 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Upcoming Conference

        CCS '24
        ACM SIGSAC Conference on Computer and Communications Security
        October 14 - 18, 2024
        Salt Lake City , UT , USA

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader