ABSTRACT
In this paper, we presented a novel framework, Clone-Slicer, a domain-specific code clone detector for binary executables, that integrates program slicing and a deep learning based binary code clone modeling framework to improve the number of code clone detected. In particular, we chose pointer analysis for memory safety as our example domain to demonstrate the usefulness of our approach. We evaluated our approach using real-world applications from SPEC 2006 benchmark suite. Our results show Clone-Slicer is able to detect up to 43.64% code clones compared to prior work and further cut the time-to-solution (the time spent to verify memory bound safety) for Clone-Slicer by 32.96% compared to Clone-Hunter. As future work, we plan to apply Clone-Slicer to different domains and tasks, such as vulnerable program path discovery, and further improve the capability for code clone detection through advanced clustering algorithms. We will also study the cost-benefit tradeoffs of using such advanced algorithms.
- 2006. SPEC CPU 2006. https://www.spec.org/cpu2006/.Google Scholar
- 2016. IDA Pro disassembler. https://www.hex-rays.com/products/ida/.Google Scholar
- Sheeva Afshan, Phil McMinn, and Mark Stevenson. 2013. Evolving readable string test inputs using a natural language model to reduce human oracle cost. In Software Testing, Verification and Validation (ICST), 2013 IEEE Sixth International Conference on. IEEE, 352--361. Google ScholarDigital Library
- Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 38--49. Google ScholarDigital Library
- Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Reverse Engineering, 1995., Proceedings of 2ndWorking Conference on. IEEE, 86--95. Google ScholarDigital Library
- Brenda S Baker. 1997. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput. 26, 5 (1997), 1343--1362. Google ScholarDigital Library
- Ira D Baxter, Christopher Pidgeon, and Michael Mehlich. 2004. DMS/spl reg: program transformations for practical scalable software evolution. In Software Engineering, 2004. ICSE 2004. Proceedings. 26th International Conference on. IEEE, 625--634. Google ScholarDigital Library
- Christopher M Bishop. 2006. Machine learning and pattern recognition. Information Science and Statistics. Springer, Heidelberg (2006). Google ScholarDigital Library
- Juan Caballero, Gustavo Grieco, Mark Marron, and Antonio Nappa. 2012. Undangle: early detection of dangling pointers in use-after-free and double-free vulnerabilities. In Proceedings of the 2012 International Symposium on Software Testing and Analysis. ACM, 133--143. Google ScholarDigital Library
- Shuo Chen, Jun Xu, Nithin Nakka, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2005. Defeating memory corruption attacks via pointer taintedness detection. In Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on. IEEE, 378--387. Google ScholarDigital Library
- Mauro Conti, Stephen Crane, Lucas Davi, Michael Franz, Per Larsen, Marco Negro, Christopher Liebchen, Mohaned Qunaibit, and Ahmad-Reza Sadeghi. 2015. Losing control: On the effectiveness of control-flow integrity under stack attacks. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 952--963. Google ScholarDigital Library
- Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Localitysensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry. ACM, 253--262. Google ScholarDigital Library
- Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Software Maintenance, 1999.(ICSM'99) Proceedings. IEEE International Conference on. IEEE, 109--118. Google ScholarDigital Library
- Christine Franks, Zhaopeng Tu, Premkumar Devanbu, and Vincent Hellendoorn. 2015. Cacheca: A cache language model based code suggestion tool. In Proceedings of the 37th International Conference on Software Engineering-Volume 2. IEEE Press, 705--708. Google ScholarDigital Library
- Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, 147--156. Google ScholarDigital Library
- Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518--529. Google ScholarDigital Library
- Christoph Goller and Andreas Kuchler. 1996. Learning task-dependent distributed representations by backpropagation through structure. In Neural Networks, 1996., IEEE International Conference on, Vol. 1. IEEE, 347--352.Google ScholarCross Ref
- Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 763--773. Google ScholarDigital Library
- Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837--847. Google ScholarDigital Library
- Yikun Hu, Yuanyuan Zhang, Juanru Li, and Dawu Gu. 2017. Binary code clone detection across architectures and compiling configurations. In Proceedings of the 25th International Conference on Program Comprehension. IEEE Press, 88--98. Google ScholarDigital Library
- Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, 96--105. Google ScholarDigital Library
- Dan Jurafsky and James H Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. , 1024 pages. Google ScholarDigital Library
- Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670. Google ScholarDigital Library
- Miryung Kim, Vibha Sazawal, David Notkin, and Gail Murphy. 2005. An empirical study of code clone genealogies. In ACM SIGSOFT Software Engineering Notes, Vol. 30. ACM, 187--196. Google ScholarDigital Library
- Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In International Static Analysis Symposium. Springer, 40--56. Google ScholarDigital Library
- Kostas A Kontogiannis, Renator DeMori, Ettore Merlo, Michael Galler, and Morris Bernstein. 1996. Pattern matching for clone and concept detection. Automated Software Engineering 3, 1--2 (1996), 77--108.Google ScholarDigital Library
- Peng Li, Yang Liu, and Maosong Sun. 2013. Recursive autoencoders for ITG-based translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 567--577.Google Scholar
- Yongbo Li, Fan Yao, Tian Lan, and Guru Venkataramani. 2016. Sarre: semanticsaware rule recommendation and enforcement for event paths on android. IEEE Transactions on Information Forensics and Security 11, 12 (2016), 2748--2762. Google ScholarDigital Library
- Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2006. CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on software Engineering 32, 3 (2006), 176--192. Google ScholarDigital Library
- Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky. 2011. Rnnlm-recurrent neural network language modeling toolkit. In Proc. of the 2011 ASRU Workshop. 196--201.Google Scholar
- Santosh Nagarakatte, Jianzhou Zhao, Milo MK Martin, and Steve Zdancewic. 2009. SoftBound: Highly compatible and complete spatial memory safety for C. ACM Sigplan Notices 44, 6 (2009), 245--258. Google ScholarDigital Library
- Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In Security and Privacy (SP), 2015 IEEE Symposium on. IEEE, 709--724. Google ScholarDigital Library
- Andreas Sæbjørnsen, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and Zhendong Su. 2009. Detecting code clones in binary executables. In Proceedings of the eighteenth international symposium on Software testing and analysis. ACM, 117--128. Google ScholarDigital Library
- Fermin J Serna. 2012. The info leak era on software exploitation. Black Hat USA (2012).Google Scholar
- Yan Shoshitaishvili, RuoyuWang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, et al. 2016. Sok:(state of) the art of war: Offensive techniques in binary analysis. In Security and Privacy (SP), 2016 IEEE Symposium on. IEEE, 138--157.Google ScholarCross Ref
- Richard Socher, Alex Perelygin, JeanWu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631--1642.Google Scholar
- Open Source. 2016. Dyninst: An application program interface (api) for runtime code generation. Online, http://www. dyninst. org (2016).Google Scholar
- Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, 384--394. Google ScholarDigital Library
- Guru Venkataramani, Ioannis Doudalis, Yan Solihin, and Milos Prvulovic. 2008. Flexitaint: A programmable accelerator for dynamic taint propagation. In High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on. IEEE, 173--184.Google ScholarCross Ref
- Guru Venkataramani, Ioannis Doudalis, Yan Solihin, and Milos Prvulovic. 2009. MemTracker: An accelerator for memory debugging and monitoring. ACM Transactions on Architecture and Code Optimization (TACO) 6, 2 (2009), 5. Google ScholarDigital Library
- Vera Wahler, Dietmar Seipel, J Wolff, and Gregor Fischer. 2004. Clone detection in source code by frequent itemset techniques. In Source Code Analysis and Manipulation, 2004. Fourth IEEE International Workshop on. IEEE, 128--135. Google ScholarDigital Library
- Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87--98. Google ScholarDigital Library
- Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 363--376. Google ScholarDigital Library
- Hongfa Xue, Yurong Chen, Fan Yao, Yongbo Li, Tian Lan, and Guru Venkataramani. 2017. Simber: Eliminating redundant memory bound checks via statistical inference. In IFIP International Conference on ICT Systems Security and Privacy Protection. Springer, 413--426.Google ScholarCross Ref
- Hongfa Xue, Guru Venkataramani, and Tian Lan. 2018. Clone-hunter: accelerated bound checks elimination via binary code clone detection. In Proceedings of the 2nd ACM SIGPLAN InternationalWorkshop on Machine Learning and Programming Languages. ACM, 11--19. Google ScholarDigital Library
- Fan Yao, Yongbo Li, Yurong Chen, Hongfa Xue, Tian Lan, and Guru Venkataramani. 2017. Statsym: vulnerable path discovery through statistics-guided symbolic execution. In Dependable Systems and Networks (DSN), 2017 47th Annual IEEE/IFIP International Conference on. IEEE, 109--120.Google ScholarCross Ref
Index Terms
- Clone-Slicer: Detecting Domain Specific Binary Code Clones through Program Slicing
Recommendations
CloneCognition: machine learning based code clone validation tool
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringA code clone is a pair of similar code fragments, within or between software systems. To detect each possible clone pair from a software system while handling the complex code structures, the clone detection tools undergo a lot of generalization of the ...
Predicting Buggy Code Clones through Machine Learning
CASCON '22: Proceedings of the 32nd Annual International Conference on Computer Science and Software EngineeringCode clones (similar code fragments in a code-base} often have negative impacts on the maintenance and evolution of software systems. According to the existing studies, code clones may contain bugs or inconsistencies that can cause an increased ...
Clone removal: fact or fiction?
IWSC '10: Proceedings of the 4th International Workshop on Software ClonesDespite ongoing research in the field of code duplication, clone research has not yet investigated when and how developers remove clones. We think knowing how developers select candidates for removal and what techniques they use to eliminate duplication ...
Comments