research-article

Clone-Slicer: Detecting Domain Specific Binary Code Clones through Program Slicing

Authors:
Hongfa Xue

George Washington University, Washington, DC, USA

George Washington University, Washington, DC, USA
View Profile

,
Guru Venkataramani

George Washington University, Washington, DC, USA

George Washington University, Washington, DC, USA
View Profile

,
Tian Lan

George Washington University, Washington, DC, USA

George Washington University, Washington, DC, USA
View Profile

FEAST '18: Proceedings of the 2018 Workshop on Forming an Ecosystem Around Software TransformationOctober 2018Pages 27–33https://doi.org/10.1145/3273045.3273047

Published:15 January 2018Publication History

FEAST '18: Proceedings of the 2018 Workshop on Forming an Ecosystem Around Software Transformation

Pages 27–33

ABSTRACT

In this paper, we presented a novel framework, Clone-Slicer, a domain-specific code clone detector for binary executables, that integrates program slicing and a deep learning based binary code clone modeling framework to improve the number of code clone detected. In particular, we chose pointer analysis for memory safety as our example domain to demonstrate the usefulness of our approach. We evaluated our approach using real-world applications from SPEC 2006 benchmark suite. Our results show Clone-Slicer is able to detect up to 43.64% code clones compared to prior work and further cut the time-to-solution (the time spent to verify memory bound safety) for Clone-Slicer by 32.96% compared to Clone-Hunter. As future work, we plan to apply Clone-Slicer to different domains and tasks, such as vulnerable program path discovery, and further improve the capability for code clone detection through advanced clustering algorithms. We will also study the cost-benefit tradeoffs of using such advanced algorithms.

References

2006. SPEC CPU 2006. https://www.spec.org/cpu2006/.Google Scholar
2016. IDA Pro disassembler. https://www.hex-rays.com/products/ida/.Google Scholar
Sheeva Afshan, Phil McMinn, and Mark Stevenson. 2013. Evolving readable string test inputs using a natural language model to reduce human oracle cost. In Software Testing, Verification and Validation (ICST), 2013 IEEE Sixth International Conference on. IEEE, 352--361. Google ScholarDigital Library
Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 38--49. Google ScholarDigital Library
Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Reverse Engineering, 1995., Proceedings of 2ndWorking Conference on. IEEE, 86--95. Google ScholarDigital Library
Brenda S Baker. 1997. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput. 26, 5 (1997), 1343--1362. Google ScholarDigital Library
Ira D Baxter, Christopher Pidgeon, and Michael Mehlich. 2004. DMS/spl reg: program transformations for practical scalable software evolution. In Software Engineering, 2004. ICSE 2004. Proceedings. 26th International Conference on. IEEE, 625--634. Google ScholarDigital Library
Christopher M Bishop. 2006. Machine learning and pattern recognition. Information Science and Statistics. Springer, Heidelberg (2006). Google ScholarDigital Library
Juan Caballero, Gustavo Grieco, Mark Marron, and Antonio Nappa. 2012. Undangle: early detection of dangling pointers in use-after-free and double-free vulnerabilities. In Proceedings of the 2012 International Symposium on Software Testing and Analysis. ACM, 133--143. Google ScholarDigital Library
Shuo Chen, Jun Xu, Nithin Nakka, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2005. Defeating memory corruption attacks via pointer taintedness detection. In Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on. IEEE, 378--387. Google ScholarDigital Library
Mauro Conti, Stephen Crane, Lucas Davi, Michael Franz, Per Larsen, Marco Negro, Christopher Liebchen, Mohaned Qunaibit, and Ahmad-Reza Sadeghi. 2015. Losing control: On the effectiveness of control-flow integrity under stack attacks. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 952--963. Google ScholarDigital Library
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Localitysensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry. ACM, 253--262. Google ScholarDigital Library
Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Software Maintenance, 1999.(ICSM'99) Proceedings. IEEE International Conference on. IEEE, 109--118. Google ScholarDigital Library
Christine Franks, Zhaopeng Tu, Premkumar Devanbu, and Vincent Hellendoorn. 2015. Cacheca: A cache language model based code suggestion tool. In Proceedings of the 37th International Conference on Software Engineering-Volume 2. IEEE Press, 705--708. Google ScholarDigital Library
Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, 147--156. Google ScholarDigital Library
Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518--529. Google ScholarDigital Library
Christoph Goller and Andreas Kuchler. 1996. Learning task-dependent distributed representations by backpropagation through structure. In Neural Networks, 1996., IEEE International Conference on, Vol. 1. IEEE, 347--352.Google ScholarCross Ref
Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 763--773. Google ScholarDigital Library
Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837--847. Google ScholarDigital Library
Yikun Hu, Yuanyuan Zhang, Juanru Li, and Dawu Gu. 2017. Binary code clone detection across architectures and compiling configurations. In Proceedings of the 25th International Conference on Program Comprehension. IEEE Press, 88--98. Google ScholarDigital Library
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, 96--105. Google ScholarDigital Library
Dan Jurafsky and James H Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. , 1024 pages. Google ScholarDigital Library
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670. Google ScholarDigital Library
Miryung Kim, Vibha Sazawal, David Notkin, and Gail Murphy. 2005. An empirical study of code clone genealogies. In ACM SIGSOFT Software Engineering Notes, Vol. 30. ACM, 187--196. Google ScholarDigital Library
Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In International Static Analysis Symposium. Springer, 40--56. Google ScholarDigital Library
Kostas A Kontogiannis, Renator DeMori, Ettore Merlo, Michael Galler, and Morris Bernstein. 1996. Pattern matching for clone and concept detection. Automated Software Engineering 3, 1--2 (1996), 77--108.Google ScholarDigital Library
Peng Li, Yang Liu, and Maosong Sun. 2013. Recursive autoencoders for ITG-based translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 567--577.Google Scholar
Yongbo Li, Fan Yao, Tian Lan, and Guru Venkataramani. 2016. Sarre: semanticsaware rule recommendation and enforcement for event paths on android. IEEE Transactions on Information Forensics and Security 11, 12 (2016), 2748--2762. Google ScholarDigital Library
Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2006. CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on software Engineering 32, 3 (2006), 176--192. Google ScholarDigital Library
Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky. 2011. Rnnlm-recurrent neural network language modeling toolkit. In Proc. of the 2011 ASRU Workshop. 196--201.Google Scholar
Santosh Nagarakatte, Jianzhou Zhao, Milo MK Martin, and Steve Zdancewic. 2009. SoftBound: Highly compatible and complete spatial memory safety for C. ACM Sigplan Notices 44, 6 (2009), 245--258. Google ScholarDigital Library
Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In Security and Privacy (SP), 2015 IEEE Symposium on. IEEE, 709--724. Google ScholarDigital Library
Andreas Sæbjørnsen, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and Zhendong Su. 2009. Detecting code clones in binary executables. In Proceedings of the eighteenth international symposium on Software testing and analysis. ACM, 117--128. Google ScholarDigital Library
Fermin J Serna. 2012. The info leak era on software exploitation. Black Hat USA (2012).Google Scholar
Yan Shoshitaishvili, RuoyuWang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, et al. 2016. Sok:(state of) the art of war: Offensive techniques in binary analysis. In Security and Privacy (SP), 2016 IEEE Symposium on. IEEE, 138--157.Google ScholarCross Ref
Richard Socher, Alex Perelygin, JeanWu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631--1642.Google Scholar
Open Source. 2016. Dyninst: An application program interface (api) for runtime code generation. Online, http://www. dyninst. org (2016).Google Scholar
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, 384--394. Google ScholarDigital Library
Guru Venkataramani, Ioannis Doudalis, Yan Solihin, and Milos Prvulovic. 2008. Flexitaint: A programmable accelerator for dynamic taint propagation. In High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on. IEEE, 173--184.Google ScholarCross Ref
Guru Venkataramani, Ioannis Doudalis, Yan Solihin, and Milos Prvulovic. 2009. MemTracker: An accelerator for memory debugging and monitoring. ACM Transactions on Architecture and Code Optimization (TACO) 6, 2 (2009), 5. Google ScholarDigital Library
Vera Wahler, Dietmar Seipel, J Wolff, and Gregor Fischer. 2004. Clone detection in source code by frequent itemset techniques. In Source Code Analysis and Manipulation, 2004. Fourth IEEE International Workshop on. IEEE, 128--135. Google ScholarDigital Library
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87--98. Google ScholarDigital Library
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 363--376. Google ScholarDigital Library
Hongfa Xue, Yurong Chen, Fan Yao, Yongbo Li, Tian Lan, and Guru Venkataramani. 2017. Simber: Eliminating redundant memory bound checks via statistical inference. In IFIP International Conference on ICT Systems Security and Privacy Protection. Springer, 413--426.Google ScholarCross Ref
Hongfa Xue, Guru Venkataramani, and Tian Lan. 2018. Clone-hunter: accelerated bound checks elimination via binary code clone detection. In Proceedings of the 2nd ACM SIGPLAN InternationalWorkshop on Machine Learning and Programming Languages. ACM, 11--19. Google ScholarDigital Library
Fan Yao, Yongbo Li, Yurong Chen, Hongfa Xue, Tian Lan, and Guru Venkataramani. 2017. Statsym: vulnerable path discovery through statistics-guided symbolic execution. In Dependable Systems and Networks (DSN), 2017 47th Annual IEEE/IFIP International Conference on. IEEE, 109--120.Google ScholarCross Ref

Index Terms

Clone-Slicer: Detecting Domain Specific Binary Code Clones through Program Slicing
1. Security and privacy
  1. Software and application security
    1. Software reverse engineering
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Formal software verification

Recommendations

CloneCognition: machine learning based code clone validation tool
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

A code clone is a pair of similar code fragments, within or between software systems. To detect each possible clone pair from a software system while handling the complex code structures, the clone detection tools undergo a lot of generalization of the ...
Read More
Predicting Buggy Code Clones through Machine Learning
CASCON '22: Proceedings of the 32nd Annual International Conference on Computer Science and Software Engineering
Code clones (similar code fragments in a code-base} often have negative impacts on the maintenance and evolution of software systems. According to the existing studies, code clones may contain bugs or inconsistencies that can cause an increased ...
Read More
Clone removal: fact or fiction?
IWSC '10: Proceedings of the 4th International Workshop on Software Clones

Despite ongoing research in the field of code duplication, clone research has not yet investigated when and how developers remove clones. We think knowing how developers select candidates for removal and what techniques they use to eliminate duplication ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FEAST '18: Proceedings of the 2018 Workshop on Forming an Ecosystem Around Software Transformation
October 2018
39 pages
ISBN:9781450359979
DOI:10.1145/3273045
Program Chairs:
Yan Shoshitaishvili
Arizona State University
,
Mayur Naik
University of Pennsylvania
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 January 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
binary analysis
code clones
machine learning
program slicing
Qualifiers
- research-article
Conference
Upcoming Conference
CCS '24

Sponsor:

sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 14 - 18, 2024

Salt Lake City , UT , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 231
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Clone-Slicer: Detecting Domain Specific Binary Code Clones through Program Slicing

FEAST '18: Proceedings of the 2018 Workshop on Forming an Ecosystem Around Software Transformation

ABSTRACT

References

Cited By

Index Terms

Recommendations

CloneCognition: machine learning based code clone validation tool

Predicting Buggy Code Clones through Machine Learning

Clone removal: fact or fiction?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Clone-Slicer: Detecting Domain Specific Binary Code Clones through Program Slicing

FEAST '18: Proceedings of the 2018 Workshop on Forming an Ecosystem Around Software Transformation

ABSTRACT

References

Cited By

Index Terms

Recommendations

CloneCognition: machine learning based code clone validation tool

Predicting Buggy Code Clones through Machine Learning

Clone removal: fact or fiction?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media