ABSTRACT
Background: Program similarity is a fundamental concept, central to the solution of software engineering tasks such as software plagiarism, clone identification, code refactoring and code search. Accurate similarity estimation between programs requires an in-depth understanding of their structure, semantics and flow. A control flow graph (CFG), is a graphical representation of a program which captures its logical control flow and hence its semantics. A common approach is to estimate program similarity by analysing CFGs using graph similarity measures, e.g. graph edit distance (GED). However, graph edit distance is an NP-hard problem and computationally expensive, making the application of graph similarity techniques to complex software programs impractical. Aim: This study intends to examine the effectiveness of graph neural networks to estimate program similarity, by analysing the associated control flow graphs. Method: We introduce funcGNN1, which is a graph neural network trained on labeled CFG pairs to predict the GED between unseen program pairs by utilizing an effective embedding vector. To our knowledge, this is the first time graph neural networks have been applied on labeled CFGs for estimating the similarity between highlevel language programs. Results: We demonstrate the effectiveness of funcGNN to estimate the GED between programs and our experimental analysis demonstrates how it achieves a lower error rate (1.94 x10-3), with faster (23 times faster than the quickest traditional GED approximation method) and better scalability compared with state of the art methods. Conclusion: funcGNN posses the inductive learning ability to infer program structure and generalise to unseen programs. The graph embedding of a program proposed by our methodology could be applied to several related software engineering problems (such as code plagiarism and clone identification) thus opening multiple research directions.
- Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 ( 2017).Google Scholar
- Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, Yizhou Sun, and Wei Wang. 2019. Simgnn: A neural network approach to fast graph similarity computation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 384--392.Google ScholarDigital Library
- Hal L Berghel and David L Sallach. 1984. Measurements of program similarity in identical task environments. ACM SIGPLAN Notices 19, 8 (1984), 65--76.Google ScholarDigital Library
- David B Blumenthal and Johann Gamper. 2018. On the exact computation of the graph edit distance. Pattern Recognition Letters (2018).Google Scholar
- Sebastien Bougleux, Luc Brun, Vincenzo Carletti, Pasquale Foggia, Benoit Gaüzère, and Mario Vento. 2017. Graph edit distance as a quadratic assignment problem. Pattern Recognition Letters 87 (2017), 38--46.Google ScholarDigital Library
- Horst Bunke. 1997. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18, 8 (1997), 689--694.Google ScholarDigital Library
- Horst Bunke and Kim Shearer. 1998. A graph distance metric based on the maximal common subgraph. Pattern recognition letters 19, 3-4 (1998), 255--259.Google Scholar
- Silvio Cesare and Yang Xiang. 2011. Malware variant detection using similarity search over sets of control flow graphs. In 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications. IEEE, 181--189.Google ScholarDigital Library
- William W Cohen, Pradeep Ravikumar, Stephen E Fienberg, et al. 2003. A Comparison of String Distance Metrics for Name-Matching Tasks.. In IIWeb, Vol. 2003. 73--78.Google ScholarDigital Library
- Chandan Dasgupta. 2010. That is not my program: Investigating the relation between program comprehension and program authorship. In Proceedings of the 48th Annual Southeast Regional Conference. 1--4.Google ScholarDigital Library
- Jinan AW Faidhi and Stuart K Robinson. 1987. An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers & Education 11, 1 (1987), 11--19.Google ScholarDigital Library
- Stefan Fankhauser, Kaspar Riesen, and Horst Bunke. 2011. Speeding up graph edit distance computation through fast bipartite matching. In International Workshop on Graph-Based Representations in Pattern Recognition. Springer, 102--111.Google ScholarDigital Library
- Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 480--491.Google ScholarDigital Library
- Andreas Fischer, Kaspar Riesen, and Horst Bunke. 2017. Improved quadratic time approximation of graph edit distance by combining Hausdorff matching and greedy assignment. Pattern Recognition Letters 87 (2017), 55--62.Google ScholarDigital Library
- Robert Geist, A. Jefferson Offutt, and Frederick C Harris Jr. 1992. Estimation and enhancement of real-time software reliability through mutation analysis. IEEE Trans. Comput. 5 (1992), 550--558.Google ScholarDigital Library
- Marco Gori, Gabriele Monfardini, and Franco Scarselli. 2005. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2. IEEE, 729--734.Google ScholarCross Ref
- Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems. 1024--1034.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In European conference on computer vision. Springer, 630--645.Google ScholarCross Ref
- Kenneth Hoste, Aashish Phansalkar, Lieven Eeckhout, Andy Georges, Lizy K John, and Koen De Bosschere. 2006. Performance prediction based on inherent program similarity. In 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 114--122.Google ScholarDigital Library
- John Hunt. 2019. Futures. In Advanced Guide to Python 3 Programming. Springer, 395--405.Google Scholar
- Vaibhavi Kalgutkar, Ratinder Kaur, Hugo Gonzalez, Natalia Stakhanova, and Alina Matyukhina. 2019. Code authorship attribution: Methods and challenges. ACM Computing Surveys (CSUR) 52, 1 (2019), 1--36.Google ScholarDigital Library
- Upulee Kanewala, James M Bieman, and Asa Ben-Hur. 2016. Predicting metamorphic relations for testing scientific software: a machine learning approach using graph kernels. Software testing, verification and reliability 26, 3 (2016), 245--269.Google Scholar
- Abhishek Karnik, Suchandra Goswami, and Ratan Guha. 2007. Detecting obfuscated viruses using cosine similarity analysis. In First Asia International Conference on Modelling & Simulation (AMS'07). IEEE, 165--170.Google ScholarDigital Library
- Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering. 664--675.Google ScholarDigital Library
- Mohamed A Khamsi and William A Kirk. 2011. An introduction to metric spaces and fixed point theory. Vol. 53. John Wiley & Sons.Google Scholar
- Yesol Kim, Jonghyuk Park, Seong-je Cho, Yunmook Nah, Sangchul Han, and Minkyu Park. 2015. Machine learning-based software classification scheme for efficient program similarity analysis. In Proceedings of the 2015 Conference on research in adaptive and convergent systems. 114--118.Google ScholarDigital Library
- Thomas NKipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google Scholar
- Jens Krinke. 2006. Mining control flow graphs for crosscutting concerns. In 2006 13th Working Conference on Reverse Engineering. IEEE, 334--342.Google ScholarDigital Library
- Giri Panamoottil Krishnan and Nikolaos Tsantalis. 2014. Unification and refactoring of clones. In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, 104--113.Google ScholarCross Ref
- Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
- Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph matching networks for learning the similarity of graph structured objects. arXiv preprint arXiv:1904.12787 (2019).Google Scholar
- Jonathan I Maletic and Andrian Marcus. 2000. Using latent semantic analysis to identify similarities in source code to support program understanding. In Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000. IEEE, 46--53.Google ScholarCross Ref
- Niccolò Marastoni, Roberto Giacobazzi, and Mila Dalla Preda. 2018. A deep learning approach to program similarity. In Proceedings of the 1st International Workshop on Machine Learning and Software Engineering in Symbiosis. 26--35.Google ScholarDigital Library
- Aravind Nair, Karl Meinke, and Sigrid Eldh. 2019. Leveraging mutants for automatic prediction of metamorphic relations using machine learning. In Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation. 1--6.Google ScholarDigital Library
- Animesh Nandi, Atri Mandal, Shubham Atreja, Gargi B Dasgupta, and Subhrajit Bhattacharya. 2016. Anomaly detection using program control flow graph mining from execution logs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 215--224.Google ScholarDigital Library
- Michel Neuhaus and Horst Bunke. 2006. A random walk kernel derived from graph edit distance. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 191--199.Google ScholarDigital Library
- Michel Neuhaus and Horst Bunke. 2007. A quadratic programming approach to the graph edit distance problem. In International Workshop on Graph-Based Representations in Pattern Recognition. Springer, 92--102.Google ScholarCross Ref
- Michel Neuhaus, Kaspar Riesen, and Horst Bunke. 2006. Fast suboptimal algorithms for the computation of graph edit distance. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 163--172.Google ScholarDigital Library
- Haoran Niu, Iman Keivanloo, and Ying Zou. 2017. Learning to rank code examples for code search engines. Empirical Software Engineering 22, 1 (2017), 259--291.Google ScholarDigital Library
- Anh Viet Phan, Minh Le Nguyen, and Lam Thu Bui. 2017. Convolutional neural networks over control flow graphs for software defect prediction. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 45--52.Google Scholar
- Aashish Phansalkar, Ajay Joshi, Lieven Eeckhout, and Lizy Kurian John. 2005. Measuring program similarity: Experiments with SPEC CPU benchmark suites. In IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. IEEE, 10--20.Google ScholarDigital Library
- Kaspar Riesen and Horst Bunke. 2009. Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision computing 27, 7 (2009), 950--959.Google Scholar
- Kaspar Riesen, Andreas Fischer, and Horst Bunke. 2014. Improving approximate graph edit distance using genetic algorithms. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 63--72.Google ScholarDigital Library
- Eric Sven Ristad and Peter N Yianilos. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 5 (1998), 522--532.Google ScholarDigital Library
- Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2008), 61--80.Google ScholarDigital Library
- Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems. 926--934.Google Scholar
- Raja Vallee-Rai and Laurie J Hendren. 1998. Jimple: Simplifying Java bytecode for analyses and transformations. (1998).Google Scholar
- SVN Vishwanathan, Karsten M Borgwardt, Nicol N Schraudolph, et al. 2006. Fast computation of graph kernels. In NIPS, Vol. 19. 131--138.Google Scholar
- Milena Vujoševic-Janičic, Mladen Nikolic, Dušan Tošic, and Viktor Kuncak. 2013. Software verification and graph similarity for automated evaluation of students' assignments. Information and Software Technology 55, 6 (2013), 1004--1016.Google ScholarDigital Library
- Andrew Walenstein, Mohammad El-Ramly, James R Cordy, William S Evans, Kiarash Mahdavi, Markus Pizka, Ganesan Ramalingam, and Jürgen Wolff von Gudenberg. 2007. Similarity in programs. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.Google Scholar
- Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).Google Scholar
- Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 363--376.Google ScholarDigital Library
- Zhiping Zeng, Anthony KH Tung, Jianyong Wang, Jianhua Feng, and Lizhu Zhou. 2009. Comparing stars: On approximating graph edit distance. Proceedings of the VLDB Endowment 2, 1 (2009), 25--36.Google ScholarDigital Library
- Fangfang Zhang, Dinghao Wu, Peng Liu, and Sencun Zhu. 2014. Program logic based software plagiarism detection. In 2014 IEEE 25th International Symposium on Software Reliability Engineering. IEEE, 66--77.Google ScholarDigital Library
- Gang Zhao and Jeff Huang. 2018. Deepsim: deep learning code functional similarity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 141--151.Google ScholarDigital Library
- Minhaz Fahim Zibran and Chanchal Kumar Roy. 2013. Conflict-aware optimal scheduling of prioritised code clone refactoring. IET software 7, 3 (2013), 167--186.Google Scholar
Index Terms
- funcGNN: A Graph Neural Network Approach to Program Similarity
Recommendations
Graph Similarity Using Tree Edit Distance
Structural, Syntactic, and Statistical Pattern RecognitionAbstractGraph similarity is the process of finding similarity between two graphs. Graph edit distance is one of the key techniques to find the similarity between two graphs. The main disadvantage of graph edit distance is that it is computationally ...
Efficient graph edit distance computation using isomorphic vertices
Highlights- Reducing the search space of GED computation by exploiting isomorphic vertices.
- Developing algorithms for identifying isomorphic vertices.
- Proposing GED computation algorithm based on isomorphic vertices.
- Finding that real ...
AbstractIn this paper, we study the problem of graph edit distance (GED) computation. We empirically observe that real graph data contain many isomorphic substructures, which incur redundant computation. Based on this observation, we aim at reducing the ...
Comments