skip to main content
10.1145/3382494.3410675acmconferencesArticle/Chapter ViewAbstractPublication PagesesemConference Proceedingsconference-collections
research-article
Open Access

funcGNN: A Graph Neural Network Approach to Program Similarity

Authors Info & Claims
Published:23 October 2020Publication History

ABSTRACT

Background: Program similarity is a fundamental concept, central to the solution of software engineering tasks such as software plagiarism, clone identification, code refactoring and code search. Accurate similarity estimation between programs requires an in-depth understanding of their structure, semantics and flow. A control flow graph (CFG), is a graphical representation of a program which captures its logical control flow and hence its semantics. A common approach is to estimate program similarity by analysing CFGs using graph similarity measures, e.g. graph edit distance (GED). However, graph edit distance is an NP-hard problem and computationally expensive, making the application of graph similarity techniques to complex software programs impractical. Aim: This study intends to examine the effectiveness of graph neural networks to estimate program similarity, by analysing the associated control flow graphs. Method: We introduce funcGNN1, which is a graph neural network trained on labeled CFG pairs to predict the GED between unseen program pairs by utilizing an effective embedding vector. To our knowledge, this is the first time graph neural networks have been applied on labeled CFGs for estimating the similarity between highlevel language programs. Results: We demonstrate the effectiveness of funcGNN to estimate the GED between programs and our experimental analysis demonstrates how it achieves a lower error rate (1.94 x10-3), with faster (23 times faster than the quickest traditional GED approximation method) and better scalability compared with state of the art methods. Conclusion: funcGNN posses the inductive learning ability to infer program structure and generalise to unseen programs. The graph embedding of a program proposed by our methodology could be applied to several related software engineering problems (such as code plagiarism and clone identification) thus opening multiple research directions.

References

  1. Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 ( 2017).Google ScholarGoogle Scholar
  2. Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, Yizhou Sun, and Wei Wang. 2019. Simgnn: A neural network approach to fast graph similarity computation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 384--392.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Hal L Berghel and David L Sallach. 1984. Measurements of program similarity in identical task environments. ACM SIGPLAN Notices 19, 8 (1984), 65--76.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. David B Blumenthal and Johann Gamper. 2018. On the exact computation of the graph edit distance. Pattern Recognition Letters (2018).Google ScholarGoogle Scholar
  5. Sebastien Bougleux, Luc Brun, Vincenzo Carletti, Pasquale Foggia, Benoit Gaüzère, and Mario Vento. 2017. Graph edit distance as a quadratic assignment problem. Pattern Recognition Letters 87 (2017), 38--46.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Horst Bunke. 1997. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18, 8 (1997), 689--694.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Horst Bunke and Kim Shearer. 1998. A graph distance metric based on the maximal common subgraph. Pattern recognition letters 19, 3-4 (1998), 255--259.Google ScholarGoogle Scholar
  8. Silvio Cesare and Yang Xiang. 2011. Malware variant detection using similarity search over sets of control flow graphs. In 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications. IEEE, 181--189.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. William W Cohen, Pradeep Ravikumar, Stephen E Fienberg, et al. 2003. A Comparison of String Distance Metrics for Name-Matching Tasks.. In IIWeb, Vol. 2003. 73--78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chandan Dasgupta. 2010. That is not my program: Investigating the relation between program comprehension and program authorship. In Proceedings of the 48th Annual Southeast Regional Conference. 1--4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jinan AW Faidhi and Stuart K Robinson. 1987. An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers & Education 11, 1 (1987), 11--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Stefan Fankhauser, Kaspar Riesen, and Horst Bunke. 2011. Speeding up graph edit distance computation through fast bipartite matching. In International Workshop on Graph-Based Representations in Pattern Recognition. Springer, 102--111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 480--491.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Andreas Fischer, Kaspar Riesen, and Horst Bunke. 2017. Improved quadratic time approximation of graph edit distance by combining Hausdorff matching and greedy assignment. Pattern Recognition Letters 87 (2017), 55--62.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Robert Geist, A. Jefferson Offutt, and Frederick C Harris Jr. 1992. Estimation and enhancement of real-time software reliability through mutation analysis. IEEE Trans. Comput. 5 (1992), 550--558.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Marco Gori, Gabriele Monfardini, and Franco Scarselli. 2005. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2. IEEE, 729--734.Google ScholarGoogle ScholarCross RefCross Ref
  17. Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems. 1024--1034.Google ScholarGoogle Scholar
  18. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In European conference on computer vision. Springer, 630--645.Google ScholarGoogle ScholarCross RefCross Ref
  19. Kenneth Hoste, Aashish Phansalkar, Lieven Eeckhout, Andy Georges, Lizy K John, and Koen De Bosschere. 2006. Performance prediction based on inherent program similarity. In 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 114--122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. John Hunt. 2019. Futures. In Advanced Guide to Python 3 Programming. Springer, 395--405.Google ScholarGoogle Scholar
  21. Vaibhavi Kalgutkar, Ratinder Kaur, Hugo Gonzalez, Natalia Stakhanova, and Alina Matyukhina. 2019. Code authorship attribution: Methods and challenges. ACM Computing Surveys (CSUR) 52, 1 (2019), 1--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Upulee Kanewala, James M Bieman, and Asa Ben-Hur. 2016. Predicting metamorphic relations for testing scientific software: a machine learning approach using graph kernels. Software testing, verification and reliability 26, 3 (2016), 245--269.Google ScholarGoogle Scholar
  23. Abhishek Karnik, Suchandra Goswami, and Ratan Guha. 2007. Detecting obfuscated viruses using cosine similarity analysis. In First Asia International Conference on Modelling & Simulation (AMS'07). IEEE, 165--170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering. 664--675.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mohamed A Khamsi and William A Kirk. 2011. An introduction to metric spaces and fixed point theory. Vol. 53. John Wiley & Sons.Google ScholarGoogle Scholar
  26. Yesol Kim, Jonghyuk Park, Seong-je Cho, Yunmook Nah, Sangchul Han, and Minkyu Park. 2015. Machine learning-based software classification scheme for efficient program similarity analysis. In Proceedings of the 2015 Conference on research in adaptive and convergent systems. 114--118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Thomas NKipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google ScholarGoogle Scholar
  28. Jens Krinke. 2006. Mining control flow graphs for crosscutting concerns. In 2006 13th Working Conference on Reverse Engineering. IEEE, 334--342.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Giri Panamoottil Krishnan and Nikolaos Tsantalis. 2014. Unification and refactoring of clones. In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, 104--113.Google ScholarGoogle ScholarCross RefCross Ref
  30. Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  31. Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph matching networks for learning the similarity of graph structured objects. arXiv preprint arXiv:1904.12787 (2019).Google ScholarGoogle Scholar
  32. Jonathan I Maletic and Andrian Marcus. 2000. Using latent semantic analysis to identify similarities in source code to support program understanding. In Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000. IEEE, 46--53.Google ScholarGoogle ScholarCross RefCross Ref
  33. Niccolò Marastoni, Roberto Giacobazzi, and Mila Dalla Preda. 2018. A deep learning approach to program similarity. In Proceedings of the 1st International Workshop on Machine Learning and Software Engineering in Symbiosis. 26--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Aravind Nair, Karl Meinke, and Sigrid Eldh. 2019. Leveraging mutants for automatic prediction of metamorphic relations using machine learning. In Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation. 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Animesh Nandi, Atri Mandal, Shubham Atreja, Gargi B Dasgupta, and Subhrajit Bhattacharya. 2016. Anomaly detection using program control flow graph mining from execution logs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 215--224.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Michel Neuhaus and Horst Bunke. 2006. A random walk kernel derived from graph edit distance. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 191--199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Michel Neuhaus and Horst Bunke. 2007. A quadratic programming approach to the graph edit distance problem. In International Workshop on Graph-Based Representations in Pattern Recognition. Springer, 92--102.Google ScholarGoogle ScholarCross RefCross Ref
  38. Michel Neuhaus, Kaspar Riesen, and Horst Bunke. 2006. Fast suboptimal algorithms for the computation of graph edit distance. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 163--172.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Haoran Niu, Iman Keivanloo, and Ying Zou. 2017. Learning to rank code examples for code search engines. Empirical Software Engineering 22, 1 (2017), 259--291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Anh Viet Phan, Minh Le Nguyen, and Lam Thu Bui. 2017. Convolutional neural networks over control flow graphs for software defect prediction. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 45--52.Google ScholarGoogle Scholar
  41. Aashish Phansalkar, Ajay Joshi, Lieven Eeckhout, and Lizy Kurian John. 2005. Measuring program similarity: Experiments with SPEC CPU benchmark suites. In IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. IEEE, 10--20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Kaspar Riesen and Horst Bunke. 2009. Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision computing 27, 7 (2009), 950--959.Google ScholarGoogle Scholar
  43. Kaspar Riesen, Andreas Fischer, and Horst Bunke. 2014. Improving approximate graph edit distance using genetic algorithms. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 63--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Eric Sven Ristad and Peter N Yianilos. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 5 (1998), 522--532.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2008), 61--80.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems. 926--934.Google ScholarGoogle Scholar
  47. Raja Vallee-Rai and Laurie J Hendren. 1998. Jimple: Simplifying Java bytecode for analyses and transformations. (1998).Google ScholarGoogle Scholar
  48. SVN Vishwanathan, Karsten M Borgwardt, Nicol N Schraudolph, et al. 2006. Fast computation of graph kernels. In NIPS, Vol. 19. 131--138.Google ScholarGoogle Scholar
  49. Milena Vujoševic-Janičic, Mladen Nikolic, Dušan Tošic, and Viktor Kuncak. 2013. Software verification and graph similarity for automated evaluation of students' assignments. Information and Software Technology 55, 6 (2013), 1004--1016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Andrew Walenstein, Mohammad El-Ramly, James R Cordy, William S Evans, Kiarash Mahdavi, Markus Pizka, Ganesan Ramalingam, and Jürgen Wolff von Gudenberg. 2007. Similarity in programs. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.Google ScholarGoogle Scholar
  51. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).Google ScholarGoogle Scholar
  52. Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 363--376.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zhiping Zeng, Anthony KH Tung, Jianyong Wang, Jianhua Feng, and Lizhu Zhou. 2009. Comparing stars: On approximating graph edit distance. Proceedings of the VLDB Endowment 2, 1 (2009), 25--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Fangfang Zhang, Dinghao Wu, Peng Liu, and Sencun Zhu. 2014. Program logic based software plagiarism detection. In 2014 IEEE 25th International Symposium on Software Reliability Engineering. IEEE, 66--77.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Gang Zhao and Jeff Huang. 2018. Deepsim: deep learning code functional similarity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 141--151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Minhaz Fahim Zibran and Chanchal Kumar Roy. 2013. Conflict-aware optimal scheduling of prioritised code clone refactoring. IET software 7, 3 (2013), 167--186.Google ScholarGoogle Scholar

Index Terms

  1. funcGNN: A Graph Neural Network Approach to Program Similarity

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ESEM '20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)
          October 2020
          412 pages
          ISBN:9781450375801
          DOI:10.1145/3382494

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 23 October 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          ESEM '20 Paper Acceptance Rate26of123submissions,21%Overall Acceptance Rate130of594submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader