research-article

Open Access

funcGNN: A Graph Neural Network Approach to Program Similarity

Authors:
Aravind Nair

KTH Royal Institute of Technology, Stockholm, Sweden

KTH Royal Institute of Technology, Stockholm, Sweden
View Profile

,
Avijit Roy

KTH Royal Institute of Technology, Stockholm, Sweden

KTH Royal Institute of Technology, Stockholm, Sweden
View Profile

,
Karl Meinke

KTH Royal Institute of Technology, Stockholm, Sweden

KTH Royal Institute of Technology, Stockholm, Sweden
View Profile

ESEM '20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)October 2020Article No.: 10Pages 1–11https://doi.org/10.1145/3382494.3410675

Published:23 October 2020Publication History

ESEM '20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

Pages 1–11

ABSTRACT

Background: Program similarity is a fundamental concept, central to the solution of software engineering tasks such as software plagiarism, clone identification, code refactoring and code search. Accurate similarity estimation between programs requires an in-depth understanding of their structure, semantics and flow. A control flow graph (CFG), is a graphical representation of a program which captures its logical control flow and hence its semantics. A common approach is to estimate program similarity by analysing CFGs using graph similarity measures, e.g. graph edit distance (GED). However, graph edit distance is an NP-hard problem and computationally expensive, making the application of graph similarity techniques to complex software programs impractical. Aim: This study intends to examine the effectiveness of graph neural networks to estimate program similarity, by analysing the associated control flow graphs. Method: We introduce funcGNN1, which is a graph neural network trained on labeled CFG pairs to predict the GED between unseen program pairs by utilizing an effective embedding vector. To our knowledge, this is the first time graph neural networks have been applied on labeled CFGs for estimating the similarity between highlevel language programs. Results: We demonstrate the effectiveness of funcGNN to estimate the GED between programs and our experimental analysis demonstrates how it achieves a lower error rate (1.94 x10-3), with faster (23 times faster than the quickest traditional GED approximation method) and better scalability compared with state of the art methods. Conclusion: funcGNN posses the inductive learning ability to infer program structure and generalise to unseen programs. The graph embedding of a program proposed by our methodology could be applied to several related software engineering problems (such as code plagiarism and clone identification) thus opening multiple research directions.

References

Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 ( 2017).Google Scholar
Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, Yizhou Sun, and Wei Wang. 2019. Simgnn: A neural network approach to fast graph similarity computation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 384--392.Google ScholarDigital Library
Hal L Berghel and David L Sallach. 1984. Measurements of program similarity in identical task environments. ACM SIGPLAN Notices 19, 8 (1984), 65--76.Google ScholarDigital Library
David B Blumenthal and Johann Gamper. 2018. On the exact computation of the graph edit distance. Pattern Recognition Letters (2018).Google Scholar
Sebastien Bougleux, Luc Brun, Vincenzo Carletti, Pasquale Foggia, Benoit Gaüzère, and Mario Vento. 2017. Graph edit distance as a quadratic assignment problem. Pattern Recognition Letters 87 (2017), 38--46.Google ScholarDigital Library
Horst Bunke. 1997. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18, 8 (1997), 689--694.Google ScholarDigital Library
Horst Bunke and Kim Shearer. 1998. A graph distance metric based on the maximal common subgraph. Pattern recognition letters 19, 3-4 (1998), 255--259.Google Scholar
Silvio Cesare and Yang Xiang. 2011. Malware variant detection using similarity search over sets of control flow graphs. In 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications. IEEE, 181--189.Google ScholarDigital Library
William W Cohen, Pradeep Ravikumar, Stephen E Fienberg, et al. 2003. A Comparison of String Distance Metrics for Name-Matching Tasks.. In IIWeb, Vol. 2003. 73--78.Google ScholarDigital Library
Chandan Dasgupta. 2010. That is not my program: Investigating the relation between program comprehension and program authorship. In Proceedings of the 48th Annual Southeast Regional Conference. 1--4.Google ScholarDigital Library
Jinan AW Faidhi and Stuart K Robinson. 1987. An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers & Education 11, 1 (1987), 11--19.Google ScholarDigital Library
Stefan Fankhauser, Kaspar Riesen, and Horst Bunke. 2011. Speeding up graph edit distance computation through fast bipartite matching. In International Workshop on Graph-Based Representations in Pattern Recognition. Springer, 102--111.Google ScholarDigital Library
Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 480--491.Google ScholarDigital Library
Andreas Fischer, Kaspar Riesen, and Horst Bunke. 2017. Improved quadratic time approximation of graph edit distance by combining Hausdorff matching and greedy assignment. Pattern Recognition Letters 87 (2017), 55--62.Google ScholarDigital Library
Robert Geist, A. Jefferson Offutt, and Frederick C Harris Jr. 1992. Estimation and enhancement of real-time software reliability through mutation analysis. IEEE Trans. Comput. 5 (1992), 550--558.Google ScholarDigital Library
Marco Gori, Gabriele Monfardini, and Franco Scarselli. 2005. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2. IEEE, 729--734.Google ScholarCross Ref
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems. 1024--1034.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In European conference on computer vision. Springer, 630--645.Google ScholarCross Ref
Kenneth Hoste, Aashish Phansalkar, Lieven Eeckhout, Andy Georges, Lizy K John, and Koen De Bosschere. 2006. Performance prediction based on inherent program similarity. In 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 114--122.Google ScholarDigital Library
John Hunt. 2019. Futures. In Advanced Guide to Python 3 Programming. Springer, 395--405.Google Scholar
Vaibhavi Kalgutkar, Ratinder Kaur, Hugo Gonzalez, Natalia Stakhanova, and Alina Matyukhina. 2019. Code authorship attribution: Methods and challenges. ACM Computing Surveys (CSUR) 52, 1 (2019), 1--36.Google ScholarDigital Library
Upulee Kanewala, James M Bieman, and Asa Ben-Hur. 2016. Predicting metamorphic relations for testing scientific software: a machine learning approach using graph kernels. Software testing, verification and reliability 26, 3 (2016), 245--269.Google Scholar
Abhishek Karnik, Suchandra Goswami, and Ratan Guha. 2007. Detecting obfuscated viruses using cosine similarity analysis. In First Asia International Conference on Modelling & Simulation (AMS'07). IEEE, 165--170.Google ScholarDigital Library
Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering. 664--675.Google ScholarDigital Library
Mohamed A Khamsi and William A Kirk. 2011. An introduction to metric spaces and fixed point theory. Vol. 53. John Wiley & Sons.Google Scholar
Yesol Kim, Jonghyuk Park, Seong-je Cho, Yunmook Nah, Sangchul Han, and Minkyu Park. 2015. Machine learning-based software classification scheme for efficient program similarity analysis. In Proceedings of the 2015 Conference on research in adaptive and convergent systems. 114--118.Google ScholarDigital Library
Thomas NKipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google Scholar
Jens Krinke. 2006. Mining control flow graphs for crosscutting concerns. In 2006 13th Working Conference on Reverse Engineering. IEEE, 334--342.Google ScholarDigital Library
Giri Panamoottil Krishnan and Nikolaos Tsantalis. 2014. Unification and refactoring of clones. In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, 104--113.Google ScholarCross Ref
Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph matching networks for learning the similarity of graph structured objects. arXiv preprint arXiv:1904.12787 (2019).Google Scholar
Jonathan I Maletic and Andrian Marcus. 2000. Using latent semantic analysis to identify similarities in source code to support program understanding. In Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000. IEEE, 46--53.Google ScholarCross Ref
Niccolò Marastoni, Roberto Giacobazzi, and Mila Dalla Preda. 2018. A deep learning approach to program similarity. In Proceedings of the 1st International Workshop on Machine Learning and Software Engineering in Symbiosis. 26--35.Google ScholarDigital Library
Aravind Nair, Karl Meinke, and Sigrid Eldh. 2019. Leveraging mutants for automatic prediction of metamorphic relations using machine learning. In Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation. 1--6.Google ScholarDigital Library
Animesh Nandi, Atri Mandal, Shubham Atreja, Gargi B Dasgupta, and Subhrajit Bhattacharya. 2016. Anomaly detection using program control flow graph mining from execution logs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 215--224.Google ScholarDigital Library
Michel Neuhaus and Horst Bunke. 2006. A random walk kernel derived from graph edit distance. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 191--199.Google ScholarDigital Library
Michel Neuhaus and Horst Bunke. 2007. A quadratic programming approach to the graph edit distance problem. In International Workshop on Graph-Based Representations in Pattern Recognition. Springer, 92--102.Google ScholarCross Ref
Michel Neuhaus, Kaspar Riesen, and Horst Bunke. 2006. Fast suboptimal algorithms for the computation of graph edit distance. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 163--172.Google ScholarDigital Library
Haoran Niu, Iman Keivanloo, and Ying Zou. 2017. Learning to rank code examples for code search engines. Empirical Software Engineering 22, 1 (2017), 259--291.Google ScholarDigital Library
Anh Viet Phan, Minh Le Nguyen, and Lam Thu Bui. 2017. Convolutional neural networks over control flow graphs for software defect prediction. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 45--52.Google Scholar
Aashish Phansalkar, Ajay Joshi, Lieven Eeckhout, and Lizy Kurian John. 2005. Measuring program similarity: Experiments with SPEC CPU benchmark suites. In IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. IEEE, 10--20.Google ScholarDigital Library
Kaspar Riesen and Horst Bunke. 2009. Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision computing 27, 7 (2009), 950--959.Google Scholar
Kaspar Riesen, Andreas Fischer, and Horst Bunke. 2014. Improving approximate graph edit distance using genetic algorithms. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 63--72.Google ScholarDigital Library
Eric Sven Ristad and Peter N Yianilos. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 5 (1998), 522--532.Google ScholarDigital Library
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2008), 61--80.Google ScholarDigital Library
Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems. 926--934.Google Scholar
Raja Vallee-Rai and Laurie J Hendren. 1998. Jimple: Simplifying Java bytecode for analyses and transformations. (1998).Google Scholar
SVN Vishwanathan, Karsten M Borgwardt, Nicol N Schraudolph, et al. 2006. Fast computation of graph kernels. In NIPS, Vol. 19. 131--138.Google Scholar
Milena Vujoševic-Janičic, Mladen Nikolic, Dušan Tošic, and Viktor Kuncak. 2013. Software verification and graph similarity for automated evaluation of students' assignments. Information and Software Technology 55, 6 (2013), 1004--1016.Google ScholarDigital Library
Andrew Walenstein, Mohammad El-Ramly, James R Cordy, William S Evans, Kiarash Mahdavi, Markus Pizka, Ganesan Ramalingam, and Jürgen Wolff von Gudenberg. 2007. Similarity in programs. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.Google Scholar
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).Google Scholar
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 363--376.Google ScholarDigital Library
Zhiping Zeng, Anthony KH Tung, Jianyong Wang, Jianhua Feng, and Lizhu Zhou. 2009. Comparing stars: On approximating graph edit distance. Proceedings of the VLDB Endowment 2, 1 (2009), 25--36.Google ScholarDigital Library
Fangfang Zhang, Dinghao Wu, Peng Liu, and Sencun Zhu. 2014. Program logic based software plagiarism detection. In 2014 IEEE 25th International Symposium on Software Reliability Engineering. IEEE, 66--77.Google ScholarDigital Library
Gang Zhao and Jeff Huang. 2018. Deepsim: deep learning code functional similarity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 141--151.Google ScholarDigital Library
Minhaz Fahim Zibran and Chanchal Kumar Roy. 2013. Conflict-aware optimal scheduling of prioritised code clone refactoring. IET software 7, 3 (2013), 167--186.Google Scholar

Index Terms

funcGNN: A Graph Neural Network Approach to Program Similarity
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
      1. Semantic networks
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Similarity measures

Recommendations

Graph Similarity Using Tree Edit Distance
Structural, Syntactic, and Statistical Pattern Recognition
Abstract
Graph similarity is the process of finding similarity between two graphs. Graph edit distance is one of the key techniques to find the similarity between two graphs. The main disadvantage of graph edit distance is that it is computationally ...
Read More
Efficient graph edit distance computation using isomorphic vertices
Highlights
- Reducing the search space of GED computation by exploiting isomorphic vertices.
- Developing algorithms for identifying isomorphic vertices.
- Proposing GED computation algorithm based on isomorphic vertices.
- Finding that real ...
Abstract
In this paper, we study the problem of graph edit distance (GED) computation. We empirically observe that real graph data contain many isomorphic substructures, which incur redundant computation. Based on this observation, we aim at reducing the ...
Read More
On the minimum common supergraph of two graphs
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ESEM '20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)
October 2020
412 pages
ISBN:9781450375801
DOI:10.1145/3382494

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Attention Mechanism
Control Flow Graph
Graph Edit Distance
Graph Embedding
Graph Neural Network
Graph Similarity
Machine Learning
ProgramSimilarity
Software Engineering
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
ESEM '20 Paper Acceptance Rate26of123submissions,21%Overall Acceptance Rate130of594submissions,22%
More
Upcoming Conference
ESEM '24

Sponsor:

sigsoft

ACM / IEEE International Symposium on Empirical Software Engineering and Measurement

October 24 - 25, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 1,581
  Total Downloads
- Downloads (Last 12 months)432
- Downloads (Last 6 weeks)52
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

funcGNN: A Graph Neural Network Approach to Program Similarity

ESEM '20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

ABSTRACT

References

Cited By

Index Terms

Recommendations

Graph Similarity Using Tree Edit Distance

Efficient graph edit distance computation using isomorphic vertices

On the minimum common supergraph of two graphs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

funcGNN: A Graph Neural Network Approach to Program Similarity

ESEM '20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

ABSTRACT

References

Cited By

Index Terms

Recommendations

Graph Similarity Using Tree Edit Distance

Efficient graph edit distance computation using isomorphic vertices

On the minimum common supergraph of two graphs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media