research-article

Modular Tree Network for Source Code Representation Learning

Authors:
Wenhan Wang

Peking University, Beijing, P.R. China

Peking University, Beijing, P.R. China
View Profile

,
Ge Li

Peking University, Beijing, P.R. China

Peking University, Beijing, P.R. China
View Profile

,
Sijie Shen

Peking University, Beijing, P.R. China

Peking University, Beijing, P.R. China
View Profile

,
Xin Xia

Monash University, Melbourne, Victoria, Australia

Monash University, Melbourne, Victoria, Australia

0000-0002-6302-3256
View Profile

,
Zhi Jin

Peking University, Beijing, P.R. China

Peking University, Beijing, P.R. China
View Profile

ACM Transactions on Software Engineering and Methodology Volume 29 Issue 4Article No.: 31pp 1–23https://doi.org/10.1145/3409331

Published:26 September 2020Publication History

ACM Transactions on Software Engineering and Methodology

Abstract

Learning representation for source code is a foundation of many program analysis tasks. In recent years, neural networks have already shown success in this area, but most existing models did not make full use of the unique structural information of programs. Although abstract syntax tree (AST)-based neural models can handle the tree structure in the source code, they cannot capture the richness of different types of substructure in programs. In this article, we propose a modular tree network that dynamically composes different neural network units into tree structures based on the input AST. Different from previous tree-structural neural network models, a modular tree network can capture the semantic differences between types of AST substructures. We evaluate our model on two tasks: program classification and code clone detection. Our model achieves the best performance compared with state-of-the-art approaches in both tasks, showing the advantage of leveraging more elaborate structure information of the source code.

References

Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys 51, 4 (2018), 81.Google ScholarDigital Library
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google Scholar
Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In Proceedings of the International Conference on Machine Learning. 2091--2100.Google Scholar
Uri Alon, Omer Levy, and Eran Yahav. 2018. Code2seq: Generating sequences from structured representations of code. arXiv:1808.01400Google Scholar
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. Code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 40.Google ScholarDigital Library
Forough Arabshahi, Sameer Singh, and Animashree Anandkumar. 2018. Combining symbolic expressions and black-box function evaluations in neural programs. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google Scholar
Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code comprehension: A learnable representation of code semantics. arXiv:1806.07336Google Scholar
Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, and Sebastian Riedel. 2016. Learning Python code suggestion with a sparse pointer network. arXiv:1611.08307Google Scholar
Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Gaunt, and Oleksandr Polozov. 2019. Generative code modeling with graphs. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19).Google Scholar
Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang, Ming Zhou, and Ke Xu. 2014. Adaptive recursive neural network for target-dependent Twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 49--54.Google ScholarCross Ref
Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Proceedings of the 2012 34th International Conference on Software Engineering (ICSE’12). IEEE, Los Alamitos, CA, 837--847.Google ScholarCross Ref
Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18, 7 (2006), 1527--1554.Google ScholarDigital Library
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.Google ScholarDigital Library
Xuan Huo and Ming Li. 2017. Enhancing the unified features to locate buggy files by exploiting the sequential nature of source code. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 1909--1915.Google ScholarCross Ref
Xuan Huo, Ming Li, and Zhi-Hua Zhou. 2016. Learning unified features from natural and programming languages for locating buggy source code. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 1606--1612.Google Scholar
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2073--2083.Google ScholarCross Ref
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering. IEEE, Los Alamitos, CA, 96--105.Google ScholarDigital Library
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1746--1751.Google ScholarCross Ref
Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15), Vol. 5.Google Scholar
Jian Li, Yue Wang, Irwin King, and Michael R. Lyu. 2018. Code completion with neural attention and pointer networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). 4159--4165.Google Scholar
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2016. Gated graph sequence neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google Scholar
Yuding Liang and Kenny Zhu. 2018. Automatic generation of text descriptive comments for code blocks. In Proceedings of the AAAI Conference on Artificial Intelligence.Google Scholar
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. Dynamic compositional neural networks over tree structure. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 4054--4060.Google ScholarCross Ref
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Vol. 2. 4.Google Scholar
Graham Neubig, Yoav Goldberg, and Chris Dyer. 2017. On-the-fly operation batching in dynamic computation graphs. In Advances in Neural Information Processing Systems. 3971--3981.Google Scholar
Chanchal Kumar Roy and James R. Cordy. 2007. A survey on software clone detection research. Queen’s School of Computing TR 541, 115 (2007), 64--68.Google Scholar
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE’16). IEEE, Los Alamitos, CA, 1157--1168.Google Scholar
Richard Socher, John Bauer, Christopher D Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 455--465.Google Scholar
Richard Socher, Cliff C. Lin, Chris Manning, and Andrew Y. Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML’11). 129--136.Google ScholarDigital Library
Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal K. Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution. IEEE, Los Alamitos, CA, 476--480.Google ScholarDigital Library
Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1556--1566.Google Scholar
Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE’16). IEEE, Los Alamitos, CA, 297--308.Google ScholarDigital Library
Huihui Wei and Ming Li. 2018. Positive and unlabeled learning for detecting software functional clones with adversarial training. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). 2840--2846.Google ScholarDigital Library
Hui-Hui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 3034--3040.Google ScholarCross Ref
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, New York, NY, 87--98.Google ScholarDigital Library
Frank Wilcoxon and Roberta A. Wilcox. 1964. Some Rapid Approximate Statistical Procedures. Lederle Laboratories.Google Scholar
Xinli Yang, David Lo, Xin Xia, Yun Zhang, and Jianling Sun. 2015. Deep learning for just-in-time defect prediction. In Proceedings of the 2015 IEEE International Conference on Software Quality, Reliability, and Security (QRS’64). 17--26.Google ScholarDigital Library
Zhenlong Yuan, Yongqiang Lu, Zhaoguo Wang, and Yibo Xue. 2014. Droid-Sec: Deep learning in Android malware detection. ACM SIGCOMM Computer Communication Review 44 (2014), 371--372.Google ScholarDigital Library

Index Terms

Modular Tree Network for Source Code Representation Learning
1. Software and its engineering
  1. Software creation and management

Recommendations

A novel neural source code representation based on abstract syntax tree
ICSE '19: Proceedings of the 41st International Conference on Software Engineering

Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural ...
Read More
Transformer-based networks over tree structures for code classification
Abstract
In software engineering (SE), code classification and related tasks, such as code clone detection are still challenging problems. Due to the elusive syntax and complicated semantics in software programs, existing traditional SE approaches still ...
Read More
Deep learning code fragments for code clone detection
ASE '16: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering

Code clone detection is an important problem for software maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These techniques also depend ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Software Engineering and Methodology Volume 29, Issue 4
Continuous Special Section: AI and SE
October 2020
307 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/3409663
Editor:
Mauro Pezzè
Università della Svizzera italiana and Università di Milano-Bicocca, Switzerland
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 September 2020
- Accepted: 1 June 2020
- Revised: 1 May 2020
- Received: 1 January 2020
Published in tosem Volume 29, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deep learning
code clone detection
neural networks
program classification
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 735
  Total Downloads
- Downloads (Last 12 months)75
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Modular Tree Network for Source Code Representation Learning

ACM Transactions on Software Engineering and Methodology

Abstract

References

Cited By

Index Terms

Recommendations

A novel neural source code representation based on abstract syntax tree

Transformer-based networks over tree structures for code classification

Deep learning code fragments for code clone detection