Top

Empirical Software Engineering

Published in:

01-11-2023

CoCoAST: Representing Source Code via Hierarchical Splitting and Reconstruction of Abstract Syntax Trees

Authors: Ensheng Shi, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, Hongbin Sun

Published in: Empirical Software Engineering | Issue 6/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Recently, machine learning techniques especially deep learning techniques have made substantial progress on some code intelligence tasks such as code summarization, code search, clone detection, etc. How to represent source code to effectively capture the syntactic, structural, and semantic information is a key challenge. Recent studies show that the information extracted from abstract syntax trees (ASTs) is conducive to code representation learning. However, existing approaches fail to fully capture the rich information in ASTs due to the large size/depth of ASTs. In this paper, we propose a novel model CoCoAST that hierarchically splits and reconstructs ASTs to comprehensively capture the syntactic and semantic information of code without the loss of AST structural information. First, we hierarchically split a large AST into a set of subtrees and utilize a recursive neural network to encode the subtrees. Then, we aggregate the embeddings of subtrees by reconstructing the split ASTs to get the representation of the complete AST. Finally, we combine AST representation carrying the syntactic and structural information and source code embedding representing the lexical information to obtain the final neural code representation. We have applied our source code representation to two common program comprehension tasks, code summarization and code search. Extensive experiments have demonstrated the superiority of CoCoAST. To facilitate reproducibility, our data and code are available https://github.com/s1530129650/CoCoAST.

previous article Automated detection, categorisation and developers’ experience with the violations of honesty in mobile apps

next article Using gameplay videos for detecting issues in video games

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

The full AST is omitted due to space limit, it can be found in Appendix C

We only present the topdown skeleton and partial rules due to space limitation. The full set of rules and tool implementation are provided in Appendix D

A "rare" token refers to a token that occurs infrequently in the training dataset.

See training time details in Appendix Table 14 and 15

See Appendix Table 17

Ahmad WU, Chakraborty S, Ray B, Chang K (2020) A transformer-based approach for source code summarization. In: ACL

Ahmad WU, Chakraborty S, Ray B, Chang K (2021) Unified pre-training for program understanding and generation. In: NAACL-HLT, pp. 2655–2668. Association for Computational Linguistics

Allamanis M, Barr ET, Bird C, Sutton CA (2015) Suggesting accurate method and class names. In: FSE

Allamanis M, Brockschmidt M, Khademi M (2018) Learning to represent programs with graphs. In: ICLR. OpenReview.net

Allamanis M, Peng H, Sutton C (2016) A convolutional attention network for extreme summarization of source code. In: ICML, JMLR Workshop and Conference Proceedings, JMLR.org vol. 48, pp 2091–2100

Alon U, Brody S, Levy O, Yahav E (2019a) code2seq: Generating sequences from structured representations of code. In: ICLR (Poster). OpenReview.net

Alon U, Yahav E (2021) On the bottleneck of graph neural networks and its practical implications. In: ICLR. OpenReview.net

Alon U, Zilberstein M, Levy O, Yahav E (2019b) code2vec: Learning distributed representations of code. In: POPL

Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL

Bansal A, Haque S, McMillan C (2021) Project-level encoding for neural source code summarization of subroutines. In: ICPC, IEEE pp 253–264

Bengio Y, Frasconi P, Simard PY (1993) The problem of learning long-term dependencies in recurrent networks. In: ICNN

Cho K, van Merrienboer B, Gülçehre Ç , Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, ACL pp 1724–1734

Du L, Shi X, Wang Y, Shi E, Han S, Zhang D (2021) Is a single model enough? mucos: A multi-model ensemble learning approach for semantic code search. In: CIKM, ACM pp 2994–2998

Eddy BP, Robinson JA, Kraft NA, Carver JC (2013) Evaluating source code summarization techniques: Replication and expansion. In: ICPC, IEEE Computer Society pp 13–22

Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) Codebert: A pre-trained model for programming and natural languages. In: EMNLP (Findings)

Fernandes P, Allamanis M, Brockschmidt M (2019) Structured neural summarization. In: ICLR

Fout A, Byrd J, Shariat B, Ben-Hur A (2017) Protein interface prediction using graph convolutional networks. In: NIPS, pp 6530–6539

Franks C, Tu Z, Devanbu PT, Hellendoorn V (2015) CACHECA: A cache language model based code suggestion tool. In: ICSE, IEEE Computer Society (2), pp 705–708

Gao S, Gao C, He Y, Zeng J, Nie LY, Xia X (2021) Code structure guided transformer for source code summarization. arXiv:2104.09340

Garg VK, Jegelka S, Jaakkola TS (2020) Generalization and representational limits of graph neural networks. ICML, Proceedings of Machine Learning Research, PMLR 119:3419–3430

Gros D, Sezhiyan H, Devanbu P, Yu Z (2020) Code to comment “translation”: Data, metrics, baselining & evaluation. In: ASE

Gu J, Chen Z, Monperrus M (2021) Multimodal representation for neural code search. In: IEEE International Conference on Software Maintenance and Evolution, ICSME 2021, Luxembourg, 2021, pp 483–494. IEEE. September 27 - October 1 https://doi.org/10.1109/ICSME52107.2021.00049

Gu W, Li Z, Gao C, Wang C, Zhang H, Xu Z, Lyu MR (2021) Cradle: Deep code retrieval based on semantic dependency learning. Neural Networks 141:385–394CrossRef

Gu X, Zhang H, Kim S (2018) Deep code search. In: ICSE, ACM pp 933–944

Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S, Tufano M, Deng SK, lement CB, Drain D, Sundaresan N, Yin J, Jiang D, Zhou M (2021) Graphcodebert: Pre-training code representations with data flow. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, OpenReview.net. 3-7 May 2021. https://openreview.net/forum?id=jLoC4ez43PZ

Haiduc S, Aponte J, Moreno L, Marcus A (2010) On the use of automated text summarization techniques for summarizing source code. In:WCRE, IEEE Computer Society pp 35–44

Haije T (2016) Automatic comment generation using a neural translation model. Bachelor’s thesis, University of Amsterdam

Haldar R, Wu L, Xiong J, Hockenmaier J (2020) A multi-perspective architecture for semantic code search. In: Jurafsky D, Chai J, Schluter N, Tetreault JR (Eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, Association for Computational Linguistics pp 8563–8568 5-10 July 2020. https://doi.org/10.18653/v1/2020.acl-main.758

Haque S, LeClair A, Wu L, McMillan C (2020) Improved automatic summarization of subroutines via attention to file context. In: MSR

He K, Fan H, Wu Y, Xie S, Girshick RB (2020) Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, Computer Vision Foundation / IEEE. pp 9726–9735. 13-19 June 2020 https://doi.org/10.1109/CVPR42600.2020.00975

Hellendoorn VJ, Sutton C, Singh R, Maniatis P, Bieber D (2020) Global relational models of source code. In: ICLR. OpenReview.net

Hu X, Li G, Xia X, Lo D, Jin Z (2018) Deep code comment generation. In: ICPC

Hu X, Li G, Xia X, Lo D, Jin Z (2019) Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering

Hu X, Li G, Xia X, Lo D, Jin Z (2018) Summarizing source code with transferred api knowledge. In: IJCAI

Huang J, Tang D, Shou L, Gong M, Xu K, Jiang D, Zhou M, Duan N (2021) Cosqa: 20, 000+ web queries for code search and question answering. In: ACL

Husain H, Wu H, Gazit T, Allamanis M, Brockschmidt M (2019) Codesearchnet challenge: Evaluating the state of semantic code search. CoRR abs/1909.09436. arXiv:1909.09436

Iyer S, Konstas I, Cheung A, Zettlemoyer L (2016) Summarizing source code using a neural attention model. In: ACL

Iyyer M, Manjunatha V, Boyd-Graber JL, III HD (2015) Deep unordered composition rivals syntactic methods for text classification. In: ACL (1), The Association for Computer Linguistics pp 1681–1691

Jain P, Jain A, Zhang T, Abbeel P, Gonzalez J, Stoica I (2021) Contrastive code representation learning. In: EMNLP, Association for Computational Linguistics (1), pp 5954–5971

Jiang X, Zheng Z, Lyu C, Li L, Lyu L (2021) Treebert: A tree-based pre-trained model for programming language. UAI, Proceedings of Machine Learning Research, AUAI Press 161:54–63

Kanade A, Maniatis P, Balakrishnan G, Shi K (2020) Pre-trained contextual embedding of source code. arXiv:2001.00059

Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLP, ACL pp 1746–1751

LeClair A, Bansal A, McMillan C (2021) Ensemble models for neural source code summarization of subroutines In: ICSME, IEEE pp 286–297

LeClair A, Haque S, Wu L, McMillan C (2020) Improved code summarization via a graph neural network. In: ICPC, ACM pp 18–195

LeClair A, Jiang S, McMillan C (2019) A neural model for generating natural language summaries of program subroutines. In: ICSE

Li W, Qin H, Yan S, Shen B, Chen Y (2020) Learning code-query interaction for enhancing code searches. In: ICSME, IEEE pp 115–126

Libovický J, Helcl J, Mareček D (2018) Input combination strategies for multi-source transformer decoder. In: WMT

Lin C (2004) ROUGE: A package for automatic evaluation of summaries. In: ACL

Lin C, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: ACL pp 605–612

Ling C, Lin Z, Zou Y, Xie B (2020) Adaptive deep code search. In: ICPC ’20: 28th International Conference on Program Comprehension, Seoul, Republic of Korea, ACM pp 48–59 13-15 July 2020. https://doi.org/10.1145/3387904.3389278

Ling X, Wu L, Wang S, Pan G, Ma T, Xu F, Liu AX, Wu C, Ji S (2021) Deep graph matching and searching for semantic code retrieval. ACM Trans Knowl Discov Data 15(5): 88:1–88:21. https://doi.org/10.1145/3447571

Linstead E, Bajracharya SK, Ngo TC, Rigor P, Lopes CV, Baldi P (2009) Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Discov 18(2):300–336MathSciNetCrossRef

Liu F, Li G, Zhao Y, Jin Z (2020) Multi-task learning based pre-trained language model for code completion. In: ASE, IEEE pp 473–485

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized BERT pretraining approach. arXiv:1907.11692

Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: ICLR

Lu M, Sun X, Wang S, Lo D, Duan Y (2015) Query expansion via wordnet for effective code search. In: SANER, IEEE Computer Society pp 545–549

Lv F, Zhang H, Lou J, Wang S, Zhang D, Zhao J (2015) Codehow: Effective code search based on API understanding and extended boolean model (E). In: ASE, IEEE Computer Society pp 260–270

McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usage. In: ICSE, ACM pp 111–120

Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. In: AAAI

Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding

Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: A method for automatic evaluation of machine translation. In: ACL, pp 311–318

Parr T (2013) The definitive ANTLR 4 reference (2 ed.). Pragmatic Bookshelf

Rodeghero P, McMillan C, McBurney PW, Bosch N, D’Mello SK (2014) Improving automated source code summarization via an eye-tracking study of programmers. In: ICSE, ACM pp 390–401

Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: Scaling code clone detection to big-code. In: ICSE, ACM pp 1157–1168

See A, Liu PJ, Manning CD (2017) Get to the point: Summarization with pointergenerator networks. In: ACL

Shi E, Gu W, Wang Y, Du L, Zhang H, Han S, Zhang D, Sun H (2022) Enhancing semantic code search with multimodal contrastive learning and soft data augmentation. https://doi.org/10.48550/arXiv.2204.03293

Shi E, Wang Y, Du L, Chen J, Han S, Zhang H, Zhang D, Sun H (2022) On the evaluation of neural code summarization

Shi E, Wang Y, Du L, Zhang H, Han S, Zhang D, Sun H (2021) CAST: Enhancing code summarization with hierarchical splitting and reconstruction of abstract syntax trees. In: EMNLP (1), Association for Computational Linguistics pp 4053–4062

Shi L, Mu F, Chen X, Wang S, Wang J, Yang Y, Li G, Xia X, Wang Q (2022) Are we building on the rock? on the importance of data preprocessing for code summarization. In: ESEC/SIGSOFT FSE, ACM pp 107–119

Shin ECR, Allamanis M, Brockschmidt M, Polozov A (2019) Program synthesis and semantic parsing with learned code idioms. In: NeurIPS, pp 10824–10834

Shuai J, Xu L, Liu C, Yan M, Xia X, Lei Y (2020) Improving code search with coattentive representation learning. In: ICPC ’20: 28th International Conference on Program Comprehension, Seoul, Republic of Korea, ACM pp 196–207 July 13-15, 2020. https://doi.org/10.1145/3387904.3389269

Sridhara G, Hill E, Muppaneni D, Pollock LL, Vijay-Shanker K (2010) Towards automatically generating summary comments for java methods. In: ASE, pp 43–52

Sun Z, Li L, Liu Y, Du X, Li L (2022) On the importance of building high-quality training datasets for neural code search. In: ICSE, ACM pp 1609–1620

Svyatkovskiy A, Deng SK, Fu S, Sundaresan N (2020) Intellicode compose: code generation using transformer. In: ESEC/SIGSOFT FSE, ACM pp 1433–1443

Tai KS, Socher R, Manning CD (2015) Improved semantic representations from treestructured long short-term memory networks. In: ACL (1), The Association for Computer Linguistics pp 1556–1566

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NIPS, pp 5998–6008

Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: CVPR

Wan Y, Shu J, SuiY, Xu G, Zhao Z, Wu J, Yu PS (2019) Multi-modal attention network learning for semantic source code retrieval. In: ASE, IEEE pp 13–25

Wan Y, Zhao Z, Yang M, Xu G, Ying H, Wu J, Yu PS (2018) Improving automatic source code summarization via deep reinforcement learning. In: ASE

Wang X, Wang Y, Mi F, Zhou P, Wan Y, Liu X, Li L, Wu H, Liu J, Jiang X (2021) Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv:2108.04556

Wang Y, Du L, Shi E, Hu Y, Han S, Zhang D (2020) Cocogum: Contextual code summarization with multi-relational gnn on umls. Tech rep, Microsoft, MSR-TR-2020-16

Wang Y, Li H (2021) Code completion by modeling flattened abstract syntax trees as graphs. In: AAAI

Wang Y, Wang W, Joty SR, Hoi SCH (2021) Codet5: Identifier-aware unified pretrained encoder-decoder models for code understanding and generation. In: EMNLP (1), Association for Computational Linguistics pp 8696–8708

Wei B, Li G, Xia X, Fu Z, Jin Z (2019) Code generation as a dual task of code summarization. In: NeurIPS, pp 6559–6569

Wei B, Li Y, Li G, Xia X, Jin Z (2020) Retrieve and refine: Exemplar-based neural comment generation. In: ASE, IEEE pp 349–360

White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: ASE, ACM pp 87–98

Wilcoxon F, Katti S, Wilcox RA (1970) Critical values and probability levels for the wilcoxon rank sum test and the wilcoxon signed rank test. Selected tables in mathematical statistics 1:171–259MATH

Wu H, Zhao H, Zhang M (2021) Code summarization with structure-induced transformer. In: ACL/IJCNLP (Findings), Findings of ACL, vol. ACL/IJCNLP 2021, Association for Computational Linguistics pp 1078–1090

Wu Y, Lian D, Xu Y, Wu L, Chen E (2020) Graph convolutional networks with markov random field reasoning for social spammer detection. In: AAAI, AAAI Press pp 1054–1061

Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: CVPR, Computer Vision Foundation /IEEE Computer Society pp 3733–3742

Yang M, Zhou M, Li Z, Liu J, Pan L, Xiong H, King I (2022) Hyperbolic graph neural networks: A review of methods and applications. arXiv:2202.13852

Ye W, Xie R, Zhang J, Hu T, Wang X, Zhang S (2020) Leveraging code generation to improve code retrieval and summarization via dual learning. In: Huang Y, King I, Liu T, van Steen M (Eds)WWW’20: TheWeb Conference 2020, Taipei, Taiwan, ACM / IW3C2 pp 2309–2319. 20-24 April 2020. https://doi.org/10.1145/3366423.3380295

Yu X, Huang Q, Wang Z, Feng Y, Zhao D (2020) Towards context-aware code comment generation. In: EMNLP (Findings), Association for Computational Linguistics pp 3938–3947

Zhang J, Panthaplackel S, Nie P, Mooney RJ, Li JJ, Gligoric M (2021) Learning to generate code comments from class hierarchies. arXiv:2103.13426

Zhang J, Wang X, Zhang H, Sun H, Liu X (2020) Retrieval-based neural source code summarization. In: ICSE

Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: ICSE

Zhu Q, Sun Z, Liang X, Xiong Y, Zhang L (2020) Ocor: An overlapping-aware code retriever. In: 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Australia, IEEE pp 883–894 21-25 Sept 2020. https://doi.org/10.1145/3324884.3416530

Title: CoCoAST: Representing Source Code via Hierarchical Splitting and Reconstruction of Abstract Syntax Trees
Authors: Ensheng Shi
Yanlin Wang
Lun Du
Hongyu Zhang
Shi Han
Dongmei Zhang
Hongbin Sun
Publication date: 01-11-2023
Publisher: Springer US
Published in: Empirical Software Engineering / Issue 6/2023
Print ISSN: 1382-3256
Electronic ISSN: 1573-7616
DOI: https://doi.org/10.1007/s10664-023-10378-9

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 6/2023

We do not understand what it says – studying student perceptions of software modelling

On the coordination of vulnerability fixes

Analyzing the BizDev interface in an enterprise context: a case of developers acting in business

The software heritage license dataset (2022 edition)

Generating and detecting true ambiguity: a forgotten danger in DNN supervision testing

Energy efficiency of the Visitor Pattern: contrasting Java and C++ implementations

Premium Partner