Skip to main content
Erschienen in: Empirical Software Engineering 6/2023

01.11.2023

CoCoAST: Representing Source Code via Hierarchical Splitting and Reconstruction of Abstract Syntax Trees

verfasst von: Ensheng Shi, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, Hongbin Sun

Erschienen in: Empirical Software Engineering | Ausgabe 6/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Recently, machine learning techniques especially deep learning techniques have made substantial progress on some code intelligence tasks such as code summarization, code search, clone detection, etc. How to represent source code to effectively capture the syntactic, structural, and semantic information is a key challenge. Recent studies show that the information extracted from abstract syntax trees (ASTs) is conducive to code representation learning. However, existing approaches fail to fully capture the rich information in ASTs due to the large size/depth of ASTs. In this paper, we propose a novel model CoCoAST that hierarchically splits and reconstructs ASTs to comprehensively capture the syntactic and semantic information of code without the loss of AST structural information. First, we hierarchically split a large AST into a set of subtrees and utilize a recursive neural network to encode the subtrees. Then, we aggregate the embeddings of subtrees by reconstructing the split ASTs to get the representation of the complete AST. Finally, we combine AST representation carrying the syntactic and structural information and source code embedding representing the lexical information to obtain the final neural code representation. We have applied our source code representation to two common program comprehension tasks, code summarization and code search. Extensive experiments have demonstrated the superiority of CoCoAST. To facilitate reproducibility, our data and code are available https://​github.​com/​s1530129650/​CoCoAST.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
The full AST is omitted due to space limit, it can be found in Appendix C
 
2
We only present the topdown skeleton and partial rules due to space limitation. The full set of rules and tool implementation are provided in Appendix D
 
3
A "rare" token refers to a token that occurs infrequently in the training dataset.
 
4
See training time details in Appendix Table  14 and 15
 
5
See Appendix Table 17
 
Literatur
Zurück zum Zitat Ahmad WU, Chakraborty S, Ray B, Chang K (2020) A transformer-based approach for source code summarization. In: ACL Ahmad WU, Chakraborty S, Ray B, Chang K (2020) A transformer-based approach for source code summarization. In: ACL
Zurück zum Zitat Ahmad WU, Chakraborty S, Ray B, Chang K (2021) Unified pre-training for program understanding and generation. In: NAACL-HLT, pp. 2655–2668. Association for Computational Linguistics Ahmad WU, Chakraborty S, Ray B, Chang K (2021) Unified pre-training for program understanding and generation. In: NAACL-HLT, pp. 2655–2668. Association for Computational Linguistics
Zurück zum Zitat Allamanis M, Barr ET, Bird C, Sutton CA (2015) Suggesting accurate method and class names. In: FSE Allamanis M, Barr ET, Bird C, Sutton CA (2015) Suggesting accurate method and class names. In: FSE
Zurück zum Zitat Allamanis M, Brockschmidt M, Khademi M (2018) Learning to represent programs with graphs. In: ICLR. OpenReview.net Allamanis M, Brockschmidt M, Khademi M (2018) Learning to represent programs with graphs. In: ICLR. OpenReview.net
Zurück zum Zitat Allamanis M, Peng H, Sutton C (2016) A convolutional attention network for extreme summarization of source code. In: ICML, JMLR Workshop and Conference Proceedings, JMLR.org vol. 48, pp 2091–2100 Allamanis M, Peng H, Sutton C (2016) A convolutional attention network for extreme summarization of source code. In: ICML, JMLR Workshop and Conference Proceedings, JMLR.org vol. 48, pp 2091–2100
Zurück zum Zitat Alon U, Brody S, Levy O, Yahav E (2019a) code2seq: Generating sequences from structured representations of code. In: ICLR (Poster). OpenReview.net Alon U, Brody S, Levy O, Yahav E (2019a) code2seq: Generating sequences from structured representations of code. In: ICLR (Poster). OpenReview.net
Zurück zum Zitat Alon U, Yahav E (2021) On the bottleneck of graph neural networks and its practical implications. In: ICLR. OpenReview.net Alon U, Yahav E (2021) On the bottleneck of graph neural networks and its practical implications. In: ICLR. OpenReview.net
Zurück zum Zitat Alon U, Zilberstein M, Levy O, Yahav E (2019b) code2vec: Learning distributed representations of code. In: POPL Alon U, Zilberstein M, Levy O, Yahav E (2019b) code2vec: Learning distributed representations of code. In: POPL
Zurück zum Zitat Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL
Zurück zum Zitat Bansal A, Haque S, McMillan C (2021) Project-level encoding for neural source code summarization of subroutines. In: ICPC, IEEE pp 253–264 Bansal A, Haque S, McMillan C (2021) Project-level encoding for neural source code summarization of subroutines. In: ICPC, IEEE pp 253–264
Zurück zum Zitat Bengio Y, Frasconi P, Simard PY (1993) The problem of learning long-term dependencies in recurrent networks. In: ICNN Bengio Y, Frasconi P, Simard PY (1993) The problem of learning long-term dependencies in recurrent networks. In: ICNN
Zurück zum Zitat Cho K, van Merrienboer B, Gülçehre Ç , Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, ACL pp 1724–1734 Cho K, van Merrienboer B, Gülçehre Ç , Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, ACL pp 1724–1734
Zurück zum Zitat Du L, Shi X, Wang Y, Shi E, Han S, Zhang D (2021) Is a single model enough? mucos: A multi-model ensemble learning approach for semantic code search. In: CIKM, ACM pp 2994–2998 Du L, Shi X, Wang Y, Shi E, Han S, Zhang D (2021) Is a single model enough? mucos: A multi-model ensemble learning approach for semantic code search. In: CIKM, ACM pp 2994–2998
Zurück zum Zitat Eddy BP, Robinson JA, Kraft NA, Carver JC (2013) Evaluating source code summarization techniques: Replication and expansion. In: ICPC, IEEE Computer Society pp 13–22 Eddy BP, Robinson JA, Kraft NA, Carver JC (2013) Evaluating source code summarization techniques: Replication and expansion. In: ICPC, IEEE Computer Society pp 13–22
Zurück zum Zitat Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) Codebert: A pre-trained model for programming and natural languages. In: EMNLP (Findings) Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) Codebert: A pre-trained model for programming and natural languages. In: EMNLP (Findings)
Zurück zum Zitat Fernandes P, Allamanis M, Brockschmidt M (2019) Structured neural summarization. In: ICLR Fernandes P, Allamanis M, Brockschmidt M (2019) Structured neural summarization. In: ICLR
Zurück zum Zitat Fout A, Byrd J, Shariat B, Ben-Hur A (2017) Protein interface prediction using graph convolutional networks. In: NIPS, pp 6530–6539 Fout A, Byrd J, Shariat B, Ben-Hur A (2017) Protein interface prediction using graph convolutional networks. In: NIPS, pp 6530–6539
Zurück zum Zitat Franks C, Tu Z, Devanbu PT, Hellendoorn V (2015) CACHECA: A cache language model based code suggestion tool. In: ICSE, IEEE Computer Society (2), pp 705–708 Franks C, Tu Z, Devanbu PT, Hellendoorn V (2015) CACHECA: A cache language model based code suggestion tool. In: ICSE, IEEE Computer Society (2), pp 705–708
Zurück zum Zitat Gao S, Gao C, He Y, Zeng J, Nie LY, Xia X (2021) Code structure guided transformer for source code summarization. arXiv:2104.09340 Gao S, Gao C, He Y, Zeng J, Nie LY, Xia X (2021) Code structure guided transformer for source code summarization. arXiv:​2104.​09340
Zurück zum Zitat Garg VK, Jegelka S, Jaakkola TS (2020) Generalization and representational limits of graph neural networks. ICML, Proceedings of Machine Learning Research, PMLR 119:3419–3430 Garg VK, Jegelka S, Jaakkola TS (2020) Generalization and representational limits of graph neural networks. ICML, Proceedings of Machine Learning Research, PMLR 119:3419–3430
Zurück zum Zitat Gros D, Sezhiyan H, Devanbu P, Yu Z (2020) Code to comment “translation”: Data, metrics, baselining & evaluation. In: ASE Gros D, Sezhiyan H, Devanbu P, Yu Z (2020) Code to comment “translation”: Data, metrics, baselining & evaluation. In: ASE
Zurück zum Zitat Gu W, Li Z, Gao C, Wang C, Zhang H, Xu Z, Lyu MR (2021) Cradle: Deep code retrieval based on semantic dependency learning. Neural Networks 141:385–394CrossRef Gu W, Li Z, Gao C, Wang C, Zhang H, Xu Z, Lyu MR (2021) Cradle: Deep code retrieval based on semantic dependency learning. Neural Networks 141:385–394CrossRef
Zurück zum Zitat Gu X, Zhang H, Kim S (2018) Deep code search. In: ICSE, ACM pp 933–944 Gu X, Zhang H, Kim S (2018) Deep code search. In: ICSE, ACM pp 933–944
Zurück zum Zitat Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S, Tufano M, Deng SK, lement CB, Drain D, Sundaresan N, Yin J, Jiang D, Zhou M (2021) Graphcodebert: Pre-training code representations with data flow. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, OpenReview.net. 3-7 May 2021. https://openreview.net/forum?id=jLoC4ez43PZ Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S, Tufano M, Deng SK, lement CB, Drain D, Sundaresan N, Yin J, Jiang D, Zhou M (2021) Graphcodebert: Pre-training code representations with data flow. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, OpenReview.net. 3-7 May 2021. https://​openreview.​net/​forum?​id=​jLoC4ez43PZ
Zurück zum Zitat Haiduc S, Aponte J, Moreno L, Marcus A (2010) On the use of automated text summarization techniques for summarizing source code. In:WCRE, IEEE Computer Society pp 35–44 Haiduc S, Aponte J, Moreno L, Marcus A (2010) On the use of automated text summarization techniques for summarizing source code. In:WCRE, IEEE Computer Society pp 35–44
Zurück zum Zitat Haije T (2016) Automatic comment generation using a neural translation model. Bachelor’s thesis, University of Amsterdam Haije T (2016) Automatic comment generation using a neural translation model. Bachelor’s thesis, University of Amsterdam
Zurück zum Zitat Haldar R, Wu L, Xiong J, Hockenmaier J (2020) A multi-perspective architecture for semantic code search. In: Jurafsky D, Chai J, Schluter N, Tetreault JR (Eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, Association for Computational Linguistics pp 8563–8568 5-10 July 2020. https://doi.org/10.18653/v1/2020.acl-main.758 Haldar R, Wu L, Xiong J, Hockenmaier J (2020) A multi-perspective architecture for semantic code search. In: Jurafsky D, Chai J, Schluter N, Tetreault JR (Eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, Association for Computational Linguistics pp 8563–8568 5-10 July 2020. https://​doi.​org/​10.​18653/​v1/​2020.​acl-main.​758
Zurück zum Zitat Haque S, LeClair A, Wu L, McMillan C (2020) Improved automatic summarization of subroutines via attention to file context. In: MSR Haque S, LeClair A, Wu L, McMillan C (2020) Improved automatic summarization of subroutines via attention to file context. In: MSR
Zurück zum Zitat He K, Fan H, Wu Y, Xie S, Girshick RB (2020) Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, Computer Vision Foundation / IEEE. pp 9726–9735. 13-19 June 2020 https://doi.org/10.1109/CVPR42600.2020.00975 He K, Fan H, Wu Y, Xie S, Girshick RB (2020) Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, Computer Vision Foundation / IEEE. pp 9726–9735. 13-19 June 2020 https://​doi.​org/​10.​1109/​CVPR42600.​2020.​00975
Zurück zum Zitat Hellendoorn VJ, Sutton C, Singh R, Maniatis P, Bieber D (2020) Global relational models of source code. In: ICLR. OpenReview.net Hellendoorn VJ, Sutton C, Singh R, Maniatis P, Bieber D (2020) Global relational models of source code. In: ICLR. OpenReview.net
Zurück zum Zitat Hu X, Li G, Xia X, Lo D, Jin Z (2018) Deep code comment generation. In: ICPC Hu X, Li G, Xia X, Lo D, Jin Z (2018) Deep code comment generation. In: ICPC
Zurück zum Zitat Hu X, Li G, Xia X, Lo D, Jin Z (2019) Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering Hu X, Li G, Xia X, Lo D, Jin Z (2019) Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering
Zurück zum Zitat Hu X, Li G, Xia X, Lo D, Jin Z (2018) Summarizing source code with transferred api knowledge. In: IJCAI Hu X, Li G, Xia X, Lo D, Jin Z (2018) Summarizing source code with transferred api knowledge. In: IJCAI
Zurück zum Zitat Huang J, Tang D, Shou L, Gong M, Xu K, Jiang D, Zhou M, Duan N (2021) Cosqa: 20, 000+ web queries for code search and question answering. In: ACL Huang J, Tang D, Shou L, Gong M, Xu K, Jiang D, Zhou M, Duan N (2021) Cosqa: 20, 000+ web queries for code search and question answering. In: ACL
Zurück zum Zitat Husain H, Wu H, Gazit T, Allamanis M, Brockschmidt M (2019) Codesearchnet challenge: Evaluating the state of semantic code search. CoRR abs/1909.09436. arXiv:1909.09436 Husain H, Wu H, Gazit T, Allamanis M, Brockschmidt M (2019) Codesearchnet challenge: Evaluating the state of semantic code search. CoRR abs/1909.09436. arXiv:​1909.​09436
Zurück zum Zitat Iyer S, Konstas I, Cheung A, Zettlemoyer L (2016) Summarizing source code using a neural attention model. In: ACL Iyer S, Konstas I, Cheung A, Zettlemoyer L (2016) Summarizing source code using a neural attention model. In: ACL
Zurück zum Zitat Iyyer M, Manjunatha V, Boyd-Graber JL, III HD (2015) Deep unordered composition rivals syntactic methods for text classification. In: ACL (1), The Association for Computer Linguistics pp 1681–1691 Iyyer M, Manjunatha V, Boyd-Graber JL, III HD (2015) Deep unordered composition rivals syntactic methods for text classification. In: ACL (1), The Association for Computer Linguistics pp 1681–1691
Zurück zum Zitat Jain P, Jain A, Zhang T, Abbeel P, Gonzalez J, Stoica I (2021) Contrastive code representation learning. In: EMNLP, Association for Computational Linguistics (1), pp 5954–5971 Jain P, Jain A, Zhang T, Abbeel P, Gonzalez J, Stoica I (2021) Contrastive code representation learning. In: EMNLP, Association for Computational Linguistics (1), pp 5954–5971
Zurück zum Zitat Jiang X, Zheng Z, Lyu C, Li L, Lyu L (2021) Treebert: A tree-based pre-trained model for programming language. UAI, Proceedings of Machine Learning Research, AUAI Press 161:54–63 Jiang X, Zheng Z, Lyu C, Li L, Lyu L (2021) Treebert: A tree-based pre-trained model for programming language. UAI, Proceedings of Machine Learning Research, AUAI Press 161:54–63
Zurück zum Zitat Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLP, ACL pp 1746–1751 Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLP, ACL pp 1746–1751
Zurück zum Zitat LeClair A, Bansal A, McMillan C (2021) Ensemble models for neural source code summarization of subroutines In: ICSME, IEEE pp 286–297 LeClair A, Bansal A, McMillan C (2021) Ensemble models for neural source code summarization of subroutines In: ICSME, IEEE pp 286–297
Zurück zum Zitat LeClair A, Haque S, Wu L, McMillan C (2020) Improved code summarization via a graph neural network. In: ICPC, ACM pp 18–195 LeClair A, Haque S, Wu L, McMillan C (2020) Improved code summarization via a graph neural network. In: ICPC, ACM pp 18–195
Zurück zum Zitat LeClair A, Jiang S, McMillan C (2019) A neural model for generating natural language summaries of program subroutines. In: ICSE LeClair A, Jiang S, McMillan C (2019) A neural model for generating natural language summaries of program subroutines. In: ICSE
Zurück zum Zitat Li W, Qin H, Yan S, Shen B, Chen Y (2020) Learning code-query interaction for enhancing code searches. In: ICSME, IEEE pp 115–126 Li W, Qin H, Yan S, Shen B, Chen Y (2020) Learning code-query interaction for enhancing code searches. In: ICSME, IEEE pp 115–126
Zurück zum Zitat Libovický J, Helcl J, Mareček D (2018) Input combination strategies for multi-source transformer decoder. In: WMT Libovický J, Helcl J, Mareček D (2018) Input combination strategies for multi-source transformer decoder. In: WMT
Zurück zum Zitat Lin C (2004) ROUGE: A package for automatic evaluation of summaries. In: ACL Lin C (2004) ROUGE: A package for automatic evaluation of summaries. In: ACL
Zurück zum Zitat Lin C, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: ACL pp 605–612 Lin C, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: ACL pp 605–612
Zurück zum Zitat Linstead E, Bajracharya SK, Ngo TC, Rigor P, Lopes CV, Baldi P (2009) Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Discov 18(2):300–336MathSciNetCrossRef Linstead E, Bajracharya SK, Ngo TC, Rigor P, Lopes CV, Baldi P (2009) Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Discov 18(2):300–336MathSciNetCrossRef
Zurück zum Zitat Liu F, Li G, Zhao Y, Jin Z (2020) Multi-task learning based pre-trained language model for code completion. In: ASE, IEEE pp 473–485 Liu F, Li G, Zhao Y, Jin Z (2020) Multi-task learning based pre-trained language model for code completion. In: ASE, IEEE pp 473–485
Zurück zum Zitat Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized BERT pretraining approach. arXiv:1907.11692 Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized BERT pretraining approach. arXiv:​1907.​11692
Zurück zum Zitat Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: ICLR Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: ICLR
Zurück zum Zitat Lu M, Sun X, Wang S, Lo D, Duan Y (2015) Query expansion via wordnet for effective code search. In: SANER, IEEE Computer Society pp 545–549 Lu M, Sun X, Wang S, Lo D, Duan Y (2015) Query expansion via wordnet for effective code search. In: SANER, IEEE Computer Society pp 545–549
Zurück zum Zitat Lv F, Zhang H, Lou J, Wang S, Zhang D, Zhao J (2015) Codehow: Effective code search based on API understanding and extended boolean model (E). In: ASE, IEEE Computer Society pp 260–270 Lv F, Zhang H, Lou J, Wang S, Zhang D, Zhao J (2015) Codehow: Effective code search based on API understanding and extended boolean model (E). In: ASE, IEEE Computer Society pp 260–270
Zurück zum Zitat McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usage. In: ICSE, ACM pp 111–120 McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usage. In: ICSE, ACM pp 111–120
Zurück zum Zitat Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. In: AAAI Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. In: AAAI
Zurück zum Zitat Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding
Zurück zum Zitat Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: A method for automatic evaluation of machine translation. In: ACL, pp 311–318 Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: A method for automatic evaluation of machine translation. In: ACL, pp 311–318
Zurück zum Zitat Parr T (2013) The definitive ANTLR 4 reference (2 ed.). Pragmatic Bookshelf Parr T (2013) The definitive ANTLR 4 reference (2 ed.). Pragmatic Bookshelf
Zurück zum Zitat Rodeghero P, McMillan C, McBurney PW, Bosch N, D’Mello SK (2014) Improving automated source code summarization via an eye-tracking study of programmers. In: ICSE, ACM pp 390–401 Rodeghero P, McMillan C, McBurney PW, Bosch N, D’Mello SK (2014) Improving automated source code summarization via an eye-tracking study of programmers. In: ICSE, ACM pp 390–401
Zurück zum Zitat Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: Scaling code clone detection to big-code. In: ICSE, ACM pp 1157–1168 Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: Scaling code clone detection to big-code. In: ICSE, ACM pp 1157–1168
Zurück zum Zitat See A, Liu PJ, Manning CD (2017) Get to the point: Summarization with pointergenerator networks. In: ACL See A, Liu PJ, Manning CD (2017) Get to the point: Summarization with pointergenerator networks. In: ACL
Zurück zum Zitat Shi E, Wang Y, Du L, Chen J, Han S, Zhang H, Zhang D, Sun H (2022) On the evaluation of neural code summarization Shi E, Wang Y, Du L, Chen J, Han S, Zhang H, Zhang D, Sun H (2022) On the evaluation of neural code summarization
Zurück zum Zitat Shi E, Wang Y, Du L, Zhang H, Han S, Zhang D, Sun H (2021) CAST: Enhancing code summarization with hierarchical splitting and reconstruction of abstract syntax trees. In: EMNLP (1), Association for Computational Linguistics pp 4053–4062 Shi E, Wang Y, Du L, Zhang H, Han S, Zhang D, Sun H (2021) CAST: Enhancing code summarization with hierarchical splitting and reconstruction of abstract syntax trees. In: EMNLP (1), Association for Computational Linguistics pp 4053–4062
Zurück zum Zitat Shi L, Mu F, Chen X, Wang S, Wang J, Yang Y, Li G, Xia X, Wang Q (2022) Are we building on the rock? on the importance of data preprocessing for code summarization. In: ESEC/SIGSOFT FSE, ACM pp 107–119 Shi L, Mu F, Chen X, Wang S, Wang J, Yang Y, Li G, Xia X, Wang Q (2022) Are we building on the rock? on the importance of data preprocessing for code summarization. In: ESEC/SIGSOFT FSE, ACM pp 107–119
Zurück zum Zitat Shin ECR, Allamanis M, Brockschmidt M, Polozov A (2019) Program synthesis and semantic parsing with learned code idioms. In: NeurIPS, pp 10824–10834 Shin ECR, Allamanis M, Brockschmidt M, Polozov A (2019) Program synthesis and semantic parsing with learned code idioms. In: NeurIPS, pp 10824–10834
Zurück zum Zitat Shuai J, Xu L, Liu C, Yan M, Xia X, Lei Y (2020) Improving code search with coattentive representation learning. In: ICPC ’20: 28th International Conference on Program Comprehension, Seoul, Republic of Korea, ACM pp 196–207 July 13-15, 2020. https://doi.org/10.1145/3387904.3389269 Shuai J, Xu L, Liu C, Yan M, Xia X, Lei Y (2020) Improving code search with coattentive representation learning. In: ICPC ’20: 28th International Conference on Program Comprehension, Seoul, Republic of Korea, ACM pp 196–207 July 13-15, 2020. https://​doi.​org/​10.​1145/​3387904.​3389269
Zurück zum Zitat Sridhara G, Hill E, Muppaneni D, Pollock LL, Vijay-Shanker K (2010) Towards automatically generating summary comments for java methods. In: ASE, pp 43–52 Sridhara G, Hill E, Muppaneni D, Pollock LL, Vijay-Shanker K (2010) Towards automatically generating summary comments for java methods. In: ASE, pp 43–52
Zurück zum Zitat Sun Z, Li L, Liu Y, Du X, Li L (2022) On the importance of building high-quality training datasets for neural code search. In: ICSE, ACM pp 1609–1620 Sun Z, Li L, Liu Y, Du X, Li L (2022) On the importance of building high-quality training datasets for neural code search. In: ICSE, ACM pp 1609–1620
Zurück zum Zitat Svyatkovskiy A, Deng SK, Fu S, Sundaresan N (2020) Intellicode compose: code generation using transformer. In: ESEC/SIGSOFT FSE, ACM pp 1433–1443 Svyatkovskiy A, Deng SK, Fu S, Sundaresan N (2020) Intellicode compose: code generation using transformer. In: ESEC/SIGSOFT FSE, ACM pp 1433–1443
Zurück zum Zitat Tai KS, Socher R, Manning CD (2015) Improved semantic representations from treestructured long short-term memory networks. In: ACL (1), The Association for Computer Linguistics pp 1556–1566 Tai KS, Socher R, Manning CD (2015) Improved semantic representations from treestructured long short-term memory networks. In: ACL (1), The Association for Computer Linguistics pp 1556–1566
Zurück zum Zitat Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NIPS, pp 5998–6008 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NIPS, pp 5998–6008
Zurück zum Zitat Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: CVPR Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: CVPR
Zurück zum Zitat Wan Y, Shu J, SuiY, Xu G, Zhao Z, Wu J, Yu PS (2019) Multi-modal attention network learning for semantic source code retrieval. In: ASE, IEEE pp 13–25 Wan Y, Shu J, SuiY, Xu G, Zhao Z, Wu J, Yu PS (2019) Multi-modal attention network learning for semantic source code retrieval. In: ASE, IEEE pp 13–25
Zurück zum Zitat Wan Y, Zhao Z, Yang M, Xu G, Ying H, Wu J, Yu PS (2018) Improving automatic source code summarization via deep reinforcement learning. In: ASE Wan Y, Zhao Z, Yang M, Xu G, Ying H, Wu J, Yu PS (2018) Improving automatic source code summarization via deep reinforcement learning. In: ASE
Zurück zum Zitat Wang X, Wang Y, Mi F, Zhou P, Wan Y, Liu X, Li L, Wu H, Liu J, Jiang X (2021) Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv:2108.04556 Wang X, Wang Y, Mi F, Zhou P, Wan Y, Liu X, Li L, Wu H, Liu J, Jiang X (2021) Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv:​2108.​04556
Zurück zum Zitat Wang Y, Du L, Shi E, Hu Y, Han S, Zhang D (2020) Cocogum: Contextual code summarization with multi-relational gnn on umls. Tech rep, Microsoft, MSR-TR-2020-16 Wang Y, Du L, Shi E, Hu Y, Han S, Zhang D (2020) Cocogum: Contextual code summarization with multi-relational gnn on umls. Tech rep, Microsoft, MSR-TR-2020-16
Zurück zum Zitat Wang Y, Li H (2021) Code completion by modeling flattened abstract syntax trees as graphs. In: AAAI Wang Y, Li H (2021) Code completion by modeling flattened abstract syntax trees as graphs. In: AAAI
Zurück zum Zitat Wang Y, Wang W, Joty SR, Hoi SCH (2021) Codet5: Identifier-aware unified pretrained encoder-decoder models for code understanding and generation. In: EMNLP (1), Association for Computational Linguistics pp 8696–8708 Wang Y, Wang W, Joty SR, Hoi SCH (2021) Codet5: Identifier-aware unified pretrained encoder-decoder models for code understanding and generation. In: EMNLP (1), Association for Computational Linguistics pp 8696–8708
Zurück zum Zitat Wei B, Li G, Xia X, Fu Z, Jin Z (2019) Code generation as a dual task of code summarization. In: NeurIPS, pp 6559–6569 Wei B, Li G, Xia X, Fu Z, Jin Z (2019) Code generation as a dual task of code summarization. In: NeurIPS, pp 6559–6569
Zurück zum Zitat Wei B, Li Y, Li G, Xia X, Jin Z (2020) Retrieve and refine: Exemplar-based neural comment generation. In: ASE, IEEE pp 349–360 Wei B, Li Y, Li G, Xia X, Jin Z (2020) Retrieve and refine: Exemplar-based neural comment generation. In: ASE, IEEE pp 349–360
Zurück zum Zitat White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: ASE, ACM pp 87–98 White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: ASE, ACM pp 87–98
Zurück zum Zitat Wilcoxon F, Katti S, Wilcox RA (1970) Critical values and probability levels for the wilcoxon rank sum test and the wilcoxon signed rank test. Selected tables in mathematical statistics 1:171–259MATH Wilcoxon F, Katti S, Wilcox RA (1970) Critical values and probability levels for the wilcoxon rank sum test and the wilcoxon signed rank test. Selected tables in mathematical statistics 1:171–259MATH
Zurück zum Zitat Wu H, Zhao H, Zhang M (2021) Code summarization with structure-induced transformer. In: ACL/IJCNLP (Findings), Findings of ACL, vol. ACL/IJCNLP 2021, Association for Computational Linguistics pp 1078–1090 Wu H, Zhao H, Zhang M (2021) Code summarization with structure-induced transformer. In: ACL/IJCNLP (Findings), Findings of ACL, vol. ACL/IJCNLP 2021, Association for Computational Linguistics pp 1078–1090
Zurück zum Zitat Wu Y, Lian D, Xu Y, Wu L, Chen E (2020) Graph convolutional networks with markov random field reasoning for social spammer detection. In: AAAI, AAAI Press pp 1054–1061 Wu Y, Lian D, Xu Y, Wu L, Chen E (2020) Graph convolutional networks with markov random field reasoning for social spammer detection. In: AAAI, AAAI Press pp 1054–1061
Zurück zum Zitat Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: CVPR, Computer Vision Foundation /IEEE Computer Society pp 3733–3742 Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: CVPR, Computer Vision Foundation /IEEE Computer Society pp 3733–3742
Zurück zum Zitat Yang M, Zhou M, Li Z, Liu J, Pan L, Xiong H, King I (2022) Hyperbolic graph neural networks: A review of methods and applications. arXiv:2202.13852 Yang M, Zhou M, Li Z, Liu J, Pan L, Xiong H, King I (2022) Hyperbolic graph neural networks: A review of methods and applications. arXiv:​2202.​13852
Zurück zum Zitat Ye W, Xie R, Zhang J, Hu T, Wang X, Zhang S (2020) Leveraging code generation to improve code retrieval and summarization via dual learning. In: Huang Y, King I, Liu T, van Steen M (Eds)WWW’20: TheWeb Conference 2020, Taipei, Taiwan, ACM / IW3C2 pp 2309–2319. 20-24 April 2020. https://doi.org/10.1145/3366423.3380295 Ye W, Xie R, Zhang J, Hu T, Wang X, Zhang S (2020) Leveraging code generation to improve code retrieval and summarization via dual learning. In: Huang Y, King I, Liu T, van Steen M (Eds)WWW’20: TheWeb Conference 2020, Taipei, Taiwan, ACM / IW3C2 pp 2309–2319. 20-24 April 2020. https://​doi.​org/​10.​1145/​3366423.​3380295
Zurück zum Zitat Yu X, Huang Q, Wang Z, Feng Y, Zhao D (2020) Towards context-aware code comment generation. In: EMNLP (Findings), Association for Computational Linguistics pp 3938–3947 Yu X, Huang Q, Wang Z, Feng Y, Zhao D (2020) Towards context-aware code comment generation. In: EMNLP (Findings), Association for Computational Linguistics pp 3938–3947
Zurück zum Zitat Zhang J, Panthaplackel S, Nie P, Mooney RJ, Li JJ, Gligoric M (2021) Learning to generate code comments from class hierarchies. arXiv:2103.13426 Zhang J, Panthaplackel S, Nie P, Mooney RJ, Li JJ, Gligoric M (2021) Learning to generate code comments from class hierarchies. arXiv:​2103.​13426
Zurück zum Zitat Zhang J, Wang X, Zhang H, Sun H, Liu X (2020) Retrieval-based neural source code summarization. In: ICSE Zhang J, Wang X, Zhang H, Sun H, Liu X (2020) Retrieval-based neural source code summarization. In: ICSE
Zurück zum Zitat Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: ICSE Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: ICSE
Zurück zum Zitat Zhu Q, Sun Z, Liang X, Xiong Y, Zhang L (2020) Ocor: An overlapping-aware code retriever. In: 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Australia, IEEE pp 883–894 21-25 Sept 2020. https://doi.org/10.1145/3324884.3416530 Zhu Q, Sun Z, Liang X, Xiong Y, Zhang L (2020) Ocor: An overlapping-aware code retriever. In: 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Australia, IEEE pp 883–894 21-25 Sept 2020. https://​doi.​org/​10.​1145/​3324884.​3416530
Metadaten
Titel
CoCoAST: Representing Source Code via Hierarchical Splitting and Reconstruction of Abstract Syntax Trees
verfasst von
Ensheng Shi
Yanlin Wang
Lun Du
Hongyu Zhang
Shi Han
Dongmei Zhang
Hongbin Sun
Publikationsdatum
01.11.2023
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 6/2023
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-023-10378-9

Weitere Artikel der Ausgabe 6/2023

Empirical Software Engineering 6/2023 Zur Ausgabe

Premium Partner