Top

Arabian Journal for Science and Engineering

Published in:

12-09-2021 | Research Article-Computer Engineering and Computer Science

Effect of Identifier Tokenization on Automatic Source Code Documentation

Authors: Sawan Rai, Ramesh Chandra Belwal, Atul Gupta

Published in: Arabian Journal for Science and Engineering | Issue 2/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In software development, source code documents play essential role during program comprehension and software maintenance. Natural language descriptions and identifier names are the main parts of the source code document. Source code document generation spares the working hours of developers. Automatic source code documentation is a rapidly growing research area at the present time. Researchers have proposed various template based, IR based (information retrieval), and learning-based techniques for automatic source code documentation. There is not much work related to preprocessing and its effect on the automatic source code documentation. Tokenization is one of the essential steps in preprocessing. We found some important flaws in the basic tokenization steps that could affect automatic source code documentation performance. Therefore, we propose an updated tokenization approach to remove the flaws of basic tokenization steps. We performed method name prediction and comment generation studies to analyze the effect of updated tokenization approach. We found that the updated tokenization helped in improving the performance of the automatic source code documentation. Name prediction and comment generation performance improved by more than 2.5% and 3.5%, respectively, in terms of F1 score.

previous article Diagnosis of Pediatric Pneumonia with Ensemble of Deep Convolutional Neural Networks in Chest X-Ray Images

next article Intelligent Framework for Prediction of Heart Disease using Deep Learning

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

https://pypi.org/project/pyspellchecker/.

https://controlc.com/c1666a6b.

https://pytorch.org/.

https://colab.research.google.com.

Xia, X.; Bao, L.; Lo, D.; Xing, Z.; Hassan, A.E.; Li, S.: Measuring program comprehension: a large-scale field study with professionals. IEEE Trans. Softw. Eng. 44(10), 951–976 (2017)CrossRef

Wong, E.; Yang, J.; Tan, L.: Autocomment: mining question and answer sites for automatic comment generation. In: 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 562–567. IEEE (2013)

de Souza, S.C.B.; Anquetil, N.; de Oliveira, K.M.: A study of the documentation essential to software maintenance. In: Proceedings of the 23rd Annual International Conference on Design of Communication: Documenting & Designing for Pervasive Information, pp. 68–75 (2005)

Steinmacher, I.; Wiese, I.S.; Conte, T.; Gerosa, M.A.; Redmiles, D.: The hard life of open source software project newcomers. In: Proceedings of the 7th International Workshop on Cooperative and Human Aspects of Software Engineering, pp. 72–78 (2014)

Tichy, W.F.: Should computer scientists experiment more? Computer 31(5), 32–40 (1998)CrossRef

Deissenboeck, F.; Pizka, M.: Concise and consistent naming. Softw. Qual. J. 14(3), 261–282 (2006)CrossRef

Sridhara, G.; Hill, E.; Muppaneni, D.; Pollock, L.; Vijay-Shanker, K.: Towards automatically generating summary comments for java methods. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, pp. 43–52 (2010)

McBurney, P.W.; McMillan, C.: Automatic documentation generation via source code summarization of method context. In: Proceedings of the 22nd International Conference on Program Comprehension, pp. 279–290 (2014)

Rai, S.; Gaikwad, T.; Jain, S.; Gupta, A.: Method level text summarization for java code using nano-patterns. In: 2017 24th Asia-Pacific Software Engineering Conference (APSEC), pp. 199–208. IEEE (2017)

10.

Moreno, L.; Aponte, J.; Sridhara, G.; Marcus, A.; Pollock, L.; Vijay-Shanker, K.: Automatic generation of natural language summaries for java classes. In: 2013 21st International Conference on Program Comprehension (ICPC), pp. 23–32. IEEE (2013)

11.

Haiduc, S.; Aponte, J.; Marcus, A.: Supporting program comprehension with source code summarization. In: 2010 ACM/IEEE 32nd International Conference on Software Engineering, vol. 2, pp. 223–226. IEEE (2010)

12.

Haiduc, S.; Aponte, J.; Moreno, L.; Marcus, A.: On the use of automated text summarization techniques for summarizing source code. In: 2010 17th Working Conference on Reverse Engineering. IEEE, pp. 35–44 (2010)

13.

Allamanis, M.; Peng, H.; Sutton, C.: A convolutional attention network for extreme summarization of source code. In: International Conference on Machine Learning, pp. 2091–2100 (2016)

14.

Valerio, A.; Barone, M.; Sennrich, R.: A parallel corpus of python functions and documentation strings for automated code documentation and code generation. In: 2017 8th International Joint Conference on Natural Language Processing, pp. 314–319 (2017)

15.

Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E.: A general path-based representation for predicting program properties. In: 2018 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 404–419. ACM (2018). https://doi.org/10.1145/3192366.3192412

16.

Alon, U.; Brody, S.; Levy, O.; Yahav, E.: code2seq: Generating sequences from structured representations of code (2019). arXiv preprint arXiv:1808.01400

17.

Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages, vol. 3(POPL), pp. 1–29 (2019)

18.

Nguyen, S.; Phan, H.; Le, T.; Nguyen, T.N.: Suggesting natural method names to check name consistencies. In: 2020 42nd International Conference on Software Engineering (2020)

19.

Wan, Y.; Zhao, Z.; Yang, M.; Xu, G.; Ying, H.; Wu, J.; Yu, P.S.: Improving automatic source code summarization via deep reinforcement learning. In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 397–407 (2018)

20.

Wang, W.; Zhang, Y.; Sui, Y.; Wan, Y.; Zhao, Z.; Wu, J.; Yu, P.; Xu, G.: Reinforcement-learning-guided source code summarization via hierarchical attention. IEEE Trans. Softw. Eng. (2020). https://doi.org/10.1109/TSE.2020.2979701

21.

Wang, W.; Zhang, Y.; Zeng, Z.; Xu, G.: Trans 3: a transformer-based framework for unifying code summarization and code search. arXiv preprint arXiv:2003.03238 (2020)

22.

Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Liu, X.: Retrieval-based neural source code summarization. In: 2020 42nd International Conference on Software Engineering (2020)

23.

Fowler, M.; Beck, K.; Opdyke, W.R.: Refactoring: improving the design of existing code. In: 11th European Conference. Jyväskylä, Finland (1997)

24.

Lawrie, D.; Morrell, C.; Feild, H.; Binkley, D.: What’s in a name? A study of identifiers. In: 14th IEEE International Conference on Program Comprehension (ICPC’06), pp. 3–12. IEEE (2006)

25.

Høst, E.W.; Østvold, B.M.: Debugging method names. In: European Conference on Object-Oriented Programming, pp. 294–317. Springer (2009)

26.

Kashiwabara, Y.; Onizuka, Y.; Ishio, T.; Hayase, Y.; Yamamoto, T.; Inoue, K.: Recommending verbs for rename method using association rule mining. In: 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), pp. 323–327. IEEE (2014)

27.

Fujita, S.; Kamigaito, H.; Takamura, H.; Okumura, M.: Pointing to subwords for generating function names in source code. arXiv preprint arXiv:2011.04241 (2020)

28.

Ge, F.; Kuang, L.: Keywords guided method name generation. arXiv preprint arXiv:2103.11118 (2021)

29.

Liblit, B.; Begel, A.; Sweetser, E.: Cognitive perspectives on the role of naming in computer programs. In: PPIG, p. 11 (2006)

30.

Hu, X.; Li, G.; Xia, X.; Lo, D.; Jin, Z.: Deep code comment generation. In: Proceedings of the 26th Conference on Program Comprehension, pp. 200–210 (2018)

31.

Ahmad, W.U.; Chakraborty, S.; Ray, B.; Chang, K.W.: A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653 (2020)

32.

Wang, R.; Zhang, H.; Lu, G.; Lyu, L.; Lyu, C.: Fret: functional reinforced transformer with BERT for code summarization. IEEE Access 8, 135591–135604 (2020a)CrossRef

33.

Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018) arXiv preprint arXiv:1810.04805

34.

Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al.: Codebert: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)

35.

Gao, S.; Gao, C.; He, Y.; Zeng, J.; Nie, L.Y.; Xia, X.: Code structure guided transformer for source code summarization. arXiv preprint arXiv:2104.09340 (2021)

36.

Lin, C.; Ouyang, Z.; Zhuang, J.; Chen, J.; Li, H.; Wu, R.: Improving code summarization with block-wise abstract syntax tree splitting. arXiv preprint arXiv:2103.07845 (2021)

37.

Sun, X.; Liu, X.; Hu, J.; Zhu, J.: Empirical studies on the NLP techniques for source code data preprocessing. In: Proceedings of the 2014 3rd International Workshop on Evidential Assessment of Software Technologies, pp. 32–39 (2014)

38.

Bahdanau, D.; Cho, K.; Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

39.

Jordan, M.: Attractor dynamics and parallelism in a connectionist sequential machine. In: Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Erlbaum, Hillsdale, NJ (1986)

40.

Pineda, F.J.: Dynamics and architecture for neural computation. J. Complex. 4(3), 216–245 (1988). https://doi.org/10.1016/0885-064X(88)90021-0.MathSciNetCrossRefMATH

41.

Williams, R.J.; Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1(2), 270–280 (1989). https://doi.org/10.1162/neco.1989.1.2.270.CrossRef

42.

Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002). https://doi.org/10.3115/1073083.1073135

43.

Hu, X.; Li, G.; Xia, X.; Lo, D.; Lu, S.; Jin, Z.: Summarizing source code with transferred API knowledge. In: 2018 27th International Joint Conference on Artificial Intelligence, pp. 2269–2275. ACM (2018)

Title: Effect of Identifier Tokenization on Automatic Source Code Documentation
Authors: Sawan Rai
Ramesh Chandra Belwal
Atul Gupta
Publication date: 12-09-2021
Publisher: Springer Berlin Heidelberg
Published in: Arabian Journal for Science and Engineering / Issue 2/2022
Print ISSN: 2193-567X
Electronic ISSN: 2191-4281
DOI: https://doi.org/10.1007/s13369-021-06149-7

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Other articles of this Issue 2/2022

Optimization, Modeling and Implementation of Plant Water Consumption Control Using Genetic Algorithm and Artificial Neural Network in a Hybrid Structure

Gene Selection for Microarray Cancer Classification based on Manta Rays Foraging Optimization and Support Vector Machines

A Non-Probabilistic Neutrosophic Entropy-Based Method For High-Order Fuzzy Time-Series Forecasting

Convalescing the Process of Ranking Metabolites for Diseases using Subcellular Localization

Progress of IoT Research Technologies and Applications Serving Hajj and Umrah

An Upper Limb Rehabilitation Exercise Status Identification System Based on Machine Learning and IoT

Premium Partners