Skip to main content
Top
Published in: Arabian Journal for Science and Engineering 2/2022

12-09-2021 | Research Article-Computer Engineering and Computer Science

Effect of Identifier Tokenization on Automatic Source Code Documentation

Authors: Sawan Rai, Ramesh Chandra Belwal, Atul Gupta

Published in: Arabian Journal for Science and Engineering | Issue 2/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In software development, source code documents play essential role during program comprehension and software maintenance. Natural language descriptions and identifier names are the main parts of the source code document. Source code document generation spares the working hours of developers. Automatic source code documentation is a rapidly growing research area at the present time. Researchers have proposed various template based, IR based (information retrieval), and learning-based techniques for automatic source code documentation. There is not much work related to preprocessing and its effect on the automatic source code documentation. Tokenization is one of the essential steps in preprocessing. We found some important flaws in the basic tokenization steps that could affect automatic source code documentation performance. Therefore, we propose an updated tokenization approach to remove the flaws of basic tokenization steps. We performed method name prediction and comment generation studies to analyze the effect of updated tokenization approach. We found that the updated tokenization helped in improving the performance of the automatic source code documentation. Name prediction and comment generation performance improved by more than 2.5% and 3.5%, respectively, in terms of F1 score.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Xia, X.; Bao, L.; Lo, D.; Xing, Z.; Hassan, A.E.; Li, S.: Measuring program comprehension: a large-scale field study with professionals. IEEE Trans. Softw. Eng. 44(10), 951–976 (2017)CrossRef Xia, X.; Bao, L.; Lo, D.; Xing, Z.; Hassan, A.E.; Li, S.: Measuring program comprehension: a large-scale field study with professionals. IEEE Trans. Softw. Eng. 44(10), 951–976 (2017)CrossRef
2.
go back to reference Wong, E.; Yang, J.; Tan, L.: Autocomment: mining question and answer sites for automatic comment generation. In: 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 562–567. IEEE (2013) Wong, E.; Yang, J.; Tan, L.: Autocomment: mining question and answer sites for automatic comment generation. In: 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 562–567. IEEE (2013)
3.
go back to reference de Souza, S.C.B.; Anquetil, N.; de Oliveira, K.M.: A study of the documentation essential to software maintenance. In: Proceedings of the 23rd Annual International Conference on Design of Communication: Documenting & Designing for Pervasive Information, pp. 68–75 (2005) de Souza, S.C.B.; Anquetil, N.; de Oliveira, K.M.: A study of the documentation essential to software maintenance. In: Proceedings of the 23rd Annual International Conference on Design of Communication: Documenting & Designing for Pervasive Information, pp. 68–75 (2005)
4.
go back to reference Steinmacher, I.; Wiese, I.S.; Conte, T.; Gerosa, M.A.; Redmiles, D.: The hard life of open source software project newcomers. In: Proceedings of the 7th International Workshop on Cooperative and Human Aspects of Software Engineering, pp. 72–78 (2014) Steinmacher, I.; Wiese, I.S.; Conte, T.; Gerosa, M.A.; Redmiles, D.: The hard life of open source software project newcomers. In: Proceedings of the 7th International Workshop on Cooperative and Human Aspects of Software Engineering, pp. 72–78 (2014)
5.
go back to reference Tichy, W.F.: Should computer scientists experiment more? Computer 31(5), 32–40 (1998)CrossRef Tichy, W.F.: Should computer scientists experiment more? Computer 31(5), 32–40 (1998)CrossRef
6.
go back to reference Deissenboeck, F.; Pizka, M.: Concise and consistent naming. Softw. Qual. J. 14(3), 261–282 (2006)CrossRef Deissenboeck, F.; Pizka, M.: Concise and consistent naming. Softw. Qual. J. 14(3), 261–282 (2006)CrossRef
7.
go back to reference Sridhara, G.; Hill, E.; Muppaneni, D.; Pollock, L.; Vijay-Shanker, K.: Towards automatically generating summary comments for java methods. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, pp. 43–52 (2010) Sridhara, G.; Hill, E.; Muppaneni, D.; Pollock, L.; Vijay-Shanker, K.: Towards automatically generating summary comments for java methods. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, pp. 43–52 (2010)
8.
go back to reference McBurney, P.W.; McMillan, C.: Automatic documentation generation via source code summarization of method context. In: Proceedings of the 22nd International Conference on Program Comprehension, pp. 279–290 (2014) McBurney, P.W.; McMillan, C.: Automatic documentation generation via source code summarization of method context. In: Proceedings of the 22nd International Conference on Program Comprehension, pp. 279–290 (2014)
9.
go back to reference Rai, S.; Gaikwad, T.; Jain, S.; Gupta, A.: Method level text summarization for java code using nano-patterns. In: 2017 24th Asia-Pacific Software Engineering Conference (APSEC), pp. 199–208. IEEE (2017) Rai, S.; Gaikwad, T.; Jain, S.; Gupta, A.: Method level text summarization for java code using nano-patterns. In: 2017 24th Asia-Pacific Software Engineering Conference (APSEC), pp. 199–208. IEEE (2017)
10.
go back to reference Moreno, L.; Aponte, J.; Sridhara, G.; Marcus, A.; Pollock, L.; Vijay-Shanker, K.: Automatic generation of natural language summaries for java classes. In: 2013 21st International Conference on Program Comprehension (ICPC), pp. 23–32. IEEE (2013) Moreno, L.; Aponte, J.; Sridhara, G.; Marcus, A.; Pollock, L.; Vijay-Shanker, K.: Automatic generation of natural language summaries for java classes. In: 2013 21st International Conference on Program Comprehension (ICPC), pp. 23–32. IEEE (2013)
11.
go back to reference Haiduc, S.; Aponte, J.; Marcus, A.: Supporting program comprehension with source code summarization. In: 2010 ACM/IEEE 32nd International Conference on Software Engineering, vol. 2, pp. 223–226. IEEE (2010) Haiduc, S.; Aponte, J.; Marcus, A.: Supporting program comprehension with source code summarization. In: 2010 ACM/IEEE 32nd International Conference on Software Engineering, vol. 2, pp. 223–226. IEEE (2010)
12.
go back to reference Haiduc, S.; Aponte, J.; Moreno, L.; Marcus, A.: On the use of automated text summarization techniques for summarizing source code. In: 2010 17th Working Conference on Reverse Engineering. IEEE, pp. 35–44 (2010) Haiduc, S.; Aponte, J.; Moreno, L.; Marcus, A.: On the use of automated text summarization techniques for summarizing source code. In: 2010 17th Working Conference on Reverse Engineering. IEEE, pp. 35–44 (2010)
13.
go back to reference Allamanis, M.; Peng, H.; Sutton, C.: A convolutional attention network for extreme summarization of source code. In: International Conference on Machine Learning, pp. 2091–2100 (2016) Allamanis, M.; Peng, H.; Sutton, C.: A convolutional attention network for extreme summarization of source code. In: International Conference on Machine Learning, pp. 2091–2100 (2016)
14.
go back to reference Valerio, A.; Barone, M.; Sennrich, R.: A parallel corpus of python functions and documentation strings for automated code documentation and code generation. In: 2017 8th International Joint Conference on Natural Language Processing, pp. 314–319 (2017) Valerio, A.; Barone, M.; Sennrich, R.: A parallel corpus of python functions and documentation strings for automated code documentation and code generation. In: 2017 8th International Joint Conference on Natural Language Processing, pp. 314–319 (2017)
15.
16.
go back to reference Alon, U.; Brody, S.; Levy, O.; Yahav, E.: code2seq: Generating sequences from structured representations of code (2019). arXiv preprint arXiv:1808.01400 Alon, U.; Brody, S.; Levy, O.; Yahav, E.: code2seq: Generating sequences from structured representations of code (2019). arXiv preprint arXiv:​1808.​01400
17.
go back to reference Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages, vol. 3(POPL), pp. 1–29 (2019) Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages, vol. 3(POPL), pp. 1–29 (2019)
18.
go back to reference Nguyen, S.; Phan, H.; Le, T.; Nguyen, T.N.: Suggesting natural method names to check name consistencies. In: 2020 42nd International Conference on Software Engineering (2020) Nguyen, S.; Phan, H.; Le, T.; Nguyen, T.N.: Suggesting natural method names to check name consistencies. In: 2020 42nd International Conference on Software Engineering (2020)
19.
go back to reference Wan, Y.; Zhao, Z.; Yang, M.; Xu, G.; Ying, H.; Wu, J.; Yu, P.S.: Improving automatic source code summarization via deep reinforcement learning. In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 397–407 (2018) Wan, Y.; Zhao, Z.; Yang, M.; Xu, G.; Ying, H.; Wu, J.; Yu, P.S.: Improving automatic source code summarization via deep reinforcement learning. In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 397–407 (2018)
21.
go back to reference Wang, W.; Zhang, Y.; Zeng, Z.; Xu, G.: Trans 3: a transformer-based framework for unifying code summarization and code search. arXiv preprint arXiv:2003.03238 (2020) Wang, W.; Zhang, Y.; Zeng, Z.; Xu, G.: Trans 3: a transformer-based framework for unifying code summarization and code search. arXiv preprint arXiv:​2003.​03238 (2020)
22.
go back to reference Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Liu, X.: Retrieval-based neural source code summarization. In: 2020 42nd International Conference on Software Engineering (2020) Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Liu, X.: Retrieval-based neural source code summarization. In: 2020 42nd International Conference on Software Engineering (2020)
23.
go back to reference Fowler, M.; Beck, K.; Opdyke, W.R.: Refactoring: improving the design of existing code. In: 11th European Conference. Jyväskylä, Finland (1997) Fowler, M.; Beck, K.; Opdyke, W.R.: Refactoring: improving the design of existing code. In: 11th European Conference. Jyväskylä, Finland (1997)
24.
go back to reference Lawrie, D.; Morrell, C.; Feild, H.; Binkley, D.: What’s in a name? A study of identifiers. In: 14th IEEE International Conference on Program Comprehension (ICPC’06), pp. 3–12. IEEE (2006) Lawrie, D.; Morrell, C.; Feild, H.; Binkley, D.: What’s in a name? A study of identifiers. In: 14th IEEE International Conference on Program Comprehension (ICPC’06), pp. 3–12. IEEE (2006)
25.
go back to reference Høst, E.W.; Østvold, B.M.: Debugging method names. In: European Conference on Object-Oriented Programming, pp. 294–317. Springer (2009) Høst, E.W.; Østvold, B.M.: Debugging method names. In: European Conference on Object-Oriented Programming, pp. 294–317. Springer (2009)
26.
go back to reference Kashiwabara, Y.; Onizuka, Y.; Ishio, T.; Hayase, Y.; Yamamoto, T.; Inoue, K.: Recommending verbs for rename method using association rule mining. In: 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), pp. 323–327. IEEE (2014) Kashiwabara, Y.; Onizuka, Y.; Ishio, T.; Hayase, Y.; Yamamoto, T.; Inoue, K.: Recommending verbs for rename method using association rule mining. In: 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), pp. 323–327. IEEE (2014)
27.
go back to reference Fujita, S.; Kamigaito, H.; Takamura, H.; Okumura, M.: Pointing to subwords for generating function names in source code. arXiv preprint arXiv:2011.04241 (2020) Fujita, S.; Kamigaito, H.; Takamura, H.; Okumura, M.: Pointing to subwords for generating function names in source code. arXiv preprint arXiv:​2011.​04241 (2020)
29.
go back to reference Liblit, B.; Begel, A.; Sweetser, E.: Cognitive perspectives on the role of naming in computer programs. In: PPIG, p. 11 (2006) Liblit, B.; Begel, A.; Sweetser, E.: Cognitive perspectives on the role of naming in computer programs. In: PPIG, p. 11 (2006)
30.
go back to reference Hu, X.; Li, G.; Xia, X.; Lo, D.; Jin, Z.: Deep code comment generation. In: Proceedings of the 26th Conference on Program Comprehension, pp. 200–210 (2018) Hu, X.; Li, G.; Xia, X.; Lo, D.; Jin, Z.: Deep code comment generation. In: Proceedings of the 26th Conference on Program Comprehension, pp. 200–210 (2018)
31.
go back to reference Ahmad, W.U.; Chakraborty, S.; Ray, B.; Chang, K.W.: A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653 (2020) Ahmad, W.U.; Chakraborty, S.; Ray, B.; Chang, K.W.: A transformer-based approach for source code summarization. arXiv preprint arXiv:​2005.​00653 (2020)
32.
go back to reference Wang, R.; Zhang, H.; Lu, G.; Lyu, L.; Lyu, C.: Fret: functional reinforced transformer with BERT for code summarization. IEEE Access 8, 135591–135604 (2020a)CrossRef Wang, R.; Zhang, H.; Lu, G.; Lyu, L.; Lyu, C.: Fret: functional reinforced transformer with BERT for code summarization. IEEE Access 8, 135591–135604 (2020a)CrossRef
33.
go back to reference Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018) arXiv preprint arXiv:1810.04805 Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018) arXiv preprint arXiv:​1810.​04805
34.
go back to reference Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al.: Codebert: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020) Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al.: Codebert: a pre-trained model for programming and natural languages. arXiv preprint arXiv:​2002.​08155 (2020)
35.
go back to reference Gao, S.; Gao, C.; He, Y.; Zeng, J.; Nie, L.Y.; Xia, X.: Code structure guided transformer for source code summarization. arXiv preprint arXiv:2104.09340 (2021) Gao, S.; Gao, C.; He, Y.; Zeng, J.; Nie, L.Y.; Xia, X.: Code structure guided transformer for source code summarization. arXiv preprint arXiv:​2104.​09340 (2021)
36.
go back to reference Lin, C.; Ouyang, Z.; Zhuang, J.; Chen, J.; Li, H.; Wu, R.: Improving code summarization with block-wise abstract syntax tree splitting. arXiv preprint arXiv:2103.07845 (2021) Lin, C.; Ouyang, Z.; Zhuang, J.; Chen, J.; Li, H.; Wu, R.: Improving code summarization with block-wise abstract syntax tree splitting. arXiv preprint arXiv:​2103.​07845 (2021)
37.
go back to reference Sun, X.; Liu, X.; Hu, J.; Zhu, J.: Empirical studies on the NLP techniques for source code data preprocessing. In: Proceedings of the 2014 3rd International Workshop on Evidential Assessment of Software Technologies, pp. 32–39 (2014) Sun, X.; Liu, X.; Hu, J.; Zhu, J.: Empirical studies on the NLP techniques for source code data preprocessing. In: Proceedings of the 2014 3rd International Workshop on Evidential Assessment of Software Technologies, pp. 32–39 (2014)
38.
go back to reference Bahdanau, D.; Cho, K.; Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Bahdanau, D.; Cho, K.; Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:​1409.​0473 (2014)
39.
go back to reference Jordan, M.: Attractor dynamics and parallelism in a connectionist sequential machine. In: Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Erlbaum, Hillsdale, NJ (1986) Jordan, M.: Attractor dynamics and parallelism in a connectionist sequential machine. In: Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Erlbaum, Hillsdale, NJ (1986)
42.
go back to reference Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002). https://doi.org/10.3115/1073083.1073135 Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002). https://​doi.​org/​10.​3115/​1073083.​1073135
43.
go back to reference Hu, X.; Li, G.; Xia, X.; Lo, D.; Lu, S.; Jin, Z.: Summarizing source code with transferred API knowledge. In: 2018 27th International Joint Conference on Artificial Intelligence, pp. 2269–2275. ACM (2018) Hu, X.; Li, G.; Xia, X.; Lo, D.; Lu, S.; Jin, Z.: Summarizing source code with transferred API knowledge. In: 2018 27th International Joint Conference on Artificial Intelligence, pp. 2269–2275. ACM (2018)
Metadata
Title
Effect of Identifier Tokenization on Automatic Source Code Documentation
Authors
Sawan Rai
Ramesh Chandra Belwal
Atul Gupta
Publication date
12-09-2021
Publisher
Springer Berlin Heidelberg
Published in
Arabian Journal for Science and Engineering / Issue 2/2022
Print ISSN: 2193-567X
Electronic ISSN: 2191-4281
DOI
https://doi.org/10.1007/s13369-021-06149-7

Other articles of this Issue 2/2022

Arabian Journal for Science and Engineering 2/2022 Go to the issue

Research Article-Computer Engineering and Computer Science

Progress of IoT Research Technologies and Applications Serving Hajj and Umrah

Premium Partners