ABSTRACT
A domain glossary that organizes domain-specific concepts and their aliases and relations is essential for knowledge acquisition and software development. Existing approaches use linguistic heuristics or term-frequency-based statistics to identify domain specific terms from software documentation, and thus the accuracy is often low. In this paper, we propose a learning-based approach for automatic construction of domain glossary from source code and software documentation. The approach uses a set of high-quality seed terms identified from code identifiers and natural language concept definitions to train a domain-specific prediction model to recognize glossary terms based on the lexical and semantic context of the sentences mentioning domain-specific concepts. It then merges the aliases of the same concepts to their canonical names, selects a set of explanation sentences for each concept, and identifies "is a", "has a", and "related to" relations between the concepts. We apply our approach to deep learning domain and Hadoop domain and harvest 5,382 and 2,069 concepts together with 16,962 and 6,815 relations respectively. Our evaluation validates the accuracy of the extracted domain glossary and its usefulness for the fusion and acquisition of knowledge from different documents of different projects.
- Surafel Lemma Abebe and Paolo Tonella. 2015. Extraction of domain concepts from the source code. Sci. Comput. Program. 98 (2015), 680–706. Google ScholarDigital Library
- Chetan Arora, Mehrdad Sabetzadeh, Lionel C. Briand, and Frank Zimmer. 2017. Automated Extraction and Clustering of Requirements Glossary Terms. IEEE Trans. Software Eng. 43, 10 (2017), 918–945.Google ScholarCross Ref
- Sophie Aubin and Thierry Hamon. 2006. Improving Term Extraction with Terminological Resources. In Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL 2006, Turku, Finland, August 23-25, 2006, Proceedings. 380–387. Google ScholarDigital Library
- Didier Bourigault. 1992. Surface Grammatical Analysis For The Extraction Of Terminological Noun Phrases. In 14th International Conference on Computational Linguistics, COLING 1992, Nantes, France, August 23-28, 1992. 977–981. Google ScholarDigital Library
- Nuno Ramos Carvalho, José João Almeida, Pedro Rangel Henriques, and Maria João Varanda Pereira. 2015. From source code identifiers to natural language terms. Journal of Systems and Software 100 (2015), 117–128. Google ScholarDigital Library
- Chunyang Chen, Zhenchang Xing, and Ximing Wang. 2017. Unsupervised software-specific morphological forms inference from informal discussions. In Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017. 450–461. Google ScholarDigital Library
- Barthélémy Dagenais and Martin P. Robillard. 2012. Recovering traceability links between an API and its learning resources. In 34th International Conference on Software Engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland. 47–57. Google ScholarDigital Library
- Fred Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3 (1964), 171–176. Google ScholarDigital Library
- Bogdan Dit, Meghan Revelle, Malcom Gethers, and Denys Poshyvanyk. 2013. Feature location in source code: a taxonomy and survey. Journal of Software: Evolution and Process 25, 1 (2013), 53–95.Google ScholarCross Ref
- Anurag Dwarakanath, Roshni R. Ramnani, and Shubhashis Sengupta. 2013. Automatic extraction of glossary terms from natural language requirements. In 21st IEEE International Requirements Engineering Conference, RE 2013, Rio de Janeiro-RJ, Brazil, July 15-19, 2013. 314–319.Google ScholarCross Ref
- Fan Fang, Bo-Wen Zhang, and Xu-Cheng Yin. 2018. Semantic Sequential Query Expansion for Biomedical Article Search. IEEE Access 6 (2018), 45448–45457.Google ScholarCross Ref
- Jin Guo, Marek Gibiec, and Jane Cleland-Huang. 2017. Tackling the termmismatch problem in automated trace retrieval. Empirical Software Engineering 22, 3 (2017), 1103–1142. Google ScholarDigital Library
- Marti A. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In 14th International Conference on Computational Linguistics, COLING 1992, Nantes, France, August 23-28, 1992. 539–545. Google ScholarDigital Library
- Paul Jaccard. 1901. Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles 37 (1901), 241–72.Google Scholar
- Leslie P. Jones, Edward W. Gassie Jr., and Sridhar Radhakrishnan. 1990. INDEX: The statistical basis for an automatic conceptual phrase-indexing system. JASIS 41, 2 (1990), 87–97.Google ScholarCross Ref
- Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016. 260–270.Google ScholarCross Ref
- Hongwei Li, Sirui Li, Jiamou Sun, Zhenchang Xing, Xin Peng, Mingwei Liu, and Xuejiao Zhao. 2018. Improving API Caveats Accessibility by Mining API Caveats Knowledge Graph. In 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, Madrid, Spain, September 23-29, 2018. 183–193.Google ScholarCross Ref
- Yutaka Matsuo and Mitsuru Ishizuka. 2004. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13, 1 (2004), 157–169.Google ScholarCross Ref
- Collin McMillan, Denys Poshyvanyk, Mark Grechanik, Qing Xie, and Chen Fu. 2013. Portfolio: Searching for relevant functions and their usages in millions of lines of code. ACM Trans. Softw. Eng. Methodol. 22, 4 (2013), 37:1–37:30. Google ScholarDigital Library
- Pierre André Ménard and Sylvie Ratté. 2016. Concept extraction from business documents for software engineering projects. Autom. Softw. Eng. 23, 4 (2016), 649–686. Google ScholarDigital Library
- Maria Teresa Pazienza, Marco Pennacchiotti, and Fabio Massimo Zanzotto. 2005. Terminology Extraction: An Analysis of Linguistic and Statistical Approaches. Knowledge Mining, S. Sirmakessis, Ed. Berlin, Germany: Springer. 255—-279 pages.Google Scholar
- Xin Peng, Yifan Zhao, Mingwei Liu, Fengyi Zhang, Yang Liu, Xin Wang, and Zhenchang Xing. 2018. Automatic Generation of API Documentations for Open-Source Projects. In IEEE Third International Workshop on Dynamic Software Documentation, DySDoc@ICSME 2018, Madrid, Spain, September 25, 2018. 7–8.Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25- 29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. 1532–1543.Google Scholar
- Mohammad Masudur Rahman, Chanchal Kumar Roy, and David Lo. 2016. RACK: Automatic API Recommendation Using Crowdsourced Knowledge. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1. 349–359.Google Scholar
- Michael Rath, Jacob Rendall, Jin L. C. Guo, Jane Cleland-Huang, and Patrick Mäder. 2018. Traceability in the wild: automatically augmenting incomplete trace links. In Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018. 834–845. Google ScholarDigital Library
- Lev-Arie Ratinov and Dan Roth. 2009. Design Challenges and Misconceptions in Named Entity Recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, Boulder, Colorado, USA, June 4-5, 2009. 147–155. Google ScholarDigital Library
- Martin P. Robillard, Andrian Marcus, Christoph Treude, Gabriele Bavota, Oscar Chaparro, Neil A. Ernst, Marco Aurélio Gerosa, Michael W. Godfrey, Michele Lanza, Mario Linares Vásquez, Gail C. Murphy, Laura Moreno, David C. Shepherd, and Edmund Wong. 2017. On-demand Developer Documentation. In 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017, Shanghai, China, September 17-22, 2017. 479–483.Google Scholar
- Ravindra Singh and Naurang Singh Mangat. 2013. Elements of survey sampling. Vol. 15. Springer Science & Business Media.Google Scholar
- Ferdian Thung, Shaowei Wang, David Lo, and Julia L. Lawall. 2013. Automatic recommendation of API methods from feature requests. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013. 290–300. Google ScholarDigital Library
- Yuan Tian, Ferdian Thung, Abhishek Sharma, and David Lo. 2017. APIBot: question answering bot for API documentation. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana, IL, USA, October 30 - November 03, 2017. 153–158. Google ScholarDigital Library
- Christoph Treude and Martin P. Robillard. 2016. Augmenting API documentation with insights from stack overflow. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016. Google ScholarDigital Library
- 392–403.Google Scholar
- Christoph Treude, Martin P. Robillard, and Barthélémy Dagenais. 2015. Extracting Development Tasks to Navigate Software Documentation. IEEE Trans. Software Eng. 41, 6 (2015), 565–581.Google ScholarDigital Library
- Gias Uddin and Martin P. Robillard. 2017. Resolving API Mentions in Informal Documents. CoRR abs/1709.02396 (2017). arXiv: 1709.02396Google Scholar
- Jinshui Wang, Xin Peng, Zhenchang Xing, and Wenyun Zhao. 2013. Improving feature location practice with multi-faceted interactive exploration. In 35th International Conference on Software Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013. 762–771. Google ScholarDigital Library
- Wentao Wang, Arushi Gupta, Nan Niu, Li Da Xu, Jing-Ru C. Cheng, and Zhendong Niu. 2018. Automatically Tracing Dependability Requirements via Term-Based Relevance Feedback. IEEE Trans. Industrial Informatics 14, 1 (2018), 342–349.Google ScholarCross Ref
- Wentao Wang, Nan Niu, Hui Liu, and Zhendong Niu. 2018. Enhancing Automated Requirements Traceability by Resolving Polysemy. In 26th IEEE International Requirements Engineering Conference, RE 2018, Banff, AB, Canada, August 20-24, 2018. 40–51.Google ScholarCross Ref
- Bowen Xu, Zhenchang Xing, Xin Xia, and David Lo. 2017. AnswerBot: automated generation of answer summary to developersź technical questions. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana, IL, USA, October 30 - November 03, 2017. 706–716. ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Chong Wang, Xin Peng, Mingwei Liu, Zhenchang Xing, Xuefang Bai, Bing Xie, and Tuo Wang Google ScholarDigital Library
- Deheng Ye, Lingfeng Bao, Zhenchang Xing, and Shang-Wei Lin. 2018. APIReal: an API recognition and linking approach for online developer forums. Empirical Software Engineering 23, 6 (2018), 3129–3160. Google ScholarDigital Library
- Deheng Ye, Zhenchang Xing, Chee Yong Foo, Zi Qun Ang, Jing Li, and Nachiket Kapre. 2016. Software-Specific Named Entity Recognition in Software Engineering Social Content. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1. 90–101.Google Scholar
Index Terms
- A learning-based approach for automatic construction of domain glossary from source code and documentation
Recommendations
On the Use of Domain Terms in Source Code
ICPC '08: Proceedings of the 2008 The 16th IEEE International Conference on Program ComprehensionInformation about the problem domain of the software and the solution it implements is often embedded by developers in comments and identifiers. When using software developed by others or when are new to a project, programmers know little about how ...
Automatic Construction of Domain Concept Hierarchy
CYBERC '10: Proceedings of the 2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge DiscoveryA general automatic domain concept hierarchy construction procedure is presented in this paper. This is a domain independent construct a domain concept hierarchy from a domain corpus . The construction procedure mainly includes domain terminology ...
Domain structure-based transfer learning for cross-domain word representation
AbstractCross-domain word representation aims to learn high-quality semantic representations in an under-resourced domain by leveraging information in a resourceful domain. However, most existing methods mainly transfer the semantics of common ...
Highlights- A cross-domain word representation method for tasks with limited resources.
- ...
Comments