research-article

A learning-based approach for automatic construction of domain glossary from source code and documentation

Authors:
Chong Wang

Fudan University, China

Fudan University, China
View Profile

,
Xin Peng

Fudan University, China

Fudan University, China
View Profile

,
Mingwei Liu

Fudan University, China

Fudan University, China
View Profile

,
Zhenchang Xing

Australian National University, Australia

Australian National University, Australia
View Profile

,
Xuefang Bai

Fudan University, China

Fudan University, China
View Profile

,
Bing Xie

Peking University, China

Peking University, China
View Profile

,
Tuo Wang

Fudan University, China

Fudan University, China
View Profile

ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringAugust 2019Pages 97–108https://doi.org/10.1145/3338906.3338963

Published:12 August 2019Publication History

ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 97–108

ABSTRACT

A domain glossary that organizes domain-specific concepts and their aliases and relations is essential for knowledge acquisition and software development. Existing approaches use linguistic heuristics or term-frequency-based statistics to identify domain specific terms from software documentation, and thus the accuracy is often low. In this paper, we propose a learning-based approach for automatic construction of domain glossary from source code and software documentation. The approach uses a set of high-quality seed terms identified from code identifiers and natural language concept definitions to train a domain-specific prediction model to recognize glossary terms based on the lexical and semantic context of the sentences mentioning domain-specific concepts. It then merges the aliases of the same concepts to their canonical names, selects a set of explanation sentences for each concept, and identifies "is a", "has a", and "related to" relations between the concepts. We apply our approach to deep learning domain and Hadoop domain and harvest 5,382 and 2,069 concepts together with 16,962 and 6,815 relations respectively. Our evaluation validates the accuracy of the extracted domain glossary and its usefulness for the fusion and acquisition of knowledge from different documents of different projects.

References

Surafel Lemma Abebe and Paolo Tonella. 2015. Extraction of domain concepts from the source code. Sci. Comput. Program. 98 (2015), 680–706. Google ScholarDigital Library
Chetan Arora, Mehrdad Sabetzadeh, Lionel C. Briand, and Frank Zimmer. 2017. Automated Extraction and Clustering of Requirements Glossary Terms. IEEE Trans. Software Eng. 43, 10 (2017), 918–945.Google ScholarCross Ref
Sophie Aubin and Thierry Hamon. 2006. Improving Term Extraction with Terminological Resources. In Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL 2006, Turku, Finland, August 23-25, 2006, Proceedings. 380–387. Google ScholarDigital Library
Didier Bourigault. 1992. Surface Grammatical Analysis For The Extraction Of Terminological Noun Phrases. In 14th International Conference on Computational Linguistics, COLING 1992, Nantes, France, August 23-28, 1992. 977–981. Google ScholarDigital Library
Nuno Ramos Carvalho, José João Almeida, Pedro Rangel Henriques, and Maria João Varanda Pereira. 2015. From source code identifiers to natural language terms. Journal of Systems and Software 100 (2015), 117–128. Google ScholarDigital Library
Chunyang Chen, Zhenchang Xing, and Ximing Wang. 2017. Unsupervised software-specific morphological forms inference from informal discussions. In Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017. 450–461. Google ScholarDigital Library
Barthélémy Dagenais and Martin P. Robillard. 2012. Recovering traceability links between an API and its learning resources. In 34th International Conference on Software Engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland. 47–57. Google ScholarDigital Library
Fred Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3 (1964), 171–176. Google ScholarDigital Library
Bogdan Dit, Meghan Revelle, Malcom Gethers, and Denys Poshyvanyk. 2013. Feature location in source code: a taxonomy and survey. Journal of Software: Evolution and Process 25, 1 (2013), 53–95.Google ScholarCross Ref
Anurag Dwarakanath, Roshni R. Ramnani, and Shubhashis Sengupta. 2013. Automatic extraction of glossary terms from natural language requirements. In 21st IEEE International Requirements Engineering Conference, RE 2013, Rio de Janeiro-RJ, Brazil, July 15-19, 2013. 314–319.Google ScholarCross Ref
Fan Fang, Bo-Wen Zhang, and Xu-Cheng Yin. 2018. Semantic Sequential Query Expansion for Biomedical Article Search. IEEE Access 6 (2018), 45448–45457.Google ScholarCross Ref
Jin Guo, Marek Gibiec, and Jane Cleland-Huang. 2017. Tackling the termmismatch problem in automated trace retrieval. Empirical Software Engineering 22, 3 (2017), 1103–1142. Google ScholarDigital Library
Marti A. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In 14th International Conference on Computational Linguistics, COLING 1992, Nantes, France, August 23-28, 1992. 539–545. Google ScholarDigital Library
Paul Jaccard. 1901. Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles 37 (1901), 241–72.Google Scholar
Leslie P. Jones, Edward W. Gassie Jr., and Sridhar Radhakrishnan. 1990. INDEX: The statistical basis for an automatic conceptual phrase-indexing system. JASIS 41, 2 (1990), 87–97.Google ScholarCross Ref
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016. 260–270.Google ScholarCross Ref
Hongwei Li, Sirui Li, Jiamou Sun, Zhenchang Xing, Xin Peng, Mingwei Liu, and Xuejiao Zhao. 2018. Improving API Caveats Accessibility by Mining API Caveats Knowledge Graph. In 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, Madrid, Spain, September 23-29, 2018. 183–193.Google ScholarCross Ref
Yutaka Matsuo and Mitsuru Ishizuka. 2004. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13, 1 (2004), 157–169.Google ScholarCross Ref
Collin McMillan, Denys Poshyvanyk, Mark Grechanik, Qing Xie, and Chen Fu. 2013. Portfolio: Searching for relevant functions and their usages in millions of lines of code. ACM Trans. Softw. Eng. Methodol. 22, 4 (2013), 37:1–37:30. Google ScholarDigital Library
Pierre André Ménard and Sylvie Ratté. 2016. Concept extraction from business documents for software engineering projects. Autom. Softw. Eng. 23, 4 (2016), 649–686. Google ScholarDigital Library
Maria Teresa Pazienza, Marco Pennacchiotti, and Fabio Massimo Zanzotto. 2005. Terminology Extraction: An Analysis of Linguistic and Statistical Approaches. Knowledge Mining, S. Sirmakessis, Ed. Berlin, Germany: Springer. 255—-279 pages.Google Scholar
Xin Peng, Yifan Zhao, Mingwei Liu, Fengyi Zhang, Yang Liu, Xin Wang, and Zhenchang Xing. 2018. Automatic Generation of API Documentations for Open-Source Projects. In IEEE Third International Workshop on Dynamic Software Documentation, DySDoc@ICSME 2018, Madrid, Spain, September 25, 2018. 7–8.Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25- 29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. 1532–1543.Google Scholar
Mohammad Masudur Rahman, Chanchal Kumar Roy, and David Lo. 2016. RACK: Automatic API Recommendation Using Crowdsourced Knowledge. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1. 349–359.Google Scholar
Michael Rath, Jacob Rendall, Jin L. C. Guo, Jane Cleland-Huang, and Patrick Mäder. 2018. Traceability in the wild: automatically augmenting incomplete trace links. In Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018. 834–845. Google ScholarDigital Library
Lev-Arie Ratinov and Dan Roth. 2009. Design Challenges and Misconceptions in Named Entity Recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, Boulder, Colorado, USA, June 4-5, 2009. 147–155. Google ScholarDigital Library
Martin P. Robillard, Andrian Marcus, Christoph Treude, Gabriele Bavota, Oscar Chaparro, Neil A. Ernst, Marco Aurélio Gerosa, Michael W. Godfrey, Michele Lanza, Mario Linares Vásquez, Gail C. Murphy, Laura Moreno, David C. Shepherd, and Edmund Wong. 2017. On-demand Developer Documentation. In 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017, Shanghai, China, September 17-22, 2017. 479–483.Google Scholar
Ravindra Singh and Naurang Singh Mangat. 2013. Elements of survey sampling. Vol. 15. Springer Science & Business Media.Google Scholar
Ferdian Thung, Shaowei Wang, David Lo, and Julia L. Lawall. 2013. Automatic recommendation of API methods from feature requests. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013. 290–300. Google ScholarDigital Library
Yuan Tian, Ferdian Thung, Abhishek Sharma, and David Lo. 2017. APIBot: question answering bot for API documentation. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana, IL, USA, October 30 - November 03, 2017. 153–158. Google ScholarDigital Library
Christoph Treude and Martin P. Robillard. 2016. Augmenting API documentation with insights from stack overflow. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016. Google ScholarDigital Library
392–403.Google Scholar
Christoph Treude, Martin P. Robillard, and Barthélémy Dagenais. 2015. Extracting Development Tasks to Navigate Software Documentation. IEEE Trans. Software Eng. 41, 6 (2015), 565–581.Google ScholarDigital Library
Gias Uddin and Martin P. Robillard. 2017. Resolving API Mentions in Informal Documents. CoRR abs/1709.02396 (2017). arXiv: 1709.02396Google Scholar
Jinshui Wang, Xin Peng, Zhenchang Xing, and Wenyun Zhao. 2013. Improving feature location practice with multi-faceted interactive exploration. In 35th International Conference on Software Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013. 762–771. Google ScholarDigital Library
Wentao Wang, Arushi Gupta, Nan Niu, Li Da Xu, Jing-Ru C. Cheng, and Zhendong Niu. 2018. Automatically Tracing Dependability Requirements via Term-Based Relevance Feedback. IEEE Trans. Industrial Informatics 14, 1 (2018), 342–349.Google ScholarCross Ref
Wentao Wang, Nan Niu, Hui Liu, and Zhendong Niu. 2018. Enhancing Automated Requirements Traceability by Resolving Polysemy. In 26th IEEE International Requirements Engineering Conference, RE 2018, Banff, AB, Canada, August 20-24, 2018. 40–51.Google ScholarCross Ref
Bowen Xu, Zhenchang Xing, Xin Xia, and David Lo. 2017. AnswerBot: automated generation of answer summary to developersź technical questions. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana, IL, USA, October 30 - November 03, 2017. 706–716. ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Chong Wang, Xin Peng, Mingwei Liu, Zhenchang Xing, Xuefang Bai, Bing Xie, and Tuo Wang Google ScholarDigital Library
Deheng Ye, Lingfeng Bao, Zhenchang Xing, and Shang-Wei Lin. 2018. APIReal: an API recognition and linking approach for online developer forums. Empirical Software Engineering 23, 6 (2018), 3129–3160. Google ScholarDigital Library
Deheng Ye, Zhenchang Xing, Chee Yong Foo, Zi Qun Ang, Jing Li, and Nachiket Kapre. 2016. Software-Specific Named Entity Recognition in Software Engineering Social Content. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1. 90–101.Google Scholar

Index Terms

A learning-based approach for automatic construction of domain glossary from source code and documentation
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Documentation

Recommendations

On the Use of Domain Terms in Source Code
ICPC '08: Proceedings of the 2008 The 16th IEEE International Conference on Program Comprehension

Information about the problem domain of the software and the solution it implements is often embedded by developers in comments and identifiers. When using software developed by others or when are new to a project, programmers know little about how ...
Read More
Automatic Construction of Domain Concept Hierarchy
CYBERC '10: Proceedings of the 2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery

A general automatic domain concept hierarchy construction procedure is presented in this paper. This is a domain independent construct a domain concept hierarchy from a domain corpus . The construction procedure mainly includes domain terminology ...
Read More
Domain structure-based transfer learning for cross-domain word representation
Abstract
Cross-domain word representation aims to learn high-quality semantic representations in an under-resourced domain by leveraging information in a resourceful domain. However, most existing methods mainly transfer the semantics of common ...
Highlights
- A cross-domain word representation method for tasks with limited resources.
- ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
August 2019
1264 pages
ISBN:9781450355728
DOI:10.1145/3338906
General Chairs:
Marlon Dumas
University of Tartu, Estonia
,
Dietmar Pfahl
University of Tartu, Estonia
,
Program Chairs:
Sven Apel
Saarland University, Germany
,
Alessandra Russo
Imperial College, UK
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
concept
documentation
domain glossary
knowledge
learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate112of543submissions,21%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 33
  Total Citations
  View Citations
- 670
  Total Downloads
- Downloads (Last 12 months)51
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A learning-based approach for automatic construction of domain glossary from source code and documentation

ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Use of Domain Terms in Source Code

Automatic Construction of Domain Concept Hierarchy

Domain structure-based transfer learning for cross-domain word representation