skip to main content
10.1145/3338906.3338963acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

A learning-based approach for automatic construction of domain glossary from source code and documentation

Published:12 August 2019Publication History

ABSTRACT

A domain glossary that organizes domain-specific concepts and their aliases and relations is essential for knowledge acquisition and software development. Existing approaches use linguistic heuristics or term-frequency-based statistics to identify domain specific terms from software documentation, and thus the accuracy is often low. In this paper, we propose a learning-based approach for automatic construction of domain glossary from source code and software documentation. The approach uses a set of high-quality seed terms identified from code identifiers and natural language concept definitions to train a domain-specific prediction model to recognize glossary terms based on the lexical and semantic context of the sentences mentioning domain-specific concepts. It then merges the aliases of the same concepts to their canonical names, selects a set of explanation sentences for each concept, and identifies "is a", "has a", and "related to" relations between the concepts. We apply our approach to deep learning domain and Hadoop domain and harvest 5,382 and 2,069 concepts together with 16,962 and 6,815 relations respectively. Our evaluation validates the accuracy of the extracted domain glossary and its usefulness for the fusion and acquisition of knowledge from different documents of different projects.

References

  1. Surafel Lemma Abebe and Paolo Tonella. 2015. Extraction of domain concepts from the source code. Sci. Comput. Program. 98 (2015), 680–706. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Chetan Arora, Mehrdad Sabetzadeh, Lionel C. Briand, and Frank Zimmer. 2017. Automated Extraction and Clustering of Requirements Glossary Terms. IEEE Trans. Software Eng. 43, 10 (2017), 918–945.Google ScholarGoogle ScholarCross RefCross Ref
  3. Sophie Aubin and Thierry Hamon. 2006. Improving Term Extraction with Terminological Resources. In Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL 2006, Turku, Finland, August 23-25, 2006, Proceedings. 380–387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Didier Bourigault. 1992. Surface Grammatical Analysis For The Extraction Of Terminological Noun Phrases. In 14th International Conference on Computational Linguistics, COLING 1992, Nantes, France, August 23-28, 1992. 977–981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Nuno Ramos Carvalho, José João Almeida, Pedro Rangel Henriques, and Maria João Varanda Pereira. 2015. From source code identifiers to natural language terms. Journal of Systems and Software 100 (2015), 117–128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chunyang Chen, Zhenchang Xing, and Ximing Wang. 2017. Unsupervised software-specific morphological forms inference from informal discussions. In Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017. 450–461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Barthélémy Dagenais and Martin P. Robillard. 2012. Recovering traceability links between an API and its learning resources. In 34th International Conference on Software Engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland. 47–57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Fred Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3 (1964), 171–176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bogdan Dit, Meghan Revelle, Malcom Gethers, and Denys Poshyvanyk. 2013. Feature location in source code: a taxonomy and survey. Journal of Software: Evolution and Process 25, 1 (2013), 53–95.Google ScholarGoogle ScholarCross RefCross Ref
  10. Anurag Dwarakanath, Roshni R. Ramnani, and Shubhashis Sengupta. 2013. Automatic extraction of glossary terms from natural language requirements. In 21st IEEE International Requirements Engineering Conference, RE 2013, Rio de Janeiro-RJ, Brazil, July 15-19, 2013. 314–319.Google ScholarGoogle ScholarCross RefCross Ref
  11. Fan Fang, Bo-Wen Zhang, and Xu-Cheng Yin. 2018. Semantic Sequential Query Expansion for Biomedical Article Search. IEEE Access 6 (2018), 45448–45457.Google ScholarGoogle ScholarCross RefCross Ref
  12. Jin Guo, Marek Gibiec, and Jane Cleland-Huang. 2017. Tackling the termmismatch problem in automated trace retrieval. Empirical Software Engineering 22, 3 (2017), 1103–1142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Marti A. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In 14th International Conference on Computational Linguistics, COLING 1992, Nantes, France, August 23-28, 1992. 539–545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Paul Jaccard. 1901. Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles 37 (1901), 241–72.Google ScholarGoogle Scholar
  15. Leslie P. Jones, Edward W. Gassie Jr., and Sridhar Radhakrishnan. 1990. INDEX: The statistical basis for an automatic conceptual phrase-indexing system. JASIS 41, 2 (1990), 87–97.Google ScholarGoogle ScholarCross RefCross Ref
  16. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016. 260–270.Google ScholarGoogle ScholarCross RefCross Ref
  17. Hongwei Li, Sirui Li, Jiamou Sun, Zhenchang Xing, Xin Peng, Mingwei Liu, and Xuejiao Zhao. 2018. Improving API Caveats Accessibility by Mining API Caveats Knowledge Graph. In 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, Madrid, Spain, September 23-29, 2018. 183–193.Google ScholarGoogle ScholarCross RefCross Ref
  18. Yutaka Matsuo and Mitsuru Ishizuka. 2004. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13, 1 (2004), 157–169.Google ScholarGoogle ScholarCross RefCross Ref
  19. Collin McMillan, Denys Poshyvanyk, Mark Grechanik, Qing Xie, and Chen Fu. 2013. Portfolio: Searching for relevant functions and their usages in millions of lines of code. ACM Trans. Softw. Eng. Methodol. 22, 4 (2013), 37:1–37:30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Pierre André Ménard and Sylvie Ratté. 2016. Concept extraction from business documents for software engineering projects. Autom. Softw. Eng. 23, 4 (2016), 649–686. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Maria Teresa Pazienza, Marco Pennacchiotti, and Fabio Massimo Zanzotto. 2005. Terminology Extraction: An Analysis of Linguistic and Statistical Approaches. Knowledge Mining, S. Sirmakessis, Ed. Berlin, Germany: Springer. 255—-279 pages.Google ScholarGoogle Scholar
  22. Xin Peng, Yifan Zhao, Mingwei Liu, Fengyi Zhang, Yang Liu, Xin Wang, and Zhenchang Xing. 2018. Automatic Generation of API Documentations for Open-Source Projects. In IEEE Third International Workshop on Dynamic Software Documentation, DySDoc@ICSME 2018, Madrid, Spain, September 25, 2018. 7–8.Google ScholarGoogle Scholar
  23. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25- 29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. 1532–1543.Google ScholarGoogle Scholar
  24. Mohammad Masudur Rahman, Chanchal Kumar Roy, and David Lo. 2016. RACK: Automatic API Recommendation Using Crowdsourced Knowledge. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1. 349–359.Google ScholarGoogle Scholar
  25. Michael Rath, Jacob Rendall, Jin L. C. Guo, Jane Cleland-Huang, and Patrick Mäder. 2018. Traceability in the wild: automatically augmenting incomplete trace links. In Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018. 834–845. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lev-Arie Ratinov and Dan Roth. 2009. Design Challenges and Misconceptions in Named Entity Recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, Boulder, Colorado, USA, June 4-5, 2009. 147–155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Martin P. Robillard, Andrian Marcus, Christoph Treude, Gabriele Bavota, Oscar Chaparro, Neil A. Ernst, Marco Aurélio Gerosa, Michael W. Godfrey, Michele Lanza, Mario Linares Vásquez, Gail C. Murphy, Laura Moreno, David C. Shepherd, and Edmund Wong. 2017. On-demand Developer Documentation. In 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017, Shanghai, China, September 17-22, 2017. 479–483.Google ScholarGoogle Scholar
  28. Ravindra Singh and Naurang Singh Mangat. 2013. Elements of survey sampling. Vol. 15. Springer Science & Business Media.Google ScholarGoogle Scholar
  29. Ferdian Thung, Shaowei Wang, David Lo, and Julia L. Lawall. 2013. Automatic recommendation of API methods from feature requests. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013. 290–300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yuan Tian, Ferdian Thung, Abhishek Sharma, and David Lo. 2017. APIBot: question answering bot for API documentation. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana, IL, USA, October 30 - November 03, 2017. 153–158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Christoph Treude and Martin P. Robillard. 2016. Augmenting API documentation with insights from stack overflow. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. 392–403.Google ScholarGoogle Scholar
  33. Christoph Treude, Martin P. Robillard, and Barthélémy Dagenais. 2015. Extracting Development Tasks to Navigate Software Documentation. IEEE Trans. Software Eng. 41, 6 (2015), 565–581.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Gias Uddin and Martin P. Robillard. 2017. Resolving API Mentions in Informal Documents. CoRR abs/1709.02396 (2017). arXiv: 1709.02396Google ScholarGoogle Scholar
  35. Jinshui Wang, Xin Peng, Zhenchang Xing, and Wenyun Zhao. 2013. Improving feature location practice with multi-faceted interactive exploration. In 35th International Conference on Software Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013. 762–771. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Wentao Wang, Arushi Gupta, Nan Niu, Li Da Xu, Jing-Ru C. Cheng, and Zhendong Niu. 2018. Automatically Tracing Dependability Requirements via Term-Based Relevance Feedback. IEEE Trans. Industrial Informatics 14, 1 (2018), 342–349.Google ScholarGoogle ScholarCross RefCross Ref
  37. Wentao Wang, Nan Niu, Hui Liu, and Zhendong Niu. 2018. Enhancing Automated Requirements Traceability by Resolving Polysemy. In 26th IEEE International Requirements Engineering Conference, RE 2018, Banff, AB, Canada, August 20-24, 2018. 40–51.Google ScholarGoogle ScholarCross RefCross Ref
  38. Bowen Xu, Zhenchang Xing, Xin Xia, and David Lo. 2017. AnswerBot: automated generation of answer summary to developersź technical questions. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana, IL, USA, October 30 - November 03, 2017. 706–716. ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Chong Wang, Xin Peng, Mingwei Liu, Zhenchang Xing, Xuefang Bai, Bing Xie, and Tuo Wang Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Deheng Ye, Lingfeng Bao, Zhenchang Xing, and Shang-Wei Lin. 2018. APIReal: an API recognition and linking approach for online developer forums. Empirical Software Engineering 23, 6 (2018), 3129–3160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Deheng Ye, Zhenchang Xing, Chee Yong Foo, Zi Qun Ang, Jing Li, and Nachiket Kapre. 2016. Software-Specific Named Entity Recognition in Software Engineering Social Content. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1. 90–101.Google ScholarGoogle Scholar

Index Terms

  1. A learning-based approach for automatic construction of domain glossary from source code and documentation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
        August 2019
        1264 pages
        ISBN:9781450355728
        DOI:10.1145/3338906

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 August 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate112of543submissions,21%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader