skip to main content
10.1145/1321631.1321709acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
poster

Mining concepts from code with probabilistic topic models

Published:05 November 2007Publication History

ABSTRACT

We develop and apply statistical topic models to software as a means of extracting concepts from source code. The effectiveness of the technique is demonstrated on 1,555 projects from SourceForge and Apache consisting of 113,000 files and 19 million lines of code. In addition to providing an automated, unsupervised, solution to the problem of summarizing program functionality, the approach provides a probabilistic framework with which to analyze and visualize source file similarity. Finally, we introduce an information-theoretic approach for computing tangling and scattering of extracted concepts, and present preliminary results

References

  1. S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and C. Lopes. Sourcerer: a search engine for open source code supporting structure-based search. In OOPSLA Companion, pages 681--682, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  4. G. Kiczales, J. Lamping, A. Menhdhekar, C. Maeda, C. Lopes, J. Loingtier, and J. Irwin. Aspect-oriented programming. In M. Akşit and S. Matsuoka, editors, Proceedings European Conference on Object-Oriented Programming, volume 1241, pages 220--242. Springer-Verlag, Berlin, Heidelberg, and New York, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  5. A. Kuhn, S. Ducasse, and T. Girba. Semantic clustering: Identifying topics in source code. Information and Software Technology (to appear), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Linstead, P. Rigor, S. Bajracharya, C. Lopes, and P. Baldi. Mining eclipse developer contributions via author-topic models. MSR 2007: Proceedings of the Fourth International Workshop on Mining Software Repositories, 0:30, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Marcus, A. Sergeyev, V. Rajlich, and J. Maletic. An information retrieval approach to concept location in source code. In Proceedings of the 11th Working Conference on Reverse Engineering (WCRE 2004), pages 214--223, Nov. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Newman and S. Block. Probabilistic topic decomposition of an eighteenth-century american newspaper. J. Am. Soc. Inf. Sci. Technol., 57(6):753--767, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Ugurel, R. Krovetz, and C. L. Giles. What's the code?: automatic classification of source code archives. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 632--638, New York, NY, USA, 2002. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining concepts from code with probabilistic topic models

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ASE '07: Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering
      November 2007
      590 pages
      ISBN:9781595938824
      DOI:10.1145/1321631

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 November 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • poster

      Acceptance Rates

      Overall Acceptance Rate82of337submissions,24%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader