poster

Mining concepts from code with probabilistic topic models

Authors:
Erik Linstead

University of California, Irvine, Irvine, CA

University of California, Irvine, Irvine, CA
View Profile

,
Paul Rigor

University of California, Irvine, Irvine, CA

University of California, Irvine, Irvine, CA
View Profile

,
Sushil Bajracharya

University of California, Irvine, Irvine, CA

University of California, Irvine, Irvine, CA
View Profile

,
Cristina Lopes

University of California, Irvine, Irvine, CA

University of California, Irvine, Irvine, CA
View Profile

,
Pierre Baldi

University of California, Irvine, Irvine, CA

University of California, Irvine, Irvine, CA
View Profile

ASE '07: Proceedings of the 22nd IEEE/ACM International Conference on Automated Software EngineeringNovember 2007Pages 461–464https://doi.org/10.1145/1321631.1321709

Published:05 November 2007Publication History

ASE '07: Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering

Pages 461–464

ABSTRACT

We develop and apply statistical topic models to software as a means of extracting concepts from source code. The effectiveness of the technique is demonstrated on 1,555 projects from SourceForge and Apache consisting of 113,000 files and 19 million lines of code. In addition to providing an automated, unsupervised, solution to the problem of summarizing program functionality, the approach provides a probabilistic framework with which to analyze and visualize source file similarity. Finally, we introduce an information-theoretic approach for computing tangling and scattering of extracted concepts, and present preliminary results

References

S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and C. Lopes. Sourcerer: a search engine for open source code supporting structure-based search. In OOPSLA Companion, pages 681--682, 2006. Google ScholarDigital Library
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003. Google ScholarDigital Library
S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
G. Kiczales, J. Lamping, A. Menhdhekar, C. Maeda, C. Lopes, J. Loingtier, and J. Irwin. Aspect-oriented programming. In M. Akşit and S. Matsuoka, editors, Proceedings European Conference on Object-Oriented Programming, volume 1241, pages 220--242. Springer-Verlag, Berlin, Heidelberg, and New York, 1997.Google ScholarCross Ref
A. Kuhn, S. Ducasse, and T. Girba. Semantic clustering: Identifying topics in source code. Information and Software Technology (to appear), 2006. Google ScholarDigital Library
E. Linstead, P. Rigor, S. Bajracharya, C. Lopes, and P. Baldi. Mining eclipse developer contributions via author-topic models. MSR 2007: Proceedings of the Fourth International Workshop on Mining Software Repositories, 0:30, 2007. Google ScholarDigital Library
A. Marcus, A. Sergeyev, V. Rajlich, and J. Maletic. An information retrieval approach to concept location in source code. In Proceedings of the 11th Working Conference on Reverse Engineering (WCRE 2004), pages 214--223, Nov. 2004. Google ScholarDigital Library
D. Newman and S. Block. Probabilistic topic decomposition of an eighteenth-century american newspaper. J. Am. Soc. Inf. Sci. Technol., 57(6):753--767, 2006. Google ScholarDigital Library
S. Ugurel, R. Krovetz, and C. L. Giles. What's the code?: automatic classification of source code archives. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 632--638, New York, NY, USA, 2002. ACM Press. Google ScholarDigital Library

Index Terms

Mining concepts from code with probabilistic topic models
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Topic sentiment mixture: modeling facets and opinions in weblogs
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent ...
Read More
Modeling online reviews with multi-grain topic models
WWW '08: Proceedings of the 17th international conference on World Wide Web

In this paper we present a novel framework for extracting the ratable aspects of objects from online user reviews. Extracting such aspects is an important challenge in automatically mining product opinions from the web and in generating opinion-based ...
Read More
Topic model tutorial: A basic introduction on latent dirichlet allocation and extensions for web scientists
WebSci '16: Proceedings of the 8th ACM Conference on Web Science

In this tutorial, we teach the intuition and the assumptions behind topic models. Topic models explain the co-occurrences of words in documents by extracting sets of semantically related words, called topics. These topics are semantically coherent and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASE '07: Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering
November 2007
590 pages
ISBN:9781595938824
DOI:10.1145/1321631
General Chair:
Kurt Stirewalt
Michigan State University, USA
,
Program Chairs:
Alexander Egyed
Teknowledge Corporation, USA
,
Bernd Fischer
University of Southampton, UK
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 November 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
mining software
program understanding
topic models
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate82of337submissions,24%
Upcoming Conference
ASE '24

Sponsor:

sigsoft online

sigsoft online

ASE '24: 39th IEEE/ACM International Conference on Automated Software Engineering

October 27 - November 1, 2024

Sacramento , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 1,072
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining concepts from code with probabilistic topic models

ASE '07: Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Topic sentiment mixture: modeling facets and opinions in weblogs

Modeling online reviews with multi-grain topic models

Topic model tutorial: A basic introduction on latent dirichlet allocation and extensions for web scientists