ABSTRACT
We develop and apply statistical topic models to software as a means of extracting concepts from source code. The effectiveness of the technique is demonstrated on 1,555 projects from SourceForge and Apache consisting of 113,000 files and 19 million lines of code. In addition to providing an automated, unsupervised, solution to the problem of summarizing program functionality, the approach provides a probabilistic framework with which to analyze and visualize source file similarity. Finally, we introduce an information-theoretic approach for computing tangling and scattering of extracted concepts, and present preliminary results
- S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and C. Lopes. Sourcerer: a search engine for open source code supporting structure-based search. In OOPSLA Companion, pages 681--682, 2006. Google ScholarDigital Library
- D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003. Google ScholarDigital Library
- S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
- G. Kiczales, J. Lamping, A. Menhdhekar, C. Maeda, C. Lopes, J. Loingtier, and J. Irwin. Aspect-oriented programming. In M. Akşit and S. Matsuoka, editors, Proceedings European Conference on Object-Oriented Programming, volume 1241, pages 220--242. Springer-Verlag, Berlin, Heidelberg, and New York, 1997.Google ScholarCross Ref
- A. Kuhn, S. Ducasse, and T. Girba. Semantic clustering: Identifying topics in source code. Information and Software Technology (to appear), 2006. Google ScholarDigital Library
- E. Linstead, P. Rigor, S. Bajracharya, C. Lopes, and P. Baldi. Mining eclipse developer contributions via author-topic models. MSR 2007: Proceedings of the Fourth International Workshop on Mining Software Repositories, 0:30, 2007. Google ScholarDigital Library
- A. Marcus, A. Sergeyev, V. Rajlich, and J. Maletic. An information retrieval approach to concept location in source code. In Proceedings of the 11th Working Conference on Reverse Engineering (WCRE 2004), pages 214--223, Nov. 2004. Google ScholarDigital Library
- D. Newman and S. Block. Probabilistic topic decomposition of an eighteenth-century american newspaper. J. Am. Soc. Inf. Sci. Technol., 57(6):753--767, 2006. Google ScholarDigital Library
- S. Ugurel, R. Krovetz, and C. L. Giles. What's the code?: automatic classification of source code archives. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 632--638, New York, NY, USA, 2002. ACM Press. Google ScholarDigital Library
Index Terms
- Mining concepts from code with probabilistic topic models
Recommendations
Topic sentiment mixture: modeling facets and opinions in weblogs
WWW '07: Proceedings of the 16th international conference on World Wide WebIn this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent ...
Modeling online reviews with multi-grain topic models
WWW '08: Proceedings of the 17th international conference on World Wide WebIn this paper we present a novel framework for extracting the ratable aspects of objects from online user reviews. Extracting such aspects is an important challenge in automatically mining product opinions from the web and in generating opinion-based ...
Topic model tutorial: A basic introduction on latent dirichlet allocation and extensions for web scientists
WebSci '16: Proceedings of the 8th ACM Conference on Web ScienceIn this tutorial, we teach the intuition and the assumptions behind topic models. Topic models explain the co-occurrences of words in documents by extracting sets of semantically related words, called topics. These topics are semantically coherent and ...
Comments