short-paper

Measuring LDA topic stability from clusters of replicated runs

Authors:
Mika V. Mantyla

University of Oulu, Finland

University of Oulu, Finland
View Profile

,
Maelick Claes

University of Oulu, Finland

University of Oulu, Finland
View Profile

,
Umar Farooq

University of Oulu, Finland

University of Oulu, Finland
View Profile

ESEM '18: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and MeasurementOctober 2018Article No.: 49Pages 1–4https://doi.org/10.1145/3239235.3267435

Published:11 October 2018Publication History

ESEM '18: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

Pages 1–4

ABSTRACT

Background: Unstructured and textual data is increasing rapidly and Latent Dirichlet Allocation (LDA) topic modeling is a popular data analysis methods for it. Past work suggests that instability of LDA topics may lead to systematic errors. Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics. Method: We generate k LDA topics and replicate this process n times resulting in n*k topics. Then we use K-medioids to cluster the n*k topics to k clusters. The k clusters now represent the original LDA topics and we present them like normal LDA topics showing the ten most probable words. For the clusters, we try multiple stability metrics, out of which we recommend Rank-Biased Overlap, showing the stability of the topics inside the clusters. Results: We provide an initial validation where our method is used for 270,000 Mozilla Firefox commit messages with k=20 and n=20. We show how our topic stability metrics are related to the contents of the topics. Conclusions: Advances in text mining enable us to analyze large masses of text in software engineering but non-deterministic algorithms, such as LDA, may lead to unreplicable conclusions. Our approach makes LDA stability transparent and is also complementary rather than alternative to many prior works that focus on LDA parameter tuning.

References

Amritanshu Agrawal, Wei Fu, and Tim Menzies. 2018. What is wrong with topic modeling? And how to fix it using search-based software engineering. Information and Software Technology 98 (2018), 74 -- 88.Google ScholarDigital Library
Hazeline U Asuncion, Arthur U Asuncion, and Richard N Taylor. 2010. Software traceability with topic modeling. In Proceedings of the 32nd ACM/IEEE international conference on Software Engineering-Volume 1. ACM, 95--104. Google ScholarDigital Library
David Binkley, Daniel Heinz, Dawn Lawrie, and Justin Overfelt. 2014. Understanding LDA in source code analysis. In Proceedings of the 22nd International Conference on Program Comprehension. ACM, 26--36. Google ScholarDigital Library
Christian Bird, Tim Menzies, and Thomas Zimmermann. 2015. The art and science of analyzing software data. Elsevier. Google ScholarDigital Library
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022. Google ScholarDigital Library
Joshua Charles Campbell, Abram Hindle, and Eleni Stroulia. 2016. Latent Dirichlet allocation: extracting topics from software engineering data. In The art and science of analyzing software data. Elsevier, 139--159.Google Scholar
Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. 2016. Warplda: a cache efficient o (1) algorithm for latent dirichlet allocation. Proceedings of the VLDB Endowment 9, 10 (2016), 744--755. Google ScholarDigital Library
Alta De Waal and Etienne Barnard. 2008. Evaluating topic models with stability. (2008).Google Scholar
Bogdan Dit, Meghan Revelle, Malcom Gethers, and Denys Poshyvanyk. 2013. Feature location in source code: a taxonomy and survey. Journal of software: Evolution and Process 25, 1 (2013), 53--95.Google ScholarCross Ref
Joshua Garcia, Daniel Popescu, Chris Mattmann, Nenad Medvidovic, and Yuanfang Cai. 2011. Enhancing architectural recovery using concerns. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, 552--555. Google ScholarDigital Library
Vahid Garousi and Mika V Mäntylä. 2016. Citations, research topics and active countries in software engineering: A bibliometrics study. Computer Science Review 19 (2016), 56--77. Google ScholarDigital Library
Derek Greene, Derek O'Callaghan, and Pádraig Cunningham. 2014. How many topics? stability analysis for topic models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 498--513.Google ScholarDigital Library
Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences 101, suppl 1 (2004), 5228--5235.Google ScholarCross Ref
Hadi Hemmati, Zhihan Fang, Mika V Mäntylä, and Bram Adams. 2017. Prioritizing manual test cases in rapid release environments. Software Testing, Verification and Reliability 27, 6 (2017).Google Scholar
Abram Hindle, Christian Bird, Thomas Zimmermann, and Nachiappan Nagappan. 2012. Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers?. In Software Maintenance (ICSM), 2012 28th IEEE International Conference on. IEEE, 243--252. Google ScholarDigital Library
Lucas Layman, Allen P Nikora, Joshua Meek, and Tim Menzies. 2016. Topic modeling of NASA space system problem reports: research in practice. In Mining Software Repositories (MSR), 2016 IEEE/ACM 13th Working Conference on. IEEE, 303--314. Google ScholarDigital Library
Mika V Mäntylä, Nicole Novielli, Filippo Lanubile, Maëlick Claes, and Miikka Kuutila. 2017. Bootstrapping a lexicon for emotional arousal in software engineering. In Mining Software Repositories (MSR), 2017 IEEE/ACM 14th International Conference on. IEEE, 198--202. Google ScholarDigital Library
Vineet Mehta, Rajmonda S Caceres, and Kevin M Carter. 2014. Evaluating topic quality using model clustering. In Computational Intelligence and Data Mining (CIDM), 2014 IEEE Symposium on. IEEE, 178--185.Google ScholarCross Ref
Annibale Panichella, Bogdan Dit, Rocco Oliveto, Massimilano Di Penta, Denys Poshynanyk, and Andrea De Lucia. 2013. How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. In Software Engineering (ICSE), 2013 35th International Conference on. IEEE, 522--531. Google ScholarDigital Library
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.Google ScholarCross Ref
Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. 2000. Inference of population structure using multilocus genotype data. Genetics 155, 2 (2000), 945--959.Google ScholarCross Ref
Paivi Raulamo-Jurvanen, Mika V Mantyla, and Vahid Garousi. 2015. Citation and Topic Analysis of the ESEM papers. In Empirical Software Engineering and Measurement (ESEM), 2015 ACM/IEEE International Symposium on. IEEE, 1--4.Google ScholarCross Ref
Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining. ACM, 399--408. Google ScholarDigital Library
Xiaobing Sun, Xiangyue Liu, Bin Li, Yucong Duan, Hui Yang, and Jiajun Hu. 2016. Exploring topic models in software engineering data analysis: A survey. In Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2016 17th IEEE/ACIS International Conference on. 357--362.Google Scholar
Stephen W Thomas, Bram Adams, Ahmed E Hassan, and Dorothea Blostein. 2010. Validating the use of topic models for software evolution. In Source Code Analysis and Manipulation (SCAM), 2010 10th IEEE Working Conference on. IEEE, 55--64. Google ScholarDigital Library
William Webber, Alistair Moffat, and Justin Zobel. 2010. A Similarity Measure for Indefinite Rankings. ACM Trans. Inf. Syst. 28, 4, Article 20 (Nov. 2010), 38 pages. Google ScholarDigital Library
Andreas Zeller. 2018. ICSE 2018 - Plenary Sessions - Andreas Zeller. https://www.youtube.com/watch?v=U5jLjcxnwfU.Google Scholar

Recommendations

Exploring the Space of Topic Coherence Measures
WSDM '15: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining

Quantifying the coherence of a set of statements is a long standing problem with many potential applications that has attracted researchers from different sciences. The special case of measuring coherence of topics has been recently studied to remedy ...
Read More
Probabilistic topic models

Surveying a suite of algorithms that offer a solution to managing large document archives.

Read More
Improving LDA topic models for microblogs via tweet pooling and automatic labeling
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ESEM '18: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement
October 2018
487 pages
ISBN:9781450358231
DOI:10.1145/3239235
General Chair:
Markku Oivo
University of Oulu, Finland
,
Program Chairs:
Daniel Méndez
Technical University of Munich, Germany
,
Audris Mockus
University of Tennessee
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering
commit messages
latent dirichlet allocation
rank-biased overlap
replication
similarity
stability
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate130of594submissions,22%
Upcoming Conference
ESEM '24

Sponsor:

sigsoft

ACM / IEEE International Symposium on Empirical Software Engineering and Measurement

October 24 - 25, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 398
  Total Downloads
- Downloads (Last 12 months)53
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Measuring LDA topic stability from clusters of replicated runs

ESEM '18: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

ABSTRACT

References

Cited By

Recommendations

Exploring the Space of Topic Coherence Measures

Probabilistic topic models

Improving LDA topic models for microblogs via tweet pooling and automatic labeling