ABSTRACT
Background: Unstructured and textual data is increasing rapidly and Latent Dirichlet Allocation (LDA) topic modeling is a popular data analysis methods for it. Past work suggests that instability of LDA topics may lead to systematic errors. Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics. Method: We generate k LDA topics and replicate this process n times resulting in n*k topics. Then we use K-medioids to cluster the n*k topics to k clusters. The k clusters now represent the original LDA topics and we present them like normal LDA topics showing the ten most probable words. For the clusters, we try multiple stability metrics, out of which we recommend Rank-Biased Overlap, showing the stability of the topics inside the clusters. Results: We provide an initial validation where our method is used for 270,000 Mozilla Firefox commit messages with k=20 and n=20. We show how our topic stability metrics are related to the contents of the topics. Conclusions: Advances in text mining enable us to analyze large masses of text in software engineering but non-deterministic algorithms, such as LDA, may lead to unreplicable conclusions. Our approach makes LDA stability transparent and is also complementary rather than alternative to many prior works that focus on LDA parameter tuning.
- Amritanshu Agrawal, Wei Fu, and Tim Menzies. 2018. What is wrong with topic modeling? And how to fix it using search-based software engineering. Information and Software Technology 98 (2018), 74 -- 88.Google ScholarDigital Library
- Hazeline U Asuncion, Arthur U Asuncion, and Richard N Taylor. 2010. Software traceability with topic modeling. In Proceedings of the 32nd ACM/IEEE international conference on Software Engineering-Volume 1. ACM, 95--104. Google ScholarDigital Library
- David Binkley, Daniel Heinz, Dawn Lawrie, and Justin Overfelt. 2014. Understanding LDA in source code analysis. In Proceedings of the 22nd International Conference on Program Comprehension. ACM, 26--36. Google ScholarDigital Library
- Christian Bird, Tim Menzies, and Thomas Zimmermann. 2015. The art and science of analyzing software data. Elsevier. Google ScholarDigital Library
- David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022. Google ScholarDigital Library
- Joshua Charles Campbell, Abram Hindle, and Eleni Stroulia. 2016. Latent Dirichlet allocation: extracting topics from software engineering data. In The art and science of analyzing software data. Elsevier, 139--159.Google Scholar
- Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. 2016. Warplda: a cache efficient o (1) algorithm for latent dirichlet allocation. Proceedings of the VLDB Endowment 9, 10 (2016), 744--755. Google ScholarDigital Library
- Alta De Waal and Etienne Barnard. 2008. Evaluating topic models with stability. (2008).Google Scholar
- Bogdan Dit, Meghan Revelle, Malcom Gethers, and Denys Poshyvanyk. 2013. Feature location in source code: a taxonomy and survey. Journal of software: Evolution and Process 25, 1 (2013), 53--95.Google ScholarCross Ref
- Joshua Garcia, Daniel Popescu, Chris Mattmann, Nenad Medvidovic, and Yuanfang Cai. 2011. Enhancing architectural recovery using concerns. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, 552--555. Google ScholarDigital Library
- Vahid Garousi and Mika V Mäntylä. 2016. Citations, research topics and active countries in software engineering: A bibliometrics study. Computer Science Review 19 (2016), 56--77. Google ScholarDigital Library
- Derek Greene, Derek O'Callaghan, and Pádraig Cunningham. 2014. How many topics? stability analysis for topic models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 498--513.Google ScholarDigital Library
- Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences 101, suppl 1 (2004), 5228--5235.Google ScholarCross Ref
- Hadi Hemmati, Zhihan Fang, Mika V Mäntylä, and Bram Adams. 2017. Prioritizing manual test cases in rapid release environments. Software Testing, Verification and Reliability 27, 6 (2017).Google Scholar
- Abram Hindle, Christian Bird, Thomas Zimmermann, and Nachiappan Nagappan. 2012. Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers?. In Software Maintenance (ICSM), 2012 28th IEEE International Conference on. IEEE, 243--252. Google ScholarDigital Library
- Lucas Layman, Allen P Nikora, Joshua Meek, and Tim Menzies. 2016. Topic modeling of NASA space system problem reports: research in practice. In Mining Software Repositories (MSR), 2016 IEEE/ACM 13th Working Conference on. IEEE, 303--314. Google ScholarDigital Library
- Mika V Mäntylä, Nicole Novielli, Filippo Lanubile, Maëlick Claes, and Miikka Kuutila. 2017. Bootstrapping a lexicon for emotional arousal in software engineering. In Mining Software Repositories (MSR), 2017 IEEE/ACM 14th International Conference on. IEEE, 198--202. Google ScholarDigital Library
- Vineet Mehta, Rajmonda S Caceres, and Kevin M Carter. 2014. Evaluating topic quality using model clustering. In Computational Intelligence and Data Mining (CIDM), 2014 IEEE Symposium on. IEEE, 178--185.Google ScholarCross Ref
- Annibale Panichella, Bogdan Dit, Rocco Oliveto, Massimilano Di Penta, Denys Poshynanyk, and Andrea De Lucia. 2013. How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. In Software Engineering (ICSE), 2013 35th International Conference on. IEEE, 522--531. Google ScholarDigital Library
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.Google ScholarCross Ref
- Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. 2000. Inference of population structure using multilocus genotype data. Genetics 155, 2 (2000), 945--959.Google ScholarCross Ref
- Paivi Raulamo-Jurvanen, Mika V Mantyla, and Vahid Garousi. 2015. Citation and Topic Analysis of the ESEM papers. In Empirical Software Engineering and Measurement (ESEM), 2015 ACM/IEEE International Symposium on. IEEE, 1--4.Google ScholarCross Ref
- Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining. ACM, 399--408. Google ScholarDigital Library
- Xiaobing Sun, Xiangyue Liu, Bin Li, Yucong Duan, Hui Yang, and Jiajun Hu. 2016. Exploring topic models in software engineering data analysis: A survey. In Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2016 17th IEEE/ACIS International Conference on. 357--362.Google Scholar
- Stephen W Thomas, Bram Adams, Ahmed E Hassan, and Dorothea Blostein. 2010. Validating the use of topic models for software evolution. In Source Code Analysis and Manipulation (SCAM), 2010 10th IEEE Working Conference on. IEEE, 55--64. Google ScholarDigital Library
- William Webber, Alistair Moffat, and Justin Zobel. 2010. A Similarity Measure for Indefinite Rankings. ACM Trans. Inf. Syst. 28, 4, Article 20 (Nov. 2010), 38 pages. Google ScholarDigital Library
- Andreas Zeller. 2018. ICSE 2018 - Plenary Sessions - Andreas Zeller. https://www.youtube.com/watch?v=U5jLjcxnwfU.Google Scholar
Recommendations
Exploring the Space of Topic Coherence Measures
WSDM '15: Proceedings of the Eighth ACM International Conference on Web Search and Data MiningQuantifying the coherence of a set of statements is a long standing problem with many potential applications that has attracted researchers from different sciences. The special case of measuring coherence of topics has been recently studied to remedy ...
Probabilistic topic models
Surveying a suite of algorithms that offer a solution to managing large document archives.
Improving LDA topic models for microblogs via tweet pooling and automatic labeling
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalTwitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and ...
Comments