skip to main content
10.1145/3239235.3267435acmconferencesArticle/Chapter ViewAbstractPublication PagesesemConference Proceedingsconference-collections
short-paper

Measuring LDA topic stability from clusters of replicated runs

Published:11 October 2018Publication History

ABSTRACT

Background: Unstructured and textual data is increasing rapidly and Latent Dirichlet Allocation (LDA) topic modeling is a popular data analysis methods for it. Past work suggests that instability of LDA topics may lead to systematic errors. Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics. Method: We generate k LDA topics and replicate this process n times resulting in n*k topics. Then we use K-medioids to cluster the n*k topics to k clusters. The k clusters now represent the original LDA topics and we present them like normal LDA topics showing the ten most probable words. For the clusters, we try multiple stability metrics, out of which we recommend Rank-Biased Overlap, showing the stability of the topics inside the clusters. Results: We provide an initial validation where our method is used for 270,000 Mozilla Firefox commit messages with k=20 and n=20. We show how our topic stability metrics are related to the contents of the topics. Conclusions: Advances in text mining enable us to analyze large masses of text in software engineering but non-deterministic algorithms, such as LDA, may lead to unreplicable conclusions. Our approach makes LDA stability transparent and is also complementary rather than alternative to many prior works that focus on LDA parameter tuning.

References

  1. Amritanshu Agrawal, Wei Fu, and Tim Menzies. 2018. What is wrong with topic modeling? And how to fix it using search-based software engineering. Information and Software Technology 98 (2018), 74 -- 88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Hazeline U Asuncion, Arthur U Asuncion, and Richard N Taylor. 2010. Software traceability with topic modeling. In Proceedings of the 32nd ACM/IEEE international conference on Software Engineering-Volume 1. ACM, 95--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David Binkley, Daniel Heinz, Dawn Lawrie, and Justin Overfelt. 2014. Understanding LDA in source code analysis. In Proceedings of the 22nd International Conference on Program Comprehension. ACM, 26--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Christian Bird, Tim Menzies, and Thomas Zimmermann. 2015. The art and science of analyzing software data. Elsevier. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Joshua Charles Campbell, Abram Hindle, and Eleni Stroulia. 2016. Latent Dirichlet allocation: extracting topics from software engineering data. In The art and science of analyzing software data. Elsevier, 139--159.Google ScholarGoogle Scholar
  7. Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. 2016. Warplda: a cache efficient o (1) algorithm for latent dirichlet allocation. Proceedings of the VLDB Endowment 9, 10 (2016), 744--755. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Alta De Waal and Etienne Barnard. 2008. Evaluating topic models with stability. (2008).Google ScholarGoogle Scholar
  9. Bogdan Dit, Meghan Revelle, Malcom Gethers, and Denys Poshyvanyk. 2013. Feature location in source code: a taxonomy and survey. Journal of software: Evolution and Process 25, 1 (2013), 53--95.Google ScholarGoogle ScholarCross RefCross Ref
  10. Joshua Garcia, Daniel Popescu, Chris Mattmann, Nenad Medvidovic, and Yuanfang Cai. 2011. Enhancing architectural recovery using concerns. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, 552--555. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Vahid Garousi and Mika V Mäntylä. 2016. Citations, research topics and active countries in software engineering: A bibliometrics study. Computer Science Review 19 (2016), 56--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Derek Greene, Derek O'Callaghan, and Pádraig Cunningham. 2014. How many topics? stability analysis for topic models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 498--513.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences 101, suppl 1 (2004), 5228--5235.Google ScholarGoogle ScholarCross RefCross Ref
  14. Hadi Hemmati, Zhihan Fang, Mika V Mäntylä, and Bram Adams. 2017. Prioritizing manual test cases in rapid release environments. Software Testing, Verification and Reliability 27, 6 (2017).Google ScholarGoogle Scholar
  15. Abram Hindle, Christian Bird, Thomas Zimmermann, and Nachiappan Nagappan. 2012. Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers?. In Software Maintenance (ICSM), 2012 28th IEEE International Conference on. IEEE, 243--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lucas Layman, Allen P Nikora, Joshua Meek, and Tim Menzies. 2016. Topic modeling of NASA space system problem reports: research in practice. In Mining Software Repositories (MSR), 2016 IEEE/ACM 13th Working Conference on. IEEE, 303--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mika V Mäntylä, Nicole Novielli, Filippo Lanubile, Maëlick Claes, and Miikka Kuutila. 2017. Bootstrapping a lexicon for emotional arousal in software engineering. In Mining Software Repositories (MSR), 2017 IEEE/ACM 14th International Conference on. IEEE, 198--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Vineet Mehta, Rajmonda S Caceres, and Kevin M Carter. 2014. Evaluating topic quality using model clustering. In Computational Intelligence and Data Mining (CIDM), 2014 IEEE Symposium on. IEEE, 178--185.Google ScholarGoogle ScholarCross RefCross Ref
  19. Annibale Panichella, Bogdan Dit, Rocco Oliveto, Massimilano Di Penta, Denys Poshynanyk, and Andrea De Lucia. 2013. How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. In Software Engineering (ICSE), 2013 35th International Conference on. IEEE, 522--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. 2000. Inference of population structure using multilocus genotype data. Genetics 155, 2 (2000), 945--959.Google ScholarGoogle ScholarCross RefCross Ref
  22. Paivi Raulamo-Jurvanen, Mika V Mantyla, and Vahid Garousi. 2015. Citation and Topic Analysis of the ESEM papers. In Empirical Software Engineering and Measurement (ESEM), 2015 ACM/IEEE International Symposium on. IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  23. Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining. ACM, 399--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Xiaobing Sun, Xiangyue Liu, Bin Li, Yucong Duan, Hui Yang, and Jiajun Hu. 2016. Exploring topic models in software engineering data analysis: A survey. In Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2016 17th IEEE/ACIS International Conference on. 357--362.Google ScholarGoogle Scholar
  25. Stephen W Thomas, Bram Adams, Ahmed E Hassan, and Dorothea Blostein. 2010. Validating the use of topic models for software evolution. In Source Code Analysis and Manipulation (SCAM), 2010 10th IEEE Working Conference on. IEEE, 55--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. William Webber, Alistair Moffat, and Justin Zobel. 2010. A Similarity Measure for Indefinite Rankings. ACM Trans. Inf. Syst. 28, 4, Article 20 (Nov. 2010), 38 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Andreas Zeller. 2018. ICSE 2018 - Plenary Sessions - Andreas Zeller. https://www.youtube.com/watch?v=U5jLjcxnwfU.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    ESEM '18: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement
    October 2018
    487 pages
    ISBN:9781450358231
    DOI:10.1145/3239235

    Copyright © 2018 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 11 October 2018

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • short-paper

    Acceptance Rates

    Overall Acceptance Rate130of594submissions,22%

    Upcoming Conference

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader