1 Introduction
-
RQ1. Which topic modeling techniques have been used and for what purpose? There are different topic modeling techniques (see Section 2), each with their own limitations and constraints (Chen et al. 2016). This RQ aims at understanding which topic modeling techniques have been used (e.g., LDA, LSI) and for what purpose studies applied such techniques (e.g., to support software maintenance tasks). Furthermore, we analyze the types of contributions in studies that used topic modeling (e.g., a new approach as a solution proposal, or an exploratory study).
-
RQ2. What are the inputs into topic modeling? Topic modeling techniques accept different types of textual documents and require the configuration of parameters (see Section 2.1). Carefully choosing parameters (such as the number of topics to be generated) is essential for obtaining valuable and reliable topics (Agrawal et al. 2018; Treude and Wagner 2019). This RQ aims at analysing types of textual data (e.g., source code), actual documents (e.g., a Java class or an individual Java method) and configured parameters used for topic modeling to address software engineering problems.
-
RQ3: How are data pre-processed for topic modeling? Topic modeling requires that the analyzed text is pre-processed (e.g., by removing stop words) to improve the quality of the produced output (Aggarwal and Zhai 2012; Bi et al. 2018). This RQ aims at analysing how previous studies pre-processed textual data for topic modeling, including the steps for cleaning and transforming text. This will help us understand if there are specific pre-processing steps for a certain topic modeling technique or types of textual data.
-
RQ4. How are generated topics named? This RQ aims at analyzing if and how topics (word clusters) were named in studies. Giving meaningful names to topics may be difficult but may be required to help humans comprehend topics. For example, naming topics can provide a high-level view on topics discussed by developers in Stack Overflow (a Q&A website) (Barua et al. 2014) or by end mobile app users in tweets (Mezouar et al. 2018). Analysts (e.g., developers interested in what topics are discussed on Stack Overflow or app reviews) can then look at the name of the topic (i.e., its “label”) rather than the cluster of words. These labels or names must capture the overarching meaning of all words in a topic. We describe different approaches to naming topics generated by a topic model, such as manual or automated labeling of clusters with names based on the most frequent words of a topic (Hindle et al. 2013).
2 Topic Modeling
-
Word w: a string of one or more alphanumeric characters (e.g., “software” or “management”);
-
Document d: a set of n words (e.g., a text snippet with five words: w1 to w5);
-
Corpus C: a set of t documents (e.g., nine text snippets: d1 to d9);
-
Vocabulary V: a set of m unique words that appear in a corpus (e.g., m = 80 unique words across nine documents);
-
Term-document matrix A: an m by t matrix whose Ai,j entry is the weight (according to some weighting function, such as term-frequency) of word wi in document dj. For example, given a matrix A with three words and three documents asA1,1 = 5 indicates that “code” appears five times in d1, etc.;
-
Topic z: a collection of terms that co-occur frequently in the documents of a corpus. Considering probabilistic topic models (e.g., LDA), z refers to an m-length vector of probabilities over the vocabulary of a corpus. For example, in a vector z1 = (code : 0.35;test : 0.17;bug : 0.08),0.35 indicates that when a word is picked from a topic z1, there is a 35% chance of drawing the word “code”, etc.;
-
Topic-term matrix ϕ (or T): a k by m matrix with k as the number of topics and ϕi,j the probability of word wj in topic zi. Row i of ϕ corresponds to zi. For example, given a matrix ϕ as0.05 in the first column indicates that the word “code” appears with a probability of 0.5% in topic z3, etc.;
-
Topic membership vector 𝜃d: for document di, a k-length vector of probabilities of the k topics. For example, given a vector \(\theta _{d_{i}} = (z_{1}: 0.25; z_{2}: 0.10; z_{3}: 0.08)\),0.25 indicates that there is a 25% chance of selecting topic z1 in di;
-
Document-topic matrix 𝜃 (or D): an n by k matrix with 𝜃i,j as the probability of topic zj in document di. Row i of 𝜃 corresponds to \(\theta _{d_{i}}\). For example, given a matrix 𝜃 as0.10 in the first column indicates that document d2 contains topic z1 with probability of 10%, etc.
2.1 Data Input
2.2 Modeling
2.3 Output
3 Related Work
3.1 Previous Literature Reviews
(Sun et al. 2016) | (Chen et al. 2016) | This study | |
---|---|---|---|
Reviewed time range | 2003-2015 | 1999-2014 | 2009-2020 |
Search venues | 4 journals | 6 journals | 5 journals |
9 conferences | 9 conferences | 5 conferences | |
Papers analysed | 38 | 167 | 111 |
Analysed data items | |||
Topic modeling technique | ✓ | ✓ | ✓ |
Supported tasks | Specific (e.g., feature localization) | Specific and high-level (e.g., feature localization (specific) under concept localization (high-level)) | High-level (e.g., documentation) |
Type of contribution | – | – | ✓ |
Tools used | – | ✓ | – |
Types of data and documents | – | – | ✓ |
Parameters used | – | Number of topics | Number of topics Hyperparameters |
Data pre–processing | General analysis | Detailed analysis | |
Topic naming | – | – | ✓ |
Evaluation of topic models | – | ✓ | – |
3.2 Meta-studies on Topic Modeling
4 Research Method
4.1 Search Procedure
4.2 Study Selection Criteria
4.3 Data Extraction and Synthesis
-
RQ1: Regarding the data item “Technique”, we identified the topic modeling techniques applied in papers. For the data item “Supported tasks”, we assigned to each paper one software engineering task. Tasks emerged during the analysis of papers (see more details in Section 5.2.2). We also identified the general study outcome in relation to its goal (data item “Type of contribution”). When analyzing the type of contribution, we also checked whether papers included a comparison of topic modeling techniques (e.g., to select the best technique to be included in a newly proposed approach). Based on these data items we checked which techniques were the most popular, whether techniques were based on other techniques or used together, and for what purpose topic modeling was used.
-
RQ2: We identified types of data (data item “Type of data”) in selected papers as listed in Section 5.3.1. Considering that some papers addressed one, two or three different types of data, we counted the frequency of types of data and related them with the document. Regarding “Document”, we identified the textual document and (if reported in the paper) its length. For the data item “Parameters”, we identified whether papers described modeling parameters and if so, which values were assigned to them.
-
RQ3: Considering that some papers may have not mentioned any pre-processing, we first checked which papers described data pre-processing. Then, we listed all pre-processing steps found and counted their frequencies.
-
RQ4: Considering the papers that described topic naming, we analyzed how generated topics were named (see Section 5.5). We used three types of approaches to describe how topics were named: (a) Manual - manually analysis and labeling of topics; (b) Automated - use automated approaches to label names to topics; and (c) Manual & Automated - mix of both manual and automated approaches to analyse and name topics. We also described the procedures performed to name topics.
Item | Description | RQ |
---|---|---|
Year | Publication year | n/a |
Author(s) | List of all authors | n/a |
Title | Title of paper | n/a |
Venue | Publication venue | n/a |
Technique | Topic modeling technique used | RQ1 |
Supported tasks | Development tasks supported by topic modeling (e.g., to predict defects) | RQ1 |
Type of contribution | General outcome of study (e.g., a new approach or an empirical exploration) | RQ1 |
Type of data | Type of data used for topic modeling (e.g., source code and commit messages) | RQ2 |
Document | Documents in corpus, i.e., “instances” of type of data (e.g., Java methods) | RQ2 |
Parameters | Topic modeling parameters and their values (e.g., number of topics) | RQ2 |
Pre-processing | Pre-processing of textual (e.g., tokenization and stop words removal) | RQ3 |
Topic naming | How topics were named (e.g., manual labeling by domain experts) | RQ4 |
5 Results
5.1 Overview
Year | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Venue | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | Total |
ASE | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
EMSE | 2 | 0 | 1 | 1 | 3 | 5 | 2 | 3 | 4 | 4 | 4 | 3 | 32 |
ESEC FSE | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 1 | 1 | 1 | 7 |
ESEM | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 5 |
ICSE | 0 | 1 | 0 | 1 | 2 | 2 | 0 | 1 | 3 | 1 | 1 | 1 | 13 |
IST | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 4 | 3 | 2 | 3 | 2 | 17 |
JSS | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 4 | 2 | 3 | 0 | 12 |
MSR | 1 | 0 | 2 | 0 | 2 | 2 | 2 | 2 | 0 | 1 | 1 | 3 | 16 |
TOSEM | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 3 |
TSE | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 4 |
Total | 3 | 2 | 4 | 3 | 9 | 12 | 7 | 15 | 16 | 15 | 14 | 11 | 111 |
5.2 RQ1: Topic Models Used
5.2.1 Topic Modeling Techniques
Technique | Comparison to LDA | Proposed by | Papers |
---|---|---|---|
Labeled LDA (LLDA) | Supervised approach of LDA that constrains topics to a set of pre-defined labels | (Ramage et al. 2009) | |
Label-to-Hierarchy model (L2H) | Builds concept hierarchy from a set of documents, where each document contains multiple labels; learns from label co-occurrence and word usage to discover a hierarchy of topics associated with user-generated labels | (Nguyen et al. 2014) | (Chen et al. 2019) |
Semi-supervised LDA | Uses samples of labeled documents to train model; relies on similarity between the unclassified documents and the labeled documents | (Fu et al. 2015) | (Fu et al. 2015) |
Twitter-LDA | Short-text topic modeling for tweets; considers each tweet as a document that contains a single topic | (Zhao et al. 2011) | (Hu et al. 2019) |
BugScout-LDA | Uses two implementations of LDA (one implementation to model topics from source code and another one to model topics in bug reports) to recommend a short list of candidate buggy files for a given bug report | (Nguyen et al. 2011) | (Nguyen et al. 2011) |
O-LDA | Method for feature location that applies strategies for filtering data used as input to LDA and strategies for filtering the output (words in topics to describe domain knowledge) | (Liu et al. 2017) | (Liu et al. 2017) |
DAT-LDA | Extended LDA to infer topic probability distributions from multiple data sources (Mashup description text, Web APIs and tags) to support Mashup service discovery | (Cao et al. 2017) | (Cao et al. 2017) |
LDA-GA | Determines the near-optimal configuration for LDA using genetic algorithms | (Panichella et al. 2013) | |
Aspect and Sentiment Unification Model (ASUM) | Finds topics in textual data, reflecting both aspect (i.e., a word that expresses a feeling, e.g., “disappointed”) and sentiment (i.e., a word that conveys sentiment, e.g., “positive” or “negative”) | (Jo and Oh 2011) | |
Citation Influence Topic Model (CITM) | Determines the citation influences of a citing paper in a document network based on two corpora: (a) incoming links of publications (cited papers), and (b) outgoing links of publications (citing papers); a paper can select words from topics of its own topics or from topics found in cited papers | (Dietz et al. 2007) | (Hu and Wong 2013) |
Collaborative Topic Modeling (CTM) | Creates recommendations for users based on the topic modeling of two types of data: (a) libraries of users, and (b) content of publications; for each user, finds both old papers that are important to other similar users and newly written papers that are related to that user interests | (Wang and Blei 2011) | (Sun et al. 2017) |
Discriminative Probability Latent Semantic Analysis (DPLSA) | Supervised approach that recommends components for bug reports; receives assigned bug reports for training and generates a number of topics that is the same as the number of components | (Yan et al. 2016a) | |
Multi-feature Topic Model (MTM) | Supervised approach that considers features (product and component information) of bug reports; emphasizes occurrence of words in bug reports that have the same combination of product and component | (Xia et al. 2017b) | (Xia et al. 2017b) |
Relational Topic Model (RTM) | Defines probability distribution of topics among documents, but also derives semantic relationships between documents | (Chang and Blei 2009) | |
T-Model | Detects duplicate bug reports | (Nguyen et al. 2012) | (Nguyen et al. 2012) |
Temporal LDA | Extends LDA to model document streams considering a time window | (Damevski et al. 2018) | (Damevski et al. 2018) |
TopicSum | Estimates content distribution for summary extraction. Different to LDA, it generates a collection of document sets: background (background distribution over vocabulary words); content (significant content to be summarized); and docspecific (local words to a single document that do not appear across several documents) | (Haghighi and Vanderwende 2009) | (Fowkes et al. 2016) |
Adaptively Online LDA (AOLDA) | Adaptively combines the topics of previous versions of an app to generate topic distributions of current versions | (Gao et al. 2018) | (Gao et al. 2018) |
Hierarchical Dirichlet Process (HDP) | Implements a non-parametric Bayesian approach which iteratively groups words based on a probability distribution (i.e., the number of topics is not known a priori) | (Teh et al. 2006) | (Palomba et al. 2017) |
Maximum-likelihood Representation LDA (MLE-LDA) | Represents a vocabulary-dimensional probability vector directly by its first order distribution | (Rao and Kak 2011) | (Rao and Kak 2011) |
Query likelihood LDA (QL-LDA) | Combines Dirichlet smoothing (a technique to address overfitting) with LDA | (Wei and Croft 2006) | (Binkley et al. 2015) |
-
One paper (Rosenberg and Moonen 2018) compared LSI with other two dimensionality reduction techniques: Principal Component Analysis (PCA) (Wold et al. 1987) and Non-Negative Matrix Factorization (NMF) (Lee and Seung 1999). The authors applied these models to automatically group log messages of continuous deployment runs that failed for the same reasons.
-
Four papers applied LDA and LSI at the same time to compare the performance of these models to Vector Space Model (VSM) (Salton et al. 1975), an algebraic model for information extraction. These studies supported documentation (De Lucia et al. 2014); bug handling (Thomas et al. 2013; Tantithamthavorn et al. 2018); and maintenance tasks (Abdellatif et al. 2019)).
-
Regarding the other two papers, Binkley et al. (2015) compared LSI to Query likelihood LDA (QL-LDA) and other information extraction techniques to check the best model for locating features in source code; and Liu et al. (2020) compared LSI and LDA to Generative Vector Space Model (GVSM), a deep learning technique, to select the best performer model for documentation traceability to source code in multilingual projects.
5.2.2 Supported Tasks
Technique | ||||||
---|---|---|---|---|---|---|
Supported task | LDA | LDA-based | LSI | LDA-based, (LDA or LSI) | LDA, LSI | Total |
Architecting | – | – | 10 | |||
Bug handling | (Nguyen et al. 2012; Noei et al. 2019; Hindle et al. 2015; Le et al. 2017; Choetkiertikul et al. 2017; Zhang et al. 2016; Martin et al. 2015; Murali et al. 2017; Ahasanuzzaman et al. 2019; Nayebi et al. 2018; Lukins et al. 2010; Chen et al. 2017; Naguib et al. 2013; Zhao et al. 2020; Zhao et al. 2016; Zaman et al. 2011; Mezouar et al. 2018; Silva et al. 2016) | – | 33 | |||
Coding | (Fowkes et al. 2016) | – | – | – | 6 | |
Documentation | QL-LDA, LSI (Binkley et al. 2015) | 19 | ||||
Maintenance | – | (Abdellatif et al. 2019) | 12 | |||
Refactoring | (Canfora et al. 2014) | – | – | – | 3 | |
Requirements | (Jiang et al. 2019) | ASUM (Galvis Carreno and Winbladh 2012) | (Blasco et al. 2020) | – | (Ali et al. 2015) | 4 |
Testing | – | – | – | – | 3 | |
Exploratory studies | (Chatterjee et al. 2019; Bajaj et al. 2014; Layman et al. 2016; Bajracharya and Lopes 2009; Xia et al. 2017a; Pagano and Maalej 2013; Ye et al. 2017; Bajracharya and Lopes 2012; Bagherzadeh and Khatchadourian 2019; Ahmed and Bagherzadeh 2018; Barua et al. 2014; Rosen and Shihab 2016; Zou et al. 2017; Han et al. 2020; Abdellatif et al. 2020; Haque and Ali Babar 2020; Tiarks and Maalej 2014; El Zarif et al. 2020; Noei et al. 2018) | – | – | – | 21 |
-
Architecting: tasks related to architecture decision making, such as selection of cloud or mash-up services (e.g., Belle et al. (2016));
-
Bug handling: bug-related tasks, such as assigning bugs to developers, prediction of defects, finding duplicate bugs, or characterizing bugs (e.g., Naguib et al. (2013));
-
Coding: tasks related to coding, e.g., detection of similar functionalities in code, reuse of code artifacts, prediction of developer behaviour (e.g., Damevski et al. (2018));
-
Documentation: support software documentation, e.g., by localizing features in documentation, automatic documentation generation (e.g., Souza et al. (2019));
-
Maintenance: software maintenance-related activities, such as checking consistency of versions of a software, investigate changes or use of a system (e.g., Silva et al. (2019));
-
Refactoring: support refactoring, such as identifying refactoring opportunities and removing bad smell from source code (e.g., Bavota et al. (2014b));
-
Requirements: related to software requirements evolution or recommendation of new features (e.g., Galvis Carreno and Winbladh (2012));
-
Testing: related to identification or prioritization of test cases (e.g., Thomas et al. (2014)).
5.2.3 Types of Contribution
-
Approach: a study develops an approach (e.g., technique, tool, or framework) to support software engineering activities based on or with the support of topic models. For example, Murali et al. (2017) developed a framework that applies LDA to Android API methods to discover types of API usage errors, while Le et al. (2017) developed a technique (APRILE+) for bug localization which combines LDA with a classifier and an artificial neural network.
-
Exploration: a study applies topic modeling as the technique to analyze textual data collected in an empirical study (in contrast to for example open coding). Studies that contributed an exploration did not propose an approach as described in the previous item, but focused on getting insights from data. For example, Barua et al. (2014) applied LDA to Stack Overflow posts to discover what software engineering topics were frequently discussed by developers; Noei et al. (2018) explored the evolution of mobile applications by applying LDA to app descriptions, release notes, and user reviews.
-
Comparison: the study (that can also contribute with an “Approach” or an “Exploration”) compares topic models to other approaches. For example, Xia et al. (2017b) compared their bug triaging approach (based on the so called Multi-feature Topic Model - MTM) with similar approaches that apply machine learning (Bugzie (Tamrawi et al. 2011)) and SVM-LDA (combining a classifier with LDA (Somasundaram and Murphy 2012)). On the other hand, De Lucia et al. (2014) compared LDA and LSI to define guidelines on how to build effective automatic text labeling techniques for program comprehension.
5.3 RQ2: Topic Model Inputs
5.3.1 Types of Data
Type of data | Description | Number of papers |
---|---|---|
“Lessons learned” as free text | Lessons learned from issues and risks of a software project (e.g., record of lessons learned from an issue of the OpenOffice project) | 1 |
URL content | Text of a URL (e.g., URLs in a Cloud service priority queue) | 1 |
Transcripts | Transcripts of audio or video recordings | 3 |
Developer documentation | Documentation used by developers (e.g., Web API documentation) | 4 |
Search query | Keywords in web search queries (e.g., “software development” used in Google search) | 4 |
Log information | Log events of a software, such as registries of updates in a code repository | 5 |
Commit messages | Comments of developers when committing changes to a code repository | 10 |
End user communication | App reviews of end users in app stores | 12 |
End user documentation | Apps and features descriptions, requirement documents, or API tutorials | 15 |
Issue/bug reports | Reports of bugs, change requests and/or issues of a software project | 22 |
Developer communication | Developer discussions such as Q&A websites, e-mails, and instant messaging | 20 |
Source code | Scripts, methods and classes of a software | 37 |
5.3.2 Documents
-
In general, papers described the length of documents in number of words, see Table 7.2 On the other hand, two papers (Moslehi et al. 2016, 2020) described their documents’ length in minutes of screencast transcriptions (videos with one to ten minutes, no information about the size of transcripts). Sixteen papers mentioned the actual length of the documents, see Table 7. Ten papers that described the actual document length did that when describing the data used for topic modeling; four papers discussed document length while describing results; and one mentioned document length as a metric for comparing different data sources;
-
Most papers (80 out of 111) did not mention document length and also do not acknowledge any limitations or the impact of document length on topics.
-
Fifteen papers did not mention the actual document length, but at some point acknowledge the influence of document length on topic modeling. For example, Abdellatif et al. (2019) mentioned that the documents in their data set were “not long”. Similarly, Yan et al. (2016b) did not mention the length of the bug reports used but discussed the impact of the vocabulary size of their corpus on results. Moslehi et al. (2018) mentioned document length as a limitation and acknowledge that using LDA on short documents was a threat to construct validity. According to these authors, using techniques specific for short documents could have improved the outcomes of their topic modeling.
Document | Length | Topic model | Hyperparameters | Number of topics | Papers |
---|---|---|---|---|---|
An individual commit message | 9 to 20 words | LDA | - | 10 | (Canfora et al. 2014) |
An individual blog post | 273 words average | LDA | - | 50 | (Pagano and Maalej 2013) |
An individual Q&A post | 500 words average | LDA | α = 50/k, β = 0.01 | 40 | (Barua et al. 2014) |
50 to 400 words | LLDA; L2H | α = 10, β = 1000 | 3 | (Chen et al. 2019) | |
An individual user review | 65 to 155 words | Twitter-LDA | - | 10 | (Hu et al. 2019) |
28 to 97 words | LDA | - | 85, 170 | (Nayebi et al. 2018) | |
An individual bug report | 404 words average | LDA | α = 50/k, β = 0.01 | 20, [steps of 10], 100, 125, 150, [steps of 25], 225 | (Layman et al. 2016) |
127 words (Eclipse data) and 146 words (Mozilla data) average | LDA; LSI | - | 32, 64, 128, 256 | (Tantithamthavorn et al. 2018)∗ | |
A combination of log messages | 95 words (test data) and 221 words (validation data) average | LDA | α = 50/k, β = 0.1 | 9 | (Pettinato et al. 2019) |
An individual requirement document | 3,800 words average | LDA | α = 0.1, β = 0.1 | 20 | (Hindle et al. 2015) |
An individual fragment of API tutorials | 100 to 300 words | LDA | α = 0.1, β = 0.1 | - | (Jiang et al. 2017) |
A combination of tutorials of an app store | 3,231 words average | LDA | - | 20, 50 | (Tiarks and Maalej 2014) |
A combination of classes from a directory | 4,153 words in 922 documents (total) | LSI | - | - | (Tairas and Gray 2009) |
An individual method | 14 words (Eclipse data) and 35 words (Mozilla data) average | LDA; LSI | - | 32, 64, 128, 256 | (Tantithamthavorn et al. 2018)∗ |
An individual screencast transcript | 1 to 10 minutes | LDA | α = 50/k, β = 0.01 | 20 55, 80, 130 |
5.3.3 Model Parameters
-
Eighteen of the 111 papers do not mention parameters (e.g., number of topics k, hyperparameters α and β). Thirteen of these papers use LDA or an LDA-based technique, four papers use LSI, while (Liu et al. 2020) use LDA and LSI.
-
The remaining 93 papers mention at least one parameter. The most frequent parameters discussed were k, α and β:
-
Fifty-eight papers mentioned actual values for k, α and β;
-
Two papers mentioned actual values for α and β, but no values for k;
-
Twenty-nine papers included actual values for k but not for α and β;
-
Thirty-two (out of 58) papers mentioned other parameters in addition to k, α and β. For example, Chen et al. (2019) applied L2H (in comparison to LLDA), which uses the hyperparameters γ1 and γ2;
-
One paper (Rosenberg and Moonen 2018) that applied LSI, mentioned the parameter “similarity threshold” rather than k, α and β.
-
-
α based on k: The most frequent setting (29 papers) was α = 50/k and β = 0.01 (i.e., α was depending on the number of topics, a strategy suggested by Steyvers and Griffiths (2010) and Wallach et al. (2009)). These values are a default setting in Gibbs Sampling implementations for LDA such as Mallet.3
-
Fixed α and β: Five papers fixed 0.01 for both hyperparameters, as suggested by Hoffman et al. (2010). Another eight papers fixed 0.1 for both hyperparameters, a default setting in Stanford Topic Modeling Toolbox (TMT);4 and three other papers fixed α = 0.1 and β = 1 (these three studies applied RTM).
-
Varying α or β: Four papers tested different values for α, where two of these papers also tested different values for β; and one paper varied β but fixed a value for α.
-
Optimized parameters: Four papers obtained optimized values for hyperparameters (Sun et al. 2015; Catolino et al. 2019; Yang et al. 2017; Zhang et al. 2018). These papers applied LDA-GA (as proposed by Panichella et al. (2013)) which, based on genetic algorithms; finds the best values for LDA hyperparameters. In regards to the actual values chosen for optimized hyperparameters, Catolino et al. (2019) did not mention the values for hyperparameters; Sun et al. (2015) and Yang et al. (2017) mentioned only the values used for k; and Zhang et al. (2018) described the values for k, α and β.
-
Although the remaining 66 (out of 90) papers mentioned a single value used for k, most of them acknowledged that had tried several number of topics or used the number of topics suggested by other studies.
5.4 RQ3: Pre-processing Steps
Pre-processing step | Description | Number of papers |
---|---|---|
Resolving negations | Negations refer to negative sentences with positive meaning, such as “No problem”; used depending on the context of study (e.g., the paper in which we found this step removed negations in user reviews) | 2 |
Expanding contractions | Normalizing contracted terms into expanded forms (e.g., “couldn’t” into “could not”) | 3 |
Resolving synonyms | Replacing words with similar meaning with a common representative word (e.g., “bug”, “error”, and “glitch” can be synonyms for “exception”) | 3 |
Identifying n-grams | Words may have a more concrete meaning when used together; n-grams are a sequence of n words; e.g., bi-gram (n-gram of two words) software development can be more informative than the words “software” and “development” separately | 6 |
Correcting typos | Replacing misspelled words with the correct ones | 7 |
Splitting document | Breaking a long document into shorter documents (e.g., splitting long project specifications and wiki pages) | 7 |
Lemmatizing | Reducing words to their lemmas based on the words’ part of speech (e.g., words “is” and “are” can be resolved as “be”) | 11 |
Tokenizing | Breaking up text in document into individual tokens (e.g., using white space and punctuation as token delimiters) | 17 |
Lowercasing | Entire document is converted to lowercase characters regardless of the spelling in the original document | 20 |
Splitting words | Splitting two or more words with no separating spaces or punctuation (e.g., many papers that analyze source code separated camel cases like “processFile” into “process” and “File”) | 33 |
Stemming | Normalizing words into their single forms by identifying and removing prefixes, suffixes and pluralisation (e.g., “development”, “developer”, “developing” become “develop”) | 61 |
Removing noise | Noise is any text that will interfere in the topic modeling (e.g., slowing down the processing or resulting in meaningless topics); due to the different types of noise removal, we discuss noise removal separately in Table 9 | 76 |
Noisy content | Number of papers |
---|---|
Empty documents | 1 |
Long paragraphs | 1 |
Extra white space | 1 |
Short documents | 2 |
Words shorter than four, three or two letters | 2 |
URLs | 4 |
Least frequent terms | 8 |
Most frequent terms | 8 |
Code snippets | 9 |
HTML tags | 9 |
Non-informative content | 11 |
Numbers | 17 |
Programming language keywords | 23 |
Symbols and special characters | 20 |
Punctuation | 21 |
Stop words | 75 |
5.5 RQ4: Topic Naming
-
Automated: Assigning names to word clusters without human intervention;
-
Manual: Manually checking the meaning and the combination of words in cluster to “deduct” a name, sometimes validated with expert judgment;
-
Manual & Automated: Mix of manual and automated; e.g., topics are manually labeled for one set of clusters to then train a classifier for naming another set of clusters.
References | |||||
---|---|---|---|---|---|
Procedure | Description | Manual | Automated | Manual & Automated | Total |
Deducting name based on words in clusters | Assign names to topics based on understanding of the most frequent words in topics (in one paper Pettinato et al. (2019), authors asked domain experts to validate the names) | (Bajaj et al. 2014; Layman et al. 2016; Bagherzadeh and Khatchadourian 2019; Ahmed and Bagherzadeh 2018; Pagano and Maalej 2013; Noei et al. 2019; Hindle et al. 2015; Barua et al. 2014; Rosen and Shihab 2016; Pettinato et al. 2019; Yang et al. 2017; Aggarwal and Zhai 2012; Ray et al. 2014; Haque and Ali Babar 2020; Gorla et al. 2014; Tiarks and Maalej 2014; El Zarif et al. 2020; Mezouar et al. 2018; Han et al. 2020; Abdellatif et al. 2020; Bajracharya and Lopes 2009) | – | – | 21 |
Naming based on most frequent word(s) in cluster | The most frequent word or the combination of frequent words in the topic were used as the name of that topic | (Panichella et al. 2013) | – | 3 | |
Assigning predefined names to clusters | A list of predefined names is related to topics based on their similarities with the most frequent words in clusters | 12 |
6 Discussion
6.1 RQ1: Topic Modeling Techniques
6.1.1 Summary of Findings
6.1.2 Comparative Studies
Paper | Supported task | Techniques compared | Type of data | Dataset | Type of contribution | Metrics | Best performing technique |
---|---|---|---|---|---|---|---|
(De Lucia et al. 2014) | Documentation | LDA, LSI, VSM | Source code | JHotDraw and eXVantage | Exploration | Term entropy; Average overlap | VSM |
(Tantithamthavorn et al. 2018) | Bug handling | LDA, LSI, VSM | Source code; Issue/bug report | Eclipse and Mozilla | Exploration | Top-k accuracy | VSM |
(Abdellatif et al. 2019) | Maintenance | LDA, LSI, VSM | Issue/bug report | Data records from an Industry partner | Exploration | Top-k accuracy; Mean average precision (MAP) | VSM |
(Liu et al. 2020) | Documentation | LDA, LSI, GVSM-based techniques | Commit messages; Issue/bug report | 17 open source projects | Exploration | Average precision (AP) | GVSM-based techniques |
(Binkley et al. 2015) | Documentation | LSI, VSM, VSM-WS, QL-lin, QL-Dir, QL-LDA | Source code | ArgoUML 0.22, Eclipse 3.0, JabRef 2.6, jEdit 4.3 and muCommander 0.8.5 | Exploration | Mean Reciprocal Rank (MRR) | QL-LDA |
(Rao and Kak 2011) | Bug handling | MLE-LDA; LDA; UM; VSM; LSA; CBDM | Source code | iBUGS benchmark dataset | Exploration | MAP; SCORE | UM |
(Rosenberg and Moonen 2018) | Maintenance | LSI, PCA, NMF | Log information | Cisco Systems Norway log base | Exploration | Adjusted mutual information (AMI); Effort reduction; Homogeneity; Completeness | NMF |
(Silva et al. 2016) | Bug handling | LDA; XScan | Source code | Rhino and jEdit | Exploration | Precision; Recall; F-measure | XScan |
(Luo et al. 2016) | Testing | Call-graph-based; String-distance-based; LDA; Greedy techniques; Adaptive random testing | Test cases | 30 open source Java programs | Exploration | Average percentage of faults detected (APFD) | Call-graph-based |
(Thomas et al. 2013) 1 | Bug handling | LDA, LSI, VSM | Source code; Issue/bug report | Eclipse, Jazz and Mozilla | Approach | Top-k accuracy | VSM |
Paper | Supported task | Approaches compared | Type of data | Dataset | Type of contribution | Metrics | Best performing approach |
---|---|---|---|---|---|---|---|
(Naguib et al. 2013) | Bug handling | LDA; LDA-SVM | Issue/bug report | Atlas, Eclipse BIRT and Unicase | Approach | Actual assignee hit Ratio; Top-k hit | LDA |
(Murali et al. 2017) | Bug handling | Salento (LDA + Probabilistic Behavior Model and Artificial Neural Networks); Non-Bayesian method | Software documentation | Android APIs: alert dialogs, bluetooth sockets and cryptographic ciphers | Approach | Precision; Recall; Anomaly score | Salento |
(Xia et al. 2017b) | Bug handling | TopicMiner (MTM); Bugzie; LDA-KL; SVM-LDA; LDA-Activity | Issue/bug report | GCC, OpenOffice, Netbeans, Eclipse and Mozilla | Approach | Top-k accuracy | TopicMiner |
(Thomas et al. 2014) | Testing | LDA; Call-graph-based; String-distance-based; Adaptive random testing | Source code | Software-artifact Infrastructure Repository (SIR) | Approach | APFD; Mann-Whitney-Wilcoxon test; A measure | LDA |
(Jiang et al. 2019) | Requirements | SAFER (LDA + Clustering technique); KNN+; CLAP | Software documentation | 100 Google Play apps | Approach | Hit ratio; Normalized Discounted Cumulative Gain (NDCG) | SAFER |
(Cao et al. 2017) | Architecting | DAT-LDA + Clustering technique; WTCluster; WT-LDA; CDSR; OD-DMSC; CDA-DMSC; CDT-DMSC | Software documentation | 6629 mashup services from ProgrammableWeb | Approach | Precision; Recall; F-Measure; Purity; Term entropy | DAT-LDA + Clustering technique |
(Yan et al. 2016b) | Bug handling | DPLSA; LDA-KL; LDA-SVM | Issue/bug report | Eclipse, Bugzilla, Mylyn, GCC and Firefox | Approach | Recall @k; Perplexity | DPLSA |
(Zhang et al. 2016) | Bug handling | LDA + Clustering technique; INSPect; NB Multinomial; DRETOM; DREX; DevRec | Issue/bug report | GCC, OpenOffice, Eclipse, NetBeans and Mozilla | Approach | Precision; Recall; F-measure; MRR | LDA + Clustering technique |
(Demissie et al. 2020) | Architecting | PREV (LDA + Clustering and Classification techniques); Covert; IccTA | Software documentation | 11,796 Google Plays apps | Approach | Precision; Recall | PREV |
(Blasco et al. 2020) | Requirements | CODFREL (LSI + Evolutionary algorithm); Regular-LSI | Source code | Kromaia video game data | Approach | Precision; Recall; F-measure | CODFREL |
Paper | Supported task | Techniques compared | Type of data | Dataset | Type of contribution | Metrics | Outcome of comparison |
---|---|---|---|---|---|---|---|
Biggers et al. (2014) | Documentation | LDA (settings tested: hyperparameters α and β, document, number of topics and query (i.e., a string formulated manually or automatically by an end user or developer)) | Source code | ArgoUML, JabRef, jEdit, muCommander, Mylyn, Rhino | Exploration | Effectiveness measure | Recommendation for values of LDA hyperparameters and number of topics considering the number of documents used |
Poshyvanyk et al. (2012) | Documentation | LSI-based technique (settings tested: number of documents, number of attributes, stemming of corpus and queries) | Source code | ArgoUML, Freenet, iBatis, JMeter, Mylyn and Rhino | Appproach | Precision; Recall; Effectiveness; Minimal browsing area (MBA); Maximum possible precision gain (MPG) | Configuration settings for the proposed technique based on the characteristics of the corpora used |
Chen et al. (2014) | Bug handling | AR-Miner: Expectation Maximization for Naive Bayes (EMNB) + LDA; EMNB + ASUM | End user communication | Apps SwiftKey Keyboard, Facebook, Temple Run 2, Tap Fish | Aproach | Precision; Recall; F-measure; NDCG | EMNB + LDA |
Fowkes et al. (2016) | Coding | TASSAL + LDA; TASSAL + VSM | Source code | Six open source Java projects | Approach | Area Under the Curve (AUC) | TASSAL + LDA |
6.2 RQ2: Inputs to Topic Models
6.2.1 Summary of Findings
6.2.2 Documents and Parameters for Topic Models
-
Papers that used LDA-GA, an LDA-based technique that optimizes hyperparameters with Genetic algorithms, applied it to data from developer documentation or source code;
-
LDA was used with all three types of hyperparameter settings across studies. The most common setting was α based on k for developer communication and source code;
-
Most of the LDA-based techniques applied fixed values for α and β.
Types of Data | α based on k | Fixed α and β | Varying α or β | Optimized parameters |
---|---|---|---|---|
Commit messages | DPLSA: 1 | LDA: 1 | – | – |
Semi-supervised LDA: 1 | RTM: 1 | |||
Developer communication | LDA: 8 | LDA: 3 | – | – |
LLDA; L2H: 1 | ||||
End user communication | LDA: 1 | LDA: 1 | – | – |
LDA; ASUM: 1 | ||||
LLDA: 1 | ||||
AOLDA: 1 | ||||
Issue/bug report | LDA: 3 | LDA: 3 | LDA: 1 | – |
LDA; LSI: 1 | RTM: 1 | MLE-LDA: 1 | ||
DPLSA: 1 | LDA; LLDA: 1 | |||
MTM: 1 | ||||
Log information | LDA: 2 | – | – | – |
Search query | – | LDA: 2 | – | – |
End user documentation | LDA: 3 | LDA: 3 | LDA: 1 | – |
Developer documentation | – | DAT–LDA: 1 | – | LDA–GA: 1 |
Source code | LDA: 6 | LDA: 3 | LDA: 2 | LDA–GA: 2 |
LDA; LSI: 1 | BugScout: 1 | MLE–LDA: 1 | ||
RTM: 3 | QL–LDA; LSI: 2 | |||
LDA; LSI: 1 | ||||
“Lessons learned” | – | – | – | – |
Transcript | LDA: 3 | – | – | – |
URL content | – | LDA: 1 | – | – |
6.2.3 Supported Tasks, Types of Data and Types of Contribution
-
Source code was a frequent type of data in papers; consequently it appeared for almost all supported tasks, except for exploratory studies;
-
Considering exploratory studies, most papers used developer communication (13 out of 21), followed by search queries and end user communication (three papers each);
-
Papers that supported bug handling mostly used issue/bug reports, source code and end user communication;
-
Log information was used by papers that supported maintenance, bug handling, and coding;
-
Considering the papers that supported documentation, three used transcript texts from speech;
-
From the four papers related to the type of data developer documentation, two supported architecting tasks and the other two, documentation tasks.
-
Regarding the type of data, URLs and transcripts were only used in studies that contributed an approach.
Supported Tasks | |||||||||
---|---|---|---|---|---|---|---|---|---|
Types of data | Architecting | Bug handling | Coding | Documentation | Maintenance | Refactoring | Requirements | Testing | Exploratory studies |
Commit messages | Exploration: 1 | Approach: 3 Exploration [C]: 1 | – | Approach: 1 Exploration [C]: 1 | Approach: 1 | Exploration: 1 | – | – | Exploration: 1 |
Developer communication | – | Approach: 1 | – | Approach: 5 | Approach: 1 | – | – | – | Exploration: 13 |
End user communication | – | Approach: 4 Exploration: 2 | – | – | Approach: 1 Exploration: 1 | – | Approach: 1 | – | Exploration: 3 |
Issue/bug report | Exploration: 1 Exploration [C]: 1 | Approach: 6 Exploration: 2 Approach [C]: 5 Exploration [C]: 2 | – | Approach: 2 Exploration [C]: 1 | Exploration [C]: 1 | – | – | – | Exploration: 1 |
Log information | – | Approach: 1 | Approach: 1 | – | Approach: 1 Exploration: 1 Exploration [C]: 1 | – | – | – | – |
Search query | – | – | – | Approach: 1 | – | – | – | – | Exploration: 3 |
End user documentation | Approach: 2 Approach [C]: 1 | Exploration: 1 Approach [C]: 1 | Exploration: 1 | Approach: 4 | Approach: 1 | – | Approach [C]: 1 | Approach: 1 | Exploration: 2 |
Developer documentation | Approach: 1 Approach [C]: 1 | – | – | Approach: 2 | – | – | – | – | – |
Source code | Approach: 2 Exploration: 2 | Approach: 4 Exploration: 2 Approach [C]: 1 Exploration [C]: 3 | Approach: 2 Exploration: 1 Approach [C]: 1 | Approach: 5 Exploration [C]: 3 | Approach: 1 Exploration: 3 | Approach: 2 | Approach: 1 Approach [C]: 1 | Approach [C]: 1 Exploration [C]: 1 | – |
“Lessons learned” | – | – | – | – | Exploration [C]: 1 | – | – | – | – |
Transcript | – | – | – | Approach: 3 | – | – | – | – | – |
URL content | Approach: 1 | – | – | – | – | – | – | – | – |
6.3 RQ3: Data Pre-processing
6.3.1 Summary of Findings
6.3.2 Pre-processing Different Types of Data
-
For developer communication there were specific types of noisy content that was removed: URLs, HTML tags and code snippets. This might have happened because most of the papers used Q&A posts as documents, which frequently contain hyperlinks and code examples;
-
Removing non-informative content was frequently applied to end user communication and end user documentation;
-
Expanding contracted terms (e.g., “didn’t” to “did not”) were applied to end user communication and issue/bug reports;
-
Removing empty documents and eliminating extra white spaces were applied only in end user communication. Empty documents occurred in this type of data because after the removal of stop words no content was left (Chen et al. 2014);
-
For source code there was a specific noise to be removed: program language specific keywords (e.g., “public”, “class”, “extends”, “if”, and “while”).
Type of data | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Pre-processing steps | Commit messages | Developer communication | Developer documentation | End user communication | End user documentation | Issue/bug report | “Lessons learned” | Log information | Search query | Source code | Transcript | URL content |
Resolving negations | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Correcting typos | 0 | 0 | 0 | 6 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Expanding contractions | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Resolving synonyms | 1 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Splitting sentences or a document into n documents | 3 | 1 | 0 | 1 | 3 | 3 | 0 | 0 | 0 | 1 | 0 | 0 |
Lemmatizing | 1 | 2 | 0 | 5 | 1 | 1 | 0 | 0 | 0 | 2 | 0 | 0 |
Identifying n-grams | 0 | 3 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Lowercasing | 1 | 1 | 0 | 5 | 1 | 3 | 0 | 2 | 1 | 5 | 1 | 1 |
Tokenizing | 1 | 1 | 0 | 2 | 2 | 5 | 0 | 2 | 1 | 4 | 0 | 0 |
Splitting words | 4 | 0 | 0 | 0 | 2 | 8 | 0 | 0 | 2 | 24 | 1 | 0 |
Stemming | 5 | 8 | 3 | 9 | 8 | 14 | 1 | 1 | 1 | 21 | 2 | 1 |
Removing empty documents | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Removing long paragraphs | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Removing short documents | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Removing extra white space | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Removing non-informative content | 1 | 1 | 0 | 4 | 4 | 2 | 0 | 0 | 0 | 1 | 0 | 0 |
Removing words shorter than four, three or two letters | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
Removing least frequent terms | 0 | 2 | 0 | 2 | 1 | 2 | 0 | 0 | 0 | 1 | 0 | 0 |
Removing most frequent terms | 0 | 2 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
Removing code snippets | 1 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
Removing HTML tags | 1 | 6 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Removing programming language keywords | 1 | 3 | 0 | 0 | 0 | 4 | 0 | 0 | 1 | 19 | 0 | 0 |
Removing symbols and special characters | 2 | 3 | 0 | 2 | 2 | 3 | 0 | 0 | 2 | 6 | 2 | 1 |
Removing punctuation | 2 | 4 | 0 | 2 | 3 | 4 | 0 | 2 | 0 | 5 | 2 | 1 |
Removing stop words | 6 | 16 | 2 | 10 | 8 | 15 | 1 | 3 | 0 | 23 | 2 | 1 |
Remove URL | 1 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Remove numbers | 1 | 4 | 0 | 1 | 3 | 4 | 0 | 1 | 0 | 5 | 2 | 0 |
6.4 RQ4: Assigning Names to Topics
-
Studies that modeled topics from developer documentation, transcripts and URLs did not mention topic naming. Studies that contributed with both exploration and comparison also did not mention topic naming;
-
Topics were mostly named in studies that used data from developer communication (ten occurrences) and in exploratory studies (22 occurrences).
Topic naming procedure | |||
---|---|---|---|
Types of data | Based on word clusters | Most frequent words | Predefined names |
Commit messages | Manual: 1 | – | Automated: 2 |
Automated & Manual: 1 | |||
Developer communication | Manual: 9 | Automated: 1 | Automated: 1 Manual: 1 |
End user communication | Manual: 2 | Manual: 1 | Automated: 2 Manual: 1 |
End user documentation | Manual: 5 | – | – |
Issue/bug report | Manual: 3 | – | Automated: 1 |
Automated & Manual: 1 | |||
Log information | Manual: 1 | – | – |
Search query | Manual: 1 | – | Manual: 1 |
Source code | – | Automated: 1 Manual: 1 | Manual: 1 |
Topic naming procedure | |||
---|---|---|---|
Types of contribution | Based on word clusters | Most frequent words | Predefined names |
Approach | Manual: 5 | Automated: 1 Manual: 1 | Automated: 4 |
Automated & Manual: 2 | |||
Approach & Comparison | – | – | Automated: 1 |
Exploration | Manual: 16 | Manual: 1 | Automated: 1 |
Manual: 4 |
-
Twelve papers (Bagherzadeh and Khatchadourian 2019; Ahmed and Bagherzadeh 2018; Martin et al. 2015; Hindle et al. 2013; Pagano and Maalej 2013; Zou et al. 2017; Pettinato et al. 2019; Layman et al. 2016; Ray et al. 2014; Tiarks and Maalej 2014; Mezouar et al. 2018; Abdellatif et al. 2020) acknowledged that how topics were named could be a threat to validity. For example, Layman et al. (2016) mentioned that they did not evaluate the accuracy of the manual topic naming, which was based on their expertise.
-
One paper (Pettinato et al. 2019) acknowledged that there is another topic naming approach that could be applied to their data: authors acknowledged that an automated extraction of topic names could replace manual labeling.
-
Some of the generated topics will not be relevant (e.g., clusters filled with common terms may not address any particular subject) and topics may be duplicated. This means that not all topics have to be named and used for analysis;
-
Domain experts can label topics better than non-experts, because they are more familiar to domain-specific keywords that may appear in word clusters;
-
It is important to rely on the relationship between topics generated and the original data. Hindle et al. (2015) argued that “the content of the topic can be interpreted in many different ways and LDA does not look for the same patterns that people do”.
6.5 Implications
-
Understand which topic modeling techniques to use for what purpose. Researchers and practitioners that are going to select and apply a topic modeling technique, for example, to refactor legacy systems; may consider the experiences of other studies with similar objectives.
-
Pre-processing based on the type of data to be modeled. Pre-processing steps depend on the type of data analyzed (e.g., removing HTML tags in developer communication, mainly Q&A posts). Researchers and practitioners who, for example, intend to model topics from source code; may consider the same pre-processing steps that other studies applied to source code.
-
Understand how to name topics. Researchers and practitioners may check how other studies named topics to get insights on how to give meaning to their own topics.
-
Appropriateness of topic modeling. Although we found that most of papers applied LDA “as is”, it may not be the best approach for other studies or for practical application. LDA is popular because it is an unsupervised model, i.e., it does not require previous knowledge about the data (e.g., pre-defined classes for model training), it is statistically more rigorous than other techniques (e.g., LSI), and it discovers latent relationships (i.e., topics) between documents in a large textual corpus (Griffiths and Steyvers 2004). However, LDA is an unstable and non-deterministic model. This means that generated topics cannot be replicated by others, even if the same model inputs (data pre-processing and configuration of parameters) are used. Furthermore, LDA performs poorly with short documents (Lin et al. 2014).
-
Meaningful topics. Topic models should discover semantically meaningful topics. Chang et al. (2009) argue about the importance of the interpretability of topics generated by probabilistic topic modeling techniques such as LDA. To create meaningful and replicable topics with LDA, Mantyla et al. (2018) highlight the importance of stabilizing the topic model (e.g., through tuning (Agrawal et al. 2018)) and advocate the use of stability metrics (e.g., rank-biased overlap - RBO (Mantyla et al. 2018)).
-
Research opportunities. Researchers interested in investigating topic modeling in software engineering may consider developing guidelines for researchers on how to use topic modeling, depending on the type of data, goals, etc. Further studies may also explore issues related to approaches for naming topics (e.g., based on domain experts), on the evaluation of the semantic accuracy of topics generated (e.g., how meaningful the topics are and if the context of document have to be considered), and on metrics to measure the performance of topic models supporting different software engineering tasks.
6.6 Threats to Validity
7 Conclusions
-
LDA and LDA-based techniques are the most frequently used topic modeling techniques;
-
Topic modeling was mostly used to develop techniques for handling bugs (e.g., to predict defects). Exploratory studies that use topic modeling as a data analysis technique were also frequent;
-
Most papers modeled topics from source code (using methods as documents);
-
Most papers used LDA “as is” and without adapting values of hyperparameters (α and β);
-
Most papers describe pre-processing. Some pre-processing steps depend on the type of textual data used (e.g., removal of URL and HTML tags), while others are commonly used in NLP techniques (e.g., stop words removal or stemming);
-
Only 36 (out of 111) papers named the topics. When naming topics, papers mostly adopted manual topic naming approaches such as deducting names (or labeling pre-defined names) based on the meaning of frequent words in that topic.