Bug localization using latent Dirichlet allocation
Introduction
Bug localization is a software maintenance task in which a developer uses information about a bug present in a software system to locate the portion of the source code that must be modified to correct the bug. In particular, a developer formulates a query using information in a bug report and uses the query to search the software system in order to determine which parts of the code must be changed to “fix” the bug.
Due to the size and complexity of modern software, effectively automating this task can reduce maintenance costs by reducing developer effort. Techniques for automating bug localization take as input information about a subject software system and produce as output a list of entities such as classes, methods, or statements. This list indicates elements of the source code that likely need modification to correct the bug in question. Static bug localization techniques gather information from the source code (or a model of the code), whereas dynamic techniques gather information from execution traces of the system. Static techniques have some advantages over dynamic techniques, e.g., static techniques do not require a working subject software system. Thus, static techniques can be applied at any stage of the software development or maintenance processes. Additionally, unlike most dynamic techniques, static techniques do not require one or more test cases that trigger the bug.
Some recent static techniques for automating bug localization have been built around modern information retrieval (IR) models such as latent semantic indexing (LSI) [1], [27] and its probabilistic extension, pLSI [13]. In this paper, we describe a static technique for automating bug localization that is based on another IR model, latent Dirichlet allocation (LDA), whose properties, which include modularity and extensibility, provide advantages over both LSI and pLSI. LDA has previously been shown to be a promising technique for a variety of source code retrieval tasks, such as mining concepts from source code [17], [21]. In addition, our previous work [19] showed LDA to be a viable technique for bug localization and provided an initial assessment of the accuracy of an LDA-based approach to source code retrieval for bug localization. This paper briefly reports the results of our previous study [19] and expands it significantly by measuring the accuracy of LDA-based bug localization when applied to a total of 322 bugs across multiple iterations of two software systems. This study is much larger than previous LSI-based bug localization studies [27], [28], [29] and illustrates the scalability of our LDA-based technique.
In addition to examining the scalability of our LDA-based bug localization approach, we also examine whether our approach is adversely affected by the stability of the software. IR-based software engineering techniques using LSI or LDA operate on informal semantic tokens (comments and identifiers) present in the code. This dependency could present a problem when applying these IR-based techniques to changing software. For example, design-level changes such as class name changes could potentially affect the accuracy of IR-based software engineering techniques. Also, the study in [9] found that newly added code in a changing software system is sparsely commented; thus, newly added classes, as well as classes with many internal changes, likely have less semantic information available to be used by IR-based software engineering techniques. Thus we argue that changeable or unstable code very often could result in poor or missing semantic tokens, which therefore could affect the results of our approach. For this reason we examine the relationship between our approach and code instability.
In the past, various stability metrics have been shown to accurately reflect the stability of software across multiple development iterations [1], [16]. In this paper, we compare these metrics to the results of our approach to determine whether or not the instability of the software results in poorer operation of our LDA-based technique.
In summary, our primary goal is to thoroughly explore our static LDA-based approach to bug localization, and to examine its advantages, disadvantages, and boundaries. To do this, our paper:
- °
Discusses the advantages and disadvantages of an LDA-based approach compared to other information retrieval approaches.
- °
Demonstrates the improved accuracy of our LDA-based bug localization technique compared to a previous LSI-based bug localization technique over the same data examined in the earlier LSI-based study.
- °
Illustrates the scalability of our LDA-based bug localization technique to large software packages.
- °
Determines whether the accuracy of our LDA-based bug localization technique, as one example of an IR-based bug localization technique, is sensitive to software stability.
The remainder of this paper is organized as follows: Section 2 is related work. Section 3 describes our research approach. Section 4 is results, and Section 5 presents conclusions.
Section snippets
Related work
This related work section is in two parts, IR-based source code retrieval and software metrics. First, we describe IR-based source code retrieval techniques.
LDA-based bug localization approach
Two steps are necessary to construct an LDA model of a software system: (1) build a document collection from the source code and (2) perform an LDA analysis on the document collection. These steps are detailed below and illustrated in Fig. 1.
Eclipse results
We ran the same queries for LDA as previously published for LSI [28] on the Eclipse LDA model. For each bug examined, Table 4 lists the software version in which the bug occurred, the bug number and bug title taken from the Eclipse bug repository, and the query used in both our LDA analysis and the earlier published LSI analysis [28]. Table 5 presents the results of the LDA query – the name of the first relevant method returned by the query and its LDA rank – along with the LSI results for
Conclusions and future research
The case studies presented in this paper show LDA can successfully be applied to source code retrieval for the purpose of bug localization. Furthermore, the results suggest an LDA-based approach is more effective than approaches using LSI alone for this task.
The first case study examined the same bugs in Mozilla and Eclipse previously analyzed using LSI [28]. The LSI analysis resulted in only three out of the eight total bugs analyzed (37.5%) with the first relevant method ranked in the top 10
Acknowledgements
This work was funded in part by the National Aeronautics and Space Administration under Grants NAG5-12725 and NCC8-200, and by the National Science Foundation under Grants CCF-0915403 and CCF-0915559.
References (35)
- et al.
An empirical study of system design instability metric and design evolution in an agile software process
J. Syst. Softw.
(2005) - et al.
An empirical study of object-oriented system evolution
Informat. Softw. Technol.
(2000) - G. Antoniol, K. Ayari, M. Di Penta. Is it a bug or an enhancement? A text-based approach to classify change requests,...
- et al.
The TAME project: towards improvement-oriented software environments
IEEE Trans. Softw. Eng.
(1988) - et al.
Latent Dirichlet allocation
J. Mach. Learn. Res.
(2003) - et al.
A metrics suite for object-oriented design
IEEE Trans. Softw. Eng.
(1994) - et al.
Indexing by latent semantic analysis
J. Am. Soc. Informat. Sci.
(1990) - et al.
Do crosscutting concerns cause defects?
IEEE Trans. Softw. Eng.
(2008) - Eclipse Home Page....
- B. Fluri, M, Wursch, H.C. Gall, Do code and comments co-evolve? On the relation between source code and comment...