Bug localization using latent Dirichlet allocation

https://doi.org/10.1016/j.infsof.2010.04.002Get rights and content

Abstract

Context

Some recent static techniques for automatic bug localization have been built around modern information retrieval (IR) models such as latent semantic indexing (LSI). Latent Dirichlet allocation (LDA) is a generative statistical model that has significant advantages, in modularity and extensibility, over both LSI and probabilistic LSI (pLSI). Moreover, LDA has been shown effective in topic model based information retrieval. In this paper, we present a static LDA-based technique for automatic bug localization and evaluate its effectiveness.

Objective

We evaluate the accuracy and scalability of the LDA-based technique and investigate whether it is suitable for use with open-source software systems of varying size, including those developed using agile methods.

Method

We present five case studies designed to determine the accuracy and scalability of the LDA-based technique, as well as its relationships to software system size and to source code stability. The studies examine over 300 bugs across more than 25 iterations of three software systems.

Results

The results of the studies show that the LDA-based technique maintains sufficient accuracy across all bugs in a single iteration of a software system and is scalable to a large number of bugs across multiple revisions of two software systems. The results of the studies also indicate that the accuracy of the LDA-based technique is not affected by the size of the subject software system or by the stability of its source code base.

Conclusion

We conclude that an effective static technique for automatic bug localization can be built around LDA. We also conclude that there is no significant relationship between the accuracy of the LDA-based technique and the size of the subject software system or the stability of its source code base. Thus, the LDA-based technique is widely applicable.

Introduction

Bug localization is a software maintenance task in which a developer uses information about a bug present in a software system to locate the portion of the source code that must be modified to correct the bug. In particular, a developer formulates a query using information in a bug report and uses the query to search the software system in order to determine which parts of the code must be changed to “fix” the bug.

Due to the size and complexity of modern software, effectively automating this task can reduce maintenance costs by reducing developer effort. Techniques for automating bug localization take as input information about a subject software system and produce as output a list of entities such as classes, methods, or statements. This list indicates elements of the source code that likely need modification to correct the bug in question. Static bug localization techniques gather information from the source code (or a model of the code), whereas dynamic techniques gather information from execution traces of the system. Static techniques have some advantages over dynamic techniques, e.g., static techniques do not require a working subject software system. Thus, static techniques can be applied at any stage of the software development or maintenance processes. Additionally, unlike most dynamic techniques, static techniques do not require one or more test cases that trigger the bug.

Some recent static techniques for automating bug localization have been built around modern information retrieval (IR) models such as latent semantic indexing (LSI) [1], [27] and its probabilistic extension, pLSI [13]. In this paper, we describe a static technique for automating bug localization that is based on another IR model, latent Dirichlet allocation (LDA), whose properties, which include modularity and extensibility, provide advantages over both LSI and pLSI. LDA has previously been shown to be a promising technique for a variety of source code retrieval tasks, such as mining concepts from source code [17], [21]. In addition, our previous work [19] showed LDA to be a viable technique for bug localization and provided an initial assessment of the accuracy of an LDA-based approach to source code retrieval for bug localization. This paper briefly reports the results of our previous study [19] and expands it significantly by measuring the accuracy of LDA-based bug localization when applied to a total of 322 bugs across multiple iterations of two software systems. This study is much larger than previous LSI-based bug localization studies [27], [28], [29] and illustrates the scalability of our LDA-based technique.

In addition to examining the scalability of our LDA-based bug localization approach, we also examine whether our approach is adversely affected by the stability of the software. IR-based software engineering techniques using LSI or LDA operate on informal semantic tokens (comments and identifiers) present in the code. This dependency could present a problem when applying these IR-based techniques to changing software. For example, design-level changes such as class name changes could potentially affect the accuracy of IR-based software engineering techniques. Also, the study in [9] found that newly added code in a changing software system is sparsely commented; thus, newly added classes, as well as classes with many internal changes, likely have less semantic information available to be used by IR-based software engineering techniques. Thus we argue that changeable or unstable code very often could result in poor or missing semantic tokens, which therefore could affect the results of our approach. For this reason we examine the relationship between our approach and code instability.

In the past, various stability metrics have been shown to accurately reflect the stability of software across multiple development iterations [1], [16]. In this paper, we compare these metrics to the results of our approach to determine whether or not the instability of the software results in poorer operation of our LDA-based technique.

In summary, our primary goal is to thoroughly explore our static LDA-based approach to bug localization, and to examine its advantages, disadvantages, and boundaries. To do this, our paper:

  • °

    Discusses the advantages and disadvantages of an LDA-based approach compared to other information retrieval approaches.

  • °

    Demonstrates the improved accuracy of our LDA-based bug localization technique compared to a previous LSI-based bug localization technique over the same data examined in the earlier LSI-based study.

  • °

    Illustrates the scalability of our LDA-based bug localization technique to large software packages.

  • °

    Determines whether the accuracy of our LDA-based bug localization technique, as one example of an IR-based bug localization technique, is sensitive to software stability.

The remainder of this paper is organized as follows: Section 2 is related work. Section 3 describes our research approach. Section 4 is results, and Section 5 presents conclusions.

Section snippets

Related work

This related work section is in two parts, IR-based source code retrieval and software metrics. First, we describe IR-based source code retrieval techniques.

LDA-based bug localization approach

Two steps are necessary to construct an LDA model of a software system: (1) build a document collection from the source code and (2) perform an LDA analysis on the document collection. These steps are detailed below and illustrated in Fig. 1.

Eclipse results

We ran the same queries for LDA as previously published for LSI [28] on the Eclipse LDA model. For each bug examined, Table 4 lists the software version in which the bug occurred, the bug number and bug title taken from the Eclipse bug repository, and the query used in both our LDA analysis and the earlier published LSI analysis [28]. Table 5 presents the results of the LDA query – the name of the first relevant method returned by the query and its LDA rank – along with the LSI results for

Conclusions and future research

The case studies presented in this paper show LDA can successfully be applied to source code retrieval for the purpose of bug localization. Furthermore, the results suggest an LDA-based approach is more effective than approaches using LSI alone for this task.

The first case study examined the same bugs in Mozilla and Eclipse previously analyzed using LSI [28]. The LSI analysis resulted in only three out of the eight total bugs analyzed (37.5%) with the first relevant method ranked in the top 10

Acknowledgements

This work was funded in part by the National Aeronautics and Space Administration under Grants NAG5-12725 and NCC8-200, and by the National Science Foundation under Grants CCF-0915403 and CCF-0915559.

References (35)

  • M. Alshayeb et al.

    An empirical study of system design instability metric and design evolution in an agile software process

    J. Syst. Softw.

    (2005)
  • W. Li et al.

    An empirical study of object-oriented system evolution

    Informat. Softw. Technol.

    (2000)
  • G. Antoniol, K. Ayari, M. Di Penta. Is it a bug or an enhancement? A text-based approach to classify change requests,...
  • V.R. Basili et al.

    The TAME project: towards improvement-oriented software environments

    IEEE Trans. Softw. Eng.

    (1988)
  • D.M. Blei et al.

    Latent Dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • S.R. Chidamber et al.

    A metrics suite for object-oriented design

    IEEE Trans. Softw. Eng.

    (1994)
  • S.C. Deerwester et al.

    Indexing by latent semantic analysis

    J. Am. Soc. Informat. Sci.

    (1990)
  • M. Eaddy et al.

    Do crosscutting concerns cause defects?

    IEEE Trans. Softw. Eng.

    (2008)
  • Eclipse Home Page....
  • B. Fluri, M, Wursch, H.C. Gall, Do code and comments co-evolve? On the relation between source code and comment...
  • GibbsLDA++....
  • M. Girolami, A. Kabán, On an equivalence between PLSI and LDA, in: Proc. 22nd Annu. ACM SIGIR Int. Conf. on Research...
  • T.L. Griffiths et al.

    Finding scientific topics

    Proc. Nat. Acad. Sci.

    (2004)
  • T. Hofmann, Probabilistic latent semantic indexing, in: Proc. 22nd Annu. ACM SIGIR Int. Conf. on Research and...
  • A. Kontostathis, Essential dimensions of latent semantic indexing (LSI), in: Proc. 40th Hawaii International Conference...
  • A. Kuhn, S. Ducasse, T. Gîrba, Enriching reverse engineering with semantic clustering, in: Proc. 12th Working Conf. on...
  • E. Linstead, P. Rigor, S. Bajracharya, C. Lopes, P. Baldi, Mining concepts from code with probabilistic topic models,...
  • Cited by (0)

    1

    Tel.: +1 256 348 4740; fax: +1 256 348 0219.

    2

    Tel.: +1 256 824 6291; fax: +1 256 824 6039.

    View full text