Elsevier

Data & Knowledge Engineering

Volume 116, July 2018, Pages 159-176
Data & Knowledge Engineering

Automatic query reformulations for feature location in a model-based family of software products

https://doi.org/10.1016/j.datak.2018.06.001Get rights and content

Highlights

  • The influence of query reformulation in feature location for models is evaluated.

  • Similitude to a reformulated query guides a feature location evolutionary algorithm.

  • Reformulation fails to boost the quality of the solution in models as it does in code.

Abstract

No maintenance activity can be completed without Feature Location (FL), which is finding the set of software artifacts that realize a particular functionally. Despite the importance of FL, the vast majority of work has been focused on retrieving code, whereas other software artifacts such as the models have been neglected. Furthermore, locating a piece of information from a query in a large repository is a challenging task as it requires knowledge of the vocabulary used in the software artifacts. This can be alleviated by automatically reformulating the query (adding or removing terms). In this paper, we test four existing query reformulation techniques, which perform the best for FL in code but have never been used for FL in models. Specifically, we test these techniques in two industrial domains: a model-based family of firmwares for induction hobs, and a model-based family of PLC software to control trains. We compare the results provided by our FL approach using the query and the reformulated queries by means of statistical analysis. Our results show that reformulated queries do not improve the performance in models, which could lead towards a new direction in the creation or reconsideration of these techniques to be applied in models.

Introduction

Feature location (FL) is known as the process of finding the set of software artifacts that realize a particular functionality of software system. No maintenance activity can be completed without locating in the first place the software artifact (e.g., code) that is relevant to the specific functionality [1]. Since FL is one of the main activities performed during software evolution [1] and up to an 80% of a system's lifetime is spent on the maintenance and evolution of system [2], there is a great demand for FL approaches that can help developers to find relevant software artifacts in a family of software products.

Many of FL approaches use of Information Retrieval (IR) techniques [1,3] such as Latent Semantic Indexing (LSI) [4], Latent Dirichlet Allocation (LDA) [5], and Vector Space Model [6] and involve the formulation of a query in natural language (e.g., by the developer). These techniques are statistical methods used to find a feature's relevant software artifact by analyzing and retrieving words that are similar to a query provided by a user. For example, during FL, a developer formulates a query which describes the feature to be located in the code. The query is then run by the IR technique and a list of ranked software artifacts (e.g., classes or methods) is retrieved.

The performance of the retrieval depends greatly on the textual query and its relationship to the text contained in the software artifacts [7]. Hence, this relationship requires knowledge of the vocabulary of the software artifacts to be searched. This knowledge can be difficult to acquire in industrial environments that accumulate a vast amount of software over the years, which often emerges ad hoc using software reuse techniques such as duplication (the “clone-and-own” approach) instead of formalizing the variability among the family of software products. Moreover, in these industrial environments, software maintenance tasks are performed by people who have not participated during the development, so the vocabulary that the people use on the textual query to locate features during maintenance tasks can differ from the vocabulary that was used during the development. Therefore, these differences between the query and the text contained in the software artifacts make the performance of the retrieval worse.

To overcome these differences between the query and the text contained in the software artifacts, other FL approaches [[7], [8], [9]] refine the query using automatic reformulation techniques: expansion or reduction. A short query which obtains not relevant results will likely need an expansion strategy (i.e., adding terms) to improve its performance, whereas a verbose query may need a reduction strategy (i.e., removing terms) since the performance uses to be deteriorated handling long queries [10].

To date the vast majority of work in FL has been focused on improving the performance of the retrieval using automatic query reformulation strategies in code (i.e., by better performance we mean retrieving the relevant software artifacts closer to the top of the list of results). Nevertheless, other software artifacts such as the models have been neglected even though models are the cornerstone in Model-Driven Development approaches to generate code.

To cope with this lack, we evaluate whether automatic query reformulation strategies could improve the results of FL in models. Therefore, the contribution of this paper is twofold.

  • 1.

    We test four automatic query reformulation techniques (Query reduction, Rocchio query expansion, RSV query expansion and Dice query expansion), which perform best in that field [7] and have never been used to locate features in models. Specifically, we test these techniques in two industrial domains: the model-based product family of the BSH group (www.bsh-group.com) and the model-based product family of CAF (www.caf.net/en).

    • The BSH group is one of the largest manufacturers of home appliances in Europe. Its induction division has been producing Induction Hobs (sold under the brands of Bosch and Siemens) for the last 15 years. CAF produces a family of PLC software to control the trains that they have been developing over more than 25 years.

  • 2.

    We compare the results provided by our FL approach using the query as it is (baseline) with the results provided by the four reformulation techniques.

    • The results of this paper suggest that current automatic query reformulation techniques should be reconsidered to be applied in models since we found that using the query as it is leads better results in models than including the query expansion/reduction reformulations. We hope that these results help FL users when they work with models to loss the inertia of applying query reformulation techniques as they would do to locate features in code. Moreover, these results would contribute towards a new direction in the creation of new query reformulation techniques or the modification of the existing ones to improve the location of features in models.

The rest of the paper is structured as follows: Section 2 provides the required background on the automatic query reformulation techniques being compared. Section 3 presents the approach to perform feature location in models. Next, Section 4 presents the evaluation performed, and Section 5 shows the results. Section 6 discusses the results. Section 7 describes the threats to validity. Section 8 reviews the related work. Finally, Section 9 concludes the paper.

Section snippets

Query reformulation techniques

Researchers in the field of FL have proposed a large variety of approaches for automatic query reformulations for an initial query. These approaches belong to one of the following two categories [11]: query expansion approaches and query reduction approaches.

Next, we introduce briefly these categories with emphasis on the automatic reformulation strategies that we use for FL in models.

Our approach to locate features in a model-based product family

The objective of our approach is to obtain the model fragment in a product family that belongs to a feature description provided as input. The upper part of Fig. 1 shows a simplified subset of the Induction Hobs Domain Specific Language (IHDSL) used by our industrial partner BSH to specify induction hobs. We use a simplified subset as a running example through the rest of the paper in order to gain legibility (since the IHDSL is composed of 46 meta-classes, 74 references among them and more

Evaluation

We conducted an evaluation to compare the results provided by our approach with the four automatic query reformulation techniques that were presented in Section 2 (Query reduction, Rocchio query expansion, RSV query expansion and Dice query expansion) in two industrial model-based product families from two of our partners: BSH, the leading manufacturer of home appliances in Europe; and CAF, an international provider of railway solutions all over the world.

Results

In this section, we present the results obtained for each case study in the baseline and the four automatic query reformulations to answer each research question.

Discussion

Although query reformulation techniques have traditionally been applied to code (where satisfactory results have been obtained), their application to models for feature location is novel. Since models and code are considerably different, it is reasonable to think that these techniques are not going to work. However, there are experiences of text-based techniques traditionally applied to code that have provided satisfactory results in models. For example, natural language processing techniques [

Threats of validity

We use the classification of threats of validity of [37,38], which distinguishes four aspects of validity to acknowledge the limitations of our evaluation.

Related work

Many feature location approaches that have been proposed to address more than twenty tasks in software engineering (e.g., concept/feature/concern location, code retrieval and reuse, etc.) by finding relevant code taking textual information as input [1]. For example, Cavalcanti et al. [39] used IR techniques to assign change requests in software maintenance or evolution tasks based on context information. Kimmig et al. [40] proposed an approach for translating NL queries to concrete parameters

Concluding remarks

We have tested four existing automatic query reformulation techniques that expand or reduce terms from an incoming feature description to check whether these techniques improve the performance locating features in a model-based family of software products as they do in other software artifacts such as code. To this aim, we have used our FL approach that provides the model fragment from a given product family that realizes the incoming feature description. The requisites to apply our FL approach

Acknowledgements

This work has been partially supported by the Ministry of Economy and Competitiveness (MINECO) through the Spanish National R+D+i Plan and ERDF funds under the project Model-Driven Variability Extraction for Software Product Line Adoption (TIN2015-64397-R).

Francisca Pérez is an assistant professor in the SVIT Research Group (https://svit.usj.es) at San Jorge University. Her research interests include model-driven development, variability modeling, feature location, end-user development, and collaborative modeling. Pérez received a PhD in computer science from the Universitat Politècnica de València. Contact her at [email protected].

References (50)

  • S. García et al.

    Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power

    Inf. Sci.

    (2010)
  • B. Dit et al.

    Feature location in source code: a taxonomy and survey

    J. Softw. Evol. Process

    (2013)
  • M.M. Lehman et al.

    A Paradigm for the Behavioural Modelling of Software Processes Using System Dynamics

    (2001)
  • V. Alves et al.

    An exploratory study of information retrieval techniques in domain analysis

  • S. Deerwester et al.

    Indexing by latent semantic analysis

    J. Am. Soc. Inf. Sci.

    (1990)
  • D.M. Blei et al.

    Latent dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • G. Salton et al.

    Introduction to Modern Information Retrieval

    (1986)
  • S. Haiduc et al.

    Automatic query reformulations for text retrieval in software engineering

  • E. Hill et al.

    Improving source code search with natural language phrasal representations of method signatures

  • J. Yang et al.

    Inferring semantically related words from software context

  • G. Kumaran et al.

    Reducing long queries using query quality predictors

  • X.A. Lu et al.

    Query expansion/reduction and its impact on retrieval effectiveness

  • G. Gay et al.

    On the use of relevance feedback in ir-based concept location

  • C. Carpineto et al.

    A survey of automatic query expansion in information retrieval

    ACM Comput. Surv.

    (2012)
  • G. Sridhara et al.

    Identifying word relations in software: a comparative study of semantic similarity tools

  • G. Salton

    The SMART Retrieval System—experiments in Automatic Document Processing

    (1971)
  • B. Sisman et al.

    Assisting code search with automatic query reformulation for bug localization

  • E. Hill et al.

    Automatically capturing source code context of nl-queries for software maintenance and reuse

  • L. Davis
    (1991)
  • Ø. Haugen et al.

    Adding standardized variability to domain specific languages

  • A.S. Sayyad et al.

    Scalable product line configuration: a straw to break the camel's back

  • A. Arcuri et al.

    Parameter tuning or default values? an empirical investigation in search-based software engineering

    Empir. Soft. Eng.

    (2013)
  • A. Kotelyanskii et al.

    Parameter tuning for search-based test-data generation revisited: support for previous results

  • T.K. Landauer et al.

    An introduction to latent semantic analysis

    Discourse Process

    (1998)
  • M. Revelle et al.

    Using data fusion and web mining to support feature location in software

  • Cited by (0)

    Francisca Pérez is an assistant professor in the SVIT Research Group (https://svit.usj.es) at San Jorge University. Her research interests include model-driven development, variability modeling, feature location, end-user development, and collaborative modeling. Pérez received a PhD in computer science from the Universitat Politècnica de València. Contact her at [email protected].

    Jaime Font is an assistant professor in the SVIT Research Group at San Jorge University. His research interests include reverse engineering, variability modeling, and feature location. Font received a PhD in computer science from the University of Oslo. Contact him at [email protected].

    Lorena Arcega is a PhD student in computer science at the University of Oslo and a researcher in the SVIT Research Group at San Jorge University. Her research interests are software evolution, variability modeling and models at run-time. Arcega received an MSc in Advanced Software Technologies from San Jorge University. Contact her at [email protected].

    Carlos Cetina is an associate professor at San Jorge University and the head of the SVIT Research Group. His research focuses on software product lines, variability modeling, feature location, and model-driven development. Cetina received a PhD in computer science from the Universitat Politècnica de València. More information about his background can be found at his website: carloscetina.com.

    View full text