Cross-projects software defect prediction using spotted hyena optimizer algorithm

Elsabagh, M. A.; Farhan, M. S.; Gafar, M. G.

doi:10.1007/s42452-020-2320-4

Cross-projects software defect prediction using spotted hyena optimizer algorithm

Research Article
Published: 03 March 2020

Volume 2, article number 538, (2020)
Cite this article

Download PDF

SN Applied Sciences Aims and scope Submit manuscript

Cross-projects software defect prediction using spotted hyena optimizer algorithm

Download PDF

M. A. Elsabagh¹,
M. S. Farhan² &
M. G. Gafar^3,1

1758 Accesses
5 Citations
Explore all metrics

A Correction to this article was published on 31 January 2022

This article has been updated

Abstract

Cross-projects software defect prediction improves the quality of new software projects or projects with a shortage of historical data. Therefore, various data mining techniques are recommended in this field. The classification accuracy issue is considered one of the most significant problems due to the shortage and heterogeneous in historical data. To address this challenge, this research utilizes a spotted hyena optimizer algorithm as a classifier to predict defects through cross-projects. Confidence and Support are utilized as a multi-objective fitness function to look for the best classification rules. These classification rules are used to predict defects for new projects or other projects with insufficient data. The datasets of NASA such as JM1, KC1, and KC2 are used. By applying spotted hyena optimizer algorithm as a classifier on one dataset and predicting defects in the other two datasets, accuracy is reported 84.6, 92.0, 82.4, 90.7, 86.6 and 81.8 for JM1, KC1, and KC2 respectively. These accuracy values are better than the most significant data mining techniques in the field such as Support Vector Machine, Naïve Bayes, Boosting, C4.5, and Bagging. Also, the proposed research discusses other performance measures such as precision, recall, and f-measure. The conclusion proves that there are many features of McCabe and Halstead that have a strong impact to generate highly accurate predictors for defects such as McCabe’s line count of code, McCabe’s cyclomatic complexity, McCabe’s essential complexity, McCabe’s design complexity iv, Halstead’s effort, Halstead’s time estimator, Halstead’s line count, Halstead’s count of line of comments and total operators.

Combined classifier for cross-project defect prediction: an extended empirical study

Article 15 February 2018

Software defect prediction based on nested-stacking and heterogeneous feature selection

Article Open access 20 February 2022

Defect Prediction of Cross Projects Using PCA and Ensemble Learning Approach

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Nowadays, Software Defect Prediction (SDP) is very critical in software engineering and one of the most helping activities during the testing phase of the System Development Life Cycle (SDLC). However, predicting the defective modules isn’t a straight forward job [1].

Defects of software are errors, flaws, mistakes, faults or bugs in software. They may come from the absence of developer experience, the misconception of requirements or uncontrollable development phase which will produce failures or unexpected results [2].

The quality and reliability of the software are demanded to meet user requirements in constrained timespan by identifying and predicting defects in the early stage of SDLC. Therefore, SDP models help teams of quality assurance to allocate resources to the most defective modules [2,3,4].

Generally, there are three approaches in SDP models.

With-in project SDP: SDP model is built by gathering historical data from a project of software (training phase) and predicts defects in the same project (testing phase) i.e. training and testing phases are applied on the same project. However, projects that have no historical data cannot be applied. Therefore, accuracy cannot be achieved [5].
Cross-projects SDP for similar datasets: SDP model is utilized in a mode such that historical data of projects isn’t presented or insufficient to train and build the SDP model. The SDP model is trained and developed on one project and applied for cross projects or other projects. The drawback here is that it requires projects with the same features and metrics [5].
Cross-projects SDP for heterogeneous or dissimilar datasets: SDP model is presented to predict defects with disparate datasets [5].

Recent researches use data mining methodologies that depend on machine learning as important models. Many techniques were used in SDP such as Support Vector Machine (SVM) [6], Naïve Bayes (NB) [7], Boosting [8], C4.5 [9] and Bagging [10]. SDP models still suffer from a very important and challenging issue, which is detecting accuracy [11, 12].

This research uses an algorithm called Spotted Hyena Optimizer (SHO) [13] as a classifier model for predicting software defects in cross projects for similar metrics in different datasets. SHO is developed and trained on one project and applied for predicting software defects in other projects (cross projects) with the same features and metrics. To locate the most fitness classification rules, the experiments apply confidence (CONF) and support (SUP) as a multi-objective function on one project with historical data to build the SDP model and apply these classification rules on other projects that don’t have a sufficient historical data such as new projects. Moreover, it assists software engineering industries to upgrade quality in limited time and effort during the development process.

In this research, SHO is used for the first time as a classifier with a multi-objective fitness function of CONF and SUP. This algorithm makes accuracy values better than the most significant data mining techniques in the field. SHO is utilized as a feasible meta-heuristic algorithm in terms of complexity and efficiency as compared to other traditional algorithms. It has been utilized for obtaining the acceptable solutions for different design problems [13]. In the previous researches, there is a shortage study on the accuracy of traditional cross-project SDP models for similar datasets. Therefore, the experiment study uses SHO as a classifier using (CONF) and (SUP) as a multi-objective fitness function on one project with historical data to build the SDP model. The previous step results the most fitness classification rules used in other projects that don’t have a sufficient historical data such as new projects.

The rest of this research is composed as follows: Sect. 2 describes recent techniques for predicting software defects. Section 3 discusses the SHO algorithm and its mathematical model. Section 4 presents a brief discussion of the proposed classifier. Section 5 shows the discussion and experimental study of the proposed algorithm. Finally, Sect. 6 lists the conclusion and future work.

2 Literature survey

Manual testing requires 27% of development effort [14] furthermore; it couldn’t detect all defects of software. Also, With-in project SDP still suffers from classification accuracy because there are new projects without historical data used to build the SDP model. Hence, cross-projects SDP is considered one of the most significant activities in producing defect-free products of software. It reduces the time and effort of testing teams. This section discusses the common related work of cross-projects SDP with a similar dataset.

It is commonly favored for SDP to learn utilizing the locally accessible data of a software project (within–projects SDP). This local data is obtained from the historical versions and forms of the project. With-in projects SDP model is constructed by gathering historical data from a software (training phase) and predicts defects in the same software (testing phase). This approach suffers from different challenges such as how to predict defects in new projects with a shortage in historical data? Therefore, accuracy cannot be achieved. Nevertheless, the challenge of unavailability of the local data faces the organization’s team. The reasons for unavailability may be due to the changing of technology or no similar features of projects previously developed [1]. To overcome this problem, the cross-project software defect prediction is used. It is utilized in a mode such that historical data of projects isn’t presented or insufficient to train and build a SDP model. Therefore, the SDP model is trained and developed on one project and is applied for cross projects or other projects provided the projects have the same features and metrics [5]. The cross-projects SDP is a classification model that is composed of a set of rules for prediction gathered on the training phase.

Steffn Herbold [15] presented research that provided a benchmark of 26 cross-projects software defect prediction methodologies depended on cost metrics. His benchmark demonstrated that expecting everything as defective was on average better than cross-projects software defect prediction under cost considerations. Moreover, the research demonstrated that the rank of methodologies utilizing metrics of cost was uncorrelated to the rank depended on the metrics that don’t utilize costs.

Fei Wu et al. [16] presented a solution that was effective and unified for both semi-supervised cross-project software defect prediction and semi-supervised with-in software defect prediction problems. This research presented a learning technique of semi-supervised technique and proposed a cost-sensitive kernelized semi-supervised dictionary learning technique.

Yun Zhang et al. [17] researched seven composite techniques that coordinate several classifiers of machine learning to enhance cross-projects software defect prediction. For evaluating the composite algorithms, they applied the experiments on 10 software systems (open source) from the PROMISE repository [18]. The research compared the composite techniques with a combined SDP model where meta- classification used logistic regression using F-measure and cost metrics for evaluation. The experiment shows that the proposed algorithms are better in performance than the compared algorithm.

Peng He et al. [19] developed TD selector using defects and similarity as a weighted function. He utilized logistic regression as a classifier model and analyzed the effects of several combinations of normalization and similarity of defects on the performance of prediction. Also, he compared it with the other two methods. The experiments are applied to 14 projects gathered from public repositories.

Chao Ni et al. [20] proposed a cluster-based strategy feature extraction utilizing clusters of hybrid-data to ease the conveyance contrasts. It incorporates two stages. The strategy of clustering basing on density is used by the feature clustering stage to cluster features. Also, the strategy of ranking is used by feature extraction. For cross-project software defect prediction, the research designed three several heuristic ranking methods in the second stage. The experiment is applied to real-world projects.

Thomas Zimmermann et al. [21] studied cross-projects software defect prediction models on an extensive scale. For 22 real applications, they ran 622 cross-projects software defect prediction. The results demonstrated that cross-projects software defect prediction didn’t lead to perfect or accurate predictions.

As indicated by the previous work, there is an expanding requirement for predicting the software defects with a shortage of historical data. Besides, this detection is required at the starting time of the software development life cycle due to the high maintenance cost. Therefore, the proposed research aims at enhancing cross-projects software defect prediction that still suffers from different challenges and drawbacks such as detecting accuracy [12] and solving the problem of how to predict defects in new projects with a shortage in historical data? Moreover, the proposed cross-project software defect prediction model utilizes the SHO algorithm [13] as a classifier. SHO algorithm was not used as a classifier before especially in this field. In addition to, CONF and SUP were not used as multi-objective fitness functions with the SHO algorithm for classification. Therefore, the performance of the algorithm and accuracy are increased. The convergence of the SHO in this research converges around 4% of the all-out number of cycles. Therefore, the best fit rule is met at less time. The following section explains the spotted hyena optimizer algorithm.

3 Spotted hyena optimizer (SHO)

Algorithms of meta-heuristic are summarized into three categories physical, evolutionary and swarm-based [22]. SHO [13] is a meta-heuristic algorithm motivated by the behavior of the spotted hyena. SHO is scored with one unconstrained and 5 constrained problems of engineering design: loaded structure displacement, speed reducer, welded beam, pressure level, compression spring, and element of rolling bearing [13, 36]. SHO is also used in classification of heart problem [37]. The principle thought of the SHO is the social association among hyenas and their conduct. SHO mathematically modeled the three stages of spotted hyena’s behavior: looking for, surrounding and assaulting prey. The SHO is better in performance over the other meta-heuristic algorithms. The following subsections summarize the mathematical model of encircling, hunting and exploiting the prey by the spotted hyenas.

3.1 Encircling the prey

Dhiman et al. [13] mathematically demonstrated the spotted hyenas’ hierarchy. They consider the prese nt best solution is the objective target which is near the ideal because of search space not known a priori. The other search hyenas will attempt to refresh their situations after the best search solution is characterized. The following equations explain the model of encircling prey:

$$\vec{D}_{h} = \left| {\vec{B} \times \vec{P}_{p} \left( x \right) - \vec{P}\left( x \right)} \right|$$

(1)

$$\vec{P}\left( {x + 1} \right) = \vec{P}_{p} \left( x \right) - \vec{E} \times \vec{D}_{h}$$

(2)

where ${\vec{\text{D}}}_{\text{h}}$ describes the partition among the spotted hyena and prey, x represents the present iteration, ${\vec{\text{B}}}$ and ${\vec{\text{E}}}$ are vectors of co-efficient, ${\vec{\text{P}}}_{\text{p}}$ represents the prey position vector and ${\vec{\text{P}}}$ is the position vector of spotted hyena. || shows the absolute value and × is the multiplication with vector. The vectors ${\vec{\text{B}}}$ and ${\vec{\text{E}}}$ are computed as follow:

$$\vec{B} = 2 \times r\vec{d}_{1}$$

(3)

$$\vec{E} = 2\vec{h} \times r\vec{d}_{2} - \vec{h}$$

(4)

$$\vec{h} = 5 - \left( {iteration \times \left( {5 \div Max_{iteration} } \right)} \right)$$

(5)

where iteration = 1, 2, 3…, ${\text{Max}}_{\text{iteration}}$. For legitimate changing the mode of exploitation and exploration, $\overrightarrow {\text{h }}$ directly diminishes from 5: 0 through ${\text{Max}}_{\text{iteration}} ,r\vec{d}_{1} ,r\vec{d}_{2}$, are irregular vectors in [0, 1].

3.2 Hunting the prey

Spotted hyenas commonly live and pursue in gatherings and depend upon an arrangement of partners and the ability to see the region of prey. For describing the behavior of spotted hyena mathematically, they consider the most feasible search agent, which ideally knows the area of prey. The other individuals make a gathering towards the perfect individual.

The following equations present the mathematical model of hunting prey:

$$\vec{D}_{h} = \left| {\vec{B} \times \vec{P}_{h} - \vec{P}_{k} } \right|$$

(6)

$$\vec{P}_{k} = \vec{P}_{h} - \vec{E} \times \vec{D}_{h}$$

(7)

$$\vec{C}_{h} = \vec{P}_{k} + \vec{P}_{k + 1} + \cdots + \vec{P}_{k + N}$$

(8)

where $\vec{P}_{h}$ is describes as the situation of first feasible hyena, $\vec{P}_{k}$ demonstrates the situation of other agents. Here, N represents the number of hyenas which is expressed as follow:

$$N = count_{nos} \left( {\vec{P}_{h} ,\vec{P}_{h + 1} ,\vec{P}_{h + 2,} \ldots ,\left( {\vec{P}_{h} + \vec{M}} \right)} \right)$$

(9)

where ${\vec{\text{M}}}$ is an irregular vector in [0.5, 1], nos represents the number of solutions and ${\vec{\text{C}}}_{\text{h}}$ is a cluster of N number of ideal solutions.

3.3 Exploiting the prey

To show the model for attacking the prey, they decrease the vector $\overrightarrow {\text{h }}$ value; the variety in the vector ${\vec{\text{E}}}$ is also decreased to change the value in the vector $\overrightarrow {\text{h }}$ which could decline from 5 to zero through the iterations. The following equation represents the model of attacking prey:

$$\vec{P}\left( {x + 1} \right) = \vec{C}_{h} \div N$$

(10)

where ${\vec{\text{P}}}\left( {{\text{x}} + 1} \right)$ saves the most feasible and updates the locations of other agents. The next section explains the SHO algorithm as a classifier and how it is utilized to optimize the most accurate classification rules in cross-project SDP for similar datasets.

4 Proposing SHO as a classifier

In new projects, Cross-projects software defect prediction is viewed as a standout amongst the essential tasks in software engineering due to the shortage in historical data [23][24]. Also, accuracy can’t be fulfilled and problems of real-life may increase such as efficiency and complexity. Detecting accuracy still can’t be achieved through with-in SDP; especially new projects. Sections 1 and 2 briefly discuss how with-in SDP and cross-projects SDP would depend on machine learning. Hence, there is an increasing need to obtain optimal solutions by meta-heuristic techniques [22].To face this challenge, this research utilizes the SHO algorithm as a classifier for predicting defects in a mode such that historical data of projects isn’t presented or insufficient to train and build a SDP model.

The following figure demonstrates the flow of data and essential processes in the SHO classifier through cross-projects SDP as following:

The instances are built from software archives such as version control. Each instance represents class, file, package or method which is defective or not.
This research uses the discretization process for dataset via RapidMiner 5.3 tool [25] because the exact matching among instances of datasets and individuals of the population is very difficult. Hence, the SUP and CONF degrees help the multi-objective function to assess the perfect rules of classification.
The SHO that used as a classifier repeats the search for the fit rule (classification rules) depending on a random subset of a dataset of a project (dataset of training).
For new projects with similar datasets that can’t present historical data, the classification rules resulted from the training phase are used to predict the defects (Fig. 1).
Fig. 1
Cross-projects SDP model utilizing SHO as a classifier
Full size image
According to results, a report of accuracy, F-measure, specificity, recall, and precision are calculated for comparing with other techniques of data mining such as Artificial Neural Network (ANN), Support Vector Machine, Naïve Bayes, Bagging, Random Forest, K nearest neighbors (K-NN), and C4.5.

The following subsections explain the multi-objective fitness function, phases, and flow chart.

4.1 Multi-objective fitness function

Objective function [26] evaluates how to find the most fitness solution of the presented issue. During the experiments, this subsection that explains the confidence and support [27] as a multi-objective function is utilized to find the perfect rules of classification. First, SUP of the rule is calculated by how many instances that fulfill the rule. It can be expressed as follow:

$$SUP = \left( {COUNT\_SS/R} \right) \times 100$$

(11)

Second, CONF is calculated. It is the proportion of the number of instances occurrence that satisfies the entire rule (antecedents and consequences) to the number of instances occurrence that satisfies only the antecedent. It can be described as follow:

$$CONF = \left( {COUNT\_SS/COUNT\_CC} \right) \times 100$$

(12)

where $COUNT\_SS$ is the number of instances that satisfies the rule, R is the total number of instances in the dataset and $COUNT\_CC$ is the number of instances that fulfill the antecedent of the rule.

Finally, the multi-objective fitness function (FT) for each rule is calculated as follow:

$$FT = W_{1} \times SUP + W_{2} \times CONF$$

(13)

where $W_{1} \;{\text{and}}\;W_{2}$ are weights given to the SUP and CONF functions depending on their relative significance.

4.2 Flowchart and phases of the SHO as a classifier

In this subsection, the phases of the SHO as a classifier through cross-projects SDP and the multi-objective fitness function.

As indicated in Fig. 2:

Create a population of spotted hyenas and pick the initial parameters.
Apply the combination of SUP and CONF that used as a multi-objective function for each individual of the spotted hyenas to assess the desired classification rules.
Then, the position of each hyena is refreshed by attempting to learn from the spotted hyena with maximum multi-objective fitness function (SUP and CONF).
After that, the SHO repeats to search for best-spotted hyena (classification rules) depending on a random subset of a dataset of a project (training phase). The resulted classification rules are utilized as input for prediction and classification processes on other new projects with a shortage of historical data.

This research bases on the steps indicated in algorithm 1 to explain the stages of the SHO as a classifier through cross-projects SDP. Moreover, this research answers the question of how to predict instances (defective or not) of new projects and projects that have a shortage in historical data (cross-projects SDP) using SUP and CONF concepts as a multi-objective fitness function.

First, the SHO classifier calculates the initial parameters as indicated in algorithm 1 where parameters are adjusted and initialized (pop = population size = 500, F = number of features or metrics = 22 and MaxIter = maximum number of iterations = 50). Then for each agent of spotted hyena, the multi-objective fitness function is calculated. During the optimization process, SUP and CONF of classification rules are utilized as indicated in Eqs. 11, 12, 13. It helps SHO to be used as a classifier by looking for the suitable classification rules between initially spotted hyenas (random rules). SHO repeats to find the best classification rules. After that, the best solutions are clustered. Then, the position of the spotted hyena is updated via checking if any agent goes past the farthest point and change it. The values of new parameters and the multi-objective fitness function of the updated agent are calculated again. Finally, the SHO returns the best classification rules that are used for predicting defects in new projects or projects with a shortage of historical data. Therefore, this is called cross-projects SDP.

5 Experimental study

This section discusses the study of SHO as a classifier through a cross-projects approach in the SDP field utilizing SUP and CONF as a multi-objective fitness function.

5.1 Experiment setup

This research experiments utilized the following tools and features:

RapidMiner 5 tool [25] is used for the discretization process.
The simulation tool of MATLAB [28] is utilized during the implementation process.
PROMISE [18] and OPENML[29] websites of open-source datasets.
WEKA 3.6 tool [30] for comparisons.
PC with CPU Intel(R) Core (TM) i5 and RAM (4 GB).

KC1, JM1, and KC2 are datasets of NASA metrics related to defects of data program. They are for receiving and processing management of data storage [29]. They are included various features that are utilized through experiments from the archive of OPENML [29] and PROMISE [18]. Table 1 indicates various dataset parameters where software components, software features, number of defects and percentage of defects. Common metrics of datasets are indicated in Table 2. These metrics are useful since it can generate a predictor with high accuracy for defects, easy to use as they can be gathered and collected cheaply and automatically such as lines of code (LOC) and widely used as many researchers utilize static features for guiding quality of software prediction [18, 29, 33, 34]. By utilizing these datasets, SHO that used as a classifier is trained to extract suitable classification rules for predicting defects through new projects or projects that don’t have sufficient historical data (cross-projects SDP). Accuracy of classification and precision are the most popular measurements compared with other techniques in WEKA 3.6 tool data mining.

Table 1 Details of datasets that used through cross-projects SDP

Cross-projects software defect prediction using spotted hyena optimizer algorithm

Abstract

Similar content being viewed by others

Combined classifier for cross-project defect prediction: an extended empirical study

Software defect prediction based on nested-stacking and heterogeneous feature selection

Defect Prediction of Cross Projects Using PCA and Ensemble Learning Approach

1 Introduction

2 Literature survey

3 Spotted hyena optimizer (SHO)

3.1 Encircling the prey

3.2 Hunting the prey

3.3 Exploiting the prey

4 Proposing SHO as a classifier

4.1 Multi-objective fitness function

4.2 Flowchart and phases of the SHO as a classifier

5 Experimental study

5.1 Experiment setup

5.2 Experimental results

5.2.1 Training case 1: KC1

5.2.2 Training case 2: JM1

5.2.3 Training case 3: KC2

5.3 Experimental discussion

6 Conclusion and future work

Change history

31 January 2022

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation