Abstract

Ensemble learning is to employ multiple individual classifiers and combine their predictions, which could achieve better performance than a single classifier. Considering that different base classifier gives different contribution to the final classification result, this paper assigns greater weights to the classifiers with better performance and proposes a weighted voting approach based on differential evolution. After optimizing the weights of the base classifiers by differential evolution, the proposed method combines the results of each classifier according to the weighted voting combination rule. Experimental results show that the proposed method not only improves the classification accuracy, but also has a strong generalization ability and universality.

1. Introduction

Ensemble learning is a new direction of machine learning, which trains a number of specific classifiers and selects some of them for ensemble. It has been shown that the combination of multiple classifiers could be more effective compared to any individual ones [1].

From a technical point of view, ensemble learning is mainly implemented as two steps: training weak base classifiers and selectively combining the member classifiers into a stronger classifier. Usually the members of an ensemble are constructed in two ways. One is to apply a single learning algorithm, and the other is to use different learning algorithms over a dataset [2]. Then, the base classifiers are combined to form a decision classifier. Generally, to get a good ensemble, the base learners should be as more accurate as possible and as more diverse as possible. So how to choose an ensemble of some accurate and diverse base learners is a focus of concern of many researchers [3].

In recent years, more and more researchers are concerned with ensemble learning [4]. There are many effective ensemble methods, such as boosting [5], bagging [6], and stacking [7]. Boosting is a method of producing highly accurate prediction rules by combining many “weak” rules which may be only moderately accurate. There are many boosting algorithms. The main variation between many boosting algorithms is their method of weighting training samples and hypotheses. AdaBoost is very popular and perhaps the most significant historically as it was the first algorithm that could adapt to the weak learners. Bagging trains a number of base learners each from a different bootstrap sample by calling a base learning algorithm. A bootstrap sample is obtained by subsampling the training dataset with replacement, where the size of a sample is the same as that of the training dataset. In a typical implementation of stacking, a number of first-level individual learners are generated from the training dataset by employing different learning algorithms. Those individual learners are then combined by a second-level learner which is called metalearner.

Among the most popular combination schemes, majority voting and weighted voting for classification are widely used. Simple majority voting is a decision rule that selects one of many alternatives, based on the predicted classes with the most votes. Majority voting does not require any parameter tuning once the individual classifiers have been trained [8, 9]. In case of weighted voting, weights of voting should vary among the different output classes in each classifier. The weight should be high for that particular output class for which the classifier performs well. So, it is a crucial issue to select the appropriate weights of votes for all the classes per classifier [2]. Weighting problem can be viewed as an optimization problem. Therefore, it can be solved by taking advantage of artificial intelligence techniques such as genetic algorithms (GA) and particle swarm optimization (PSO). The existing literature shows the benefits of these methods of improving the classification performance [2].

Differential evolution (DE) is a simple, efficient, and population-based evolutionary algorithm for the global numerical optimization [10]. Due to its simple structure, ease of use, and robustness, DE has been successfully applied in many fields, including data mining, pattern recognition, digital filter design, and multiobjective optimization [1113]. This paper describes a weighted voting ensemble learning scheme, in which the weight values of each base classifier are optimized by DE algorithm.

This paper is divided into five sections. Section 2 introduces the differential evolution. The proposed approach is presented in Section 3. Empirical studies, results, and discussions are presented in Section 4. Conclusions and future work are presented in Section 5.

2. Differential Evolution

Differential evolution algorithm was proposed by Storn and Price [10]. DE optimizes a problem by maintaining a population of candidate solutions and creating new candidate solutions by combining existing ones according to its simple formulae and then keeping whichever candidate solution that has the best score or fitness on the optimization problem at hand. DE algorithm starts with an initial population of individuals: ,, where the index denotes the th solution of the population at generation . An individual is defined as a -dimensional vector . There are three main operations of DE that are repeated till the stopping criterion is met. They are briefly described below.

Mutation. Mutation operation creates a donor vector corresponding to each population member or target vector in the current generation. The most frequently referred mutation strategies are presented below [14]:

DE/rand/1:

DE/best/1:

DE/current to best/1:

DE/best/2:

DE/rand/2:

The indexes , represent the random and mutually different integers generated within the range and also different from index . is a mutation scaling factor within the range , usually less than 1. Vector is the best individual vector with the best fitness in generation .

Crossover. After mutation, crossover operation is performed between the target vector and its corresponding mutant vector to form the trial vector . For each of the variables, where CR is a crossover control parameter, called crossover rate, within the range . is a uniformly distributed random number, for each th component of the th vector. is a randomly chosen index, which ensures that gets at least one component from .

Selection. After reproduction of the trial individual , selection operation compares it to its corresponding target individual and decides whether the target or the trial individual survives to the next generation . The selection operation is described as where is the objective function to be optimized and ensures that a member of the next generation is the fittest individual. From (7), we can see that if the trial individual is better than target individual , namely, , then it replaces target individual in the next generation ; otherwise it will continue with the target individual.

To improve optimization performance, DE algorithms are continually being developed. Many different strategies for performing crossover and mutation are proposed [1518].

3. Our Proposed Approach

This section describes the proposed weighted voting ensemble learning method based on differential evolution (DEWVote). In our proposed method, we randomly select base classifiers. We find the proper weights of all the base classifiers depending on the prediction confidence through DE algorithm. The whole procedure is summarized in Figure 1.

3.1. Selection and Training of Base Classifiers

The use of ensemble of classifiers has gained wide acceptance in machine learning and statistics community due to significant improvement in accuracy. The individual classifiers should be as diverse as possible. In the well-known ensemble techniques such as bagging and boosting, such diversities are achieved by manipulating the training examples in order to generate multiple hypotheses. In our proposed approach, we select five base classifiers to learn, including C4.5, Naive Bayes, Bayes Nets, -nearest neighbor (-NN), and ZeroR.

3.2. DE-Based Model for Parameters Selection

In this section, we are concerned with the parameters selection for the proposed DEWVote. The parameters that should be optimized in DEWVote are the weights of each base classifier in an ensemble. Different parameters settings have a heavy impact on the performance of DEWVote. We select the differential evolution to search the optimal weights.

DE has a random initial population of solution candidates that is then improved using the evolution operations. In general, we employ the predefined maximum iterations to determine the stopping criterion of DE. Other control parameters for DE are the mutation scaling factor , the crossover rate , and the population size . The process of the DE-based parameters selection for DEWVote is shown in Algorithm 1 with the following explanations.

Input: The control parameters of DE: mutation factor F, crossover rate CR, and population size N.
(1)       Initialization(); {Generate uniformly distributed random population of N individuals
       , where is a
       vector representing the weights ( , , …, ,…, ) of D base classifiers.}
(2)      Set the generation iterator .
(3)        while the stopping criterion is not satisfied do
(4)     for     do
(5)    Select random indexes , , and to be different from each other and from the index i.
(6)    Compute a mutant vector using (1).
(7)    Generate random number .
(8)    for     do
(9)        Decide trial individual using (6).
(10)      end for
(11)        Compute the fitness of the vector and using 10-fold cross validation, and
        update the vector of the next generation ( ) using (7).
(12)    end for
(13)    Update generation iterator .
(14) end  while
Output: The optimal weights ( , , …, , …, ) for DEWVote.

Initialization. Initialize a population of individuals: . An individual is defined as a -dimensional vector: , which represents the weights of base classifiers and is the size of base classifiers. Each individual is generated by the uniform distribution in the range .

Fitness Evaluation. Train the DEWVote by using each individual vector, and the corresponding 10-fold cross-validation accuracy is then evaluated as the fitness function.

Given the number of categories and base classifiers to vote, the prediction category of weighted voting for each sample is described as where is the binary variable. If the th base classifier classifies sample into the th category, then ; otherwise, . is the weight of the th base classifier in an ensemble, which is optimized by DE algorithm in Algorithm 1.

Then, the accuracy is defined as

After obtaining the best individual by the differential evolution, namely, the optimal weight , we generate the ensemble classifier to classify the test datasets using (8).

4. Experimental Results and Analysis

In this section, we present and discuss, in detail, the results obtained by the experiments carried out in this research.

We run our experiments under the framework of Weka [19] using 15 datasets to test the performance of the proposed method. These datasets are from the UCI Machine Learning Repository [20]. Information about these datasets is summarized in Table 1. In DE algorithm, the choice of DE parameters can have a large impact on optimization performance. Selecting the DE parameters that yield good performance has therefore been the subject of much research. For simplicity, we set factor , crossover rate , population , and maximum iteration number .

We first compared the performance with four base classifiers, including C4.5, Naive Bayes, Bayes Nets, and -nearest neighbor (-NN).

C4.5 is an algorithm used to generate a decision tree developed by Quinlan [21]. C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy.

A Naive Bayes classifier [22] is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. Bayes theorem provides a way of calculating the posterior probability. Naive Bayes classifier assumes that the effect of the value of a predictor on a given class is independent of the values of other predictors.

Bayes Nets or Bayesian networks [23] are graphical representation for probabilistic relationships among a set of random variables. A Bayesian network is an annotated directed acyclic graph (DAG) that encodes a joint probability distribution. The nodes of the graph correspond to the random variables. The links of the graph correspond to the direct influence from one variable to the other.

-NN is a type of instance-based learning or lazy learning. In -NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its nearest neighbors ( is a positive integer, typically small). If , then the object is simply assigned to the class of that single nearest neighbor.

To obtain a better measure of predictive accuracy, we compare these methods using 10-fold cross-validation. The cross-validation accuracy is the average of the ten estimates. In each fold nine out of ten samples are selected to be training set, and the left one out of the ten samples is testing set. This process repeats 10 times so that all samples are selected in both training set and testing set. Table 2 shows the average accuracy values of four single methods.

From Table 2, we can see that each method outperforms other single methods in some datasets. Comparatively, C4.5 has more accuracies than other methods in 8 of all 15 datasets. It is noted that these base classifiers are more diverse.

To obtain a better measure of predictive accuracy, we also compare several ensemble methods using 10-fold cross-validation, such as bagging, AdaBoost, majority voting, and our DEWVote approach. In the DEWVote approach, we select five classifiers as base learners, including C4.5, Naive Bayes, Bayes Nets, -nearest neighbor (-NN), and ZeroR [19]. ZeroR is the simplest classification method which relies on the target and ignores all predictors. ZeroR classifier simply predicts the majority category (class). Although there is no predictability power in ZeroR, it is useful for determining a baseline performance as a benchmark for other classification methods. Majority voting selects the same base classifiers as our approach. A Naive Bayes classifier is employed as the base learning algorithm of bagging and AdaBoost. Naive Bayes classifiers are generated multiple times by each ensemble method's own mechanism. The generated classifiers are then combined to form an ensemble.

We present the mean of 10-fold cross-validation accuracies for 15 datasets. The results of ensembles are shown in Table 3. DEWVote shows more accuracies than other ensemble methods besides in Diabetes and Segment-challenge datasets, while majority voting outperforms other ensemble methods in these two datasets. It is of note that majority voting has more accuracies than bagging and AdaBoost. Comparatively speaking, DEWVote and majority voting obtain better performance in majority datasets. However, bagging and boosting obtain better performance than majority voting in vote dataset.

5. Conclusions

In this paper we give a novel approach to optimal weights of base classifiers by differential evolution and present a weighted voting ensemble learning classifier. The proposed approach adopts ensemble learning strategy and selects several base learners, which are as more diverse as possible to each other, to combine an ensemble classifier. Each weight of base learner is obtained by differential evolution algorithm.

We have compared the performance with the three classical ensemble methods, as well as with four base classifiers. Experimental results have confirmed that our approach consistently outperforms the previous approaches. DEWVote searches the weights through iteration operations. So, it has more cost than other ensemble methods. In our future work, we will concentrate on reducing the computational cost.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is partly supported by National Natural Science Foundation of China (no. 61373127), the China Postdoctoral Science Foundation (no. 20110491530), and the University Scientific Research Project of Liaoning Education Department of China (no. 2011186).