3.1 Overview of granular computing
Granular computing is an emerging approach of information processing. It is applied with two main aims as stressed in Yao (
2005). One aim is to adopt structured thinking at the philosophical level and the other one is to conduct structured problem solving at the practical level. As described in Hu and Shi (
2009), granular computing generally involves decomposition of a whole into several parts. In practice, this means to decompose a complex problem into several simpler sub-problems.
The fundamentals of granular computing generally involve information granulation which includes probabilistic sets, fuzzy sets and rough sets. Deterministic sets can be seen as special cases of all the three above types of sets. In particular, a probabilistic set is judged as a deterministic set while each element has a 100% chance to belong to the set. Also, a fuzzy set is judged as a deterministic set while each element has a full membership to the set, i.e. the fuzzy membership degree is 100%. Similarly, a rough set is judged as a deterministic set while each element unconditionally belongs to the set. The above description indicates that deterministic sets are used in the context of deterministic logic, whereas the other three types of sets are used in the context of non-deterministic logic.
In the context of probabilistic sets, as described in Liu et al. (
2016b), each set employs a chance space which can be partitioned into a number of subspaces. Each of these subspaces can be viewed as a granule that can be randomly selected towards enabling an event to occur. In this context, all these granules make up the whole chance space. As also described in Liu et al. (
2016b), each element in a probabilistic set is granted a probability towards getting a full membership to the set. In the context of granular computing, the probability can be viewed as a percentage of the granules that make up the chance space. For example, if an element is given a probability of 80% towards getting a full membership to a set, then the element would be assigned 80% of the granules that enable the full membership to be granted.
In the context of fuzzy sets, as described in Liu et al. (
2016b), each element in a fuzzy set has a certain degree of membership to the set, i.e. an element belongs to a set to a certain extent. In the context of granular computing, the membership can be partitioned into a number of parts. Each part of the membership can be viewed as a granule. For example, if an element is given a membership degree of 80% to a set, then the element would be assigned 80% of the granules that certainly belong to the set. This is very similar to the example that a club offers different levels of memberships, which provides the members with different levels of access to the resources and the facilities.
In the context of rough sets, as described in Liu et al. (
2016b), each rough set employs a boundary region to allow some elements to be restored conditionally on the basis of the insufficient information. In other words, all these elements within the boundary region are only given conditional memberships to the set. This is because these elements have only partially met the conditions towards being members of the set. Once the conditions have been fully met, these elements would be granted full memberships to the set. In the context of granular computing, the condition for an element to belong to the set can be partitioned into a number of subconditions. Each of these subconditions can be viewed as a granule. As defined in Liu et al. (
2016b), possibility is aimed to measure the extent to which the condition is met. For example, if the possibility that an element belongs to a set is 80%, then the element would be assigned 80% of the granules, each of which leads to the partial fulfilment towards getting the membership.
In practice, the concepts of granular computing have been applied broadly in many areas such as artificial intelligence (Wilke and Portmann
2016; Skowron et al.
2016), computational intelligence (Dubois and Prade
2016; Kreinovich
2016; Livi and Sadeghian
2016), and machine learning (Min and Xu
2016; Peters and Weber
2016; Antonelli et al.
2016). In addition, ensemble learning is also considered as a granular computing approach since it involves decomposition of a data set into a number of overlapping samples in the training stage and a combination of predictions by different classifiers towards classifying a test instance. The similar perspective has also been pointed out in Hu and Shi (
2009). The next section presents in detail how the concepts of granular computing can be used towards reduction of bias in voting in the context of ensemble learning.
3.2 Probabilistic voting
Probabilistic voting (Liu et al.
2016a) is considered to be inspired by nature and biology in the context of granular computing, since the voting is made on the basis of the hypothesis that the class with the highest frequency or weight only has the best chance of being selected towards classifying an unseen instance. In other words, it is not guaranteed that the class with the highest frequency or weight will definitely be selected to be assigned to the unseen instance. In this paper, probabilistic voting is used for both Bagging and Boosting towards improving the overall classification accuracy. In particular, majority voting (involved in Bagging) and weighted voting (involved in Boosting) are both replaced with probabilistic voting. The procedure of probabilistic voting is illustrated below:
Step 1:
calculating the weight \(W_i\) for each single class i.
Step 2:
calculating the total weight W over all classes.
Step 3:
calculating the percentage of weight \(P_i\) for each single class i, i.e. \(P_i= W_i \div W\).
Step 4:
Randomly selecting a single class i with the probability \(P_i\) towards classifying an unseen instance.
The following example relating to Bayes Theorem is used for the illustration of the above procedure:
Inputs (binary): \(x_1, x_2, x_3\)
Output (binary): y
Probabilistic correlation:
$$\begin{aligned}&P(y=0|x_1=0)=0.4, P(y=1|x_1=0)=0.6, \\&P(y=0|x_1=1)=0.5, P(y=1|x_1=1)=0.5;\\&P(y=0|x_2=0)=0.7, P(y=1|x_2=0)=0.3, \\&P(y=0|x_2=1)=0.6, P(y=1|x_2=1)=0.4;\\&P(y=0|x_3=0)=0.5, P(y=1|x_3=0)=0.5, \\&P(y=0|x_3=1)=0.8, P(y=1|x_3=1)=0.2; \end{aligned}$$
While
\(x_1=0, x_2=1, x_3=1, y=?\)
Following Step 1, the weight
\(W_i\) for each single value of
y is:
$$\begin{aligned}&P(y=0|x_1=0, x_2=1, x_3=1)= P(y=0|x_1=0)\times P(y=0|x_2=1)\\&\quad \times P(y=0|x_3=1)= 0.4 \times 0.6 \times 0.8= 0.192\\&P(y=1|x_1=0, x_2=1, x_3=1)= P(y=1|x_1=0)\times P(y=1|x_2=1)\\&\quad \times P(y=1|x_3=1)= 0.6 \times 0.4 \times 0.2= 0.048 \end{aligned}$$
Following Step 2, the total weight
W is
\(0.24= 0.192 + 0.048\).
Following Step 3, the percentage of weight \(P_i\) for each single value of y is:
Percentage for \(y=0\): \(P_0= 0.192\div 0.24= 80\%\)
Percentage for \(y=1\): \(P_1= 0.048\div 0.24= 20\%\)
Following Step 4, \(y=0\) (80% chance) or \(y=1\) (20% chance).
In the above illustration, both majority voting and weighted voting would result in 0 being assigned to y due to its higher frequency or weight shown in Step 4. In particular, in the context of majority voting, Step 4 would indicate that the frequency for y to equal 0 is 80% and the one for y to equal 1 is 20%. Also, in the context of weighted voting, Step 4 would indicate that over the total weight the percentage of the weight for y to equal 0 is 80% and the percentage of the weight for y to equal 1 is 20%. Therefore, both types of voting would choose to assign y the value of 0. However, in the context of probabilistic voting, Step 4 would indicate that y could be assigned either 0 (with 80% chance) or 1 (with 20% chance). In this way, the bias in voting can be effectively reduced towards improvement of overall accuracy of classification in ensemble learning.
The probabilistic voting approach illustrated above is very similar to natural selection which is one step of the procedure of genetic algorithms (Man et al.
1996), i.e. the probability of a class being selected is very similar to the fitness of an individual involved in natural selection. In particular, the way of selecting a class involved in Step 4 of the above procedure is inspired by the Roulette Wheel Selection (Lipowski and Lipowska
2012).
In the context of granular computing, the frequency of a class can be viewed as an information granule that enables the class to be selected for being assigned to a test instance. Similarly, the weight of a class can be viewed as a part of information granules that enable the class to be selected towards classifying an unseen instance. From this point of view, the class with the highest frequency of being predicted by base classifiers means to have been assigned the most information granules that enable the class to be selected for being assigned to a test instance. Similarly, the class with the highest weight means to have been assigned the highest percentage of the information granules that enable the class to be selected towards classifying an unseen instance. More details on information granules can be found in Pedrycz and Chen (
2011,
2015a,
b).
As mentioned in Sect.
1, for classifying test instances, the Bagging method is biased to always select the most frequently occurring class and the Boosting method is biased to always select the most highly weighted class. This is due to the assumption that all the independent predictions by the base classifiers provide a complete and highly trusted set of information granules, each of which votes towards one class and against all the other classes. However, it is fairly difficult to guarantee that a set of granules is complete, due to the fact that the training and validation sets are very likely to be incomplete in practice. Also, it is commonly known that a training set may be imbalanced, due to insufficient collection of data, which is likely to result in a class being assigned much more information granules than the other classes. In addition, a learning algorithm may not be suitable to learn a model on a particular sample set. In this case, the information granules, which are provided from the predictions by the models learned by that algorithm, would be much less trusted.
In the context of machine learning, it has been argued in Liu et al. (
2016a) that voting based on heuristics such as frequency or weight is biased. In particular, as mentioned in Sect.
1, the probabilistic voting approach has been applied to two popular single learning algorithms (Naive Bayes and
K Nearest Neighbour) for reduction of bias in voting and the experimental results were encouraging. Since this type of voting is involved in ensemble learning as well, probabilistic voting could also lead to improved results in this context.
In ensemble learning, Bagging needs to draw a number of samples of the original data on a random basis and Boosting needs to iteratively evaluates the weight of training instances. The nature of the Bagging method may result in poor samples of training data being drawn in terms of incompleteness and imbalance. The nature of the Boosting method may result in poor evaluation of training instances in terms of their weights. If the employed learning algorithms are not suitable to the sampled data for Bagging or the weighted data for Boosting, then the frequency or the weight of classes would be much less trusted for classifying test instances. Therefore, the majority voting involved in Bagging and the weighted voting involved in Boosting are considered to be biased. This is very similar to the human reasoning approach that people generally make decisions and judgements based on their previous experience without the guarantee that the decisions and judgements are absolutely right (Liu et al.
2016a). However, the frequency or the weight of a class can fairly be used to reflect the chance of the class being selected towards classifying a test instance, especially when the above conjecture concerning low-quality training data cannot be proved in a reasonable way. The impact of probabilistic voting on the Bagging and Boosting approaches is investigated experimentally in the following section.