Introduction
Background
Research motivations
Literature review
Data filtering
- The decision boundaries are smooth and clear;
- It is easier for the classifiers to discriminate between the classes;
- Decreasing the size of the training; leaving in it the really important data;
- Improving the accuracy performance of the model;
- Computational costs can be reduced.
- Making the training data less expressive
- And decreasing training set size.
Related studies
Methods
Datasets
- German credit (22 features, 1000 entries, 70% of positive entries) will be denoted as dataset A. The dataset was created by Professor Dr. Hans Hofmann from Institute of Statistics of Hamburg University. In this dataset bank credit attributes for 1000 credits is provided.
- Data banknote authentication (4 features, 1372 entries, 56% of positive entries). Will be denoted as dataset B. The dataset was provided by Helene Darksen (University of Applied Sciences, Ostwestfalen-Lippe. Data was extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400 × 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained.
- Haberman (3 features, 306 entries, 74% of positive entries). Will be denoted as dataset C [12]. The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
- Ionosphere (34 features, 351 entries, 64% of positive labels). Will be denoted as dataset D. [24]. The dataset represent classification of radar returns from the ionosphere, which were collected by a system in Goose Bay, Labrador. The targets were free electrons in the ionosphere. “Good” radar returns are those showing evidence of some type of structure in the ionosphere. “Bad” returns are those that do not; their signals pass through the ionosphere.
- Seismic bumps (18 features, 2584 entries, 93% of positive labels). Will be denoted as dataset E. The dataset was provided by Marek Sikora and Lukasz Wrobel from Institute of Computer Science, Silesian University of Technology. The data describe the problem of high energy (higher than 10^4 J) seismic bumps forecasting in a coal mine. Data come from two of longwalls located in a Polish coal mine.
- WDBC (30 features, 569 entries, 63% of positive labels). Will be denoted as dataset F. This dataset was provided by Dr. William H. Wolberg, General Surgery Dept. University of Wisconsin, Clinical Sciences Center, Madison.
Data analysis and selecting filtering parameters
- \(P\left( i \right) < threshold\) and \(l\left( i \right) = 1\).
- \(P\left( i \right) \ge threshold\) and \(l\left( i \right) = 0\).
German | Banknote authn. | Haberman | Ionosphere | Seismic bumps | WDBC | |
---|---|---|---|---|---|---|
Neighbour count | 25 | 20 | 20 | 8 | 10 | 20 |
Power | − 0.35 | 0 | 0 | − 4.8 | − 0.8 | − 2.1 |
Threshold | 0.48 | 0.75 | 0.45 | 0.11 | 0.62 | 0.49 |
Outlier rate | 0.23 | 0 | 0.24 | 0.07 | 0.06 | 0.02 |
Base classifiers development
- Decision Tree (number of estimations to split the leaf − 10, empirically evaluated probabilities for each class) Decision tree builds classification or regression models in the form of a tree structure. It utilizes an if-then rule set which is mutually exclusive and exhaustive for classification. The rules are learned sequentially using the training data one at a time. Each time a rule is learned, the tuples covered by the rules are removed. This process is continued on the training set until meeting a termination condition. The tree is constructed in a top-down recursive divide-and-conquer manner. A decision tree can be easily over-fitted generating too many branches and may reflect anomalies due to noise or outliers. An over-fitted model has a very poor performance on the unseen data even though it gives an impressive performance on training data. This can be avoided by pre-pruning which halts tree construction early or post-pruning which removes branches from the fully-grown tree.
- Logistic Regression classifier (nominal type of model). Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is dichotomous in nature. Dichotomous means there are only two possible classes. It computes the probability of an event occurrence. It is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.
- The dependent variable in logistic regression follows Bernoulli distribution.
- Estimation is done through maximum likelihood.
- Naïve Bayes (normal distribution for each feature). Naive Bayes is a probabilistic classifier inspired by the Bayes theorem under an assumption which is that the attributes are conditionally independent. The classification is conducted by deriving the maximum posterior probability with the above assumption applying to Bayes theorem. This assumption greatly reduces the computational cost by only counting the class distribution. Even though the assumption is not valid in most cases since the attributes are statistically dependent, Naive Bayes has able to perform quite well. Naive Bayes can suffer from a problem called the zero-probability problem. When the conditional probability is zero for a particular attribute, it fails to give a valid prediction. This needs to be fixed explicitly using a Laplacian estimator.
- Support Vector Machine (Radial basis kernel function, kernel scale automatically evaluated for each dataset except German (for which kernel scale is 1.8). The objective of the support vector machine algorithm is to find a hyperplane in transformed feature space that distinctly separates the data points of different classes. Our objective is to find a hyperplane that has the maximum margin, i.e. the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence. Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane.
- Neural Network (10 hidden layers, method of gradient descent learning with adaptive learning rate, transfer function for the first layer is hyperbolic tangent, for the second layer is linear). Artificial Neural Network is a set of connected input/output artificial neurons where each connection has a weight associated with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input vectors. The disadvantage of the classifier is the poor interpretability of model compared to other models like Decision Trees due to the unknown symbolic meaning behind the learned weights. However, Neural Networks have performed impressively in most of the real-world applications. It has high tolerance to noisy data and able to classify untrained patterns. Usually, Artificial Neural Networks perform better with numerical inputs and outputs.
- Random Forest (number of trees—60, method–classification). Random forest by itself is a homogeneous combiner which consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes the prediction of Random Forest. Uncorrelated models can produce ensemble predictions that are more accurate than any of the individual predictions. The reason for this effect is that the trees cancel out its individual mistakes.
- DES-LA combiner.Data for training is the same for all classifiers, no matter whether filtering is used or not.
DES-LA combiner
Performance indicator measures
Experimental results and discussion
Impact of filtering on single classifiers performance
Classifier | German | Banknote authn. | Haberman | Ionosphere | Seismic bumps | WDBC |
---|---|---|---|---|---|---|
DT | 0.692 | 0.981 | 0.686 | 0.879 | 0.898 | 0.919 |
LR | 0.76 | 0.99 | 0.742 | 0.868 | 0.931 | 0.954 |
NB | 0.724 | 0.841 | 0.747 | 0.821 | 0.855 | 0.934 |
SVM | 0.761 | 0.999 | 0.716 | 0.937 | 0.933 | 0.974 |
NN | 0.741 | 0.978 | 0.731 | 0.871 | 0.933 | 0.955 |
RF | 0.762 | 0.993 | 0.686 | 0.932 | 0.91 | 0.961 |
DES-LA | 0.742 | 0.997 | 0.737 | 0.934 | 0.927 | 0.964 |
Classifier | German | Banknote authn. | Haberman | Ionosphere | Seismic bumps | WDBC |
---|---|---|---|---|---|---|
DT | 0.746 | 0.981 | 0.747 | 0.894 | 0.933 | 0.931 |
LR | 0.76 | 0.99 | 0.744 | 0.878 | 0.934 | 0.961 |
NB | 0.757 | 0.841 | 0.752 | 0.814 | 0.927 | 0.931 |
SVM | 0.761 | 0.999 | 0.736 | 0.929 | 0.934 | 0.969 |
NN | 0.747 | 0.979 | 0.742 | 0.863 | 0.934 | 0.948 |
RF | 0.743 | 0.992 | 0.748 | 0.922 | 0.934 | 0.948 |
DES-LA | 0.768 | 0.996 | 0.747 | 0.928 | 0.934 | 0.963 |
Classifier | DT (%) | LR (%) | NB (%) | SVM (%) | NN (%) | RF (%) | DES-LA (%) |
---|---|---|---|---|---|---|---|
ACC | 2.97 | 0.38 | 1.65 | 0.11 | 0.07 | 0.35 | 0.57 |
Sens | 6.14 | 0.85 | 3.07 | 0.26 | 1.41 | 2.59 | 1.12 |
Spec | 7.12 | 0.74 | 7.92 | − 0.15 | − 2.98 | 5.42 | − 1.08 |
AUC | − 1.53 | 4.53 | 3.76 | − 1.54 | − 2.56 | 4.23 | 0.17 |
BS | − 1.32 | 1.61 | 0.05 | 0.11 | 0.81 | 1.00 | − 0.13 |
Dataset | Accuracy gap comparing to best classifier (without filtering) (%) | Accuracy gap comparing to best classifier (with filtering) (%) | Accuracy increase (%) |
---|---|---|---|
German | 7 | 2.2 | 5.4 |
Banknote authn. | 1.8 | 1.8 | 0 |
Haberman | 6.1 | 0.5 | 6.2 |
Iono-sphere | 5.8 | 3.5 | 1.5 |
Seismic bumps | 3.5 | 0.1 | 3.5 |
WDBC | 5.5 | 3.8 | 1.2 |
Ranking distribution change due to filtering and feature selection
Real case test: large defaults payments dataset
- X1: Amount of the given credit, which includes both the individual consumer credit and his/her family (supplementary) credit
- X2: Gender (1 = male; 2 = female)
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)
- X4: Marital status (1 = married; 2 = single; 3 = others)
- X5: Age (year)
- X6–X11: History of past payment. We denoted tracked payment records from September to April, 2005 by X6–X11 respectively. The measurement scale for the repayment status is: − 1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
- X12–X17: Amount of bill statement. We denoted amount of bill statement from September to April, 2005 by X12–X17 respectively.
- X18–X23: Amount of previous payment. We denoted amount paid from September to April, 2005 by X18–X23 respectively.
Classifier | Decision Tree | Logistic regression | Naive Bayes | SVM | Neural Network | Random Forest | DES-LA |
---|---|---|---|---|---|---|---|
Accuracy | 0.657 | 0.737 | 0.774 | 0.816 | 0.793 | 0.817 | 0.816 |
Sensitivity | 0.781 | 0.926 | 0.837 | 0.956 | 0.985 | 0.943 | 0.947 |
Specificity | 0.217 | 0.073 | 0.551 | 0.323 | 0.118 | 0.371 | 0.354 |
AUC | 0.5 | 0.499 | 0.736 | 0.703 | 0.686 | 0.762 | 0.749 |
Brier Score | 0.302 | 0.196 | 0.184 | 0.152 | 0.155 | 0.139 | 0.141 |
Classifier | Decision Tree | Logistic regression | Naive Bayes | SVM | Neural Network | Random Forest | DES-LA |
---|---|---|---|---|---|---|---|
Accuracy | 0.813 | 0.818 | 0.79 | 0.815 | 0.794 | 0.817 | 0.835 |
Sensitivity | 0.951 | 0.958 | 0.875 | 0.956 | 0.985 | 0.954 | 0.95 |
Specificity | 0.327 | 0.324 | 0.491 | 0.322 | 0.124 | 0.334 | 0.365 |
AUC | 0.688 | 0.713 | 0.712 | 0.714 | 0.683 | 0.729 | 0.769 |
Brier Score | 0.179 | 0.162 | 0.192 | 0.15 | 0.17 | 0.161 | 0.159 |