1 Introduction
2 Related work
3 Semi-supervised self-training with decision trees
3.1 Semi-supervised setting
3.2 The self-training algorithm
4 Self-training with improved probability estimates
4.1 C4.4: No-pruning and Laplacian correction
4.2 NBTree
4.3 Grafted decision tree
4.4 Combining no-pruning, Laplace correction and grafting
4.5 Global distance-based measure
5 Self-training for ensembles of decision trees
5.1 Random forest
5.2 The random subspaces method
5.3 The ensemble self-training algorithm
6 Experiments
6.1 Statistical test
6.2 UCI datasets
Dataset (classes) | Attributes | Size | Perc. |
---|---|---|---|
Breath-cancer (1,2) | 10 | 286 | 70 |
Bupa (1,2) | 6 | 345 | 58 |
Car (1,2) | 7 | 1,594 | 76 |
Cmc (1,3) | 10 | 1,140 | 55 |
Colic (1,2) | 22 | 368 | 63 |
Diabetes (1,2) | 6 | 768 | 65 |
Heart statlog (1,2) | 13 | 270 | 55 |
Hepatitis (1,2) | 19 | 155 | 79 |
Ionosphere (1,2) | 34 | 351 | 36 |
Liver (1,2) | 7 | 345 | 58 |
Sonar (1,2) | 61 | 208 | 53 |
Tic-tac-toe (1,2) | 9 | 958 | 65 |
Vote (1,2) | 16 | 435 | 61 |
Wave (1,2) | 41 | 3,345 | 51 |
6.3 Web-pages datasets
Dataset | Attributes | Size | Perc. |
---|---|---|---|
Aesthetic | 192 | 60 | 50 |
Recency | 192 | 60 | 50 |
7 Results
7.1 Self-training with a single classifier
Dataset | DT | ST-DT | C4.4 | ST-C4.4 | GT | ST-GT | C4G | ST-C4G | NB | ST-NB |
---|---|---|---|---|---|---|---|---|---|---|
Breath-cancer | 68.00 | 69.00 | 66.25 | 67.00 | 68.00 | 70.01 | 66.25 | 70.12 | 72.50 | 75.75 |
Bupa | 58.62 | 57.09 | 58.62 | 58.68 | 58.62 | 59.25 | 58.62 | 61.40 | 58.20 | 58.20 |
Car | 86.08 | 86.04 | 85.48 | 87.48 | 86.08 | 87.28 | 85.48 | 88.28 | 85.08 | 87.68 |
Cmc | 57.00 | 58.25 | 56.75 | 59.05 | 57.00 | 59.13 | 56.75 | 60.12 | 54.25 | 58.00 |
Colic | 72.83 | 72.36 | 70.56 | 73.70 | 72.84 | 74.80 | 70.56 | 75.03 | 74.60 | 76.71 |
Diabetes | 67.82 | 67.83 | 67.51 | 69.18 | 68.46 | 69.40 | 68.14 | 71.79 | 71.14 | 72.59 |
Heart | 67.27 | 67.27 | 68.63 | 70.50 | 67.27 | 69.10 | 68.63 | 72.12 | 71.81 | 73.85 |
Hepatitis | 76.00 | 75.60 | 76.00 | 76.40 | 76.00 | 76.60 | 76.00 | 80.60 | 78.40 | 82.40 |
Ionoshere | 70.47 | 70.67 | 70.47 | 71.46 | 70.37 | 71.56 | 70.37 | 73.72 | 79.97 | 82.57 |
Liver | 56.80 | 56.60 | 57.00 | 60.80 | 57.00 | 59.80 | 57.00 | 59.98 | 57.00 | 59.90 |
Sonar | 63.40 | 63.40 | 63.40 | 63.76 | 63.40 | 64.92 | 63.40 | 65.40 | 59.60 | 63.60 |
Tic-tac-toe | 66.40 | 68.20 | 63.40 | 68.80 | 66.40 | 70.10 | 63.80 | 69.60 | 65.20 | 68.60 |
Vote | 89.08 | 89.08 | 89.05 | 90.30 | 89.08 | 89.80 | 88.05 | 90.48 | 90.00 | 92.74 |
Wave | 83.10 | 83.63 | 82.85 | 85.13 | 84.10 | 85.25 | 83.60 | 86.25 | 84.75 | 88.00 |
7.1.1 Self-training with J48 decision tree learner
7.1.2 Self-training with C4.4, grafting, and NBTree
7.1.3 Statistical analysis
Datasets | Decision tree | C4.4 | J48graft | C4.4graft | NBTree |
---|---|---|---|---|---|
Breath-cancer | 4 | 5 | 3 | 2 | 1 |
Bupa | 5 | 3 | 2 | 1 | 4 |
Car | 5 | 3 | 4 | 1 | 2 |
Cmc | 5 | 3 | 2 | 1 | 4 |
Colic | 5 | 4 | 3 | 2 | 1 |
Diabetes | 5 | 4 | 3 | 2 | 1 |
Heart | 5 | 3 | 4 | 2 | 1 |
Hepatitis | 5 | 4 | 3 | 2 | 1 |
Ionosphere | 5 | 4 | 3 | 2 | 1 |
Liver | 5 | 3 | 4 | 1 | 2 |
Sonar | 5 | 3 | 2 | 1 | 4 |
Tic-tac-toe | 5 | 3 | 1 | 2 | 4 |
Vote | 5 | 3 | 4 | 2 | 1 |
Wave | 5 | 4 | 3 | 2 | 1 |
Average rank | 4.93 | 3.50 | 2.93 | 1.64 | 1.93 |
i | Classifier | Z | P-value |
\(\alpha /(k-i)\)
|
---|---|---|---|---|
1 | C4.4graft | 5.498051603 | 0.00000004 | 0.0125 |
2 | NBTree | 4.780914437 | 0.00000174 | 0.016666667 |
3 | j48graft | 3.82473155 | 0.00013092 | 0.025 |
4 | C4.4 | 2.031888636 | 0.0421658 | 0.05 |
7.2 Self-training with single classifier and global distance-based measure
Dataset | Supervised learning | Self-training | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
DT | C4.4 | GT | C4G | NB | ST-DT | ST-C4.4 | ST-GT | ST-C4G | ST-NB | |
Breath-cancer | 68.00 | 66.25 | 68.00 | 66.25 | 72.50 | 70.07 | 70.12 | 70.25 | 71.27 | 76.14 |
Bupa | 58.62 | 58.62 | 58.62 | 58.62 | 58.20 | 61.60 | 61.60 | 61.80 | 62.37 | 61.76 |
Car | 86.08 | 85.48 | 86.08 | 85.48 | 85.08 | 87.04 | 87.56 | 87.28 | 89.01 | 88.04 |
Cmc | 57.00 | 56.75 | 57.00 | 56.75 | 54.25 | 59.00 | 60.01 | 60.77 | 61.01 | 60.00 |
Colic | 72.83 | 70.56 | 72.84 | 70.56 | 74.60 | 74.29 | 75.17 | 75.49 | 76.25 | 77.34 |
Diabetes | 67.82 | 67.51 | 68.46 | 68.14 | 71.14 | 69.80 | 70.80 | 70.90 | 71.20 | 72.40 |
Heart | 67.27 | 68.63 | 67.27 | 68.63 | 71.81 | 68.50 | 71.61 | 70.27 | 71.63 | 74.00 |
Hepatitis | 76.00 | 76.00 | 76.00 | 76.00 | 78.40 | 77.12 | 77.29 | 78.71 | 80.97 | 82.40 |
Ionosphere | 70.47 | 70.47 | 70.37 | 70.37 | 79.79 | 72.62 | 71.61 | 72.29 | 73.43 | 82.15 |
Liver | 56.80 | 57.00 | 57.00 | 57.00 | 57.00 | 57.28 | 60.88 | 61.10 | 60.80 | 58.32 |
Sonar | 63.40 | 63.40 | 63.40 | 63.40 | 59.60 | 64.12 | 65.00 | 65.10 | 66.43 | 64.31 |
Tic-tac-toe | 66.40 | 63.40 | 66.40 | 63.80 | 65.20 | 67.40 | 66.02 | 68.40 | 67.33 | 67.17 |
Vote | 89.08 | 89.05 | 89.08 | 88.05 | 90.00 | 90.43 | 91.37 | 92.12 | 92.18 | 92.15 |
Wave | 83.10 | 82.85 | 84.10 | 83.60 | 84.75 | 84.63 | 85.13 | 85.67 | 86.75 | 88.00 |
Aesthetics | 53.47 | 54.12 | 54.40 | 54.48 | 54.40 | 57.70 | 57.87 | 56.70 | 59.22 | 60.45 |
Recency | 65.27 | 66.01 | 66.70 | 67.50 | 67.00 | 68.08 | 70.71 | 70.27 | 72.47 | 74.36 |
7.3 Self-training with an ensemble of trees
Dataset | RFG | ST-RFG | RREP | ST-RREP | RG | ST-RG | RNB | ST-RNB |
---|---|---|---|---|---|---|---|---|
Breath-cancer | 66.25 | 68.75 | 68.50 | 69.50 | 68.50 | 71.50 | 74.50 | 75.50 |
Bupa | 57.34 | 60.32 | 56.45 | 57.72 | 55.04 | 59.38 | 58.40 | 58.40 |
Car | 87.00 | 88.32 | 77.60 | 80.40 | 80.60 | 83.02 | 78.00 | 80.80 |
Cmc | 60.25 | 63.25 | 58.75 | 59.63 | 59.50 | 63.70 | 57.25 | 59.38 |
Colic | 75.00 | 74.90 | 67.32 | 71.65 | 77.50 | 79.60 | 76.77 | 79.26 |
Diabetes | 69.56 | 71.66 | 70.66 | 70.75 | 67.82 | 70.56 | 70.04 | 72.29 |
Heart | 74.99 | 76.22 | 72.04 | 74.58 | 73.18 | 76.09 | 70.91 | 73.40 |
Hepatitis | 80.00 | 80.80 | 80.00 | 80.00 | 80.40 | 81.80 | 79.60 | 82.00 |
Ionoshere | 80.00 | 82.00 | 71.20 | 73.80 | 73.04 | 77.76 | 78.31 | 81.10 |
Liver | 56.60 | 58.14 | 56.00 | 58.40 | 61.40 | 63.00 | 56.80 | 58.00 |
Sonar | 63.60 | 67.20 | 59.20 | 60.60 | 63.40 | 64.80 | 59.80 | 61.20 |
Tic-tac-toe | 70.00 | 71.40 | 67.20 | 67.60 | 69.60 | 69.60 | 68.20 | 70.40 |
Vote | 91.25 | 93.25 | 88.78 | 90.25 | 89.00 | 93.00 | 88.78 | 92.50 |
Wave | 86.00 | 88.50 | 85.75 | 86.75 | 87.25 | 89.50 | 88.75 | 89.75 |
7.4 Self-training with ensemble classifier and distance-based measure
Dataset | Supervised learning | Self-training | ||||||
---|---|---|---|---|---|---|---|---|
RFG | RREP | RG | RNB | ST-RFG | ST-RREP | ST-RG | ST-RNB | |
Breath-cancer | 66.25 | 68.50 | 68.50 | 74.50 | 69.80 | 70.50 | 71.95 | 76.93 |
Bupa | 57.34 | 56.45 | 55.04 | 58.40 | 61.60 | 60.76 | 61.06 | 60.04 |
Car | 87.00 | 77.60 | 80.60 | 78.00 | 89.27 | 81.40 | 84.29 | 81.18 |
Cmc | 60.25 | 58.75 | 59.50 | 57.25 | 64.00 | 60.00 | 64.12 | 60.10 |
Colic | 75.00 | 67.32 | 77.50 | 76.77 | 75.21 | 71.15 | 80.60 | 79.32 |
Diabetes | 69.56 | 70.66 | 67.82 | 70.04 | 71.80 | 70.98 | 71.80 | 72.00 |
Heart | 74.99 | 72.04 | 73.18 | 70.91 | 77.01 | 74.15 | 76.62 | 74.27 |
Hepatitis | 80.00 | 80.00 | 80.40 | 79.60 | 81.81 | 81.25 | 82.15 | 83.00 |
Ionoshere | 80.00 | 71.20 | 73.04 | 78.31 | 82.34 | 74.79 | 78.56 | 81.91 |
Liver | 56.60 | 56.00 | 61.40 | 56.80 | 59.00 | 57.92 | 63.60 | 58.52 |
Sonar | 63.60 | 59.20 | 63.40 | 59.80 | 69.07 | 64.15 | 69.23 | 63.14 |
Tic-tac-toe | 70.00 | 67.20 | 69.60 | 68.20 | 72.01 | 69.34 | 70.67 | 71.33 |
Vote | 91.25 | 88.78 | 89.00 | 88.78 | 93.25 | 91.53 | 93.17 | 93.79 |
Wave | 86.00 | 85.75 | 87.25 | 88.75 | 88.15 | 86.97 | 89.50 | 89.86 |
Aesthetics | 58.55 | 60.88 | 63.88 | 60.41 | 61.91 | 61.04 | 68.91 | 65.01 |
Recency | 68.49 | 70.37 | 70.37 | 71.29 | 71.87 | 72.3 | 78.74 | 75.57 |
7.5 Sensitivity to the amount of trees
7.6 Sensitivity to the amount of unlabeled data
7.7 Sensitivity to the threshold parameter
8 Multiclass classification
Dataset | # Samples | # Attributes | # Classes |
---|---|---|---|
Balance | 625 | 4 | 3 |
Car | 1,728 | 6 | 4 |
Cmc | 1,473 | 9 | 3 |
Iris | 150 | 4 | 3 |
Vehicle | 846 | 19 | 4 |
Self-training with ensemble classifiers | ||||||||
Datasets | DT | ST-DT | RFG | ST-RFG | RG | ST-RG | RNB | ST-RNB |
Balance | 63.59 | 63.11 | 67.96 | 69.41 | 68.44 | 70.38 | 66.99 | 66.99 |
Car | 76.94 | 77.29 | 79.23 | 81.34 | 72.71 | 73.50 | 72.00 | 75.18 |
Cmc | 42.77 | 44.42 | 45.46 | 47.32 | 46.28 | 50.00 | 44.70 | 48.02 |
Iris | 77.08 | 77.08 | 77.08 | 81.25 | 89.58 | 95.83 | 91.67 | 95.83 |
Vehicle | 53.99 | 54.71 | 59.42 | 61.96 | 59.42 | 64.13 | 64.85 | 65.22 |
Self-training with single classifiers | ||||||||
Datasets | DT | ST-DT | NB | ST-NB | C4G | ST-C4G | ||
Balance | 63.59 | 63.11 | 67.47 | 71.84 | 65.53 | 67.96 | ||
Car | 76.94 | 77.29 | 76.76 | 78.52 | 74.64 | 76.23 | ||
Cmc | 42.77 | 44.42 | 42.36 | 45.04 | 42.96 | 45.46 | ||
Iris | 77.08 | 77.08 | 89.58 | 91.67 | 75.00 | 75.00 | ||
Vehicle | 53.99 | 54.71 | 61.59 | 62.68 | 53.99 | 55.70 |
Classes | DT | ST-DT | NB | ST-NB | C4G | ST-C4G | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | AUC | P | R | AUC | P | R | AUC | P | R | AUC | P | R | AUC | P | R | AUC | |
Balance | ||||||||||||||||||
Class 1 | 0.74 | 0.63 | 0.80 | 0.68 | 0.71 | 0.77 | 0.63 | 0.87 | 0.74 | 0.68 | 0.85 | 0.85 | 0.71 | 0.76 | 0.76 | 0.75 | 0.74 | 0.79 |
Class 2 | 0.07 | 0.13 | 0.42 | 0.07 | 0.13 | 0.44 | 0.0 | 0.0 | 0.56 | 0.0 | 0.0 | 0.47 | 0.0 | 0.0 | 0.44 | 0.04 | 0.06 | 0.60 |
Class 3 | 0.73 | 0.73 | 0.76 | 0.78 | 0.64 | 0.74 | 0.79 | 0.62 | 0.77 | 0.77 | 0.71 | 0.83 | 0.70 | 0.72 | 0.76 | 0.69 | 0.66 | 0.79 |
RFG | ST-RFG | RFG | ST-RFG | RNB | ST-RNB | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Class 1 | 0.71 | 0.77 | 0.77 | 0.72 | 0.75 | 0.83 | 0.68 | 0.78 | 0.84 | 0.68 | 0.83 | 0.85 | 0.76 | 0.59 | 0.73 | 0.78 | 0.60 | 0.86 |
Class 2 | 0.0 | 0.0 | 0.46 | 0.09 | 0.13 | 0.60 | 0.0 | 0.0 | 0.53 | 0.0 | 0.0 | 0.55 | 0.0 | 0.0 | 0.47 | 0.0 | 0.0 | 0.44 |
Class 3 | 0.80 | 0.74 | 0.81 | 0.79 | 0.71 | 0.85 | 0.72 | 0.71 | 0.86 | 0.78 | 0.70 | 0.87 | 0.62 | 0.86 | 0.72 | 0.65 | 0.86 | 0.85 |
Classes | DT | ST-DT | NB | ST-NB | C4G | ST-C4G | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cmc | ||||||||||||||||||
Class 1 | 0.47 | 0.63 | 0.62 | 0.49 | 0.75 | 0.61 | 0.46 | 0.45 | 0.54 | 0.49 | 0.6 | 0.57 | 0.48 | 0.56 | 0.58 | 0.54 | 0.47 | 0.57 |
Class 2 | 0.32 | 0.27 | 0.64 | 0.37 | 0.60 | 0.68 | 0.35 | 0.40 | 0.64 | 0.47 | 0.32 | 0.67 | 0.29 | 0.27 | 0.58 | 0.37 | 0.32 | 0.64 |
Class 3 | 0.38 | 0.27 | 0.56 | 1.0 | 0.02 | 0.60 | 0.42 | 0.39 | 0.54 | 0.39 | 0.41 | 0.56 | 0.43 | 0.36 | 0.61 | 0.42 | 0.52 | 0.61 |
RFG | ST-RFG | RFG | ST-RFG | RNB | ST-RNB | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Class 1 | 0.50 | 0.48 | 0.60 | 0.50 | 0.70 | 0.61 | 0.48 | 0.63 | 0.62 | 0.48 | 0.86 | 0.65 | 0.49 | 0.55 | 0.59 | 0.56 | 0.58 | 0.61 |
Class 2 | 0.36 | 0.31 | 0.67 | 0.37 | 0.18 | 0.56 | 0.37 | 0.23 | 0.66 | 0.58 | 0.18 | 0.69 | 0.42 | 0.28 | 0.59 | 0.43 | 0.31 | 0.65 |
Class 3 | 0.45 | 0.51 | 0.60 | 0.47 | 0.39 | 0.59 | 0.45 | 0.40 | 0.60 | 0.57 | 0.29 | 0.62 | 0.40 | 0.42 | 0.57 | 0.44 | 0.5 | 0.58 |