Skip to main content
Erschienen in: International Journal of Machine Learning and Cybernetics 1/2014

01.02.2014 | Original Article

The effect of varying levels of class distribution on bagging for different algorithms: An empirical study

verfasst von: Guohua Liang, Xingquan Zhu, Chengqi Zhang

Erschienen in: International Journal of Machine Learning and Cybernetics | Ausgabe 1/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Many real world applications involve highly imbalanced class distribution. Research into learning from imbalanced class distribution is considered to be one of ten challenging problems in data mining research, and it has increasingly captured the attention of both academia and industry. In this work, we study the effects of different levels of imbalanced class distribution on bagging predictors by using under-sampling techniques. Despite the popularity of bagging in many real-world applications, some questions have not been clearly answered in the existing research, such as the effect of varying the levels of class distribution on different bagging predictors, e.g., whether bagging is superior to single learners when the levels of class distribution change. Most classification learning algorithms are designed to maximize the overall accuracy rate and assume that training instances are uniformly distributed; however, the overall accuracy does not represent correct prediction on the minority class, which is the class of interest to users. The overall accuracy metric is therefore ineffective for evaluating the performance of classifiers in extremely imbalanced data. This study investigates the effect of varying levels of class distribution on different bagging predictors based on the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) as a performance metric, using an under-sampling technique on 14 data-sets with imbalanced class distributions. Our experimental results indicate that Decision Table (DTable) and RepTree are the learning algorithms with the best bagging AUC performance. The AUC performances of bagging predictors are statistically superior to single learners, with the exception of Support Vector Machines (SVM) and Decision Stump (DStump).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Literatur
1.
Zurück zum Zitat Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29CrossRef Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29CrossRef
2.
Zurück zum Zitat Boehm O, Hardoon DR, Manevitz LM (2011) Classifying cognitive states of brain activity via one-class neural networks with feature selection by genetic algorithms. Int J Mach Learn Cybern 2:1–10 Boehm O, Hardoon DR, Manevitz LM (2011) Classifying cognitive states of brain activity via one-class neural networks with feature selection by genetic algorithms. Int J Mach Learn Cybern 2:1–10
3.
Zurück zum Zitat Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159CrossRef Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159CrossRef
5.
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling Technique for handling the class imbalanced problem. In: Advances in Knowledge Discovery and Data Mining, vol 5476, pp 475–482 Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling Technique for handling the class imbalanced problem. In: Advances in Knowledge Discovery and Data Mining, vol 5476, pp 475–482
6.
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357MATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357MATH
7.
Zurück zum Zitat Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. Paper presented at the PKDD 2003:107–119 Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. Paper presented at the PKDD 2003:107–119
8.
Zurück zum Zitat Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6CrossRef Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6CrossRef
9.
Zurück zum Zitat Cieslak D, Chawla N (2008) Learning decision trees for unbalanced data. Paper presented at the ECML PKDD 2008, pp 241–256 Cieslak D, Chawla N (2008) Learning decision trees for unbalanced data. Paper presented at the ECML PKDD 2008, pp 241–256
10.
Zurück zum Zitat Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet
11.
Zurück zum Zitat Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36CrossRefMathSciNet Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36CrossRefMathSciNet
12.
Zurück zum Zitat Fawcett T (2004) ROC graphs: Notes and practical considerations for researchers. Mach Learn 31 (HPL-2003-4):1–38 Fawcett T (2004) ROC graphs: Notes and practical considerations for researchers. Mach Learn 31 (HPL-2003-4):1–38
14.
Zurück zum Zitat Guoxun H, Hui H, Wenyuan W (2005) An Over-sampling expert system for learning from imbalanced data sets. Paper presented at the International Conference on Neural Networks and Brain, 13–15 Oct, 537–541 Guoxun H, Hui H, Wenyuan W (2005) An Over-sampling expert system for learning from imbalanced data sets. Paper presented at the International Conference on Neural Networks and Brain, 13–15 Oct, 537–541
15.
Zurück zum Zitat Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Advances in Intelligent Computing, vol 3644. Lecture notes in Computer Science, pp 878–887 Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Advances in Intelligent Computing, vol 3644. Lecture notes in Computer Science, pp 878–887
16.
Zurück zum Zitat He H, Garcia AE (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef He H, Garcia AE (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef
17.
Zurück zum Zitat Hido S, Kashima H, Takahashi Y (2009) Roughly balanced bagging for imbalanced data. Stat Anal Data Min 2(5–6):412–426CrossRefMathSciNet Hido S, Kashima H, Takahashi Y (2009) Roughly balanced bagging for imbalanced data. Stat Anal Data Min 2(5–6):412–426CrossRefMathSciNet
18.
Zurück zum Zitat Jun W, Shitong W, Chung F (2011) Positive and negative fuzzy rule system, extreme learning machine and image classification. Int J Mach Learn Cybern: 1–11 Jun W, Shitong W, Chung F (2011) Positive and negative fuzzy rule system, extreme learning machine and image classification. Int J Mach Learn Cybern: 1–11
19.
Zurück zum Zitat Kang P, Cho S (2006) EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems. Paper presented at the Neural Information Processing, pp 837–846 Kang P, Cho S (2006) EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems. Paper presented at the Neural Information Processing, pp 837–846
20.
Zurück zum Zitat Koknar-Tezel S, Latecki LJ (2009) Improving SVM classification on imbalanced data sets in distance spaces. Paper presented at the 9th IEEE International Conference on Data Mining, 6–9 Dec, pp 259–267 Koknar-Tezel S, Latecki LJ (2009) Improving SVM classification on imbalanced data sets in distance spaces. Paper presented at the 9th IEEE International Conference on Data Mining, 6–9 Dec, pp 259–267
21.
Zurück zum Zitat Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. Int Trans Comput Sci Eng 30(1):25–36 Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. Int Trans Comput Sci Eng 30(1):25–36
22.
Zurück zum Zitat Liang G (2012) An investigation of sensitivity on bagging predictors: An empirical approach. Paper presented at the 26th AAAI Conference on Artificial Intelligence, AAAI 2012, Toronto, 22–26 July, 2439–2440 Liang G (2012) An investigation of sensitivity on bagging predictors: An empirical approach. Paper presented at the 26th AAAI Conference on Artificial Intelligence, AAAI 2012, Toronto, 22–26 July, 2439–2440
23.
Zurück zum Zitat Liang G, Zhang C (2011) Empirical study of bagging predictors on medical data. Paper presented at the 9th Australian Data Mining Conference, AusDM 2011, Ballarat, Australia, pp 31–40 Liang G, Zhang C (2011) Empirical study of bagging predictors on medical data. Paper presented at the 9th Australian Data Mining Conference, AusDM 2011, Ballarat, Australia, pp 31–40
24.
Zurück zum Zitat Liang G, Zhang C (2011) An empirical evaluation of bagging with different learning algorithms on imbalanced data. In: Proceedings of the 7th International Conference on Advanced Data Mining and Applications, ADMA 2011. 339–352 Liang G, Zhang C (2011) An empirical evaluation of bagging with different learning algorithms on imbalanced data. In: Proceedings of the 7th International Conference on Advanced Data Mining and Applications, ADMA 2011. 339–352
25.
Zurück zum Zitat Liang G, Zhang C (2012) An efficient and simple under-sampling technique for imbalanced time series classification. Paper presented at the ACM 21st Conference on Information and Knowledge Management, CIKM 2012, Maui Hawaii, 29th October – 2nd November Liang G, Zhang C (2012) An efficient and simple under-sampling technique for imbalanced time series classification. Paper presented at the ACM 21st Conference on Information and Knowledge Management, CIKM 2012, Maui Hawaii, 29th October – 2nd November
26.
Zurück zum Zitat Liang G, Zhang C (2012) A comparative study of sampling methods and algorithms for imbalanced time series classification. Paper presented at the 25th Australasian Joint Conference on Artificial Intelligence, AI 2012, Sydney, 4th–7th December Liang G, Zhang C (2012) A comparative study of sampling methods and algorithms for imbalanced time series classification. Paper presented at the 25th Australasian Joint Conference on Artificial Intelligence, AI 2012, Sydney, 4th–7th December
27.
Zurück zum Zitat Liang G, Zhu X, Zhang C (2011a) An empirical study of bagging predictors for different learning algorithms. Paper presented at the 25th AAAI Conference on Artificial Intelligence, AAAI 2011, San Francisco, 7–11 August, 1802–1803 Liang G, Zhu X, Zhang C (2011a) An empirical study of bagging predictors for different learning algorithms. Paper presented at the 25th AAAI Conference on Artificial Intelligence, AAAI 2011, San Francisco, 7–11 August, 1802–1803
28.
Zurück zum Zitat Liang G, Zhu X, Zhang C (2011b) An empirical study of bagging predictors for imbalanced data with different levels of class distribution. Paper presented at the 24th Australasian Joint Conference on Artificial Intelligence, AI 2011, Perth, 5th–8th December, 213–222 Liang G, Zhu X, Zhang C (2011b) An empirical study of bagging predictors for imbalanced data with different levels of class distribution. Paper presented at the 24th Australasian Joint Conference on Artificial Intelligence, AI 2011, Perth, 5th–8th December, 213–222
29.
Zurück zum Zitat Ling C, Huang J, Zhang H (2003) AUC: A better measure than accuracy in comparing learning algorithms. Paper presented at the AI 2003, pp 329–341 Ling C, Huang J, Zhang H (2003) AUC: A better measure than accuracy in comparing learning algorithms. Paper presented at the AI 2003, pp 329–341
30.
Zurück zum Zitat Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class imbalance learning. IEEE Trans Syst Man Cybern B Cybern 39(2):539–550CrossRef Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class imbalance learning. IEEE Trans Syst Man Cybern B Cybern 39(2):539–550CrossRef
31.
Zurück zum Zitat Liu W, Chawla S, Cieslak DA, Chawla NV (2010) A robust decision tree algorithm for imbalanced data sets. Paper presented at the SIAM International Conference on Data Mining, SDM 2010, Columbus, Ohio, USA, April 29–May 1, pp 766–777 Liu W, Chawla S, Cieslak DA, Chawla NV (2010) A robust decision tree algorithm for imbalanced data sets. Paper presented at the SIAM International Conference on Data Mining, SDM 2010, Columbus, Ohio, USA, April 29–May 1, pp 766–777
32.
Zurück zum Zitat Maloof M (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: Proceedings of the ICML-2003 Workshop on Learning from Imbalanced Data Sets II, Washington Maloof M (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: Proceedings of the ICML-2003 Workshop on Learning from Imbalanced Data Sets II, Washington
33.
Zurück zum Zitat Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436CrossRef Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436CrossRef
34.
Zurück zum Zitat Melville P, Mooney RJ (2005) Creating diversity in ensembles using artificial data. Inform Fusion 6(1):99–111CrossRef Melville P, Mooney RJ (2005) Creating diversity in ensembles using artificial data. Inform Fusion 6(1):99–111CrossRef
35.
Zurück zum Zitat Mena L, Gonzalez J (2006) Machine learning for imbalanced datasets: Application in medical diagnostic. Paper presented at the Proceedings of the 19th International FLAIRS Conference, pp 574–579 Mena L, Gonzalez J (2006) Machine learning for imbalanced datasets: Application in medical diagnostic. Paper presented at the Proceedings of the 19th International FLAIRS Conference, pp 574–579
37.
Zurück zum Zitat Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor Newsl 6(1):50–59CrossRef Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor Newsl 6(1):50–59CrossRef
38.
Zurück zum Zitat Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. Paper presented at the 15th International Conference on Machine Learning, pp 445–453 Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. Paper presented at the 15th International Conference on Machine Learning, pp 445–453
39.
Zurück zum Zitat Rao RB, Krishnan S, Niculescu RS (2006) Data mining for improved cardiac care. ACM SIGKDD Explor Newsl 8(1):3–10CrossRef Rao RB, Krishnan S, Niculescu RS (2006) Data mining for improved cardiac care. ACM SIGKDD Explor Newsl 8(1):3–10CrossRef
40.
Zurück zum Zitat Su CT, Hsiao YH (2007) An evaluation of the robustness of MTS for imbalanced data. IEEE Trans Knowl Data Eng:1321–1332 Su CT, Hsiao YH (2007) An evaluation of the robustness of MTS for imbalanced data. IEEE Trans Knowl Data Eng:1321–1332
41.
Zurück zum Zitat Sun Y, Kamel M, Wong A, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378CrossRefMATH Sun Y, Kamel M, Wong A, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378CrossRefMATH
42.
Zurück zum Zitat Wang X, Wang Y, Xu X, Ling W, Yeung DS (2001) A new approach to fuzzy rule generation: fuzzy extension matrix. Fuzzy Sets Syst 123(3):291–306CrossRefMATHMathSciNet Wang X, Wang Y, Xu X, Ling W, Yeung DS (2001) A new approach to fuzzy rule generation: fuzzy extension matrix. Fuzzy Sets Syst 123(3):291–306CrossRefMATHMathSciNet
43.
Zurück zum Zitat Wang XZ, He Q, Chen DG, Yeung D (2005) A genetic algorithm for solving the inverse problem of support vector machines. Neurocomputing 68:225–238CrossRef Wang XZ, He Q, Chen DG, Yeung D (2005) A genetic algorithm for solving the inverse problem of support vector machines. Neurocomputing 68:225–238CrossRef
44.
Zurück zum Zitat Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19CrossRef Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19CrossRef
45.
Zurück zum Zitat Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
46.
Zurück zum Zitat Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef
47.
Zurück zum Zitat Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inform Technol Decis Mak 5(4):597–604CrossRef Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inform Technol Decis Mak 5(4):597–604CrossRef
48.
Zurück zum Zitat Zeng-Chang Q (2005) ROC analysis for predictions made by probabilistic classifiers. Paper presented at the International Conference on Machine Learning and Cybernetics 18–21 Aug. 2005, pp 3119–3124 Zeng-Chang Q (2005) ROC analysis for predictions made by probabilistic classifiers. Paper presented at the International Conference on Machine Learning and Cybernetics 18–21 Aug. 2005, pp 3119–3124
Metadaten
Titel
The effect of varying levels of class distribution on bagging for different algorithms: An empirical study
verfasst von
Guohua Liang
Xingquan Zhu
Chengqi Zhang
Publikationsdatum
01.02.2014
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal of Machine Learning and Cybernetics / Ausgabe 1/2014
Print ISSN: 1868-8071
Elektronische ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-012-0125-5

Weitere Artikel der Ausgabe 1/2014

International Journal of Machine Learning and Cybernetics 1/2014 Zur Ausgabe

Neuer Inhalt