Skip to main content

2016 | OriginalPaper | Buchkapitel

Imbalance Effects on Classification Using Binary Logistic Regression

verfasst von : Hezlin Aryani Abd Rahman, Bee Wah Yap

Erschienen in: Soft Computing in Data Science

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Classification problems involving imbalance data will affect the performance of classifiers. In predictive analytics, logistic regression is a statistical technique which is often used as a benchmark when other classifiers, such as Naïve Bayes, decision tree, artificial neural network and support vector machine, are applied to a classification problem. This study investigates the effect of imbalanced ratio in the response variable on the parameter estimate of the binary logistic regression via a simulation study. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1 % to 50 %, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using MSE (Mean Square Error). The simulation results provided evidence that imbalance ratio affects the parameter estimates where severe imbalance (IR = 1 %, 2 %, 5 %) has higher MSE. Additionally, the effects of high imbalance (IR ≤ 5 %) will be more severe when sample size is small (n = 100 & n = 500). Further investigation using real dataset from the UCI repository (Bupa Liver (n = 345) and Diabetes Messidor, n = 1151)) confirmed the imbalanced ratio effect on the parameter estimates and the odds ratio, and thus will lead to misleading results.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Datir, A.A., Wadhe, A.P.: Review on need of data mining techniques for biomedical field. Int. J. Comput. Inf. Technol. Bioinforma. 2, 1–5 (2014) Datir, A.A., Wadhe, A.P.: Review on need of data mining techniques for biomedical field. Int. J. Comput. Inf. Technol. Bioinforma. 2, 1–5 (2014)
3.
Zurück zum Zitat Oztekin, A., Delen, D., Kong, Z.J.: Predicting the graft survival for heart-lung transplantation patients: an integrated data mining methodology. Int. J. Med. Inform. 78, e84–e96 (2009)CrossRef Oztekin, A., Delen, D., Kong, Z.J.: Predicting the graft survival for heart-lung transplantation patients: an integrated data mining methodology. Int. J. Med. Inform. 78, e84–e96 (2009)CrossRef
4.
Zurück zum Zitat Sathian, B.: Reporting dichotomous data using logistic regression in medical research: the scenario in developing countries. Nepal J. Epidemiol. 1, 111–113 (2011) Sathian, B.: Reporting dichotomous data using logistic regression in medical research: the scenario in developing countries. Nepal J. Epidemiol. 1, 111–113 (2011)
5.
Zurück zum Zitat Uyar, A., Bener, A., Ciray, H., Bahceci, M.: Handling the imbalance problem of IVF implantation prediction. IAENG Int. J. Comput. Sci. 37 (2010) Uyar, A., Bener, A., Ciray, H., Bahceci, M.: Handling the imbalance problem of IVF implantation prediction. IAENG Int. J. Comput. Sci. 37 (2010)
6.
Zurück zum Zitat Akbani, R., Kwek, S.S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004)CrossRef Akbani, R., Kwek, S.S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004)CrossRef
7.
Zurück zum Zitat Burez, J., Van den Poel, D.: Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36, 4626–4636 (2009)CrossRef Burez, J., Van den Poel, D.: Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36, 4626–4636 (2009)CrossRef
8.
Zurück zum Zitat Ogwueleka, F.: Data mining application in credit card fraud detection system. J. Eng. Sci. Technol. 6, 311–322 (2011) Ogwueleka, F.: Data mining application in credit card fraud detection system. J. Eng. Sci. Technol. 6, 311–322 (2011)
11.
Zurück zum Zitat Thogmartin, W.E., Knutson, M.G., Sauer, J.R.: Predicting regional abundance of rare grassland birds with a hierarchical spatial count model. Condor 108, 25–46 (2006)CrossRef Thogmartin, W.E., Knutson, M.G., Sauer, J.R.: Predicting regional abundance of rare grassland birds with a hierarchical spatial count model. Condor 108, 25–46 (2006)CrossRef
12.
Zurück zum Zitat Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6, 1 (2004)CrossRef Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6, 1 (2004)CrossRef
13.
Zurück zum Zitat Drummond, C., Holte, R.: Severe class imbalance: why better algorithms aren’t the answer. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 539–546. Springer, Heidelberg (2005)CrossRef Drummond, C., Holte, R.: Severe class imbalance: why better algorithms aren’t the answer. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 539–546. Springer, Heidelberg (2005)CrossRef
14.
Zurück zum Zitat He, H., Garcia, E.E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)CrossRef He, H., Garcia, E.E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)CrossRef
15.
Zurück zum Zitat Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6, 429–449 (2002)MATH Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6, 429–449 (2002)MATH
16.
17.
Zurück zum Zitat Lemnaru, C., Potolea, R.: Imbalanced classification problems: systematic study, issues and best practices. In: Zhang, R., Zhang, J., Zhang, Z., Filipe, J., Cordeiro, J. (eds.) ICEIS 2011. LNBIP, vol. 102, pp. 35–50. Springer, Heidelberg (2012)CrossRef Lemnaru, C., Potolea, R.: Imbalanced classification problems: systematic study, issues and best practices. In: Zhang, R., Zhang, J., Zhang, Z., Filipe, J., Cordeiro, J. (eds.) ICEIS 2011. LNBIP, vol. 102, pp. 35–50. Springer, Heidelberg (2012)CrossRef
18.
Zurück zum Zitat Longadge, R., Dongre, S.S., Malik, L.: Class imbalance problem in data mining review. Int. J. Comput. Sci. Netw. 2, 83–87 (2013) Longadge, R., Dongre, S.S., Malik, L.: Class imbalance problem in data mining review. Int. J. Comput. Sci. Netw. 2, 83–87 (2013)
19.
Zurück zum Zitat Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of 24th International Conference on Machine Learning, pp. 935–942 (2007). doi:10.1145/1273496.1273614 Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of 24th International Conference on Machine Learning, pp. 935–942 (2007). doi:10.​1145/​1273496.​1273614
20.
Zurück zum Zitat Visa, S., Ralescu, A.: Issues in mining imbalanced data sets - a review paper. In: Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, MAICS-2005, pp. 67–73 (2005) Visa, S., Ralescu, A.: Issues in mining imbalanced data sets - a review paper. In: Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, MAICS-2005, pp. 67–73 (2005)
22.
Zurück zum Zitat Dong, Y., Guo, H., Zhi, W., Fan, M.: Class imbalance oriented logistic regression. In: 2014 International Conference Cyber-Enabled Distributed Computing and Knowledge Discovery, pp. 187–192 (2014). doi:10.1109/CyberC.2014.42 Dong, Y., Guo, H., Zhi, W., Fan, M.: Class imbalance oriented logistic regression. In: 2014 International Conference Cyber-Enabled Distributed Computing and Knowledge Discovery, pp. 187–192 (2014). doi:10.​1109/​CyberC.​2014.​42
23.
Zurück zum Zitat Goel, G., Maguire, L., Li, Y., McLoone, S.: Evaluation of sampling methods for learning from imbalanced data. Intell. Comput. Theor. 7995, 392–401 (2013) Goel, G., Maguire, L., Li, Y., McLoone, S.: Evaluation of sampling methods for learning from imbalanced data. Intell. Comput. Theor. 7995, 392–401 (2013)
24.
Zurück zum Zitat Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Arti. Intell. Res. 19, 315–354 (2003)MATH Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Arti. Intell. Res. 19, 315–354 (2003)MATH
25.
Zurück zum Zitat Chawla, N.V.: C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proceedings of the International Conference on Machine Learning, Workshop Learning from Imbalanced Data Set II (2003). https://www3.nd.edu/~dial/papers/ICML03.pdf Chawla, N.V.: C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proceedings of the International Conference on Machine Learning, Workshop Learning from Imbalanced Data Set II (2003). https://​www3.​nd.​edu/​~dial/​papers/​ICML03.​pdf
26.
Zurück zum Zitat Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 37, 7–18 (2006)CrossRef Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 37, 7–18 (2006)CrossRef
27.
Zurück zum Zitat Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem bagging, boosting, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 99, 1–22 (2011) Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem bagging, boosting, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 99, 1–22 (2011)
28.
Zurück zum Zitat Blagus, R., Lusa, L.: Class prediction for high-dimensional class-imbalanced data. BMC Bioinform. 11, 523 (2010)CrossRef Blagus, R., Lusa, L.: Class prediction for high-dimensional class-imbalanced data. BMC Bioinform. 11, 523 (2010)CrossRef
29.
Zurück zum Zitat Anand, A., Pugalenthi, G., Fogel, G.B., Suganthan, P.N.: An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39, 1385–1391 (2010)CrossRef Anand, A., Pugalenthi, G., Fogel, G.B., Suganthan, P.N.: An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39, 1385–1391 (2010)CrossRef
30.
Zurück zum Zitat Batista, G., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6, 20 (2004)CrossRef Batista, G., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6, 20 (2004)CrossRef
31.
Zurück zum Zitat Prati, R.C., Batista, G.E.A.P.A., Silva, D.F.: Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl. Inf. Syst. 45, 247–270 (2014) Prati, R.C., Batista, G.E.A.P.A., Silva, D.F.: Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl. Inf. Syst. 45, 247–270 (2014)
32.
Zurück zum Zitat Sarmanova, A., Albayrak, S.: Alleviating class imbalance problem in data mining. In: Signal Processing and Communications Applications Conference, pp. 1–4 (2013) Sarmanova, A., Albayrak, S.: Alleviating class imbalance problem in data mining. In: Signal Processing and Communications Applications Conference, pp. 1–4 (2013)
33.
35.
Zurück zum Zitat Hamid, H.A., Yap, B.W., Xie, X.-J., Abd Rahman, H.A.: Assessing the Effects of Different Types of Covariates for Binary Logistic Regression. 425, 425–430 (2015) Hamid, H.A., Yap, B.W., Xie, X.-J., Abd Rahman, H.A.: Assessing the Effects of Different Types of Covariates for Binary Logistic Regression. 425, 425–430 (2015)
37.
Zurück zum Zitat Antal, B., Hajdu, A.: An ensemble-based system for automatic screening of diabetic retinopathy. Knowl. Based Syst. 60, 20–27 (2014)CrossRef Antal, B., Hajdu, A.: An ensemble-based system for automatic screening of diabetic retinopathy. Knowl. Based Syst. 60, 20–27 (2014)CrossRef
Metadaten
Titel
Imbalance Effects on Classification Using Binary Logistic Regression
verfasst von
Hezlin Aryani Abd Rahman
Bee Wah Yap
Copyright-Jahr
2016
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-10-2777-2_12

Premium Partner