Skip to main content
Top
Published in:

23-02-2023

Finding the Proverbial Needle: Improving Minority Class Identification Under Extreme Class Imbalance

Authors: Trent Geisler, Herman Ray, Ying Xie

Published in: Journal of Classification | Issue 1/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Imbalanced learning problems typically consist of data with skewed class distributions, coupled with large misclassification costs for the rare events. For binary classification, logistic regression is a common supervised learning technique chosen to perform this task. Unfortunately, the model performs poorly on classification tasks when class distributions are highly imbalanced. To improve this generalization, we implement a novel instance-level weighting methodology for the minority class in the loss function. We build our method from a recently published, locally weighted log-likelihood objective function, where each of the minority class weights are learned from the data. We improve upon this previous approach by creating a convex and hyperparameter-free loss function that improves generalization performance for datasets exhibiting extreme class imbalance.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
The model training process is synonymous with the term optimization process, where the parameter estimates are updated to find the minimum overall loss value. Since we have a convex loss function, we can use a wide range of optimization algorithms during the training process. In this particular instance, we use Python’s SciPy’s optimize module to minimize the loss function with the Newton-CG method (Virtanen et al., 2020).
 
2
The data that support the findings of this study are available at https://​github.​com/​x46182/​proverbial_​needle_​manuscript.
 
3
The threshold value where sensitivity is equal to specificity is also evaluated.
 
Literature
go back to reference Beyan, C., & Fisher, R. (2015). Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognition, 48(5), 1653–1672.CrossRef Beyan, C., & Fisher, R. (2015). Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognition, 48(5), 1653–1672.CrossRef
go back to reference Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.CrossRefMATH Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.CrossRefMATH
go back to reference Branco, P., Torgo, L., & Ribeiro, R. P. (2016). A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR), 49(2), 1–50.CrossRef Branco, P., Torgo, L., & Ribeiro, R. P. (2016). A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR), 49(2), 1–50.CrossRef
go back to reference Cortez, P., Cerdeira, A., Almeida, F., Matos, T, & Reis, J. (2009). Wine quality. UCI Machine Learning Repository. Cortez, P., Cerdeira, A., Almeida, F., Matos, T, & Reis, J. (2009). Wine quality. UCI Machine Learning Repository.
go back to reference Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on machine learning (pp. 233–240). Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on machine learning (pp. 233–240).
go back to reference Denil, M., & Trappenberg, T. (2010). Overlap versus imbalance. Canadian conference on artificial intelligence (pp. 220–231). Denil, M., & Trappenberg, T. (2010). Overlap versus imbalance. Canadian conference on artificial intelligence (pp. 220–231).
go back to reference Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets Vol. 11. New York: Springer.CrossRef Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets Vol. 11. New York: Springer.CrossRef
go back to reference Geisler, T., Priestley, J., Ray, H., & Xie, Y. (2022). Improving minority class detection in imbalanced learning applications with a logistic regression novel loss function. Manuscript Under Development. Geisler, T., Priestley, J., Ray, H., & Xie, Y. (2022). Improving minority class detection in imbalanced learning applications with a logistic regression novel loss function. Manuscript Under Development.
go back to reference Geisler, T., Ray, H, & Xie, Y. (2022). Novel neural network loss for extreme class imbalance. Manuscript Under Development. Geisler, T., Ray, H, & Xie, Y. (2022). Novel neural network loss for extreme class imbalance. Manuscript Under Development.
go back to reference Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: review of methods and applications. Expert Systems with Applications, 73, 220–239.CrossRef Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: review of methods and applications. Expert Systems with Applications, 73, 220–239.CrossRef
go back to reference He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.CrossRef He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.CrossRef
go back to reference Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: a systematic study. Intelligent Data Analysis, 6(5), 429–449.CrossRefMATH Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: a systematic study. Intelligent Data Analysis, 6(5), 429–449.CrossRefMATH
go back to reference King, G., & Zeng, L. (2001). Logistic regression in rare events data. Political Analysis, 9(2), 137–163.CrossRef King, G., & Zeng, L. (2001). Logistic regression in rare events data. Political Analysis, 9(2), 137–163.CrossRef
go back to reference Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models Vol. 5. Boston: McGraw-Hill Irwin. Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models Vol. 5. Boston: McGraw-Hill Irwin.
go back to reference Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision (pp. 2980–2988). Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
go back to reference Loyola-González, O., Martínez-Trinidad, J. F., Carrasco-Ochoa, J. A., & García-Borroto, M. (2016). Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing, 175, 935–947.CrossRef Loyola-González, O., Martínez-Trinidad, J. F., Carrasco-Ochoa, J. A., & García-Borroto, M. (2016). Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing, 175, 935–947.CrossRef
go back to reference López, V., Fernández, A., García, S., Palade, V., & Herrera, F (2013). An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.CrossRef López, V., Fernández, A., García, S., Palade, V., & Herrera, F (2013). An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.CrossRef
go back to reference Maalouf, M., Homouz, D., & Trafalis, T. B. (2018). Logistic regression in large rare events and imbalanced data: a performance comparison of prior correction and weighting methods. Computational Intelligence, 34(1), 161–174.MathSciNetCrossRef Maalouf, M., Homouz, D., & Trafalis, T. B. (2018). Logistic regression in large rare events and imbalanced data: a performance comparison of prior correction and weighting methods. Computational Intelligence, 34(1), 161–174.MathSciNetCrossRef
go back to reference Maloof, M. A. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. ICML-2003 workshop on learning from imbalanced data sets II (Vol. 2 pp. 2–1). Maloof, M. A. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. ICML-2003 workshop on learning from imbalanced data sets II (Vol. 2 pp. 2–1).
go back to reference Manski, C. F., & Lerman, S. R. (1977). The estimation of choice probabilities from choice based samples. Econometrica: Journal of the Econometric Society 1977–1988. Manski, C. F., & Lerman, S. R. (1977). The estimation of choice probabilities from choice based samples. Econometrica: Journal of the Econometric Society 1977–1988.
go back to reference Roberts, A. W. (1993). Convex functions. Handbook of convex geometry, (pp. 1081–1104). Amsterdam: Elsevier.CrossRefMATH Roberts, A. W. (1993). Convex functions. Handbook of convex geometry, (pp. 1081–1104). Amsterdam: Elsevier.CrossRefMATH
go back to reference Stefanowski, J. (2016). Dealing with data difficulty factors while learning from imbalanced data. Challenges in computational statistics and data mining (pp. 333–363). Springer. Stefanowski, J. (2016). Dealing with data difficulty factors while learning from imbalanced data. Challenges in computational statistics and data mining (pp. 333–363). Springer.
go back to reference Wallace, B. C., Small, K., Brodley, C. E., & Trikalinos, T. A. (2011). Class imbalance, redux. 2011 IEEE 11th international conference on data mining (pp. 754–763). Wallace, B. C., Small, K., Brodley, C. E., & Trikalinos, T. A. (2011). Class imbalance, redux. 2011 IEEE 11th international conference on data mining (pp. 754–763).
go back to reference Yang, X., Song, Q., & Wang, Y. (2007). A weighted support vector machine for data classification. International Journal of Pattern Recognition and Artificial Intelligence, 21(05), 961–976.CrossRef Yang, X., Song, Q., & Wang, Y. (2007). A weighted support vector machine for data classification. International Journal of Pattern Recognition and Artificial Intelligence, 21(05), 961–976.CrossRef
go back to reference Zhang, L., Priestley, J., & Ni, X. (2018). Influence of the event rate on discrimination abilities of bankruptcy prediction models. arXiv:1803.03756. Zhang, L., Priestley, J., & Ni, X. (2018). Influence of the event rate on discrimination abilities of bankruptcy prediction models. arXiv:1803.​03756.
Metadata
Title
Finding the Proverbial Needle: Improving Minority Class Identification Under Extreme Class Imbalance
Authors
Trent Geisler
Herman Ray
Ying Xie
Publication date
23-02-2023
Publisher
Springer US
Published in
Journal of Classification / Issue 1/2023
Print ISSN: 0176-4268
Electronic ISSN: 1432-1343
DOI
https://doi.org/10.1007/s00357-023-09431-5

Premium Partner