Skip to main content
Erschienen in:

23.02.2023

Finding the Proverbial Needle: Improving Minority Class Identification Under Extreme Class Imbalance

verfasst von: Trent Geisler, Herman Ray, Ying Xie

Erschienen in: Journal of Classification | Ausgabe 1/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Imbalanced learning problems typically consist of data with skewed class distributions, coupled with large misclassification costs for the rare events. For binary classification, logistic regression is a common supervised learning technique chosen to perform this task. Unfortunately, the model performs poorly on classification tasks when class distributions are highly imbalanced. To improve this generalization, we implement a novel instance-level weighting methodology for the minority class in the loss function. We build our method from a recently published, locally weighted log-likelihood objective function, where each of the minority class weights are learned from the data. We improve upon this previous approach by creating a convex and hyperparameter-free loss function that improves generalization performance for datasets exhibiting extreme class imbalance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
The model training process is synonymous with the term optimization process, where the parameter estimates are updated to find the minimum overall loss value. Since we have a convex loss function, we can use a wide range of optimization algorithms during the training process. In this particular instance, we use Python’s SciPy’s optimize module to minimize the loss function with the Newton-CG method (Virtanen et al., 2020).
 
2
The data that support the findings of this study are available at https://​github.​com/​x46182/​proverbial_​needle_​manuscript.
 
3
The threshold value where sensitivity is equal to specificity is also evaluated.
 
Literatur
Zurück zum Zitat Beyan, C., & Fisher, R. (2015). Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognition, 48(5), 1653–1672.CrossRef Beyan, C., & Fisher, R. (2015). Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognition, 48(5), 1653–1672.CrossRef
Zurück zum Zitat Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.CrossRefMATH Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.CrossRefMATH
Zurück zum Zitat Branco, P., Torgo, L., & Ribeiro, R. P. (2016). A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR), 49(2), 1–50.CrossRef Branco, P., Torgo, L., & Ribeiro, R. P. (2016). A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR), 49(2), 1–50.CrossRef
Zurück zum Zitat Cortez, P., Cerdeira, A., Almeida, F., Matos, T, & Reis, J. (2009). Wine quality. UCI Machine Learning Repository. Cortez, P., Cerdeira, A., Almeida, F., Matos, T, & Reis, J. (2009). Wine quality. UCI Machine Learning Repository.
Zurück zum Zitat Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on machine learning (pp. 233–240). Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on machine learning (pp. 233–240).
Zurück zum Zitat Denil, M., & Trappenberg, T. (2010). Overlap versus imbalance. Canadian conference on artificial intelligence (pp. 220–231). Denil, M., & Trappenberg, T. (2010). Overlap versus imbalance. Canadian conference on artificial intelligence (pp. 220–231).
Zurück zum Zitat Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets Vol. 11. New York: Springer.CrossRef Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets Vol. 11. New York: Springer.CrossRef
Zurück zum Zitat Geisler, T., Priestley, J., Ray, H., & Xie, Y. (2022). Improving minority class detection in imbalanced learning applications with a logistic regression novel loss function. Manuscript Under Development. Geisler, T., Priestley, J., Ray, H., & Xie, Y. (2022). Improving minority class detection in imbalanced learning applications with a logistic regression novel loss function. Manuscript Under Development.
Zurück zum Zitat Geisler, T., Ray, H, & Xie, Y. (2022). Novel neural network loss for extreme class imbalance. Manuscript Under Development. Geisler, T., Ray, H, & Xie, Y. (2022). Novel neural network loss for extreme class imbalance. Manuscript Under Development.
Zurück zum Zitat Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: review of methods and applications. Expert Systems with Applications, 73, 220–239.CrossRef Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: review of methods and applications. Expert Systems with Applications, 73, 220–239.CrossRef
Zurück zum Zitat He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.CrossRef He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.CrossRef
Zurück zum Zitat Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: a systematic study. Intelligent Data Analysis, 6(5), 429–449.CrossRefMATH Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: a systematic study. Intelligent Data Analysis, 6(5), 429–449.CrossRefMATH
Zurück zum Zitat King, G., & Zeng, L. (2001). Logistic regression in rare events data. Political Analysis, 9(2), 137–163.CrossRef King, G., & Zeng, L. (2001). Logistic regression in rare events data. Political Analysis, 9(2), 137–163.CrossRef
Zurück zum Zitat Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models Vol. 5. Boston: McGraw-Hill Irwin. Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models Vol. 5. Boston: McGraw-Hill Irwin.
Zurück zum Zitat Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision (pp. 2980–2988). Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
Zurück zum Zitat Loyola-González, O., Martínez-Trinidad, J. F., Carrasco-Ochoa, J. A., & García-Borroto, M. (2016). Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing, 175, 935–947.CrossRef Loyola-González, O., Martínez-Trinidad, J. F., Carrasco-Ochoa, J. A., & García-Borroto, M. (2016). Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing, 175, 935–947.CrossRef
Zurück zum Zitat López, V., Fernández, A., García, S., Palade, V., & Herrera, F (2013). An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.CrossRef López, V., Fernández, A., García, S., Palade, V., & Herrera, F (2013). An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.CrossRef
Zurück zum Zitat Maalouf, M., Homouz, D., & Trafalis, T. B. (2018). Logistic regression in large rare events and imbalanced data: a performance comparison of prior correction and weighting methods. Computational Intelligence, 34(1), 161–174.MathSciNetCrossRef Maalouf, M., Homouz, D., & Trafalis, T. B. (2018). Logistic regression in large rare events and imbalanced data: a performance comparison of prior correction and weighting methods. Computational Intelligence, 34(1), 161–174.MathSciNetCrossRef
Zurück zum Zitat Maloof, M. A. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. ICML-2003 workshop on learning from imbalanced data sets II (Vol. 2 pp. 2–1). Maloof, M. A. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. ICML-2003 workshop on learning from imbalanced data sets II (Vol. 2 pp. 2–1).
Zurück zum Zitat Manski, C. F., & Lerman, S. R. (1977). The estimation of choice probabilities from choice based samples. Econometrica: Journal of the Econometric Society 1977–1988. Manski, C. F., & Lerman, S. R. (1977). The estimation of choice probabilities from choice based samples. Econometrica: Journal of the Econometric Society 1977–1988.
Zurück zum Zitat Roberts, A. W. (1993). Convex functions. Handbook of convex geometry, (pp. 1081–1104). Amsterdam: Elsevier.CrossRefMATH Roberts, A. W. (1993). Convex functions. Handbook of convex geometry, (pp. 1081–1104). Amsterdam: Elsevier.CrossRefMATH
Zurück zum Zitat Stefanowski, J. (2016). Dealing with data difficulty factors while learning from imbalanced data. Challenges in computational statistics and data mining (pp. 333–363). Springer. Stefanowski, J. (2016). Dealing with data difficulty factors while learning from imbalanced data. Challenges in computational statistics and data mining (pp. 333–363). Springer.
Zurück zum Zitat Wallace, B. C., Small, K., Brodley, C. E., & Trikalinos, T. A. (2011). Class imbalance, redux. 2011 IEEE 11th international conference on data mining (pp. 754–763). Wallace, B. C., Small, K., Brodley, C. E., & Trikalinos, T. A. (2011). Class imbalance, redux. 2011 IEEE 11th international conference on data mining (pp. 754–763).
Zurück zum Zitat Yang, X., Song, Q., & Wang, Y. (2007). A weighted support vector machine for data classification. International Journal of Pattern Recognition and Artificial Intelligence, 21(05), 961–976.CrossRef Yang, X., Song, Q., & Wang, Y. (2007). A weighted support vector machine for data classification. International Journal of Pattern Recognition and Artificial Intelligence, 21(05), 961–976.CrossRef
Zurück zum Zitat Zhang, L., Priestley, J., & Ni, X. (2018). Influence of the event rate on discrimination abilities of bankruptcy prediction models. arXiv:1803.03756. Zhang, L., Priestley, J., & Ni, X. (2018). Influence of the event rate on discrimination abilities of bankruptcy prediction models. arXiv:1803.​03756.
Metadaten
Titel
Finding the Proverbial Needle: Improving Minority Class Identification Under Extreme Class Imbalance
verfasst von
Trent Geisler
Herman Ray
Ying Xie
Publikationsdatum
23.02.2023
Verlag
Springer US
Erschienen in
Journal of Classification / Ausgabe 1/2023
Print ISSN: 0176-4268
Elektronische ISSN: 1432-1343
DOI
https://doi.org/10.1007/s00357-023-09431-5