Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression

Oommen, Thomas; Baise, Laurie G.; Vogel, Richard M.

doi:10.1007/s11004-010-9311-8

Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression

Published: 15 October 2010

Volume 43, pages 99–120, (2011)
Cite this article

Mathematical Geosciences Aims and scope Submit manuscript

Thomas Oommen¹^nAff2,
Laurie G. Baise¹ &
Richard M. Vogel¹

1683 Accesses
79 Citations
3 Altmetric
Explore all metrics

Abstract

Logistic regression is a widely used statistical method to relate a binary response variable to a set of explanatory variables and maximum likelihood is the most commonly used method for parameter estimation. A maximum-likelihood logistic regression (MLLR) model predicts the probability of the event from binary data defining the event. Currently, MLLR models are used in a myriad of fields including geosciences, natural hazard evaluation, medical diagnosis, homeland security, finance, and many others. In such applications, the empirical sample data often exhibit class imbalance, where one class is represented by a large number of events while the other is represented by only a few. In addition, the data also exhibit sampling bias, which occurs when there is a difference between the class distribution in the sample compared to the actual class distribution in the population. Previous studies have evaluated how class imbalance and sampling bias affect the predictive capability of asymptotic classification algorithms such as MLLR, yet no definitive conclusions have been reached.

We hypothesize that the predictive capability of the model is related to the sampling bias associated with the data so that the MLLR model has perfect predictability when the data have no sampling bias. We test our hypotheses using two simulated datasets with class distributions that are 50:50 and 80:20, respectively. We construct a suite of controlled experiments by extracting multiple samples with varying class imbalance and sampling bias from the two simulated datasets and fitting MLLR models to each of these samples. The experiments suggest that it is important to develop a sample that has the same class distribution as the original population rather than ensuring that the classes are balanced. Furthermore, when sampling bias is reduced either by using over-sampling or under-sampling, both sampling techniques can improve the predictive capability of an MLLR model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian Robust Regression with the Horseshoe+ Estimator

Bayesian variable selection in linear regression models with non-normal errors

Article 26 September 2018

On the examination of the reliability of statistical software for estimating regression models with discrete dependent variables

Article 17 November 2017

References

Agresti A (2002) Categorical data analysis, 2nd edn. Wiley series in probability and statistics. Wiley, New York
Book Google Scholar
Agterberg FP (1974) Automatic contouring of geological maps to detect target areas for mineral exploration. Math Geol 6:373–395
Article Google Scholar
Atkinson PM, Massari R (1998) Generalised linear modelling of susceptibility to landsliding in the central Apennines, Italy. Comput Geosci 24:373–385
Article Google Scholar
Bent GC, Steeves PA (2006) A revised logistic regression equation and an automated procedure for mapping the probability of a stream flowing perennially in Massachusetts. US Geological Survey Scientific Investigations Report 2006-5031, 1 CD-ROM
Bonham-Carter GF, Chung CF (1989) Integration of mineral resource data for Kasmere lake area, Northwest Manitoba, with emphasis on uranium. Comput Geosci 15(1):25–45
Google Scholar
Boyacioglu MA, Kara Y, Baykan OK (2009) Predicting bank financial failures using neural networks, support vector machines and multivariate statistical methods: A comparative analysis in the sample of savings deposit insurance fund (SDIF) transferred banks in Turkey. Expert Syst Appl 36:3355–3366
Article Google Scholar
Burez J, Van den Poel D (2008) Separating financial from commercial customer churn: A modeling step towards resolving the conflict between the sales and credit department. Expert Syst Appl 35:497–514
Article Google Scholar
Cao K, Yang X, Tian J, Zhang YY, Li P, Tao XQ (2009) Fingerprint matching based on neighboring information and penalized logistic regression. Adv Biom 5558:617–626
Article Google Scholar
Carrara (1983) Multivariate models for landslide hazard evaluation. Math Geol 15(3):403–426
Article Google Scholar
Caumon G, Ortiz JM, Rabeau O (2006) Comparative study of three data-driven mineral potential mapping techniques. In: Int assoc for mathematical geology, XI^th international congress, Belgium, S13-05
Chung CF, Fabbri AG (2003) Validation of spatial prediction models for landslide hazard mapping. Nat Hazards 30:451–472
Article Google Scholar
Correia LCL, Rocha MS, Esteves JP (2009) HDL-cholesterol level provides additional prognosis in acute coronary syndromes. Int J Cardiol 136:307–14
Article Google Scholar
Cosslett SR (1981a) Maximum-likelihood estimator for choice-based samples. Econometrica 49:1289–1316
Article Google Scholar
Cosslett SR (1981b) Efficient estimation of discrete-choice models. MIT Press, Cambridge
Google Scholar
Cox DR (1970) Analysis of binary data. Methuen, London
Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874
Article Google Scholar
Garcia V, Mollineda RA, Sanchez JS (2008) On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11:269–280
Article Google Scholar
Gu Q, Cai ZH, Zhu L, Huang B (2008) Data mining on imbalanced data sets. In: International conference on advanced computer theory and engineering, pp 1020–1024
Hirji KF, Mehta CR, Patel NR (1987) Computing distributions for exact logistic regression. J Am Stat Assoc 82:1110–1117
Article Google Scholar
Imbens GW (1992) An efficient method of moments estimator for discrete choice models with choice-based sampling. Econometrica 60:1187–1214
Article Google Scholar
Juang CH, Chen CJ, Jiang T (2001) Probabilistic framework for liquefaction potential by shear wave velocity. J Geotech Geoenviron Eng 127:670–678
Article Google Scholar
Juang CH, Jiang T, Andrus RD (2002) Assessing probability-based methods for liquefaction potential evaluation. J Geotech Geoenviron Eng 128:580–589
Article Google Scholar
King G, Zeng L (2001) Explaining rare events in international relations. Int Organ 55:693–715
Article Google Scholar
Lai SY, Chang WJ, Lin PS (2006) Logistic regression model for evaluating soil liquefaction probability using CPT data. J Geotech Geoenviron Eng 132:694–704
Article Google Scholar
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B, Cybern 39:539–50
Article Google Scholar
Lopez L, Sanchez JL (2009) Discriminant methods for radar detection of hail. In: 4^th European conference on severe storms, vol 93, pp 358–368
Google Scholar
Mehta CR, Patel NR (1995) Exact logistic regression: Theory and examples. Stat Med 14:2143–2160
Article Google Scholar
Moss RES, Seed RB, Kayen RE, Stewart JP, Kiureghian AD, Cetin KO (2006) CPT-based probabilistic and deterministic assessment of in situ seismic soil liquefaction potential. J Geotech Geoenviron Eng 132(8):1032–1051
Article Google Scholar
Olson SA, Brouillette MC (2006) A logistic regression equation for estimating the probability of a stream in Vermont having intermittent flow: US Geological Survey Scientific Investigations Report 2006–5217
Oommen T, Baise LG, Vogel R (2010) Validation and application of empirical liquefaction models. J Geotech Geoenviron Eng. doi:10.1061/(ASCE)GT.1943-5606.0000395
Page RL, Ellison CG, Lee J (2009) Does religiosity affect health risk behaviors in pregnant and postpartum women? Matern Child Health J 13:621–632
Article Google Scholar
Preisler HK, Brillinger DR, Burgan RE, Benoit JW (2004) Probability based models for estimation of wildfire risk. Int J Wildland Fire 13:133–142
Article Google Scholar
R Development Core Team (2009) R: A language and environment for statistical computing. R Foundation for statistical computing, Vienna
Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J (2009) Hybrid sampling for imbalanced data. Integr Comput -Aided Eng 16:193–210
Google Scholar
Sun YM, Wong AKC, Kamel MS (2009) Classification of imbalanced data: A review. Int J Pattern Recognit Artif Intell 23:687–719
Article Google Scholar
Tang YC, Zhang YQ, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced cClassification. IEEE Trans Syst Man Cybern Part B, Cybern 39:281–288
Article Google Scholar
Tasker GD (1989) Regionalization of low flow characteristics using logistic and GLS regression. In: Kavvas ML (ed) New directions for surface water modeling. IAHS Publication, vol 181, pp 323–331
Google Scholar
Toner M, Keddy P (1997) River hydrology and riparian wetlands: A predictive model for ecological assembly. Ecol Appl 7:236–246
Article Google Scholar
van Rijsbergen C (1979) Information retrieval. Butterworths, London
Google Scholar
Weiss GM, Provost F (2003) Learning when training data are costly: The effect of class distribution on tree induction. J Artif Intell Res 19:315–354
Google Scholar
Williams DP, Myers V, Silvious MS (2009) Mine classification with imbalanced data. IEEE Geosci Remote Sens Lett 6:528–532
Article Google Scholar

Download references

Author information

Thomas Oommen
Present address: Dept. of Geological Engineering, Michigan Tech., Houghton, MI, 49931, USA

Authors and Affiliations

Department of Civil and Environmental Engineering, Tufts University, 113 Anderson Hall, Medford, MA, 02155, USA
Thomas Oommen, Laurie G. Baise & Richard M. Vogel

Authors

Thomas Oommen
View author publications
You can also search for this author in PubMed Google Scholar
Laurie G. Baise
View author publications
You can also search for this author in PubMed Google Scholar
Richard M. Vogel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Oommen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oommen, T., Baise, L.G. & Vogel, R.M. Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression. Math Geosci 43, 99–120 (2011). https://doi.org/10.1007/s11004-010-9311-8

Download citation

Received: 30 October 2009
Accepted: 20 September 2010
Published: 15 October 2010
Issue Date: January 2011
DOI: https://doi.org/10.1007/s11004-010-9311-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression

Abstract

Access this article

Similar content being viewed by others

Bayesian Robust Regression with the Horseshoe+ Estimator

Bayesian variable selection in linear regression models with non-normal errors

On the examination of the reliability of statistical software for estimating regression models with discrete dependent variables

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression

Abstract

Access this article

Similar content being viewed by others

Bayesian Robust Regression with the Horseshoe+ Estimator

Bayesian variable selection in linear regression models with non-normal errors

On the examination of the reliability of statistical software for estimating regression models with discrete dependent variables

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation