Skip to main content
Log in

Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression

  • Published:
Mathematical Geosciences Aims and scope Submit manuscript

Abstract

Logistic regression is a widely used statistical method to relate a binary response variable to a set of explanatory variables and maximum likelihood is the most commonly used method for parameter estimation. A maximum-likelihood logistic regression (MLLR) model predicts the probability of the event from binary data defining the event. Currently, MLLR models are used in a myriad of fields including geosciences, natural hazard evaluation, medical diagnosis, homeland security, finance, and many others. In such applications, the empirical sample data often exhibit class imbalance, where one class is represented by a large number of events while the other is represented by only a few. In addition, the data also exhibit sampling bias, which occurs when there is a difference between the class distribution in the sample compared to the actual class distribution in the population. Previous studies have evaluated how class imbalance and sampling bias affect the predictive capability of asymptotic classification algorithms such as MLLR, yet no definitive conclusions have been reached.

We hypothesize that the predictive capability of the model is related to the sampling bias associated with the data so that the MLLR model has perfect predictability when the data have no sampling bias. We test our hypotheses using two simulated datasets with class distributions that are 50:50 and 80:20, respectively. We construct a suite of controlled experiments by extracting multiple samples with varying class imbalance and sampling bias from the two simulated datasets and fitting MLLR models to each of these samples. The experiments suggest that it is important to develop a sample that has the same class distribution as the original population rather than ensuring that the classes are balanced. Furthermore, when sampling bias is reduced either by using over-sampling or under-sampling, both sampling techniques can improve the predictive capability of an MLLR model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agresti A (2002) Categorical data analysis, 2nd edn. Wiley series in probability and statistics. Wiley, New York

    Book  Google Scholar 

  • Agterberg FP (1974) Automatic contouring of geological maps to detect target areas for mineral exploration. Math Geol 6:373–395

    Article  Google Scholar 

  • Atkinson PM, Massari R (1998) Generalised linear modelling of susceptibility to landsliding in the central Apennines, Italy. Comput Geosci 24:373–385

    Article  Google Scholar 

  • Bent GC, Steeves PA (2006) A revised logistic regression equation and an automated procedure for mapping the probability of a stream flowing perennially in Massachusetts. US Geological Survey Scientific Investigations Report 2006-5031, 1 CD-ROM

  • Bonham-Carter GF, Chung CF (1989) Integration of mineral resource data for Kasmere lake area, Northwest Manitoba, with emphasis on uranium. Comput Geosci 15(1):25–45

    Google Scholar 

  • Boyacioglu MA, Kara Y, Baykan OK (2009) Predicting bank financial failures using neural networks, support vector machines and multivariate statistical methods: A comparative analysis in the sample of savings deposit insurance fund (SDIF) transferred banks in Turkey. Expert Syst Appl 36:3355–3366

    Article  Google Scholar 

  • Burez J, Van den Poel D (2008) Separating financial from commercial customer churn: A modeling step towards resolving the conflict between the sales and credit department. Expert Syst Appl 35:497–514

    Article  Google Scholar 

  • Cao K, Yang X, Tian J, Zhang YY, Li P, Tao XQ (2009) Fingerprint matching based on neighboring information and penalized logistic regression. Adv Biom 5558:617–626

    Article  Google Scholar 

  • Carrara (1983) Multivariate models for landslide hazard evaluation. Math Geol 15(3):403–426

    Article  Google Scholar 

  • Caumon G, Ortiz JM, Rabeau O (2006) Comparative study of three data-driven mineral potential mapping techniques. In: Int assoc for mathematical geology, XIth international congress, Belgium, S13-05

  • Chung CF, Fabbri AG (2003) Validation of spatial prediction models for landslide hazard mapping. Nat Hazards 30:451–472

    Article  Google Scholar 

  • Correia LCL, Rocha MS, Esteves JP (2009) HDL-cholesterol level provides additional prognosis in acute coronary syndromes. Int J Cardiol 136:307–14

    Article  Google Scholar 

  • Cosslett SR (1981a) Maximum-likelihood estimator for choice-based samples. Econometrica 49:1289–1316

    Article  Google Scholar 

  • Cosslett SR (1981b) Efficient estimation of discrete-choice models. MIT Press, Cambridge

    Google Scholar 

  • Cox DR (1970) Analysis of binary data. Methuen, London

    Google Scholar 

  • Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874

    Article  Google Scholar 

  • Garcia V, Mollineda RA, Sanchez JS (2008) On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11:269–280

    Article  Google Scholar 

  • Gu Q, Cai ZH, Zhu L, Huang B (2008) Data mining on imbalanced data sets. In: International conference on advanced computer theory and engineering, pp 1020–1024

  • Hirji KF, Mehta CR, Patel NR (1987) Computing distributions for exact logistic regression. J Am Stat Assoc 82:1110–1117

    Article  Google Scholar 

  • Imbens GW (1992) An efficient method of moments estimator for discrete choice models with choice-based sampling. Econometrica 60:1187–1214

    Article  Google Scholar 

  • Juang CH, Chen CJ, Jiang T (2001) Probabilistic framework for liquefaction potential by shear wave velocity. J Geotech Geoenviron Eng 127:670–678

    Article  Google Scholar 

  • Juang CH, Jiang T, Andrus RD (2002) Assessing probability-based methods for liquefaction potential evaluation. J Geotech Geoenviron Eng 128:580–589

    Article  Google Scholar 

  • King G, Zeng L (2001) Explaining rare events in international relations. Int Organ 55:693–715

    Article  Google Scholar 

  • Lai SY, Chang WJ, Lin PS (2006) Logistic regression model for evaluating soil liquefaction probability using CPT data. J Geotech Geoenviron Eng 132:694–704

    Article  Google Scholar 

  • Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B, Cybern 39:539–50

    Article  Google Scholar 

  • Lopez L, Sanchez JL (2009) Discriminant methods for radar detection of hail. In: 4th European conference on severe storms, vol 93, pp 358–368

    Google Scholar 

  • Mehta CR, Patel NR (1995) Exact logistic regression: Theory and examples. Stat Med 14:2143–2160

    Article  Google Scholar 

  • Moss RES, Seed RB, Kayen RE, Stewart JP, Kiureghian AD, Cetin KO (2006) CPT-based probabilistic and deterministic assessment of in situ seismic soil liquefaction potential. J Geotech Geoenviron Eng 132(8):1032–1051

    Article  Google Scholar 

  • Olson SA, Brouillette MC (2006) A logistic regression equation for estimating the probability of a stream in Vermont having intermittent flow: US Geological Survey Scientific Investigations Report 2006–5217

  • Oommen T, Baise LG, Vogel R (2010) Validation and application of empirical liquefaction models. J Geotech Geoenviron Eng. doi:10.1061/(ASCE)GT.1943-5606.0000395

  • Page RL, Ellison CG, Lee J (2009) Does religiosity affect health risk behaviors in pregnant and postpartum women? Matern Child Health J 13:621–632

    Article  Google Scholar 

  • Preisler HK, Brillinger DR, Burgan RE, Benoit JW (2004) Probability based models for estimation of wildfire risk. Int J Wildland Fire 13:133–142

    Article  Google Scholar 

  • R Development Core Team (2009) R: A language and environment for statistical computing. R Foundation for statistical computing, Vienna

    Google Scholar 

  • Seiffert C, Khoshgoftaar TM, Van Hulse J (2009) Hybrid sampling for imbalanced data. Integr Comput -Aided Eng 16:193–210

    Google Scholar 

  • Sun YM, Wong AKC, Kamel MS (2009) Classification of imbalanced data: A review. Int J Pattern Recognit Artif Intell 23:687–719

    Article  Google Scholar 

  • Tang YC, Zhang YQ, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced cClassification. IEEE Trans Syst Man Cybern Part B, Cybern 39:281–288

    Article  Google Scholar 

  • Tasker GD (1989) Regionalization of low flow characteristics using logistic and GLS regression. In: Kavvas ML (ed) New directions for surface water modeling. IAHS Publication, vol 181, pp 323–331

    Google Scholar 

  • Toner M, Keddy P (1997) River hydrology and riparian wetlands: A predictive model for ecological assembly. Ecol Appl 7:236–246

    Article  Google Scholar 

  • van Rijsbergen C (1979) Information retrieval. Butterworths, London

    Google Scholar 

  • Weiss GM, Provost F (2003) Learning when training data are costly: The effect of class distribution on tree induction. J Artif Intell Res 19:315–354

    Google Scholar 

  • Williams DP, Myers V, Silvious MS (2009) Mine classification with imbalanced data. IEEE Geosci Remote Sens Lett 6:528–532

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Oommen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oommen, T., Baise, L.G. & Vogel, R.M. Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression. Math Geosci 43, 99–120 (2011). https://doi.org/10.1007/s11004-010-9311-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11004-010-9311-8

Keywords

Navigation