Abstract
Logistic regression is a widely used statistical method to relate a binary response variable to a set of explanatory variables and maximum likelihood is the most commonly used method for parameter estimation. A maximum-likelihood logistic regression (MLLR) model predicts the probability of the event from binary data defining the event. Currently, MLLR models are used in a myriad of fields including geosciences, natural hazard evaluation, medical diagnosis, homeland security, finance, and many others. In such applications, the empirical sample data often exhibit class imbalance, where one class is represented by a large number of events while the other is represented by only a few. In addition, the data also exhibit sampling bias, which occurs when there is a difference between the class distribution in the sample compared to the actual class distribution in the population. Previous studies have evaluated how class imbalance and sampling bias affect the predictive capability of asymptotic classification algorithms such as MLLR, yet no definitive conclusions have been reached.
We hypothesize that the predictive capability of the model is related to the sampling bias associated with the data so that the MLLR model has perfect predictability when the data have no sampling bias. We test our hypotheses using two simulated datasets with class distributions that are 50:50 and 80:20, respectively. We construct a suite of controlled experiments by extracting multiple samples with varying class imbalance and sampling bias from the two simulated datasets and fitting MLLR models to each of these samples. The experiments suggest that it is important to develop a sample that has the same class distribution as the original population rather than ensuring that the classes are balanced. Furthermore, when sampling bias is reduced either by using over-sampling or under-sampling, both sampling techniques can improve the predictive capability of an MLLR model.
Similar content being viewed by others
References
Agresti A (2002) Categorical data analysis, 2nd edn. Wiley series in probability and statistics. Wiley, New York
Agterberg FP (1974) Automatic contouring of geological maps to detect target areas for mineral exploration. Math Geol 6:373–395
Atkinson PM, Massari R (1998) Generalised linear modelling of susceptibility to landsliding in the central Apennines, Italy. Comput Geosci 24:373–385
Bent GC, Steeves PA (2006) A revised logistic regression equation and an automated procedure for mapping the probability of a stream flowing perennially in Massachusetts. US Geological Survey Scientific Investigations Report 2006-5031, 1 CD-ROM
Bonham-Carter GF, Chung CF (1989) Integration of mineral resource data for Kasmere lake area, Northwest Manitoba, with emphasis on uranium. Comput Geosci 15(1):25–45
Boyacioglu MA, Kara Y, Baykan OK (2009) Predicting bank financial failures using neural networks, support vector machines and multivariate statistical methods: A comparative analysis in the sample of savings deposit insurance fund (SDIF) transferred banks in Turkey. Expert Syst Appl 36:3355–3366
Burez J, Van den Poel D (2008) Separating financial from commercial customer churn: A modeling step towards resolving the conflict between the sales and credit department. Expert Syst Appl 35:497–514
Cao K, Yang X, Tian J, Zhang YY, Li P, Tao XQ (2009) Fingerprint matching based on neighboring information and penalized logistic regression. Adv Biom 5558:617–626
Carrara (1983) Multivariate models for landslide hazard evaluation. Math Geol 15(3):403–426
Caumon G, Ortiz JM, Rabeau O (2006) Comparative study of three data-driven mineral potential mapping techniques. In: Int assoc for mathematical geology, XIth international congress, Belgium, S13-05
Chung CF, Fabbri AG (2003) Validation of spatial prediction models for landslide hazard mapping. Nat Hazards 30:451–472
Correia LCL, Rocha MS, Esteves JP (2009) HDL-cholesterol level provides additional prognosis in acute coronary syndromes. Int J Cardiol 136:307–14
Cosslett SR (1981a) Maximum-likelihood estimator for choice-based samples. Econometrica 49:1289–1316
Cosslett SR (1981b) Efficient estimation of discrete-choice models. MIT Press, Cambridge
Cox DR (1970) Analysis of binary data. Methuen, London
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874
Garcia V, Mollineda RA, Sanchez JS (2008) On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11:269–280
Gu Q, Cai ZH, Zhu L, Huang B (2008) Data mining on imbalanced data sets. In: International conference on advanced computer theory and engineering, pp 1020–1024
Hirji KF, Mehta CR, Patel NR (1987) Computing distributions for exact logistic regression. J Am Stat Assoc 82:1110–1117
Imbens GW (1992) An efficient method of moments estimator for discrete choice models with choice-based sampling. Econometrica 60:1187–1214
Juang CH, Chen CJ, Jiang T (2001) Probabilistic framework for liquefaction potential by shear wave velocity. J Geotech Geoenviron Eng 127:670–678
Juang CH, Jiang T, Andrus RD (2002) Assessing probability-based methods for liquefaction potential evaluation. J Geotech Geoenviron Eng 128:580–589
King G, Zeng L (2001) Explaining rare events in international relations. Int Organ 55:693–715
Lai SY, Chang WJ, Lin PS (2006) Logistic regression model for evaluating soil liquefaction probability using CPT data. J Geotech Geoenviron Eng 132:694–704
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B, Cybern 39:539–50
Lopez L, Sanchez JL (2009) Discriminant methods for radar detection of hail. In: 4th European conference on severe storms, vol 93, pp 358–368
Mehta CR, Patel NR (1995) Exact logistic regression: Theory and examples. Stat Med 14:2143–2160
Moss RES, Seed RB, Kayen RE, Stewart JP, Kiureghian AD, Cetin KO (2006) CPT-based probabilistic and deterministic assessment of in situ seismic soil liquefaction potential. J Geotech Geoenviron Eng 132(8):1032–1051
Olson SA, Brouillette MC (2006) A logistic regression equation for estimating the probability of a stream in Vermont having intermittent flow: US Geological Survey Scientific Investigations Report 2006–5217
Oommen T, Baise LG, Vogel R (2010) Validation and application of empirical liquefaction models. J Geotech Geoenviron Eng. doi:10.1061/(ASCE)GT.1943-5606.0000395
Page RL, Ellison CG, Lee J (2009) Does religiosity affect health risk behaviors in pregnant and postpartum women? Matern Child Health J 13:621–632
Preisler HK, Brillinger DR, Burgan RE, Benoit JW (2004) Probability based models for estimation of wildfire risk. Int J Wildland Fire 13:133–142
R Development Core Team (2009) R: A language and environment for statistical computing. R Foundation for statistical computing, Vienna
Seiffert C, Khoshgoftaar TM, Van Hulse J (2009) Hybrid sampling for imbalanced data. Integr Comput -Aided Eng 16:193–210
Sun YM, Wong AKC, Kamel MS (2009) Classification of imbalanced data: A review. Int J Pattern Recognit Artif Intell 23:687–719
Tang YC, Zhang YQ, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced cClassification. IEEE Trans Syst Man Cybern Part B, Cybern 39:281–288
Tasker GD (1989) Regionalization of low flow characteristics using logistic and GLS regression. In: Kavvas ML (ed) New directions for surface water modeling. IAHS Publication, vol 181, pp 323–331
Toner M, Keddy P (1997) River hydrology and riparian wetlands: A predictive model for ecological assembly. Ecol Appl 7:236–246
van Rijsbergen C (1979) Information retrieval. Butterworths, London
Weiss GM, Provost F (2003) Learning when training data are costly: The effect of class distribution on tree induction. J Artif Intell Res 19:315–354
Williams DP, Myers V, Silvious MS (2009) Mine classification with imbalanced data. IEEE Geosci Remote Sens Lett 6:528–532
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Oommen, T., Baise, L.G. & Vogel, R.M. Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression. Math Geosci 43, 99–120 (2011). https://doi.org/10.1007/s11004-010-9311-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11004-010-9311-8