Statistical Methods for Imbalanced Data in Ecological and Biological Studies

verfasst von: Osamu Komori, Prof. Shinto Eguchi

Verlag: Springer Japan

Buchreihe : SpringerBriefs in Statistics

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book presents a fresh, new approach in that it provides a comprehensive recent review of challenging problems caused by imbalanced data in prediction and classification, and also in that it introduces several of the latest statistical methods of dealing with these problems. The book discusses the property of the imbalance of data from two points of view. The first is quantitative imbalance, meaning that the sample size in one population highly outnumbers that in another population. It includes presence-only data as an extreme case, where the presence of a species is confirmed, whereas the information on its absence is uncertain, which is especially common in ecology in predicting habitat distribution. The second is qualitative imbalance, meaning that the data distribution of one population can be well specified whereas that of the other one shows a highly heterogeneous property. A typical case is the existence of outliers commonly observed in gene expression data, and another is heterogeneous characteristics often observed in a case group in case-control studies. The extension of the logistic regression model, maxent, and AdaBoost for imbalanced data is discussed, providing a new framework for improvement of prediction, classification, and performance of variable selection. Weights functions introduced in the methods play an important role in alleviating the imbalance of data. This book also furnishes a new perspective on these problem and shows some applications of the recently developed statistical methods to real data sets.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction to Imbalanced Data

Abstract

An imbalance of sample sizes among class labels makes it difficult to obtain high classification accuracy in many scientific fields, including medical diagnosis, bioinformatics, biology, and fisheries management. This difficulty is referred to as “class imbalance problem” and is considered to be among the 10 most important problems in data mining research. This topic has also been widely discussed in several machine learning workshops. The critical feature of the imbalance problem is that it significantly degrades the performance of standard classification methods, which implicitly assume balanced class distributions and equal costs of misclassification for each class. Hence, new strategies are required for mitigating such imbalances, based on resampling techniques, modification of the classification algorithms, adjustment of weights for class distributions, and so on.

Osamu Komori, Shinto Eguchi

Chapter 2. Weighted Logistic Regression

Abstract

We consider an asymmetric logistic regression model as an example of a weighted logistic regression model, where the weights in the estimating equation vary according to the explanatory variables, thereby alleviating the imbalance of effective sample sizes between class labels \(y=0\) and \(y=1\). This model is extended to have a double robust property based on a propensity score, so that it has consistent estimators. We illustrate the utility of both models using the RAM and FAO data from fishery science.

Osamu Komori, Shinto Eguchi

Chapter 3. -Maxent

Abstract

Maxent is very popular for estimating species distributions using environmental variables such as temperature, precipitation, elevation, and soil category, all of which are closely related to the habitat of the species of interest. It is designed for estimating a probability distribution that has maximum entropy subject to the condition that the sample means of environmental variables are equal to the population means. Maxent can deal with presence-only data, for which the records of positions of the species are available but those of absence of the species are not available. Hence, this kind of data can be regarded as the extreme case of imbalance data, where observations belonging to one class (\(y=0\) or \(y=1\)) are totally missing. We investigate the Maxent from the viewpoint of divergence and extend it by introducing \(\beta \)-divergence, a variant of the more general class of U-divergence.

Osamu Komori, Shinto Eguchi

Chapter 4. Generalized T-Statistic

Abstract

We discuss a statistical method for the classification problem with two groups \(y=0\) and \(y=1\). We envisage a situation in which the conditional distribution of \(y=0\) is well specified by a normal distribution, but the conditional distribution of \(y=1\) (rare observations in imbalanced data sets) is not well modeled by any specific distribution. Typically in a case-control study, the distribution in the control group can be assumed to be normal via an appropriate data transformation, whereas the distribution in the case group may depart from normality. In this situation, the maximum t-statistic for linear discrimination, or equivalently the Fisher’s linear discriminant function, may not be optimal. We propose a class of generalized t-statistics and study asymptotic consistency and normality. The optimal generalized t-statistic in the sense of asymptotic variance is derived in a semi-parametric manner, and its statistical performance is confirmed in several numerical experiments.

Osamu Komori, Shinto Eguchi

Chapter 5. Machine Learning Methods for Imbalanced Data

Abstract

We discuss high-dimensional data analysis in the framework of pattern recognition and machine learning, including single-component analysis and clustering analysis. Several boosting methods for tackling imbalances in sample sizes are investigated.

Osamu Komori, Shinto Eguchi

Backmatter

Titel: Statistical Methods for Imbalanced Data in Ecological and Biological Studies
verfasst von: Osamu Komori
Prof. Shinto Eguchi
Verlag: Springer Japan
Electronic ISBN: 978-4-431-55570-4
Print ISBN: 978-4-431-55569-8
DOI: https://doi.org/10.1007/978-4-431-55570-4