Skip to main content

2019 | Buch

Statistical Methods for Imbalanced Data in Ecological and Biological Studies

insite
SUCHEN

Über dieses Buch

This book presents a fresh, new approach in that it provides a comprehensive recent review of challenging problems caused by imbalanced data in prediction and classification, and also in that it introduces several of the latest statistical methods of dealing with these problems. The book discusses the property of the imbalance of data from two points of view. The first is quantitative imbalance, meaning that the sample size in one population highly outnumbers that in another population. It includes presence-only data as an extreme case, where the presence of a species is confirmed, whereas the information on its absence is uncertain, which is especially common in ecology in predicting habitat distribution. The second is qualitative imbalance, meaning that the data distribution of one population can be well specified whereas that of the other one shows a highly heterogeneous property. A typical case is the existence of outliers commonly observed in gene expression data, and another is heterogeneous characteristics often observed in a case group in case-control studies. The extension of the logistic regression model, maxent, and AdaBoost for imbalanced data is discussed, providing a new framework for improvement of prediction, classification, and performance of variable selection. Weights functions introduced in the methods play an important role in alleviating the imbalance of data. This book also furnishes a new perspective on these problem and shows some applications of the recently developed statistical methods to real data sets.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction to Imbalanced Data
Abstract
An imbalance of sample sizes among class labels makes it difficult to obtain high classification accuracy in many scientific fields, including medical diagnosis, bioinformatics, biology, and fisheries management. This difficulty is referred to as “class imbalance problem” and is considered to be among the 10 most important problems in data mining research. This topic has also been widely discussed in several machine learning workshops. The critical feature of the imbalance problem is that it significantly degrades the performance of standard classification methods, which implicitly assume balanced class distributions and equal costs of misclassification for each class. Hence, new strategies are required for mitigating such imbalances, based on resampling techniques, modification of the classification algorithms, adjustment of weights for class distributions, and so on.
Osamu Komori, Shinto Eguchi
Chapter 2. Weighted Logistic Regression
Abstract
We consider an asymmetric logistic regression model as an example of a weighted logistic regression model, where the weights in the estimating equation vary according to the explanatory variables, thereby alleviating the imbalance of effective sample sizes between class labels \(y=0\) and \(y=1\). This model is extended to have a double robust property based on a propensity score, so that it has consistent estimators. We illustrate the utility of both models using the RAM and FAO data from fishery science.
Osamu Komori, Shinto Eguchi
Chapter 3. -Maxent
Abstract
Maxent is very popular for estimating species distributions using environmental variables such as temperature, precipitation, elevation, and soil category, all of which are closely related to the habitat of the species of interest. It is designed for estimating a probability distribution that has maximum entropy subject to the condition that the sample means of environmental variables are equal to the population means. Maxent can deal with presence-only data, for which the records of positions of the species are available but those of absence of the species are not available. Hence, this kind of data can be regarded as the extreme case of imbalance data, where observations belonging to one class (\(y=0\) or \(y=1\)) are totally missing. We investigate the Maxent from the viewpoint of divergence and extend it by introducing \(\beta \)-divergence, a variant of the more general class of U-divergence.
Osamu Komori, Shinto Eguchi
Chapter 4. Generalized T-Statistic
Abstract
We discuss a statistical method for the classification problem with two groups \(y=0\) and \(y=1\). We envisage a situation in which the conditional distribution of \(y=0\) is well specified by a normal distribution, but the conditional distribution of \(y=1\) (rare observations in imbalanced data sets) is not well modeled by any specific distribution. Typically in a case-control study, the distribution in the control group can be assumed to be normal via an appropriate data transformation, whereas the distribution in the case group may depart from normality. In this situation, the maximum t-statistic for linear discrimination, or equivalently the Fisher’s linear discriminant function, may not be optimal. We propose a class of generalized t-statistics and study asymptotic consistency and normality. The optimal generalized t-statistic in the sense of asymptotic variance is derived in a semi-parametric manner, and its statistical performance is confirmed in several numerical experiments.
Osamu Komori, Shinto Eguchi
Chapter 5. Machine Learning Methods for Imbalanced Data
Abstract
We discuss high-dimensional data analysis in the framework of pattern recognition and machine learning, including single-component analysis and clustering analysis. Several boosting methods for tackling imbalances in sample sizes are investigated.
Osamu Komori, Shinto Eguchi
Backmatter
Metadaten
Titel
Statistical Methods for Imbalanced Data in Ecological and Biological Studies
verfasst von
Osamu Komori
Prof. Shinto Eguchi
Copyright-Jahr
2019
Verlag
Springer Japan
Electronic ISBN
978-4-431-55570-4
Print ISBN
978-4-431-55569-8
DOI
https://doi.org/10.1007/978-4-431-55570-4