Published in:

2021 | OriginalPaper | Chapter

Outlier detection in Bioinformatics with Mixtures of Gaussian and heavy-tailed distributions

Author : Alexandra Posekany

Published in: Data Science – Analytics and Applications

Publisher: Springer Fachmedien Wiesbaden

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Starting from approaches in Bioinformatics, we will investigate aspects of Bayesian robustness ideas and compare them to methods from classical robust statistics. Bayesian robustness branches into three aspects, robustifying the prior, the likelihood or the loss function. Our focus will be the the likelihood itself. For computational convenience, normal likelihoods are the standard for many basic analyses ranging from simple mean estimation to regression or discriminatory models. However, similar to classical analyses non-normal data cause problems in the estimation process and are often covered with complex models for the overestimated variance or shrink- age. Most prominently, Bayesian non-parametrics approach this challenge with infinite mixtures of distributions. However, infinite mixture models do not allow an identification of outlying values in “near-Gaussian” scenarios being almost too flexible for such a purpose. The goal of our works is to allow for a robust estimation of parameters of the “main part of the data”, while being able to identify the outlying part of the data and providing a posterior probability for not fitting the main likelihood model. For this purpose, we propose to mix a Gaussian likelihood with heavy-tailed or skewed distributions of a similar structure which can hierarchically be related to the normal distribution in order to allow a consistent estimation of parameters and efficient simulation. We present an application of this approach in Bioinformatics for the robust estimation of genetic array data by mixing Gaussian and student’s t distributions with various degrees of freedom. To this effect, we employ microarray data as a case study for this behaviour, as they are well-known for their complicated, over-dispersed noise behaviour. Our secondary goal is to present a methodology, which helps not only to identify noisy genes but also to recognise whether single arrays are responsible for this behaviour. Although Bioinformatics dropped array technology in favour of sequencing in research, the medical diagnostics has picked up the methodology and thus require appropriate error estimators.

Springer Professional

Outlier detection in Bioinformatics with Mixtures of Gaussian and heavy-tailed distributions

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner