2023 | Book

# Statistical Learning in Genetics

## An Introduction Using R

Author: Daniel Sorensen

Publisher: Springer International Publishing

Book Series : Statistics for Biology and Health

2023 | Book

Author: Daniel Sorensen

Publisher: Springer International Publishing

Book Series : Statistics for Biology and Health

This book provides an introduction to computer-based methods for the analysis of genomic data. Breakthroughs in molecular and computational biology have contributed to the emergence of vast data sets, where millions of genetic markers for each individual are coupled with medical records, generating an unparalleled resource for linking human genetic variation to human biology and disease. Similar developments have taken place in animal and plant breeding, where genetic marker information is combined with production traits. An important task for the statistical geneticist is to adapt, construct and implement models that can extract information from these large-scale data. An initial step is to understand the methodology that underlies the probability models and to learn the modern computer-intensive methods required for fitting these models. The objective of this book, suitable for readers who wish to develop analytic skills to perform genomic research, is to provide guidance to take this first step.

This book is addressed to numerate biologists who typically lack the formal mathematical background of the professional statistician. For this reason, considerably more detail in explanations and derivations is offered. It is written in a concise style and examples are used profusely. A large proportion of the examples involve programming with the open-source package R. The R code needed to solve the exercises is provided. The MarkDown interface allows the students to implement the code on their own computer, contributing to a better understanding of the underlying theory.

Part I presents methods of inference based on likelihood and Bayesian methods, including computational techniques for fitting likelihood and Bayesian models. Part II discusses prediction for continuous and binary data using both frequentist and Bayesian approaches. Some of the models used for prediction are also used for gene discovery. The challenge is to find promising genes without incurring a large proportion of false positive results. Therefore, Part II includes a detour on False Discovery Rate assuming frequentist and Bayesian perspectives. The last chapter of Part II provides an overview of a selected number of non-parametric methods. Part III consists of exercises and their solutions.

Daniel Sorensen holds PhD and DSc degrees from the University of Edinburgh and is an elected Fellow of the American Statistical Association. He was professor of Statistical Genetics at Aarhus University where, at present, he is professor emeritus.

Advertisement

Abstract

Suppose there is a set of data consisting of observations in humans on forced expiratory volume (FEV, a measure of lung function; lung function is a predictor of health and a low lung function is a risk factor for mortality), or on the presence or absence of heart disease and that there are questions that could be answered using these data.

Abstract

A central problem in statistics is the estimation of parameters that index a probability model proposed to describe aspects of the state of nature. In the classical approach to inference, these parameters have a “true” but unknown value and given the model, can be estimated from a set of observations. A firmly entrenched inferential approach in statistics is the method of maximum likelihood proposed and termed by Fisher (Philos. Trans. R. Soc. Lond. A 222:309–368, 1922), although, as is often the case in science, the subject had been in the air long before Fisher disguised in the terminology of inverse probability. An excellent account is in Hald (A History of Mathematical Statistics, from 1750 to 1930. Wiley, 1998).

Estimation using the likelihood function proceeds by solving for \(\theta \) the equation \(S(\theta )=0\) where \(S(\theta )\) is the score.

Abstract

In classical likelihood, an important goal is to learn about a parameter \(\theta \) regarded as a fixed unknown quantity. This is accomplished by collecting data y assumed to be a realisation from a probability model \(p(y|\theta )\) indexed by \(\theta \). This probability model gives rise to the likelihood, a function of \(\theta \) conditional on the realised y, from which the maximum likelihood estimate \(\hat {\theta }\) is obtained. The ML estimator is a random variable (a function of y), whose distribution is characterised by conceptual replications of y.

Abstract

This chapter illustrates applications of McMC in a Bayesian context. The treatment is mostly schematic; the objective is to present the mechanics of McMC in different modelling scenarios. Many of the examples, discussed in connection with the implementation of maximum likelihood (using Newton-Raphson and EM), are revisited from a Bayesian McMC perspective. These include the analysis of ABO blood group data, the binary regression, the genomic model, the two-component mixture model, and the Bayesian analysis of truncated data. Further examples are discussed in the second part of the book on Prediction and in the Exercise sections, including their solutions, at the end of the book.

Abstract

This chapter provides an overview of prediction with examples taken from quantitative genetics. The first part summarises best prediction and best linear prediction and offers a brief tour of the standard linear least squares regression.

Abstract

Expression (6.51) indicates how prediction ability is governed by bias and variance. As models become more complex, local noise can be captured, but coefficient estimates suffer from higher variance as more terms are included in the model. In the context of the traditional regression model \(y=Xb+e\), \(e\sim N\left ( 0,I\sigma ^{2}\right )\), when the number of covariates (number of columns of X) p is large relative to the number of records/individuals n (number of rows of X), the columns of matrix X may become rank-deficient (X may not be or is close to not being of full column rank). In this case, even when \(p<n\), it is difficult to separate the effect of individual covariates.

Abstract

A classical single hypothesis test proceeds by specifying \(\alpha \), the probability of a significant result, given the null hypothesis \(\left (H=0\right )\) is true. This is also known as the probability of a false discovery and more commonly of a type I error.

Abstract

Many of the results derived under the assumption that observations are continuously distributed extend to dichotomous and categorical responses. There are some technical details that must be observed that are specific to discontinuous data. The chapter starts by illustrating the behaviour of training and validating mean squared error applied to binary records using operational logistic regression models with increasing number of covariates.

Abstract

Aspects of Bayesian prediction have been addressed in previous chapters. In particular, Chaps. 7 and 9 show a Bayesian implementation of the spike and slab model for continuous and binary records, respectively, and illustrate how the marginal posterior distribution of validating mean squared errors can easily be computed in an McMC environment (pages 331 and 386).

Abstract

Throughout this book a phrase like “assume the data have been generated by the following probability model” has been abundantly used. Indeed, the standard parametric assumption is that observed data represent one realisation from some given probability model and the goal can be to infer the parameters of the model. Alternatively and from a classical frequentist setting, conditionally on estimated parameters, the goal may be to predict future observations.

Abstract

In a binomial experiment with n trials and probability of success \(\theta \), x successes are observed. The setup could represent a trial designed to estimate the proportion of individuals in a population that suffer from a particular disease.

Abstract

The result of using the transformed parameter (13.8) translates into a more symmetric likelihood function, as displayed in Fig. 13.1. This in turn has consequences for the quality of inferences.