2014 | Book

# Design, Analysis, and Interpretation of Genome-Wide Association Scans

Author: Daniel O. Stram

Publisher: Springer New York

Book Series : Statistics for Biology and Health

2014 | Book

Author: Daniel O. Stram

Publisher: Springer New York

Book Series : Statistics for Biology and Health

This book presents the statistical aspects of designing, analyzing and interpreting the results of genome-wide association scans (GWAS studies) for genetic causes of disease using unrelated subjects. Particular detail is given to the practical aspects of employing the bioinformatics and data handling methods necessary to prepare data for statistical analysis. The goal in writing this book is to give statisticians, epidemiologists, and students in these fields the tools to design a powerful genome-wide study based on current technology. The other part of this is showing readers how to conduct analysis of the created study.

Design and Analysis of Genome-Wide Association Studies provides a compendium of well-established statistical methods based upon single SNP associations. It also provides an introduction to more advanced statistical methods and issues. Knowing that technology, for instance large scale SNP arrays, is quickly changing, this text has significant lessons for future use with sequencing data. Emphasis on statistical concepts that apply to the problem of finding disease associations irrespective of the technology ensures its future applications. The author includes current bioinformatics tools while outlining the tools that will be required for use with extensive databases from future large scale sequencing projects. The author includes current bioinformatics tools while outlining additional issues and needs arising from the extensive databases from future large scale sequencing projects.

Advertisement

Abstract

This chapter provides an elementary introduction to some of the basic biology and technology that underlies genetic association studies that rely on dense genotyping of nominally unrelated individuals to discover genetic variants related to risk of disease and other outcomes, phenotypes, or traits. This chapter discusses relevant aspects of DNA and RNA architecture, coding of amino acids, describes chromosomal organization, gives an overview of the most common types of sequence variation, and provides an overview of genotyping methods. It introduces concepts, databases, analysis programs, and example data that will be used in later portions of the book.

Abstract

An understanding of the genetics of current world populations provides the conceptual basis upon which today’s genetic association studies rest. This chapter focuses specifically on gaining a basic grounding in three general topics:

1.

Linkage disequilibrium
(LD), the nonrandom associations of alleles. A discussion on how linkage disequilibrium varies between populations due to multiple factors, such as random drift in allele frequencies in isolated populations, population migration, admixture, and population expansion.

2.

Population heterogeneity. A discussion of the effects of population heterogeneity, including population stratification, admixture, and relatedness between subjects—specifically, the distribution of marker alleles and their apparent association with each other and with causal variants, using marker data for the empirical estimation of relatedness and kinship coefficients and of identity by descent probabilities.

3.

The common disease-common variant hypothesis. Arguments in favor of the common disease-common variant hypothesis and a discussion of distributions of allele frequencies for both marker alleles and causal variants.

In addition to these concepts, several important tools for the investigation of linkage disequilibrium and of population heterogeneity are introduced: specifically data from the HapMap project and data manipulation and LD visualization tools which help to explore these data effectively. Principal components analysis (PCA) of large-scale genetic data for the purpose of examining population substructure and admixture is introduced and illustrated using a download of phase 3 HapMap data for 11 population samples; an example of some simple PLINK commands and a corresponding R script is provided. These illustrate the selection (in PLINK) of a subset of SNPs to be used in the computation and display, by R, of leading principal components that characterize global population structure.

Many parts of the broader field of population genetics are entirely ignored here. Notably we do not discuss, except parenthetically, natural selection as a force determining allele frequency distributions within modern populations or the differences seen between modern populations. Differences between populations in allele frequencies for marker alleles or causal variants are largely assumed here to be due to random drift or founder effects and population expansion.

Implicitly we are restricting interest, and this is part of topic (3), to common genetic causes of disease. The mere fact that the alleles are common indicates that the reproductive fitness of carriers of the alleles is not greatly impacted; even when looking at diseases that do affect reproductive fitness (such as fatal childhood diseases, early adult onset mental illnesses), the selective pressure against alleles that cause modest increases in risk of such disease may be very minor if the diseases themselves are rare.

Abstract

This chapter focuses on techniques commonly used in GWAS studies to estimate single SNP marker associations in samples of unrelated individuals; when the phenotype is discrete (disease/no disease) then case–control methods, conditional and unconditional logistic regression, are typically utilized. Maximum likelihood estimation for generalized linear models is reviewed, and the score, Wald, and likelihood ratio tests are defined and discussed. The analysis of data from nuclear family-based designs is also briefly introduced. Issues regarding confounding, measurement error, effect mediation, and interactions are described. Control for multiple comparisons is reviewed with an emphasis placed on the behavior of the Bonferroni criteria for multiple correlated tests. The effects on statistical estimation and inference of the loss of independence between outcomes are characterized for a specific model of loss of independence, which is relevant to the presence of hidden population structure or relatedness. These last results build on a basic theme described in Chap. 2 and are then carried forward in Chap. 4.

Abstract

Chapter 2 discussed both relatedness of study participants and hidden population structure in terms of the correlations induced between the number of copies, n
_{
iA
} and n
_{
jA
}, of a diallelic genetic variant carried by two individuals i and j. In Chap. 3 we discussed the requirement for association studies of unrelated subjects that the outcomes of interest, Y
_{
i
}, be independent between study subjects. In this chapter we will expand on this initial discussion (1) to examine the impact of non-independence on the distribution of statistical tests for the influence of alleles (here a and A) on phenotype or disease risk, and (2) how non-independence between individuals’ outcomes can arise as a direct result of correlation among the genotypes of study subjects due to hidden strata or relatedness or due to other factors (e.g., cultural/behavioral) that act as confounders of genetic associations. The chapter introduces several basic approaches for dealing with population structure in single marker association analyses and shows how all these methods deal, at least in part, with the fundamental problem of the analysis of correlated phenotypes. At the heart of these methods is the empirical estimation of a relationship matrix (more precisely a covariance structure matrix) that describes the relative relatedness of individuals. The statistical methods for dealing with covariances in estimation of single marker effects fall into three categories: fixed effects models utilizing adjustment for eigenvectors (“principal components”) of this matrix; random effects methods dealing explicitly with the relationship matrix as a covariance matrix of random effects in extended generalized linear modeling; and retrospective methods, which invert the usual generalized linear modeling procedures so that the conditional distribution of the genetic markers given the phenotypes (rather than the reverse) is used for inference in genetic association studies. Our discussion of all these approaches is unified around the theme of dealing with false-positive associations that are due to unrecognized inflation of the variance of estimators relied upon in traditional regression methods when correlated data are analyzed. Finally the relative performance of the various methods is described in various settings.

Abstract

This chapter discusses extending association analyses to include a larger set of hypotheses beyond just the single markers that have been genotyped in a particular study. This chapter first reviews haplotype frequency estimation and imputation. It gives the details of EM estimation and haplotype imputation for a small number of SNPs using data from unrelated subjects and then considers the extension of this method to larger number of SNPs. The partition-ligation EM algorithm is detailed as a method of obtaining haplotype count estimates for individuals in an association study for a moderate number of SNPs.

The problem of haplotype-specific risk estimation and incorporation of SNP haplotype analysis into generalized linear regression models is considered in some detail; first a simple substitution of imputed for true haplotypes into association testing is described. Since this method ignores uncertainties in the estimation of the haplotype frequencies that underlie the imputation of haplotypes, the chapter also considers simultaneous maximum likelihood estimation of all parameters (risk estimates and haplotype frequency estimates), so that haplotype uncertainty is formally taken account of in the construction of hypothesis tests and confidence intervals. For case–control data, a full likelihood-based approach also must take into account ascertainment of cases and controls as haplotype frequencies in the sample may not reflect haplotype frequencies in the population, specifically for haplotypes that are associated with risk.

The substitution of expected (imputed) haplotype count for unobserved true haplotype count in regression analysis is a special case of the “expectation-substitution” method described in the statistical literature on the subject of regression parameter estimation when measurement errors occur in the explanatory variables. It is known to have reasonable statistical properties in many analyses, especially for forming tests of the null hypothesis (of no haplotype-specific effects). However under the alternative hypothesis, the validity of the expectation-substitution method can be questioned, and joint estimation as well as ascertainment correction (for case–control sampling) may be considered. This and other problems in haplotype-specific risk estimation are discussed.

Finally some special requirements of imputing SNPs or haplotypes in heterogeneous populations are described.

Abstract

This chapter also (as does Chap. 5) discusses an extension of association analyses to include a larger set of hypotheses beyond just the single markers that have been genotyped in a particular study. Imputed SNP analysis is in certain respects identical to haplotype analysis since the imputed SNPs are on specific haplotypes or haplotype combinations. Testing imputed SNPs as well as genotyped SNPs is thus simply a more focused kind of haplotype analysis and serves to extend the set of hypotheses that are tested to encompass known but ungenotyped variants. Imputed SNP analysis plays a special role during a post-GWAS phase when during meta-analysis many studies are combined in efforts to find associations that are too small to be detectable in any one study. Imputation is necessarily relied upon when (as is generally the case) not all studies used the same genotyping platform or chip version.

This chapter discusses the basic statistical method, namely, Hidden Markov Model (HMM) that is used for fast and very large-scale SNP imputation in a number of high-performance programs. A brief introduction to the HMM methods is provided and the basic principles behind estimating the parameters of an HMM are illustrated with R code. The basics of a particular algorithm, patterned loosely after that implemented in the program MACH, are described.

Since nearly all large-scale SNP imputation methods require that phased haplotypes be provided for SNPs to be imputed (or measured for the purpose of imputation), a discussion of the use of phasing algorithms is also provided, with the details also modeled after the MACH program.

The use of imputed SNPs as independent variables in regression analysis is introduced with the discussion mostly focused on the same approach (expectation substitution) used in haplotype analysis. Use of imputed SNPs in association analyses for a single study is described, although the use of imputed SNPs in meta-analysis is deferred until Chap. 8.

Abstract

The subject of this chapter is sample size and power calculations for studies of genetic associations for both case–control studies and prospective studies of a quantitative phenotype or trait. This chapter starts with a comparison of three different methods for calculating power for the simplest single marker studies and then goes on to consider the general statistical approach to power calculations that are embodied in the very widely used QUANTO (Gauderman, W., & Morrison, J. (2006). QUANTO 1.1: A computer program for power and sample size calculations for genetic-epidemiology studies. http://hydra.usc.edu/gxe) program for a variety of study designs, including those which are impervious to population stratification, specifically sibling-matched and parent-affected-offspring studies. The chapter then discusses additional considerations affecting sample size and power including (1) control for multiple comparisons, (2) control for population stratification by the principal components methods discussed in Chap. 4, (3) multi-staged genotyping designs, (4) fine-mapping of associations using multi-marker analyses, (5) power calculations for haplotype analysis and imputed marker analysis, and (7) reuse of existing data for new studies.

Abstract

The term post-GWAS analyses here refers to two somewhat distinct general topics; first are a compendium of analyses that are typically performed after one or more GWAS studies of a particular disease have been completed. These analyses include pooled or meta-analysis used in order to combine results of two or more studies, typically with the help of the of large-scale SNP imputation as discussed in Chap. 6. Additional analyses include replication of results often found first in Europeans, in studies of other racial/ethnic groups. Discussion of this topic is broadened to include what has been called multiethnic fine mapping. Adjustment for local ancestry in studies of admixed groups, as an aid to fine mapping within a single group is discussed as well. Heritability estimation using GWAS data is also considered.

A second set of topics are termed post-GWAS because they relate to issues raised by a new technology, namely, next-generation whole genome sequencing (WGS), which is currently being evaluated for large-scale association studies. The main raison d’être for WGS is to allow for interrogation of rare variation that cannot be measured on GWAS SNP arrays used to date. This chapter covers some of statistical topics related to the assessment of the role of rare variation, especially composite groups of rare variation related to each other through their mode of actions, pathway membership, physical location in or between genes, etc.