Application of quantile regression to recent genetic and -omic studies

Briollais, Laurent; Durrieu, Gilles

doi:10.1007/s00439-014-1440-6

Application of quantile regression to recent genetic and -omic studies

Review Paper
Published: 26 April 2014

Volume 133, pages 951–966, (2014)
Cite this article

Human Genetics Aims and scope Submit manuscript

Laurent Briollais¹ &
Gilles Durrieu²

1330 Accesses
30 Citations
Explore all metrics

Abstract

This paper provides a review of recent applications of quantile regression to the fields of genetic and the emerging -omic studies. It begins with a general background about this statistical approach following the seminal paper of Koenker and Bassett (Econometrica 46:33–50, 1978). Applications are described, as diverse as genetic association studies, penetrance estimation, gene expression, CGH array experiments, RNAseq experiments, methylation data and proteomics. This paper also introduces recent extensions of quantile regression with a particular focus on the Copula-quantile regression, an approach we recently proposed for sib-pair analysis. A real data example from eQTL analysis is then presented and the $R$ codes, which run the analyses are provided. Finally, we conclude with some statistical software presentation and some general statements about the potential and interests of quantile regression in modern biological experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mapping and functional characterization of structural variation in 1060 pig genomes

Article Open access 07 May 2024

Effective use of the McNemar test

Article Open access 10 October 2020

Is there still evolution in the human population?

Article Open access 01 December 2022

References

Beyerlein A, Kries VR, Ness AR, Ong KK (2011) Genetic markers of obesity risk: stronger associations with body composition in overweight compared to normal-weight children. PLoS ONE 6(4):e19057
Bilias Y, Chen S, Ying Z (2000) Simple resampling methods for censored regression quantiles. J Econ 99:373–386
Article Google Scholar
Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2):185–193
Article CAS PubMed Google Scholar
Boscovich RJ (1757) De litteraria expeditione per pontificiam ditionem, et synopsis amplioris operis, ac habentur plura eius ex exemplaria etiam sensorum impressa. Bononiensi Scientiarum et Artium Instituto Atque Academia Commentarii 4:353–396
Google Scholar
Bouyé E, Salmon M (2002) Dynamic copula quantile regressions and tail area dynamic dependence in forex markets. Eur J Fin 15(7):721–750
Google Scholar
Callister SJ, Barry RC, Adkins JN, Johnson ET, Qian W, Webb-Robertson B-JM, Smith RD, Lipton MS (2006) Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. J Proteome Res 5(2):277–286
Article CAS PubMed Central PubMed Google Scholar
Cardoso J, Molenaar L, de Menezes RX, van Leerdam M, Rosenberg C, Möslein G, Sampson J, Morreau H, Boer JM, Fodde R (2006) Chromosomal instability in myh- and apc-mutant adenomatous polyps. Cancer Res 66(5):2514–2519
Article CAS PubMed Google Scholar
Dodge Y, Jurečková J (1995) Estimation of quantile density function based on regression quantiles. Stat Probab Lett 23:73–78
Article Google Scholar
Durrieu G, Briollais L (2009) Sequential design for microarray experiments. J Am Stat Assoc 104(104):650–660
Article CAS Google Scholar
Edgeworth F (1888) On a new method of reducing observations relating to several quantities. Philos Mag 25:184–191
Article Google Scholar
Eilers PHC, de Menezes RX (2005) Quantile smoothing of array cgh data. Bioinformatics 21(7):1146–1153
Article CAS PubMed Google Scholar
Falconer DS, McKay TFC (1996) Introduction to quantitative genetics, 4th edn. Longmans Green, Harlow
Google Scholar
Gao X, Huang J (2010) A robust penalized method for the analysis of noisy dna copy number data. BMC Genom 11:517
Article Google Scholar
Gu C, Todorov AA, Rao DC (1997) Genome screening using extremely discordant and extremely concordant sib pairs. Genet Epidemiol 14:791–796
Article CAS PubMed Google Scholar
Gutenbrunner CJ, Jurečková J, Koenker R, Portnoy S (1993) Tests of linear hypotheses based on regression rank scores. J Non Parametr Stat 2:307–333
Article Google Scholar
Hansen KD, Irizarry RA (2012) Removing technical variability in rna-seq data using conditional quantile normalization. Biostatistics 13(2):204–216
Article PubMed Central PubMed Google Scholar
Haring R, Wallaschofski H, Teumer A, Kroemer H, Taylor AE, Shackleton CHL, Nauck M, Volker U, Homuth G, Arlt W (2013) A sult2a1 genetic variant identified by gwas as associated with low serum dheas does not impact on the actual dhea/dheas ratio. J Mol Endocrinol 50:73–77
Article CAS PubMed Central PubMed Google Scholar
Haseman JK, Elston RC (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 2:3–19
Article CAS PubMed Google Scholar
He X, Shao Q (1996) A general bahadur representation of m-estimators and its application to linear regression with non stochastic designs. Ann Stat 24:2608–2630
Article Google Scholar
Hecker LA, Edwards AO, Ryu E, Tosakulwong N, Baratz KH, Brown WL, Issa PC, Scholl HP, Pollok-Kopp B, Schmid-Kubista KE, Balley KR, Oppermann M (2009) Genetic control of the alternative pathway of complement in humans and age-related macular degeneration. Human Mol Genet 19:209–215
Article Google Scholar
Ho JWK, Stefani M, Remedios CGR, Charleston MA (2009) A model selection approach to discover age-dependent gene expression patterns using quantile regression models. BMC Genom 10(3):1–18
Google Scholar
Huang BE, Lin DY (2007) Efficient association mapping of quantitative trait loci using selective genotyping. Am J Human Genet 80:567–576
Article CAS Google Scholar
Huang L, Zhu W, Saunders CP, MacLeod JN, Zhou M, Stromberg AJ, Bathke AC (2008) A novel application of quantile regression for identification of biomarkers exemplified by equine cartilage microarray data. BMC Bioinform 9:1–8
Article Google Scholar
Khmaladze E (1981) Martingale approach in the theory of goodness-of-fit tests. Theory Probab Appl 26:240–257
Article Google Scholar
Kocherginsky M, He X, Mu Y (2005) Practical confidence intervals for regression quantiles. J Comput Graph Stat 14:41–55
Article Google Scholar
Koenker R (1994) Confidence intervals for regression quantiles. Springer, New-York
Google Scholar
Koenker R (1996) Rank tests for linear models. Springer, New-York
Google Scholar
Koenker R (2005) Quantile regression. Cambridge University Press, New-York
Book Google Scholar
Koenker R (2008) Censored quantile regression redux. J Stat Softw 27:1–14
Google Scholar
Koenker R, Park BJ (1996) An interior point algorithm for nonlinear quantile regression. J Econ 71:265–283
Article Google Scholar
Koenker R, Xiao Z (2002) Inference on the quantile regression process. Econometrica 81:1583–1612
Article Google Scholar
Koenker RW, Bassett G (1978) Regression quantiles. Econometrica 46:33–50
Article Google Scholar
Kottas A, Gelfland AE (2001) Bayesian semiparametric median regression modeling. J Am Stat Assoc 96:1458–1468
Article Google Scholar
Li D, Lewinger JP, Gauderman WJ, Murcray CE, Conti D (2011) Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies. Genet Epidemiol 35(8):790–799
Article PubMed Google Scholar
Li Y, Zhu J (2007) Analysis of array cgh data for cancer studies using fused quantile regression. Bioinformatics 23(18):2470–2476
Article CAS PubMed Google Scholar
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke AG, Clark M, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM (2009) Finding the missing heritability of complex diseases. Nature 461(7265):747–753
Article CAS PubMed Central PubMed Google Scholar
Morley M, Molony CM, Weber T, Devlin JL, Ewens KG, Spielman RS, Cheung VG (2004) Genetic analysis of genome-wide variation in human gene expression. Nature 430:743–747
Article CAS PubMed Central PubMed Google Scholar
Nelsen RB (1998) An introduction to copulas. Springer, New-York
Google Scholar
Olivieri O, Martinelli N, Sandri M, Bassi A, Guarini P, Trabetti E, Pizzolo F, Girelli D, Friso S, Pignatti PF, Corrocher R (2005) Apolipoprotein c-|ii, n-3 polyunsaturated fatty acids, and insulin-resistant t455c apoc3 gene polymorphism in heart disease patients: Example of gene-diet interaction. Clin Chem 51(2):360–367
Article CAS PubMed Google Scholar
Parzen MI, Wei L, Ying Z (1994) A resampling method based on pivotal estimating functions. Biometrika 81:341–350
Article Google Scholar
Peng L, Huang Y (2008) Survival analysis with quantile regression models. J Am Stat Assoc 103:637–649
Article CAS Google Scholar
Pinkel D, Albertson DG (2005) Comparative genomic hybridization. Annu Rev Genom Human Genet 6:331–354
Article CAS Google Scholar
Portnoy S (2003) Censored quantile regression. J Am Stat Assoc 98:1001–1012
Article Google Scholar
Rippe RC, Meulman JJ, Eilers PH (2012) Visualization of genomic changes by segmented smoothing using $l_0$ penalty. PLoSone 7:e38230
Article CAS Google Scholar
Risch N, Zhang H (1995) Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 268:1584–1589
Article CAS PubMed Google Scholar
Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, GuhaThakurta D, Sieberts SK, Monks S, Reitman M, Zhang C, Lum PY, Leonardson A, Thieringer R, Metzger JM, Yang L, Castle J, Zhu H, Kash SFH, Drake TA, Sachs A, Lusis AJ (2005) An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genetics 37:710–717
Scher AI, Terwindt GM, Verschuren WM, Kruit MC, Blom HJ, Kowa H, Frants RR, van den Maagdenberg AM, van Buchem M, Ferrari MD, Launer LJ (2006) Migraine and mthfr c677t genotype in a population-based sample. Ann Neurol 59(2):372–375
Article CAS PubMed Google Scholar
Scholkopf B, Smola A (2002) Statistical learning theory. MIT Press, New-York
Google Scholar
Simon RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y (2003) Design and analysis of DNA microarray investigations. Springer, New York
Google Scholar
Sklar A (1959) Fonctions de répartition á n dimensions et leurs marges. Publications de l’institut de Statistique de l’Université de Paris 8:229–231
Google Scholar
Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J, Hamilton G, Hindle AK, Huey B, Kimura K, Law S, Myambo K, Palmer J, Ylstra B, Yue JP, Gray JW, Jain AN, Pinkel D, Albertson DG (2001) Assembly of microarrays for genome-wide measurement of dna copy number. Nature Genet 29(3):263–264
Article CAS PubMed Google Scholar
Sohn I, Kim S, Hwang C, Lee JW (2008a) New normalization methods using support vector machine quantile regression approach in microarray analysis. Comput Stat Data Anal 52:4104–4115
Article Google Scholar
Sohn I, Kim S, Hwang C, Lee JW, Shim J (2008b) Support vector machine quantile regression for detecting differentially expressed genes in microarray analysis. Methods Inf Med 5:459–467
Google Scholar
Sun S, Chen Z, Yan PS, Huang Y-W, Huang THM, Lin S (2011) Identifying hypermethylated cpg islands using a quantile regression model. BMC Bioinform 12:54
Article Google Scholar
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B 67(1):91–108
Article Google Scholar
Vapnik VN (1998) Statistical learning theory, New-York
Vinciotti V, Yu K (2009) M-quantile regression analysis of temporal gene expression data. Stat Appl Genet Mol Biol 8(1):1–20
Google Scholar
Wang H, He X (2007) Detecting differential expressions in genechip microarray studies: a quantile approach. J Am Stat Assoc 102:104–112
Article CAS Google Scholar
Wang H, He X (2008) An enhanced quantile approach for assessing differential gene expressions. Biometrics 64:449–457
Article PubMed Google Scholar
Wang K, Li W-D, Zhang CK, Wang Z, Glessner JT, Grant SFA, Zhao H, Hakonarson H, Price RA (2011) A genome-wide association study on obesity and obesity-related traits. PLoS ONE 7(2):e18939
Article Google Scholar
Williams PT (2012) Quantile-specific penetrance of genes affecting lipoproteins, adiposity and height. PLoS One 7(1):e28764
Article CAS PubMed Central PubMed Google Scholar
Wu Z, Aryee MJ (2010) Subset quantile normalization using negative control features. J Comput Biol 17(10):1385–1395
Article PubMed Central PubMed Google Scholar
Yoon D, Lee E-K, Park T (2007) Robust imputation method for missing values in microarray data. BMC Bioinform 8(Suppl. 2):S6
Article Google Scholar
Yu K, Moyeed RA (2001) Bayesian quantile regression. Stat Probab Lett 54(4):437–447
Article Google Scholar

Download references

Acknowledgments

This work was partly funded by a MITACS grant from Dr. Briollais. We thank Mohamedou Sow for running the Copula quantile regression example. The authors are also grateful to the editor, associate editor and the two referees for their suggestions and comments, which greatly improved this article.

Author information

Authors and Affiliations

Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, 700, University Avenue, Toronto, ON, M5G 1X5, Canada
Laurent Briollais
Laboratoire de Mathématiques de Bretagne Atlantique LMBA, UMR CNRS 6205 et Université de Bretagne Sud, Campus de Tohannic, 56017, Vannes, France
Gilles Durrieu

Authors

Laurent Briollais
View author publications
You can also search for this author in PubMed Google Scholar
Gilles Durrieu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laurent Briollais.

Appendix

Supplemental material: code R for the real data example

The code illustrates the application of quantile regression and Copula quantile regression to the CEPH data introduced in Sect. 4. Using Haseman and Elston (1972)’s approach, the goal is to regress the squared difference between two sibs’ trait values, where the trait corresponds to the gene expression for each of the three transcripts studied in Table 1, on the proportion of alleles shared identical-by descent at the SNP marker considered.

The database including the variables is ‘bb’. We used the variables: ‘trait1$X217225’ and ‘trait2$X217225’, which are the gene expression trait values of sib1 and sib2, respectively. The variable ‘IBD’ is the probability of alleles shared identical by descent for the pair of relatives at the SNP marker.

The following line of code runs QR treating the observations as independent (se=‘iid’). The value of the quantile is given by ‘tau’. The case ‘tau=0.5’ corresponds the the median regression. The command ‘summary’ prints the parameter estimates, standard errors and p values.

The following line of code runs QR using a Huber sandwich variance estimate for the parameters of interest (se=‘nid’).

The code below is the function for the Copula quantile regression using a Gaussian Copula (See Eq. 7 of “Appendix 2”). In this function, ‘ecdf’ computes an empirical cumulative distribution function, ‘pnorm’ is the normal density function, ‘qnorm’ is the normal cumulative distribution function and ‘rho’ is the correlation coefficient of the Copula function.

The following line of code runs the Copula quantile regression function given above. The function ‘nlrq’ corresponds to the non-linear quantile regression and ‘start’ gives the starting value of the parameter ‘rho’, i.e. the correlation, of the Gaussian Copula

Supplemental material: recent extensions of QR

Beyond the more standard approaches for QR presented in the previous sections, recent extensions of the methodology have also been proposed including support vector machine (SVM) quantile regression, the Copula quantile regression and nonlinear quantile regression.

Appendix 1: Support vector machine quantile regression

SVM algorithm developed by Vapnik (1998) is based on statistical learning theory. In classification problems, the objective is to determine an optimal hyperplane that separates classes. This is equivalent to maximizing the margin between classes. In support vector machine regression (SVR) (Scholkopf and Smola 2002; Sohn et al. 2008a, b), the goal is to determine a hyperplane that fits the data very well. The basic idea of SVR is to find a model function $f({\varvec{x}})$ representing the relationship between several characteristics associated with environmental and genomic information and a target such as microarray gene expression. Sohn et al. (2008a) applied SVM quantile regression (SVMQR) to cDNA microarray expression data to identify genes differentially expressed. They use three microarray data sets including experiments on a diet-induced obese (DIO) mouse model, data on E.coli and a last one on high-density lipoprotein (HDL)-deficient mouse. For each of this data set, the authors compared two treatment groups (e.g. high- vs. low-fat diet in the first experiment, drug A vs. drug B. in the second experiment and apoAI knockout mice vs. wild-type mice in the third experiment). The authors concluded that the SVMQR approach was superior to classical methods for identifying differentially expressed genes. In a companion paper, Sohn et al. (2008b) proposed to use SVMQR for print-tip normalization of the expression data. Their goal was to adjust systematic variations due to dye biases, where those biases depend on the spot overall intensity and/or spatial location within the array. Using the same microarray data sets, the authors showed that their novel approach outperformed methods based on Loess adjustment.

Appendix 2: Copula quantile regression

The problem of characterizing the dependence between random variables at a given quantile is an important issue, especially if the distributions of the variables are heavy-tailed. QR modeling approach based on a Copula function can handle the dependency between two variables, e.g. $X$ and $Y$. The idea is that the joint distribution $F_{X,Y}$ can be decomposed into the marginals distribution of $X$ and $Y$ denoted, respectively, by $F_X$ and $F_Y$ and the dependence function is specified by the Copula function (Bouyé and Salmon 2002). The form of the linear quantile relationship implied by the copula is deduced. According to the Sklar’s theorem (Sklar 1959), there exists a unique bivariate copula ${\varvec{C}}: [0,1]^2 \rightarrow [0,1]$ satisfying

$$\begin{aligned} F_{X,Y}(x, y)={\varvec{C}}\left( F_X(x), F_Y(y)\right) , \end{aligned}$$

where $F_X$ and $F_Y$ are continuous and represent the marginal distribution function of $X$ and $Y$ respectively. Different families of Copula correspond to different types of dependence structure (see Nelsen (1998), for a general introduction to Copulas). With Gaussian Copulas, the dependence is measured by a correlation $\rho$. The bivariate Gaussian Copula has the form

$$\begin{aligned} C(x,y; \rho )= \Phi _2 \left( \Phi ^{[-1]}(x), \Phi ^{[-1]}(y); \rho \right) \end{aligned}$$

where $\Phi _2$ corresponds to the bivariate Gaussian distribution, $\Phi$ to the univariate distribution and $\Phi ^{[-1]}$ to the pseudo-inverse of $\Phi$. So, by Sklar’s theorem, for marginal distribution functions $F_X$ and $F_Y$, the joint distribution

$$\begin{aligned} F _{X,Y}(x, y) = \Phi _2 \left( \Phi ^{[-1]}(F_X(x)), \Phi ^{[-1]}(F_Y(y)); \rho \right) \end{aligned}$$

is a bivariate distribution function with marginals $F_X$ and $F_Y$ and the Copula that connects $F_{X,Y}(x,y)$ to $F_X(x)$ and $F_Y(y)$ is the Gaussian Copula. For a parametric copula ${\varvec{C}}(.,.;\rho )$, the $\theta$-th copula quantile curve of $y$ conditional on $x$ is defined by

$$\begin{aligned} \theta = C^{\star }(F_X(x),F_Y(y) ; \rho ) \end{aligned}$$

(5)

where $C^{\star }(x,y;\rho ) = \frac{\partial }{\partial x} {\varvec{C}}(x,y; \rho )$. The relationship between $X$ and $Y$ can be expressed using (5) by

$$\begin{aligned} y = q(x,\theta ; \rho ) \end{aligned}$$

(6)

where $q(x,\theta ; \rho ) = F_Y^{[-1]}\left( D(F_X(x),\theta ; \rho ))\right)$ with $D$ the partial inverse in the second argument of $C^{\star }$ and $F_Y^{[-1]}$ the pseudo-inverse of $F_Y$. The relationship (6) can be also expressed using uniform margins as

$$\begin{aligned} v = r(u,\theta ; \rho ) \end{aligned}$$

(7)

with $u=F_X(x)$ and $v=F_Y(y)$. By (5) and (7), the $\theta$-th Gaussian copula quantile is defined by

$$\begin{aligned} \theta = \Phi \left( \frac{\Phi ^{[-1]}(v) -\rho \phi ^{[-1]} (u)}{\sqrt{1-\rho ^2}} \right) , \end{aligned}$$

and the relationship between $y$ and $x$ is

$$\begin{aligned} y=F_Y^{[-1]} \left( \Phi \left( \rho \,\Phi ^{[-1]}(F_X(x)) + \sqrt{1-\rho ^2}\,\Phi ^{[-1]}(\theta ) \right) \right) . \end{aligned}$$

(8)

The $\theta$-th copula QR $q(x_i,\theta ,\rho )$ is defined as any solution to the minimization problem

$$\begin{aligned} \min _{\rho } \left( \sum _{i\in {\mathcal F}_{\theta }} \theta |y_i - q(x_i,\theta ; \rho )| + \sum _{i \in {\mathcal F}_{1-\theta }} (1-\theta ) |y_i - q(x_i,\theta ; \rho )|\right) \end{aligned}$$

with ${\mathcal F}_{\theta } = \{i : y_i \ge q(x_i,\theta ;\rho )\}$ and ${\mathcal F}_{1-\theta }$ its complement.

Using Copula quantile regression, the dependence $\rho$ can be estimated at various quantiles. Besides, depending on the choice of the Copula function, the relationship between the two random variables can be nonlinear, offering some flexibility in the modeling of the dependence between these two variables. An application of the this method is given in Sect. 5.

Appendix 3: Nonlinear quantile regression

Different types of nonlinear QR approaches have been proposed. Bouyé and Salmon (2002) extended Koenker and Bassett (1978)’s work and proposed a nonlinear QR based on Copula function. QR has also been applied to censored survival (duration) data, which offers a more flexible alternative to the Cox proportional hazard model for some applications. Developments on censored QR were presented in Portnoy (2003) and Peng and Huang (2008). In particular, the Portnoy and Peng-Huang estimators can be both considered as regression-based generalizations of Kaplan-Meier and Nelson-Aalen estimator of the cumulative hazard function for randomly censored observations. Software implementations of the previous censored QR estimators for the $R$ language are available in the quantreg package of Koenker (2008) using the function “crq”. Koenker and Park (1996) proposed a procedure to compute QR estimates for problems in which the response function is nonlinear in parameters using interior point methods. As an example, Ho et al. (2009) presented a model selection approach to discover genes with linear or nonlinear age-dependent gene expression patterns from microarray data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Briollais, L., Durrieu, G. Application of quantile regression to recent genetic and -omic studies. Hum Genet 133, 951–966 (2014). https://doi.org/10.1007/s00439-014-1440-6

Download citation

Received: 16 September 2013
Accepted: 10 March 2014
Published: 26 April 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s00439-014-1440-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Application of quantile regression to recent genetic and -omic studies

Abstract

Access this article

Similar content being viewed by others