Abstract
This paper provides a review of recent applications of quantile regression to the fields of genetic and the emerging -omic studies. It begins with a general background about this statistical approach following the seminal paper of Koenker and Bassett (Econometrica 46:33–50, 1978). Applications are described, as diverse as genetic association studies, penetrance estimation, gene expression, CGH array experiments, RNAseq experiments, methylation data and proteomics. This paper also introduces recent extensions of quantile regression with a particular focus on the Copula-quantile regression, an approach we recently proposed for sib-pair analysis. A real data example from eQTL analysis is then presented and the \(R\) codes, which run the analyses are provided. Finally, we conclude with some statistical software presentation and some general statements about the potential and interests of quantile regression in modern biological experiments.
Similar content being viewed by others
References
Beyerlein A, Kries VR, Ness AR, Ong KK (2011) Genetic markers of obesity risk: stronger associations with body composition in overweight compared to normal-weight children. PLoS ONE 6(4):e19057
Bilias Y, Chen S, Ying Z (2000) Simple resampling methods for censored regression quantiles. J Econ 99:373–386
Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2):185–193
Boscovich RJ (1757) De litteraria expeditione per pontificiam ditionem, et synopsis amplioris operis, ac habentur plura eius ex exemplaria etiam sensorum impressa. Bononiensi Scientiarum et Artium Instituto Atque Academia Commentarii 4:353–396
Bouyé E, Salmon M (2002) Dynamic copula quantile regressions and tail area dynamic dependence in forex markets. Eur J Fin 15(7):721–750
Callister SJ, Barry RC, Adkins JN, Johnson ET, Qian W, Webb-Robertson B-JM, Smith RD, Lipton MS (2006) Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. J Proteome Res 5(2):277–286
Cardoso J, Molenaar L, de Menezes RX, van Leerdam M, Rosenberg C, Möslein G, Sampson J, Morreau H, Boer JM, Fodde R (2006) Chromosomal instability in myh- and apc-mutant adenomatous polyps. Cancer Res 66(5):2514–2519
Dodge Y, Jurečková J (1995) Estimation of quantile density function based on regression quantiles. Stat Probab Lett 23:73–78
Durrieu G, Briollais L (2009) Sequential design for microarray experiments. J Am Stat Assoc 104(104):650–660
Edgeworth F (1888) On a new method of reducing observations relating to several quantities. Philos Mag 25:184–191
Eilers PHC, de Menezes RX (2005) Quantile smoothing of array cgh data. Bioinformatics 21(7):1146–1153
Falconer DS, McKay TFC (1996) Introduction to quantitative genetics, 4th edn. Longmans Green, Harlow
Gao X, Huang J (2010) A robust penalized method for the analysis of noisy dna copy number data. BMC Genom 11:517
Gu C, Todorov AA, Rao DC (1997) Genome screening using extremely discordant and extremely concordant sib pairs. Genet Epidemiol 14:791–796
Gutenbrunner CJ, Jurečková J, Koenker R, Portnoy S (1993) Tests of linear hypotheses based on regression rank scores. J Non Parametr Stat 2:307–333
Hansen KD, Irizarry RA (2012) Removing technical variability in rna-seq data using conditional quantile normalization. Biostatistics 13(2):204–216
Haring R, Wallaschofski H, Teumer A, Kroemer H, Taylor AE, Shackleton CHL, Nauck M, Volker U, Homuth G, Arlt W (2013) A sult2a1 genetic variant identified by gwas as associated with low serum dheas does not impact on the actual dhea/dheas ratio. J Mol Endocrinol 50:73–77
Haseman JK, Elston RC (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 2:3–19
He X, Shao Q (1996) A general bahadur representation of m-estimators and its application to linear regression with non stochastic designs. Ann Stat 24:2608–2630
Hecker LA, Edwards AO, Ryu E, Tosakulwong N, Baratz KH, Brown WL, Issa PC, Scholl HP, Pollok-Kopp B, Schmid-Kubista KE, Balley KR, Oppermann M (2009) Genetic control of the alternative pathway of complement in humans and age-related macular degeneration. Human Mol Genet 19:209–215
Ho JWK, Stefani M, Remedios CGR, Charleston MA (2009) A model selection approach to discover age-dependent gene expression patterns using quantile regression models. BMC Genom 10(3):1–18
Huang BE, Lin DY (2007) Efficient association mapping of quantitative trait loci using selective genotyping. Am J Human Genet 80:567–576
Huang L, Zhu W, Saunders CP, MacLeod JN, Zhou M, Stromberg AJ, Bathke AC (2008) A novel application of quantile regression for identification of biomarkers exemplified by equine cartilage microarray data. BMC Bioinform 9:1–8
Khmaladze E (1981) Martingale approach in the theory of goodness-of-fit tests. Theory Probab Appl 26:240–257
Kocherginsky M, He X, Mu Y (2005) Practical confidence intervals for regression quantiles. J Comput Graph Stat 14:41–55
Koenker R (1994) Confidence intervals for regression quantiles. Springer, New-York
Koenker R (1996) Rank tests for linear models. Springer, New-York
Koenker R (2005) Quantile regression. Cambridge University Press, New-York
Koenker R (2008) Censored quantile regression redux. J Stat Softw 27:1–14
Koenker R, Park BJ (1996) An interior point algorithm for nonlinear quantile regression. J Econ 71:265–283
Koenker R, Xiao Z (2002) Inference on the quantile regression process. Econometrica 81:1583–1612
Koenker RW, Bassett G (1978) Regression quantiles. Econometrica 46:33–50
Kottas A, Gelfland AE (2001) Bayesian semiparametric median regression modeling. J Am Stat Assoc 96:1458–1468
Li D, Lewinger JP, Gauderman WJ, Murcray CE, Conti D (2011) Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies. Genet Epidemiol 35(8):790–799
Li Y, Zhu J (2007) Analysis of array cgh data for cancer studies using fused quantile regression. Bioinformatics 23(18):2470–2476
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke AG, Clark M, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM (2009) Finding the missing heritability of complex diseases. Nature 461(7265):747–753
Morley M, Molony CM, Weber T, Devlin JL, Ewens KG, Spielman RS, Cheung VG (2004) Genetic analysis of genome-wide variation in human gene expression. Nature 430:743–747
Nelsen RB (1998) An introduction to copulas. Springer, New-York
Olivieri O, Martinelli N, Sandri M, Bassi A, Guarini P, Trabetti E, Pizzolo F, Girelli D, Friso S, Pignatti PF, Corrocher R (2005) Apolipoprotein c-|ii, n-3 polyunsaturated fatty acids, and insulin-resistant t455c apoc3 gene polymorphism in heart disease patients: Example of gene-diet interaction. Clin Chem 51(2):360–367
Parzen MI, Wei L, Ying Z (1994) A resampling method based on pivotal estimating functions. Biometrika 81:341–350
Peng L, Huang Y (2008) Survival analysis with quantile regression models. J Am Stat Assoc 103:637–649
Pinkel D, Albertson DG (2005) Comparative genomic hybridization. Annu Rev Genom Human Genet 6:331–354
Portnoy S (2003) Censored quantile regression. J Am Stat Assoc 98:1001–1012
Rippe RC, Meulman JJ, Eilers PH (2012) Visualization of genomic changes by segmented smoothing using \(l_0\) penalty. PLoSone 7:e38230
Risch N, Zhang H (1995) Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 268:1584–1589
Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, GuhaThakurta D, Sieberts SK, Monks S, Reitman M, Zhang C, Lum PY, Leonardson A, Thieringer R, Metzger JM, Yang L, Castle J, Zhu H, Kash SFH, Drake TA, Sachs A, Lusis AJ (2005) An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genetics 37:710–717
Scher AI, Terwindt GM, Verschuren WM, Kruit MC, Blom HJ, Kowa H, Frants RR, van den Maagdenberg AM, van Buchem M, Ferrari MD, Launer LJ (2006) Migraine and mthfr c677t genotype in a population-based sample. Ann Neurol 59(2):372–375
Scholkopf B, Smola A (2002) Statistical learning theory. MIT Press, New-York
Simon RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y (2003) Design and analysis of DNA microarray investigations. Springer, New York
Sklar A (1959) Fonctions de répartition á n dimensions et leurs marges. Publications de l’institut de Statistique de l’Université de Paris 8:229–231
Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J, Hamilton G, Hindle AK, Huey B, Kimura K, Law S, Myambo K, Palmer J, Ylstra B, Yue JP, Gray JW, Jain AN, Pinkel D, Albertson DG (2001) Assembly of microarrays for genome-wide measurement of dna copy number. Nature Genet 29(3):263–264
Sohn I, Kim S, Hwang C, Lee JW (2008a) New normalization methods using support vector machine quantile regression approach in microarray analysis. Comput Stat Data Anal 52:4104–4115
Sohn I, Kim S, Hwang C, Lee JW, Shim J (2008b) Support vector machine quantile regression for detecting differentially expressed genes in microarray analysis. Methods Inf Med 5:459–467
Sun S, Chen Z, Yan PS, Huang Y-W, Huang THM, Lin S (2011) Identifying hypermethylated cpg islands using a quantile regression model. BMC Bioinform 12:54
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B 67(1):91–108
Vapnik VN (1998) Statistical learning theory, New-York
Vinciotti V, Yu K (2009) M-quantile regression analysis of temporal gene expression data. Stat Appl Genet Mol Biol 8(1):1–20
Wang H, He X (2007) Detecting differential expressions in genechip microarray studies: a quantile approach. J Am Stat Assoc 102:104–112
Wang H, He X (2008) An enhanced quantile approach for assessing differential gene expressions. Biometrics 64:449–457
Wang K, Li W-D, Zhang CK, Wang Z, Glessner JT, Grant SFA, Zhao H, Hakonarson H, Price RA (2011) A genome-wide association study on obesity and obesity-related traits. PLoS ONE 7(2):e18939
Williams PT (2012) Quantile-specific penetrance of genes affecting lipoproteins, adiposity and height. PLoS One 7(1):e28764
Wu Z, Aryee MJ (2010) Subset quantile normalization using negative control features. J Comput Biol 17(10):1385–1395
Yoon D, Lee E-K, Park T (2007) Robust imputation method for missing values in microarray data. BMC Bioinform 8(Suppl. 2):S6
Yu K, Moyeed RA (2001) Bayesian quantile regression. Stat Probab Lett 54(4):437–447
Acknowledgments
This work was partly funded by a MITACS grant from Dr. Briollais. We thank Mohamedou Sow for running the Copula quantile regression example. The authors are also grateful to the editor, associate editor and the two referees for their suggestions and comments, which greatly improved this article.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Supplemental material: code R for the real data example
The code illustrates the application of quantile regression and Copula quantile regression to the CEPH data introduced in Sect. 4. Using Haseman and Elston (1972)’s approach, the goal is to regress the squared difference between two sibs’ trait values, where the trait corresponds to the gene expression for each of the three transcripts studied in Table 1, on the proportion of alleles shared identical-by descent at the SNP marker considered.
The database including the variables is ‘bb’. We used the variables: ‘trait1$X217225’ and ‘trait2$X217225’, which are the gene expression trait values of sib1 and sib2, respectively. The variable ‘IBD’ is the probability of alleles shared identical by descent for the pair of relatives at the SNP marker.
The following line of code runs QR treating the observations as independent (se=‘iid’). The value of the quantile is given by ‘tau’. The case ‘tau=0.5’ corresponds the the median regression. The command ‘summary’ prints the parameter estimates, standard errors and p values.
The following line of code runs QR using a Huber sandwich variance estimate for the parameters of interest (se=‘nid’).
The code below is the function for the Copula quantile regression using a Gaussian Copula (See Eq. 7 of “Appendix 2”). In this function, ‘ecdf’ computes an empirical cumulative distribution function, ‘pnorm’ is the normal density function, ‘qnorm’ is the normal cumulative distribution function and ‘rho’ is the correlation coefficient of the Copula function.
The following line of code runs the Copula quantile regression function given above. The function ‘nlrq’ corresponds to the non-linear quantile regression and ‘start’ gives the starting value of the parameter ‘rho’, i.e. the correlation, of the Gaussian Copula
Supplemental material: recent extensions of QR
Beyond the more standard approaches for QR presented in the previous sections, recent extensions of the methodology have also been proposed including support vector machine (SVM) quantile regression, the Copula quantile regression and nonlinear quantile regression.
Appendix 1: Support vector machine quantile regression
SVM algorithm developed by Vapnik (1998) is based on statistical learning theory. In classification problems, the objective is to determine an optimal hyperplane that separates classes. This is equivalent to maximizing the margin between classes. In support vector machine regression (SVR) (Scholkopf and Smola 2002; Sohn et al. 2008a, b), the goal is to determine a hyperplane that fits the data very well. The basic idea of SVR is to find a model function \(f({\varvec{x}})\) representing the relationship between several characteristics associated with environmental and genomic information and a target such as microarray gene expression. Sohn et al. (2008a) applied SVM quantile regression (SVMQR) to cDNA microarray expression data to identify genes differentially expressed. They use three microarray data sets including experiments on a diet-induced obese (DIO) mouse model, data on E.coli and a last one on high-density lipoprotein (HDL)-deficient mouse. For each of this data set, the authors compared two treatment groups (e.g. high- vs. low-fat diet in the first experiment, drug A vs. drug B. in the second experiment and apoAI knockout mice vs. wild-type mice in the third experiment). The authors concluded that the SVMQR approach was superior to classical methods for identifying differentially expressed genes. In a companion paper, Sohn et al. (2008b) proposed to use SVMQR for print-tip normalization of the expression data. Their goal was to adjust systematic variations due to dye biases, where those biases depend on the spot overall intensity and/or spatial location within the array. Using the same microarray data sets, the authors showed that their novel approach outperformed methods based on Loess adjustment.
Appendix 2: Copula quantile regression
The problem of characterizing the dependence between random variables at a given quantile is an important issue, especially if the distributions of the variables are heavy-tailed. QR modeling approach based on a Copula function can handle the dependency between two variables, e.g. \(X\) and \(Y\). The idea is that the joint distribution \(F_{X,Y}\) can be decomposed into the marginals distribution of \(X\) and \(Y\) denoted, respectively, by \(F_X\) and \(F_Y\) and the dependence function is specified by the Copula function (Bouyé and Salmon 2002). The form of the linear quantile relationship implied by the copula is deduced. According to the Sklar’s theorem (Sklar 1959), there exists a unique bivariate copula \({\varvec{C}}: [0,1]^2 \rightarrow [0,1]\) satisfying
where \(F_X\) and \(F_Y\) are continuous and represent the marginal distribution function of \(X\) and \(Y\) respectively. Different families of Copula correspond to different types of dependence structure (see Nelsen (1998), for a general introduction to Copulas). With Gaussian Copulas, the dependence is measured by a correlation \(\rho\). The bivariate Gaussian Copula has the form
where \(\Phi _2\) corresponds to the bivariate Gaussian distribution, \(\Phi\) to the univariate distribution and \(\Phi ^{[-1]}\) to the pseudo-inverse of \(\Phi\). So, by Sklar’s theorem, for marginal distribution functions \(F_X\) and \(F_Y\), the joint distribution
is a bivariate distribution function with marginals \(F_X\) and \(F_Y\) and the Copula that connects \(F_{X,Y}(x,y)\) to \(F_X(x)\) and \(F_Y(y)\) is the Gaussian Copula. For a parametric copula \({\varvec{C}}(.,.;\rho )\), the \(\theta\)-th copula quantile curve of \(y\) conditional on \(x\) is defined by
where \(C^{\star }(x,y;\rho ) = \frac{\partial }{\partial x} {\varvec{C}}(x,y; \rho )\). The relationship between \(X\) and \(Y\) can be expressed using (5) by
where \(q(x,\theta ; \rho ) = F_Y^{[-1]}\left( D(F_X(x),\theta ; \rho ))\right)\) with \(D\) the partial inverse in the second argument of \(C^{\star }\) and \(F_Y^{[-1]}\) the pseudo-inverse of \(F_Y\). The relationship (6) can be also expressed using uniform margins as
with \(u=F_X(x)\) and \(v=F_Y(y)\). By (5) and (7), the \(\theta\)-th Gaussian copula quantile is defined by
and the relationship between \(y\) and \(x\) is
The \(\theta\)-th copula QR \(q(x_i,\theta ,\rho )\) is defined as any solution to the minimization problem
with \({\mathcal F}_{\theta } = \{i : y_i \ge q(x_i,\theta ;\rho )\}\) and \({\mathcal F}_{1-\theta }\) its complement.
Using Copula quantile regression, the dependence \(\rho\) can be estimated at various quantiles. Besides, depending on the choice of the Copula function, the relationship between the two random variables can be nonlinear, offering some flexibility in the modeling of the dependence between these two variables. An application of the this method is given in Sect. 5.
Appendix 3: Nonlinear quantile regression
Different types of nonlinear QR approaches have been proposed. Bouyé and Salmon (2002) extended Koenker and Bassett (1978)’s work and proposed a nonlinear QR based on Copula function. QR has also been applied to censored survival (duration) data, which offers a more flexible alternative to the Cox proportional hazard model for some applications. Developments on censored QR were presented in Portnoy (2003) and Peng and Huang (2008). In particular, the Portnoy and Peng-Huang estimators can be both considered as regression-based generalizations of Kaplan-Meier and Nelson-Aalen estimator of the cumulative hazard function for randomly censored observations. Software implementations of the previous censored QR estimators for the \(R\) language are available in the quantreg package of Koenker (2008) using the function “crq”. Koenker and Park (1996) proposed a procedure to compute QR estimates for problems in which the response function is nonlinear in parameters using interior point methods. As an example, Ho et al. (2009) presented a model selection approach to discover genes with linear or nonlinear age-dependent gene expression patterns from microarray data.
Rights and permissions
About this article
Cite this article
Briollais, L., Durrieu, G. Application of quantile regression to recent genetic and -omic studies. Hum Genet 133, 951–966 (2014). https://doi.org/10.1007/s00439-014-1440-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00439-014-1440-6