Skip to main content

Advertisement

Log in

Application of quantile regression to recent genetic and -omic studies

  • Review Paper
  • Published:
Human Genetics Aims and scope Submit manuscript

Abstract

This paper provides a review of recent applications of quantile regression to the fields of genetic and the emerging -omic studies. It begins with a general background about this statistical approach following the seminal paper of Koenker and Bassett (Econometrica 46:33–50, 1978). Applications are described, as diverse as genetic association studies, penetrance estimation, gene expression, CGH array experiments, RNAseq experiments, methylation data and proteomics. This paper also introduces recent extensions of quantile regression with a particular focus on the Copula-quantile regression, an approach we recently proposed for sib-pair analysis. A real data example from eQTL analysis is then presented and the \(R\) codes, which run the analyses are provided. Finally, we conclude with some statistical software presentation and some general statements about the potential and interests of quantile regression in modern biological experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Beyerlein A, Kries VR, Ness AR, Ong KK (2011) Genetic markers of obesity risk: stronger associations with body composition in overweight compared to normal-weight children. PLoS ONE 6(4):e19057

  • Bilias Y, Chen S, Ying Z (2000) Simple resampling methods for censored regression quantiles. J Econ 99:373–386

    Article  Google Scholar 

  • Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2):185–193

    Article  CAS  PubMed  Google Scholar 

  • Boscovich RJ (1757) De litteraria expeditione per pontificiam ditionem, et synopsis amplioris operis, ac habentur plura eius ex exemplaria etiam sensorum impressa. Bononiensi Scientiarum et Artium Instituto Atque Academia Commentarii 4:353–396

    Google Scholar 

  • Bouyé E, Salmon M (2002) Dynamic copula quantile regressions and tail area dynamic dependence in forex markets. Eur J Fin 15(7):721–750

    Google Scholar 

  • Callister SJ, Barry RC, Adkins JN, Johnson ET, Qian W, Webb-Robertson B-JM, Smith RD, Lipton MS (2006) Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. J Proteome Res 5(2):277–286

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Cardoso J, Molenaar L, de Menezes RX, van Leerdam M, Rosenberg C, Möslein G, Sampson J, Morreau H, Boer JM, Fodde R (2006) Chromosomal instability in myh- and apc-mutant adenomatous polyps. Cancer Res 66(5):2514–2519

    Article  CAS  PubMed  Google Scholar 

  • Dodge Y, Jurečková J (1995) Estimation of quantile density function based on regression quantiles. Stat Probab Lett 23:73–78

    Article  Google Scholar 

  • Durrieu G, Briollais L (2009) Sequential design for microarray experiments. J Am Stat Assoc 104(104):650–660

    Article  CAS  Google Scholar 

  • Edgeworth F (1888) On a new method of reducing observations relating to several quantities. Philos Mag 25:184–191

    Article  Google Scholar 

  • Eilers PHC, de Menezes RX (2005) Quantile smoothing of array cgh data. Bioinformatics 21(7):1146–1153

    Article  CAS  PubMed  Google Scholar 

  • Falconer DS, McKay TFC (1996) Introduction to quantitative genetics, 4th edn. Longmans Green, Harlow

    Google Scholar 

  • Gao X, Huang J (2010) A robust penalized method for the analysis of noisy dna copy number data. BMC Genom 11:517

    Article  Google Scholar 

  • Gu C, Todorov AA, Rao DC (1997) Genome screening using extremely discordant and extremely concordant sib pairs. Genet Epidemiol 14:791–796

    Article  CAS  PubMed  Google Scholar 

  • Gutenbrunner CJ, Jurečková J, Koenker R, Portnoy S (1993) Tests of linear hypotheses based on regression rank scores. J Non Parametr Stat 2:307–333

    Article  Google Scholar 

  • Hansen KD, Irizarry RA (2012) Removing technical variability in rna-seq data using conditional quantile normalization. Biostatistics 13(2):204–216

    Article  PubMed Central  PubMed  Google Scholar 

  • Haring R, Wallaschofski H, Teumer A, Kroemer H, Taylor AE, Shackleton CHL, Nauck M, Volker U, Homuth G, Arlt W (2013) A sult2a1 genetic variant identified by gwas as associated with low serum dheas does not impact on the actual dhea/dheas ratio. J Mol Endocrinol 50:73–77

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Haseman JK, Elston RC (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 2:3–19

    Article  CAS  PubMed  Google Scholar 

  • He X, Shao Q (1996) A general bahadur representation of m-estimators and its application to linear regression with non stochastic designs. Ann Stat 24:2608–2630

    Article  Google Scholar 

  • Hecker LA, Edwards AO, Ryu E, Tosakulwong N, Baratz KH, Brown WL, Issa PC, Scholl HP, Pollok-Kopp B, Schmid-Kubista KE, Balley KR, Oppermann M (2009) Genetic control of the alternative pathway of complement in humans and age-related macular degeneration. Human Mol Genet 19:209–215

    Article  Google Scholar 

  • Ho JWK, Stefani M, Remedios CGR, Charleston MA (2009) A model selection approach to discover age-dependent gene expression patterns using quantile regression models. BMC Genom 10(3):1–18

    Google Scholar 

  • Huang BE, Lin DY (2007) Efficient association mapping of quantitative trait loci using selective genotyping. Am J Human Genet 80:567–576

    Article  CAS  Google Scholar 

  • Huang L, Zhu W, Saunders CP, MacLeod JN, Zhou M, Stromberg AJ, Bathke AC (2008) A novel application of quantile regression for identification of biomarkers exemplified by equine cartilage microarray data. BMC Bioinform 9:1–8

    Article  Google Scholar 

  • Khmaladze E (1981) Martingale approach in the theory of goodness-of-fit tests. Theory Probab Appl 26:240–257

    Article  Google Scholar 

  • Kocherginsky M, He X, Mu Y (2005) Practical confidence intervals for regression quantiles. J Comput Graph Stat 14:41–55

    Article  Google Scholar 

  • Koenker R (1994) Confidence intervals for regression quantiles. Springer, New-York

    Google Scholar 

  • Koenker R (1996) Rank tests for linear models. Springer, New-York

    Google Scholar 

  • Koenker R (2005) Quantile regression. Cambridge University Press, New-York

    Book  Google Scholar 

  • Koenker R (2008) Censored quantile regression redux. J Stat Softw 27:1–14

    Google Scholar 

  • Koenker R, Park BJ (1996) An interior point algorithm for nonlinear quantile regression. J Econ 71:265–283

    Article  Google Scholar 

  • Koenker R, Xiao Z (2002) Inference on the quantile regression process. Econometrica 81:1583–1612

    Article  Google Scholar 

  • Koenker RW, Bassett G (1978) Regression quantiles. Econometrica 46:33–50

    Article  Google Scholar 

  • Kottas A, Gelfland AE (2001) Bayesian semiparametric median regression modeling. J Am Stat Assoc 96:1458–1468

    Article  Google Scholar 

  • Li D, Lewinger JP, Gauderman WJ, Murcray CE, Conti D (2011) Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies. Genet Epidemiol 35(8):790–799

    Article  PubMed  Google Scholar 

  • Li Y, Zhu J (2007) Analysis of array cgh data for cancer studies using fused quantile regression. Bioinformatics 23(18):2470–2476

    Article  CAS  PubMed  Google Scholar 

  • Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke AG, Clark M, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM (2009) Finding the missing heritability of complex diseases. Nature 461(7265):747–753

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Morley M, Molony CM, Weber T, Devlin JL, Ewens KG, Spielman RS, Cheung VG (2004) Genetic analysis of genome-wide variation in human gene expression. Nature 430:743–747

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Nelsen RB (1998) An introduction to copulas. Springer, New-York

    Google Scholar 

  • Olivieri O, Martinelli N, Sandri M, Bassi A, Guarini P, Trabetti E, Pizzolo F, Girelli D, Friso S, Pignatti PF, Corrocher R (2005) Apolipoprotein c-|ii, n-3 polyunsaturated fatty acids, and insulin-resistant t455c apoc3 gene polymorphism in heart disease patients: Example of gene-diet interaction. Clin Chem 51(2):360–367

    Article  CAS  PubMed  Google Scholar 

  • Parzen MI, Wei L, Ying Z (1994) A resampling method based on pivotal estimating functions. Biometrika 81:341–350

    Article  Google Scholar 

  • Peng L, Huang Y (2008) Survival analysis with quantile regression models. J Am Stat Assoc 103:637–649

    Article  CAS  Google Scholar 

  • Pinkel D, Albertson DG (2005) Comparative genomic hybridization. Annu Rev Genom Human Genet 6:331–354

    Article  CAS  Google Scholar 

  • Portnoy S (2003) Censored quantile regression. J Am Stat Assoc 98:1001–1012

    Article  Google Scholar 

  • Rippe RC, Meulman JJ, Eilers PH (2012) Visualization of genomic changes by segmented smoothing using \(l_0\) penalty. PLoSone 7:e38230

    Article  CAS  Google Scholar 

  • Risch N, Zhang H (1995) Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 268:1584–1589

    Article  CAS  PubMed  Google Scholar 

  • Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, GuhaThakurta D, Sieberts SK, Monks S, Reitman M, Zhang C, Lum PY, Leonardson A, Thieringer R, Metzger JM, Yang L, Castle J, Zhu H, Kash SFH, Drake TA, Sachs A, Lusis AJ (2005) An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genetics 37:710–717

  • Scher AI, Terwindt GM, Verschuren WM, Kruit MC, Blom HJ, Kowa H, Frants RR, van den Maagdenberg AM, van Buchem M, Ferrari MD, Launer LJ (2006) Migraine and mthfr c677t genotype in a population-based sample. Ann Neurol 59(2):372–375

    Article  CAS  PubMed  Google Scholar 

  • Scholkopf B, Smola A (2002) Statistical learning theory. MIT Press, New-York

    Google Scholar 

  • Simon RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y (2003) Design and analysis of DNA microarray investigations. Springer, New York

    Google Scholar 

  • Sklar A (1959) Fonctions de répartition á n dimensions et leurs marges. Publications de l’institut de Statistique de l’Université de Paris 8:229–231

    Google Scholar 

  • Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J, Hamilton G, Hindle AK, Huey B, Kimura K, Law S, Myambo K, Palmer J, Ylstra B, Yue JP, Gray JW, Jain AN, Pinkel D, Albertson DG (2001) Assembly of microarrays for genome-wide measurement of dna copy number. Nature Genet 29(3):263–264

    Article  CAS  PubMed  Google Scholar 

  • Sohn I, Kim S, Hwang C, Lee JW (2008a) New normalization methods using support vector machine quantile regression approach in microarray analysis. Comput Stat Data Anal 52:4104–4115

    Article  Google Scholar 

  • Sohn I, Kim S, Hwang C, Lee JW, Shim J (2008b) Support vector machine quantile regression for detecting differentially expressed genes in microarray analysis. Methods Inf Med 5:459–467

    Google Scholar 

  • Sun S, Chen Z, Yan PS, Huang Y-W, Huang THM, Lin S (2011) Identifying hypermethylated cpg islands using a quantile regression model. BMC Bioinform 12:54

    Article  Google Scholar 

  • Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B 67(1):91–108

    Article  Google Scholar 

  • Vapnik VN (1998) Statistical learning theory, New-York

  • Vinciotti V, Yu K (2009) M-quantile regression analysis of temporal gene expression data. Stat Appl Genet Mol Biol 8(1):1–20

    Google Scholar 

  • Wang H, He X (2007) Detecting differential expressions in genechip microarray studies: a quantile approach. J Am Stat Assoc 102:104–112

    Article  CAS  Google Scholar 

  • Wang H, He X (2008) An enhanced quantile approach for assessing differential gene expressions. Biometrics 64:449–457

    Article  PubMed  Google Scholar 

  • Wang K, Li W-D, Zhang CK, Wang Z, Glessner JT, Grant SFA, Zhao H, Hakonarson H, Price RA (2011) A genome-wide association study on obesity and obesity-related traits. PLoS ONE 7(2):e18939

    Article  Google Scholar 

  • Williams PT (2012) Quantile-specific penetrance of genes affecting lipoproteins, adiposity and height. PLoS One 7(1):e28764

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Wu Z, Aryee MJ (2010) Subset quantile normalization using negative control features. J Comput Biol 17(10):1385–1395

    Article  PubMed Central  PubMed  Google Scholar 

  • Yoon D, Lee E-K, Park T (2007) Robust imputation method for missing values in microarray data. BMC Bioinform 8(Suppl. 2):S6

    Article  Google Scholar 

  • Yu K, Moyeed RA (2001) Bayesian quantile regression. Stat Probab Lett 54(4):437–447

    Article  Google Scholar 

Download references

Acknowledgments

This work was partly funded by a MITACS grant from Dr. Briollais. We thank Mohamedou Sow for running the Copula quantile regression example. The authors are also grateful to the editor, associate editor and the two referees for their suggestions and comments, which greatly improved this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laurent Briollais.

Appendix

Appendix

Supplemental material: code R for the real data example

The code illustrates the application of quantile regression and Copula quantile regression to the CEPH data introduced in Sect. 4. Using Haseman and Elston (1972)’s approach, the goal is to regress the squared difference between two sibs’ trait values, where the trait corresponds to the gene expression for each of the three transcripts studied in Table 1, on the proportion of alleles shared identical-by descent at the SNP marker considered.

The database including the variables is ‘bb’. We used the variables: ‘trait1$X217225’ and ‘trait2$X217225’, which are the gene expression trait values of sib1 and sib2, respectively. The variable ‘IBD’ is the probability of alleles shared identical by descent for the pair of relatives at the SNP marker.

The following line of code runs QR treating the observations as independent (se=‘iid’). The value of the quantile is given by ‘tau’. The case ‘tau=0.5’ corresponds the the median regression. The command ‘summary’ prints the parameter estimates, standard errors and p values.

figure e

The following line of code runs QR using a Huber sandwich variance estimate for the parameters of interest (se=‘nid’).

figure f

The code below is the function for the Copula quantile regression using a Gaussian Copula (See Eq. 7 of “Appendix 2”). In this function, ‘ecdf’ computes an empirical cumulative distribution function, ‘pnorm’ is the normal density function, ‘qnorm’ is the normal cumulative distribution function and ‘rho’ is the correlation coefficient of the Copula function.

figure g

The following line of code runs the Copula quantile regression function given above. The function ‘nlrq’ corresponds to the non-linear quantile regression and ‘start’ gives the starting value of the parameter ‘rho’, i.e. the correlation, of the Gaussian Copula

figure h

Supplemental material: recent extensions of QR

Beyond the more standard approaches for QR presented in the previous sections, recent extensions of the methodology have also been proposed including support vector machine (SVM) quantile regression, the Copula quantile regression and nonlinear quantile regression.

Appendix 1: Support vector machine quantile regression

SVM algorithm developed by Vapnik (1998) is based on statistical learning theory. In classification problems, the objective is to determine an optimal hyperplane that separates classes. This is equivalent to maximizing the margin between classes. In support vector machine regression (SVR) (Scholkopf and Smola 2002; Sohn et al. 2008a, b), the goal is to determine a hyperplane that fits the data very well. The basic idea of SVR is to find a model function \(f({\varvec{x}})\) representing the relationship between several characteristics associated with environmental and genomic information and a target such as microarray gene expression. Sohn et al. (2008a) applied SVM quantile regression (SVMQR) to cDNA microarray expression data to identify genes differentially expressed. They use three microarray data sets including experiments on a diet-induced obese (DIO) mouse model, data on E.coli and a last one on high-density lipoprotein (HDL)-deficient mouse. For each of this data set, the authors compared two treatment groups (e.g. high- vs. low-fat diet in the first experiment, drug A vs. drug B. in the second experiment and apoAI knockout mice vs. wild-type mice in the third experiment). The authors concluded that the SVMQR approach was superior to classical methods for identifying differentially expressed genes. In a companion paper, Sohn et al. (2008b) proposed to use SVMQR for print-tip normalization of the expression data. Their goal was to adjust systematic variations due to dye biases, where those biases depend on the spot overall intensity and/or spatial location within the array. Using the same microarray data sets, the authors showed that their novel approach outperformed methods based on Loess adjustment.

Appendix 2: Copula quantile regression

The problem of characterizing the dependence between random variables at a given quantile is an important issue, especially if the distributions of the variables are heavy-tailed. QR modeling approach based on a Copula function can handle the dependency between two variables, e.g. \(X\) and \(Y\). The idea is that the joint distribution \(F_{X,Y}\) can be decomposed into the marginals distribution of \(X\) and \(Y\) denoted, respectively, by \(F_X\) and \(F_Y\) and the dependence function is specified by the Copula function (Bouyé and Salmon 2002). The form of the linear quantile relationship implied by the copula is deduced. According to the Sklar’s theorem (Sklar 1959), there exists a unique bivariate copula \({\varvec{C}}: [0,1]^2 \rightarrow [0,1]\) satisfying

$$\begin{aligned} F_{X,Y}(x, y)={\varvec{C}}\left( F_X(x), F_Y(y)\right) , \end{aligned}$$

where \(F_X\) and \(F_Y\) are continuous and represent the marginal distribution function of \(X\) and \(Y\) respectively. Different families of Copula correspond to different types of dependence structure (see Nelsen (1998), for a general introduction to Copulas). With Gaussian Copulas, the dependence is measured by a correlation \(\rho\). The bivariate Gaussian Copula has the form

$$\begin{aligned} C(x,y; \rho )= \Phi _2 \left( \Phi ^{[-1]}(x), \Phi ^{[-1]}(y); \rho \right) \end{aligned}$$

where \(\Phi _2\) corresponds to the bivariate Gaussian distribution, \(\Phi\) to the univariate distribution and \(\Phi ^{[-1]}\) to the pseudo-inverse of \(\Phi\). So, by Sklar’s theorem, for marginal distribution functions \(F_X\) and \(F_Y\), the joint distribution

$$\begin{aligned} F _{X,Y}(x, y) = \Phi _2 \left( \Phi ^{[-1]}(F_X(x)), \Phi ^{[-1]}(F_Y(y)); \rho \right) \end{aligned}$$

is a bivariate distribution function with marginals \(F_X\) and \(F_Y\) and the Copula that connects \(F_{X,Y}(x,y)\) to \(F_X(x)\) and \(F_Y(y)\) is the Gaussian Copula. For a parametric copula \({\varvec{C}}(.,.;\rho )\), the \(\theta\)-th copula quantile curve of \(y\) conditional on \(x\) is defined by

$$\begin{aligned} \theta = C^{\star }(F_X(x),F_Y(y) ; \rho ) \end{aligned}$$
(5)

where \(C^{\star }(x,y;\rho ) = \frac{\partial }{\partial x} {\varvec{C}}(x,y; \rho )\). The relationship between \(X\) and \(Y\) can be expressed using (5) by

$$\begin{aligned} y = q(x,\theta ; \rho ) \end{aligned}$$
(6)

where \(q(x,\theta ; \rho ) = F_Y^{[-1]}\left( D(F_X(x),\theta ; \rho ))\right)\) with \(D\) the partial inverse in the second argument of \(C^{\star }\) and \(F_Y^{[-1]}\) the pseudo-inverse of \(F_Y\). The relationship (6) can be also expressed using uniform margins as

$$\begin{aligned} v = r(u,\theta ; \rho ) \end{aligned}$$
(7)

with \(u=F_X(x)\) and \(v=F_Y(y)\). By (5) and (7), the \(\theta\)-th Gaussian copula quantile is defined by

$$\begin{aligned} \theta = \Phi \left( \frac{\Phi ^{[-1]}(v) -\rho \phi ^{[-1]} (u)}{\sqrt{1-\rho ^2}} \right) , \end{aligned}$$

and the relationship between \(y\) and \(x\) is

$$\begin{aligned} y=F_Y^{[-1]} \left( \Phi \left( \rho \,\Phi ^{[-1]}(F_X(x)) + \sqrt{1-\rho ^2}\,\Phi ^{[-1]}(\theta ) \right) \right) . \end{aligned}$$
(8)

The \(\theta\)-th copula QR \(q(x_i,\theta ,\rho )\) is defined as any solution to the minimization problem

$$\begin{aligned} \min _{\rho } \left( \sum _{i\in {\mathcal F}_{\theta }} \theta |y_i - q(x_i,\theta ; \rho )| + \sum _{i \in {\mathcal F}_{1-\theta }} (1-\theta ) |y_i - q(x_i,\theta ; \rho )|\right) \end{aligned}$$

with \({\mathcal F}_{\theta } = \{i : y_i \ge q(x_i,\theta ;\rho )\}\) and \({\mathcal F}_{1-\theta }\) its complement.

Using Copula quantile regression, the dependence \(\rho\) can be estimated at various quantiles. Besides, depending on the choice of the Copula function, the relationship between the two random variables can be nonlinear, offering some flexibility in the modeling of the dependence between these two variables. An application of the this method is given in Sect. 5.

Appendix 3: Nonlinear quantile regression

Different types of nonlinear QR approaches have been proposed. Bouyé and Salmon (2002) extended Koenker and Bassett (1978)’s work and proposed a nonlinear QR based on Copula function. QR has also been applied to censored survival (duration) data, which offers a more flexible alternative to the Cox proportional hazard model for some applications. Developments on censored QR were presented in Portnoy (2003) and Peng and Huang (2008). In particular, the Portnoy and Peng-Huang estimators can be both considered as regression-based generalizations of Kaplan-Meier and Nelson-Aalen estimator of the cumulative hazard function for randomly censored observations. Software implementations of the previous censored QR estimators for the \(R\) language are available in the quantreg package of Koenker (2008) using the function “crq”. Koenker and Park (1996) proposed a procedure to compute QR estimates for problems in which the response function is nonlinear in parameters using interior point methods. As an example, Ho et al. (2009) presented a model selection approach to discover genes with linear or nonlinear age-dependent gene expression patterns from microarray data.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Briollais, L., Durrieu, G. Application of quantile regression to recent genetic and -omic studies. Hum Genet 133, 951–966 (2014). https://doi.org/10.1007/s00439-014-1440-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00439-014-1440-6

Keywords

Navigation