Abstract
We review variable selection and variable screening in high-dimensional linear models. Thereby, a major focus is an empirical comparison of various estimation methods with respect to true and false positive selection rates based on 128 different sparse scenarios from semi-real data (real data covariables but synthetic regression coefficients and noise). Furthermore, we present some theoretical bounds for the bias in subsequent least squares estimation, using the selected variables from the first stage, which have direct implications for construction of p-values for regression coefficients.
Similar content being viewed by others
References
Adragni K, Cook R (2009) Sufficient dimension reduction and prediction in regression. Philos Trans R Soc A 367:4385–4400
Bickel P, Ritov Y, Tsybakov A (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:1705–1732
Bühlmann P (2012) Statistical significance in high-dimensional linear models. Bernoulli (to appear)
Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, New York
Bühlmann P, Meier L, Kalisch M (2013) High-dimensional statistics with a view towards applications in biology. Annu Rev Stat Appl (to appear)
Bunea F, Tsybakov A, Wegkamp M (2007) Sparsity oracle inequalities for the Lasso. Electron J Stat 1:169–194
Candès E, Tao T (2007) The Dantzig selector: statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351
Dettling M (2004) Bagboosting for tumor classification with gene expression data. Bioinformatics 20(18):3583–3593
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Fan J, Lv J (2008) Sure independence screening for ultra-high dimensional feature space (with discussion). J R Stat Soc Ser B 70:849–911
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Greenshtein E, Ritov Y (2004) Persistence in high-dimensional predictor selection and the virtue of over-parametrization. Bernoulli 10:971–988
Hebiri M, van de Geer S (2011) The smooth Lasso and other \(\ell _1+ \ell _2\)-penalized methods. Electron J Stat 5:1184–1226
Koltchinskii V (2009a) The Dantzig selector and sparsity oracle inequalities. Bernoulli 15:799–828
Koltchinskii V (2009b) Sparsity in penalized empirical risk minimization. Ann de l’Institut Henri Poincaré, Probab et Stat 45:7–57
Meinshausen N, Bühlmann P (2006) High-dimensional graphs and variable selection with the Lasso. Ann Stat 34:1436–1462
Meinshausen N, Meier L, Bühlmann P (2009) P-values for high-dimensional regression. J Am Stat Assoc 104:1671–1681
Meinshausen N, Yu B (2009) Lasso-type recovery of sparse representations for high-dimensional data. Ann Stat 37:246–270
Raskutti G, Wainwright M, Yu B (2010) Restricted eigenvalue properties for correlated Gaussian designs. J Mach Learn Res 11:2241–2259
Shao J, Deng X (2012) Estimation in high-dimensional linear models with deterministic design matrices. Ann Stat 40:812–831
Sun T, Zhang C-H (2012) Scaled sparse linear regression. Biometrika 99:879–898
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B 58:267–288
van de Geer S (2007) The deterministic Lasso. In: JSM proceedings, p. 140. American Statistical Association, Alexandria, VA
van de Geer S (2008) High-dimensional generalized linear models and the Lasso. Ann Stat 36:614–645
van de Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the Lasso. Electron J Stat 3:1360–1392
van de Geer S, Bühlmann P, Zhou S (2011) The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso). Electron J Stat 5:688–749
Wainwright M (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _{1}\)-constrained quadratic programming (Lasso). IEEE Trans Inf Theory 55:2183–2202
Wasserman L, Roeder K (2009) High dimensional variable selection. Ann Stat 37:2178–2201
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 98(20):11462–11467
Ye F, Zhang C-H (2010) Rate minimaxity of the Lasso and Dantzig selector for the \(\ell _q\) loss in \(\ell _r\) balls. J Mach Learn Res 11:3519–3540
Zhang C-H, Huang J (2008) The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann Stat 36:1567–1594
Zhao P, Yu B (2006) On model selection consistency of Lasso. J Mach Learn Res 7:2541–2563
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320
Zou H (2006) The adaptive Lasso and its oracle properties. J Am Stat Assoc 101:1418–1429
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bühlmann, P., Mandozzi, J. High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput Stat 29, 407–430 (2014). https://doi.org/10.1007/s00180-013-0436-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-013-0436-3