Skip to main content
Top

2011 | Book

Statistics for High-Dimensional Data

Methods, Theory and Applications

Authors: Peter Bühlmann, Sara van de Geer

Publisher: Springer Berlin Heidelberg

Book Series : Springer Series in Statistics

insite
SEARCH

About this book

Modern statistics deals with large and complex data sets, and consequently with models containing a large number of parameters. This book presents a detailed account of recently developed approaches, including the Lasso and versions of it for various models, boosting methods, undirected graphical modeling, and procedures controlling false positive selections.
A special characteristic of the book is that it contains comprehensive mathematical theory on high-dimensional statistics combined with methodology, algorithms and illustrations with real data examples. This in-depth approach highlights the methods’ great potential and practical applicability in a variety of settings. As such, it is a valuable resource for researchers, graduate students and experts in statistics, applied mathematics and computer science.

Table of Contents

Frontmatter
Chapter 1. Introduction
Abstract
High-dimensional statistics refers to statistical inference when the number of unknown parameters is of much larger order than sample size. We present some introductory motivation and a rough picture about high-dimensional statistics.
Peter Bühlmann, Sara van de Geer
Chapter 2. Lasso for linear models
Abstract
The Lasso, proposed by Tibshirani (1996), is an acronym for Least Absolute Shrinkage and Selection Operator. Among the main reasons why it has become very popular for high-dimensional estimation problems are its statistical accuracy for prediction and variable selection coupled with its computational feasibility. Furthermore, since the Lasso is a penalized likelihood approach, the method is rather general and can be used in a broad variety of models. In the simple case of a linear model with orthonormal design, the Lasso equals the soft thresholding estimator introduced and analyzed by Donoho and Johnstone (1994). The Lasso for linear models is the core example to develop the methodology for ℓ1-penalization in highdimensional settings. We discuss in this chapter some fundamental methodological and computational aspects of the Lasso. We also present the adaptive Lasso, an important two-stage procedure which addresses some bias problems of the Lasso. The methodological steps are supported by describing various theoretical results which will be fully developed in Chapters 6 and 7.
Peter Bühlmann, Sara van de Geer
Chapter 3. Generalized linear models and the Lasso
Abstract
Generalized linear models build a unified framework containing many extensions of a linear model. Important examples include logistic regression for binary responses, Poisson regression for count data or log-linear models for contingency tables. Penalizing the negative log-likelihood with the ℓ1-norm, still called the Lasso, is in many examples conceptually similar to the case with squared error loss in linear regression due to the convexity of the negative log-likelihood. This implies that the statistical properties as well as the computational complexity of algorithms are attractive. A noticeable difference, however, occurs with log-linear models for large contingency tables where the computation is in general much more demanding. We present in this chapter the models and estimators while computational algorithms and theory are described in more details in Chapters 4 and 6, respectively.
Peter Bühlmann, Sara van de Geer
Chapter 4. The group Lasso
Abstract
In many applications, the high-dimensional parameter vector carries a structure. Among the simplest is a group structure where the parameter is partitioned into disjoint pieces. This occurs when dealing with factor variables or in connection with basis expansions in high-dimensional additive models as discussed in Chapters 5 and 8. The goal is high-dimensional estimation in linear or generalized linear models being sparse with respect to whole groups. The group Lasso, proposed by Yuan and Lin (2006) achieves such group sparsity. We discuss in this chapter methodological aspects, and we develop the details for efficient computational algorithms which are more subtle than for non-group problems.
Peter Bühlmann, Sara van de Geer
Chapter 5. Additive models and many smooth univariate functions
Abstract
Additive models build a nonparametric extension of linear models and as such, they exhibit a substantial degree of flexibility. While the most important effects may still be detected by a linear model, substantial improvements are potentially possible by using the more flexible additive model class. At first sight, it seems very ambitious to fit additive models with high-dimensional covariates but sparsity implies feasible computations and good statistical properties. Besides encouraging sparsity, it is important to control smoothness as well. This can be achieved by a sparsity-smoothness penalty function. The combination of sparsity and smoothness is crucial for mathematical theory as well as for better performance on data. We discuss in this chapter methodology and computational aspects which are related to the group Lasso presented in Chapter 4.
Peter Bühlmann, Sara van de Geer
Chapter 6. Theory for the Lasso
Abstract
We study the Lasso, i.e., ℓ1-penalized empirical risk minimization, for general convex loss functions. The aim is to show that the Lasso penalty enjoys good theoretical properties, in the sense that its prediction error is of the same order of magnitude as the prediction error one would have if one knew a priori which variables are relevant. The chapter starts out with squared error loss with fixed design, because there the derivations are the simplest. For more general loss, we defer the probabilistic arguments to Chapter 14. We allow for misspecification of the (generalized) linear model, and will consider an oracle that represents the best approximation within the model of the truth. An important quantity in the results will be the so-called compatibility constant, which we require to be non-zero. The latter requirement is called the compatibility condition, a condition with eigenvalue-flavor to it. Our bounds (for prediction error, etc.) are given in explicit (non-asymptotic) form.
Peter Bühlmann, Sara van de Geer
Chapter 7. Variable selection with the Lasso
Abstract
We use the Lasso, its adaptive or its thresholded variant, as procedure for variable selection. This essentially means that for \( {S_0}\,:=\,{\{j\,:{\beta^0_j}\neq 0\}} \) being the true active set, we look for a Lasso procedure delivering an estimator \( \hat{S}\) of \( {S_0}\) such that \( \hat{S}={S_0}\) with large probability. However, it is clear that very small coefficients \( \mid\beta^0_j\mid \) cannot be detected by any method. Moreover, irrepresentable conditions show that the Lasso, or any weighted variant, typically selects too many variables. In other words, unless one imposes very strong conditions, false positives cannot be avoided either. We shall therefore aim at estimators with oracle prediction error, yet having not too many false positives. The latter is considered as achieved when \( {\mid \hat{S}\backslash {S_*}\mid}=O({\mid {S_*}\mid})\), where \( {S_*}\,\subset \,{S_0}\) is the set of coefficients the oracle would select.We will show that the adaptive Lasso procedure, and also thresholding the initial Lasso, reaches this aim, assuming sparse eigenvalues, or alternatively, so-called “beta-min” conditions.
Peter Bühlmann, Sara van de Geer
Chapter 8. Theory for ℓ1/ℓ2-penalty procedures
Abstract
We study four procedures for regression models with group structure in the parameter vector. The first two are for models with univariate response variable. They are the so-called group Lasso (see Chapter 4), and the smoothed group Lasso for the high-dimensional additive model (see Chapter 5). We also discuss multivariate extensions, namely for the linear model with time-varying coefficients, for multivariate regression, and multitask learning.
Peter Bühlmann, Sara van de Geer
Chapter 9. Non-convex loss functions and ℓ1-regularization
Abstract
Much of the theory and computational algorithms for ℓ1-penalized methods in the high-dimensional context has been developed for convex loss functions, e.g., the squared error loss for linear models (Chapters 2 and 6) or the negative log-likelihood in a generalized linear model (Chapters 3 and 6). However, there are many models where the negative log-likelihood is a non-convex function. Important examples include mixture models or linear mixed effects models which we describe in more details. Both of them address in a different way the issue of modeling a grouping structure among the observations, a quite common feature in complex situations. We discuss in this chapter how to deal with non-convex but smooth ℓ1- penalized likelihood problems. Regarding computation, we can typically find a local optimum of the corresponding non-convex optimization problem only whereas the theory is given for the estimator defined by a global optimum. Particularly in highdimensional problems, it is difficult to compute a global optimum and it would be desirable to have some theoretical properties of estimators arising from “reasonable” local optima. However, this is largely an unanswered problem.
Peter Bühlmann, Sara van de Geer
Chapter 10. Stable solutions
Abstract
Estimation of discrete structure such as in variable selection or graphical modeling is notoriously difficult, especially for high-dimensional data. Subsampling or bootstrapping have the potential to substantially increase the stability of high-dimensional selection algorithms and to quantify their uncertainties. Stability via subsampling or bootstrapping has been introduced by Breiman (1996) in the context of prediction. Here, the focus is different: the resampling scheme can provide finite sample control for certain error rates of false discoveries and hence a transparent principle to choose a proper amount of regularization for structure estimation. We discuss methodology and theory for very general settings which include variable selection in linear or generalized linear models or graphical modeling from Chapter 13. For the special case of variable selection in linear models, the theoretical properties (developed here) for consistent selection using stable solutions based on subsampling or bootstrapping require slightly stronger assumptions and are less refined than say for the adaptive or thresholded Lasso.
Peter Bühlmann, Sara van de Geer
Chapter 11. P-values for linear models and beyond
Abstract
In the classical low-dimensional setup, error control of false selections based on p-values is a widely used standard in many areas of sciences. In the highdimensional setting, assigning significance is challenging. Most computationally efficient selection algorithms cannot guard against inclusion of noise variables and hence, some thresholding operation on the basis of p-values for individual regression coefficients is desirable. We describe in this chapter a rather generic way to achieve this goal. Using multiple sample splitting, which is very simple to implement and bears similarities to subsampling and stability selection from Chapter 10, we show that such a random sampling scheme yields asymptotically valid p-values for controlling the familywise error or false discovery rate in high-dimensional linear or generalized linear models.
Peter Bühlmann, Sara van de Geer
Chapter 12. Boosting and greedy algorithms
Abstract
Boosting algorithms or greedy methods are computationally fast and often powerful for high-dimensional data problems. They have been mainly developed for classification and regression. Regularization arises in form of algorithmic constraints rather than explicit penalty terms. Interestingly, both of these regularization concepts are sometimes close to being equivalent, as shown for linear models by Efron et al. (2004). We present boosting and related algorithms, including a brief discussion about forward variable selection and orthogonal matching pursuit. The exposition in this chapter is mainly focusing on methodology and describing simple computational ideas. Mathematical theory is developed for special cases like one-dimensional function estimation or high-dimensional linear models.
Peter Bühlmann, Sara van de Geer
Chapter 13. Graphical modeling
Abstract
Graphical models are very useful to describe conditional independences among a set of random variables. We focus on undirected graphs only and their interpretation as conditional independence graphs. For undirected Gaussian graphical models, ℓ1-regularization methods can be used in a similar fashion as for linear models. Closely linked to the estimation of the underlying undirected graph is the problem of covariance estimation which we briefly discuss as well. Besides ℓ1- penalization, we describe a completely different approach using the framework of so-called faithful distributions which also has implications on variable selection for regression. The chapter contains methodology, algorithms and mathematical theory.
Peter Bühlmann, Sara van de Geer
Chapter 14. Probability and moment inequalities
Abstract
The chapter is a mini-guide to empirical process theory, in particular probability and moment inequalities for the empirical process indexed by functions. We give an essentially complete (but often rather compact) treatment of the results we need in the previous chapters.
Peter Bühlmann, Sara van de Geer
Backmatter
Metadata
Title
Statistics for High-Dimensional Data
Authors
Peter Bühlmann
Sara van de Geer
Copyright Year
2011
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-20192-9
Print ISBN
978-3-642-20191-2
DOI
https://doi.org/10.1007/978-3-642-20192-9

Premium Partner