main-content

## Über dieses Buch

The purpose of these lecture notes is to provide an introduction to the general theory of empirical risk minimization with an emphasis on excess risk bounds and oracle inequalities in penalized problems. In recent years, there have been new developments in this area motivated by the study of new classes of methods in machine learning such as large margin classification methods (boosting, kernel machines). The main probabilistic tools involved in the analysis of these problems are concentration and deviation inequalities by Talagrand along with other methods of empirical processes theory (symmetrization inequalities, contraction inequality for Rademacher sums, entropy and generic chaining bounds). Sparse recovery based on l_1-type penalization and low rank matrix recovery based on the nuclear norm penalization are other active areas of research, where the main problems can be stated in the framework of penalized empirical risk minimization, and concentration inequalities and empirical processes tools have proved to be very useful.

## Inhaltsverzeichnis

### Chapter 1. Introduction

Abstract
We start with a brief overview of empirical risk minimization problems and of the role of empirical and Rademacher processes in constructing distribution dependent and data dependent excess risk bounds. We then discuss penalized empirical risk minimization and oracle inequalities and conclude with sparse recovery and low rank matrix recovery problems. Many important aspects of empirical risk minimization are beyond the scope of these notes, in particular, the circle of questions related to approximation theory (see well known papers by Cucker and Smale [47], DeVore et al. [49] and references therein).

### Chapter 2. Empirical and Rademacher Processes

Abstract
It should be mentioned that certain measurability assumptions are required in the study of empirical and Rademacher processes. In particular, under these assumptions

### Chapter 3. Bounding Expected Sup-Norms of Empirical and Rademacher Processes

Abstract
In what follows, we will use a number of bounds on expectation of suprema of empirical and Rademacher processes. Because of symmetrization inequalities, the problems of bounding expected suprema for these two stochastic processes are equivalent. The bounds are usually based on various complexity measures of function classes (such as linear dimension, VC-dimension, shattering numbers, uniform covering numbers, random covering numbers, bracketing numbers, generic chaining complexities, etc). It would be of interest to develop the bounds with precise dependence on such geometric parameters as the L2.P /-diameter of the class. Combining the bounds on expected suprema with Talagrand’s concentration inequalities yields exponential inequalities for the tail probabilities of sup-norms.

### Chapter 4. Excess Risk Bounds

Abstract
We will assume that such a minimizer exists (a simple modification of the results is possible if fOn is an approximate solution of (4.1)). Our approach to this problem has been already outlined in Chap. 1 and it is closely related to the recent work of Massart [106], Koltchinskii and Panchenko [92], Bartlett et al. [15], Bousquet et al. [34], Koltchinskii [83], Bartlett and Mendelson [17].

### Chapter 5. Examples of Excess Risk Bounds in Prediction Problems

Abstract
Empirical risk minimization with convex loss was studied in detail by Blanchard 324 et al. [26] and Bartlett et al. [16]. In the last paper, rather subtle bounds relating 325 excess risks with respect to a “surrogate” convex loss and with respect to the binary 326 classification loss were also studied. Earlier, Zhang [154] provided initial versions 327 of such bounds.

### Chapter 6. Penalized Empirical Risk Minimization and Model Selection Problems

Abstract
Birgé and Massart [24] introduced a concept of minimal penalties and advocated 323 an approach to the problem of calibration of data-dependent penalties based on so 324 called “slope heuristics”.

### Chapter 7. Linear Programming in Sparse Recovery

Abstract
As it was pointed out in the Introduction, many important sparse recovery methods 3 are based on empirical risk minimization with convex loss and convex complexity 4 penalty. Some interesting algorithms, for instance, the Dantzig selector by Candes 5 and Tao [44] can be formulated as linear programs. In this chapter, we develop 6 error bounds for such algorithms that require certain geometric assumptions on 7 the dictionary. They are expressed in terms of restricted isometry constants and 8 other related characteristics that depend both on the dictionary and on the design 9 distribution. Based on these geometric characteristics, we describe the conditions of 10 exact sparse recovery in the noiseless case as well as sparsity oracle inequalities for 11 the Dantzig selector in regression problems with random noise. These results rely 12 on comparison inequalities and exponential bounds for empirical and Rademacher 13 processes.

### Chapter 8. Convex Penalization in Sparse Recovery

Abstract
We will discuss the role of penalized empirical risk minimization with convex 3 penalties in sparse recovery problems. This includes the 1-norm (LASSO) penalty 4 as well as strictly convex and smooth penalties, such as the negative entropy penalty 5 for sparse recovery in convex hulls. The goal is to show that, when the target 6 function can be well approximated by a “sparse” linear combination of functions 7 from a given dictionary, then solutions of penalized empirical risk minimization 8 problems with 1 and some other convex penalties are “approximately sparse” and 9 they approximate the target function with an error that depends on the “sparsity”. 10 As a result of this analysis, we derive sparsity oracle inequalities showing the 11 dependence of the excess risk of the empirical solution on the underlying sparsity 12 of the problem. These inequalities also involve various distribution dependent 13 geometric characteristics of the dictionary (such as restricted isometry constants 14 and alignment coefficients) and the error of sparse recovery crucially depends on 15 the geometry of the dictionary.

### Chapter 9. Low Rank Matrix Recovery: Nuclear Norm Penalization

Abstract
In this chapter, we discuss a problem of estimation of a large target matrix based 4 on a finite number of noisy measurements of linear functionals (often, random) 5 of this matrix. The underlying assumption is that the target matrix is of small 6 rank and the goal is to determine how the estimation error depends on the rank 7 as well as on other important parameters of the problem such as the number of 8 measurements and the variance of the noise. This problem can be viewed as a 9 natural noncommutative extension of sparse recovery problems discussed in the 10 previous chapters. As a matter of fact, low rank recovery is equivalent to sparse 11 recovery when all the matrices in question are diagonal. There are several important 12 instances of such problems, in particular, matrix completion [41, 45, 70, 123], 13 matrix regression [40, 90, 126] and the problem of density matrix estimation in 14 quantum state tomography [70, 71, 88]. We will study some of these problems 15 using general empirical processes techniques developed in the first several chapters. 16 Noncommutative Bernstein type inequalities established in Sect. 2.4 will play a 17 very special role in our analysis. The main results will be obtained for Hermitian 18 matrices. So called “Paulsen dilation” (see Sect. 2.4) can be then used to tackle 19 the case of rectangular matrices. Throughout the chapter, we use the notations 20 introduced in Sect. A.4.