Skip to main content

2001 | Buch

Combinatorial Methods in Density Estimation

verfasst von: Luc Devroye, Gábor Lugosi

Verlag: Springer New York

Buchreihe : Springer Series in Statistics

insite
SUCHEN

Über dieses Buch

Density estimation has evolved enormously since the days of bar plots and histograms, but researchers and users are still struggling with the problem of the selection of the bin widths. This text explores a new paradigm for the data-based or automatic selection of the free parameters of density estimates in general so that the expected error is within a given constant multiple of the best possible error. The paradigm can be used in nearly all density estimates and for most model selection problems, both parametric and nonparametric. It is the first book on this topic. The text is intended for first-year graduate students in statistics and learning theory, and offers a host of opportunities for further research and thesis topics. Each chapter corresponds roughly to one lecture, and is supplemented with many classroom exercises. A one year course in probability theory at the level of Feller's Volume 1 should be more than adequate preparation. Gabor Lugosi is Professor at Universitat Pompeu Fabra in Barcelona, and Luc Debroye is Professor at McGill University in Montreal. In 1996, the authors, together with Lászlo Györfi, published the successful text, A Probabilistic Theory of Pattern Recognition with Springer-Verlag. Both authors have made many contributions in the area of nonparametric estimation.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction
Abstract
A random variable X on R d has a density f if, for all Borel sets A of R d , ∫ A f(x)dx = P{XA}. It thus serves as a tool for computing probabilities of sets. As it is a function that reveals the local concentration of probability mass, it may be used to visualize distributions of random variables. The statistician’s problem, then, is to estimate f from an i.i.d. sample X 1,…,X n drawn from f. A density estimate is simply a mapping f n : R d × (R d ) n R d (we write f n (x; X 1,…,X n ) or f n (x)). It is the global closeness of f n to f that interests us. The choice of a density estimate is governed by a number of factors, like consistency, smoothness, ease of computation, interpretability, flexibility, robustness, versatility, and optimality for certain criteria. The early work in the field approached the problem largely as a functional estimation problem: f was treated as any function, and tools from function approximation theory came to the rescue in the analysis of the performance of density estimates—Taylor series expansions played a key role, for example. The view we take in nonparametric density estimation is that f is largely unknown and that no assumptions can be made about its properties. Methods or properties that are valid for all densities are said to be universal. It is quite surprising that there are density estimates that can approximate any density f asymptotically in an appropriate sense. Simple histogram and kernel estimates have this property for example. But when people started studying rates of convergence, or error bounds for particular n, they invariably had to place conditions on the smoothness or boundedness of f. This was frustrating , because it was nearly impossible to test whether these assumptions were satisfied. Just try cooking up a test that decides whether the essestial supremum of f is bounded, for example!
Luc Devroye, Gábor Lugosi
Chapter 2. Concentration Inequalities
Abstract
Most of the methodology discussed in this book builds on elements of empirical process theory, certain concentration-of-measure inequalities, and simple combinatorial calculations. The aim of this and the following two chapters is to equip the reader with these simple tools. We keep the material at an elementary level, with additional information added in the exercises.
Luc Devroye, Gábor Lugosi
Chapter 3. Uniform Deviation Inequalities
Abstract
This chapter is devoted to some basic inequalities that bound the maximal difference between probabilities and relative frequencies over a class of events. The bounds will be key tools in our study of density estimates. Let X 1,…,X n be i.i.d. random variables taking values in R d with common distribution
$$ \mu (A) = {\text{P}}\{ {X_1} \in A\} (A \subset {{\text{R}}^d}).$$
Luc Devroye, Gábor Lugosi
Chapter 4. Combinatorial Tools
Abstract
Consider a class A of subsets of R d , and let x 1,…,x n R d be arbitrary points. Recall from the previous chapter that properties of the finite set A(x 1 n ) ⊂ {0, 1} n defined by
$$ A(x_1^n) = \{ b = ({b_1}, \ldots ,{b_n}) \in {\{ 0,1\} ^n}:\exists A \in A:{b_i} = {1_{[{x_i} \in A]}},i = 1, \ldots ,n\}$$
play an essential role in bounding uniform deviations of the empirical measure.
Luc Devroye, Gábor Lugosi
Chapter 5. Total Variation
Abstract
A random variable X has a density f on R d when for all Borel sets AR d , P{XA} = ∫ A f(x)dx. In other words, if A is a small ball about x, its probability is about f(x) times the volume of A. The purpose of density estimation is to estimate an unknown density f from an i.i.d. sample X 1,…,X n drawn from f. There are other questions one may ask, like the estimation of a functional such as ∫ f α (x)dx or ∫ ψ(f(x), x)dx for general ψ. These are not dealt with in the text and may indeed require a very different treatment. We are concerned with the density itself as a means of computing probabilities of all sets. A density estimate f n (x) = f n (x;X 1,…,X n ) is thus a mapping from (R d ) n+1 to R. Its quality is measured by how well it performs the task at hand, estimating probabilities.
Luc Devroye, Gábor Lugosi
Chapter 6. Choosing a Density Estimate
Abstract
Consider the following simple situation: g n and f n are two density estimates, and we must select the best one, that is, arg min(∫ |f n f|, ∫ |g n f|). More precisely, given the sample X 1, …, X n distributed according to density f, we are asked to construct a density estimate φ n such that
$$ \int {|{\varphi _n} - f| \approx {\text{min}}\left( {\int {|fn - f|,\int {|{g_n} - f|} } } \right).} $$
This simple problem turns out to be surprisingly difficult, even if the estimates f n and g n are fixed densities, not depending on the data.
Luc Devroye, Gábor Lugosi
Chapter 7. Skeleton Estimates
Abstract
In this short chapter, we show how to pick a density estimate from a pre-specified class of densities, F. In particular, the classes we are interested in are totally bounded, that is, for every ε > 0, there exists a finite number N ε of densities in F such that the L 1 balls of radius ε centered at these densities cover F, that is, if these chosen densities are G ε = {g1,…,gN ε }, then
$$ F \subseteq \mathop \cup \limits_{i = 1}^{{N_ \in }} {B_{{g_i}, \in }} , $$
where B g,r = {f : ∫ |fg| ≤ r}. The smallest such N ε is called the complexity (or covering number) of F (and thus is a function of ε), and log2 N ε is the Kolmogorov entropy of F. G ε is called a skeleton of F.
Luc Devroye, Gábor Lugosi
Chapter 8. The Minimum Distance Estimate: Examples
Abstract
Given a class F of densities, we are to construct a density estimate such that it performs almost as well as the best density in the class. We saw that the skeleton estimate defined in the previous chapter always works when F is totally bounded. In this chapter we analyze the minimum distance estimate described in Section 6.8. Assume that the densities f θ F are indexed by a parameter θ ∈ Θ.
Luc Devroye, Gábor Lugosi
Chapter 9. The Kernel Density Estimate
Abstract
In this chapter, we get our first taste of real analysis, starting with some results on the approximations of functions in L 1. The literature on this is vast, and a lot of it is ancient. The problem is that f cannot be approximated in L 1 by μ n , the empirical measure, as the total variation distance between any density f and any atomic measure (like μ n ) is 1. Thus, the approximation itself must have a density. The kernel estimate provides this: it smooths the empirical measure μ n . This section has no combinatorial contributions, but develops just enough of the function approximation material to understand the remaining chapters.
Luc Devroye, Gábor Lugosi
Chapter 10. Additive Estimates and Data Splitting
Abstract
Assume that we are given a class of density estimates parametrized by θ ∈ Θ, such that f n, θ denotes the density estimate with parameter θ. Our goal is to construct a density estimate f n whose L 1 error is (almost) as small as that of the best estimate among the f n,θ , θ ∈ Θ. Applying the minimum distance estimate of Chapter 5 directly to this class is often problematic because of the dependence of each estimate in the class and the empirical measure μ n .
Luc Devroye, Gábor Lugosi
Chapter 11. Bandwidth Selection for Kernel Estimates
Abstract
This chapter is about the choice of the bandwidth (or smoothing factor) h ∈ (0, ∞) of the standard kernel estimate
$$ {f_{n,h}}(x) = \frac{1}{{n{h^d}}}\sum\limits_{i = 1}^n {K\left( {\frac{{x - {X_i}}}{h}} \right)} . $$
Luc Devroye, Gábor Lugosi
Chapter 12. Multiparameter Kernel Estimates
Abstract
In this chapter we illustrate how the minimum distance estimate described in Chapter 10 may be used to select various parameters simultaneously in an almost optimal manner. The examples are all simple multiparameter versions of the kernel estimate. Once again, the methods applied here are fully combinatorial, as the only thing we need in each case is a suitable upper bound for the shatter coefficient appearing in Theorem 10.3.
Luc Devroye, Gábor Lugosi
Chapter 13. Wavelet Estimates
Abstract
In univariate curve estimation, wavelet systems have become popular. A wavelet system is a countable family of orthonormal functions (in L 2(R)) {φ k , ψ jk : j ≥ 0, k = 0, ± 1, ±2 …} with the following properties:
(1)
φ k (x) = φ(xk), k = 0, ±1, ±2, …, are translates of a fixed function φ in L 2 (called the father wavelet) such that the φ k ’ s are orthonormal.
 
(2)
For j ≥ 0, where ψ jk (x) = 2 j/2 ψ(2 j xk), where ψ (the mother wavelet) is a function such that its translates ψ(xk), k = 0, ±1, ±2,…, form an orthonormal basis of the space W 0 defined as the orthogonal complement of V 0 in V 1, where V 0 is the space generated by all translates of φ(x) and V 1 is the space generated by all translates of φ(2x).
 
(3)
{φ k , ψ jk : j ≥ 0, k = 0, ±1, ±2,…} is an orthonormal basis of L 2(R).
 
Luc Devroye, Gábor Lugosi
Chapter 14. The Transformed Kernel Estimate
Abstract
The transformed kernel estimate on the real line was introduced in an attempt to reduce the L 1 error in a relatively cheap manner. The data are first transformed T : RR by a strictly monotonically increasing almost everywhere differentiable transformation T: Y 1 = T(X 1),…,Y n = T(X n ). The density of Y 1 is
$$ g(y) = f({T^{ - 1}}(y)){T^{ - 1'}}(y), $$
where T −1 denotes the inverse of T.
Luc Devroye, Gábor Lugosi
Chapter 15. Minimax Theory
Abstract
Let us use an extreme example to start off this chapter. Assume that we are shown one data point X drawn from an unknown density f that belongs to a class F, which is given to us. Our estimate of f is g(x, X), where g should be such that it is best possible. As we do not know f, we naturally assume that after g is picked, our adversary picks the worst fF.
Luc Devroye, Gábor Lugosi
Chapter 16. Choosing the Kernel Order
Abstract
Using the notation K h (·) to denote (1/h)K(·/h), let the univariate kernel estimate with bandwidth h > 0 and kernel K be
$$ {f_{n,K,h}}(x) = \frac{1}{n}\sum\limits_{i = 1}^n {{K_h}} (x - {X_i}).$$
Luc Devroye, Gábor Lugosi
Chapter 17. Bandwidth Choice with Superkernels
Abstract
In the previous chapter, we looked at the saturation classes for kernels with many vanishing moments. The best rate Φ(n, K) for these kernels was n s/(2s+1) with s an even positive integer. So, in theory, if we could find a kernel all of whose moments are vanishing, then we could hope for a rate Φ(n, K) that is better than n −1/2+ε for all ε > 0. So, is it possible to find such kernels? Can we simultaneously insure ∫ K = 1, yet ∫ x i K(x) dx = 0 for all i > 0? The answer is yes. In fact, there are many such kernels, and each of them has of course its own best rate Φ(n, K). This could lead to the introduction of yet another hierarchy of saturation classes with best rates, for example, (log n( s / √n. However, as these saturation classes are surely becoming rather small—all densities in them must be infinitely many times continuously differentiate—this is an almost pointless exercise. Instead, we consider the ultimate kernels, those for which Φ(n, K) = O(1/√n). Clearly, O(1/√n) is the best possible rate, as Lemma 1 shows.
Luc Devroye, Gábor Lugosi
Backmatter
Metadaten
Titel
Combinatorial Methods in Density Estimation
verfasst von
Luc Devroye
Gábor Lugosi
Copyright-Jahr
2001
Verlag
Springer New York
Electronic ISBN
978-1-4613-0125-7
Print ISBN
978-1-4612-6527-6
DOI
https://doi.org/10.1007/978-1-4613-0125-7