Top

2015 | Book

Read chapter Read first chapter

Lectures on the Nearest Neighbor Method

Authors: Gérard Biau, Luc Devroye

Publisher: Springer International Publishing

Book Series : Springer Series in the Data Sciences

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This text presents a wide-ranging and rigorous overview of nearest neighbor methods, one of the most important paradigms in machine learning. Now in one self-contained volume, this book systematically covers key statistical, probabilistic, combinatorial and geometric ideas for understanding, analyzing and developing nearest neighbor methods.

Gérard Biau is a professor at Université Pierre et Marie Curie (Paris). Luc Devroye is a professor at the School of Computer Science at McGill University (Montreal).

Frontmatter

Density estimation

Frontmatter

Chapter 1. Order statistics and nearest neighbors

Abstract

We start with some basic properties of uniform order statistics. For a general introduction to probability, see Grimmett and Stirzaker (2001). Some of the properties of order statistics presented in this chapter are covered by Rényi (1970); Galambos (1978), and Devroye (1986).

Gérard Biau, Luc Devroye

Chapter 2. The nearest neighbor distance

Abstract

If X _i and X _j are equidistant from x, i.e., if $\|\mathbf{X}_{i} -\mathbf{x}\| =\| \mathbf{X}_{j} -\mathbf{x}\|$ for some i ≠ j, then we have a distance tie. By convention, ties are broken by comparing indices, that is, by declaring that X _i is closer to x than X _j whenever i < j.

Gérard Biau, Luc Devroye

Chapter 3. The k-nearest neighbor density estimate

Abstract

A random vector X taking values in $\mathbb{R}^{d}$ has a (probability) density f with respect to the Lebesgue measure if, for all Borel sets $A \subseteq \mathbb{R}^{d}$, $\mathbb{P}\{\mathbf{X} \in A\} =\int _{A}f(\mathbf{x})\mbox{ d}\mathbf{x}$. In other words, if A is a small ball about x, the probability that X falls in A is about f(x) times the volume of A. It thus serves as a tool for computing probabilities of sets and, as a function that reveals the local concentration of probability mass, it may be used to visualize distributions of random variables.

Gérard Biau, Luc Devroye

Chapter 4. Uniform consistency

Abstract

This chapter is devoted to the study of the uniform consistency properties of the k-nearest neighbor density estimate f _n. Before embarking on the supremum norm convergence, it is useful to understand the behavior of f _n on bounded densities. We denote the essential supremum (with respect to the Lebesgue measure $\lambda$) of the density f by

Gérard Biau, Luc Devroye

Chapter 5. Weighted k-nearest neighbor density estimates

Abstract

There are different ways to weigh or smooth the k-nearest neighbor density estimate. Some key ideas are surveyed in this chapter. For some of them, consistency theorems are stated.

Gérard Biau, Luc Devroye

Chapter 6. Local behavior

Abstract

No study of a density estimate is complete without a discussion of the local behavior of it. That is, given a certain amount of smoothness at x, how fast does f _n(x) tend to f(x)? It is clear that for any sequence of density estimates, and any sequence $a_{n} \downarrow 0$, however slow, there exists a density f with x a Lebesgue point of f, such that

$$\displaystyle{\limsup _{n\rightarrow \infty }\frac{\mathbb{E}\left \vert f_{n}(\mathbf{x}) - f(\mathbf{x})\right \vert } {a_{n}} \geq 1.}$$

We will not show this, but just point to a similar theorem for the total variation error in density estimation (Devroye, 1987). However, under smoothness conditions, there is hope to get useful rates of convergence.

Gérard Biau, Luc Devroye

Chapter 7. Entropy estimation

Abstract

Differential entropy, or continuous entropy, is a concept in information theory related to the classical (Shannon) entropy (Shannon, 1948). For a random variable with density f on $\mathbb{R}^{d}$, it is defined by

$$\displaystyle{ \mathcal{E}(f) = -\int _{\mathbb{R}^{d}}f(\mathbf{x})\log f(\mathbf{x})\mbox{ d}\mathbf{x}, }$$

(7.1)

when this integral exists (with the convention $0\log 0 = 0$). If d = 1 and f is the uniform density on [0, a], a > 0, then its differential entropy is

$$\displaystyle{\mathcal{E}(f) = -\int _{0}^{a}\frac{1} {a}\log \frac{1} {a}\mbox{ d}x =\log a.}$$

We see that for a < 1, $\log a < 0$, so that $\mathcal{E}(f)$ can be negative. The standard exponential has $\mathcal{E}(f) = 1$, and the standard Gaussian has $\mathcal{E}(f) =\log \sqrt{2\pi e}$, to give a few examples.

Gérard Biau, Luc Devroye

Regression estimation

Frontmatter

Chapter 8. The nearest neighbor regression function estimate

Abstract

Let (X, Y ) be a pair of random variables taking values in $\mathbb{R}^{d} \times \mathbb{R}$. The goal of regression analysis is to understand how the values of the response variable Y depend on the values of the observation vector X. The objective is to find a Borel measurable function g such that | Y − g(X) | is small, where “small” could be defined in terms of the L ^p risk $\mathbb{E}\vert Y - g(\mathbf{X})\vert ^{p}$ (p > 0), for example. Of particular interest is the L ² risk of g,

$$\displaystyle{ \mathbb{E}\left \vert Y - g(\mathbf{X})\right \vert ^{2}. }$$

(8.1)

Gérard Biau, Luc Devroye

Chapter 9. The 1-nearest neighbor regression function estimate

Abstract

Our objective in this short chapter is to analyze some elementary consistency properties of the 1-nearest neighbor regression function estimate. This will also offer the opportunity to familiarize the reader with concepts that will be encountered in the next few chapters. Recall that this very simple estimation procedure is defined by setting

$$\displaystyle{r_{n}(\mathbf{x}) = Y _{(1)}(\mathbf{x}),\quad \mathbf{x} \in \mathbb{R}^{d},}$$

where $(\mathbf{X}_{(1)}(\mathbf{x}),Y _{(1)}(\mathbf{x})),\mathop{\ldots },(\mathbf{X}_{(n)}(\mathbf{x}),Y _{(n)}(\mathbf{x}))$ is a reordering of the data according to increasing values of $\|\mathbf{X}_{i} -\mathbf{x}\|$, and distance ties are broken by looking at indices.

Gérard Biau, Luc Devroye

Chapter 10. L p -consistency and Stone’s theorem

Abstract

We know that, whenever $\mathbb{E}Y ^{2} < \infty $, the regression function $r(\mathbf{x}) = \mathbb{E}[Y \vert \mathbf{X} = \mathbf{x}]$ achieves the minimal value $L^{\star } = \mathbb{E}\vert Y - r(\mathbf{X})\vert ^{2}$ of the L ² risk over all square-integrable functions of X. It is also easy to show, using the independence of (X, Y ) and the sample $\mathcal{D}_{n}$, that the (conditional) L ² risk $\mathbb{E}[\vert Y - r_{n}(\mathbf{X})\vert ^{2}\,\vert \,\mathcal{D}_{n}]$ of an estimate r _n of r satisfies

$$\displaystyle{\mathbb{E}\big[\left \vert Y - r_{n}(\mathbf{X})\right \vert ^{2}\,\vert \,\mathcal{D}_{ n}\big] = \mathbb{E}\left \vert Y - r(\mathbf{X})\right \vert ^{2} +\int _{ \mathbb{R}^{d}}\left \vert r_{n}(\mathbf{x}) - r(\mathbf{x})\right \vert ^{2}\mu (\mbox{ d}\mathbf{x}),}$$

where μ is the distribution of X. This identity reveals that the L ² risk of the estimate r _n is close to the optimal value if and only if the L ² error $\int _{\mathbb{R}^{d}}\vert r_{n}(\mathbf{x}) - r(\mathbf{x})\vert ^{2}\mu (\mbox{ d}\mathbf{x})$ is close to zero. Therefore, the L ² error (integrated squared error) is a nice criterion to measure the quality of an estimate.

Gérard Biau, Luc Devroye

Chapter 11. Pointwise consistency

Abstract

Theorem 11.1 below is a slight extension of a theorem due to Devroye (1981a). It offers sufficient conditions on the probability weight vector guaranteeing that the (raw) nearest neighbor estimate (8.2) satisfies, for all p ≥ 1.

Gérard Biau, Luc Devroye

Chapter 12. Uniform consistency

Abstract

The supremum creates two problems—first of all, by moving x about $\mathbb{R}^{d}$, the data ordering changes. We will count the number of possible data permutations in the second section. Second, we need a uniform condition on the “noise” Y − r(X) so that the averaging done by the weights v _ni is strong enough. This is addressed in the third section.

Gérard Biau, Luc Devroye

Chapter 13. Advanced properties of uniform order statistics

Abstract

Various properties of $U_{(1)},\mathop{\ldots },U_{(n)}$, uniform [0, 1] order statistics, will be needed in the analysis that follows. These are collected in the present chapter. The first group of properties is directly related to U _(i) (1 ≤ i ≤ n), while the second group deals with random linear combinations of them.

Gérard Biau, Luc Devroye

Chapter 14. Rates of convergence

Abstract

In this chapter, we study the local rate of convergence of r _n(x) to r(x). We obtain full information on the first asymptotic term of r _n(x) − r(x), and are rewarded with (i) a central limit theorem for r _n(x) − r(x), and (ii) a way of helping the user decide how to choose the weights v _ni of the estimate.

Gérard Biau, Luc Devroye

Chapter 15. Regression: the noiseless case

Abstract

Classical function estimation deals with the estimation of a function r on $\mathbb{R}^{d}$ from a finite number of points $\mathbf{x}_{1},\mathop{\ldots },\mathbf{x}_{n}$. Some applications are concerned with L ^p errors with respect to the Lebesgue measure on compacts. Others use it for Monte Carlo purposes, wanting to estimate $\int _{A}r(\mathbf{x})\mbox{ d}\mathbf{x}$ over a compact set A. The model we study here takes a sample $\mathbf{X}_{1},\mathop{\ldots },\mathbf{X}_{n}$ of i.i.d. random vectors with a density f on A that is not known.

Gérard Biau, Luc Devroye

Chapter 16. The choice of a nearest neighbor estimate

Abstract

Selecting the estimate within a class of estimates that is optimal in a certain sense is perhaps the ultimate goal of nonparametric estimation. It assumes that the class of estimates is sufficiently rich within the universe of all possible estimates. That the nearest neighbor regression function estimate is rich as a class follows not only from the universality, but also from the fact that it achieves rates of convergence for various criteria that are close to the best possible over certain classes of distributions on (X, Y ), a property that is studied in minimax theory (Stone, 1980, 1982).

Gérard Biau, Luc Devroye

Supervised classification

Frontmatter

Chapter 17. Basics of classification

Abstract

Supervised classification (also called pattern recognition, discrimination, or class prediction) is a specific regression problem, where the observation X takes values in $\mathbb{R}^{d}$ and the random response Y takes values in {0, 1}. Given X, one has to guess the value of Y (also termed the label or class), and this guess is called a decision. Pattern recognition is important in different scientific disciplines, such as medicine, biology, finance, and meteorology. In medicine, for example, one needs to evaluate patients according to their disease risk, and the typical questions for classification are: “Is this person infected?,” “Will this patient respond to the treatment?,” or “Will this patient have serious side effects from using the drug?”—in all these cases, a yes/no or 0∕1 decision has to be made.

Gérard Biau, Luc Devroye

Chapter 18. The nearest neighbor rule: fixed k

Abstract

In this chapter, $(\mathbf{X},Y ) \in \mathbb{R}^{d} \times \{ 0,1\}$, and $(\mathbf{X}_{1},Y _{1}),\mathop{\ldots },(\mathbf{X}_{n},Y _{n})$ are reordered according to increasing values of $\|\mathbf{X}_{i} -\mathbf{x}\|$. Ties are broken as for regression. The reordered sequence is denoted by $(\mathbf{X}_{(1)}(\mathbf{x}),Y _{(1)}(\mathbf{x})),\mathop{\ldots },(\mathbf{X}_{(n)}(\mathbf{x}),Y _{(n)}(\mathbf{x}))$. As usual, we let $r(\mathbf{x}) = \mathbb{E}[Y \vert \mathbf{X} = \mathbf{x}]$ and recall, since Y ∈ { 0, 1}, that $r(\mathbf{x}) = \mathbb{P}\{Y = 1\vert \mathbf{X} = \mathbf{x}\}$.

Gérard Biau, Luc Devroye

Chapter 19. The nearest neighbor rule: variable k

Abstract

Given weights $(v_{n1},\mathop{\ldots },v_{nn})$ satisfying $\sum _{i=1}^{n}v_{ni} = 1$, the nearest neighbor classifier is defined for $\mathbf{x} \in \mathbb{R}^{d}$ by

$$\displaystyle{ g_{n}(\mathbf{x}) = \left \{\begin{array}{ll} 1&\mbox{ if $\sum _{i=1}^{n}v_{ni}Y _{(i)}(\mathbf{x}) > 1/2$} \\ 0&\mbox{ otherwise.} \end{array} \right. }$$

Gérard Biau, Luc Devroye

Chapter 20. Appendix

Abstract

For any real-valued function g, define $g^{+} =\max (g,0)$ and $g^{-} = -\min (g,0)$. These are called the positive and negative parts of g, respectively, and satisfy the relations

$$\displaystyle{g^{+},g^{-}\geq 0,\quad \vert g\vert = g^{+} + g^{-},\quad \mbox{ and}\quad g = g^{+} - g^{-}.}$$

We recall that a real-valued random variable X is said to be integrable if $\mathbb{E}\vert X\vert < \infty $ , or, equivalently, if $\mathbb{E}X^{+} < \infty $ and $\mathbb{E}X^{-} < \infty $. In that case, $\mathbb{E}X = \mathbb{E}X^{+} - \mathbb{E}X^{-}$. We use $\|X\|_{\infty }$ to denote the essential supremum of X:

$$\displaystyle{\|X\|_{\infty } =\inf \big\{ t \geq 0: \mathbb{P}\left \{\vert X\vert > t\right \} = 0\big\}.}$$

Gérard Biau, Luc Devroye

Backmatter

Title: Lectures on the Nearest Neighbor Method
Authors: Gérard Biau
Luc Devroye
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-25388-6
Print ISBN: 978-3-319-25386-2
DOI: https://doi.org/10.1007/978-3-319-25388-6

Springer Professional

About this book

Table of Contents

Frontmatter

Density estimation

Frontmatter

Chapter 1. Order statistics and nearest neighbors

Chapter 2. The nearest neighbor distance

Chapter 3. The k-nearest neighbor density estimate

Chapter 4. Uniform consistency

Chapter 5. Weighted k-nearest neighbor density estimates

Chapter 6. Local behavior

Chapter 7. Entropy estimation

Regression estimation

Frontmatter

Chapter 8. The nearest neighbor regression function estimate

Chapter 9. The 1-nearest neighbor regression function estimate

Chapter 10. L p -consistency and Stone’s theorem

Chapter 11. Pointwise consistency

Chapter 12. Uniform consistency

Chapter 13. Advanced properties of uniform order statistics

Chapter 14. Rates of convergence

Chapter 15. Regression: the noiseless case

Chapter 16. The choice of a nearest neighbor estimate

Supervised classification

Frontmatter

Chapter 17. Basics of classification

Chapter 18. The nearest neighbor rule: fixed k

Chapter 19. The nearest neighbor rule: variable k

Chapter 20. Appendix

Backmatter