Skip to main content
Top

2015 | Book

Lectures on the Nearest Neighbor Method

insite
SEARCH

About this book

This text presents a wide-ranging and rigorous overview of nearest neighbor methods, one of the most important paradigms in machine learning. Now in one self-contained volume, this book systematically covers key statistical, probabilistic, combinatorial and geometric ideas for understanding, analyzing and developing nearest neighbor methods.

Gérard Biau is a professor at Université Pierre et Marie Curie (Paris). Luc Devroye is a professor at the School of Computer Science at McGill University (Montreal).

Table of Contents

Frontmatter

Density estimation

Frontmatter
Chapter 1. Order statistics and nearest neighbors
Abstract
We start with some basic properties of uniform order statistics. For a general introduction to probability, see Grimmett and Stirzaker (2001). Some of the properties of order statistics presented in this chapter are covered by Rényi (1970); Galambos (1978), and Devroye (1986).
Gérard Biau, Luc Devroye
Chapter 2. The nearest neighbor distance
Abstract
If X i and X j are equidistant from x, i.e., if \(\|\mathbf{X}_{i} -\mathbf{x}\| =\| \mathbf{X}_{j} -\mathbf{x}\|\) for some ij, then we have a distance tie. By convention, ties are broken by comparing indices, that is, by declaring that X i is closer to x than X j whenever i < j.
Gérard Biau, Luc Devroye
Chapter 3. The k-nearest neighbor density estimate
Abstract
A random vector X taking values in \(\mathbb{R}^{d}\) has a (probability) density f with respect to the Lebesgue measure if, for all Borel sets \(A \subseteq \mathbb{R}^{d}\), \(\mathbb{P}\{\mathbf{X} \in A\} =\int _{A}f(\mathbf{x})\mbox{ d}\mathbf{x}\). In other words, if A is a small ball about x, the probability that X falls in A is about f(x) times the volume of A. It thus serves as a tool for computing probabilities of sets and, as a function that reveals the local concentration of probability mass, it may be used to visualize distributions of random variables.
Gérard Biau, Luc Devroye
Chapter 4. Uniform consistency
Abstract
This chapter is devoted to the study of the uniform consistency properties of the k-nearest neighbor density estimate f n . Before embarking on the supremum norm convergence, it is useful to understand the behavior of f n on bounded densities. We denote the essential supremum (with respect to the Lebesgue measure \(\lambda\)) of the density f by
Gérard Biau, Luc Devroye
Chapter 5. Weighted k-nearest neighbor density estimates
Abstract
There are different ways to weigh or smooth the k-nearest neighbor density estimate. Some key ideas are surveyed in this chapter. For some of them, consistency theorems are stated.
Gérard Biau, Luc Devroye
Chapter 6. Local behavior
Abstract
No study of a density estimate is complete without a discussion of the local behavior of it. That is, given a certain amount of smoothness at x, how fast does f n (x) tend to f(x)? It is clear that for any sequence of density estimates, and any sequence \(a_{n} \downarrow 0\), however slow, there exists a density f with x a Lebesgue point of f, such that
$$\displaystyle{\limsup _{n\rightarrow \infty }\frac{\mathbb{E}\left \vert f_{n}(\mathbf{x}) - f(\mathbf{x})\right \vert } {a_{n}} \geq 1.}$$
We will not show this, but just point to a similar theorem for the total variation error in density estimation (Devroye, 1987). However, under smoothness conditions, there is hope to get useful rates of convergence.
Gérard Biau, Luc Devroye
Chapter 7. Entropy estimation
Abstract
Differential entropy, or continuous entropy, is a concept in information theory related to the classical (Shannon) entropy (Shannon, 1948). For a random variable with density f on \(\mathbb{R}^{d}\), it is defined by
$$\displaystyle{ \mathcal{E}(f) = -\int _{\mathbb{R}^{d}}f(\mathbf{x})\log f(\mathbf{x})\mbox{ d}\mathbf{x}, }$$
(7.1)
when this integral exists (with the convention \(0\log 0 = 0\)). If d = 1 and f is the uniform density on [0, a], a > 0, then its differential entropy is
$$\displaystyle{\mathcal{E}(f) = -\int _{0}^{a}\frac{1} {a}\log \frac{1} {a}\mbox{ d}x =\log a.}$$
We see that for a < 1, \(\log a < 0\), so that \(\mathcal{E}(f)\) can be negative. The standard exponential has \(\mathcal{E}(f) = 1\), and the standard Gaussian has \(\mathcal{E}(f) =\log \sqrt{2\pi e}\), to give a few examples.
Gérard Biau, Luc Devroye

Regression estimation

Frontmatter
Chapter 8. The nearest neighbor regression function estimate
Abstract
Let (X, Y ) be a pair of random variables taking values in \(\mathbb{R}^{d} \times \mathbb{R}\). The goal of regression analysis is to understand how the values of the response variable Y depend on the values of the observation vector X. The objective is to find a Borel measurable function g such that | Yg(X) | is small, where “small” could be defined in terms of the L p risk \(\mathbb{E}\vert Y - g(\mathbf{X})\vert ^{p}\) (p > 0), for example. Of particular interest is the L 2 risk of g,
$$\displaystyle{ \mathbb{E}\left \vert Y - g(\mathbf{X})\right \vert ^{2}. }$$
(8.1)
Gérard Biau, Luc Devroye
Chapter 9. The 1-nearest neighbor regression function estimate
Abstract
Our objective in this short chapter is to analyze some elementary consistency properties of the 1-nearest neighbor regression function estimate. This will also offer the opportunity to familiarize the reader with concepts that will be encountered in the next few chapters. Recall that this very simple estimation procedure is defined by setting
$$\displaystyle{r_{n}(\mathbf{x}) = Y _{(1)}(\mathbf{x}),\quad \mathbf{x} \in \mathbb{R}^{d},}$$
where \((\mathbf{X}_{(1)}(\mathbf{x}),Y _{(1)}(\mathbf{x})),\mathop{\ldots },(\mathbf{X}_{(n)}(\mathbf{x}),Y _{(n)}(\mathbf{x}))\) is a reordering of the data according to increasing values of \(\|\mathbf{X}_{i} -\mathbf{x}\|\), and distance ties are broken by looking at indices.
Gérard Biau, Luc Devroye
Chapter 10. L p -consistency and Stone’s theorem
Abstract
We know that, whenever \(\mathbb{E}Y ^{2} < \infty \), the regression function \(r(\mathbf{x}) = \mathbb{E}[Y \vert \mathbf{X} = \mathbf{x}]\) achieves the minimal value \(L^{\star } = \mathbb{E}\vert Y - r(\mathbf{X})\vert ^{2}\) of the L 2 risk over all square-integrable functions of X. It is also easy to show, using the independence of (X, Y ) and the sample \(\mathcal{D}_{n}\), that the (conditional) L 2 risk \(\mathbb{E}[\vert Y - r_{n}(\mathbf{X})\vert ^{2}\,\vert \,\mathcal{D}_{n}]\) of an estimate r n of r satisfies
$$\displaystyle{\mathbb{E}\big[\left \vert Y - r_{n}(\mathbf{X})\right \vert ^{2}\,\vert \,\mathcal{D}_{ n}\big] = \mathbb{E}\left \vert Y - r(\mathbf{X})\right \vert ^{2} +\int _{ \mathbb{R}^{d}}\left \vert r_{n}(\mathbf{x}) - r(\mathbf{x})\right \vert ^{2}\mu (\mbox{ d}\mathbf{x}),}$$
where μ is the distribution of X. This identity reveals that the L 2 risk of the estimate r n is close to the optimal value if and only if the L 2 error \(\int _{\mathbb{R}^{d}}\vert r_{n}(\mathbf{x}) - r(\mathbf{x})\vert ^{2}\mu (\mbox{ d}\mathbf{x})\) is close to zero. Therefore, the L 2 error (integrated squared error) is a nice criterion to measure the quality of an estimate.
Gérard Biau, Luc Devroye
Chapter 11. Pointwise consistency
Abstract
Theorem 11.1 below is a slight extension of a theorem due to Devroye (1981a). It offers sufficient conditions on the probability weight vector guaranteeing that the (raw) nearest neighbor estimate (8.2) satisfies, for all p ≥ 1.
Gérard Biau, Luc Devroye
Chapter 12. Uniform consistency
Abstract
The supremum creates two problems—first of all, by moving x about \(\mathbb{R}^{d}\), the data ordering changes. We will count the number of possible data permutations in the second section. Second, we need a uniform condition on the “noise” Yr(X) so that the averaging done by the weights v ni is strong enough. This is addressed in the third section.
Gérard Biau, Luc Devroye
Chapter 13. Advanced properties of uniform order statistics
Abstract
Various properties of \(U_{(1)},\mathop{\ldots },U_{(n)}\), uniform [0, 1] order statistics, will be needed in the analysis that follows. These are collected in the present chapter. The first group of properties is directly related to U (i) (1 ≤ i ≤ n), while the second group deals with random linear combinations of them.
Gérard Biau, Luc Devroye
Chapter 14. Rates of convergence
Abstract
In this chapter, we study the local rate of convergence of r n (x) to r(x). We obtain full information on the first asymptotic term of r n (x) − r(x), and are rewarded with (i) a central limit theorem for r n (x) − r(x), and (ii) a way of helping the user decide how to choose the weights v ni of the estimate.
Gérard Biau, Luc Devroye
Chapter 15. Regression: the noiseless case
Abstract
Classical function estimation deals with the estimation of a function r on \(\mathbb{R}^{d}\) from a finite number of points \(\mathbf{x}_{1},\mathop{\ldots },\mathbf{x}_{n}\). Some applications are concerned with L p errors with respect to the Lebesgue measure on compacts. Others use it for Monte Carlo purposes, wanting to estimate \(\int _{A}r(\mathbf{x})\mbox{ d}\mathbf{x}\) over a compact set A. The model we study here takes a sample \(\mathbf{X}_{1},\mathop{\ldots },\mathbf{X}_{n}\) of i.i.d. random vectors with a density f on A that is not known.
Gérard Biau, Luc Devroye
Chapter 16. The choice of a nearest neighbor estimate
Abstract
Selecting the estimate within a class of estimates that is optimal in a certain sense is perhaps the ultimate goal of nonparametric estimation. It assumes that the class of estimates is sufficiently rich within the universe of all possible estimates. That the nearest neighbor regression function estimate is rich as a class follows not only from the universality, but also from the fact that it achieves rates of convergence for various criteria that are close to the best possible over certain classes of distributions on (X, Y ), a property that is studied in minimax theory (Stone, 1980, 1982).
Gérard Biau, Luc Devroye

Supervised classification

Frontmatter
Chapter 17. Basics of classification
Abstract
Supervised classification (also called pattern recognition, discrimination, or class prediction) is a specific regression problem, where the observation X takes values in \(\mathbb{R}^{d}\) and the random response Y takes values in {0, 1}. Given X, one has to guess the value of Y (also termed the label or class), and this guess is called a decision. Pattern recognition is important in different scientific disciplines, such as medicine, biology, finance, and meteorology. In medicine, for example, one needs to evaluate patients according to their disease risk, and the typical questions for classification are: “Is this person infected?,” “Will this patient respond to the treatment?,” or “Will this patient have serious side effects from using the drug?”—in all these cases, a yes/no or 0∕1 decision has to be made.
Gérard Biau, Luc Devroye
Chapter 18. The nearest neighbor rule: fixed k
Abstract
In this chapter, \((\mathbf{X},Y ) \in \mathbb{R}^{d} \times \{ 0,1\}\), and \((\mathbf{X}_{1},Y _{1}),\mathop{\ldots },(\mathbf{X}_{n},Y _{n})\) are reordered according to increasing values of \(\|\mathbf{X}_{i} -\mathbf{x}\|\). Ties are broken as for regression. The reordered sequence is denoted by \((\mathbf{X}_{(1)}(\mathbf{x}),Y _{(1)}(\mathbf{x})),\mathop{\ldots },(\mathbf{X}_{(n)}(\mathbf{x}),Y _{(n)}(\mathbf{x}))\). As usual, we let \(r(\mathbf{x}) = \mathbb{E}[Y \vert \mathbf{X} = \mathbf{x}]\) and recall, since Y ∈ { 0, 1}, that \(r(\mathbf{x}) = \mathbb{P}\{Y = 1\vert \mathbf{X} = \mathbf{x}\}\).
Gérard Biau, Luc Devroye
Chapter 19. The nearest neighbor rule: variable k
Abstract
Given weights \((v_{n1},\mathop{\ldots },v_{nn})\) satisfying \(\sum _{i=1}^{n}v_{ni} = 1\), the nearest neighbor classifier is defined for \(\mathbf{x} \in \mathbb{R}^{d}\) by
$$\displaystyle{ g_{n}(\mathbf{x}) = \left \{\begin{array}{ll} 1&\mbox{ if $\sum _{i=1}^{n}v_{ni}Y _{(i)}(\mathbf{x}) > 1/2$} \\ 0&\mbox{ otherwise.} \end{array} \right. }$$
Gérard Biau, Luc Devroye
Chapter 20. Appendix
Abstract
For any real-valued function g, define \(g^{+} =\max (g,0)\) and \(g^{-} = -\min (g,0)\). These are called the positive and negative parts of g, respectively, and satisfy the relations
$$\displaystyle{g^{+},g^{-}\geq 0,\quad \vert g\vert = g^{+} + g^{-},\quad \mbox{ and}\quad g = g^{+} - g^{-}.}$$
We recall that a real-valued random variable X is said to be integrable if \(\mathbb{E}\vert X\vert < \infty \) , or, equivalently, if \(\mathbb{E}X^{+} < \infty \) and \(\mathbb{E}X^{-} < \infty \). In that case, \(\mathbb{E}X = \mathbb{E}X^{+} - \mathbb{E}X^{-}\). We use \(\|X\|_{\infty }\) to denote the essential supremum of X:
$$\displaystyle{\|X\|_{\infty } =\inf \big\{ t \geq 0: \mathbb{P}\left \{\vert X\vert > t\right \} = 0\big\}.}$$
Gérard Biau, Luc Devroye
Backmatter
Metadata
Title
Lectures on the Nearest Neighbor Method
Authors
Gérard Biau
Luc Devroye
Copyright Year
2015
Electronic ISBN
978-3-319-25388-6
Print ISBN
978-3-319-25386-2
DOI
https://doi.org/10.1007/978-3-319-25388-6