Minimax estimation of the integral of a power of a density

doi:10.1016/j.spl.2008.07.001

Statistics & Probability Letters

Volume 78, Issue 18, 15 December 2008, Pages 3307-3311

https://doi.org/10.1016/j.spl.2008.07.001 Get rights and content

Abstract

We construct an estimator of $\int f^{p} (z) d z$ , based on a random sample of size $n$ from a density $f$ on the unit cube in $R^{d}$ . This estimator achieves the minimax rate for $f$ known to belong to a multiple of the unit ball in a Hölder space of order $α$ , where $α \leq d / 4$ . We are mostly interested in the case that the power $p$ is larger than 2 and/or the dimension $d$ is large.

Introduction

Suppose that we observe an i.i.d. sample $Z_{1}, \dots, Z_{n}$ from a density $f$ on the $d$ -dimensional cube ${[0, 1]}^{d}$ . We wish to estimate the functional $ψ (f; p) = \int f^{p} (z) d z$ , for a given, known integer $p \geq 2$ , when it is known that $f$ belongs to the Hölder space $C^{α} {[0, 1]}^{d}$ .

We concentrate on the case that the regularity $α$ is low relative to the dimension: $α \leq d / 4 .$ In this case the square minimax rate of estimation over the unit ball of $C_{α} {[0, 1]}^{d}$ is known (see Birgé and Massart (1995)) to be not faster than $r_{n} ≔ n^{- 8 α / (d + 4 α)} .$ For values of $α$ below the cut-off (1.1) this rate is slower than the standard parametric rate $n^{- 1}$ , and for large $d$ it can be slow even for high regularity levels $α$ . In this paper we show that the lower bound is sharp by constructing an estimator with mean square error of order $r_{n}$ .

This problem was previously considered by many authors, including Birgé and Massart (1995), Bickel and Ritov (1988), Laurent and Massart (2000) and Emery et al. (2000), as a canonical example of a nonlinear functional. A simple construction of a minimax estimator for the case $p = 2$ is given in Laurent, 1996, Laurent, 1997, and the case that $d = 1$ and $p = 3$ is covered by Kerkyacharian and Picard (1996). The authors of the latter paper also indicate that their construction extends to general smooth functionals (Kerkyacharian and Picard, 1996, Section 5).

Similarly to the constructions by these authors, our estimator is based on approximations of $f$ in a basis and an analysis of the bias of these approximations. The final estimator is a $p$ th-order $U$ -statistic with a kernel determined by three approximation levels, but otherwise given by a simple and direct formula. Different from Kerkyacharian and Picard (1996) our construction allows for general approximation schemes, not limited to the Haar basis. In fact, the latter basis cannot be used in the case $α > 1$ , as it gives suboptimal approximation in this case. The case $α > 1$ can arise under (1.1) if $d > 4$ . Not using the Haar basis leads to additional bias terms, which need to be estimated, whence our estimator is different from the one in Kerkyacharian and Picard (1996), both for small and large $d$ . However, actually the use of a general basis has led us to simpler formulas.

The paper is organized as follows. In Section 2 we introduce the projection kernels used for the construction of our estimator. In Section 3 we very briefly mention the estimator in the quadratic case. In Section 4 we present our estimator for $p \geq 3$ and state the main result of the paper. Section 5 contains the proof of the main result.

Throughout the paper we use the following notation. Given a function $b : {({[0, 1]}^{d})}^{m} \to R$ of $m$ arguments, the $U$ -statistic with kernel $b_{m}$ is written $U_{n} b = \frac{(n - m)!}{n!} \sum_{1 \leq i_{1} \neq i_{2} \neq \dots \neq i_{m} \leq n} b (Z_{i_{1}}, \dots, Z_{i_{m}}) .$ The “kernel” $b$ need not be permutation symmetric in this definition. As an alternative notation for this $U$ -statistic we use $U_{n} b_{m} (Z_{i_{1}}, \dots, Z_{i_{m}})$ , where the unspecified indices $i_{1}, \dots, i_{m}$ serve as a reminder of the arguments involved in the $U$ -statistic. The notation $a ≲ b$ means that $a \leq C b$ for a constant $C$ that is fixed within the context. Undelimited integrals are silently understood to be over the sample space ${[0, 1]}^{d}$ .

Section snippets

Projections

Our estimator can be viewed as an unbiased estimator of an approximation to the functional $ψ (f; p)$ , constructed from approximations ${\bar{f}}_{k}$ to $f$ . In the case $p > 2$ we need to combine three such approximations, with different values of $k$ , each taken as a projection onto a $k$ -dimensional space.

We use a fixed orthogonal projection $K_{k} : L_{2} {[0, 1]}^{d} \to L_{2} {[0, 1]}^{d}$ given by a kernel operator, with the kernel denoted by the same symbol as the operator: $K_{k} f (u) = \int K_{k} (u, v) f (v) d v$ . The projection property $K_{k}^{2} = K_{k}$ of the

Quadratic functional

To estimate the quadratic functional $ψ (f; 2)$ we estimate the approximation $ψ (K_{k} f; 2)$ unbiasedly by the second-order $U$ -statistic $U_{n} K_{k}$ . For appropriate projections this is precisely the estimator considered by Laurent, 1996, Laurent, 1997, and Kerkyacharian and Picard (1996), and used in Robins and van der Vaart (2006) to construct adaptive confidence sets. Thanks to (2.3) the bias is of the order $ψ (K_{k} f; 2) - ψ (f; 2) = \int {(K_{k} f - f)}^{2} (z) d z ≲ {(\frac{1}{k})}^{2 α / d} .$ By standard computations on $U$ -statistics (e.g. van der Vaart

Main result

The estimation of $ψ (f; p)$ for $p \geq 3$ necessitates a more elaborate approximation scheme. We use three levels $k_{1} \leq k_{2} \leq k_{3}$ of truncation and define a $p$ th-order kernel by $b_{k_{1}, k_{2}, k_{3}} (Z_{1}, \dots, Z_{p}) = = \sum_{0 \leq t \leq 2} (\binom{p}{t}) \int \prod_{s = 1}^{p - t} K_{k_{1}} (z, Z_{s}) \prod_{r = p - t + 1}^{p} (K_{k_{3}} (z, Z_{r}) - K_{k_{1}} (z, Z_{r})) d z + (\binom{p}{3}) \int \prod_{s = 1}^{p - 3} K_{k_{1}} (z, Z_{s}) \prod_{r = p - 2}^{p} (K_{k_{2}} (z, Z_{r}) - K_{k_{1}} (z, Z_{r})) d z + 3 (\binom{p}{3}) \int \prod_{s = 1}^{p - 3} K_{k_{1}} (z, Z_{s}) \prod_{r = p - 2}^{p - 1} (K_{k_{2}} (z, Z_{r}) - K_{k_{1}} (z, Z_{r})) \times (K_{k_{3}} (z, Z_{p}) - K_{k_{2}} (z, Z_{p})) d z .$

Theorem 4.1

Let $K_{k}$ be a projection kernel satisfying(2.2), (2.3). Fix $p \geq 3$ . For sequences $k_{1} (n) \sim n$ , $k_{3} (n) \sim n^{2 / (1 + 4 α / d)}$ and $k_{2} (n)$ such that $n^{(3 / 2 - 2 α / d}$

Proof

In this section we present two lemmas followed by the proof of the main theorem. Throughout the section we assume the conditions of Theorem 4.1. The upper bounds are uniform in $f$ ranging over a fixed multiple of the unit ball in $C^{α} {[0, 1]}^{d}$ .

Lemma 5.1

For any positive integers $m, k_{1}, \dots, k_{m}$ and a bounded function $g$ , $E {[\int g (z) \prod_{s = 1}^{m} K_{k_{s}} (z, Z_{s}) d z]}^{2} ≲ {‖ f ‖}_{\infty}^{m} {‖ g ‖}_{\infty}^{2} \prod_{s = 1}^{m - 1} k_{s} .$

Proof

The expected value is an integral relative to the density of $(Z_{1}, \dots, Z_{m})$ . By bounding this density by ${‖ f ‖}_{\infty}^{m}$ , the integral is turned into an integral

References (11)

A. Cohen et al.
Wavelets on the interval and fast wavelet transforms
Appl. Comput. Harmon. Anal.
(1993)
P.J. Bickel et al.
Estimating integrated squared density derivatives: sharp best order of convergence estimates
Sankhyā Ser. A
(1988)
L. Birgé et al.
Estimation of integral functionals of a density
Ann. Statist.
(1995)
M. Emery et al.
W. Härdle et al.

There are more references available in the full text version of this article.

Cited by (21)

Assumption-lean falsification tests of rate double-robustness of double-machine-learning estimators
2024, Journal of Econometrics
The class of doubly robust (DR) functionals studied by Rotnitzky et al. (2021) is of central importance in economics and biostatistics. It strictly includes both (i) the class of mean-square continuous functionals that can be written as an expectation of an affine functional of a conditional expectation studied by Chernozhukov et al. (2022b) and the class of functionals studied by Robins et al. (2008). The present state-of-the-art estimators for DR functionals $ψ$ are double-machine-learning (DML) estimators (Chernozhukov et al., 2018a). A DML estimator ${\hat{ψ}}_{1}$ of $ψ$ depends on estimates $\hat{p} (x)$ and $\hat{b} (x)$ of a pair of nuisance functions $p (x)$ and $b (x)$ , and is said to satisfy “rate double-robustness” if the Cauchy–Schwarz upper bound of its bias is $o (n^{- 1 / 2})$ . Rate double-robustness implies that the bias is $o (n^{- 1 / 2})$ , but the converse is false. Were it achievable, our scientific goal would have been to construct valid, assumption-lean (i.e. no complexity-reducing assumptions on $b$ or $p$ ) tests of the validity of a nominal $(1 - α)$ Wald confidence interval (CI) centered at ${\hat{ψ}}_{1}$ . But this would require a test of the bias to be $o (n^{- 1 / 2})$ , which can be shown not to exist. We therefore adopt the less ambitious goal of falsifying, when possible, an analyst’s justification for her claim that the reported $(1 - α)$ Wald CI is valid. In many instances, an analyst justifies her claim by imposing complexity-reducing assumptions on $b$ and $p$ to ensure “rate double-robustness”. Here we exhibit valid, assumption-lean tests of $H_{0}$ : “rate double-robustness holds”, with non-trivial power against certain alternatives. If $H_{0}$ is rejected, we will have falsified her justification. However, no assumption-lean test of $H_{0}$ , including ours, can be a consistent test. Thus, the failure of our test to reject is not meaningful evidence in favor of $H_{0}$ .
Assumption-lean falsification tests of rate double-robustness of double-machine-learning estimators
2023, arXiv
The Fundamental Limits of Structure-Agnostic Functional Estimation
2023, arXiv
Efficient Generalization and Transportation
2023, arXiv
Minimax estimation of norms of a probability density: I. Lower bounds
2022, Bernoulli
Minimax estimation of norms of a probability density: II. Rate-optimal estimation procedures
2022, Bernoulli

View all citing articles on Scopus

View full text

Minimax estimation of the integral of a power of a density

Abstract

Introduction

Section snippets

Projections

Quadratic functional

Main result

Proof

Appl. Comput. Harmon. Anal.

Estimating integrated squared density derivatives: sharp best order of convergence estimates

Sankhyā Ser. A

Estimation of integral functionals of a density

Ann. Statist.