Skip to main content
Erschienen in: Empirical Economics 1/2022

Open Access 12.08.2020

Vector quantile regression and optimal transport, from theory to numerics

verfasst von: Guillaume Carlier, Victor Chernozhukov, Gwendoline De Bie, Alfred Galichon

Erschienen in: Empirical Economics | Ausgabe 1/2022

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we first revisit the Koenker and Bassett variational approach to (univariate) quantile regression, emphasizing its link with latent factor representations and correlation maximization problems. We then review the multivariate extension due to Carlier et al. (Ann Statist 44(3):1165–92, 2016,; J Multivariate Anal 161:96–102, 2017) which relates vector quantile regression to an optimal transport problem with mean independence constraints. We introduce an entropic regularization of this problem, implement a gradient descent numerical method and illustrate its feasibility on univariate and bivariate examples.
Hinweise
A correction to this article is available online at https://​doi.​org/​10.​1007/​s00181-020-01933-0.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Quantile regression, introduced by Koenker and Bassett Jr (1978), has become a very popular tool for analyzing the response of the whole distribution of a dependent variable to a set of predictors. It is a far-reaching generalization of the median regression, allowing for a predition of any quantile of the distribution. We briefly recall classical quantile regression. For \(t\in \left[ 0,1\right] \), it is well-known that the t -quantile of \(\varepsilon =Y-q_{t}\left( x\right) \) given \(X=x\) minimizes the loss function \(\mathbb {E}\left[ t\varepsilon ^{+}+\left( 1-t\right) \varepsilon ^{-}|X\right] \), or equivalently \(\mathbb {E}\left[ \varepsilon ^{+}+\left( t-1\right) \varepsilon |X\right] \). As a result, if \(q_{t}\left( x\right) \) is specified under the parametric form \(q_{t}\left( x\right) =\beta _{t}^{\top } x+\alpha _{t}\), it is natural to estimate \(\alpha _{t}\) and \(\beta _{t}\) by minimizing the loss
$$\begin{aligned} \min _{\alpha ,\beta }\mathbb {E}\left[ \left( Y-\beta ^{\top } X -\alpha \right) ^{+}+\left( 1-t\right) \left( \beta ^{\top }X +\alpha \right) \right] . \end{aligned}$$
While the previous optimization problem estimates \(\alpha _{t}\) and \(\beta _{t}\) for pointwise values of t, if one would like to estimate the whole curve \(t\mapsto \left( \alpha _{t},\beta _{t}\right) \), one simply should construct the loss function by integrating the previous loss functions over \( t\in \left[ 0,1\right] \), and thus the curve \(t\mapsto \left( \alpha _{t},\beta _{t}\right) \) minimizes
$$\begin{aligned} \min _{\left( \alpha _{t},\beta _{t}\right) _{t\in \left[ 0,1\right] }}\int _{0}^{1}\mathbb {E}\left[ \left( Y-\beta _t^{\top } X-\alpha _{t}\right) ^{+}+\left( 1-t\right) \left( \beta _t^{\top } X+\alpha _{t}\right) \right] dt. \end{aligned}$$
As it is known since the original work by Koenker and Bassett, this problem has an (infinite-dimensional) linear programming formulation. Defining \( P_{t}=\left( Y-\beta _t^{\top } X-\alpha _{t}\right) ^{+}\) as the positive deviations of Y with respect to their predicted quantiles \(\beta _t^{\top } X+\alpha _{t}\), we have \(P_{t}\ge 0\) and \(\left( Y-\beta _t^{\top } X-\alpha _{t}\right) ^{-}=P_{t}-Y+\beta _t^{\top } X+\alpha _{t}\ge 0\), so the problem reformulates as1
$$\begin{aligned}&\min _{P_{t}\ge 0,\beta _{t},\alpha _{t}} \int _{0}^{1}\mathbb {E}\left[ P_{t}+\left( 1-t\right) \left( \beta _t^{\top } X+\alpha _{t}\right) \right] dt \nonumber \\&\quad s.t.~ P_{t}-Y+\beta _t^{\top } X+\alpha _{t}\ge 0~\left[ V_{t}\right] \end{aligned}$$
(1.1)
which we will call “dual formulation” of the classical quantile regression problem2. To the dual formulation corresponds a primal one (dual to the dual), which is formally obtained by a minimax formulation
$$\begin{aligned} \min _{P_{t}\ge 0,\beta _{t},\alpha _{t}}\max _{V_{t}\ge 0}\int _{0}^{1} \mathbb {E}\left[ P_{t}+\left( 1-t\right) \left( \beta _t^{\top } X+\alpha _{t}\right) +V_{t}Y-V_{t}P-V_{t}\beta _t^{\top } X-V_{t}\alpha _{t}\right] dt \end{aligned}$$
thus
$$\begin{aligned} \max _{V_{t}\ge 0}\int _{0}^{1}\mathbb {E}\left[ V_{t}Y\right] dt+\min _{P_{t}\ge 0,\beta _{t},\alpha _{t}}\int _{0}^{1}\mathbb {E}\left[ \left( 1-V_{t}\right) P_{t}+\beta _{t}^{\top }\left( \left( 1-t-V_{t}\right) X\right) +V_{t}\left( Y-\alpha _{t}\right) \right] dt \end{aligned}$$
hence we arrive at the primal formulation
$$\begin{aligned}&\max _{V_{t}\ge 0}\int _{0}^{1}\mathbb {E}\left[ YV_{t}\right] dt \nonumber \\&\quad s.t.~ V_{t}\le 1~\left[ P_{t}\ge 0\right] \nonumber \\&\quad \mathbb {E}\left[ V_{t}X\right] =\left( 1-t\right) \mathbb {E}\left[ X\right] ~\left[ \beta _{t}\right] \nonumber \\&\quad \mathbb {E}\left[ V_{t}\right] =\left( 1-t\right) ~\left[ \alpha _{t}\right] \end{aligned}$$
(1.2)
If \(V_{t}\) and \(\left( \alpha _{t},\beta _{t}\right) \) are solutions to the above primal and dual programs, complementary slackness yields \(\mathbf {1} _{\left\{ Y>\beta _t^{\top } X+\alpha _{t}\right\} } \le V_{t}\le \mathbf {1} _{\left\{ Y\ge \beta _t^{\top } X+\alpha _{t}\right\} } \), hence if \(\left( X,Y\right) \) has a continuous distribution, then for any \(\left( \alpha ,\beta \right) \), \({\mathbb {P}}\left( Y-\beta ^{\top } X -\alpha =0\right) =0\) , and therefore one has almost surely
$$\begin{aligned} V_{t}=\mathbf {1}_{\left\{ Y\ge \beta _t^{\top } X+\alpha _{t}\right\} }. \end{aligned}$$
Koenker and Ng (2005) impose a monotonicity constraint of the estimated quantile curves. Indeed, if \(\beta _t^{\top } x+\alpha _{t}\) is the t -quantile of the conditional distribution of Y given \(X=x\), the curve \( t\mapsto \beta _t^{\top } x+\alpha _{t}\) should be nondecreasing. Hence, these authors impose a natural constraint on the dual, that is \(\beta _t^{\top } X + \alpha _{t}\ge \beta _{t^{\prime }}^{\top } X+\alpha _{t^{\prime }}\) for \(t\ge t^{\prime } \), and they incorporate this constraint into (1.1), yielding
$$\begin{aligned}&\min _{P_{t}\ge 0,N_{t}\ge 0,\beta _{t},\alpha _{t}} \int _{0}^{1}\mathbb {E} \left[ P_{t}+\left( 1-t\right) \left( \beta _t^{\top } X+\alpha _{t}\right) \right] dt \\&\quad s.t.~ P_{t}-N_{t}=Y-\beta _t^{\top } X-\alpha _{t}~\left[ V_{t}\right] \\&\quad t\ge t^{\prime } \Rightarrow \beta _t^{\top } X+\alpha _{t}\ge \beta _{t^{\prime }}^{\top } X +\alpha _{t^{\prime }}. \end{aligned}$$
Note that if \(t\mapsto \beta _t^{\top } x+\alpha _{t}\) is nondecreasing, then \( \mathbf {1}_{\left\{ y\ge \beta _t^{\top } x+\alpha _{t}\right\} }\) should be nonincreasing. Therefore, in that case, \(V_{t}\) should be nonincreasing in t, which allows us to impose a monotonicity constraint on the primal variable \(V_{t}\) instead of a monotonicity constraint on the dual variables \( \beta _{t}\) and \(\alpha _t\). This is precisely the problem we look at. Consider
$$\begin{aligned}&\max _{V_{t}}\int _{0}^{1}\mathbb {E}\left[ YV_{t}\right] dt \nonumber \\&\quad s.t.~ V_{t}\ge 0~\left[ N_{t}\ge 0\right] \nonumber \\&\quad V_{t}\le 1~\left[ P_{t}\ge 0\right] \nonumber \\&\quad \mathbb {E}\left[ V_{t}X\right] =\left( 1-t\right) \mathbb {E}\left[ X\right] ~\left[ \beta _{t}\right] \nonumber \\&\quad \mathbb {E}\left[ V_{t}\right] =\left( 1-t\right) ~\left[ \alpha _{t}\right] \nonumber \\&\quad t\ge t^{\prime } \Rightarrow V_t \le V_{t^{\prime }}. \end{aligned}$$
(1.3)
Let us now take a look at a sample version of this problem. Here, we observe as sample \(\left( X_{i},Y_{i}\right) \) for \(i\in \left\{ 1,...,N\right\} \). We shall discretize the probability space \(\left[ 0,1\right] \) into T points, \(t_{1}=0<t_{2}<...<t_{T}\le 1\). Let \(\overline{x}\) be the \(1\times K\) row vector whose k-th entry is \(\sum _{1\le i\le n}X_{ik}/N\). The sample analog of (1.3) is
$$\begin{aligned}&\max _{V_{\tau i}\ge 0}\sum _{\begin{array}{c} 1\le i\le N \\ 1\le \tau \le T \end{array}}V_{\tau i}Y_{i} \\&\quad V_{\tau i}\le 1 \\&\quad \frac{1}{N}\sum _{1\le i\le N}V_{\tau i}X_{ik}=\left( 1-t_{\tau }\right) \overline{x}_{k} \\&\quad \frac{1}{N}\sum _{1\le i\le N}V_{\tau i}=\left( 1-t_{\tau }\right) \\&\quad V_{\tau 1}\ge V_{\tau 2}\ge ...\ge V_{\tau \left( N-1\right) }\ge V_{\tau ,N}\ge 0. \end{aligned}$$
Denoting \(\mathbf {t}\) the \(T\times 1\) row matrix with entries \(t_{\tau }\), and D a \(T\times T\) matrix with ones on the main diagonal, and \(-1\) on the diagonal just below the main diagonal, and 0 elsewhere, the condition \( V_{\tau 1}\ge V_{\tau 2}\ge ...\ge V_{\tau \left( N-1\right) }\ge V_{\tau ,N}\ge 0\) reexpresses as \(V^{\top }D\ge 0\), and the program rewrites
$$\begin{aligned}&\max _{V}1_{N}^{\top }VY \\&\quad \frac{1}{N}VX=\left( 1_{T}-\mathbf {t}\right) \overline{x} \\&\quad \frac{1}{N}V1_{N}=\left( 1_{T}-\mathbf {t}\right) \\&\quad V^{\top }D1_{T}=1_{N} \\&\quad V^{\top }D\ge 0. \end{aligned}$$
Setting \(\pi =D^{\top }V/N\), and \(U=D^{-1}1_{N}=\left( 1/T,2/T,...,1\right) \) , \(\mu =D^{\top }\left( 1_{T}-\mathbf {t}\right) =\left( 1/T,...,1/T\right) \) , and \(\nu =1_{N}/N\), one can reformulate the problem as
$$\begin{aligned}&\max _{\pi \ge 0}\sum _{\begin{array}{c} 1\le \tau \le T \\ 1\le i\le N \end{array}}\pi _{\tau i}U_{\tau }Y_{i} \\&\quad \sum _{1\le \tau \le T}\pi _{\tau i}=\nu _{i} \\&\quad \sum _{i=1}^{N}\pi _{\tau i}=\mu _{\tau }\text { } \\&\quad \sum _{1\le i\le N}\pi _{\tau i}X_{ik}=\mu _{\tau }\overline{x}_{k} \end{aligned}$$
which rewrites in the population as
$$\begin{aligned}&\max _{\left( U,X,Y\right) \sim \pi } \mathbb {E}_{\pi }\left[ UY\right] \nonumber \\&\quad s.t. \; U\sim \mathcal {U}\left( \left[ 0,1\right] \right) \nonumber \\&\quad \left( X,Y\right) \sim \nu \nonumber \\&\quad \mathbb {E}\left[ X|U\right] =\mathbb {E}\left[ X\right] . \end{aligned}$$
(1.4)
Note that this is a direct extension of the Monge-Kantorovich problem of optimal transport—in fact it boils down to it when the last constraint is absent. This should not be surprising, given the connection between optimal transport, as recalled below. In the present paper, we introduce the Regularized Vector Quantile Regression (RVQR) problem, which consists of adding an entropic regularization term in the expression (1.4), which yields, for a given data distribution \(\nu \),
$$\begin{aligned}&\max _{\left( U,X,Y\right) \sim \pi } \mathbb {E}_{\pi }\left[ UY\right] -\varepsilon \mathbb {E}_{\pi }\left[ \ln \pi \left( U,X,Y\right) \right] \nonumber \\&\quad s.t. \; U\sim \mathcal {U}\left( \left[ 0,1\right] \right) \nonumber \\&\quad \left( X,Y\right) \sim \nu \nonumber \\&\quad \mathbb {E}\left[ X|U\right] =\mathbb {E}\left[ X\right] . \end{aligned}$$
(1.5)
Due to smoothness and regularity, the regularized problem (1.5) enjoys computational and analytical properties that are missing from the original problem (1.4). In particular, the dual to (1.4) is a smooth, unconstrained problem that can be solved by computational methods. While here, unlike in the context of standard optimal transport, the Kullback-Leibler divergence projection onto the mean-independence constraint is not in closed form, we can use Nesterov’s gradient descent acceleration, which gives optimal convergence rates for first-order methods.
The present paper in part provides a survey of previous results, and in part conveys new results. In the vein of the previous papers on the topic (2016, 2017), this paper seeks to apply the optimal transport toolbox to quantile regression. In contrast with these papers, a particular focus in the present paper is to propose a regularized version of the problem as well as new computational methods. The two main new contributions of the paper are (1) a connection with shape-constrained classical regression (Sect. 4), and (2) the introduction of the regularized vector quantile regression problem (RVQR) along with a duality theorem for that problem (Sect. 6).
The paper is organized as follows. Section 2 will offer reminders on the notion of quantile; Sect. 3 will review the previous results of (Carlier et al. 2016, 2017) on the “specified” case; Sect. 4 offers a new result (theorem 4.1) on the comparison with the shape-constrained classical quantile regression; and Sect. 5 will review results on the multivariate case. Section 6 introduces RVQR and introduces results relevant for that problem, in particular a duality result in that case (theorem 6.1).

2 Several characterizations of quantiles

Throughout the paper, \((\Omega , {\mathcal {F}}, \mathbb {P})\) will be some fixed nonatomic space3 probability. Given a random vector Z with values in \({\mathbb {R}}^k\) defined on this space we will denote by \({\mathscr {L}}(Z)\) the law of Z, given a probability measure \( \theta \) on \({\mathbb {R}}^k\), we shall often write \(Z\sim \theta \) to express that \({\mathscr {L}}(Z)=\theta \). Independence of two random variables \(Z_1\) and \(Z_2\) will be denoted as \(Z_1 \perp \! \! \! \perp Z_2\).

2.1 Quantiles

Let Y be some univariate random variable defined on \((\Omega , {\mathcal {F}} , \mathbb {P})\). Denoting by \(F_Y\) the distribution function of Y:
$$\begin{aligned} F_Y(\alpha ):=\mathbb {P}(Y\le \alpha ), \; \forall \alpha \in {\mathbb {R}} \end{aligned}$$
the quantile function of Y, \(Q_Y=F_Y^{-1}\) is the generalized inverse of \(F_Y\) given by the formula:
$$\begin{aligned} Q_Y(t):=\inf \{\alpha \in {\mathbb {R}} \; : \; F_Y(\alpha )>t\} \text { for all } t\in (0,1). \end{aligned}$$
(2.1)
Let us now recall two well-known facts about quantiles:
  • \(\alpha =Q_Y(t)\) is a solution of the convex minimization problem
    $$\begin{aligned} \min _{\alpha } \{{\mathbb {E}}((Y-\alpha )^+)+ \alpha (1-t)\} \end{aligned}$$
    (2.2)
  • there exists a uniformly distributed random variable U such that \( Y=Q_Y(U)\). Moreover, among uniformly distributed random variables, U is maximally correlated4 to Y in the sense that it solves
    $$\begin{aligned} \max \{ {\mathbb {E}}(VY), \; V\sim \mu \} \end{aligned}$$
    (2.3)
    where \(\mu :={\mathcal {U}}([0,1])\) is the uniform measure on [0, 1]. Of course, when \({\mathscr {L}}(Y)\) has no atom, i.e., when \(F_{Y}\) is continuous, U is unique and given by \(U=F_{Y}(Y)\). Problem (2.3) is the easiest example of optimal transport problem one can think of. The decomposition of a random variable Y as the composed of a monotone nondecreasing function and a uniformly distributed random variable is called a polar factorization of Y. The existence of such decompositions goes back to Ryff (1970) and the extension to the multivariate case (by optimal transport) is due to Brenier (1991).
We therefore see that there are basically two different approaches to study or estimate quantiles:
  • the local or “t by t” approach which consists, for a fixed probability level t, in using directly formula (2.1) or the minimization problem (2.2) (or some approximation of it), this can be done very efficiently in practice but has the disadvantage of forgetting the fundamental global property of the quantile function: it should be monotone in t,
  • the global approach (or polar factorization approach), where quantiles of Y are defined as all nondecreasing functions Q for which one can write \(Y=Q(U)\) with U uniformly distributed. In this approach, one rather tries to recover directly the whole monotone function Q (or the uniform variable U that is maximally correlated to Y). Therefore this is a global approach for which one should rather use the optimal transport problem (2.3).

2.2 Conditional quantiles

Let us assume now that, in addition to the random variable Y, we are also given a random vector \(X\in {\mathbb {R}}^N\) which we may think of as being a list of explanatory variables for Y. We are primarily interested in the dependence between Y and X and in particular the conditional quantiles of Y given \(X=x\). Let us denote by \(\nu \) the joint law of (XY) by \(\nu \) the law of X, and by \(\nu (. \vert x)\) the conditional law of Y given \( X=x \):
$$\begin{aligned} \nu :={\mathscr {L}}(X,Y), \; m:={\mathscr {L}}(X), \; \nu (.\vert x):={ \mathscr {L}}(Y \vert X=x) \end{aligned}$$
(2.4)
which in particular yields
$$\begin{aligned} \text {d} \nu (x,y) = \text {d}\nu (y\vert x) \text {d} m(x). \end{aligned}$$
We then denote by \(F(x,y)=F_{Y\vert X=x}(y)\) the conditional cdf:
$$\begin{aligned} F(x,y):=\mathbb {P}(Y\le y \vert X=x) \end{aligned}$$
and Q(xt) the conditional quantile
$$\begin{aligned} Q(x,t):=\inf \{\alpha \in {\mathbb {R}} \; : \; F(x,\alpha )>t\}, \; \forall t\in (0,1). \end{aligned}$$
For the sake of simplicity, we shall assume that for \(m={\mathscr {L}}(X)\) -almost every \(x\in {\mathbb {R}}^N\) (m-a.e. x for short), one has
$$\begin{aligned} t\mapsto Q(x,t) \text { is continuous and increasing} \end{aligned}$$
(2.5)
so that for m-a.e. x, \(F(x, Q(x,t))=t\) for every \(t\in (0,1)\) and \(Q(x, F(x,y))=y\) for every y in the support of \(\nu (.\vert x)\).
Let us now define the random variable
$$\begin{aligned} U:=F(X,Y), \end{aligned}$$
(2.6)
then by construction:
$$\begin{aligned} \begin{aligned} \mathbb {P}(U< t\vert X=x)&=\mathbb {P}(F(x,Y)<t \vert X=x)=\mathbb {P} (Y<Q(x,t) \vert X=x) \\&=F(x,Q(x,t))=t. \end{aligned} \end{aligned}$$
We deduce that U is uniformly distributed and independent from X (since its conditional cdf does not depend on x). Moreover since \(U=F(X,Y)=F(X, Q(X,U))\) it follows from (2.5) that one has the representation
$$\begin{aligned} Y=Q(X,U) \end{aligned}$$
in which U can naturally be interpreted as a latent factor.
This easy remark leads to a conditional polar factorization of Y through the pointwise relation \(Y=Q(X,U)\) with Q(X, .) nondecreasing and \(U\sim \mu \) , \(U\perp \! \! \! \perp X\). We would like to emphasize now that there is a variational principle behind this conditional decomposition. Let us indeed consider the variant of the optimal transport problem (2.3) where one further requires U to be independent from the vector of regressors X :
$$\begin{aligned} \max \{ {\mathbb {E}}(VY), \; {\mathscr {L}}(V)=\mu , \; V \perp \! \! \! \perp X \}. \end{aligned}$$
(2.7)
then we have
Proposition 2.1
If \({\mathbb {E}}(\vert Y\vert )<+\infty \) and (2.5) holds, the random variable U defined in (2.6) solves (2.7).
Proof
Let V be admissible for (2.7). Let us define for \(x\in \mathop {\mathrm {spt}}\nolimits (m)\) and \(t\in [0,1]\),
$$\begin{aligned} \varphi (x,t):=\int _0^t Q(x,s) \text {d}s. \end{aligned}$$
We first claim that \(\varphi (X,U)\) is integrable, indeed we obviously have
$$\begin{aligned} \vert \varphi (X,U)\vert \le \int _0^1 \vert Q(X,s)\vert \text {d}s \end{aligned}$$
hence
$$\begin{aligned} \begin{aligned} {\mathbb {E}}(\vert \varphi (X,U)\vert )&\le \int _{{\mathbb {R}}^N} \int _0^1 \vert Q(x,s)\vert \text {d} \mu (s)\; \text {d} m(x) \\&=\int _{{\mathbb {R}}^N} \int _{\mathbb {R}} \vert y \vert \text {d} \nu (y\vert x) \text {d} m(x) ={\mathbb {E}}(\vert Y \vert )<+\infty \end{aligned} \end{aligned}$$
where we have used in the second line the fact that the image of \(\mu \) by Q(x, .) is \(\nu (.\vert x)\). Since \(\varphi (x,.)\) is convex and \(Y=\frac{ \partial \; \varphi }{\partial u} (X,U)\) the pointwise inequality
$$\begin{aligned} \varphi (X,V)-\varphi (X,U)\ge Y (V-U) \end{aligned}$$
holds almost surely. But since \({\mathscr {L}}(X,V)={\mathscr {L}}(X,U)\) integrating the previous inequality yields
$$\begin{aligned} {\mathbb {E}}(\varphi (X,V)-\varphi (X,U))=0 \ge {\mathbb {E}}(Y (V-U)). \end{aligned}$$

3 Specified and quasi-specified quantile regression

3.1 Specified quantile regression

Since the seminal work of Koenker and Bassett Jr (1978), it has been widely accepted that a convenient way to estimate conditional quantiles is to stipulate an affine form with respect to x for the conditional quantile. Since a quantile function should be monotone in its second argument, this leads to the following definition
Definition 3.1
Quantile regression is specified if there exist \((\alpha , \beta )\in C([0,1], {\mathbb {R}})\times C([0,1], {\mathbb {R}}^N)\) such that for m-a.e. x
$$\begin{aligned} t\mapsto \alpha (t)+\beta (t)^{\top } x \text { is increasing on [0,1]} \end{aligned}$$
(3.1)
and
$$\begin{aligned} Q(x,t)=\alpha (t)+ \beta (t)^{\top } x, \end{aligned}$$
(3.2)
for m-a.e. x and every \(t\in [0,1]\). If (3.1), (3.2) hold, quantile regression is specified with regression coefficients \( (\alpha , \beta )\).
Specification of quantile regression can be characterized by the validity of an affine in X representation of Y with a latent factor:
Proposition 3.2
Let \((\alpha , \beta )\) be continuous and satisfy (3.1). Quantile regression is specified with regression coefficients \((\alpha , \beta )\) if and only if there exists U such that
$$\begin{aligned} Y=\alpha (U)+ \beta (U)^{\top } X \text { almost surely} , \; {\mathscr {L}} (U)=\mu , \; U \perp \! \! \! \perp X. \end{aligned}$$
(3.3)
Proof
The fact that specification of quantile regression implies decomposition ( 3.3) has already been explained in paragraph 2.2. Let us assume (3.3), and compute
$$\begin{aligned} \begin{aligned} F(x, \alpha (t)+\beta (t)^{\top } x)&=\mathbb {P}(Y\le \alpha (t)+\beta (t)^{\top } x\vert X=x) \\&= \mathbb {P}(\alpha (U)+ \beta (U)^{\top } x \le \alpha (t)+\beta (t)^{\top } x\vert X=x) \\&=\mathbb {P}(U\le t \vert X=x)=\mathbb {P}(U\le t)=t \end{aligned} \end{aligned}$$
so that \(Q(x,t)=\alpha (t)+\beta (t)^{\top } x\).

3.2 Quasi-specified quantile regression

Let us now assume that both X and Y are integrable
$$\begin{aligned} {\mathbb {E}}(\Vert X\Vert + \vert Y\vert )<+\infty \end{aligned}$$
(3.4)
and normalize, without loss of generality, X in such a way that
$$\begin{aligned} {\mathbb {E}}(X)=0. \end{aligned}$$
(3.5)
Koenker and Bassett showed that, for a fixed probability level t, the regression coefficients \((\alpha ,\beta )\) can be estimated by quantile regression, i.e., the minimization problem
$$\begin{aligned} \inf _{(\alpha ,\beta )\in {\mathbb {R}}^{1+N}}{\mathbb {E}}(\rho _{t}(Y-\alpha -\beta ^{\top }X)) \end{aligned}$$
(3.6)
where the penalty \(\rho _{t}\) is given by \(\rho _{t}(z):=tz^-+(1-t)z^+\) with \(z^-\) and \(z^+\) denoting the negative and positive parts of z. For further use, note that (3.6) can be conveniently be rewritten as
$$\begin{aligned} \inf _{(\alpha ,\beta )\in {\mathbb {R}}^{1+N}}\{{\mathbb {E}}((Y-\alpha -\beta ^{\top }X)^+)+(1-t)\alpha \}. \end{aligned}$$
(3.7)
As noticed by Koenker and Bassett, this convex program admits as dual formulation
$$\begin{aligned} \sup \{{\mathbb {E}}(V_{t}Y))\;:\;V_{t}\in [0,1],\;{\mathbb {E}} (V_{t})=(1-t),\;{\mathbb {E}}(V_{t}X)=0\}. \end{aligned}$$
(3.8)
An optimal \((\alpha ,\beta )\) for (3.7) and an optimal \(V_{t}\) in (3.8) are related by the complementary slackness condition:
$$\begin{aligned} Y>\alpha +\beta ^{\top }X\Rightarrow V_{t}=1,\text { and }\;Y<\alpha +\beta ^{\top }X\Rightarrow V_{t}=0. \end{aligned}$$
(3.9)
Note that \(\alpha \) appears naturally as a Lagrange multiplier associated to the constraint \({\mathbb {E}}(V_{t})=(1-t)\) and \(\beta \) as a Lagrange multiplier associated to \({\mathbb {E}}(V_{t}X)=0\).
To avoid mixing, i.e., the possibility that \(V_t\) takes values in (0, 1), it will be convenient to assume that \(\nu ={\mathscr {L}}(X,Y)\) gives zero mass to nonvertical hyperplanes, i.e.,
$$\begin{aligned} \mathbb {P}(Y=\alpha +\beta ^{\top } X)=0, \; \forall (\alpha , \beta )\in { \mathbb {R}}^{1+N}. \end{aligned}$$
(3.10)
We shall also consider a nondegeneracy condition on the (centered) random vector X which says that its law is not supported by any hyperplane5:
$$\begin{aligned} \; \mathbb {P}(\beta ^{\top } X=0)<1, \; \forall \beta \in {\mathbb {R}} ^N\setminus \{0\}. \end{aligned}$$
(3.11)
Thanks to (3.10), we may simply write
$$\begin{aligned} V_t=\mathbf {1}_{\{Y>\alpha +\beta ^{\top } X\}} \end{aligned}$$
(3.12)
and thus the constraints \({\mathbb {E}}(V_t)=(1-t)\), \({\mathbb {E}}(XV_t)=0\) read
$$\begin{aligned} {\mathbb {E}}( \mathbf {1}_{\{Y> \alpha + \beta ^{\top } X \}})=\mathbb {P}(Y> \alpha + \beta ^{\top } X) = (1-t),\; {\mathbb {E}}(X \mathbf {1}_{\{Y> \alpha + \beta ^{\top } X \}} ) =0 \end{aligned}$$
(3.13)
which simply are the first-order conditions for (3.7).
Any pair \((\alpha , \beta )\) which solves6 the optimality conditions (3.13) for the Koenker and Bassett approach will be denoted
$$\begin{aligned} \alpha =\alpha ^{QR}(t), \beta =\beta ^{QR}(t) \end{aligned}$$
and the variable \(V_t\) solving (3.8) given by (3.12) will similarly be denoted \(V_t^{QR}\)
$$\begin{aligned} V_t^{QR}:=\mathbf {1}_{\{Y>\alpha ^{QR}(t) +\beta ^{QR}(t)^{\top } X\}}. \end{aligned}$$
(3.14)
Note that in the previous considerations the probability level t is fixed, this is what we called the “t by t” approach. For this approach to be consistent with conditional quantile estimation, if we allow t to vary we should add an additional monotonicity requirement:
Definition 3.3
Quantile regression is quasi-specified7 if there exists for each t, a solution \((\alpha ^{QR}(t), \beta ^{QR}(t))\) of (3.13) (equivalently the minimization problem (3.6)) such that \(t\in [0,1]\mapsto (\alpha ^{QR}(t), \beta ^{QR}(t))\) is continuous and, for m-a.e. x
$$\begin{aligned} t\mapsto \alpha ^{QR}(t)+\beta ^{QR}(t)^{\top } x \text { is increasing on [0,1]}. \end{aligned}$$
(3.15)
A first consequence of quasi-specification is given by
Proposition 3.4
Assume (2.5)–(3.4), (3.5) and (3.10). If quantile regression is quasi-specified and if we define \(U^{QR}:=\int _0^1 V_t^{QR} dt\) (recall that \(V_t^{QR}\) is given by ( 3.14)) then:
  • \(U^{QR}\) is uniformly distributed,
  • X is mean-independent from \(U^{QR}\), i.e., \({\mathbb {E}}(X\vert U^{QR})={\mathbb {E}}(X)=0\),
  • \(Y=\alpha ^{QR}(U^{QR})+ {\beta ^{QR}}(U^{QR})^{\top } X\) almost surely.
Moreover, \(U^{QR}\) solves the correlation maximization problem with a mean-independence constraint:
$$\begin{aligned} \max \{ {\mathbb {E}}(VY), \; {\mathscr {L}}(V)=\mu , \; {\mathbb {E}}(X\vert V)=0\}. \end{aligned}$$
(3.16)
Proof
Obviously
$$\begin{aligned} V_t^{QR}=1\Rightarrow U^{QR} \ge t, \text { and } \; U^{QR}>t \Rightarrow V_t^{QR}=1 \end{aligned}$$
hence \(\mathbb {P}(U^{QR}\ge t)\ge \mathbb {P}(V_t^{QR}=1)=\mathbb {P}(Y> \alpha ^{QR}(t)+\beta ^{QR}(t)^{\top } X)=(1-t)\) and \(\mathbb {P}(U^{QR}> t)\le \mathbb {P}(V_t^{QR}=1)=(1-t)\) which proves that \(U^{QR}\) is uniformly distributed and \(\{U^{QR}>t\}\) coincides with \(\{V^{QR}_t=1\}\) up to a set of null probability. We thus have \({\mathbb {E}}(X \mathbf {1}_{U^{QR}>t})={ \mathbb {E}}(X V_t^{QR})=0\), by a standard approximation argument we deduce that \({\mathbb {E}}(Xf(U^{QR}))=0\) for every \(f\in C([0,1], {\mathbb {R}})\) which means that X is mean-independent from \(U^{QR}\).
As already observed \(U^{QR}>t\) implies that \(Y>\alpha ^{QR}(t)+ \beta ^{QR}(t)^{\top } X\) in particular \(Y\ge \alpha ^{QR}(U^{QR}-\delta )+\beta ^{QR}(U^{QR}- \delta )^{\top } X\) for \(\delta >0 \), letting \(\delta \rightarrow 0^+\) and using the continuity of \((\alpha ^{QR}, \beta ^{QR})\) we get \(Y\ge \alpha ^{QR}(U^{QR})+\beta ^{QR}(U^{QR})^{\top } X\). The converse inequality is obtained similarly by remarking that \(U^{QR}<t\) implies that \(Y\le \alpha ^{QR}(t)+\beta ^{QR}(t)^{\top } X\).
Let us now prove that \(U^{QR}\) solves (3.16). Take V uniformly distributed, such that X is mean-independent from V and set \(V_t:= \mathbf {1}_{\{V>t \}}\), we then have \({\mathbb {E}}(X V_t)=0\), \({\mathbb {E}} (V_t)=(1-t)\) but since \(V_t^{QR}\) solves (3.8) we have \({\mathbb {E}} (V_t Y)\le {\mathbb {E}}(V_t^{QR}Y)\). Observing that \(V=\int _0^1 V_t dt\) and integrating the previous inequality with respect to t gives \({\mathbb {E}} (VY)\le {\mathbb {E}}(U^{QR}Y)\) so that \(U^{QR}\) solves (3.16).
Let us continue with a uniqueness argument for the mean-independent decomposition given in proposition 3.4:
Proposition 3.5
Assume (2.5)–(3.4), (3.5 )–(3.10) and (3.11). Let us assume that
$$\begin{aligned} Y=\alpha (U)+\beta (U)^{\top } X=\overline{\alpha } (\overline{U})+ \overline{ \beta }(\overline{U})^{\top } X \end{aligned}$$
with:
  • both U and \(\overline{U}\) uniformly distributed,
  • X is mean-independent from U and \(\overline{U}\): \({\mathbb {E}} (X\vert U)={\mathbb {E}}(X\vert \overline{U})=0\),
  • \(\alpha , \beta , \overline{\alpha }, \overline{\beta }\) are continuous on [0, 1],
  • \((\alpha , \beta )\) and \((\overline{\alpha }, \overline{\beta })\) satisfy the monotonicity condition (3.1),
then
$$\begin{aligned} \alpha =\overline{\alpha }, \; \beta =\overline{\beta }, \; U=\overline{U}. \end{aligned}$$
Proof
Let us define for every \(t\in [0,1]\)
$$\begin{aligned} \varphi (t):=\int _0^t \alpha (s)ds, \; b(t):=\int _0^t \beta (s)ds. \end{aligned}$$
Let us also define for (xy) in \({\mathbb {R}}^{N+1}\):
$$\begin{aligned} \psi (x,y):=\max _{t\in [0,1]} \{ty-\varphi (t)-b(t)^{\top } x\} \end{aligned}$$
thanks to the monotonicity condition (3.1), the maximization program above is strictly concave in t for every y and m-a.e.x. We then remark that \(Y=\alpha (U)+\beta (U)^{\top } X=\varphi ^{\prime }(U)+b^{\prime }(U)^{\top }X\) exactly is the first-order condition for the above maximization problem when \((x,y)=(X,Y)\). In other words, we have
$$\begin{aligned} \psi (x,y)+b(t)^{\top }x + \varphi (t)\ge ty, \; \forall (t,x,y)\in [0,1]\times {\mathbb {R}}^N\times {\mathbb {R}} \end{aligned}$$
(3.17)
with an equality for \((x,y,t)=(X,Y,U)\), i.e.,
$$\begin{aligned} \psi (X,Y)+b(U)^{\top } X + \varphi (U)=UY, \; \text { almost surely. } \end{aligned}$$
(3.18)
Using the fact that \({\mathscr {L}}(U)={\mathscr {Law}}(\overline{U})\) and the fact that mean-independence gives \({\mathbb {E}}(b(U)^{\top } X)={\mathbb {E }}(b(\overline{U})^{\top } X)=0\), we have
$$\begin{aligned} {\mathbb {E}}(UY)={\mathbb {E}}( \psi (X,Y)+b(U)^{\top } X + \varphi (U))= { \mathbb {E}}( \psi (X,Y)+b(\overline{U})^{\top } X + \varphi (\overline{U})) \ge {\mathbb {E}}(\overline{U} Y) \end{aligned}$$
but reversing the role of U and \(\overline{U}\), we also have \({\mathbb {E}} (UY)\le {\mathbb {E}}(\overline{U} Y)\) and then
$$\begin{aligned} {\mathbb {E}}(\overline{U} Y)= {\mathbb {E}}( \psi (X,Y)+b(\overline{U})^{\top } X + \varphi (\overline{U})) \end{aligned}$$
so that, thanks to inequality (3.17)
$$\begin{aligned} \psi (X,Y)+b(\overline{U})^{\top } X + \varphi (\overline{U})=\overline{U} Y, \; \text { almost surely } \end{aligned}$$
which means that \(\overline{U}\) solves \(\max _{t\in [0,1]} \{tY-\varphi (t)-b(t)^{\top } X\}\) which, by strict concavity admits U as unique solution. This proves that \(U=\overline{U}\) and thus
$$\begin{aligned} \alpha (U)-\overline{\alpha }(U)=(\overline{\beta }(U)-\beta (U))^{\top } X \end{aligned}$$
taking the conditional expectation with respect to U on both sides we then obtain \(\alpha =\overline{\alpha }\) and thus \(\beta (U)^{\top } X=\overline{\beta }(U)^{\top } X\) almost surely. We then compute
$$\begin{aligned} \begin{aligned} F(x, \alpha (t)+\beta (t)^{\top } x)&= \mathbb {P}(\alpha (U)+\beta (U)^{\top } X \le \alpha (t)+\beta (t)^{\top } x \vert X=x) \\&=\mathbb {P}( \alpha (U)+ \beta (U)^{\top } x \le \alpha (t)+\beta (t)^{\top } x \vert X=x) \\&=\mathbb {P}(U\le t \vert X=x) \end{aligned} \end{aligned}$$
and similarly \(F(x, \alpha (t)+\overline{\beta }(t)^{\top } x)=\mathbb {P}(U\le t \vert X=x)=F(x, \alpha (t)+\beta (t)^{\top } x)\). Thanks to (2.5), we deduce that \(\beta (t)^{\top } x=\overline{\beta }(t)^{\top } x\) for m-a.e. x and every \(t\in [0,1]\). Finally, the previous considerations and the nondegeneracy condition (3.11) enable us to conclude that \(\beta = \overline{\beta }\).
Corollary 3.6
Assume (2.5)–(3.4), (3.5)–(3.10) and (3.11). If quantile regression is quasi-specified, the regression coefficients \((\alpha ^{QR}, \beta ^{QR})\) are uniquely defined and if Y can be written as
$$\begin{aligned} Y=\alpha (U)+\beta (U)^{\top } X \end{aligned}$$
for U uniformly distributed, X being mean independent from U, \( (\alpha , \beta )\) continuous such that the monotonicity condition (3.1 ) holds then necessarily
$$\begin{aligned} \alpha =\alpha ^{QR}, \; \beta =\beta ^{QR}. \end{aligned}$$
To sum up, we have shown that quasi-specification is equivalent to the validity of the factor linear model:
$$\begin{aligned} Y=\alpha (U)+\beta (U)^{\top } X \end{aligned}$$
for \((\alpha , \beta )\) continuous and satisfying the monotonicity condition ( 3.1) and U, uniformly distributed and such that X is mean-independent from U. This has to be compared with the decomposition of paragraph 2.2 where U is required to be independent from X but the dependence of Y with respect to U, given X, is given by a nondecreasing function of U which is not necessarily affine in X.

4 Quantile regression without specification

Now we wish to address quantile regression in the case where neither specification nor quasi-specification can be taken for granted. In such a general situation, keeping in mind the remarks from the previous paragraphs, we can think of two natural approaches.
The first one consists in studying directly the correlation maximization with a mean-independence constraint (3.16). The second one consists in getting back to the Koenker and Bassett t by t problem (3.8) but adding as an additional global consistency constraint that \(V_t\) should be nonincreasing (which we abbreviate as \(V_t \downarrow \)) with respect to t:
$$\begin{aligned} \sup \{{\mathbb {E}}(\int _0^1 V_t Ydt ) \; : \, V_t \downarrow , \; V_t\in [0,1],\; {\mathbb {E}}(V_t)=(1-t), \; {\mathbb {E}}(V_t X)=0\} \end{aligned}$$
(4.1)
Our aim is to compare these two approaches (and in particular to show that the maximization problems (3.16) and (4.1) have the same value) as well as their dual formulations. Before going further, let us remark that (3.16) can directly be considered in the multivariate case, whereas the monotonicity constrained problem (4.1) makes sense only in the univariate case.
As proven in Carlier et al. (2016), (3.16) is dual to
$$\begin{aligned} \inf _{(\psi , \varphi , b)} \{{\mathbb {E}}(\psi (X,Y))+{\mathbb {E}}(\varphi (U)) \; : \; \psi (x,y)+ \varphi (u)\ge uy -b(u)^{\top } x\} \end{aligned}$$
(4.2)
which can be reformulated as:
$$\begin{aligned} \inf _{(\varphi , b)} \int \max _{t\in [0,1]} ( ty- \varphi (t) -b(t)^{\top } x) \nu (dx, dy) +\int _0^1 \varphi (t) dt \end{aligned}$$
(4.3)
in the sense that8
$$\begin{aligned} \sup (3.16)=\inf (4.2)=\inf (4.3). \end{aligned}$$
(4.4)
The existence of a solution to (4.2) is not straightforward and is established under appropriate assumptions in Carlier et al. (2017) in the multivariate case. The following result shows that there is a t-dependent reformulation of (3.16):
Lemma 4.1
The value of (3.16) coincides with
$$\begin{aligned} \sup \{{\mathbb {E}}(\int _0^t V_t Ydt ) \; : \, V_t \downarrow , \; V_t\in \{0,1\},\; {\mathbb {E}}(V_t)=(1-t), \; {\mathbb {E}}(V_t X)=0\}. \end{aligned}$$
(4.5)
Proof
Let U be admissible for (3.16) and define \(V_t:=\mathbf {1} _{\{U>t\}}\) then \(U=\int _0^1 V_t dt\) and obviously \((V_t)_t\) is admissible for (4.5), we thus have \(\sup \) (3.16) \(\le \sup \) (4.5). Take now \((V_t)_t\) admissible for (4.5) and let \(V:=\int _0^1 V_t dt\), we then have
$$\begin{aligned} V>t \Rightarrow V_t=1\Rightarrow V\ge t \end{aligned}$$
since \({\mathbb {E}}(V_t)=(1-t)\) this implies that V is uniformly distributed and \(V_t=\mathbf {1}_{\{V>t\}}\) almost surely so that \({\mathbb {E} }(X \mathbf {1}_{\{V>t\}})=0\) which implies that X is mean-independent from V and thus \({\mathbb {E}}(\int _0^1 V_t Y dt)\le \sup \) (3.16). We conclude that \(\sup \) (3.16) = \(\sup \)(4.5).
Let us now define
$$\begin{aligned} {\mathcal {C}}:=\{v \; : \; [0,1]\mapsto [0,1], \; \downarrow \} \end{aligned}$$
Let \((V_t)_t\) be admissible for (4.1) and set
$$\begin{aligned} v_t(x,y):={\mathbb {E}}(V_t \vert X=x, Y=y), \; V_t:= v_t(X,Y) \end{aligned}$$
it is obvious that \((V_t)_t\) is admissible for (4.1) and by construction \({\mathbb {E}}(V_t Y)={\mathbb {E}}(V_t Y)\). Moreover, the deterministic function \((t,x,y)\mapsto v_t(x,y)\) satisfies the following conditions:
$$\begin{aligned} \text {for fixed} (x,y), t\mapsto v_t(x,y) \text { belongs to }{\mathcal C}, \end{aligned}$$
(4.6)
and for a.e. \(t\in [0,1]\),
$$\begin{aligned} \int v_t(x,y) \nu (dx, dy)=(1-t), \; \int v_t(x,y) x\nu (dx, dy)=0. \end{aligned}$$
(4.7)
Conversely, if \((t,x,y)\mapsto v_t(x,y)\) satisfies (4.6), (4.7 ), \(V_t:=v_t(X,Y)\) is admissible for (4.1) and \({\mathbb {E}}(V_t Y)=\int v_t(x,y) y \nu (dx, dy)\). All this proves that \(\sup \)(4.1) coincides with
$$\begin{aligned} \sup _{(t,x,y)\mapsto v_t(x,y)} \int v_t(x,y) y \nu (dx, dy)dt \text { subject to: } (4.6)-(4.7) \end{aligned}$$
(4.8)
Theorem 4.2
The shape constrained quantile regression problem (4.1) is related to the correlation maximization with a mean independence constraint (3.16) by:
$$\begin{aligned} \sup (3.16)=\sup (4.1). \end{aligned}$$
Proof
We know from lemma 4.1 and the remarks above that
$$\begin{aligned} \sup (3.16)=\sup (4.5) \le \sup (4.1)=\sup (4.8). \end{aligned}$$
We now get rid of constraints (4.7) by rewriting (4.8) in sup-inf form as
$$\begin{aligned}&\sup _{\quad v_t \text { satisfies } (4.6)\quad } \inf _{(\alpha , \beta )} \int v_t(x,y)(y-\alpha (t)-\beta (t)^{\top } x) \nu (dx,dy)dt\\&\qquad +\int _0^1 (1-t)\alpha (t) dt \end{aligned}$$
Recall that one always have \(\sup \inf \le \inf \sup \) so that \(\sup \)(4.8) is less than
$$\begin{aligned} \begin{aligned} \inf _{(\alpha , \beta )} \sup _{\quad v_t \text { satisf. } (4.6)\quad } \int v_t(x,y)(y-\alpha (t)-\beta (t)^{\top } x) \nu (dx,dy)dt +\int _0^1 (1-t)\alpha (t) dt \\ \le \inf _{(\alpha , \beta )} \int \Big (\sup _{v\in {\mathcal {C}}} \int _0^1 v(t)(y-\alpha (t)-\beta (t)^{\top }x)dt \Big ) \nu (dx,dy)+ \int _0^1 (1-t)\alpha (t) dt \end{aligned} \end{aligned}$$
It follows from Lemma 4.3 below that, for \(q\in L^1(0,1)\) defining \( Q(t):=\int _0^t q(s) ds\), one has
$$\begin{aligned} \sup _{v\in {\mathcal {C}}} \int _0^1 v(t) q(t)dt=\max _{t\in [0,1]} Q(t). \end{aligned}$$
So setting \(\varphi (t):=\int _0^t \alpha (s) ds\), \(b(t):=\int _0^t \beta (s)ds\) and remarking that integrating by parts immediately gives
$$\begin{aligned} \int _0^1 (1-t)\alpha (t) dt=\int _0^1 \varphi (t) dt, \end{aligned}$$
we have
$$\begin{aligned} \begin{aligned}&\sup _{v\in {\mathcal {C}}} \int _0^1 v(t)(y-\alpha (t)-\beta (t)^{\top }x)dt + \int _0^1 (1-t)\alpha (t) dt \\&\quad = \max _{t\in [0,1]} \{t y-\varphi (t)-b(t)^{\top } x\} +\int _0^1 \varphi (t) dt. \end{aligned} \end{aligned}$$
This yields
$$\begin{aligned} \sup (4.8) \le \inf _{(\varphi , b)} \int \max _{t\in [0,1]} ( ty- \varphi (t) -b(t)^{\top } x) \nu (dx, dy) +\int _0^1 \varphi (t) dt =\inf (4.3) \end{aligned}$$
but we know from (4.4) that \(\inf \) (4.3) =\(\sup \) (3.16) which ends the proof.
In the previous proof, we have used the elementary result (proven in the “Appendix”)
Lemma 4.3
Let \(q\in L^1(0,1)\) and define \(Q(t):=\int _0^t q(s) ds\) for every \(t\in [0,1]\), one has
$$\begin{aligned} \sup _{v\in {\mathcal {C}}} \int _0^1 v(t) q(t)dt=\max _{t\in [0,1]} Q(t). \end{aligned}$$

5 Vector quantiles, vector quantile regression and optimal transport

We now consider the case where Y is a random vector with values in \({ \mathbb {R}}^d\) with \(d\ge 2\). The notion of quantile does not have an obvious generalization in the multivariate setting however, the various correlation maximization problems we have encountered in the previous sections still make sense (provided Y is integrable say) in dimension d and are related to optimal transport theory. The aim of this section is to briefly summarize the optimal transport approach to quantile regression introduced in Carlier et al. (2016) and further analyzed in their follow-up 2017 paper.

5.1 Brenier’s map as a vector quantile

From now on we fix as a reference measure the uniform measure on the unit cube \([0,1]^d\), i.e.,
$$\begin{aligned} \mu _d:={\mathcal {U}}([0,1]^d) \end{aligned}$$
(5.1)
Given Y, an integrable \({\mathbb {R}}^d\)-valued random variable on \( (\Omega , {\mathcal {F}}, \mathbb {P})\), a remarkable theorem due to Brenier (1991) and extended by McCann (1995) implies that there exists a unique \( U\sim \mu _d\) and a unique (up to the addition of a constant) convex function defined on \([0,1]^d\) such that
$$\begin{aligned} Y=\nabla \varphi (U). \end{aligned}$$
(5.2)
The map \(\nabla \varphi \) is called the Brenier’s map between \(\mu _d\) and \({\mathscr {L}}(Y)\).
The convex function \(\varphi \) is not necessarily differentiable but being convex it is differentiable at Lebesgue-a.e. point of \([0,1]^d\) so that \( \nabla \varphi (U)\) is well defined almost surely, it is worth at this point recalling that the Legendre transform of \(\varphi \) is the convex function:
$$\begin{aligned} \varphi ^*(y):=\sup _{u\in [0,1]^d} \{ u^{\top } y -\varphi (u)\} \end{aligned}$$
(5.3)
and that the subdifferentials of \(\varphi \) and \(\varphi ^*\) are defined, respectively, by
$$\begin{aligned} \partial \varphi (u):=\{y \in {\mathbb {R}}^d \; : \; \varphi (u)+\varphi ^*(y)=u^{\top } y\} \end{aligned}$$
and
$$\begin{aligned} \partial \varphi ^*(y):=\{u \in [0,1]^d \; : \; \varphi (u)+\varphi ^*(y)=u^{\top } y\} \end{aligned}$$
so that \(\partial \varphi \) and \(\partial \varphi ^*\) are inverse to each other in the sense that
$$\begin{aligned} y\in \partial \varphi (u) \Leftrightarrow u\in \partial \varphi ^*(y) \end{aligned}$$
which is often refered to in convex analysis as the Fenchel reciprocity formula9. Note then that (5.2) implies that
$$\begin{aligned} U\in \partial \varphi ^*(Y) \text { almost surely}. \end{aligned}$$
If both \(\varphi \) and \(\varphi ^*\) are differentiable, their subgradients reduce to the singleton formed by their gradient and the Fenchel reciprocity formula simply gives \(\nabla \varphi ^{-1}=\nabla \varphi ^*\). Recalling the subgradient of the convex function \(\varphi \) is monotone in the sense that whenever \(y_1\in \partial \varphi (u_1)\) and \(y_2\in \partial \varphi (u_2)\) one has
$$\begin{aligned} (y_1-y_2)^{\top } (u_1-u_2)\ge 0, \end{aligned}$$
we see that gradients of convex functions are a generalization to the multivariate case of monotone univariate maps. It is therefore natural in view of (5.2) to define the vector quantile of Y as:
Definition 5.1
The vector quantile of Y is the Brenier’s map between \(\mu _d\) and \({ \mathscr {L}}(Y)\).
Now, it is worth noting that the Brenier’s map (and the uniformly distributed random vector U in (5.2)) are not abstract objects, they have a variational characterization related to optimal transport10. Consider indeed
$$\begin{aligned} \sup \{ {\mathbb {E}}(V^{\top } Y) \; : \; V\sim \mu _d\} \end{aligned}$$
(5.4)
and its dual
$$\begin{aligned} \inf _{f, g} \{\int _{[0,1]^d} f \text {d} \mu _d + {\mathbb {E}}(g(Y)) \; : \; f(u)+g(y) \ge u^{\top } y, \; \forall (u,y)\in [0,1]^d\times {\mathbb {R}}^d\} \end{aligned}$$
(5.5)
then U in (5.2) is the unique solution of (5.4) and any solution (fg) of the dual (5.5) satisfies \(\nabla f=\nabla \varphi \) \(\mu _d\)-a.e..

5.2 Conditional vector quantiles

Assume now as in paragraph 2.2 that we are also given a random vector \(X\in {\mathbb {R}}^N\). As in (2.4), we denote by \(\nu \) the law of (XY), by m the law of X and by \(\nu (.\vert x)\) the conditional law of Y given \(X=x\) (the only difference with (2.4) is that Y is \({\mathbb {R}}^d\)-valued). Conditional vector quantile are then defined as
Definition 5.2
For \(m={\mathscr {L}}(X)\)-a.e. \(x\in {\mathbb {R}}^N\), the vector conditional quantile of Y given \(X=x\) is the Brenier’s map between \(\mu _d:= {\mathcal {U}}([0,1]^d)\) and \(\nu (.\vert x):={\mathscr {L}}(Y\vert X=x)\). We denote this well-defined map as \(\nabla \varphi _x\) where \(\varphi _x\) is a convex function on \([0,1]^d\).
If both \(\varphi _x\) and its Legendre transform
$$\begin{aligned} \varphi _x^*(y):=\sup _{u\in [0,1]^d} \{u^{\top } y-\varphi _x(u)\} \end{aligned}$$
are differentiable11, one can define the random vector:
$$\begin{aligned} U:=\nabla \varphi _X^*(Y) \end{aligned}$$
which is equivalent to
$$\begin{aligned} Y=\nabla \varphi _X(U). \end{aligned}$$
(5.6)
One can check exactly as in the proof of Proposition 2.1 for the univariate case that if Y is integrable then
$$\begin{aligned} U\sim \mu _d, \; U\perp \! \! \! \perp X \end{aligned}$$
and U solves
$$\begin{aligned} \max \{{\mathbb {E}}(V^{\top } Y), \; V\sim \mu _d, \; V\perp \! \! \! \perp X\}. \end{aligned}$$
(5.7)

5.3 Vector quantile regression

When one assumes that the convex function \(\varphi _x\) is affine with respect to the explanatory variables x (specification):
$$\begin{aligned} \varphi _x(u)=\varphi (u)+ b(u)^{\top } x \end{aligned}$$
with \(\varphi \) : \([0,1]^d \rightarrow {\mathbb {R}}\) and b : \([0,1]^d \rightarrow {\mathbb {R }}^N\) smooth, the conditional quantile is itself affine and the relation (5.6) takes the form
$$\begin{aligned} Y=\nabla \varphi _X(U)=\alpha (U)+ \beta (U)X, \text { for } \alpha =\nabla \varphi , \; \beta :=Db^{\top }. \end{aligned}$$
(5.8)
This affine form moreover implies that not only U maximizes the correlation with Y among uniformly distributed random vectors independent from X but in the larger class of uniformly distributed random vectors for which12
$$\begin{aligned} {\mathbb {E}}(X\vert U)={\mathbb {E}}(X)=0. \end{aligned}$$
This is the reason why the study of
$$\begin{aligned} \max \{{\mathbb {E}}(V^{\top } Y), \; V\sim \mu _d, \; {\mathbb {E}}(X\vert V)=0\} \end{aligned}$$
(5.9)
is the main tool in the approach of (Carlier et al. 2016, 2017) to vector quantile regression. Let us now briefly summarize the main findings in these two papers. First observe that (5.9) can be recast as a linear program by setting \(\pi :={\mathscr {L}}(U, X,Y)\) and observing that U solves (5.9) if and only if \(\pi \) solves
$$\begin{aligned} \max _{\pi \in \mathop {\mathrm {MI}}\nolimits ( \mu _d, \nu )} \int _{ [0,1]^d\times {\mathbb {R}}^N\times {\mathbb {R}}^d} u^{\top } y \text {d} \pi (u,x,y) \end{aligned}$$
(5.10)
where \(\mathop {\mathrm {MI}}\nolimits (\nu , \mu )\) is the set of probability measures which satisfy the linear constraints:
  • the first marginal of \(\pi \) is \(\mu _d\), i.e., for every \(\varphi \in C([0,1]^d, {\mathbb {R}})\):
    $$\begin{aligned} \int _{ [0,1]^d \times {\mathbb {R}}^N\times {\mathbb {R}}^d} \varphi (u)\text {d} \pi (u,x,y)=\int _{[0,1]^d} \varphi (u) \text {d} \mu _d(u), \end{aligned}$$
  • the second marginal of \(\pi \) is \(\nu \), i.e., for every \(\psi \in C_b({ \mathbb {R}}^N\times {\mathbb {R}}^d, {\mathbb {R}})\):
    $$\begin{aligned} \begin{aligned} \int _{[0,1]^d \times {\mathbb {R}}^N\times {\mathbb {R}}^d} \psi (x,y)\text {d} \pi (u,x,y)&=\int _{{\mathbb {R}}^N\times {\mathbb {R}}^d} \psi (x,y) \text {d} \nu (x,y) \\&={\mathbb {E}}(\psi (X,Y)), \end{aligned} \end{aligned}$$
  • the conditional expectation of x given u is 0, i.e., for every \( b\in C([0,1]^d, {\mathbb {R}}^N)\):
    $$\begin{aligned} \int _{[0,1]^d \times {\mathbb {R}}^N\times {\mathbb {R}}^d} b(u)^{\top } x \text {d} \pi (u,x,y)=0. \end{aligned}$$
The dual of the linear program (5.9) then reads
$$\begin{aligned} \inf _{(\varphi ,\psi ,b)} \int _{[0,1]^d} \varphi \text {d} \mu _d+ \int _{{ \mathbb {R}}^N\times {\mathbb {R}}^d} \psi (x,y) \text {d} \nu (x,y) \end{aligned}$$
(5.11)
subject to the pointwise constraint
$$\begin{aligned} \varphi (u)+b(u)^{\top } x+\psi (x,y) \ge u^{\top } y \end{aligned}$$
given b and \(\varphi \) the lowest \(\psi \) fitting this constraint being the (convex in y) function
$$\begin{aligned} \psi (x,y):=\sup _{u\in [0,1]^d} \{ u^{\top } y-\varphi (u)-b(u)^{\top } x\}. \end{aligned}$$
The existence of a solution \((\psi , \varphi , b)\) to (5.11) is established in Carlier et al. (2016) (under some assumptions on \(\nu \)) and optimality for U in (5.9) is characterized by the pointwise complementary slackness condition
$$\begin{aligned} \varphi (U)+b(U)^{\top } X+\psi (X,Y) = U^{\top } Y \text { almost surely}. \end{aligned}$$
If \(\varphi \) and b were smooth we could deduce from the latter that
$$\begin{aligned} Y=\nabla \varphi (U)+Db(U)^{\top } U=\nabla \varphi _X(U), \; \text { for } \varphi _x(u):=\varphi (u)+b(u)^{\top } x \end{aligned}$$
which is exactly (5.8). So specification of vector quantile regression is essentially the same as assuming this smoothness and the convexity of \(u\mapsto \varphi _x(u):=\varphi (u)+b(u)^{\top } x\). In general, these properties cannot be taken for granted and what can be deduced from complementary slackness is given by the weaker relations
$$\begin{aligned} \varphi _X(U)=\varphi _X^{**}(U), \; Y\in \partial \varphi _X^{**}(U) \text { almost surely,} \end{aligned}$$
were \(\varphi _x^{**}\) is the convex envelope of \(\varphi _x\) (i.e., the largest convex function below \(\varphi _x\)), we refer the reader to Carlier et al. (2017) for details.

6 Discretization, regularization, numerical minimization

6.1 Discrete optimal transport with a mean independence constraint

We now turn to a discrete setting for implementation purposes, and consider data \((X_j,Y_j)_{j=1..J}\) distributed according to the empirical measure \( \nu =\sum _{j=1}^J \nu _j \delta _{(x_j,y_j)}\), and a \([0,1]^d\)-uniform sample \( (U_i)_{i=1, \ldots , I}\) with empirical measure \(\mu =\sum _{i=1}^I \mu _i \delta _{u_i}\). In this setting, the vector quantile regression primal (5.10) writes
$$\begin{aligned} \max _{\pi \in \mathbb {R}_{+}^{I\times J}} \sum _{i=1}^I \sum _{j=1}^J u_i^{\top } y_j \pi _{ij} \end{aligned}$$
subject to marginal constraints \(\forall j, \sum _{i} \pi _{ij} = \nu _j\) and \( \forall i, \sum _j \pi _{ij} = \mu _i\) and the mean-independence constraint between X and U: \(\forall i, \sum _j x_j \pi _{ij}=0\). Its dual formulation (5.11) reads
$$\begin{aligned} \inf _{(\varphi _i)_i,(\psi _j)_j,(b_i)_i} \sum _{j=1}^J \psi _j \nu _j + \sum _{i=1}^I \varphi _i \mu _i \end{aligned}$$
subject to the constraint
$$\begin{aligned} \forall i, j, \varphi _i + b_i^{\top } x_j + \psi _j \ge u_i^{\top } y_j. \end{aligned}$$

6.2 The regularized vector quantile regression (RVQR) problem

Using the optimality condition \(\varphi _{i}=\max _{j}u_{i}^{\top }y_{j}-b_{i}^{\top }x_{j}-\psi _{j}\), we obtain the unconstrained formulation
$$\begin{aligned} \inf _{(\psi _{j})_{j},(b_{i})_{i}}\sum _{j}\psi _{j}\nu _{j}+\sum _{i}\mu _{i}\left( \max _{j}u_{i}^{\top }y_{j}-b_{i}^{\top }x_{j}-\psi _{j}\right) . \end{aligned}$$
Replacing the maximum with its smoothed version13, given a small regularization parameter \(\varepsilon \), yields the smooth convex minimization problem (see Cuturi and Peyré 2016) for more details in connection with entropic regularization of optimal transport), which we call the Regularized Vector Quantile Regression (RVQR) problem
$$\begin{aligned} \inf _{(\psi _{j})_{j},(b_{i})_{i}}J(\psi ,b):=\sum _{j}\psi _{j}\nu _{j}+\varepsilon \sum _{i}\mu _{i}\log \left( \sum _{j}\exp \left( \frac{1}{ \varepsilon }[u_{i}^{\top }y_{j}-b_{i}^{\top }x_{j}-\psi _{j}]\right) \right) . \end{aligned}$$
(6.1)
We then have the following duality result14:
Theorem 6.1
The RVQR problem
$$\begin{aligned} \max _{\pi _{ij}\ge 0}&\sum _{ij}\pi _{ij}\left( u_{i}^{\top }y_{j}\right) -\varepsilon \sum _{ij}\pi _{ij}\log \pi _{ij} \\&\sum _{j}\pi _{ij}=\mu _{i} \\&\sum _{i}\pi _{ij}=\nu _{j} \\&\sum _{j}\pi _{ij}x_{j}=\sum _{j}\nu _{j}x_{j} \end{aligned}$$
has dual (6.1), or equivalently
$$\begin{aligned} \min _{\varphi _{i},v_{j}}\sum _{i}\mu _{i}\varphi _{i}+\sum _{j}\psi _{j}\nu _{j}+\varepsilon \sum _{ij}\exp \left( \frac{1}{\varepsilon } [u_{i}^{\top }y_{j}-\varphi _{i}-b_{i}^{\top }x_{j}-\psi _{j}]\right) . \end{aligned}$$
Note that the objective J in (6.1) remains invariant under the two transformations
  • \((b,\psi ) \leftarrow (b+c, \psi -c^{\top } x)\) with \(c\in {\mathbb {R}}^N \) is a constant translation vector,
  • \(\psi \leftarrow \psi +\lambda \) where \(\lambda \in {\mathbb {R}}\) is a constant.
These two invariances enable us to fix the value of \(b_1=0\) and (for instance) to chose \(\lambda \) in such a way that \(\sum _{i,j} \exp \left( \frac{1}{\varepsilon } [u_i^{\top } y_j - b_i^{\top } x_j - \psi _j] \right) )=1\).
Remark 6.2
This formulation is eligible for stochastic optimization techniques when the number of (XY) observations is very large. Stochastic optimization w.r.t. \(\psi \) can be performed using the stochastic averaged gradient algorithm, see Genevay et al. (2016), considering the equivalent objective
$$\begin{aligned} \inf _{\psi , \varphi , b} \sum _j h_\varepsilon (x_j,y_j,\psi ,\varphi ,b)\nu _j \end{aligned}$$
with \(h_\varepsilon (x_j,y_j,\psi ,\varphi ,b)=\psi _j+\sum _i \mu _i \varphi _i + \varepsilon \sum _i \exp \left( \frac{1}{\varepsilon } [u_i^{\top } y_j - b_i^{\top } x_j - \psi _j - \varphi _i] \right) \). Such techniques are not needed to compute b since the number of U samples (i.e., the size of b) is set by the user.

6.3 Gradient descent

As already noted the objective J in (6.1) is convex15 and smooth. Its gradient has the explicit form
$$\begin{aligned} \frac{\partial J}{\partial \psi _j}:=\nu _j-\sum _{i=1}^I \mu _i \frac{ e^{\theta _{ij}} }{\sum _{k=1}^J e^{\theta _{ik} } } \text { where } \theta _{ij}=\frac{1}{\varepsilon } [u_i^{\top } y_j - b_i^{\top } x_j - \psi _j] \end{aligned}$$
(6.2)
and
$$\begin{aligned} \frac{\partial J}{\partial b_i}:=- \mu _i \frac{\sum _{k=1}^J x_k e^{\theta _{ik}} }{\sum _{k=1}^J e^{\theta _{ik}}}. \end{aligned}$$
(6.3)
To solve (6.1) numerically, we therefore can use a gradient descent method. An efficient way to do it is to use Nesterov accelerated gradient algorithm see Nesterov (1983) and Beck and Teboulle (2009). Note that if \(\psi , b \) solves (6.1), the fact that the partial derivatives in (6.2)–(6.3) vanish imply that the coupling
$$\begin{aligned} \alpha ^\varepsilon _{ij}:=\mu _i \frac{ e^{\theta _{ij}} }{\sum _{k=1}^J e^{\theta _{ik} } } \end{aligned}$$
Table 1
Relative error between one-dimensional VQR with a “soft” computation of \(\varphi \) and its “hard” counterpart, with \(Y_1=\)Weight and \(X=\)(1, Height) for different height quantiles (10%, 30%, 60%, 90%), depending on regularization strengths \(\varepsilon \). Chosen grid size is \(n=20\)
\(\varepsilon \)
0.05
0.1
0.5
1
\(||Q_{soft}-Q_{hard} ||_2/||Q_{soft} ||_2\), \(X=10\%\)
3.8\(\cdot 10^{-3}\)
1.5\(\cdot 10^{-2}\)
6.7\(\cdot 10^{-2}\)
9.2\(\cdot 10^{-2}\)
\(||Q_{soft}-Q_{hard} ||_2/||Q_{soft} ||_2\), \(X=30\%\)
6.8\(\cdot 10^{-3}\)
1.9\(\cdot 10^{-2}\)
7.0\(\cdot 10^{-2}\)
9.3\(\cdot 10^{-2}\)
\(||Q_{soft}-Q_{hard} ||_2/||Q_{soft} ||_2\), \(X=60\%\)
1.2\(\cdot 10^{-2}\)
2.0\(\cdot 10^{-2}\)
6.9\(\cdot 10^{-2}\)
9.5\(\cdot 10^{-2}\)
\(||Q_{soft}-Q_{hard} ||_2/||Q_{soft} ||_2\), \(X=90\%\)
1.6\(\cdot 10^{-2}\)
2.3\(\cdot 10^{-2}\)
6.8\(\cdot 10^{-2}\)
9.5\(\cdot 10^{-2}\)
Table 2
Relative error between one-dimensional VQR and classical QR approach with \(Y_1=\)Weight and \(X=\)(1, Height) for different height quantiles (10%, 30%, 60%, 90%), depending on regularization strengths \(\varepsilon \). Chosen grid size is \(n=20\)
\(\varepsilon \)
0.05
0.1
0.5
1
\(||Q_{QR}-Q_{VQR} ||_2/||Q_{QR} ||_2\), \(X=10\%\)
9.8\(\cdot 10^{-3}\)
9.8\(\cdot 10^{-3}\)
2.8\(\cdot 10^{-2}\)
3.8\(\cdot 10^{-2}\)
\(||Q_{QR}-Q_{VQR} ||_2/||Q_{QR} ||_2\), \(X=30\%\)
8.5\(\cdot 10^{-3}\)
1.1\(\cdot 10^{-2}\)
3.3\(\cdot 10^{-2}\)
4.3\(\cdot 10^{-2}\)
\(||Q_{QR}-Q_{VQR} ||_2/||Q_{QR} ||_2\), \(X=60\%\)
7.7\(\cdot 10^{-3}\)
9.3\(\cdot 10^{-3}\)
3.1\(\cdot 10^{-2}\)
4.4\(\cdot 10^{-2}\)
\(||Q_{QR}-Q_{VQR} ||_2/||Q_{QR} ||_2\), \(X=90\%\)
8.2\(\cdot 10^{-3}\)
1.0\(\cdot 10^{-2}\)
3.5\(\cdot 10^{-2}\)
4.9\(\cdot 10^{-2}\)
satisfies the constraint of fixed marginals and mean-independence of the primal problem. Since the index j corresponds to observations it is convenient to introduce for every \(x\in {\mathcal {X}}:=\{x_1, \ldots , x_J\}\) and \(y\in {\mathcal {Y}}:=\{y_1, \ldots y_j\}\) the probability
$$\begin{aligned} \pi ^\varepsilon (x,y, u_i):=\sum _{j \; : \; x_j=x, \; y_j=y} \alpha ^\varepsilon _{ij}. \end{aligned}$$

7 Results

Quantiles computation The discrete probability \(\pi ^\varepsilon \) is an approximation (because of the regularization \(\varepsilon \)) of \({\mathscr {L}} (U,X,Y)\) where U solves (5.9). The corresponding approximate quantile \(Q^\varepsilon _X(U)\) is given by \({\mathbb {E}}_{\pi ^\varepsilon } [Y|X,U]\). In the above discrete setting, this yields
$$\begin{aligned} Q^\varepsilon _x(u_i):={\mathbb {E}}_{\pi ^\varepsilon } [Y|X=x,U=u_i] =\sum _{y\in {\mathcal {Y} }} y \frac{\pi ^\varepsilon (x,y,u_i) }{ \sum _{y^{\prime }\in {\mathcal {Y}}} \pi ^\varepsilon (x,y^{\prime },u_i) }. \end{aligned}$$
Remark 7.1
To estimate the conditional distribution of Y given \(U=u\) and \(X=x\), we can use kernel methods. In the experiments, we compute approximate quantiles as means on neighborhoods of X values to make up for the lack of replicates. This amounts to considering \({\mathbb {E}} _{\pi ^{\varepsilon }}[Y|X\in B_{\eta }(x),U=u_{i}]\) where \(B_{\eta }(x)\) is a Euclidean ball of radius \(\eta \) centered on x.
Empirical illustrations We demonstrate the use of this approach on a series of health related experiments. We use the “ANSUR II” dataset (Anthropometric Survey of US Army Personnel), which can be found online16. This dataset is one of the most comprehensive publicly available data sets on body size and shape, containing 93 measurements for over 4,082 male adult US military personnel. It allows us to easily build multivariate dependent variables.
One-dimensional VQR We start by one-dimensional dependent variables (\(d=1\)), namely Weight (\(Y_1\)) and Thigh circumference (\(Y_2\)), explained by \(X=\)(1, Height), to allow for comparison with classical quantile regression of Koenker and Bassett Jr (1978). Figure 1 displays results of our method compared to the classical approach, for different height quantiles (10%, 30%, 60%, 90%). Figure 1 is computed with a “soft” potential \(\varphi \) while Table 1 depicts the difference with its “hard” counterpart (see the beginning of Sect. 6.2). Figure 2 and Table 2 detail the impact of regularization strength on these quantiles.
Multi-dimensional VQR In contrast, multivariate quantile regression explains the joint dependence \(Y=(Y_1,Y_2)\) by \(X=\)(1,Height). Figures 4 and 5 (each corresponding to an explained component, either \(Y_1\) or \(Y_2\)) depict how smoothing operates in higher dimension for different Height quantiles (10%, 50% and 90%), compared to a previous unregularized approach Carlier et al. (2016). Figure 3 details computational times in 2D using an Intel(R) Core(TM) i7-7500U CPU 2.70GHz.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anhänge

Appendix

Proof of Lemma 4.3

Since \(\mathbf {1}_{[0,t]}\in {\mathcal {C}}\), one obviously first has
$$\begin{aligned} \sup _{v\in {\mathcal {C}}}\int _{0}^{1}v(s)q(s)ds\ge \max _{t\in [0,1]}\int _{0}^{t}q(s)ds=\max _{t\in [0,1]}Q(t). \end{aligned}$$
Let us now prove the converse inequality, taking an arbitrary \(v\in { \mathcal {C}}\). We first observe that Q is absolutely continuous and that v is of bounded variation (its derivative in the sense of distributions being a bounded nonpositive measure which we denote by \(\eta \)), integrating by parts and using the definition of \({\mathcal {C}}\) then give:
$$\begin{aligned} \int _{0}^{1}v(s)q(s)ds&=-\int _{0}^{1}Q\eta +v(1^{-})Q(1) \\&\le (\max _{[0,1]}Q)\times (-\eta ([0,1])+v(1^{-})Q(1) \\&=(\max _{[0,1]}Q)(v(0^{+})-v(1^{-}))+v(1^{-})Q(1) \\&=(\max _{[0,1]}Q)v(0^{+})+(Q(1)-\max _{[0,1]}Q)v(1^{-}) \\&\le \max _{[0,1]}Q. \end{aligned}$$
Fußnoten
1
Whenever we write a variable in brackets after a constraint, as \([V_t]\) in (1.1), we mean that this variable plays the role of a multiplier.
 
2
It may seem awkward to start with the “dual” formulation before giving out the “primal” one, and the “primal” being the dual to the “dual,” this choice of labeling is pretty arbitrary. However, our choice is motivated by consistency with optimal transport theory, introduced below.
 
3
One way to define the nonatomicity of \((\Omega , {\mathcal {F}}, \mathbb {P})\) is by the existence of a uniformly distributed random variable on this space, this somehow ensures that the space is rich enough so that there exists random variables with prescribed law. If, on the contrary, the space is finite for instance only finitely supported probability measures can be realized as the law of such random variables.
 
4
In fact for (2.3) to make sense one needs some integrabilty of Y, i.e., \({\mathbb {E}}(\vert Y\vert ) <+\infty \).
 
5
if \({\mathbb {E}}(\Vert X\Vert ^2)<+\infty \) then (3.11) amounts to the standard requirement that \({\mathbb {E}}(X X^{\top })\) is nonsingular.
 
6
Uniqueness will be discussed later on.
 
7
If quantile regression is specified and the pair of functions \((\alpha , \beta )\) is as in definition 3.1, then for every t, \((\alpha (t), \beta (t))\) solves the conditions (3.13). This shows that specification implies quasi-specification.
 
8
With a little abuse of notations when a reference number (A) refers to a maximization (minimization) problem, we will simply write \(\sup (A)\) (\(\inf (A) \)) to the denote the value of this optimization problem.
 
9
Note the analogy with the fact that in the univariate case the cdf and the quantile of Y are generalized inverse to each other.
 
10
In the case where \({\mathbb {E}}(\Vert Y\Vert ^2)<+\infty \), (5.4) is equivalent to minimize \({\mathbb {E}}(\Vert V- Y\Vert ^2)\) among uniformly distributed V’s.
 
11
A deep regularity theory initated by Caffarelli (1992) in the 1990’s gives conditions on \(\nu (.\vert x)\) such that this is in fact the case that the optimal transport map is smooth and/or invertible, we refer the interested reader to the textbook of Figalli (2017) for a detailed and recent account of this regularity theory.
 
12
here we assume that both X and Y are integrable
 
13
Recall that the softmax with regularization parameter \(\varepsilon >0\) of \((\alpha _{1},\ldots ,\alpha _{J})\) is given by \({\mathrm {Softmax}}_{\varepsilon }(\alpha _{1},\ldots \alpha _{J}):=\varepsilon \log (\sum _{j=1}^{J}e^{\frac{\alpha _{j}}{\varepsilon } })\).
 
14
Which can be proved either by using the Fenchel-Rockafellar duality theorem or by hand. Indeed, in the primal, there are only finitely many linear constraints and nonnegativity constraints are not binding because of the entropy. The existence of Lagrange multipliers for the equality constraints is then straightforward.
 
15
it is even strictly convex once we have chosen normalizations which take into account the two invariances of J explained above.
 
Literatur
Zurück zum Zitat Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009;2(1):183–202.CrossRef Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009;2(1):183–202.CrossRef
Zurück zum Zitat Brenier Y. Polar factorization and monotone rearrangement of vector-valued functions. Commun. Pure Appl. Math. 1991;44(4):375–417.CrossRef Brenier Y. Polar factorization and monotone rearrangement of vector-valued functions. Commun. Pure Appl. Math. 1991;44(4):375–417.CrossRef
Zurück zum Zitat Caffarelli L. The regularity of mappings with a convex potential. J. Am. Math. Soc. 1992;5(1):99–104.CrossRef Caffarelli L. The regularity of mappings with a convex potential. J. Am. Math. Soc. 1992;5(1):99–104.CrossRef
Zurück zum Zitat Carlier G, Chernozhukov V, Galichon A. Vector quantile regression: an optimal transport approach. Ann. Statist. 2016;44(3):1165–92.CrossRef Carlier G, Chernozhukov V, Galichon A. Vector quantile regression: an optimal transport approach. Ann. Statist. 2016;44(3):1165–92.CrossRef
Zurück zum Zitat Carlier G, Chernozhukov V, Galichon A. Vector quantile regression beyond the specified case. J. Multivariate Anal., 2017. pp. 161, pp. 96–102. Carlier G, Chernozhukov V, Galichon A. Vector quantile regression beyond the specified case. J. Multivariate Anal., 2017. pp. 161, pp. 96–102.
Zurück zum Zitat Cuturi M, Peyré G. A smoothed dual approach for variational Wasserstein problems. SIAM J. Imaging Sci. 2016;9(1):320–43.CrossRef Cuturi M, Peyré G. A smoothed dual approach for variational Wasserstein problems. SIAM J. Imaging Sci. 2016;9(1):320–43.CrossRef
Zurück zum Zitat Figalli A. The Monge-Amp ere equation and its applications. Zurich Lectures in Advanced Mathematics. European Mathematical Society (EMS), Zurich. 2017. Figalli A. The Monge-Amp ere equation and its applications. Zurich Lectures in Advanced Mathematics. European Mathematical Society (EMS), Zurich. 2017.
Zurück zum Zitat Genevay A, Cuturi M, Peyré G, Bach F. Stochastic optimization for large-scale optimal transport. Advances in neural information processing systems. 2016;3440–8. Genevay A, Cuturi M, Peyré G, Bach F. Stochastic optimization for large-scale optimal transport. Advances in neural information processing systems. 2016;3440–8.
Zurück zum Zitat Koenker R, Bassett G Jr. Regression quantiles. Econometrica. 1978;46(1):33–50.CrossRef Koenker R, Bassett G Jr. Regression quantiles. Econometrica. 1978;46(1):33–50.CrossRef
Zurück zum Zitat McCann R. Existence and uniqueness of monotone measure preserving maps. Duke Math. J. 1995;80(2):309–23.CrossRef McCann R. Existence and uniqueness of monotone measure preserving maps. Duke Math. J. 1995;80(2):309–23.CrossRef
Zurück zum Zitat Nesterov Y. A method for solving the convex programming problem with convergence rate O(1=k2). Dokl Akad Nauk SSSR. 1983;269(3):543–7. Nesterov Y. A method for solving the convex programming problem with convergence rate O(1=k2). Dokl Akad Nauk SSSR. 1983;269(3):543–7.
Zurück zum Zitat Ryff J. Measure preserving transformations and rearrangements. J Math Anal Appl. 1970;31:449–58.CrossRef Ryff J. Measure preserving transformations and rearrangements. J Math Anal Appl. 1970;31:449–58.CrossRef
Metadaten
Titel
Vector quantile regression and optimal transport, from theory to numerics
verfasst von
Guillaume Carlier
Victor Chernozhukov
Gwendoline De Bie
Alfred Galichon
Publikationsdatum
12.08.2020
Verlag
Springer Berlin Heidelberg
Erschienen in
Empirical Economics / Ausgabe 1/2022
Print ISSN: 0377-7332
Elektronische ISSN: 1435-8921
DOI
https://doi.org/10.1007/s00181-020-01919-y

Weitere Artikel der Ausgabe 1/2022

Empirical Economics 1/2022 Zur Ausgabe

Premium Partner