3.1 Kernel density estimation of matrix-variate data
Let \(X_{1}, \dots , X_{N}\) be a sample of i.i.d. realizations of a \(P\times T\) random matrix \(X =\{x_{p,t}\}_{p=1, \ldots , P, t=1, \ldots ,T}\), which we shall assume to be defined on the vector space \(M_{P\times T} \subseteq \mathbb {R}^{P, T}\). The (unknown) distribution of X is naturally described by some probability density function \(f: M_{P\times T} \mapsto \mathbb {R}_+, \) with \(\int _{ M_{P\times T} } f(X) \mathrm {d}X = 1\). Since we are dealing with a vector space, the integral is intended as the component-wise integral of f on its support.
Consider an integrable kernel
\(K : \mathbb {R}^{P,T} \mapsto \mathbb {R}_{+}\), with unit integral and spherically symmetric, i.e.
\(\int _{M_{P\times T}}X K(X)dX = 0\). We define the
kernel density estimator for matrix-variate data as
$$\begin{aligned} \hat{f}(X; h) = \frac{1}{Nh^{P\cdot T}}\sum _{n = 1}^{N}K(h^{-1}(X - X_{n})), \qquad h > 0. \end{aligned}$$
(2)
With the above established convention on defining matrix-variate integrals as their component-wise counterpart, and the same for derivatives, most of standard results on kernel density estimators extend naturally to the matrix-variate setting. The Mean Integrated Square Error (MISE) admits forthwith the usual representation (e.g. Chacón and Duong
2018, p. 28)
$$\begin{aligned} \text {MISE}&(\hat{f}(X; h)) = \mathbb {E}\int _{\mathbb {R}^{P,T}}(\hat{f}(X;h) - f(X))^{2}dX \nonumber \\&= \int _{\mathbb {R}^{P,T}}\text {Var}(\hat{f}(X;h))dX + \int _{\mathbb {R}^{P,T}}\text {Bias}^{2}(\hat{f}(X; h))dX \nonumber \\&= \text {IV}(\hat{f}(X;h)) + \text {ISB}(\hat{f}(X;h)) \end{aligned}$$
(3)
where
\(\text {IV}(\cdot )\) is the integrated variance and
\(\text {ISB}(\cdot )\) is the integrated squared bias of the estimator. Its dependence on the bandwidth is nonetheless not easily disclosed, as the latter enters implicitly via the integrals involving the kernel. To highlight the effect of the bandwidth, it is useful to derive an asymptotic approximation of the MISE. To this aim, we further assume the following:
(i)
f is square integrable and twice differentiable, with all its second order partial derivatives bounded, continuous and square integrable;
(ii)
The kernel K is, in turn, square integrable, with finite second order moments \(\int X\otimes X K(X)dX = m_{2}(K)\mathrm {vec}\mathbb {I}_{P\times T},\) and \(m_{2}(K) = \int x_{p,t}^{2}K(X)dX,\) \(p = 1, \dots , P, t= 1, \ldots T.\) The symbol \(\otimes \) here denotes the Kronecker product, while \(\mathbb {I}_{d}\) denotes the d-dimensional identity matrix;
(iii)
The bandwidths \(h = h_{N}\) form a positive sequence, such that \(h \rightarrow 0\) and \(N^{-1}h \rightarrow 0\) as \(N \rightarrow \infty \).
Then, the following holds.
Hence, as for vector-valued data, the approximately optimal bandwidth converges to zero as
N increases at the rate
\(N^{-\frac{1}{(P\cdot T)+4}}\). Also, the optimal solution (
4) relies on the knowledge of the true
f, and hence cannot be directly used to define the optimal smoothing amount. Consistently with the vector case, automatic bandwidth selection can be built by first estimating either the MISE or its asymptotic version (AMISE), and then minimising such estimate to yield a bandwidth computed solely from the data. Standard approaches based on cross-validation, bootstrap, or based on replacing the target density with a given parametric model in the expressions of the MISE/AMISE can be easily extended to the matrix-variate framework.
In fact, as for the standard multivariate settings, the use of a scalar bandwidth h may result in a poor flexibility, and richer classes of parameterizations may be alternatively considered. The maximal extent of flexibility would require the awkward use of a four-way structure whose entries would reflect all the possible covariances between pairs of the X components. Alternatively, the vectorization operator may be easier to this aim, by mapping \(\mathbb {R}^{P,T}\) to \(\mathbb {R}^{P\cdot T}\) and stacking the column vectors of X underneath each other in order from left to right. With this representation, a full, unconstrained bandwidth H takes the form of a symmetric, semidefinite matrix \(P\cdot T\times P \cdot T,\) yet with some limitations from the algebraic and computational points of view.
A remarkable simplification may be induced by certain Kernels, for which a separable structure of
H is available, so that an equivalent specification represents the matrix-variate
\(P\times T \) Kernel as a special case of a
\(PT-\)variate Kernel with bandwidth
\({H} = U \otimes V,\) with
U and
V symmetric positive definite matrices of dimension
\(P\times P\) and
\(T \times T\), respectively. Elliptical models belong to this family, and are defined as
$$\begin{aligned} K(V^{-1/2}(X-X_n)U^{-1/2}) = |V|^{-\frac{P}{2}}|U|^{-\frac{T}{2}}g\left( -\frac{1}{2}\mathrm {tr}(V^{-1}(X-X_n)^{\top }U^{-1}(X-X_n))\right) , \end{aligned}$$
where
\(g:\mathbb {R}\mapsto \mathbb {R}_{+}\) is such that
\(\int _{\mathbb {R}_+} z^{pt-1}g(z^2)dz < \infty \) (Caro-Lopera et al.
2016). The matrices
U and
V act independently on the rows and columns and are easier to handle than a full specification of the matrix
H, that might result challenging even when
P and
T are small. By following the same steps as in Proposition 1, we may obtain the expression of the AMISE when
\(H = U \otimes V\). In this case, the first term depends on the determinant
\(|H|^{-1/2}\) instead of
\(h^{-(P \times T)}\), while the second term involves the full Hessian instead of the simple Laplacian. The general result, however, does not lead to an explicit formula for the optimal bandwidth matrix.
Note that the simplest case, where
\(U = h_{U}\mathbb {I}_{P}\) and
\(V = h_{V}\mathbb {I}_{T}\), reduces to the form
$$\begin{aligned} K((h_{U}h_{V})^{-1}(X - X_n)) = h_{U}^{-P}h_{V}^{-T}g\left( -\frac{1}{2(h_{U}h_{V})^{2}}\mathrm {tr}((X - X_{n})^{\top }(X - X_{n}))\right) , \end{aligned}$$
hence the choice of two distinct smoothing parameters for rows and columns brings back to the scalar case as an effect of the separability of the scale matrix
H.
Within the class of elliptical kernels, we may set
\(g(\cdot ) = (2\pi )^{-\frac{P\cdot T}{2}} \exp (\cdot )\) and obtain the matrix Normal density, a natural candidate for the kernel function which plays a pivotal role in the matrix-variate framework, as for the univariate and multivariate settings (see, e.g, Gupta and Nagar
2018).
3.1.1 Adaptive kernel
As an overall problem shared by nonparametric tools, kernel estimators are known to strongly suffer from the curse of dimensionality. On one side, the required sample size to achieve an acceptable accuracy becomes disproportionately large as the dimensions increases, leading to intractable problems, even computationally. On the other side, in high dimensions, the sparsity of data leads much of the probability mass to flow to the tails of the density, possibly averaging away features in the highest density regions and giving rise to the birth of spurious modes.
These arguments could discourage from the application of modal clustering on matrix-variate data, which are intrinsically high-dimensional, except for very small values of
P and
T. In fact, nonparametric estimates can still be useful to coarsely describe the data structure, and often allowing different amounts of smoothing is advisable to capture local structures of the data. In this direction, adaptive estimators build on the idea that for data-sparse regions, a large bandwidth is needed to compensate for the few nearby data points, and conversely, for data-dense regions, a small bandwidth applies smoothing in a small region due to the presence of many nearby data points. As a general principle, we may distinguish between
balloon and
sample point estimators, which replace
h in equation (
2) by
h(
X) and
\(h(X_i)\) respectively. See Chacón and Duong (
2018, §4.1) for an overview in the multivariate setting. Within these classes, we consider, for the matrix-variate setting, a
k-nearest neighbor (
k-NN) extension of the two estimators, defined as
$$\begin{aligned} \hat{f}_B(X; k)= & {} \frac{1}{N\delta _{k}(X)^{P \cdot T}}\sum _{n = 1}^{N}K\left( {\delta _{k}(X)}^{-1}(X - X_{n})\right) , \end{aligned}$$
(5)
$$\begin{aligned} \hat{f}_{SP}(X; k)= & {} \frac{1}{N}\sum _{n = 1}^{N}\frac{1}{\delta _{k}(X_n)^{P \cdot T}} K\left( \delta _{k}(X_n)^{-1}(X - X_{n})\right) , \end{aligned}$$
(6)
where
\(\delta _{k}(X) = || X - X^{(k)}||_F\) is the Frobenius distance of
X from its
k-th nearest neighbour
\(X^{(k)}.\)
3.2 Mean shift for matrix-variate data
Once the density has been estimated via the matrix-variate extension discussed so far, clusters can be associated to the domains of attraction of the modes of such density, to be intended as high-density subsets of the sample space surrounding the (matrix-variate) local maxima of the density.
With this regard, the following proposition states that the hill-climbing property of the mean-shift algorithm still holds in the matrix-variate setting.
While loosing its interpretation as an iterative weighted average of the observations, the gradient ascent nature of the mean-shift may be derived also for more complex structures of the bandwidths. When a kernel function with separable structure
\(H= U\otimes V\) is used, for instance, the (
8) becomes
$$\begin{aligned} M(Y) = \frac{\sum \kappa '\left( \text {tr}(V^{-1}(X_{n} - Y)^\top U^{-1}(X_{n} - Y))\right) V^{-1/2}(X_{n} - Y)U^{-1/2}}{\sum \kappa '\left( \text {tr}(V^{-1}(X_{n} - Y)^\top U^{-1}(X_{n} - Y))\right) } \end{aligned}$$
with some simple mathematical manipulation.
With respect to the adaptive estimator (
6), the same proposition holds, with the only caution of replacing
h with
\(\delta _k(X_n).\) Conversely, the same arguments do not generally apply to the balloon estimator (
5), since the kernel depends on
X also through
\(\delta _k(X)\), and therefore does not allow to derive a general expression for its gradient. An exception in the multivariate case occurs when the kernel is chosen among the beta family, where the problem simplifies remarkably (Duong et al.
2016). The same naturally extends to the matrix variate case. Specifically, when a uniform kernel on the unit
PT-ball is selected, the gradient ascent property of the mean-shift is shown to hold with extreme computational efficiency, as stated by the following result.
The proof follows the same steps of the one of Proposition 2. See also Duong et al. (
2016) for the multivariate case.