1 Introduction

Local polynomial regression (LPR) is a flexible and efficient nonparametric regression method in statistics. LPR has been widely applied in many research areas such as data smoothing, density estimation, and nonlinear modelling [1, 2]. Given a set of noisy and possibly non-uniformly-spaced samples of a multidimensional signal, the data samples are fitted locally by a multivariate polynomial using the least-squares (LS) criterion with a kernel function \( {K_{\mathbf{H}}}\left( {\mathbf{u}} \right) = \frac{1}{{|{\mathbf{H}}|}}K\left( {{{\mathbf{H}}^{ - 1}}{\mathbf{u}}} \right) \), where \( K\left( {\mathbf{u}} \right):{\mathbb{R}^d} \to \mathbb{R} \) is a symmetric spherical multi-dimensional basis-kernel and H is a certain bandwidth matrix to determine the kernel size and other properties.

LPR has been shown to possess an excellent spatial adaptation and reasonably simple implementation. Therefore, it has received considerable attention in the statistics and signal processing communities. Since signals may vary considerably, it is very crucial to choose a proper local kernel to achieve the best bias-variance tradeoff in estimating the local polynomial coefficients for modelling non-stationary signals. Generally, the optimal local kernel should be selected to minimize the mean squared error (MSE) or its integrated and smoothed versions, which are functions of the local kernel. For example, for slow varying parts of a one-dimensional signal, we would like the kernel to have a large scale so that more accurate estimates can be obtained by averaging out the additive noise as much as possible. At fast varying parts of the signal, however, we would like to have a small kernel scale so that excessive bias errors due to the limited order of the fitting polynomial will not occur.

For multivariate LPR, the bias-variance tradeoff problem is more complicated. It is because the bandwidth H is a matrix in multidimensional signal processing. Not only is the kernel scale important, but also the kernel shape has significant effects on the estimation results. It explains why there were not many publications concerning the bandwidth matrix selection of multivariate LPR, although the adaptive bandwidth selection for univariate LPR has been studied systematically and intensively [210].

Conventionally, the bandwidth matrix selection for multivariate LPR is simplified by using kernels with a specific shape, which is usually symmetric and spherical. More precisely, using a symmetric and spherical basis-kernel K(u) and a bandwidth matrix H = h I with h the scale parameter and I the identity matrix, the kernel K H (u) is simplified to \( {K_h}\left( {\mathbf{u}} \right) = \frac{1}{{{h^d}}}K\left( {{\mathbf{u}}/h} \right) \), where d is the dimension of the multivariate LPR. As a result, the selection of the scale h is much easier than the selection of a d-dimensional bandwidth matrix H. With this assumption, Ruppert et al. [1012] studied the asymptotic bias, variance, and MSE of multivariate LPR and developed the empirical-bias bandwidths selection (EBBS) algorithm for multivariate LPR [13]. Based on Ruppert’s work, Fan and Gijbels [2], Yang and Tschernig [14] also developed similar scale selection methods for multivariate local linear regression. These methods are innovative and useful, but they were developed mainly for linear or quadratic regression and did not consider the partial derivatives of the estimates. Also, they were all based on the symmetric kernel assumption.

Unlike above methods, Masry [1517] considered a more general kernel, which may not be symmetric and spherical. More precisely, the basis-kernel K(u) in the kernel function \( {K_h}\left( {\mathbf{u}} \right) = \frac{1}{{{h^d}}}K\left( {{\mathbf{u}}/h} \right) \) may be asymmetric and of various shapes. A comprehensive study on the bias and variance of multivariate LPR with higher-order polynomials and higher-order derivatives was carried out, and an analytical expression of the optimal scale h was given. However, the optimal scale selection method is generally not directly usable in practice because its analytical form always involves quantities which are somewhat difficult to be estimated.

Instead of computing an optimal scale in an analytical form, empirical methods choose the scale from a finite set of scale parameters in a geometric grid. Recently, Katkovnik et al. [1820] applied an intersection confidence intervals (ICI) method to the multivariate LPR problem for selecting the optimal local scale and studied its applications in image processing. Given a set of scale parameters in ascending order, the ICI rule determines the optimal scale by comparing the confidence intervals of the estimates with different scale parameters in the scale set. However, the ICI method (and other empirical scale selectors) can only select the local scale from the finite scale set, and therefore the exact scale value cannot be calculated if it is not included in the scale set.

In this paper, we first propose a new refined ICI adaptive scale selection method (RICI) for multidimensional signal processing using the bias and variance expressions derived from multivariate LPR analysis of Masry [1517]. The RICI method can determine a more accurate scale parameter beyond the limit of the finite scale set. The main idea of the RICI method is inspired by the work of Stanković [21], which discussed in-depth the bias-variance tradeoff problem and the ICI adaptive scale selection algorithm.

Next, we study the application of the proposed LPR with RICI adaptive scale selection method (LPR-RICI) for smoothing noisy images. Motivated by the iterative steering kernel regression method of Takeda et al. [22], the local kernel shape is iteratively determined from the local orientation of the image data. The locally adaptive steering kernel has a better adaptability to local characteristics of images than the conventional symmetric circular kernel. Coupled with the proposed RICI scale selection, a fully automatic iterative steering-kernel-based LPR with refined ICI method (SK-LPR-RICI) for smoothing multivariate data is proposed. It has a better MSE and perceptual equality than the conventional symmetric circular kernel. Since LPR can be naturally adapted to non-uniformly spaced data points, we also propose to employ the SK-LPR-RICI method for interpolating non-uniformly sampled images, which finds applications in image reconstruction and super-resolution.

The rest of the paper is organized as follows. In Section 2, the basic principle of multivariate LPR and its bias-variance tradeoff problem are revisited. Section 3 introduces the ICI scale selection algorithm and proposed the refined ICI adaptive scale selector. Section 4 is devoted to the applications of LPR to image processing problems such as smoothing and reconstruction, and a fully data-driven iterative steering-kernel-based LPR with refined ICI method will be proposed. This algorithm is also applicable to other 2-dimentional data. Experimental results and comparisons to other conventional methods using both test and real images are presented in Section 5. Finally, conclusions are drawn in Section 6.

2 Multivariate Local Polynomial Regression

In multivariate LPR, we have a set of multidimensional observations: (Y i , X i ), i = 1,2,⋯,n, where for each i, Y i is a scalar response variable and \( {{\mathbf{X}}_i} = {\left( {{X_{i,1}},{X_{i,2}}, \cdots, {X_{i,d}}} \right)^T} \) is a d-dimensional explanatory variable. We assume the homoscedastic model:

$$ {Y_i} = m\left( {{{\mathbf{X}}_i}} \right) + \sigma \left( {{{\mathbf{X}}_i}} \right){\varepsilon_i} $$
(1)

where m(X i ) is a smooth function specifying the conditional mean of Y i given X i , ε i is an independent identically distributed (i.i.d.) additive white Gaussian noise (AWGN) with zero mean and unit variance, and σ2 (X i ) is its conditional variance. The problem is to estimate m(X i ) and its partial derivatives from the noisy samples Y i .

The smooth function m(X i ) can be approximated locally, say at a given point \( {\mathbf{x}} = {\left( {{x_1},{x_2}, \cdots, {x_d}} \right)^T} \), by a multivariate polynomial of a certain degree:

$$ \begin{array}{*{20}{c}} {m({\mathbf{X}}:{\mathbf{x}}) = m\left( {\mathbf{x}} \right) + {{\left\{ {{\nabla_m}\left( {\mathbf{x}} \right)} \right\}}^T}\left( {{\mathbf{X}} - {\mathbf{x}}} \right) + \frac{1}{2}\left( {{\mathbf{X}} - {\mathbf{x}}} \right){{\left\{ {{\mathcal{H}_m}\left( {\mathbf{x}} \right)} \right\}}^T}\left( {{\mathbf{X}} - {\mathbf{x}}} \right) + \cdots } \\{ = m\left( {\mathbf{x}} \right) + {{\left\{ {{\nabla_m}\left( {\mathbf{x}} \right)} \right\}}^T}\left( {{\mathbf{X}} - {\mathbf{x}}} \right) + \frac{1}{2}{\hbox{ve}}{{\hbox{c}}^T}\left\{ {{\mathcal{H}_m}\left( {\mathbf{x}} \right)} \right\} \cdot {\hbox{vec}}\left\{ {\left( {{\mathbf{X}} - {\mathbf{x}}} \right){{\left( {{\mathbf{X}} - {\mathbf{x}}} \right)}^T}} \right\} + \cdots } \\{ = {\beta_0} + {\mathbf{\beta }}_\nabla^T \cdot \left( {{\mathbf{X}} - {\mathbf{x}}} \right) + {\mathbf{\beta }}_\mathcal{H}^T \cdot {\hbox{vech}}\left\{ {\left( {{\mathbf{X}} - {\mathbf{x}}} \right){{\left( {{\mathbf{X}} - {\mathbf{x}}} \right)}^T}} \right\} + \cdots } \\\end{array} $$
(2)

where \( {\beta_0} = m\left( {\mathbf{x}} \right),\;{{\mathbf{\beta }}_\nabla } = {\nabla_m}\left( {\mathbf{x}} \right) = {\left[ {\frac{{\partial m\left( {\mathbf{x}} \right)}}{{\partial {x_1}}}, \cdots, \frac{{\partial m\left( {\mathbf{x}} \right)}}{{\partial {x_d}}}} \right]^T} \), \( {{\mathbf{\beta }}_\mathcal{H}} = {\hbox{vech}}\left\{ {{\mathcal{H}_m}\left( {\mathbf{x}} \right)} \right\} = \frac{1}{2}{\left[ {\frac{{{\partial^2}m\left( {\mathbf{x}} \right)}}{{\partial x_1^2}},2\frac{{{\partial^2}m\left( {\mathbf{x}} \right)}}{{\partial {x_1}\partial {x_2}}}, \cdots, \frac{{{\partial^2}m\left( {\mathbf{x}} \right)}}{{\partial x_d^2}}} \right]^T} \), \( {\mathbf{X}} = {\left( {{X_1},{X_2}, \cdots, {X_d}} \right)^T} \) is in a neighborhood of x, ∇ m is the d × 1 gradient operator, \({\mathcal{H}_{m}}\) is the d × d Hessian matrix of m(x), vec(∙) is the vectorization operator which converts a matrix into a column vector lexicographically, and vech(∙) is the half-vectorization operator which converts, for example, \( {\mathbf{A}} = \left[ {\begin{array}{*{20}{c}} {{a_1}} & {{a_2}} \\{{a_3}} & {{a_4}} \\\end{array} } \right] \) into \( {\hbox{vech}}(A) = {\left[ {{a_1},{a_3},{a_4}} \right]^T} \).

Alternatively, a degree p multivariate polynomial in (2) can be written as:

$$ m\left( {{\mathbf{X}}:{\mathbf{x}}} \right) = \sum\limits_{\mathcal{K} = 0}^p {\sum\limits_{{ }{k_1} + \ldots + {k_d} = \mathcal{K}} {{\beta_{{k_1}, \cdots, {k_d}}}\prod\limits_{j = 1}^d {{{\left( {{X_j} - {x_j}} \right)}^{{k_j}}}} } }, $$
(3)

where \( {\mathbf{\beta }} = \left\{ {{\beta_{{k_1}, \cdots, {k_d}}}:{k_1} + \cdots + {k_d} = \mathcal{K}{ }a{\hbox{nd }}\mathcal{K} = 0, \cdots, p} \right\} \) is the vector of coefficients with length \( q = \prod\limits_{j = 1}^p {\left[ {\left( {d + j} \right)/j} \right]} \).

Since ε i is i.i.d. and Gaussian distributed, the maximum likelihood (ML) estimation of the coefficient vector β at location x can be obtained by solving the following weighted least-squares (WLS) regression problem:

$$ {\mathbf{\hat{\beta }}}\left( {{\mathbf{x}};{\mathbf{H}}} \right) = \arg \mathop {{\min }}\limits_{\mathbf{\beta }} \left\{ {J\left( {{\mathbf{x}};{\mathbf{H}}} \right)} \right\}, $$
(4)

where

$$ J\left( {{\mathbf{x}};{\mathbf{H}}} \right) = \sum\limits_{i = 1}^n {{K_{\mathbf{H}}}\left( {{{\mathbf{X}}_i} - {\mathbf{x}}} \right){{\left[ {{Y_i} - \sum\limits_{\mathcal{K} = 0}^p {\sum\limits_{{ }{k_1} + \ldots + {k_d} = \mathcal{K}} {{\beta_{{k_1}, \cdots, {k_d}}}\prod\limits_{j = 1}^d {{{\left( {{X_{i,j}} - {x_j}} \right)}^{{k_j}}}} } } } \right]}^2}}, $$
(5)

is a LS cost function and \( {K_{\mathbf{H}}}\left( {{{\mathbf{X}}_i} - {\mathbf{x}}} \right) = \frac{{\mathbf{1}}}{{{\mathbf{|H|}}}}K\left( {{{\mathbf{H}}^{ - 1}}\left( {{{\mathbf{X}}_i} - {\mathbf{x}}} \right)} \right) \) is a d-variate non-negative kernel function which emphasizes neighboring observations around x in estimating β(x). The bandwidth matrix H determines the weights of neighboring samples around x to be used in estimating β(x). For separable windows, we have

$$ {K_{\mathbf{H}}}\left( {{{\mathbf{X}}_i} - {\mathbf{x}}} \right) = \prod\limits_{j = 1}^d {{K_{{h_j}}}\left( {{X_{i,j}} - {x_j}} \right)} = \frac{1}{{{h_1}...{h_d}}}\prod\limits_{j = 1}^d {{K_{{h_j}}}\left( {\frac{1}{{{h_j}}}\left( {{X_{i,j}} - {x_j}} \right)} \right)}, $$
(6)

Further, for identical bandwidth, h 1 = ⋯ = h d  = h. It is the most popular and simplest way to simplify the bandwidth parameter from a matrix H to a scalar h.

In [1517], the kernel considered is of the form \( {K_h}\left( {\mathbf{u}} \right) = \frac{1}{{{h^d}}}K\left( {{\mathbf{u}}/h} \right) \), where the non-negative basis-kernel function K(u) may not be circular or symmetric, and h is the scale parameter. To facilitate the development in the following sections, we assume that the basis-kernel K(u) is known a priori. Its extension to a locally orientation adaptive basis-kernel will be discussed in Section 4.

The WLS solution to (4) is:

$$ {\mathbf{\hat{\beta }}}\left( {{\mathbf{x}};h} \right) = {\left( {{\mathbb{X}^T}\mathbb{W}\mathbb{X}} \right)^{ - 1}}{\mathbb{X}^T}\mathbb{W}\mathbb{Y}{,} $$
(7)

where \( \mathbb{X} = \left[ {\begin{array}{*{20}{c}} 1 & {{{\left( {{{\mathbf{X}}_1} - {\mathbf{x}}} \right)}^T}} & {{\hbox{vech}}\left\{ {\left( {{{\mathbf{X}}_1} - {\mathbf{x}}} \right){{\left( {{{\mathbf{X}}_1} - {\mathbf{x}}} \right)}^T}} \right\}} & \cdots \\1 & {{{\left( {{{\mathbf{X}}_2} - {\mathbf{x}}} \right)}^T}} & {{\hbox{vech}}\left\{ {\left( {{{\mathbf{X}}_2} - {\mathbf{x}}} \right){{\left( {{{\mathbf{X}}_2} - {\mathbf{x}}} \right)}^T}} \right\}} & \cdots \\\vdots & \vdots & \vdots & \vdots \\1 & {{{\left( {{{\mathbf{X}}_n} - {\mathbf{x}}} \right)}^T}} & {{\hbox{vech}}\left\{ {\left( {{{\mathbf{X}}_n} - {\mathbf{x}}} \right){{\left( {{{\mathbf{X}}_n} - {\mathbf{x}}} \right)}^T}} \right\}} & \cdots \\\end{array} } \right],\;\mathbb{Y} = {\left[ {{Y_1},{Y_2}, \cdots, {Y_n}} \right]^T} \), and \( \mathbb{W} = diag\left\{ {{K_h}\left( {{{\mathbf{X}}_1} - {\mathbf{x}}} \right), \cdots, {K_h}\left( {{{\mathbf{X}}_n} - {\mathbf{x}}} \right)} \right\} \) is the weighting matrix.

From the estimated coefficient vector \( {\mathbf{\hat{\beta }}}\left( {{\mathbf{x}};h} \right) \), we can obtain the estimated k-th order derivative of m(x) as follows

$$ {\hat{m}^{\left( {\mathbf{k}} \right)}}\left( {\mathbf{x}} \right) = \frac{{{\partial^\mathcal{K}}m\left( {\mathbf{x}} \right)}}{{\partial x_1^{{k_1}} \cdots \partial x_d^{{k_d}}}} = \prod\limits_{j = 1}^d {\left( {{k_j}!} \right)} \cdot {\hat{\beta }_{{k_1}, \cdots, {k_d}}}\left( {{\mathbf{x}};h} \right), $$
(8)

where k = (k 1,⋯, k d ) is a vector of non-negative integers and \( \mathcal{K} = \sum\limits_{j = 1}^d {{k_j}} \). In particular, m(x) is estimated from \( {\hat{\beta }_{0, \cdots, 0}}\left( {\mathbf{x}} \right) \), that is \( \hat{m}\left( {\mathbf{x}} \right) = {\hat{m}^{\left( {0, \cdots, 0} \right)}}\left( {\mathbf{x}} \right) = {\hat{\beta }_{0, \cdots, 0}}\left( {\mathbf{x}} \right) \).

As mention earlier, a key problem in LPR is to select an appropriate scale parameter h so as to achieve the best bias-variance tradeoff in estimating m (k)(x). To this end, we need first of all to examine the analytical expressions for the conditional bias, variance and MSE of the polynomial coefficient estimate \( {\mathbf{\hat{\beta }}} \). From (7), it can be shown that:

$$ Bias\left( {{\mathbf{\hat{\beta }}}\left( {{\mathbf{x}};h} \right)|{\mathbf{\tilde{X}}}} \right) = E\left( {{\mathbf{\hat{\beta }}}\left( {{\mathbf{x}};h} \right)} \right) - {\mathbf{\beta }}\left( {{\mathbf{x}};h} \right) = {\left( {{\mathbb{X}^T}\mathbb{W}\mathbb{X}} \right)^{ - 1}}{\mathbb{X}^T}\mathbb{W}\left( {{\mathbf{m}} - \mathbb{X}{\mathbf{\beta }}\left( {{\mathbf{x}};h} \right)} \right), $$
(9)
$$ Var\left( {{\mathbf{\hat{\beta }}}\left( {{\mathbf{x}};h} \right)|{\mathbf{\tilde{X}}}} \right) = {\left( {{\mathbb{X}^T}\mathbb{W}\mathbb{X}} \right)^{ - 1}}{\mathbb{X}^T}{\mathbf{\Sigma }}\mathbb{X}{\left( {{\mathbb{X}^T}\mathbb{W}\mathbb{X}} \right)^{ - 1}}, $$
(10)
$$ MSE\left( {{\mathbf{\hat{\beta }}}\left( {{\mathbf{x}};h} \right)|{\mathbf{\tilde{X}}}} \right) = Bia{s^2}\left( {{\mathbf{\hat{\beta }}}\left( {{\mathbf{x}};h} \right)|{\mathbf{\tilde{X}}}} \right) + Var\left( {{\mathbf{\hat{\beta }}}\left( {{\mathbf{x}};h} \right)|{\mathbf{\tilde{X}}}} \right), $$
(11)

where \( {\mathbf{\tilde{X}}} = \left\{ {{{\mathbf{X}}_1},{{\mathbf{X}}_2}, \cdots, {{\mathbf{X}}_n}} \right\} \), \( {\mathbf{m}} = {\left[ {m\left( {{{\mathbf{X}}_1}} \right),m\left( {{{\mathbf{X}}_2}} \right), \cdots, m\left( {{{\mathbf{X}}_n}} \right)} \right]^T} \), and \( {\mathbf{\Sigma }} = diag\left\{ {{K_h}\left( {{{\mathbf{X}}_i} - {\mathbf{x}}} \right){\sigma^2}\left( {{{\mathbf{X}}_i}} \right)} \right\} \).

While the variance is not difficult to estimate from the observations and the kernel coefficients, the estimation of the bias poses a significant problem because it depends on the estimate β(x;h) to be determined. Denote \( {\mathbf{k}}! = \prod\limits_{j = 1}^d {{k_j}!} \), and let f(x) be the probability density function (p.d.f.) of x. The asymptotic bias and variance of the k-th partial derivative \( {\hat{m}^{\left( {\mathbf{k}} \right)}}\left( {{\mathbf{x}};h} \right) \) are [1517]:

$$ Bias\left\{ {{{\hat{m}}^{\left( {\mathbf{k}} \right)}}\left( {{\mathbf{x}};h} \right)|{\mathbf{\tilde{X}}}} \right\} = \sqrt {\mathbb{B}} {h^{p + 1 - \mathcal{K}}}, $$
(12)
$$ Var\left\{ {{{\hat{m}}^{\left( {\mathbf{k}} \right)}}\left( {{\mathbf{x}};h} \right)|{\mathbf{\tilde{X}}}} \right\} = \mathbb{V}{h^{ - \left( {d + 2\mathcal{K}} \right)}}, $$
(13)

where \( \mathbb{B} = {\left( {{\mathbf{k}}!{{\left( {{{\mathbf{M}}^{ - {\mathbf{1}}}}{\mathbf{B}}{{\mathbf{m}}_{p + 1}}\left( {\mathbf{x}} \right)} \right)}_\mathcal{K}}} \right)^2} \), \( \mathbb{V} = {\left( {{\mathbf{k}}!} \right)^2}{\sigma^2}\left( {\mathbf{x}} \right){\left( {{{\mathbf{M}}^{ - {\mathbf{1}}}}{\mathbf{\Gamma }}{{\mathbf{M}}^{ - {\mathbf{1}}}}} \right)_{\mathcal{K},\mathcal{K}}}/nf\left( {\mathbf{x}} \right) \), \( {\mathbf{B}} = {\left[ {{M_{0,p + 1}},{M_{1,p + 1}}, \cdots, {M_{p,p + 1}}} \right]^T} \), \( {{\mathbf{m}}_{p + 1}}\left( {\mathbf{x}} \right) = {\hbox{vech}}\left( {{{\hat{m}}^{\left( {\mathbf{k}} \right)}}\left( {\mathbf{x}} \right)} \right) \) is the regression result obtained with a higher order p + 1, \( {\left( {{{\mathbf{M}}^{ - {\mathbf{1}}}}{\mathbf{\Gamma }}{{\mathbf{M}}^{ - {\mathbf{1}}}}} \right)_{\mathcal{K},\mathcal{K}}} \) is the (\( \mathcal{K},\mathcal{K} \)) diagonal element of the constant matrix \( {{\mathbf{M}}^{ - {\mathbf{1}}}}{\mathbf{\Gamma }}{{\mathbf{M}}^{ - {\mathbf{1}}}} \), \( {\left( {{{\mathbf{M}}^{ - {\mathbf{1}}}}{\mathbf{B}}{{\mathbf{m}}_{p + 1}}\left( {\mathbf{x}} \right)} \right)_\mathcal{K}} \) is the \( \mathcal{K} \)-th element of the vector \( {{\mathbf{M}}^{ - {\mathbf{1}}}}{\mathbf{B}}{{\mathbf{m}}_{p + 1}}\left( {\mathbf{x}} \right) \), and the (i, j)-th element of the matrices M and Γ are \( {M_{i,j}} = \int {{{\mathbf{u}}^{\left( {i + j} \right)}}K\left( {\mathbf{u}} \right)d{\mathbf{u}}} \) and \( {\Gamma_{i,j}} = \int {{{\mathbf{u}}^{\left( {i + j} \right)}}{K^2}\left( {\mathbf{u}} \right)d{\mathbf{u}}} \) respectively. The details can be found in [1517]. Similar asymptotic analytical results on bias and variance of multivariate LPR were also reported by Ruppert and Ward [10], and Fan and Gijebles [2].

It can be seen from (12) and (13) that the bias term will increase with the increase of the scale parameter h since that the underlying data at x cannot be modeled well by the local polynomial with the given order. On the other hand, a larger scale parameter will reduce the variance of the estimator. Hence, there exists a locally optimal scale h opt(x) for estimating the k-th partial derivative m (k)(x), and h opt(x) should minimize an appropriate measure such as the MSE as follows:

$$ MSE\left\{ {{{\hat{m}}^{\left( {\mathbf{k}} \right)}}\left( {{\mathbf{x}};h} \right)|{\mathbf{\tilde{X}}}} \right\} = Bia{s^2}\left\{ {{{\hat{m}}^{\left( {\mathbf{k}} \right)}}\left( {{\mathbf{x}};h} \right)|{\mathbf{\tilde{X}}}} \right\} + Var\left\{ {{{\hat{m}}^{\left( {\mathbf{k}} \right)}}\left( {{\mathbf{x}};h} \right)|{\mathbf{\tilde{X}}}} \right\} = \mathbb{B}{h^{2(p + 1 - \mathcal{K})}} + \mathbb{V}{h^{ - \left( {d + 2\mathcal{K}} \right)}}. $$
(14)

By minimizing (14), an analytical formula for h opt(x) can be determined as follows:

$$ {h^{opt}}\left( {\mathbf{x}} \right) = {\left( {\frac{{\left( {2\mathcal{K} + d} \right)\mathbb{V}}}{{2\left( {p - \mathcal{K} + 1} \right)\mathbb{B}}}} \right)^{\frac{1}{{d + 2p + 1}}}}. $$
(15)

However, some of the quantities in (15) are difficult to be calculated, which makes the optimal scale difficult to be estimated accurately. Instead of computing an optimal scale in an analytical form, empirical methods choose the local scale from a finite set of scale parameters in a geometric grid. In the next section, an empirical scale selection method, called the intersection of confidence interval rule is introduced.

3 Refined ICI Scale Selection Algorithm

The intersection of confidence interval (ICI) rule is an empirical method to select the adaptive scale from a set of scales in ascending order. The ICI rule was proposed by Goldenshluger and Nemirovski in [23], and Katkovnik et al. [1820] applied the ICI rule to univariate and bivariate LPR for selecting the locally adaptive bandwidth or scale. The ICI method can be regarded as a special implementation of the Lepski’s approach [24], which compared a series of estimates with different scales from the scale set and the optimal scale was the largest one whose estimate did not differ significantly from the estimate of a smaller scale.

Starting with a finite set of scale parameters in a geometric grid,

$$ {\mathbf{\tilde{H}}} = \left\{ {{h_j}{\mathbf{I}}|{h_j} = {h_0}{a^j},{ }j = 1, \cdots, J} \right\}, $$
(16)

where a > 1 is a step factor, h 0 > 0 is the base scale, and J is the size of the scale set. The minimum scale in the scale set should be small enough so that sharp changes can be preserved, while the maximum scale should be large enough to remove the additive noise adequately. The maximum scale is usually chosen as n/log10(n) to avoid high complexity and possible over-smoothing [2].

3.1 ICI Rule

In the ICI rule, the optimal scale is determined by comparing the confidence intervals of the estimates with different scales from \( {\mathbf{\tilde{H}}} \). The ICI rule needs only to estimate the variance, which can be easily estimated from (10), and it does not involve the computation of the bias and MSE values.

To see how the algorithm works, we first note that, for each x, the estimate \( \hat{m}\left( {{\mathbf{x}};h} \right) \) is a random variable around the true value m(x) with bias and variance as shown in (12) and (13). For simplicity, in this section we shall drop the index x and k to save notation. It has been shown in [2, 20, 21, 23] that

$$ \left\{ {\hat{m}(h) - \left[ {m + Bias\left( {\hat{m}(h)} \right)} \right]} \right\}/\sqrt {{Var\left( {\hat{m}(h)} \right)}} \sim \mathcal{N}(0,1) $$
(17)

is Gaussian distributed with zero mean and unit variance. In other words, \( \hat{m}(h) \) is Gaussian distributed with mean \( Mean\left( {\hat{m}(h)} \right) = m + Bias\left( {\hat{m}(h)} \right) \) and standard deviation \( Std\left( {\hat{m}(h)} \right) = \sqrt {{Var\left( {\hat{m}(h)} \right)}} \).

The confidence intervals of the estimate \( \hat{m}(h) \) can also be constructed from the bias and variance as

$$ \left[ {Mean\left( {\hat{m}((h)} \right) - \kappa \cdot Std\left( {\hat{m}(h)} \right),Mean\left( {\hat{m}(h)} \right) + \kappa \cdot Std\left( {\hat{m}(h)} \right)} \right]. $$
(18)

As a result, the probability that \( \hat{m}(h) \) lies in the confidence interval \( |\hat{m}(h) - Mean\left( {\hat{m}(h)} \right)| \leqslant \kappa \cdot Std\left( {\hat{m}(h)} \right) \), Pr(κ), can be computed readily. Usually, the parameter κ is selected so that Pr(κ) will be close to one and a reasonable estimation of the possible range of values of \( \hat{m}(h) \) can be achieved. For example, typical values are Pr(1.65) = 0.90, Pr(1.96) = 0.95, and Pr(2.58) = 0.99. A larger κ will lead to a wider confidence interval and thus a smaller bandwidth selected, while a smaller κ will obtain a larger ICI-selected bandwidth. Instead of choosing a fixed and appropriate value for κ, it is also possible to optimize this value by assigning a local threshold κ to each locations. In [18], Katkovnik used the generalized cross-validation (GCV) to determine the value of κ for each x. It is also based on an empirical method, where the desired value of κ is chosen from a finite set of threshold parameters \( {\mathbf{\Gamma }} = \left\{ {{\kappa_1},{\kappa_2}, \cdots, {\kappa_K}} \right\} \). Each threshold parameter \( \kappa \in {\mathbf{\Gamma }} \) is used in the ICI rule to calculate a series of adaptive estimates. The optimal κ is chosen as the one which minimizes the GCV criterion:

$$ {\kappa^{opt}}\left( {\mathbf{x}} \right) = \arg \mathop {{\min }}\limits_\kappa \sum\limits_i {{{\left( {\frac{{{Y_i} - \hat{m}\left( {{{\mathbf{X}}_i};{h_{{j^{+} }|\kappa }}\left( {\mathbf{x}} \right)} \right)}}{{1 - {K_{{h_{{j^{+} }|\kappa }}\left( {\mathbf{x}} \right)}}\left( {{{\mathbf{X}}_i} - {\mathbf{x}}} \right)}}} \right)}^2}}, $$
(19)

where \( {h_{{j^{+} }|\kappa }}\left( {\mathbf{x}} \right) \) is the optimal scale with the threshold parameter κ. This method may lead to a better result at the expense of significantly higher computational complexity. It is because we need to compute a \( \hat{m}\left( {{h_j}} \right) \) for each \( \kappa \in {\mathbf{\Gamma }} \) and each \( {h_j} \in {\mathbf{\tilde{H}}} \), not to mention the GCV computation and comparison.

The mean value of \( \hat{m}(h) \) is difficult to be estimated since it involves the true value m. The standard deviation of \( \hat{m}(h) \), \( Std\left( {\hat{m}(h)} \right) \), can be computed with the help of (10). The quantity σ 2(x) in (10) is estimated from the homoscedasticity assumption of the additive noise. The robust noise variance estimator given in [1820] can be used:

$$ \hat{\sigma } = \left\{ {median\left( {\left| {{Y_i} - {Y_{i - 1}}} \right|} \right)} \right\}/\left( {\sqrt {2} \cdot 0.6745} \right). $$
(20)

where median(∙) denotes the sample median operation.

Next, consider a series of confidence intervals \( {D_j} = \left[ {{L_j},{U_j}} \right] \) from the estimated \( \hat{m}\left( {{h_j}} \right) \) with different values of scale h j from the scale set \( {\mathbf{\tilde{H}}} \):

$$ {U_j} = \hat{m}\left( {{h_j}} \right) + \left( {\kappa + \Delta \kappa } \right) \cdot Std\left( {\hat{m}\left( {{h_j}} \right)} \right), $$
(21)
$$ {L_j} = \hat{m}\left( {{h_j}} \right) - \left( {\kappa + \Delta \kappa } \right) \cdot Std\left( {\hat{m}\left( {{h_j}} \right)} \right), $$
(22)

where ∆κ is a threshold parameter used to adjust the width of the confidence interval. We notice that the confidence intervals (21) and (22) are biased because the estimated value \( \hat{m}\left( {{h_j}} \right) \) is used instead of the true value m(X i ). Another constant ∆κ is introduced to reduce the influence of the estimation bias, and its effect and selection will be discussed later in the refined ICI method.

For a small value of h, we shall expect the bias will be small and the confidence interval will gradually decrease with increasing value of h while the center of the interval remains more or less fixed. When h is increased to a point where the observations cannot be satisfactorily modeled, a large bias will result and the center of the interval will change significantly with respect to the others, while its length will still be small. As a result, the confidence interval will no longer intersect those with smaller values of h.

Motivated by this observation, the ICI scale selection method computes and examines the following quantities from the confident intervals in order to detect this sudden change:

$$ \begin{array}{*{20}{c}} {{{\overline L }_j} = \max \left[ {{{\overline L }_{j - 1}},{L_j}} \right],{\hbox{for}}\;j = 1,2, \cdots, J} \hfill \\{{{\underline U }_j} = \min \left[ {{{\underline U }_{j - 1}},{U_j}} \right],\;{\hbox{for}}\;j = 1,2, \cdots, J} \hfill \\{{{\overline L }_0} = {{\underline U }_0} = 0.} \hfill \\\end{array} $$
(23)

It can be seen that \( {\overline L_j} \) is the largest upper bound of the confidence interval for scale evaluated up to h j , while \( {\underline U_j} \) is the corresponding lower bound. The largest value of these j for which \( {\underline U_j} \geqslant {\overline L_j} \), denoted by j +, gives the desirable scale h j+, because above which the confidence intervals no longer intersect with each other. In the ICI rule, it has been shown in [21] that if the two confidence intervals D i and D j intersects, then:

$$ |\hat{m}\left( {{h_i}} \right) - \hat{m}\left( {{h_j}} \right)) \leqslant \left( {\kappa + \Delta \kappa } \right) \cdot \left[ {Std\left( {\hat{m}\left( {{h_i}} \right)} \right) + Std\left( {\hat{m}\left( {{h_j}} \right)} \right)} \right]. $$
(24)

3.2 Refined ICI Method

We now consider the selection of Δκ and give a refined version of the conventional ICI rule. From (21), (22), and (24), it can be seen that the threshold Δκ can significantly affect the performance of the ICI algorithm. When Δκ is large, the confidence interval D j becomes wide, and it may lead to a large value of h j+. This will result in over-smoothing of the observations. On the contrary, when Δκ is small, the interval D j will become narrow, and it may result in a very small value of h j+ so that the additive noise cannot be removed effectively.

The selection and adjustment of Δκ have been discussed in [20] and [23]. By substituting the optimal scale (15) into the asymptotic bias and variance expressions in (12) and (13), we have:

$$ \frac{{Bia{s^2}\left\{ {{{\hat{m}}^{\left( {\mathbf{k}} \right)}}\left( {{\mathbf{x}};{h^{opt}}} \right)|{\mathbf{\tilde{X}}}} \right\}}}{{Var\left\{ {{{\hat{m}}^{\left( {\mathbf{k}} \right)}}\left( {{\mathbf{x}};{h^{opt}}} \right)|{\mathbf{\tilde{X}}}} \right\}}} = \frac{{\left( {2\mathcal{K} + d} \right)}}{{2\left( {p - \mathcal{K} + 1} \right)}}. $$
(25)

To simplify notations, let

$$ {b_m}(h) = Bias\left\{ {\hat{m}(h)|{\mathbf{\tilde{X}}}} \right\}, $$
(26)
$$ \sigma_m^2(h) = Var\left\{ {\hat{m}(h)|{\mathbf{\tilde{X}}}} \right\}, $$
(27)
$$ \beta = 2\left( {p + 1 - \mathcal{K}} \right),\,{\hbox{and}}\;\nu = 2\mathcal{K} + d. $$
(28)

Then, one gets \( b_m^2(h) = \mathbb{B}{h^\beta } \), \( \sigma_m^2(h) = \mathbb{V}{h^{ - \nu }} \), and \( MSE(h) = \mathbb{B}{h^\beta } + \mathbb{V}{h^{ - \nu }} \). The optimal scale can then be written as

$$ {h^{opt}} = {\left( {\frac{\beta }{v} \cdot \frac{\mathbb{V}}{\mathbb{B}}} \right)^{\frac{1}{{\beta + \nu }}}}. $$
(29)

In [23], ∆κ is chosen as κ because the bias and standard deviation are assumed to be equal. In [20], Katkovnik showed that \( \Delta \kappa = \left( {\beta /\nu } \right)\kappa \) was an appropriate choice because it can minimize the upper bound of the estimation error. That is, ∆κ should be selected as the ratio of the squared bias and the variance at the optimal scale h opt.

It can be seen that the above method can only select the optimal scale h opt = h j+ from the finite scale set \( {\mathbf{\tilde{H}}} \). Generally, the exact optimal scale is not included in the scale set \( {\mathbf{\tilde{H}}} \). In [21], the computation of a more exact optimal scale for the general ICI scale selection problem with given bias and variance formulas was considered. The importance of the result is that explicit knowledge of \( \mathbb{B} \) and \( \mathbb{V} \) in (12) and (13) is not required. Here, we shall employ this method to further refine the ICI adaptive scale for the LPR problem using the bias and variance formulas in (12) and (13).

By assuming that h opt I is close to the conventional ICI-based optimal bandwidth \( {h_{{j^{+} }}}{\mathbf{I}} \in {\mathbf{\tilde{H}}} \), we can write \( {h_{{j^{+} }}} = {a^\eta }{h^{opt}} \), where a is the step factor and η is a variable to be determined for refining the optimal scale. The entries in the scale set \( {\mathbf{\tilde{H}}} \) of (16) can thus be rewritten as

$$ {\mathbf{\tilde{H}}} = \left\{ {{h_j}{\mathbf{I}}|{h_j} = {h_{{j^{+} }}}{a^{\left( {j - {j^{+} }} \right)}} = {h^{opt}}{a^{\left( {j - {j^{+} } + \eta } \right)}},{ }j = 1, \cdots, J} \right\}. $$
(30)

Substituting (30) into (12) and (13), the bias and standard deviation can be written as:

$$ {b_m}\left( {{h_j}} \right) = \sqrt {{\mathbb{B}{h_j}^\beta }} = = \sqrt {{\nu /\beta }} {\sigma_m}\left( {{h^{opt}}} \right){a^{\frac{{\beta \left( {j - {j^{+} } + \eta } \right)}}{2}}}, $$
(31)
$$ {\sigma_m}\left( {{h_j}} \right) = \sqrt {{\mathbb{V}/{h_j}^\nu }} = = {\sigma_m}\left( {{h^{opt}}} \right){a^{\frac{{ - \nu \left( {j - {j^{+} } + \eta } \right)}}{2}}}. $$
(32)

In the ICI rule, the optimal h j+ is the largest scale where the confidence intervals \( {D_{{j^{+} } - 1}} \) and \( {D_{{j^{+} }}} \) intersect, but not \( {D_{{j^{+} }}} \) and \( {D_{{j^{+} } + 1}} \). Given an appropriate value of κ such that Pr(κ) ≈ 1. From the p.d.f. of (17), we have high confidence that \( \hat{m}\left( {{h_j}} \right) \) will lie in the confidence interval:

$$ m + {b_m}\left( {{h_j}} \right) - \kappa {\sigma_m}\left( {{h_j}} \right) \leqslant \hat{m}\left( {{h_j}} \right) \leqslant m + {b_m}\left( {{h_j}} \right) - \kappa {\sigma_m}\left( {{h_j}} \right). $$
(33)

Making use of (33) and substituting (16) to (24), one gets, after slight manipulation, the condition for \( {D_{{j^{+} } - 1}} \) and \( {D_{{j^{+} }}} \) to intersect as:

$$ {b_m}\left( {{h_{{j^{+} } - 1}}} \right) + \Delta \kappa {\sigma_m}\left( {{h_{{j^{+} } - 1}}} \right) \geqslant {b_m}\left( {{h_{{j^{+} }}}} \right) + \Delta \kappa {\sigma_m}\left( {{h_{{j^{+} }}}} \right), $$
(34)

and the condition for \( {D_{{j^{+} }}} \) and \( {D_{{j^{+} } + 1}} \) not to intersect as:

$$ {b_m}\left( {{h_{{j^{+} }}}} \right) + \left( {2\kappa + \Delta \kappa } \right){\sigma_m}\left( {{h_{{j^{+} }}}} \right) < {b_m}\left( {{h_{{j^{+} } + 1}}} \right) - \left( {2\kappa + \Delta \kappa } \right){\sigma_m}\left( {{h_{{j^{+} } + 1}}} \right). $$
(35)

Substituting the estimated bias (31) and variance (32) into the inequalities (34) and (35), the threshold parameter ∆κ and the refining parameter η in the expressions can be determined as follows:

$$ \Delta \kappa = \frac{{2\kappa }}{{{a^{\left( {\beta + \nu } \right)/2}} - 1}}, $$
(36)
$$ \eta = \frac{2}{{\beta + \nu }}{\log_a}\left( {\Delta \kappa \sqrt {{\frac{\beta }{\nu }}} \frac{{1 + {a^{\nu /2}}}}{{1 - {a^{ - \beta /2}}}}} \right). $$
(37)

Thus, given an appropriate value of the threshold parameter κ, say 1.96, in LPR problem, ∆κ and η can be obtained from (36) and (37). Hence, a refined adaptive scale is \( {h^{opt}} = {h_{{j^{+} }}}{a^{ - \eta }} \).

Since the scale set is constructed in a geometric order, we still need to add a logarithmic shift ∆η to η for compensating for the difference between the arithmetic mean and the geometric mean. This shift value can be approximated as [21]:

$$ \Delta \eta = \frac{{2\left[ {{{\log }_a}\left( {\left( {1 + {a^{\left( {\beta + \nu } \right)/2}}} \right)/2} \right)} \right]}}{{\beta + \nu }} - \frac{1}{2}. $$
(38)

Finally, the desired optimal scale is determined as:

$$ {h^{opt}} = {h_{{j^{+} }}}{a^{ - \left( {\eta + \Delta \eta } \right)}}. $$
(39)

We now summarize the refined ICI (RICI) adaptive scale selection method as follows:

  1. 1)

    Set up a finite scale set \( {\mathbf{\tilde{H}}} \) as in (16);

  2. 2)

    Compute \( \hat{m}\left( {{h_j}} \right) \) for \( {h_j}{\mathbf{I}} \in {\mathbf{\tilde{H}}} \);

  3. 3)

    Select the threshold κ and compute ∆κ from (36);

  4. 4)

    Find \( {h_{{j^{+} }}} \) from the set \( {\mathbf{\tilde{H}}} \) with the ICI rule;

  5. 5)

    Compute the refined parameters η and ∆η from (37) and (38);

  6. 6)

    Calculate the refined optimal scale \( {h^{opt}} = {h_{{j^{+} }}}{a^{ - \left( {\eta + \Delta \eta } \right)}} \).

Note that in step 3), the threshold parameter κ can be chosen as some reasonable values such as 1.96. If κ is selected by (19), the result may be better but the complexity is higher.

4 Data-Driven LPR for Image Processing

Recently, bivariate LPR has received considerable attention in image processing field. Katkovnik et al. [1820] successfully applied LPR with ICI bandwidth selection to de-noising, de-blurring, and edge detection of images. Takeda et al. developed a steering kernel regression method for image de-noising, interpolation, and fusion [22]. In this section, we shall extend the adaptive scale selection method developed in Section 3 to kernel that can adapt to the local orientation of an image. This steering-kernel-based LPR-RICI (SK-LPR-RICI) will then be applied to smoothing and interpolation of noisy images.

An N 1 × N 2 gray-scale image can be viewed as the observations of a 2-dimensional (2D) function where the samples are taken along the 2D explanatory variable \( {\mathbf{x}} = \left( {{x_1},{x_2}} \right) \) at \( {x_1} = 1,2, \cdots, {N_1} \) and \( {x_2} = 1,2, \cdots, {N_2} \) with \( Y\left( {{x_1},{x_2}} \right) \) the intensity of the image pixel. In principle, the locations of the samples can be non-uniformly spaced. This makes LPR very useful in general interpolation and smoothing problems. The smoothed result is obtained from the estimated \( \hat{m}\left( {{x_1},{x_2}} \right) = {\beta_{0,0}} \) by applying LPR with a suitably chosen 2D kernel \( {K_{\mathbf{H}}}\left( \cdot \right) \) to the original image. The quantities \( {\beta_{1,0}} = \partial m\left( {{x_1},{x_2}} \right)/\partial {x_1} \) and \( {\beta_{0,1}} = \partial m\left( {{x_1},{x_2}} \right)/\partial {x_2} \) represent respectively the first-order partial derivatives of an image along the x 1 and x 2 coordinates. When x is evaluated at a series of 2D grid points, we obtain a smoothed surface representation of the data from the noisy image Y. Therefore, the smoothing and interpolation of digital images can be performed through LPR.

4.1 Kernel Selection

The Gaussian-type kernel is employed in our analysis work and experiments on image processing because it can be transformed to produce a steering kernel. Given a bandwidth matrix H, the Gaussian kernel is calculated as:

$$ {{\mathbf{K}}_{\mathbf{H}}}\left( {\mathbf{u}} \right) = \frac{1}{{2\pi |{\mathbf{H}}|}}\exp \left\{ { - {{\mathbf{u}}^T}{{\left( {{{\mathbf{H}}^{ - 1}}} \right)}^T}{{\mathbf{H}}^{ - 1}}{\mathbf{u}}/2} \right\} $$
(40)

Since the Gaussian kernel is not of compact support, it should be truncated to a sufficient kernel size N k  × N k to reduce the arithmetic complexity, which will be discussed later in this section.

When the bandwidth matrix is chosen as H = h I, a symmetric Gaussian kernel results, which is used in most LPR methods for image processing because of its simplicity. For the symmetric basis-kernel, the locally adaptive scale parameter h at a point x will control the amount of smoothing to be performed at that point. When h is small, image details such as edges and textures will be preserved. However, possible additive noise may not be removed effectively. On the contrary, a large-scale kernel has better de-noising properties at the expense of possibly blurring of the image details. Therefore, a locally adaptive scale selection method is very crucial to LPR in image processing. The ICI rule with a symmetric circular kernel has been shown to be an effective tool for adaptive scale selection [1820].

Recently, Takeda et al. [22] proposed a steering kernel, which is obtained by elongating and rotating a symmetric Gaussian kernel so that it can adapt to the dominant orientation locally. In [22], the orientation and scale of the steering kernel was determined using principal components analysis (PCA). The resultant iterative steering kernel regression (ISKR) method was shown to have a better performance than the conventional symmetric kernel, especially along image edges. In the ISKR method, the local scaling parameter was obtained as \( {h_i} = {h_0}{\gamma_i} \), where h 0 and γ i are respectively the global smoothing parameter and the local scaling parameter. The local scaling factor γ i was estimated using the local energy distribution of the image, while h 0 was chosen as to minimize the root mean squared error (RMSE) value for known test images or to yield most visually appealing results for real images. Consequently, the scale selection process is not fully automatic and the criterion for selecting the global smoothing parameter for real images is somewhat subjective.

To address this problem, we propose a fully data-driven adaptive scale selection method using the RICI rule proposed in Section 3. Specifically, we will extend the RICI rule for symmetric kernel to the steering kernel and obtain a new steering-kernel-based LPR with refined ICI bandwidth selection (SK-LPR-RICI). To this end, we propose to construct the local steering kernel \( {K_h}\left( {\mathbf{u}} \right) = \frac{1}{{{h^2}}}K\left( {{\mathbf{u}}/h} \right) \) at each sample x of an image by a steering basis-kernel K(u) and a scale parameter h. The shape of the local basis-kernel K(u) will be determined first based on the local dominant orientation estimation around x which can be achieved by the PCA method in [22]. The details of the PCA method is omitted to save place, and please be referred to [22] for details. Then, the RICI rule is employed to choose the optimal scale parameter h for the steering basis-kernel K(u).

4.2 Steering-Kernel-Based LPR with RICI Method

Once the locally adaptive steering basis-kernel is determined, the RICI adaptive scale selection method can be used with slight modification to determine the locally adaptive scale parameter h. More precisely, with different scale parameter h j from the scale set \( {\mathbf{\tilde{H}}} \), we use the steering kernel \( {K_{{h_j}}}\left( {{\mathbf{X}} - {\mathbf{x}};{{\mathbf{H}}_{S,}}} \right) = \frac{1}{{h_j^2}}K\left( {\left( {{\mathbf{X}} - {\mathbf{x}}} \right)/{h_j};{{\mathbf{H}}_S}} \right) \) with a bandwidth matrix h j H S to fit a p-order polynomial at x and obtain a series of smoothed estimates of \( \hat{m}\left( {{\mathbf{x}};{h_j}} \right) \). The traditional ICI rule is then invoked to select a preliminary scale parameter \( {h_{{j^{+} }}} \). For d = 2 and k = (0,0), the bias and variance of the estimator are

$$ Bias\left\{ {\hat{m}\left( {{\mathbf{x}};h} \right)|{\mathbf{\tilde{X}}}} \right\} = \sqrt {\mathbb{B}} {h^{p + 1}}, $$
(41)
$$ Var\left\{ {\hat{m}\left( {{\mathbf{x}};h} \right)|{\mathbf{\tilde{X}}}} \right\} = \mathbb{V}{h^{ - 2}}. $$
(42)

The refined scale h opt can then be determined by the RICI method. A LPR using this refined scale parameter is then performed on the image to yield the final estimate \( \hat{m}\left( {{\mathbf{x}};{h^{opt}}} \right) = {\hat{\beta }_{0,0}}\left( {{\mathbf{x}};{h^{opt}}} \right) \).

Inspired by the ISKR method, the local steering matrix estimation and LPR with RICI rule can be executed iteratively to achieve better de-noising results, if necessary. The proposed iterative SK-LPR-RICI method can be summarized as follows:

  1. Step 1

    Initial gradient estimation

    With an initial bandwidth matrix \( {{\mathbf{H}}^{(0)}} = {h^{(0)}}{\mathbf{I}} \), calculate the initial values of \( {\hat{m}^{(0)}}\left( {{\mathbf{x}};{{\mathbf{H}}^{(0)}}} \right) \) and the gradient estimation \( \hat{\beta }_{0,1}^{(0)}\left( {{\mathbf{x}};{{\mathbf{H}}^{(0)}}} \right) \) and \( \hat{\beta }_{1,0}^{(0)}\left( {{\mathbf{x}};{{\mathbf{H}}^{(0)}}} \right) \).

  2. Step 2

    Steering kernel estimation (l-th iteration)

    The first-order gradient estimates \( \hat{\beta }_{0,1}^{\left( {l - 1} \right)}\left( {{\mathbf{x}};{{\mathbf{H}}^{\left( {l - 1} \right)}}} \right) \) and \( \hat{\beta }_{1,0}^{\left( {l - 1} \right)}\left( {{\mathbf{x}};{{\mathbf{H}}^{\left( {l - 1} \right)}}} \right) \) are used to calculate the steering bandwidth matrix \( {\mathbf{H}}_{S,{\mathbf{x}}}^{(l)} \) and the steering basis-kernel \( K_{\mathbf{x}}^{(l)}\left( {{\mathbf{u}};{{\mathbf{H}}_{S,{\mathbf{x}}}}} \right) \).

  3. Step 3

    Local scale parameter estimation (l-th iteration)

    Obtain a series of LPR results using kernels with the bandwidth matrix \( h \cdot {\mathbf{H}}_{S,{\mathbf{x}}}^{(l)} \) with h belonging to \( {\mathbf{\tilde{H}}} \). Determine the scale parameter \( {h^{opt,(l)}} \) using the RICI rule.

  4. Step 4

    LPR with estimated steering kernel (l-th iteration)

    A LPR is performed with the estimated bandwidth matrix \( {{\mathbf{H}}^{(l)}} = {h^{opt,(l)}}{\mathbf{H}}_{S,{\mathbf{x}}}^{(l)} \) to obtain \( {\hat{m}^{(l)}}\left( {{\mathbf{x}};{{\mathbf{H}}^{(l)}}} \right) \), along with its partial derivatives \( \hat{\beta }_{1,0}^{(l)}\left( {{\mathbf{x}};{h^{opt}}} \right) \) and \( \hat{\beta }_{0,1}^{(l)}\left( {{\mathbf{x}};{h^{opt}}} \right) \).

Generally, more iterations will reduce the estimation variance, but the bias will increase, As a result, when the additive noise is small, the number of iterations should also be small to avoid over-smoothing for the image. When the image is contaminated by large amount of noise, more iterations can improve the de-noising results. Similarly, given prior knowledge of additive noise, the initial scale parameter h (0) should be selected as a large value when noise component is significantly large, while a small value of h (0) is more appropriate when the additive noise is small.

The computational complexity of the SK-LPR-RICI method is mainly determined by the polynomial order p and the kernel size N K . For each data sample, the complexities of computing \( {\mathbb{X}^T}\mathbb{W}\mathbb{X} \) and \( {\mathbb{X}^T}\mathbb{W}\mathbb{Y} \) in (7) are \( O\left( {{q^2}N_K^2} \right) \) and \( O\left( {qN_K^2} \right) \) respectively, and the matrix inversion needs a complexity of \( O\left( {{q^3}} \right) \), where \( q = \prod\limits_{j = 1}^p {\left[ {\left( {d + j} \right)/j} \right]} \) is the length of the coefficient β. If q ≪ N K , the complexity of the WLS solution is \( O\left( {N_K^2} \right) \). As a result, a large p, which means a large q, will lead to a very heavy computational complexity. Hence, a linear (p = 1) regression is generally used, and the kernel size N K is given a small value of, say 9 or 11. Since the SK-LPR-RICI method involves the SVD, which has a complexity of \( O\left( {N_K^3} \right) \), it has a higher arithmetic complexity than the symmetric-kernel-based LPR. As for the ISKR method, if there is no a prior knowledge about the global smoothing parameter h 0, each h 0 from \( {\mathbf{\tilde{H}}} \) has to be tested and more importantly the quality has to be judged manually. Hence, the complexity of ISKR and the proposed iterative SK-LPR-RICI method are comparable.

4.3 LPR for Image Smoothing and Reconstruction

Since LPR is applicable to uniformly as well as non-uniformly sampled observations, it is very useful in smoothing and interpolation of generally sampled data set. One useful application is the reconstruction of a high-resolution (HR) image from multiple shifted low-resolution (LR) images in super-resolution imaging [2528]. Since the LR images may be shifted arbitrarily from each other, the image samples are non-uniformly sampled. Using LPR, we can construct a HR image from the non-uniform spaced image samples. More information about super-resolution and image registration can be found in [2530]. Here, we only focus on the application of the SK-LPR-RICI method to the HR imaging from LR images without blurring. The image samples may be corrupted by noise due to imprecise registration.

As shown before, LPR is a practical de-noising and interpolation technique for image processing. Therefore, it can be used to interpolate noisy uniformly spaced images and remove possible artifacts due to data compression. This will be illustrated further by examples in Section 5. However, it is worth noting that the proposed LPR-based methods are only capable of handling the artifacts that can be modeled as an additive Gaussian white noise. It will be interesting to improve the LPR-based methods to deal with other unwanted effects like illumination or blurring, but it is outside the scope the paper and we will pursue such direction in further study.

5 Experimental Results

In this section, we will evaluate the proposed RICI method for image smoothing, coding artifacts removal, and super-resolution imaging. Peak signal-to-noise ration (PSNR), \( PSNR = 10 \cdot {\log_{10}}\left( {{{255}^2}/MSE} \right) \), is used here as the performance criterion for evaluation. The PSNR values shown in the tables are averages of 100 independent Monte-Carlo realizations, and one realization is illustrated in figures.

5.1 Image De-Noising

We now apply the proposed SK-LPR-RICI method to suppress the additive Gaussian white noise in gray-scale images. Two 256 × 256 8-bit images, Test and Lena are used in the simulation, and the noise-free original images are shown in Figs. 1a and 2a. The image Test contains distinct edges and corners, and the performances around image edges are easier to be visualized and compared. Lena also contains a good mixture of details, flat area, and texture, which makes it a commonly used test image. Gaussian kernels are used here, and the scale set of (16) is chosen as \( {\mathbf{\tilde{H}}} = \left\{ {{h_j}:|{h_j} = 1/4,1/2,1,2,4} \right\} \). The kernel size parameter N K is set to 11.

Figure 1
figure 1

Comparison of different adaptive scale selection methods for smoothing the image Test: a original noise-free image, b noisy image, c ISKR, d LPR-ICI, e LPR-RICI, f SK-LPR-RICI, g adaptive local scale with ICI rule, h adaptive local scale with RICI rule, i adaptive local scale with RICI rule based on steering kernel, j enlarged image of ISKR, k enlarged image of LPR-RICI, l enlarged image of SK-LPR-RICI.

Figure 2
figure 2

Comparison of different adaptive scale selection methods for smoothing the image Lena: a original noise-free image, b noisy image, c ISKR, d LPR-ICI, e LPR-RICI, f SK-LPR-RICI, g adaptive local scale with ICI rule, h adaptive local scale with RICI rule, i adaptive local scale with RICI rule based on steering kernel, j enlarged image of ISKR, k enlarged image of LPR-RICI, l enlarged image of SK-LPR-RICI.

Four different adaptive scale selection methods are evaluated:

  1. 1)

    LPR-ICI: it uses the traditional ICI adaptive scale selector introduced by Katkovnik in [18], for the symmetric circular kernel.

  2. 2)

    LPR-RICI: it uses the refined ICI adaptive scale selector method proposed in Section 3, and a symmetric circular kernel.

  3. 3)

    Iterative SK-LPR-RICI: it uses the iterative steering-kernel-based LPR with the refined ICI adaptive scale selector, as proposed in Section 4.

  4. 4)

    ISKR: it uses the iterative steering kernel regression method introduced by Takeda et al. [22].

For the ISKR method, the global smoothing parameter h 0 is chosen from scale parameters in the set \( {\mathbf{\tilde{H}}} \). The number of iterations in the SK-LPR-RICI and ISKR method is set to 3. We shall estimate the smooth function \( \hat{m}\left( {{{\mathbf{X}}_i}} \right) \) using linear regression with p = 1. As a result, we have \( \beta = 2\left( {p + 1 - \mathcal{K}} \right) = 4 \) and \( \nu = 2\mathcal{K} + d = 2 \) in the bias and variance expressions (31) and (32). More precisely, the estimate bias is proportional to the squared scale while the variance is inversely proportional to the squared scale. The threshold parameter κ used for the ICI-based methods is selected as 1.96 so that the probability that \( \hat{m}\left( {{{\mathbf{X}}_i}} \right) \) lies in the confidence interval is 95%. With these parameter and Eqs. 3639, we can calculate the parameters for the RICI method: ∆κ = 0.56, η = 0.55, ∆η = 0.22, and \( {h^{opt}} = 0.59{h_{{j^{+} }}} \).

The corresponding smoothing results are shown in Figs. 1 and 2 where the additive noise is Gaussian distributed with zero mean and variance σ2 = 25. To evaluate the best possible performance of the ISKR, we assume that the “true image” is available so that the best global smoothing scale h 0 can be found by a one-dimensional search. We found that the MSE curve as a function of h 0 is convex near the global minimum, and, therefore, a bisection search is employed to seek for the best value of h 0 from the bandwidth set \( {\mathbf{\tilde{H}}} \).

The smoothing results of ISKR method in Figs. 1c and 2c are obtained with the optimal global smoothing parameter h 0 from \( {\mathbf{\tilde{H}}} \) and the optimal number of iteration l. For the Test and Lena images, the optimal global smoothing parameters are h 0 = 1.27 and h 0 = 0.65, respectively. The number of iteration for images Test and Lena are both l = 1 in the ISKR method. The results of the iterative SK-LPR-RICI method illustrated in Figs. 1f and 2f, are obtained with the same number of iteration. It also gives the best PSNR value.

We can see from Figs. 1c-f and 2c–f that all the four adaptive scale selection methods have satisfactory results: the noise in flat area is suppressed and the edges are preserved quite well. By comparing the PSNR values in Table 1 (a), we find that the SK-LPR-RICI method has the highest PSNR. Some specific areas are enlarged in Figs. 1j-l and 2j–l for better visualization. It shows that the SK-LPR-RICI method also have a better visual performance.

Table 1 PSNR comparisons of variable basis-kernel and scale selection methods for image processing (σ 2 = 25).

Next, we evaluate the performances of various scale selection methods under different noise variances. The PSNR values are summarized in Table 2. It can be seen from Table 2 that:

  1. 1.

    Generally, the PSNR values of the four adaptive scale selection methods in ascending order are: LPR-ICI, LPR-RICI, ISKR, SK-LPR-RICI. Note that the best performance in the ISKR method cannot in general be realized because h 0 is chosen to minimize the MSE, which is unavailable in practice.

  2. 2.

    The best global smoothing parameter h 0 for the ISKR method varies with the noise variance. For small noise level, h 0 is relatively small, while for large noise level, h 0 increases accordingly.

  3. 3.

    More iterations generally cause more smoothing of the images. Therefore, the number of iteration should increase with the noise level of the images, if it is known in prior. More iterations should only be applied to rather noisy images in order to avoid over-smoothing.

Table 2 PSNR comparisons of variable basis-kernel and scale selection methods for image de-noising (against different noise variance).

Finally, the computational complexities of these methods under test are compared. The detailed analysis of one LPR operation in a pixel can be found in Section 4.2. The LPR-ICI method has the lowest complexity because it does not need to refine the ICI-based bandwidths and to calculate the steering kernels. The SK-LPR-RICI and ISKR methods have higher complexities than LPR-ICI and LPR-RICI methods because the parameters to construct the steering kernels have to be estimated. The SK-LPR-RICI method has a higher complexity than ISKR because it includes additional ICI bandwidth selection and refinement. In summary, the computational complexities of the four methods in ascending order are: LPR-ICI, LPR-RICI, ISKR, SK-LPR-RICI.

5.2 Coding Artifacts Removal

LPR can also be used to suppress the quantization noise in image and video compression. This is because the artifacts or noise produced during the compression and quantization can be roughly modeled as Gaussian distributed. This is supported by observations in [31], which reported that the JPEG compression noise is Gaussian distributed with zero mean and a covariance matrix determined by the quantization table and DCT transform matrix.

To evaluate the effectiveness of the LPR algorithms, we compressed the image Lena by JPEG with a compression ratio 50%, as shown in Fig. 3b. Only one iteration is employed in the ISKR and iterative SK-LPR-RICI methods because the quantization noise is not very severe. Other experimental settings are the same as the previous de-noising experiment. The compression artifacts mainly occur near sharp edges and have block-like shapes, which are illustrated in Fig. 3. It can be seen that the blocking artifacts are suppressed satisfactorily by the LPR-based methods. The proposed SK-LPR-RICI method can achieve the best results among these LPR-based methods, and their PSNR values are summarized in Table 1 (b).

Figure 3
figure 3

Comparison of different adaptive scale selection methods for smoothing the JPEG coded image Lena: a original noise-free image, b JPEG compressed image, c ISKR, d LPR-ICI, e LPR-RICI, f SK-LPR-RICI, g-i enlarged images of the original image, j-l enlarged images of SK-LPR-RICI.

5.3 Super-Resolution Imaging

In this section, the applications of the proposed SK-LPR-RICI method to super-resolution of test gray-scale images and real color photos are presented.

  1. 1)

    Gray-scale images: A set of four LR (128 × 128) images are generated from the original HR (256 × 256) Lena by decimating it by a factor of 2 both horizontally and vertically and then with a shift of zero pixels, one pixel to right, one pixel to bottom, and one pixel to right and bottom, respectively. Gaussian noise of zero mean and variance σ2 = 25 is added to the LR images. The scale set \( {\mathbf{\tilde{H}}} \), the threshold parameter κ, and the kernel size N K used in this experiment follow their values in previous de-noising experiment.

Experiments on test image Lena are carried to evaluate the performances of the SK-LPR-RICI method with other traditional reconstruction methods based on bicubic spline interpolation, the iterative back-projection (IBP) method [28, 29], and ISKR. The IBP method minimizes the LS error between the HR image to be estimated and a set of known LR images using LS estimation method with regularization. In LPR, we model the each pixel in the true image as a local polynomial and solve a series of local LS estimation problem using the samples from the known LR images, instead of the whole image. As a result, compared with the IBP method, the LPR-based reconstruction methods can better adapt to local orientations in the image and achieve a better performance at the expense of a much higher complexity.

As shown in the results of Fig. 4, the IBP method in (c) has a better result than the bicubic interpolation method in (b). But, the edges are blurred as compared with the LPR-based methods in (d)–(f). The main difficulty of the LS regularization method is to choose a global regularization parameter to suppress the additive noise wile preserving image edges. The PSNR values of Fig. 4b–f are listed in Table 1 (c). The proposed SK-LPR-RICI method has a higher PSNR value than other methods tested.

Figure 4
figure 4

Comparison of different super-resolution methods for the image Lena: a one low-resolution noisy image, b bicubic spline interpolation, c IBP, d ISKR, e LPR-RICI, f SK-LPR-RICI, g enlarged images of the original HR image, h enlarged images of the bicubic spline interpolation, i enlarged images of IBP, j enlarged images of SK-LPR-RICI.

  1. 2).

    Color images: We now consider the reconstruction of real color photos. In color images, each pixel consists of the red (R), green (G), and blue (B) components in the RGB color space. Therefore, it is possible to extend the gray-scale super-resolution methods to color images by treating each channel in the color space as an individual image and employing the gray-scale super-resolution methods to each channel independently. However, as pointed out in [27], this method will produce undesirable color artifacts because it ignores the correlation between different channels. In this paper, we first convert the image to the HSI (Hue, Saturation, and Intensity) color space. The adaptive scale selection methods will be applied to the intensity channel to obtain the local basis-kernel and scale parameters. A LPR using the resulting basis-kernel and scale parameters is then applied to the R, G, and B channels of the image, respectively. An iterative maximum a posteriori (MAP) color super-resolution method proposed in [26, 27] is also included in the experiment for comparison. The MAP method also solves a global LS problem with regularization to minimize the estimation error between the HR and LR images.

Six LR (456 × 289) color photos were captured by a commercial digital camera (Canon IXUS 600). The SK-LPR-RICI super-resolution method is applied to increase the spatial resolution by a factor of 3 to yield a 1368 × 867 HR image. A frequency-domain based motion estimation and image registration technique developed in [30] is employed to determine the relative shifts of these LR images.

For comparison, we display the LR photo in the same size as the HR image by resizing each pixel to a 3 × 3 block. Some areas of the LR and resultant HR images are enlarged so that these details can be seen clearly. From visual inspection of the results in Fig. 5, it can be observed that the SK-LPR-RICI method is able to preserve image edges while remove the image noise effectively.

Figure 5
figure 5

Comparison of different super-resolution methods for real color photos: a one LR image and its enlarged images, b iterative MAP method and its enlarged images, c SK-LPR-RICI and its enlarged images.

Overall, based on the experimental results in this section, it can be concluded that the proposed SK-LPR-RICI method is a useful tool in smoothing and reconstruction of noisy images. It can preserve local characteristics well and has a systematic algorithm to automatically calculate the parameters involved in the method.

6 Conclusion

A study on the multivariate LPR for analysis of multidimensional signals was presented. A new refined ICI adaptive bandwidth selection method was first developed for the multivariate LPR and it was shown to have a better bias-variance tradeoff than the traditional ICI rule. This method is further extended to the case of a steering kernel with local orientation for better adaptation to local characteristics of multidimensional signals. The data-adaptive scale selection method was then applied to smoothing and reconstruction of noisy images. Simulation results on test and true images showed that the proposed SK-LPR-RICI method had a better PSNR and visual performance than the conventional LPR-based methods in image processing.