Nonlocal robust tensor recovery with nonconvex regularization *

Duo Qiu; Minru Bai; Michael K Ng; Xiongjun Zhang

doi:10.1088/1361-6420/abd85b

1. Introduction

With the rapid development of big data processing, tensors have received much attention in the past decades and become one of the most powerful tools to represent multi-dimensional data, which arises in various areas such as hyperspectral image processing [56, 68], computer vision [35], machine learning [48], data mining [11], and neuroscience [31]. Many underlying tensor data of interest are low-rank in a variety of applications, and thus have much lower-dimensional structure. However, these tensor data often suffer from missing information and are corrupted by mixed noise during the data acquisition and transmission process. In this paper, we study the robust low-rank tensor recovery problem for third-order tensors, in which the few available entries are corrupted by Gaussian noise and sparse noise simultaneously.

Compared with robust matrix recovery [30], it is more complex for the robust tensor recovery problem due to diverse definitions of the rank of a tensor, while the rank of a matrix is unique. There are mainly four classes of ranks of a tensor, including CANDECOMP/PARAFAC (CP) rank [10, 22], Tucker rank [51], tensor train (TT) rank [44], and tubal rank and multi-rank [29]. Based on CP rank, Ashraphijuo and Wang [2] studied the tensor completion problem and derived fundamental conditions on the sampling pattern for finite and unique completability of a sampled tensor, while its CP rank should be given in advance, which is generally NP hard to acquire [25]. Besides, some Tucker-based methods were proposed and studied to complete a tensor with/without noise, which utilized the unfolding matrices of the tensor, see [5, 17, 19, 34, 35, 39, 55] and references therein. For example, Liu et al [35] proposed the sum of nuclear norms of unfolding matrices of a tensor to approximate the Tucker rank minimization for tensor completion. Gandy et al [17] proposed a penalty method for tensor completion with/without noisy observations. However, Romera-Paredes and Pontil [46] showed that the sum of nuclear norms of unfolding matrices of a tensor is not the tightest convex relaxation of the sum of its entries of the Tucker rank. Moreover, the matricization of a tensor may break down the intrinsic structure of the tensor data, which leads to the loss of correlation. Based on TT rank minimization⁴ , Bengua et al [6] proposed a novel tensor completion method without sparse noise, which is capable of gaining hidden information from a tensor due to its definition from a well-balanced matricization scheme. The TT method in [6] needs to reshape the original tensor data into a higher-order one by the ket augmentation technique [33], while the ket augmentation scheme could only be implemented for special sizes of tensors and may also destroy the intrinsic structure of the original tensor data.

Another type of tensor rank is the tubal rank and multi-rank [29], which is based on a novel tensor–tensor product. The algebraic operations of this kind of tensor–tensor product and tensor singular value decomposition (SVD) were proposed and studied for third-order tensors in [9, 21, 28], which were further generalized to higher-order tensors in [38]. For tubal rank and multi-rank minimization, Semerci et al [47] proposed to utilize the tensor nuclear norm (TNN) to approximate the sum of its entries of the multi-rank, where the TNN is the convex envelope of ℓ₁ norm of the multi-rank of a tensor under the unit ball of tensor spectral norm [67]. Due to considering the internal structure of the tensor data, the tubal rank of a tensor and TNN have been widely used and studied for robust tensor completion in recent years (see, e.g., [26, 36, 52, 60–62, 65–68]). However, the TNN is just the sum of nuclear norms of all frontal slices of a tensor in the Fourier domain [47]. There is a limitation in the use of discrete Fourier transform, where periodicity is assumed. Moreover, the discrete Fourier transform may not result in a low rank tensor [49], while a low rank requirement is preferred for robust tensor recovery. Recently, Song et al [49] proposed a transformed tensor nuclear norm (TTNN) method for robust tensor completion via a novel tensor–tensor product and transformed tensor SVD, see also [41]. The main motivation of the transformed tensor SVD and TTNN is that it may result in lower rank transformed matrix slices with suitable unitary transformation. Therefore, a lower transformed multi-rank tensor may be obtained via using other unitary transform matrices than that via using discrete Fourier transform matrix [49]. Actually, the aforementioned method is only designed to complete a tensor with only sparse noise, while many tensors of interest are usually corrupted by a combination of several different types of noise during the acquisition process in real-world scenarios, such as Gaussian noise and impulse noise.

On the other hand, the TTNN is the sum of nuclear norms of all frontal slices of the tensor in the transformed domain, which is the ℓ₁ norm of all singular vectors. In the least absolute shrinkage and selection operator (lasso) problem for simultaneous estimation and variable selection, one needs to minimize the square loss function plus the ℓ₁ norm. However, Zou [70] showed that the optimal solution of lasso problem is inconsistent under certain scenarios and does not have oracle properties. The lasso shrinkage produces biased estimates for the large coefficients, and thus it could be suboptimal [15, 50], which cannot achieve the best estimation performance [70]. We remark that a similar case also arises for the tensor ℓ₁ norm. Therefore, some nonconvex penalties have been proposed to enhance sparsity, which have been insightfully investigated in statistical learning and image restoration literature (see, e.g., [15, 16, 42, 59, 64, 70]). For example, in high-dimensional statistics, Fan and Li [15] proposed a nonconvex penalty for sparse estimation and variable selections instead of using the convex ℓ₁ penalty. It has been noted that using nonconvex penalties can alleviate some of the shortcomings of the ℓ₁ norm in theory and experiments, such as minimax concave penalty (MCP) [59], see also [16, 70]. The nonconvex penalties may shrink estimator coefficients faster for small values than the ℓ₁ penalty, and often perform very well when the underlying model is very sparse. These situations are similar to the nonconvex functions onto the singular values of a matrix [37, 54].

In this paper, by jointly exploiting the low-rankness and nonlocal self-similarity of the underlying tensor data in both spatial and tubal dimensions and the sparsity of the sparse corruptions, we propose a nonlocal robust low-rank tensor recovery model with nonconvex regularization (NRTRM), which aims to recover a tensor corrupted by Gaussian noise and sparse noise with partial observations. The NRTRM proceeds as follows: first, we extract the locations of similar patched-tubes of an initial estimator, which could be generated by a tensor SVD method [26]. The nonlocal patch techniques have been proposed and widely used in image restoration, see [13, 14] and references therein. Besides, the similar patches of the observed tensor with the same locations of previous patches are unfolded into matrices along the second-dimension, and then we group them into a third-order sub-tensor, where a lower rank sub-tensor may be obtained. Second, we propose to employ a class of nonconvex functions on the singular values of the TTNN of the low-rank component and each entry of the sparse component for the incomplete and corrupted sub-tensor, respectively, and use the tensor Frobenius norm to remove Gaussian noise further. The nonconvex functions are performed on the singular values of all frontal slices of the sub-tensor in the transformed domain for the low-rank sub-tensor and all entries for the sparse corruptions sub-tensor, respectively, which are capable of promoting the low-rankness of the underlying sub-tensor and enhancing the sparsity of the sparse corruptions better compared with TTNN and tensor ℓ₁ norm. In addition, a proximal alternating linearized minimization algorithm (PALM) [8] is designed to solve the resulting model, where the data-fitting term is linearized with a proximal term. The convergence of PALM is established under very mild conditions. Numerical experiments on both multispectral images and video datasets show the effectiveness of the NRTRM model compared with several existing methods in terms of quantitative measure and visual quality for robust tensor recovery.

The remainder of this paper is organized as follows. In the next section, we introduce some notation and notions, and briefly review the transformed tensor SVD of a tensor. The NRTRM model for robust low-rank tensor recovery is proposed in section 3. A PALM algorithm is presented to solve the resulting model and its convergence is established in section 4. Extensive numerical experiments are reported to verify the superior performance of the NRTRM model in section 5. Finally, the concluding remarks are given in section 6.

2. Preliminaries

In this section, some notation and notions are provided and summarized. Then we give a brief overview of the transformed tensor SVD of a tensor.

2.1. Notation and notions

Throughout this paper, $\mathbb{R}$ and $\mathbb{C}$ denote the spaces with real numbers and complex numbers, respectively. Scalars and vectors are denoted by lowercase letters and boldface lowercase letters, respectively, e.g., a and a. Matrices are denoted by boldface capital letters, e.g., A. Tensors are denoted by Euler script letters, e.g., $\mathcal{A}$ . The order of a tensor refers to the number of its dimensions, also known as ways or modes [31]. For any third-order tensor $\mathcal{A}$ , its (i, j, k)th entry is denoted as ${\mathcal{A}}_{ijk}$ . A tube of a third-order tensor is defined by fixing the first two indices and varying the third one. The ith frontal, lateral, and horizontal slices of $\mathcal{A}$ are denoted by the MATLAB notation $\mathcal{A}\left(:,:,i\right),\mathcal{A}\left(:,i,:\right),\mathcal{A}\left(i,:,;\right)$ , respectively. For simplicity, the ith frontal slice of $\mathcal{A}$ is also denoted by A⁽ⁱ⁾.

For any vector $\mathbf{a}\in {\mathbb{R}}^{n}$ , the Euclidean norm, denoted by ||a||, is defined as ${\Vert}\mathbf{a}{\Vert}=\sqrt{{\sum }_{i=1}^{n}{a}_{i}^{2}}$ , where a_i is the ith entry of a. Diag(a) denotes a diagonal matrix whose diagonal elements are the components of a. For any matrix A, A^H denotes the conjugate transpose of A. ||A||_* denotes the nuclear norm of the matrix A, which is the sum of all singular values of A, ||A||_F denotes the Frobenius norm of A, which is defined as ${\Vert}\mathbf{A}{{\Vert}}_{\mathrm{F}}=\sqrt{\langle \mathbf{A},\mathbf{A}\rangle }$ . Here the inner product is defined as ⟨A, A⟩ = Tr(A^H A), where Tr(⋅) stands for the trace of a matrix. σ_i(A) denotes the ith largest singular value of A. The inner product between two tensors $\mathcal{X},\mathcal{Y}\in {\mathbb{C}}^{{n}_{1}{\times}{n}_{2}{\times}{n}_{3}}$ is defined as $\langle \mathcal{X},\mathcal{Y}\rangle ={\sum }_{i=1}^{{n}_{3}}{\sum }_{j=1}^{{n}_{3}}{\sum }_{k=1}^{{n}_{3}}{\mathcal{X}}_{ijk}^{{\ast}}{\mathcal{Y}}_{ijk}$ , where ⋅* denotes its conjugate. The tensor Frobenius norm of $\mathcal{A}$ , denoted as ${\Vert}\mathcal{A}{{\Vert}}_{\mathrm{F}}$ , is defined as ${\Vert}\mathcal{A}{{\Vert}}_{\mathrm{F}}=\sqrt{\langle \mathcal{A},\mathcal{A}\rangle }$ . For any third-order tensor $\mathcal{A}$ , the tensor ℓ₁ norm is defined as ${\Vert}\mathcal{A}{{\Vert}}_{1}={\sum }_{i}{\sum }_{j}{\sum }_{k}\vert {\mathcal{A}}_{ijk}\vert$ .

Let $\mathfrak{L}$ be a finite-dimensional Euclidean space, a function $f:\mathfrak{L}\to \left[-\infty ,+\infty \right]$ is called proper if f(x) < +∞ for at least one $x\in \mathfrak{L}$ , and f(x) > −∞ for all $x\in \mathfrak{L}$ [45]. The effective domain of f is defined as dom(f) := {x : f(x) < +∞}. For a given proper and lower semicontinuous function $f:\mathfrak{L}\to \left(-\infty ,+\infty \right]$ , the proximal mapping associated with f at y is defined by

$\begin{equation*}{\text{Prox}}_{f}\left(y\right)=\mathrm{arg}{\mathrm{min}}_{x\in \mathfrak{L}}\left\{f\left(x\right)+\frac{1}{2}{\Vert}x-y{{\Vert}}^{2}\right\},\quad \forall y\in \mathfrak{L}.\end{equation*}$

Now we recall the subdifferential of a nonconvex function [45, definition 8.3]. The subdifferential of $f:{\mathbb{R}}^{n}\to \left(-\infty ,+\infty \right]$ at x, denoted as ∂f(x), is defined by

$\begin{align*}\partial f\left(\mathbf{x}\right)& =\left\{\mathbf{y}\in {\mathbb{R}}^{n}:\enspace \exists \enspace {\mathbf{x}}^{k}\to \mathbf{x},f\left({\mathbf{x}}^{k}\right)\to f\left(\mathbf{x}\right),\right.\\ & \left.\quad {\times}{\mathbf{y}}^{k}\to \mathbf{y}\enspace \enspace \text{with}\enspace \enspace {\mathbf{y}}^{k}\in \hat{\partial }f\left({\mathbf{x}}^{k}\right)\enspace \enspace \quad \text{as}\enspace \enspace k\to +\infty \right\},\end{align*}$

where $\hat{\partial }f\left(\mathbf{x}\right)$ denotes the Fréchet subdifferential of f at x ∈ dom(f), which is the set of all y satisfying

$\begin{equation*}{\mathrm{lim}\enspace \mathrm{i}\mathrm{n}\mathrm{f}}_{\mathbf{z}\to \mathbf{x},\mathbf{z}\ne \mathbf{x}}\enspace \frac{f\left(\mathbf{z}\right)-f\left(\mathbf{x}\right)-\langle \mathbf{y},\mathbf{z}-\mathbf{x}\rangle }{{\Vert}\mathbf{z}-\mathbf{x}{\Vert}}{\geqslant}0.\end{equation*}$

For any $\mathcal{X}\in \mathfrak{L}$ , the distance from $\mathcal{X}$ to $\mathfrak{U}$ is defined and denoted by $\mathrm{dist}\left(\mathcal{X},\mathfrak{U}\right){:=}\mathrm{inf}\left\{{\Vert}\mathcal{X}-\mathcal{Y}{{\Vert}}_{\mathrm{F}}:\mathcal{Y}\in \mathfrak{U}\right\}$ , where $\mathfrak{U}$ is a subset of $\mathfrak{L}$ . Next, we recall the Kurdyka–Łojasiewicz (KL) property [8, definition 3], which plays a pivotal role in the analysis of the convergence of PALM for the nonconvex problems.

Definition 1. [8]. Let $f:{\mathbb{R}}^{n}\to \left(-\infty ,+\infty \right]$ be a proper and lower semicontinuous function.

(a)
The function f is said to have the KL property at x ∈ dom(∂f) if there exist η ∈ (0, +∞], a neighborhood $\mathfrak{U}$ of x and a continuous concave function φ : [0, η) → [0, +∞) such that: (a) φ(0) = 0; (b) φ is continuously differentiable on (0, η), and continuous at 0; (c) φ'(s) > 0 for all s ∈ (0, η); (d) for all $\mathbf{y}\in \mathfrak{U}\cap \left[\mathbf{y}\in {\mathbb{R}}^{n}:f\left(\mathbf{x}\right){< }f\left(\mathbf{y}\right){< }f\left(\mathbf{x}\right)+\eta \right]$ , the following KL inequality holds:
$\begin{equation*}{\varphi }^{\prime }\left(f\left(\mathbf{y}\right)-f\left(\mathbf{x}\right)\right)\mathrm{dist}\left(0,\partial f\left(\mathbf{y}\right)\right){\geqslant}1.\end{equation*}$
(b)
If f satisfies the KL property at each point of dom(∂f), then f is called a KL function.

2.2. Transformed tensor SVD

In this subsection, we briefly review the transformed tensor SVD, more details can be found in [49]. Let $\mathbf{U}\in {\mathbb{C}}^{{n}_{3}{\times}{n}_{3}}$ be a unitary matrix with $\mathbf{U}{\mathbf{U}}^{\mathrm{H}}={\mathbf{U}}^{\mathrm{H}}\mathbf{U}={\mathbf{I}}_{{n}_{3}}$ , where ${\mathbf{I}}_{{n}_{3}}$ is the n₃ × n₃ identity matrix. Denote $\hat{\mathcal{A}}$ to be a third-order tensor obtained via multiplying by U on all tubes along the third-dimension of $\mathcal{A}$ , i.e.,

$\begin{equation*}\hat{\mathcal{A}}\left(i,j,:\right)=\mathbf{U}\left(\mathcal{A}\left(i,j,:\right)\right).\end{equation*}$

For the sake of brevity, we denote $\hat{\mathcal{A}}=\mathbf{U}\left[\mathcal{A}\right]$ . Conversely, one can get $\mathcal{A}$ from $\hat{\mathcal{A}}$ by using U^H operation along the third-dimension of $\hat{\mathcal{A}}$ , i.e., $\mathcal{A}={\mathbf{U}}^{\mathrm{H}}\left[\hat{\mathcal{A}}\right]$ .

Next we recall the definition of tensor-tensor product with respect to any unitary matrix U (called U-product) [49, definition 1], which was first proposed by Kilmer and Martin [29] under Fourier transform and then generated to invertible linear transforms [27, definition 4.2].

Definition 2. [27, definition 4.2]. The U-product of $\mathcal{A}\in {\mathbb{C}}^{{n}_{1}{\times}{n}_{2}{\times}{n}_{3}}$ and $\mathcal{B}\in {\mathbb{C}}^{{n}_{2}{\times}{n}_{4}{\times}{n}_{3}}$ , denoted as $\mathcal{A}{\diamond }_{\mathbf{U}}\mathcal{B}$ , is a tensor $\mathcal{C}\in {\mathbb{C}}^{{n}_{1}{\times}{n}_{4}{\times}{n}_{3}}$ given by

$\begin{equation*}\mathcal{C}=\mathcal{A}{\diamond }_{\mathbf{U}}\mathcal{B}={\mathbf{U}}^{\mathrm{H}}\left[\mathrm{f}\mathrm{o}\mathrm{l}\mathrm{d}\left(\text{blockdiag}\left(\hat{\mathcal{A}}\right)\cdot \text{blockdiag}\left(\hat{\mathcal{B}}\right)\right)\right],\end{equation*}$

where $\text{fold}\left(\text{blockdiag}\left(\hat{\mathcal{A}}\right)\right)=\hat{\mathcal{A}}$ and

$\begin{equation*}\text{blockdiag}\left(\hat{\mathcal{A}}\right){:=}\left(\begin{matrix}\hfill {\hat{\mathbf{A}}}^{\left(1\right)}\hfill & \hfill \hfill & \hfill \hfill & \hfill \hfill \\ \hfill \hfill & \hfill {\hat{\mathbf{A}}}^{\left(2\right)}\hfill & \hfill \hfill & \hfill \hfill \\ \hfill \hfill & \hfill \hfill & \hfill \ddots \hfill \\ \hfill \hfill & \hfill \hfill & \hfill \hfill & \hfill {\hat{\mathbf{A}}}^{\left({n}_{3}\right)}\hfill \end{matrix}\right).\end{equation*}$

Based on the definition of U-product, we state the following definition about the transformed tensor SVD.

Theorem 1. [27, theorem 5.1]. Let $\mathcal{A}\in {\mathbb{C}}^{{n}_{1}{\times}{n}_{2}{\times}{n}_{3}}$ . The transformed tensor SVD of $\mathcal{A}$ is given as follows:

$\begin{equation*}\mathcal{A}=\mathcal{U}{\diamond }_{\mathbf{U}}\mathcal{S}{\diamond }_{\mathbf{U}}{\mathcal{V}}^{\mathrm{H}},\end{equation*}$

where $\mathcal{U}\in {\mathbb{C}}^{{n}_{1}{\times}{n}_{1}{\times}{n}_{3}}$ , $\mathcal{V}\in {\mathbb{C}}^{{n}_{2}{\times}{n}_{2}{\times}{n}_{3}}$ are unitary tensors with respect to U-product, and $\mathcal{S}\in {\mathbb{C}}^{{n}_{1}{\times}{n}_{2}{\times}{n}_{3}}$ is a diagonal tensor.

Remark 1. In theorem 1, the conjugate transpose of a tensor $\mathcal{A}\in {\mathbb{C}}^{{n}_{1}{\times}{n}_{2}{\times}{n}_{3}}$ with respect to U, denoted by ${\mathcal{A}}^{\mathrm{H}}\in {\mathbb{C}}^{{n}_{2}{\times}{n}_{1}{\times}{n}_{3}}$ , is defined as ${\mathcal{A}}^{\mathrm{H}}={\mathbf{U}}^{\mathrm{H}}\left[\text{fold}\left(\text{blockdiag}{\left(\hat{\mathcal{A}}\right)}^{\mathrm{H}}\right)\right]$ [49, definition 2]. Let $\mathcal{T}\in {\mathbb{R}}^{{n}_{1}{\times}{n}_{1}{\times}{n}_{3}}$ and each frontal slice of $\mathcal{T}$ be the n₁ × n₁ identity matrix. The identity tensor with respect to U is defined as $\mathcal{I}={\mathbf{U}}^{\mathrm{H}}\left[\mathcal{T}\right]$ [27, proposition 4.1]. A tensor $\mathcal{U}\in {\mathbb{C}}^{{n}_{1}{\times}{n}_{1}{\times}{n}_{3}}$ is unitary with respect to U-product if ${\mathcal{U}}^{\mathrm{H}}{\diamond }_{\mathbf{U}}\mathcal{U}=\mathcal{U}{\diamond }_{\mathbf{U}}{\mathcal{U}}^{\mathrm{H}}=\mathcal{I}$ [27, definition 5.1], where $\mathcal{I}$ is the identity tensor. In addition, a tensor $\mathcal{A}$ is called to be diagonal if each frontal slice A⁽ⁱ⁾ is a diagonal matrix [29].

Based on the transformed tensor SVD given in theorem 1, the transformed multi-rank of a tensor can be defined as follows.

Definition 3. [49, definition 6]. The transformed multi-rank of a tensor $\mathcal{A}\in {\mathbb{C}}^{{n}_{1}{\times}{n}_{2}{\times}{n}_{3}}$ with respect to U is a vector $\mathbf{r}\in {\mathbb{R}}^{{n}_{3}}$ with its ith element being the rank of the ith frontal slice ${\hat{\mathbf{A}}}^{\left(i\right)}$ of $\hat{\mathcal{A}}$ , i.e., ${r}_{i}=\text{rank}\left({\hat{\mathbf{A}}}^{\left(i\right)}\right)$ .

Next the TTNN is used to approximate the sum of all elements of the transformed multi-rank of a tensor, whose convex envelope is TTNN under the unit ball of tensor spectral norm [49, lemma 1]. Here the tensor spectral norm of a tensor $\mathcal{A}$ is defined as the spectral norm of the matrix $\text{blockdiag}\left(\hat{\mathcal{A}}\right)$ .

Definition 4. [49, definition 7]. The TTNN of a tensor $\mathcal{A}\in {\mathbb{C}}^{{n}_{1}{\times}{n}_{2}{\times}{n}_{3}}$ , denoted as ${\Vert}\mathcal{A}{{\Vert}}_{\text{TTNN}}$ , is the sum of the nuclear norms of all frontal slices of $\hat{\mathcal{A}}$ , i.e., ${\Vert}\mathcal{A}{{\Vert}}_{\text{TTNN}}={\sum }_{i=1}^{{n}_{3}}{\Vert}{\hat{\mathbf{A}}}^{\left(i\right)}{{\Vert}}_{{\ast}}$ .

3. Nonlocal and nonconvex model for robust tensor recovery

In this section, we first present the observation formulation with different degradations, and then propose an NRTRM model for robust tensor recovery.

3.1. Problem formulation

In many real scenarios, the underlying tensor data are usually corrupted by different degradations during the acquisition and transmission process, including Gaussian noise, sparse noise, and missing entries [40, 58]. For a noisy tensor $\mathcal{Y}$ and an index set Ω, only partial entries of $\mathcal{Y}$ are available, which implies that we know ${\mathcal{Y}}_{ijk}$ if (i, j, k) ∈ Ω and the index set Ω ⊆ {1, ..., n₁} × {1, ..., n₂} × {1, ..., n₃}. Specifically, the degradation model can be expressed as

$\begin{equation*}{\mathcal{P}}_{{\Omega}}\left(\mathcal{Y}\right)={\mathcal{P}}_{{\Omega}}\left(\mathcal{X}+{\mathcal{E}}^{\prime }+\mathcal{N}\right),\end{equation*}$

where $\mathcal{X}\in {\mathbb{R}}^{{n}_{1}{\times}{n}_{2}{\times}{n}_{3}}$ is the underlying low-rank tensor we aim to estimate, ${\mathcal{E}}^{\prime }\in {\mathbb{R}}^{{n}_{1}{\times}{n}_{2}{\times}{n}_{3}}$ is the sparse corruptions, $\mathcal{N}\in {\mathbb{R}}^{{n}_{1}{\times}{n}_{2}{\times}{n}_{3}}$ is the Gaussian noise tensor, and ${\mathcal{P}}_{{\Omega}}$ is the projection operator onto Ω such that

$\begin{equation*}{\left({\mathcal{P}}_{{\Omega}}\left(\mathcal{X}\right)\right)}_{ijk}=\begin{cases}{\mathcal{X}}_{ijk},\hfill & \text{if}\;\;\left(i,j,k\right)\in {\Omega},\hfill \\ 0,\hfill & \text{otherwise}.\hfill \end{cases}\end{equation*}$

In this situation, our goal is to recover $\mathcal{X}$ from the missing and noisy observations ${\mathcal{P}}_{{\Omega}}\left(\mathcal{Y}\right)$ , which has attracted much attention in various applications [24, 40, 68].

3.2. Nonlocal model with nonconvex regularization

In this subsection, we propose an NRTRM model by jointly exploiting the low-rankness and nonlocal self-similarity of the underlying tensor and promoting the sparsity of the sparse corruptions for robust low-rank tensor recovery with different degradations.

The nonlocal self-similarity is a patch-based prior which means that one patch in a tensor has many similar patches. If one has many similar patches, then the resulting sub-tensor by stacking these similar patches together will be low-rank, which would be more effective for robust tensor recovery as a low rank requirement is preferred. By leveraging on the nonlocal self-similarity of the underlying tensor, a nonlocal approach is proposed for robust tensor recovery, where a lower rank sub-tensor can be formed in each group. The nonlocal techniques have been proposed and applied in image restoration successfully [13, 14], which are used to explore their similar patches and form a lower dimensional subspace.

Assume that an initial estimator $\tilde {\mathcal{X}}$ of $\mathcal{X}\in {\mathbb{R}}^{{n}_{1}{\times}{n}_{2}{\times}{n}_{3}}$ is given. $\tilde {\mathcal{X}}$ is spatially partitioned into a group of overlapping cubes ${\tilde {\mathcal{X}}}_{ij}$ with size w × w × n₃, i = 1, 2, ..., n₁ − w + 1, j = 1, 2, ..., n₂ − w + 1, where the step size of sliding the cubes along the spatial dimension is one. Here ${\tilde {\mathcal{X}}}_{ij}$ denotes a cube of fixed size w × w × n₃ extracted from $\tilde {\mathcal{X}}$ , where (i, j) is the coordinate of the top-left corner of the first frontal slice of the cube. Now we choose ${\tilde {\mathcal{X}}}_{st}$ as the key patches, where

$\begin{align*}s& ={i}^{\text{rhook}}\left(w-1\right)+1,\left(lfloor \frac{{n}_{1}}{w-1}rfloor -1\right)\left(w-1\right)+2,\\ & \quad \enspace {\times}\left(lfloor \frac{{n}_{1}}{w-1}rfloor -1\right)\left(w-1\right)+3,\dots ,{n}_{1}-w+1,\\ t& ={j}^{\text{{rhook}}}\left(w-1\right)+1,\left(lfloor \frac{{n}_{2}}{w-1}rfloor -1\right)\left(w-1\right)+2,\\ & \quad \enspace {\times}\left(lfloor \frac{{n}_{2}}{w-1}rfloor -1\right)\left(w-1\right)+3,\dots ,{n}_{2}-w+1,\end{align*}$

and ${i}^{\prime }=0,1,2,\dots ,\lfloor \frac{{n}_{1}}{w-1}\rfloor -1,\enspace {j}^{\prime }=0,1,2,\dots ,\lfloor \frac{{n}_{2}}{w-1}\rfloor -1$ . Here ⌊x⌋ denotes the integer part of x. For each key patch ${\tilde {\mathcal{X}}}_{st}\in {\mathbb{R}}^{w{\times}w{\times}{n}_{3}}$ , we search its m nearest patches, which have the least Euclidean distances. Each patch is unfolded into a matrix (with size w² × n₃) along the third-dimension and these matrices are grouped together, which form a third-order sub-tensor with size w² × m × n₃. For the index set Ω and the observed tensor ${\mathcal{P}}_{{\Omega}}\left(\mathcal{Y}\right)$ , the spatial locations of each patch are the same as those of the initial estimator $\tilde {\mathcal{X}}$ , which form the subset Ω_l and the sub-tensor ${\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}\right)$ .

Note that the underlying sparse sub-tensor ${\mathcal{E}}_{l}^{\prime }$ cannot be recovered outside Ω_l for robust tensor recovery [49], see also [30] for robust matrix completion. Let ${\mathcal{E}}_{l}={\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{E}}_{l}^{\prime }\right)$ . Now we take into account the intrinsic structure of each sub-tensor. In order to utilize the low-rankness of the low-rank component and sparsity of the sparse component better in each group, the nonconvex functions are performed on the singular values of all frontal slices in the transformed domain for the low-rank component and each entry for the sparse component in each sub-tensor, respectively. Consequently, for the lth sub-tensor, we propose the NRTRM model as follows:

$\begin{equation}{\mathrm{min}}_{{\mathcal{X}}_{l},\enspace {\mathcal{E}}_{l}}\enspace {G}_{1}\left({\mathcal{X}}_{l}\right)+\lambda {G}_{2}\left({\mathcal{E}}_{l}\right)+\frac{\rho }{2}{\Vert}{\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}\right)-{\mathcal{E}}_{l}{{\Vert}}_{\mathrm{F}}^{2},\end{equation} \tag{ 1 }$

where λ, ρ > 0 are regularization parameters,

$\begin{equation}{G}_{1}\left({\mathcal{X}}_{l}\right){:=}\sum\limits _{i=1}^{{n}_{3}}\sum\limits _{j=1}^{\mathrm{min}\left\{{w}^{2},m\right\}}{G}_{1}\left({\sigma }_{j}\left({\hat{\mathbf{X}}}_{l}^{\left(i\right)}\right)\right),\qquad {G}_{2}\left({\mathcal{E}}_{l}\right){:=}\sum\limits _{i=1}^{{w}^{2}}\sum\limits _{j=1}^{m}\sum\limits _{k=1}^{{n}_{3}}{G}_{2}\left({\left({\mathcal{E}}_{l}\right)}_{ijk}\right),\end{equation} \tag{ 2 }$

and G₁, G₂ are nonconvex functions satisfying the following assumptions:

Assumption 1.

(a)
${G}_{i}:\mathbb{R}\to \mathbb{R}$ is proper, lower semicontinuous and symmetric with respect to y-axis;
(b)
G_i is concave and monotonically nondecreasing on [0, +∞) with G_i(0) = 0, i = 1, 2.

Now some nonconvex functions satisfying assumption 1 are listed as follows, where we only discuss the parameters of the nonconvex functions on [0, +∞) since they are symmetric with respect to y-axis.

Firm thresholding [18]: the firm penalty is given by
$\begin{equation}f\left(x\right)=\begin{cases}a\left(\vert x\vert -{x}^{2}/\left(2\gamma \right)\right),\hfill & \text{if}\enspace \vert x\vert {< }\gamma ,\hfill \\ a\gamma /2,\hfill & \text{if}\enspace \vert x\vert {\geqslant}\gamma ,\hfill \end{cases}\end{equation} \tag{ 3 }$
where a, γ > 0. The parameter γ mainly influences the domain of f to be a quadratic function or a constant function, where f is a quadratic function on [0, γ) and f is a constant function on [γ, +∞). a and γ control the curvature of the quadratic function in the range [0, γ) and the peak value of f. For fixed γ, if a is larger, f(x) is steeper on [0, γ) and its peak value is also larger. A similar situation occurs when a is fixed and γ varies. We visualize the curves of firm function with different a and γ in figure 1(a).
ℓ_q penalty [37]: the penalty function is given by f(x) = |x|^q, where 0 < q < 1. The parameter q mainly controls the level of concavity of x^q on [0, +∞). From figure 1(b), we can observe that if q approximates 1, f(x) approximates the absolute function |x| better. And if q approximates 0, f approximates 0 faster in the neighborhood of 0.
Smoothly clipped absolute deviation (SCAD) penalty [15]: the SCAD penalty is given by
$\begin{equation}f\left(x\right)=\begin{cases}\xi \vert x\vert ,\hfill & \text{if}\enspace \vert x\vert {< }\xi ,\hfill \\ \frac{2b\xi \vert x\vert -{x}^{2}-{\xi }^{2}}{2\left(b-1\right)},\hfill & \text{if}\enspace \xi {\leqslant}\vert x\vert {< }b\xi ,\hfill \\ \left(b+1\right){\xi }^{2}/2,\hfill & \text{if}\enspace \vert x\vert {\geqslant}b\xi ,\hfill \end{cases}\end{equation} \tag{ 4 }$
where b > 1, ξ > 0. f(x) is a linear function with ratio ξ on [0, ξ), and a quadratic function on [ξ, bξ). For x ∈ [bξ, +∞), f(x) is a constant function. The parameter ξ controls the ratio of the linear function and influences the domain of f to be a linear function. The parameter b controls the domain of f to be a quadratic function, and also influences the slope of the quadratic function on [ξ, bξ) and the peak value of f. From figure 1(c), we can see that for fixed ξ, if b is larger, so are the larger domain of the quadratic function and the peak value. For fixed b, if ξ is larger, so are the ratio of the linear function and the peak value.
Minimax concave penalty (MCP) [59]: the MCP function is defined as
$\begin{equation}f\left(x\right)=c{\int }_{0}^{\vert x\vert }\mathrm{max}\left\{1-t/\left(c\eta \right),0\right\}\mathrm{d}t=\begin{cases}c\vert x\vert -\frac{1}{2\eta }{x}^{2},\hfill & \text{if}\enspace \vert x\vert {\leqslant}c\eta ,\hfill \\ \frac{{c}^{2}\eta }{2},\hfill & \text{if}\enspace \vert x\vert { >}c\eta ,\hfill \end{cases}\end{equation} \tag{ 5 }$
where η, c > 0. The MCP function f is a quadratic function on [0, cη] and a constant function on (cη, +∞). The parameters c, η control the steepness of the quadratic function and level of concavity. And c, η also influence the domain of f to be a quadratic function or a constant function, see figure 1(d). For fixed c, if η is larger, so are the peak value of f and the domain of the quadratic function. This phenomenon also occurs when η is fixed and c varies.
Logarithmic function (log) [20]: the logarithmic function is given by
$\begin{equation}f\left(x\right)=\text{log}\left(1+\frac{\vert x\vert }{\theta }\right),\end{equation} \tag{ 6 }$
where θ > 0. f(x) is monotonically increasing and concave on [0, +∞). The parameter θ controls the level of concavity of f on [0, +∞). If θ is smaller, then f is steeper and approximates 0 faster, see figure 1(e).

**Figure 1.** Nonconvex functions (x ⩾ 0) with different parameters. (a) Firm. (b) ℓ_q. (c) SCAD. (d) MCP. (e) log.
Download figure:
Standard image High-resolution image

Recently, Song et al [49] proposed to employ the TTNN for the low-rank component, which can obtain lower multi-rank in the transformed domain than TNN in [67]. Notice that the TTNN is the ℓ₁ norm of the singular vectors of all frontal slices in the transformed domain [49]. And the tensor ℓ₁ norm is usually used to remove the sparse corruptions [19, 49]. However, the ℓ₁ norm penalty has long been known by statisticians to yield biased estimators and cannot achieve the best recovery performance [50, 70]. For example, Zou [70] showed that there exist certain scenarios where the ℓ₁ penalty is inconsistent for variable selection in the lasso problem. Thus some nonconvex penalties, which can produce less bias than the ℓ₁ norm penalty, were proposed and studied in various sparse or low-rank models, see [15, 16, 42, 63, 64] and references therein.

In order to overcome the disadvantages of TTNN and tensor ℓ₁ norm, we propose to employ a class of nonconvex functions onto the singular values of all frontal slices of the low-rank component in the transformed domain and all entries of the sparse component in each sub-tensor. On the one hand, the nonconvex function G₁ in (1) is used to get a lower transformed multi-rank sub-tensor. Compared with TTNN in [49], the nonconvex function G₁, which performs on the singular values of all frontal slices of the low-rank component in the transformed domain, can approximate the transformed multi-rank of the underlying sub-tensor better. On the other hand, the nonconvex function G₂, which performs on each entry of the underlying sparse corruptions sub-tensor, can be employed to generate a sparser solution compared with the tensor ℓ₁ norm.

After reconstructing each sub-tensor by (1), we fold all lateral slices of the recovered sub-tensors into third-order tensors (with size w × w × n₃) along the second-dimension, which form the recovered patches. Then the final tensor is obtained by aggregating all patches, where the weighted averages are used in the overlapping locations.

The proposed NRTRM method consists of the following three steps: first, the similar patches are extracted from an initial estimator to obtain the locations of these patches in each group. Second, we propose a nonconvex approach to recover each sub-tensor. Third, the final tensor is obtained by aggregating all patches of the recovered sub-tensors. The flowchart of the proposed NRTRM approach is presented in figure 2.

**Figure 2.** The flowchart of NRTRM for robust tensor recovery.
Download figure:
Standard image High-resolution image

Remark 2. From figure 1, it can be observed that the nonconvex functions with suitable parameters may approximate the two points 0–1 function (i.e., 0 if x = 0 and 1 otherwise) better than the absolute value function |x|, where the 0–1 function is just the ℓ₀ norm of a variable with one-dimension. Notice that TTNN and tensor ℓ₁ norm are the ℓ₁ norms of the singular values of all frontal slices of the tensor in the transformed domain and of each entry. Consequently, compared with TTNN and tensor ℓ₁ norm in each sub-tensor, the nonconvex functions onto the singular values of all frontal slices of the low-rank component in the transformed domain and each entry of the sparse component, may approximate the transformed multi-rank of the underlying low-rank component and the tensor ℓ₀ norm of the sparse corruptions component better, respectively.

Remark 3. Recently, Ng et al [41] showed the perturbation analysis of transformed tensor SVD for similar patches. In each group, the resulting sub-tensor is a very low-rank tensor plus a perturbation tensor. Therefore, the generated sub-tensor can approximate a low-rank tensor with very small errors, which can be a tubal rank-k tensor with small k.

Remark 4. Note that we need to solve model (1) in each group and all groups are independent. In our experiments, all groups are solved in parallel.

4. Proximal alternating linearized minimization algorithm for NRTRM

In this section, we develop a PALM algorithm [8] to solve the NRTRM model (1). Then the convergence of PALM is established under very mild conditions. Finally, a stopping criterion for the PALM algorithm is given.

4.1. PALM algorithm

The PALM is a powerful algorithm for solving a broad class of nonconvex and nonsmooth minimization problems, which is viewed as alternating minimization with a proximal linearization step for the differentiable function [8]. Let

$\begin{equation}H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right){:=}\frac{\rho }{2}{\Vert}{\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}\right)-{\mathcal{E}}_{l}{{\Vert}}_{\mathrm{F}}^{2},\end{equation} \tag{ 7 }$

and

$\begin{equation}{\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right){:=}{G}_{1}\left({\mathcal{X}}_{l}\right)+\lambda {G}_{2}\left({\mathcal{E}}_{l}\right)+H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right).\end{equation} \tag{ 8 }$

We consider to replace $H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ by its approximations. Note that

$\begin{equation*}{\nabla }_{{\mathcal{X}}_{l}}H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)=-\rho {\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}-{\mathcal{E}}_{l}\right)\end{equation*}$

and

$\begin{equation}{\nabla }_{{\mathcal{E}}_{l}}H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)=-\rho \left({\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}\right)-{\mathcal{E}}_{l}\right).\end{equation} \tag{ 9 }$

Then, by the proximal linearization of each subproblem, the PALM algorithm on the two blocks $\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ for solving (1) yields the iteration scheme alternatingly as follows:

$\begin{align}\hfill {\mathcal{X}}_{l}^{k+1}& \in \mathrm{arg}{\mathrm{min}}_{{\mathcal{X}}_{l}}\left\{{G}_{1}\left({\mathcal{X}}_{l}\right)-\rho \left\langle {\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{k}-{\mathcal{E}}_{l}^{k}\right),{\mathcal{X}}_{l}-{\mathcal{X}}_{l}^{k}\right\rangle +\frac{{\alpha }^{k}}{2}{\Vert}{\mathcal{X}}_{l}-{\mathcal{X}}_{l}^{k}{{\Vert}}_{\mathrm{F}}^{2}\right\},\hfill \end{align} \tag{ 10 }$

$\begin{align}\hfill {\mathcal{E}}_{l}^{k+1}& \in \mathrm{arg}{\mathrm{min}}_{{\mathcal{E}}_{l}}\left\{\lambda {G}_{2}\left({\mathcal{E}}_{l}\right)-\rho \left\langle {\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{k+1}\right)-{\mathcal{E}}_{l}^{k},{\mathcal{E}}_{l}-{\mathcal{E}}_{l}^{k}\right\rangle +\frac{{\beta }^{k}}{2}{\Vert}{\mathcal{E}}_{l}-{\mathcal{E}}_{l}^{k}{{\Vert}}_{\mathrm{F}}^{2}\right\},\hfill \end{align} \tag{ 11 }$

where α_k, β_k > ρ. The problems (10) and (11) can be reformulated equivalently as

$\begin{equation}{\mathcal{X}}_{l}^{k+1}\in \mathrm{arg}{\mathrm{min}}_{{\mathcal{X}}_{l}}\left\{{G}_{1}\left({\mathcal{X}}_{l}\right)+\frac{{\alpha }^{k}}{2}{{\Vert}{\mathcal{X}}_{l}-\left({\mathcal{X}}_{l}^{k}+\frac{\rho {\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{k}-{\mathcal{E}}_{l}^{k}\right)}{{\alpha }^{k}}\right){\Vert}}_{\mathrm{F}}^{2}\right\},\end{equation} \tag{ 12 }$

$\begin{align}\hfill {\mathcal{E}}_{l}^{k+1}& \in \mathrm{arg}{\mathrm{min}}_{{\mathcal{E}}_{l}}\{\lambda {G}_{2}\left({\mathcal{E}}_{l}\right)+\frac{{\beta }^{k}}{2}{\Vert}{\mathcal{E}}_{l}\hfill \\ \hfill & {\quad -\left({\mathcal{E}}_{l}^{k}+\frac{\rho \left({\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{k+1}\right)-{\mathcal{E}}_{l}^{k}\right)}{{\beta }^{k}}\right){\Vert}}_{\mathrm{F}}^{2}\}.\hfill \end{align} \tag{ 13 }$

Remark 5.

(a)
In order to guarantee the convergence of PALM, we need to show the sufficient descent property of the objective function ${\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ [8, lemma 3], that is,
$\begin{align*}\hfill & {\Psi}\left({\mathcal{X}}_{l}^{k+1},{\mathcal{E}}_{l}^{k+1}\right)+\frac{{\alpha }^{k}-\rho }{2}{\Vert}{\mathcal{X}}_{l}^{k+1}-{\mathcal{X}}_{l}^{k}{{\Vert}}_{\mathrm{F}}^{2}+\frac{{\beta }^{k}-\rho }{2}{\Vert}{\mathcal{E}}_{l}^{k+1}-{\mathcal{E}}_{l}^{k}{{\Vert}}_{\mathrm{F}}^{2}{\leqslant}{\Psi}\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right),\hfill \end{align*}$
where ${\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ is defined as (8), α^k, β^k > ρ. Therefore, in our experiments, we choose α^k = γ₁ ρ and β^k = γ₂ ρ with γ₁, γ₂ > 1. For γ₁, γ₂ > 1, we just choose γ₁ = 1.8 and γ₂ = 1.5 for simplicity since the recovery performance is stable as long as γ₁, γ₂ are not too large.
(b)
For the parameters λ, ρ in the model, we choose $\lambda =\theta /\sqrt{\text{SR}{n}_{3}\enspace \mathrm{max}\left\{{w}^{2},m\right\}}$ , where θ > 0 and SR is the sampling ratio (will be defined in section 5). Here the choice of λ is similar to [49], which could guarantee the exact recovery when θ = 1 for robust tensor completion with only sparse noise. Then, we tune θ slightly for different SRs and sparse noise levels to get good recovery performance. The recovery performance about different θ has been analyzed in figures 5(a) and (b) in section 5.2. It can be seen from the two figures that the recovery performance may have great differences for the same θ in different noise levels and SRs. Therefore, we will give a few concrete choices about θ for different cases in section 5.2. For the parameter ρ > 0, the recovery performance of its different choices has also been analyzed in figure 5(c) in section 5.2. We find that the recovery performance is stable when ρ is around (2.2, 3.2). For the sake of brevity, we choose ρ = 2.8 in all experiments.

Let $\tilde {n}=\mathrm{min}\left\{{w}^{2},m\right\}$ . Before giving the solution of problem (12), we need to present the proximal mapping of the following problem, which can be derived from [54, theorem 1].

Theorem 2. For any $\mathcal{Y}\in {\mathbb{R}}^{{w}^{2}{\times}m{\times}{n}_{3}}$ and μ > 0, a solution of the following minimization problem

$\begin{equation}{\mathrm{min}}_{\mathcal{X}}\enspace \mu {G}_{1}\left(\mathcal{X}\right)+\frac{1}{2}{\Vert}\mathcal{X}-\mathcal{Y}{{\Vert}}_{\mathrm{F}}^{2},\end{equation} \tag{ 14 }$

is given by

$\begin{equation}{\mathrm{Prox}}_{\mu {G}_{1}}\left(\mathcal{Y}\right){:=}\;\mathcal{U}{\diamond }_{\mathbf{U}}{{\Sigma}}_{\mu }{\diamond }_{\mathbf{U}}{\mathcal{V}}^{\mathrm{H}},\end{equation} \tag{ 15 }$

where $\mathcal{Y}=\mathcal{U}{\diamond }_{\mathbf{U}}{\Sigma}{\diamond }_{\mathbf{U}}{\mathcal{V}}^{\mathrm{H}}$ , ${{\Sigma}}_{\mu }={\mathbf{U}}^{\mathrm{H}}\left[{\hat{{\Sigma}}}_{\mu }\right]$ , and

$\begin{equation*}{\hat{{\Sigma}}}_{\mu }^{\left(i\right)}=\mathrm{D}\mathrm{i}\mathrm{a}\mathrm{g}\left({\mathrm{P}\mathrm{r}\mathrm{o}\mathrm{x}}_{\mu {G}_{1}}\left({\left({\hat{{\Sigma}}}^{\left(i\right)}\right)}_{1,1}\right),\dots ,{\mathrm{P}\mathrm{r}\mathrm{o}\mathrm{x}}_{\mu {G}_{1}}\left({\left({\hat{{\Sigma}}}^{\left(i\right)}\right)}_{\tilde {n},\tilde {n}}\right)\right)\end{equation*}$

with ${\left({\hat{{\Sigma}}}^{\left(i\right)}\right)}_{k,k}$ being the (k, k)th entry of ${\hat{{\Sigma}}}^{\left(i\right)}$ , i = 1, ..., n₃.

Proof. By the definition of ${G}_{1}\left(\mathcal{X}\right)$ in (2), problem (14) can be reformulated equivalently as

$\begin{equation}{\mathrm{min}}_{\mathcal{X}}\sum\limits _{i=1}^{{n}_{3}}\sum\limits _{j=1}\mu {G}_{1}\left({\sigma }_{j}\left({\hat{\mathbf{X}}}^{\left(i\right)}\right)\right)+\frac{1}{2}{\Vert}{\hat{\mathbf{X}}}^{\left(i\right)}-{\hat{\mathbf{Y}}}^{\left(i\right)}{{\Vert}}_{\text{F}}^{2}.\end{equation} \tag{ 16 }$

Since each frontal slice ${\hat{\mathbf{X}}}^{\left(i\right)}$ is separable in (16), by [54, theorem 1], the minimizer ${\hat{\mathbf{X}}}^{\left(i\right)}$ of (16) is given by

$\begin{equation*}{\hat{\mathbf{X}}}^{\left(i\right)}={\hat{\mathbf{U}}}^{\left(i\right)}\text{Diag}\left({\text{Prox}}_{\mu {G}_{1}}\left({\left({\hat{{\Sigma}}}^{\left(i\right)}\right)}_{1,1}\right),\dots ,{\text{Prox}}_{\mu {G}_{1}}\left({\left({\hat{{\Sigma}}}^{\left(i\right)}\right)}_{\tilde {n},\tilde {n}}\right)\right){\left({\hat{\mathbf{V}}}^{\left(i\right)}\right)}^{\mathrm{H}},\end{equation*}$

where ${\hat{\mathbf{Y}}}^{\left(i\right)}={\hat{\mathbf{U}}}^{\left(i\right)}{\hat{{\Sigma}}}^{\left(i\right)}{\left({\hat{\mathbf{V}}}^{\left(i\right)}\right)}^{\mathrm{H}},\enspace i=1,\dots ,{n}_{3}$ , and ${\left({\hat{{\Sigma}}}^{\left(i\right)}\right)}_{k,k}$ denotes the (k, k)th entry of ${\hat{{\Sigma}}}^{\left(i\right)}$ . By using the inverse unitary transform (i.e., U^H) along the third-dimension of $\hat{\mathcal{X}}$ , we can obtain that a solution of (14) is given by (15). □

Next we give the explicit formulations in (12) and (13). From theorem 2, the solution of (12) is given by

$\begin{equation}{\mathcal{X}}_{l}^{k+1}=\mathcal{U}{\diamond }_{\mathbf{U}}{{\Sigma}}_{1/{\alpha }^{k}}{\diamond }_{\mathbf{U}}{\mathcal{V}}^{\mathrm{H}},\end{equation} \tag{ 17 }$

where ${\mathcal{X}}_{l}^{k}+\frac{\rho {\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{k}-{\mathcal{E}}_{l}^{k}\right)}{{\alpha }^{k}}=\mathcal{U}{\diamond }_{\mathbf{U}}{\Sigma}{\diamond }_{\mathbf{U}}{\mathcal{V}}^{\mathrm{H}}$ and ${{\Sigma}}_{1/{\alpha }^{k}}={\mathbf{U}}^{\mathrm{H}}\left[{\hat{{\Sigma}}}_{1/{\alpha }^{k}}\right]$ with

$\begin{equation}{\hat{{\Sigma}}}_{1/{\alpha }^{k}}^{\left(i\right)}=\text{Diag}\left({\text{Prox}}_{\frac{1}{{\alpha }^{k}}{G}_{1}}\left({\left({\hat{{\Sigma}}}^{\left(i\right)}\right)}_{1,1}\right),\dots ,{\text{Prox}}_{\frac{1}{{\alpha }^{k}}{G}_{1}}\left({\left({\hat{{\Sigma}}}^{\left(i\right)}\right)}_{\tilde {n},\tilde {n}}\right)\right).\end{equation} \tag{ 18 }$

The solution of problem (13) is the proximal mapping of $\frac{\lambda }{{\beta }^{k}}{G}_{2}$ at each entry, which can be written as follows:

$\begin{equation}{\left({\mathcal{E}}_{l}^{k+1}\right)}_{ijk}={\text{Prox}}_{\frac{\lambda }{{\beta }^{k}}{G}_{2}}\left({\left({\mathcal{E}}_{l}^{k}+\frac{\rho \left({\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{k+1}\right)-{\mathcal{E}}_{l}^{k}\right)}{{\beta }^{k}}\right)}_{ijk}\right),\end{equation} \tag{ 19 }$

where i = 1, ..., w², j = 1, ..., m, k = 1, ..., n₃.

Now the PALM algorithm for solving (1) can be stated in algorithm 1.

Algorithm 1. PALM algorithm for solving (1).

Input. α₀ = γ₁ ρ, β₀ = γ₂ ρ with γ₁, γ₂ > 1, ${\mathcal{X}}_{l}^{0},{\mathcal{E}}_{l}^{0}$ ${\mathcal{X}}_{l}^{0},{\mathcal{E}}_{l}^{0}$ . For k = 0, 1, 2, ..., perform the following steps:

Step 1. Compute ${\mathcal{X}}_{l}^{k+1}$ ${\mathcal{X}}_{l}^{k+1}$ via (17).

Step 2. Compute ${\mathcal{E}}_{l}^{k+1}$ ${\mathcal{E}}_{l}^{k+1}$ by (19).

Step 3. If a stopping criterion is not met, set k := k + 1 and go to step 1.

For the explicit implementation in algorithm 1, the proximal mappings of G_i should be solved inexpensively, i = 1, 2. We summarize the proximal mappings of the nonconvex functions in section 3.2 as follows, see also [15, 18, 20, 37, 57].

Firm thresholding: the proximal mapping of μf (μ > 0) in (3) is given by [18]
$\begin{equation*}{\text{Prox}}_{\mu f}\left(y\right)=\begin{cases}0,\hfill & \text{if}\enspace \vert y\vert {\leqslant}a\mu ,\hfill \\ \frac{\text{sign}\left(y\right)\left(\vert y\vert -a\mu \right)\gamma }{\gamma -a\mu },\hfill & \text{if}\enspace a\mu {< }\vert y\vert {\leqslant}\gamma ,\hfill \\ y,\hfill & \text{if}\enspace \vert y\vert { >}\gamma ,\hfill \end{cases}\end{equation*}$
where γ > aμ and sign(x) denotes the signum function of x, i.e., sign(x) = 1 if x > 0, sign(x) = 0 if x = 0, and sign(x) = −1 if x < 0.
ℓ_q penalty: the proximal mapping of μf is given by [37]
$\begin{equation*}{\text{Prox}}_{\mu f}\left(y\right)=\begin{cases}0,\hfill & \text{if}\enspace \vert y\vert {< }{\varphi }_{2},\hfill \\ \left\{0,\text{sign}\left(y\right){\varphi }_{1}\right\},\hfill & \text{if}\enspace \vert y\vert ={\varphi }_{2},\hfill \\ \text{sign}\left(y\right){z}^{{\ast}},\hfill & \text{if}\enspace \vert y\vert { >}{\varphi }_{2},\hfill \end{cases}\end{equation*}$
where φ₁ = [2μ(1 − q)]^1/(2−q), ${\varphi }_{2}={\varphi }_{1}+\mu q{\varphi }_{1}^{q-1}$ , z* ∈ (φ₁, |y|) is the solution of the equation h(z) = μqz^q−1 + z −|y| = 0 with z > 0.
SCAD: for any 0 < μ < b − 1, the proximal mapping of μf in (4) is given by [15]
$\begin{equation*}{\text{Prox}}_{\mu f}\left(y\right)=\begin{cases}\text{sign}\left(y\right)\text{max}\left\{\vert y\vert -\xi \mu ,0\right\},\hfill & \text{if}\enspace \vert y\vert {\leqslant}\enspace \left(1+\mu \right)\xi ,\hfill \\ \frac{\left(b-1\right)y-\text{sign}\left(y\right)b\mu \xi }{b-1-\mu },\hfill & \text{if}\enspace \left(1+\mu \right)\xi {\leqslant}\vert y\vert {< }b\xi ,\hfill \\ y,\hfill & \text{if}\enspace \vert y\vert {\geqslant}b\xi .\hfill \end{cases}\end{equation*}$
MCP: for any 0 < μ < η, the proximal mapping of μf in (5) is given by [57, 59]
$\begin{equation*}{\text{Prox}}_{\mu f}\left(y\right)=\begin{cases}0,\hfill & \text{if}\enspace \vert y\vert {\leqslant}\enspace c\mu ,\hfill \\ \frac{\text{sign}\left(y\right)\left(\vert y\vert -c\mu \right)}{1-\mu /\eta },\hfill & \text{if}\enspace c\mu {< }\vert y\vert {\leqslant}c\eta ,\hfill \\ y,\hfill & \text{if}\enspace \vert y\vert { >}c\eta .\hfill \end{cases}\end{equation*}$
Logarithmic function: the proximal mapping of μf (μ > 0) in (6) is given by [20]
$\begin{equation*}{\text{Prox}}_{\mu f}\left(y\right)=\text{sign}\left(y\right)z,\end{equation*}$
where z is an optimal solution of the problem $z=\mathrm{arg}{\mathrm{min}}_{x\in \mathfrak{C}}\left\{\mu f\left(x\right)+\frac{1}{2}{\left(x-y\right)}^{2}\right\},$ and $\mathfrak{C}$ is a set composed of 3 elements or 1 element. If a := (|y|−θ)² − 4(μ − θ|y|) ⩾ 0,
$\begin{equation*}\mathfrak{C}=\left\{0,\mathrm{max}\left\{\frac{\vert y\vert -\theta +\sqrt{a}}{2},0\right\},\mathrm{max}\left\{\frac{\vert y\vert -\theta -\sqrt{a}}{2},0\right\}\right\},\end{equation*}$
otherwise, $\mathfrak{C}=\left\{0\right\}$ .

Remark 6. For the firm thresholding, the parameter γ > aμ is to guarantee the meaningful solution in the proximal mapping, where a > 0 is used to control the steepness of f on [0, γ). For the ℓ_q penalty, we just need 0 < q < 1 to guarantee the concavity of x^q for x ⩾ 0. For the SCAD penalty, we need ξ > 0 and b > μ + 1. ξ > 0 is used to guarantee that f is increasing and concave on [0, bξ). And b > μ + 1 is to guarantee the meaningful solution in the proximal mapping about the SCAD function. For MCP, we need c > 0 and η > μ. c > 0 is to guarantee the concavity of the quadratic function on [0, cη] and η > μ is to guarantee the meaningful solution in the proximal mapping about the MCP function. For the logarithmic function, we just need θ > 0 to guarantee the concavity of this function on [0, +∞). We will give the concrete values of these parameters in our testing cases in section 5.1. In particular, we mainly use the MCP function in our experiments, and will discuss the two parameters c, η of MCP in detail in section 5.2.

Now we consider the computational cost of transformed tensor SVD, which is the main cost of PALM at each iteration. Notice that the application of a unitary transformation to an n₃-vector is of $O\left({n}_{3}^{2}\right)$ operations. There are n₃ w²-by-m SVDs to be calculated in the transformed tensor SVD. Therefore, the total cost of transformed tensor SVD is of $O\left({n}_{\left(1\right)}{n}_{\left(2\right)}^{2}{n}_{3}+{w}^{2}m{n}_{3}^{2}\right)$ operations, where n₍₁₎ = max{w², m} and n₍₂₎ = min{w², m}. In particular, if G₁, G₂ are firm thresholding, or SCAD, or MCP, the complexities of computing ${\mathcal{E}}_{l}^{k+1}$ and ${\mathcal{L}}_{l}^{k+1}$ are O(w² mn₃) and $O\left({n}_{\left(1\right)}{n}_{\left(2\right)}^{2}{n}_{3}+{w}^{2}m{n}_{3}^{2}\right)$ , respectively. Then the total cost of PALM at each iteration is $O\left({n}_{\left(1\right)}{n}_{\left(2\right)}^{2}{n}_{3}+{w}^{2}m{n}_{3}^{2}\right)$ .

4.2. Convergence analysis

In this subsection, the convergence of PALM is established under some mild conditions, which is mainly based on the framework in [8, theorem 1].

Theorem 3. Assume that assumption 1 holds. Suppose that α^k = γ₁ ρ and β^k = γ₂ ρ with γ₁, γ₂ > 1. Let the sequence $\left\{\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)\right\}$ be generated by algorithm 1. Then,

(a)
any accumulation point of the sequence $\left\{\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)\right\}$ is a critical point of (1).
(b)
if G₁, G₂ are KL functions and G₁ is coercive, the sequence $\left\{\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)\right\}$ converges to a critical point of (1).

Proof. By the definition of $H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ in (7), and its gradients ${\nabla }_{{\mathcal{X}}_{l}}H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right),{\nabla }_{{\mathcal{E}}_{l}}H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ in (9), for any fixed ${\mathcal{E}}_{l}$ , we obtain that

$\begin{align}\hfill & \enspace {\Vert}{\nabla }_{{\mathcal{X}}_{l}}H\left({\left({\mathcal{X}}_{l}\right)}_{1},{\mathcal{E}}_{l}\right)-{\nabla }_{{\mathcal{X}}_{l}}H\left({\left({\mathcal{X}}_{l}\right)}_{2},{\mathcal{E}}_{l}\right){{\Vert}}_{\mathrm{F}}\hfill \\ \hfill & \qquad =\rho {\Vert}{\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\left({\mathcal{X}}_{l}\right)}_{1}-{\mathcal{E}}_{l}\right)-{\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\left({\mathcal{X}}_{l}\right)}_{2}-{\mathcal{E}}_{l}\right){{\Vert}}_{\mathrm{F}}\hfill \\ \hfill & \qquad =\rho {\Vert}{\mathcal{P}}_{{{\Omega}}_{l}}\left({\left({\mathcal{X}}_{l}\right)}_{1}-{\left({\mathcal{X}}_{l}\right)}_{2}\right){{\Vert}}_{\mathrm{F}}\hfill \\ \hfill & \qquad {\leqslant}\rho {\Vert}{\left({\mathcal{X}}_{l}\right)}_{1}-{\left({\mathcal{X}}_{l}\right)}_{2}{{\Vert}}_{\mathrm{F}},\hfill \end{align} \tag{ 20 }$

and for any fixed ${\mathcal{X}}_{l}$ ,

$\begin{align}\hfill & \enspace {\Vert}{\nabla }_{{\mathcal{E}}_{l}}H\left({\mathcal{X}}_{l},{\left({\mathcal{E}}_{l}\right)}_{1}\right)-{\nabla }_{{\mathcal{E}}_{l}}H\left({\mathcal{X}}_{l},{\left({\mathcal{E}}_{l}\right)}_{2}\right){{\Vert}}_{\mathrm{F}}\hfill \\ \hfill & \qquad =\rho {\Vert}{\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}\right)-{\left({\mathcal{E}}_{l}\right)}_{1}-{\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}\right)+{\left({\mathcal{E}}_{l}\right)}_{2}{{\Vert}}_{\mathrm{F}}\hfill \\ \hfill & \qquad =\rho {\Vert}{\left({\mathcal{E}}_{l}\right)}_{1}-{\left({\mathcal{E}}_{l}\right)}_{2}{{\Vert}}_{\mathrm{F}}.\hfill \end{align} \tag{ 21 }$

(20) and (21) imply that the gradient of $H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ is Lipschitz continuous block-wise. Note that $H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ is twice continuously differentiable, which yields that $\nabla H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ is Lipschitz continuous on bounded subsets of ${\mathbb{R}}^{{w}^{2}{\times}m{\times}{n}_{3}}{\times}{\mathbb{R}}^{{w}^{2}{\times}m{\times}{n}_{3}}$ [8, remark 3]. It follows from [8, lemma 3] that

$\begin{equation}{\Psi}\left({\mathcal{X}}_{l}^{k+1},{\mathcal{E}}_{l}^{k+1}\right)+\frac{\zeta }{2}\left({\Vert}{\mathcal{X}}_{l}^{k+1}-{\mathcal{X}}_{l}^{k}{{\Vert}}_{\mathrm{F}}^{2}+{\Vert}{\mathcal{E}}_{l}^{k+1}-{\mathcal{E}}_{l}^{k}{{\Vert}}_{\mathrm{F}}^{2}\right){\leqslant}{\Psi}\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right),\end{equation} \tag{ 22 }$

where ${\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ is defined as (8), ζ:= min{(γ₁ − 1)ρ, (γ₂ − 1)ρ}, and

$\begin{equation}{\mathrm{lim}}_{k\to +\infty }{\Vert}{\mathcal{X}}_{l}^{k+1}-{\mathcal{X}}_{l}^{k}{{\Vert}}_{\mathrm{F}}={\mathrm{lim}}_{k\to +\infty }{\Vert}{\mathcal{E}}_{l}^{k+1}-{\mathcal{E}}_{l}^{k}{{\Vert}}_{\mathrm{F}}=0.\end{equation} \tag{ 23 }$

(a)
Assume that there exists a subsequence $\left\{\left({\mathcal{X}}_{l}^{{k}_{j}},{\mathcal{E}}_{l}^{{k}_{j}}\right)\right\}$ of $\left\{\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)\right\}$ such that $\left\{\left({\mathcal{X}}_{l}^{{k}_{j}},{\mathcal{E}}_{l}^{{k}_{j}}\right)\right\}$ converges to $\left({\mathcal{X}}_{l}^{{\ast}},{\mathcal{E}}_{l}^{{\ast}}\right)$ as j → +∞. By (23), we have that $\left\{\left({\mathcal{X}}_{l}^{{k}_{j}+1},{\mathcal{E}}_{l}^{{k}_{j}+1}\right)\right\}$ also converges to $\left({\mathcal{X}}_{l}^{{\ast}},{\mathcal{E}}_{l}^{{\ast}}\right)$ as j → +∞. Moreover, the optimality conditions of (10) and (11) yield that
$\begin{equation*}0\in \partial {G}_{1}\left({\mathcal{X}}_{l}^{{k}_{j}+1}\right)-\rho {\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{{k}_{j}}-{\mathcal{E}}_{l}^{{k}_{j}}\right)+{\alpha }^{{k}_{j}}\left({\mathcal{X}}_{l}^{{k}_{j}+1}-{\mathcal{X}}_{l}^{{k}_{j}}\right)\end{equation*}$
and
$\begin{equation*}0\in \partial \left(\lambda {G}_{2}\left({\mathcal{E}}_{l}^{{k}_{j}+1}\right)\right)-\rho \left({\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{{k}_{j}+1}\right)-{\mathcal{E}}_{l}^{{k}_{j}}\right)+{\beta }^{{k}_{j}}\left({\mathcal{E}}_{l}^{{k}_{j}+1}-{\mathcal{E}}_{l}^{{k}_{j}}\right).\end{equation*}$
When j → +∞, by [12, proposition 2.1.5], we get that
$\begin{equation*}0\in \partial {G}_{1}\left({\mathcal{X}}_{l}^{{\ast}}\right)-\rho {\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{{\ast}}-{\mathcal{E}}_{l}^{{\ast}}\right)=\partial {G}_{1}\left({\mathcal{X}}_{l}^{{\ast}}\right)+{\nabla }_{{\mathcal{X}}_{l}^{{\ast}}}H\left({\mathcal{X}}_{l}^{{\ast}},{\mathcal{E}}_{l}^{{\ast}}\right)\end{equation*}$
and
$\begin{equation*}0\in \partial \left(\lambda {G}_{2}\left({\mathcal{E}}_{l}^{{\ast}}\right)\right)-\rho \left({\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{{\ast}}\right)-{\mathcal{E}}_{l}^{{\ast}}\right)=\partial \left(\lambda {G}_{2}\left({\mathcal{E}}_{l}^{{\ast}}\right)\right)+{\nabla }_{{\mathcal{E}}_{l}^{{\ast}}}H\left({\mathcal{X}}_{l}^{{\ast}},{\mathcal{E}}_{l}^{{\ast}}\right).\end{equation*}$
Note that $\partial {\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)={\partial }_{{\mathcal{X}}_{l}}{\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right){\times}{\partial }_{{\mathcal{E}}_{l}}{\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)=\left\{\partial {G}_{1}\left({\mathcal{X}}_{l}\right)+{\nabla }_{{\mathcal{X}}_{l}}H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)\right\}{\times}\left\{\partial \left(\lambda {G}_{2}\left({\mathcal{E}}_{l}\right)\right)+{\nabla }_{{\mathcal{E}}_{l}}H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)\right\}$ [3, proposition 2.1]. Therefore, we have that
$\begin{equation*}\left(0,0\right)\in \partial {\Psi}\left({\mathcal{X}}_{l}^{{\ast}},{\mathcal{E}}_{l}^{{\ast}}\right),\end{equation*}$
which implies that $\left({\mathcal{X}}_{l}^{{\ast}},{\mathcal{E}}_{l}^{{\ast}}\right)$ is a critical point of ${\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ .
(b)
Since G₁ is coercive, we obtain that ${G}_{1}\left({\mathcal{X}}_{l}\right)\to +\infty$ as ${\Vert}{\mathcal{X}}_{l}{{\Vert}}_{\mathrm{F}}\to +\infty$ . By the definition of ${\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ , we have that ${\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)\to +\infty$ as ${\Vert}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right){{\Vert}}_{\mathrm{F}}\to +\infty$ . Suppose that $\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)$ is unbounded, i.e., ${\Vert}\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right){{\Vert}}_{\mathrm{F}}\to +\infty$ , we obtain that ${\Psi}\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)\to +\infty$ since ${\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ is coercive. However, it follows from (22) that ${\Psi}\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)$ is upper bounded. Therefore, the sequence $\left\{\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)\right\}$ is bounded. Note that $H\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ is a KL function. Since G₁ and G₂ are KL functions, we have that ${\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ is also a KL function [8]. Then by [8, theorem 1], we obtain that the sequence $\left\{\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)\right\}$ converges to a critical point of (1). □

Remark 7. The assumptions in theorem 3(b) are easy to satisfy. The ℓ_q penalty (0 < q < 1) with q being rational is a KL function [4, 8]. The SCAD, MCP, and logarithmic function are KL functions [43, 63, 64]. Note that the firm thresholding is a polynomial function, which is a KL function [8]. More details about KL functions can also be referred to [3, 7, 8]. Moreover, the nonconvex functions ℓ_q penalty (0 < q < 1) and logarithmic function are coercive.

Suppose that the assumptions of theorem 3 hold. Let the sequence $\left\{\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)\right\}$ be generated by algorithm 1 and $\left({\mathcal{X}}_{l}^{{\ast}},{\mathcal{E}}_{l}^{{\ast}}\right)$ be the limit point of $\left\{\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)\right\}$ . Assume that ${\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ has the KL property at $\left({\mathcal{X}}_{l}^{{\ast}},{\mathcal{E}}_{l}^{{\ast}}\right)$ with $\varphi \left(s\right)=c{s}^{1-\tilde {\theta }}$ , $c{ >}0,\tilde {\theta }\in \left[0,1\right)$ , where ${\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ is defined as (8). It follows from [8, section 3.5] that, (1) if $\tilde {\theta }=0$ , then the sequence $\left\{\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)\right\}$ converges in a finite number of steps; (2) If $\tilde {\theta }\in \left(0,\frac{1}{2}\right]$ , then the sequence $\left\{\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)\right\}$ converges R-linearly, that is, there exist C₁ > 0 and $\tilde {\tau }\in \left[0,1\right)$ such that ${\Vert}\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)-\left({\mathcal{X}}_{l}^{{\ast}},{\mathcal{E}}_{l}^{{\ast}}\right){{\Vert}}_{\mathrm{F}}{\leqslant}{C}_{1}{\tilde {\tau }}^{k};$ (3) If $\tilde {\theta }\in \left(\frac{1}{2},1\right)$ , then the sequence $\left\{\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)\right\}$ converges R-sublinearly, that is, there exists C₂ > 0 such that ${\Vert}\left({\mathcal{X}}_{l}^{k},{\mathcal{E}}_{l}^{k}\right)-\left({\mathcal{X}}_{l}^{{\ast}},{\mathcal{E}}_{l}^{{\ast}}\right){{\Vert}}_{\mathrm{F}}{\leqslant}{C}_{2}{k}^{-\frac{1-\tilde {\theta }}{2\tilde {\theta }-1}}.$

Remark 8. We recall that all proper closed semialgebraic functions satisfy the KL property at all points in dom(∂Ψ) with $\varphi \left(s\right)=c{s}^{1-\tilde {\theta }}$ , for some c > 0 and $\tilde {\theta }\in \left[0,1\right)$ [3, section 4.3] and [7]. In particular, if G₁ and G₂ are SCAD, or MCP, or ℓ_q with 0 < q < 1 being rational, then Ψ is a proper closed semialgebraic function [8].

4.3. A stopping criterion for the PALM algorithm

For the unconstrained and nonconvex minimization problem (1), a suitable stopping criterion is that $\pi \left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right):=\;\mathrm{dist}\left(0,\partial {\Psi}\left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)\right)$ is sufficiently small. However, it is difficult to compute $\pi \left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ directly in practice. In the implementation, we compute an upper bound of $\pi \left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ . If the upper bound of $\pi \left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ is small, so is $\pi \left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ . One can actually get a good upper bound on $\pi \left({\mathcal{X}}_{l},{\mathcal{E}}_{l}\right)$ without incurring extra computational cost when running the PALM algorithm, which is given as follows.

At the (k + 1)th iteration, by the optimality conditions of (10) and (11), we obtain that

$\begin{equation}0\in \partial {G}_{1}\left({\mathcal{X}}_{l}^{k+1}\right)-\rho {\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{k}-{\mathcal{E}}_{l}^{k}\right)+{\alpha }^{k}\left({\mathcal{X}}_{l}^{k+1}-{\mathcal{X}}_{l}^{k}\right)\end{equation} \tag{ 24 }$

and

$\begin{equation}0\in \partial \left(\lambda {G}_{2}\left({\mathcal{E}}_{l}^{k+1}\right)\right)-\rho \left({\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{k+1}\right)-{\mathcal{E}}_{l}^{k}\right)+{\beta }^{k}\left({\mathcal{E}}_{l}^{k+1}-{\mathcal{E}}_{l}^{k}\right).\end{equation} \tag{ 25 }$

Let

$\begin{equation*}\begin{aligned}\hfill {\mathcal{A}}_{l}^{k+1}& {:=}\rho {\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{k}-{\mathcal{E}}_{l}^{k}\right)-{\alpha }^{k}\left({\mathcal{X}}_{l}^{k+1}-{\mathcal{X}}_{l}^{k}\right)-\rho {\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{k+1}-{\mathcal{E}}_{l}^{k+1}\right)\hfill \\ \hfill & =\rho {\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{X}}_{l}^{k+1}-{\mathcal{X}}_{l}^{k}+{\mathcal{E}}_{l}^{k+1}-{\mathcal{E}}_{l}^{k}\right)-{\alpha }^{k}\left({\mathcal{X}}_{l}^{k+1}-{\mathcal{X}}_{l}^{k}\right)\hfill \end{aligned}\end{equation*}$

and

$\begin{align*}\hfill {\mathcal{B}}_{l}^{k+1}& {:=}\rho \left({\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{k+1}\right)-{\mathcal{E}}_{l}^{k}\right)-{\beta }^{k}\left({\mathcal{E}}_{l}^{k+1}-{\mathcal{E}}_{l}^{k}\right)\hfill \\ \hfill & \quad -\rho \left({\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}-{\mathcal{X}}_{l}^{k+1}\right)-{\mathcal{E}}_{l}^{k+1}\right)\hfill \\ \hfill & =\left(\rho -{\beta }^{k}\right)\left({\mathcal{E}}_{l}^{k+1}-{\mathcal{E}}_{l}^{k}\right).\hfill \end{align*}$

It follows from (24), (25), and [3, Proposition 2.1] that $\left({\mathcal{A}}_{l}^{k+1},{\mathcal{B}}_{l}^{k+1}\right)\in \partial {\Psi}\left({\mathcal{X}}_{l}^{k+1},{\mathcal{E}}_{l}^{k+1}\right)$ . Therefore, an upper bound of $\pi \left({\mathcal{X}}_{l}^{k+1},{\mathcal{E}}_{l}^{k+1}\right)$ is given as follows:

$\begin{equation}\pi \left({\mathcal{X}}_{l}^{k+1},{\mathcal{E}}_{l}^{k+1}\right){\leqslant}{\Vert}\left({\mathcal{A}}_{l}^{k+1},{\mathcal{B}}_{l}^{k+1}\right){{\Vert}}_{\mathrm{F}}.\end{equation} \tag{ 26 }$

From (26), the stopping condition for the PALM algorithm in our numerical experiments is derived as follows:

$\begin{equation}\varrho {:=}\frac{{\Vert}{\mathcal{A}}_{l}^{k+1}{{\Vert}}_{\mathrm{F}}+{\Vert}{\mathcal{B}}_{l}^{k+1}{{\Vert}}_{\mathrm{F}}}{1+{\Vert}{\mathcal{P}}_{{{\Omega}}_{l}}\left({\mathcal{Y}}_{l}\right){{\Vert}}_{\mathrm{F}}+{\Vert}{\mathcal{X}}_{l}^{k+1}{{\Vert}}_{\mathrm{F}}+{\Vert}{\mathcal{E}}_{l}^{k+1}{{\Vert}}_{\mathrm{F}}}{\leqslant}\text{tol},\end{equation} \tag{ 27 }$

where tol is a moderately small number.

5. Numerical results

In this section, numerical experiments on multispectral images and video datasets are presented to demonstrate the effectiveness of NRTRM. We compare the proposed NRTRM model with the singleton high-order robust principal component analysis (SHRPCA)⁵ [19], TNN minimization method [26], the spatio-spectral total variation (SSTV)⁶ [1], and spatial-spectral total variation regularized local low-rank matrix recovery (LLRTV)⁷ [23]. Since the performance of SSTV and LLRTV is not good for robust tensor recovery with missing entries, especially for low SRs, the TNN is employed to obtain an initial estimator and then the SSTV and LLRTV are used for the intermediate results, which are also called SSTV and LLRTV for simplicity. All methods are conducted by MATLAB in a PC (Intel Core i5, @ 2.40GHz, 8G RAM).

We defined the SR as $\text{SR}=\frac{\vert {\Omega}\vert }{{n}_{1}{n}_{2}{n}_{3}}$ for an n₁ × n₂ × n₃ tensor, where Ω is generated uniformly at random and |Ω| represents the cardinality of Ω. In our experiments, the observed tensor is constructed as follows: first, we add the salt-and-pepper impulse noise to the underlying tensor with ratio τ and then add zero-mean Gaussian noise with variance σ. Finally, the observed tensor ${\mathcal{P}}_{{\Omega}}\left(\mathcal{Y}\right)$ is generated by the given SR.

For the unitary transformation in transformed tensor SVD and TTNN of each group, we use the data-dependent transformation [49], which can generate a lower transformed multi-rank tensor than that of fast Fourier transform and discrete cosine transform, see also [41]. First, we generate an initial estimator $\tilde {\mathcal{X}}$ by the transformed tensor SVD method based on discrete cosine transform, which is also the initial estimator for searching the locations of similar patches. Then we unfold each sub-tensor ${\tilde {\mathcal{X}}}_{l}$ into a matrix ${\tilde {\mathbf{X}}}_{l}$ along the third-dimension. Let the SVD of ${\tilde {\mathbf{X}}}_{l}$ be ${\tilde {\mathbf{X}}}_{l}={\mathbf{U}}_{l}{\mathbf{\Sigma }}_{l}{\mathbf{V}}_{l}^{\mathrm{H}}$ . Then ${\mathbf{U}}_{l}^{\mathrm{H}}$ is the desired unitary matrix in transformed tensor SVD and TTNN for the lth group. If the unfolding matrix ${\tilde {\mathbf{X}}}_{l}$ of the underlying sub-tenor is low-rank, then ${\mathbf{U}}_{l}^{\mathrm{H}}{\tilde {\mathbf{X}}}_{l}$ is also a low rank matrix, which implies that many frontal slices of the underlying sub-tensor are zero matrices in the transformed domain. This means that its transformed multi-rank may be very low in each group. We also refer the reader to [49] for a detailed discussion about the choice of the unitary matrix in transformed tensor SVD and TTNN.

To evaluate the performance of different methods quantitatively, the peak signal-to-noise ratio (PSNR) is utilized to give a quantitative assessment, which is defined as

$\begin{equation*}\text{PSNR}=10\enspace {\mathrm{log}}_{10}\frac{{n}_{1}{n}_{2}{n}_{3}{\left({\mathrm{max}}_{i,j,k}\enspace {\mathcal{X}}_{ijk}-{\mathrm{min}}_{i,j,k}\enspace {\mathcal{X}}_{ijk}\right)}^{2}}{{\Vert}{\mathcal{X}}^{\prime }-\mathcal{X}{{\Vert}}_{\mathrm{F}}^{2}},\end{equation*}$

where ${\mathcal{X}}^{\prime }$ and $\mathcal{X}\in {\mathbb{R}}^{{n}_{1}{\times}{n}_{2}{\times}{n}_{3}}$ stand for the recovered tensor and the original tensor, respectively. The structural similarity (SSIM) index [53] is also adopted to measure the quality of the recovered results. For the lth group in NRTRM, the PALM algorithm is terminated when ϱ ⩽ tol = 5 × 10⁻³ in (27) or the maximum number of 300 iterations is reached.

For the proposed NRTRM model, not only the nonlocal self-similarity prior information is used, but also nonconvex techniques are utilized to enhance the low-rankness and sparsity. An example is given to show which (nonlocal self-similarity prior information or nonconvex techniques) will yield better results for robust tensor recovery. The testing image is the toy dataset (see in the subsequent subsection) with different SRs, where σ = 0.01, and τ = 0.3. In order to show the superiority of nonconvex techniques and nonlocal self-similarity, we also display the method composed of tensor Frobenius norm, TTNN, and tensor ℓ₁ norm (TTNN+ℓ₁ for short), where the TTNN and tensor ℓ₁ norm are employed for the low-rank component and sparse component, respectively. For the nonconvex method without nonlocal self-similarity prior information (nonconvex for short), the nonconvex functions are performed on the singular values of all frontal slices of the tensor in the transformed domain for the low-rank component and all entries for the sparse component, respectively. Here the MCP is used for both nonconvex and NRTRM. Moreover, for the nonlocal self-similarity prior information method (nonlocal for short), we use the nonlocal self-similarity technique for the TTNN+ℓ₁ method, which employs TTNN and tensor ℓ₁ norm instead of nonconvex functions in NRTRM.

Now we compare NRTRM with SHRPCA, TNN, SSTV, LLRTV, TTNN+ℓ₁, nonconvex, and nonlocal. From table 1, we can see that the nonconvex and nonlocal outperform TTNN+ℓ₁ in terms of PSNR, which implies that the nonlocal self-similarity prior information or nonconvex techniques can improve the results for robust tensor recovery. Meanwhile, the PSNR values obtained by the nonlocal method are higher than those obtained by the nonconvex method, which implies that the nonlocal self-similarity prior information yields better results than the nonconvex method. Among these methods, NRTRM performs best in terms of PSNR values. The single prior information method (nonconvex or nonlocal) outperforms SHRPCA, TNN, SSTV, LLRTV, and TTNN+ℓ₁. Although nonlocal and NRTRM require more CPU time (in seconds) than other methods, the PSNR values obtained by nonlocal and NRTRM are higher. In the following experiments, we mainly focus on the quality of solutions of different methods.

Table 1. PSNR and CPU time (in seconds) of different methods for the toy dataset with different SRs, where σ = 0.01, and τ = 0.3.

	SR	SHRPCA	TNN	SSTV	LLRTV	TTNN+ℓ₁	Nonconvex	Nonlocal	NRTRM
PSNR	0.5	24.98	28.79	29.42	30.53	30.95	34.38	35.39	36.65
	0.6	26.00	29.70	30.58	31.54	32.32	35.51	36.52	37.87
	0.7	26.78	30.20	31.32	32.89	33.53	36.51	38.10	38.68
CPU	0.5	12.09	5.10	17.95	9.40	15.65	15.87	147.02	109.54
	0.6	11.59	4.99	17.59	9.36	16.14	13.56	154.15	104.77
	0.7	11.40	4.71	17.38	9.25	14.83	15.94	154.58	103.72

5.1. Performance of different nonconvex functions

In this subsection, we test the performance of different nonconvex functions for G₁ and G₂ in our NRTRM model, where the test image is the toy dataset (see in the subsequent subsection), and SR = 0.7, τ = 0.1, σ = 0.01. In section 3.2, we list some nonconvex functions satisfying assumption 1. Note that G₁ and G₂ could be the same. According to theorem 3, if G₁ is coercive, the PALM algorithm is globally convergent. Therefore, we also test some nonconvex functions such that G₁ is coercive. The nonconvex functions in our experiments include firm, ℓ_q, SCAD, MCP, and log, where G₁ and G₂ are the same nonconvex functions in each case. For different G₁ and G₂, we test four cases: the first two cases are that G₁ is ℓ_q in each case, and G₂ are MCP and firm, respectively (called ℓ_q + MCP and ℓ_q + firm, respectively). The last two cases are that G₁ is the log function in each case, and G₂ are MCP and SCAD, respectively (called log + MCP and log + SCAD, respectively).

In this testing case, the parameters of the nonconvex functions should be set in advance. For the firm thresholding, since γ should be larger than the coefficients of G₁ multiplied by a (i.e., a/α^k) in (18) and G₂ multiplied by a (i.e., aλ/β^k) in (19), we set γ = 65a/α^k for G₁ and γ = 65aλ/β^k for G₂ for simplicity, where the recovery results are robust as long as γ is not too large. And a is set to 0.4 in order to get good recovery performance. For the ℓ_q penalty, we just need 0 < q < 1 and simply set q = 0.9 in the experiments in order to get good recovery performance. For SCAD, we need b > 1 + 1/α^k for G₁ and b > 1 + λ/β^k for G₂ from remark 6. We simply set b = 70(1 + 1/α^k) for G₁ and b = 70(1 + λ/β^k) for G₂ in our experiments, where the recovery performance is stable as long as b is not too large. Meanwhile, we just need ξ > 0 and simply set ξ = 0.4 to get good recovery performance in this experiment. For MCP, η should be larger than the coefficients of G₁ (i.e., 1/α^k) in (18) and G₂ (i.e., λ/β^k) in (19). For simplicity, we set η = 70/α^k for G₁ and η = 70λ/β^k for G₂ since the recovery performance is stable as long as η is not too large. Moreover, c is set to 0.4 for both G₁ and G₂ in the MCP penalty in order to get good recovery results. For log, θ > 0 is just required and θ is stable for the recovery performance. We simply set θ = 2 in this experiment to get good recovery performance. Other parameters are the same as those in the next subsection.

In figure 3, we show the PSNR and SSIM values of different nonconvex functions in each case. The first observation from this figure is that the recovery performance of ℓ_q, MCP, ℓ_q + MCP, and log + MCP is almost the same in terms of PSNR values, which is better than that of the other cases. And the improvement of ℓ_q, MCP, ℓ_q + MCP, and log + MCP is about 0.5 dB compared with that of firm, SCAD, ℓ_q + firm, and log + SCAD. The SSIM values obtained by different nonconvex functions are almost the same except for log and log + SCAD. Therefore, in our following experiments, the MCP function is used in both G₁ and G₂ for simplicity.

5.2. Parameters setting

We remark that the parameters of the NRTRM model and the PALM algorithm among all groups are the same in our experiments. For α^k and β^k in the PALM algorithm, we set α^k = γ₁ ρ and β^k = γ₂ ρ by remark 5, respectively, and choose γ₁ = 1.8 and γ₂ = 1.5 in all experiments for simplicity since the recovery performance is stable as long as γ₁, γ₂ are not too large.

Now we analyze the performance of different size of patches w and number of similar patches m in each group for the NRTRM model, where the testing image is the toy dataset with different SR, τ, and σ. In figure 4, the PSNR and SSIM values are shown for different size of patches w and number of similar patches m, where w varies from 3 to 21 with step size 2, and m varies from 10 to 100 with step size 10. From figure 4, we can see that the performance is stable in terms of PSNR and SSIM values when w varied from 7 to 19 for the testing two cases. And when m varies from 10 to 50, the changes of PSNR and SSIM values are very small. In fact, the PSNR values are stable for all testing m. Therefore, for simplicity, we choose w = 9 and m = 30 in all experiments, respectively.

**Figure 4.** Performance of different size of patches w and number of similar patches m for the NRTRM model. (a) PSNR versus w. (b) SSIM versus w. (c) PSNR versus m. (d) SSIM versus m.
Download figure:
Standard image High-resolution image

For the parameter λ in the NRTRM model, we set $\lambda =\theta /\sqrt{\text{SR}{n}_{3}\enspace \mathrm{max}\left\{{w}^{2},m\right\}}$ . Here the choice of λ is similar to [49], which could guarantee the exact recovery when θ = 1 for robust tensor completion with only sparse noise. And θ is adjusted slightly in order to achieve good performance. In figures 5(a) and (b), we show the PSNR values versus different θ, where SR = 0.7, σ = 0.01 in figure 5(a) and τ = 0.3, σ = 0.01 in figure 5(b). The experimental results in this figure suggest that the peak values reach around small ranges for different cases. Therefore, we will choose θ from {5, 6, 8, 12, 15, 22} for multispectral images datasets and from {11, 14, 15, 21, 25} for video datasets in all experiments, respectively. Meanwhile, in figure 5(c), we show the recovery performance of different ρ for τ = 0.1, 0.3, where SR = 0.7 and σ = 0.01. Here ρ varies from 0.5 to 3.5. It can be observed that the PSNR values are stable around the range [2.2, 3.2] for different τ. Therefore, for the sake of brevity, we choose ρ = 2.8 in the following experiments.

**Figure 5.** Performance of different parameters of NRTRM. (a) θ with different τ. (b) θ with different SR. (c) ρ with different τ.
Download figure:
Standard image High-resolution image

The parameters c and η of the MCP function in (5) also influence the recovery performance of NRTRM. From remark 6, we know that the parameter η needs to be larger than 1/α^k for G₁ and λ/β^k for G₂. Since the recovery performance is robust for not too large η, the parameter η of MCP is chosen as 70/α^k for G₁ and 70λ/β^k for G₂ for simplicity, respectively. However, for different c, the recovery performance has great differences in our experiments. Therefore, c is chosen in a few ranges for different cases. In particular, the parameter c of MCP for G₁ and G₂ is the same, and we choose it from {0.4, 0.5, 0.6, 0.8, 1.1} for multispectral images datasets and from {0.7, 0.9, 1.4} for video datasets to get the best recovery performance in the following experiments.

5.3. Multispectral images

In this subsection, we test three multispectral images datasets (length × width × spectral) including chart toy⁸ , the Indian Pines dataset (145 × 145 × 224)⁹ , which is a synthetic data and generated by [24] using the ground truth of the Indian Pines dataset, and Samson (95 × 95 × 156) ¹⁰ [69]. Since the toy dataset is too large, we resize the toy dataset to 128 × 128 in each image, and the size of the resulting tensor is 128 × 128 × 31. All testing images are normalized to [0, 1].

In figure 6, we show the 30th band of the recovered images and corresponding zoomed regions of different methods for the toy dataset, where SR = 0.5, σ = 0.01, and τ = 0.5. The results presented in this figure show that the images recovered by NRTRM are more clear than those recovered by SHRPCA, TNN, SSTV, and LLRTV, where the NRTRM retains more details and textures than the other methods. Figure 7 shows the tenth band of the recovered images and zoomed regions of the Indian dataset, where SR = 0.7, σ = 0.001, and τ = 0.5. It is obvious that the NRTRM performs better than SHRPCA, TNN, SSTV, and LLRTV in terms of visual quality. The black parts of the zoomed region obtained by NRTRM are more clear than those obtained by SHRPCA, TNN, and SSTV. Compared with LLRTV, the white part of the zoomed region obtained by NRTRM exhibits more details. In figure 8, we show the tenth band of the recovered images and zoomed regions obtained by SHRPCA, TNN, SSTV, LLRTV, and NRTRM for the Samson dataset, where SR = 0.4, σ = 0.01, and τ = 0.3. We can observe that the details recovered by NRTRM are more clear than those recovered by SHRPCA, TNN, SSTV, and LLRTV. And the NRTRM is quite effective than SHRPCA, TNN, and SSTV in reducing the noise. Moreover, from the zoomed regions of the same figure, we can observe that the NRTRM preserves more details than SHRPCA, TNN, SSTV, and LLRTV.

**Figure 7.** The tenth band of the recovered images and zoomed regions of different methods for the Indian dataset, where SR = 0.7, σ = 0.001, and τ = 0.5.
Download figure:
Standard image High-resolution image

**Figure 8.** The tenth band of the recovered images and zoomed regions of different methods for the Samson dataset, where SR = 0.4, σ = 0.01, and τ = 0.3.
Download figure:
Standard image High-resolution image

We show the PSNR values of different bands of SHRPCA, TNN, SSTV, LLRTV, and NRTRM for the multispectral images datasets in figure 9, where SR = 0.45, σ = 0.01, and τ = 0.3. It can be seen that the PSNR values obtained by NRTRM are higher than those obtained by SHRPCA, TNN, SSTV, and LLRTV in most bands for the three multispectral images datasets, except for few bands of the Samson dataset. In particular, the improvement of PSNR values obtained by NRTRM is very obvious than the other methods in each band for the Indian dataset.

In table 2, we report the PSNR and SSIM values of SHRPCA, TNN, SSTV, LLRTV, and NRTRM for the three multispectral images datasets, where SR = 0.5, 0.7, σ = 0.01, 0.001, and τ = 0.1, 0.3, 0.5. It can be observed that the PSNR and SSIM values obtained by our proposed NRTRM model are much higher than those obtained by SHRPCA, TNN, SSTV, and LLRTV, especially for low impulse noise ratios, where the improvements of NRTRM are impressive than those of the other methods in terms of PSNR and SSIM values. The NRTRM can improve about 4 dB compared with the other methods for most cases. Meanwhile, the performance of LLRTV is better than that of SHRPCA, TNN, and SSTV in terms of PSNR and SSIM values.

Table 2. PSNR and SSIM values of different methods for the multispectral images datasets with different SRs, σ, and τ. The boldface number is the best.

	Images	SR	σ	τ	SHRPCA	TNN	SSTV	LLRTV	NRTRM
PSNR	Toy	0.7	0.01	0.1	32.27	34.54	35.81	38.39	43.90
				0.3	26.78	30.29	31.32	32.89	38.53
				0.5	22.25	27.32	27.82	28.53	32.51
			0.001	0.1	33.61	37.17	37.99	39.89	44.26
				0.3	27.29	31.23	31.98	33.15	38.76
				0.5	22.29	27.58	28.18	28.64	32.59
	Indian	0.7	0.01	0.1	30.05	34.40	36.63	43.76	48.53
				0.3	24.82	31.05	31.95	36.62	40.61
				0.5	20.08	27.98	28.62	30.02	33.77
			0.001	0.1	31.30	36.47	36.90	44.55	48.58
				0.3	25.09	32.39	32.66	38.62	40.69
				0.5	20.18	28.79	29.03	30.19	33.95
	Samson	0.5	0.01	0.1	33.57	34.45	35.92	38.76	45.86
				0.3	27.47	31.22	31.95	33.22	40.75
				0.5	21.83	28.11	28.48	28.89	35.47
			0.001	0.1	35.61	37.80	38.58	39.93	46.25
				0.3	27.89	32.50	32.82	33.75	40.80
				0.5	21.86	28.54	29.05	29.40	35.78
SSIM	Toy	0.7	0.01	0.1	0.8918	0.9230	0.9731	0.9765	0.9925
				0.3	0.7519	0.8821	0.9317	0.9486	0.9805
				0.5	0.5830	0. 8525	0.8660	0.8822	0.9256
			0.001	0.1	0.9531	0.9773	0.9884	0.9885	0.9930
				0.3	0.8217	0.9392	0.9558	0.9581	0.9811
				0.5	0.6349	0.8682	0.9162	0.9178	0.9271
	Indian	0.7	0.01	0.1	0.8333	0.9227	0.9748	0.9956	0.9987
				0.3	0.7758	0.8941	0.9509	0.9858	0.9951
				0.5	0.6007	0.8825	0.8941	0.9221	0.9818
			0.001	0.1	0.8963	0.9878	0.9905	0.9971	0.9990
				0.3	0.8420	0.9590	0.9732	0.9928	0.9952
				0.5	0.6366	0.9055	0.9329	0.9411	0.9819
	Samson	0.5	0.01	0.1	0.8981	0.9099	0.9674	0.9798	0.9934
				0.3	0.7816	0.9008	0.9432	0.9623	0.9866
				0.5	0.6621	0.8466	0.8737	0.8976	0.9552
			0.001	0.1	0.9446	0.9682	0.9837	0.9877	0.9941
				0.3	0.8481	0.9454	0.9679	0.9734	0.9879
				0.5	0.7076	0.8527	0.9156	0.9187	0.9667

Figure 10 displays the PSNR values versus SR of different methods for the multispectral images datasets, where σ = 0.01 and τ = 0.3. From this figure, we can observe that the PSNR values obtained by NRTRM are higher than those obtained by SHRPCA, TNN, SSTV, and LLRTV for different SRs. And the LLRTV performs better than SHRPCA, TNN, and SSTV quantitatively, especially for high SRs. Moreover, the SSTV and TNN outperform SHRPCA in terms of PSNR values.

5.4. Video images

In this subsection, we test three video datasets including Foreman, Hall Monitor¹¹ , and Galleon¹² . Each image of the video datasets is resized to 144 × 176. Only one channel of each image are chosen in the video datasets. Then the sizes of the resulting tensors are 144 × 176 × 180 (Foreman), 144 × 176 × 200 (Hall), and 144 × 176 × 140 (Galleon). All images are mapped to [0, 1] in our experiments.

In figure 11, we show the 20th frame of the recovered images and their zoomed regions for the Foreman dataset, where SR = 0.5, σ = 0.01, and τ = 0.3. It can be observed that the NRTRM performs considerably better than the other methods in recovering the details of the images. From the zoomed regions, we can see that the NRTRM exhibits tangibly better recovery performance than SHRPCA, TNN, SSTV, and LLRTV in terms of visual quality. Figure 12 displays the visual comparisons of SHRPCA, TNN, SSTV, LLRTV, and NRTRM for the Hall dataset, where SR = 0.7, σ = 0.01, and τ = 0.5. We can see that the images recovered by NRTRM preserve more details than those recovered by the other methods. In figure 13, we show the recovered images and their zoomed regions of different methods for the Galleon dataset, where SR = 0.5, σ = 0.001, and τ = 0.5. Again we see that the visual quality of the images recovered by NRTRM is much better than that recovered by the other four methods.

**Figure 12.** The 50th frame of the recovered images and zoomed regions of different methods for the Hall dataset, where SR = 0.7, σ = 0.01, and τ = 0.5.
Download figure:
Standard image High-resolution image

**Figure 13.** The 15th frame of the recovered images and zoomed regions of different methods for the Galleon dataset, where SR = 0.5, σ = 0.001, and τ = 0.5.
Download figure:
Standard image High-resolution image

Figure 14 is the PSNR value display of each frame recovered by different methods for the video datasets, where SR = 0.55, σ = 0.01, and τ = 0.3. It is clear that the PSNR values obtained by NRTRM are higher than those obtained by SHRPCA, TNN, SSTV, and LLRTV for different frames. The improvement of NRTRM is very impressive in almost all frames compared with the other four methods for these video datasets.

Table 3 displays the PSNR and SSIM values of different methods for the three video datasets with different SRs, where σ = 0.01, 0.001 and τ = 0.1, 0.3, 0.5. The results in the table show that the PSNR and SSIM values acquired by NRTRM are higher than those acquired by SHRPCA, TNN, SSTV, and LLRTV. Compared with other methods, the improvements of NRTRM are around 2 dB in terms of PSNR values. Moreover, the performance of SSTV and LLRTV is better than that of SHRPCA and TNN in terms of PSNR and SSIM values, while the performance of SSTV and LLRTV is almost the same for the testing cases.

Table 3. PSNR and SSIM values of different methods for the video datasets with different SRs, σ, and τ. The boldface number is the best.

	Images	SR	σ	τ	SHRPCA	TNN	SSTV	LLRTV	NRTRM
PSNR	Foreman	0.7	0.01	0.1	27.10	29.10	30.31	30.43	32.78
				0.3	23.07	25.71	26.32	26.87	29.34
				0.5	19.55	22.93	23.53	23.55	25.97
			0.001	0.1	27.40	29.72	30.60	30.64	32.94
				0.3	23.22	26.00	26.49	26.99	29.50
				0.5	19.60	23.13	23.62	23.58	25.99
	Hall	0.7	0.01	0.1	28.66	32.94	34.11	34.59	36.68
				0.3	24.34	29.97	31.05	31.32	33.76
				0.5	21.12	28.08	28.68	28.70	30.41
			0.001	0.1	29.14	34.25	35.48	35.78	36.82
				0.3	24.45	30.65	31.25	31.35	33.78
				0.5	21.13	28.46	28.75	28.77	30.46
	Galleon	0.5	0.01	0.1	24.36	35.45	36.61	36.88	39.27
				0.3	20.03	32.82	33.92	34.08	36.18
				0.5	17.16	29.87	30.94	30.76	32.01
			0.001	0.1	24.53	37.62	38.65	38.11	39.36
				0.3	20.15	33.54	34.52	34.82	36.38
				0.5	17.24	30.49	31.09	30.94	32.39
SSIM	Foreman	0.7	0.01	0.1	0.7803	0.8149	0.8786	0.8921	0.9179
				0.3	0.5942	0.7063	0.7439	0.7982	0.8736
				0.5	0.4671	0.5486	0.6459	0.6644	0.7923
			0.001	0.1	0.8081	0.8561	0.8961	0.9021	0.9232
				0.3	0.6121	0.7446	0.7633	0.8082	0.8742
				0.5	0.4742	0.5756	0.6660	0.6696	0.7935
	Hall	0.7	0.01	0.1	0.8472	0.9189	0.9670	0.9717	0.9727
				0.3	0.7750	0.8823	0.9371	0.9569	0.9667
				0.5	0.6715	0.8473	0.9220	0.9422	0.9479
			0.001	0.1	0.8886	0.9485	0.9687	0.9752	0.9758
				0.3	0.8161	0.9377	0.9553	0.9651	0.9669
				0.5	0.7020	0.9093	0.9436	0.9478	0.9479
	Galleon	0.5	0.01	0.1	0.8372	0.9701	0.9877	0.9864	0.9894
				0.3	0.6907	0.9563	0.9754	0.9784	0.9832
				0.5	0.4808	0.9318	0.9603	0.9578	0.9611
			0.001	0.1	0.8526	0.9847	0.9899	0.9888	0.9897
				0.3	0.7053	0.9635	0.9770	0.9821	0.9838
				0.5	0.4892	0.9346	0.9626	0.9595	0.9625

In figure 15, we show the PSNR values versus SR of SHRPCA, TNN, SSTV, LLRTV, and NRTRM for the video datasets, where σ = 0.01 and τ = 0.3. We can observe from this figure that the NRTRM performs significantly better than SHRPCA, TNN, SSTV, and LLRTV in terms of PSNR values for the testing cases. Moreover, the LLRTV and SSTV yield better recovery performance than SHRPCA and TNN for these SRs.

6. Concluding remarks

In this paper, we have studied the problem of robust low-rank tensor recovery for third-order tensors, which aims to recover a tensor from partial observations degraded by a mixture of various types of noise including Gaussian noise and sparse noise. By leveraging on the global low-rankness and nonlocal self-similarity of the underlying tensor, we proposed an NRTRM approach, which proceeds as follows: first, by extracting the locations of the similar patched-tubes of an initial estimator, we group these similar cubes of the observed tensor together and unfold these cubes into matrices along the second-dimension, then stack these similar matrices into a third-order sub-tensor. Second, a class of nonconvex functions are employed on the singular values of the TTNN of the underlying sub-tensor and each entries of the sparse corruptions sub-tensor, respectively. Finally, the recovered tensor is obtained by aggregating all patches, where the weighted averages are utilized in the overlapping locations. Moreover, a PALM algorithm is developed to solve the NRTRM model and its convergence is established under very mild conditions. Preliminary numerical results on many real-world datasets demonstrate that the NRTRM method outperforms several state-of-the-art methods both quantitatively and qualitatively.

In the future, a possible direction is to establish the error bounds of the NRTRM method in terms of fraction observations for each sub-tensor (cf [30]). Since some real datasets are higher-order tensors, such as color videos or hyperspectral images with different time (fourth-order tensors), seismic data (fifth-order tensors [32]), it would also be of great interest to extend NRTRM to higher-order tensors, where the key issue is to extend the tensor–tensor product with unitary transform and transformed tensor SVD to higher-order tensors (cf [38]).

Acknowledgments

The authors sincerely thank the editor and anonymous referees for their constructive comments that substantially improved the quality of this paper.

Nonlocal robust tensor recovery with nonconvex regularization^{^*}

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Preliminaries

2.1. Notation and notions

2.2. Transformed tensor SVD

3. Nonlocal and nonconvex model for robust tensor recovery

3.1. Problem formulation

3.2. Nonlocal model with nonconvex regularization

4. Proximal alternating linearized minimization algorithm for NRTRM

4.1. PALM algorithm

4.2. Convergence analysis

4.3. A stopping criterion for the PALM algorithm

5. Numerical results

5.1. Performance of different nonconvex functions

5.2. Parameters setting

5.3. Multispectral images

5.4. Video images

6. Concluding remarks

Acknowledgments

Footnotes

Nonlocal robust tensor recovery with nonconvex regularization*

Article metrics

Submit

Permissions

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Preliminaries

2.1. Notation and notions

2.2. Transformed tensor SVD

3. Nonlocal and nonconvex model for robust tensor recovery

3.1. Problem formulation

3.2. Nonlocal model with nonconvex regularization

4. Proximal alternating linearized minimization algorithm for NRTRM

4.1. PALM algorithm

4.2. Convergence analysis

4.3. A stopping criterion for the PALM algorithm

5. Numerical results

5.1. Performance of different nonconvex functions

5.2. Parameters setting

5.3. Multispectral images

5.4. Video images

6. Concluding remarks

Acknowledgments

Footnotes

Nonlocal robust tensor recovery with nonconvex regularization^{^*}