Skip to main content
Erschienen in: Complex & Intelligent Systems 5/2022

Open Access 25.09.2021 | Original Article

Spatio-temporal joint aberrance suppressed correlation filter for visual tracking

verfasst von: Libin Xu, Pyoungwon Kim, Mengjie Wang, Jinfeng Pan, Xiaomin Yang, Mingliang Gao

Erschienen in: Complex & Intelligent Systems | Ausgabe 5/2022

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The discriminative correlation filter (DCF)-based tracking methods have achieved remarkable performance in visual tracking. However, the existing DCF paradigm still suffers from dilemmas such as boundary effect, filter degradation, and aberrance. To address these problems, we propose a spatio-temporal joint aberrance suppressed regularization (STAR) correlation filter tracker under a unified framework of response map. Specifically, a dynamic spatio-temporal regularizer is introduced into the DCF to alleviate the boundary effect and filter degradation, simultaneously. Meanwhile, an aberrance suppressed regularizer is exploited to reduce the interference of background clutter. The proposed STAR model is effectively optimized using the alternating direction method of multipliers (ADMM). Finally, comprehensive experiments on TC128, OTB2013, OTB2015 and UAV123 benchmarks demonstrate that the STAR tracker achieves compelling performance compared with the state-of-the-art (SOTA) trackers.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Visual tracking aims to estimate the state of the target in image sequences, given its initial state. It plays a crucial role in computer vision-based applications, e.g., vehicle navigation, video surveillance and robotic perception [2, 16, 26, 31]. In recent years, the DCF-based methods have attracted extensive attention due to the high efficiency. However, DCF-based tracking remains a challenging problem due to many intricate issues, such as boundary effect, filter degradation, and aberrance.
Boundary effect. The efficiency of DCF-based methods relies on the periodic assumption at the stage of training and detection. However, this assumption induces the filters to be trained and performed on partially unreal samples and subsequently results in the unexpected boundary effect. The boundary effect mainly impedes the performance of the DCF in two aspects [13]. (i) The inaccurate negative training samples reduce the discriminative power of the learned filters. (ii) The detection scores are reliable only around the center of the region, while the remaining scores are heavily influenced by the periodic repetitions of the detection samples. To address this issue, several competitive DCF-based trackers utilize the constant spatial regularizer to penalize the filter coefficients outside the bounding box [13, 18, 25]. However, these constant spatial constraints are usually fixed at the stage of tracking, and the diverse information (e.g., the appearance variation of the target and the confidence of the tracking results) is not fully utilized. To address this problem, in this paper, we propose a dynamic spatial regularizer based on response variation rate, which enables the filter to learn more reliable filter coefficients.
Filter degradation. Generally, the DCF-based methods adopt the model update mechanism based on fixed rate, which ignores the variation between different frames [45]. Once the appearance of the target varies dramatically, the filter learned from the previous frame cannot adjust to appearance changes, resulting in the filter degradation. To cope with the filter degradation, several DCF-based trackers adopt the temporal regularizer into filter training [25, 28, 45]. Nevertheless, the temporal regularizer is based on the assumption that filters between consecutive frames should be coherent. The filter training may be interfered with severe occlusion, background clutter, etc., resulting in a corrupted filter and breaking this assumption. To solve this issue, in this paper, we propose a dynamic temporal regularizer based on average peak-to-correlation energy (APCE) [39] to suppress the filter degradation.
Aberrance. Due to the spatial regularization, the correlation filter can be learned on larger image regions [13]. Nevertheless, with the expansion of the learning regions, more background clutter will be introduced, leading to aberrance at the detection stage, which is manifested as the abrupt variation in response maps. To reduce the effect of aberrance, Wang et al. [39] proposed the Large Margin Object Tracking (LMCF) method, in which the quality of response maps is verified during the filter learning and used to carry out the model updating in high confidence. Choi et al. [7] proposed the Attentional Correlation Filter Network (ACFN) tracker that integrates multiple correlation filters into a network. The verified scores which are generated based on response maps are utilized to select the suitable filter. However, these trackers deal with the aberrance at the stage of detection, and thus the tracking performance is decreased inevitably. Unlike these trackers, in this paper, we integrate an aberrance suppressed regularizer into the DCF schema to suppress the aberrance at the stage of filter training.
In this work, we address the above issues simultaneously under a unified framework of response map by learning a spatio-temporal joint aberrance suppressed regularization correlation filter. The main contributions are summarized as follows.
1.
A novel tracking method by learning spatio-temporal joint aberrance suppressed regularization correlation filter (STAR) is proposed under a unified framework of response map.
 
2.
A dynamic spatio-temporal regularizer is introduced to alleviate the boundary effect and filter degradation, simultaneously.
 
3.
An aberrance suppressed strategy is introduced into the filter learning to minimize the interference by the background cluster.
 
4.
Extensive evaluations are conducted on four challenging tracking benchmarks, and the experimental results demonstrate the competitive performance of the proposed tracker compared with the state-of-the-art (SOTA) tracking methods.
 
The rest of this paper is organized as follows. In “Related work”, we present an overview of the prior work most relevant to the proposed method. In “Proposed method”, the proposed STAR model is introduced, and the ADMM algorithm is developed to solve the STAR efficiently. In “Experimental results”, quantitative and qualitative evaluations of the proposed tracker with the SOTA trackers are presented. Conclusions are presented in “Conclusion”.
The visual tracking methods can be classified into generative tracking methods and discriminative tracking methods [31, 40]. Among the discriminative-based trackers, the DCF promote the visual tracking to a new level.

Generative tracking

The generative tracking attempts to build models to represent the appearance of the target and search the most similar candidate region with minimal reconstruction error. Comaniciu et al. [8] proposed the mean-shift tracking method with iterative histogram matching for visual tracking. Adam et al. [1] proposed the fragments-based tracker, which utilizes multiple image fragments to represent the object. Subsequently, Ross et al. [35] proposed the subspace-based tracking method to learn and update the low-dimensional subspace representation of the target. Although generative tracking has achieved considerable success in constrained scenarios, they are vulnerable to complicated appearance variations of the target. Therefore, more attention is shifted to discriminative tracking, due to it is less susceptible to background clutter during the tracking process.

Discriminative tracking

The discriminative tracking trains a classifier to discriminate the target from the background. Grabner et al. [19] proposed an online boosting tracker by fusing multiple weak classifiers. Kalal et al. [24] proposed the Tracking–Learning–Detection (TLD) tracker that decomposes the long-term tracking into three sub-tasks, namely tracking, learning, and detection. More recently, many deep neural network (DNN) based trackers under the framework of “end-to-end learning” and “offline-learning and online-tracking” are proposed. For example, Bertinetto et al. [4] proposed the Fully Convolutional Siamese Networks (SiamFC) tracker that trains a fully convolutional siamese network by cross-correlating two inputs of the bilinear layer. Valmadre et al. [37] put forward the CFNet tracker that considers the correlation filter as a differentiable layer of the deep neural network. In general, discriminative tracking is relatively more effective than generative tracking in preventing the negative effects of complex background clutter or target appearance variations [40].

DCF-based tracking

Recently, DCF has received considerable attention due to its efficiency and scalability. Bolme et al. [5] first proposed the correlation filter tracker, termed minimum output sum of squared error (MOSSE), to learn a filter between multiple training image patches and a template of user-specified ideal correlation response. Henriques et al. [21] proposed the circulant structure of Tracking-by-Detection with Kernels (CSK) tracker, which exploits the circulant structure of the local image patch to learn a kernel regularized least squares classifier.
To further improve the tracking performance, the follow-up improvements are mainly carried out around two aspects, namely feature representation and scale estimation. In feature representation, Danelljan et al. [11] proposed the color attributes tracker by investigating the color names (CN) [38] feature in the tracking-by-detection framework. Henriques et al. [22] proposed the kernelized correlation filters (KCF) method by utilizing the histogram of oriented gradient (HOG) [9] feature. In addition, Bertinetto et al. [3] proposed the Sum of Template And Pixel-wise LEarners (STAPLE) tracker using the HOG and colour features to improve the tracking credibility. Moreover, Convolutional Neural Network (CNN) features have been used to further improve the feature representation [12, 14, 25, 45]. In scale estimation, Danelljan et al. [10] proposed the Discriminative Scale Space Tracking (DSST) method, which learns a separate scale filter to address the scale variation. Li et al. [27] proposed the Scale Adaptive with Multiple Features (SAMF) tracker by employing a bilinear interpolation to generate image representations in multiple scales.

Proposed method

Revisit the standard DCF

In the standard DCF [22], \({\mathbf {x}} \in {\mathbb {R}}^{M \times N \times C}\) denotes the training sample with \(M \times N\) feature size and C channels. \({\mathbf {y}} \in {\mathbb {R}}^{M \times N}\) is the corresponding Gaussian-shaped label (desired output). The filter \({\mathbf {f}} \in {\mathbb {R}}^{M \times N \times C} \) is trained by regressing the samples, which is defined as follows,
$$\begin{aligned} \underset{{\mathbf {f}}}{\arg \min }\frac{1}{2}\left\| \sum _{c=1}^{C} {\mathbf {x}}^c * {\mathbf {f}}^c - {\mathbf {y}}\right\| ^2_F + \alpha \sum _{c=1}^{C}\left\| {\mathbf {f}}^c\right\| ^2_F, \end{aligned}$$
(1)
where \(*\) stands for the circular convolution operator, and \(\alpha \) is the regularization parameter to prevent overfitting.
In the standard DCF model, there are several problems need to be further addressed. (i) It suffers from periodic repetitions on boundary positions caused by circulant shifted training sample. (ii) It does not tackle the problem of filter degradation, since the model is updated based on fixed rate. (iii) There is no response mechanism to copy with the aberrance, and the target will be easily lost when aberrance occurs.

The proposed model STAR

To address the problems mentioned above, we propose a novel spatio-temporal joint aberrance suppressed regularization (STAR) correlation filter for robust visual tracking. The tracking framework of the proposed STAR model is shown in Fig. 1. The spatial regularizer, temporal regularizer and aberrance suppressed regularizer are exploited to the standard DCF to tackle the boundary effect, filter degradation and aberrance suppression, simultaneously.
We assume that the learning of the correlation filter \({\mathbf {f}}\) is conducted for the t-th frame. The filter is learned by minimizing the following objective function,
$$\begin{aligned} \underset{{\mathbf {f}}}{\arg \min }\frac{1}{2}\left\| \sum _{c=1}^{C}{\mathbf {x}}^c * {\mathbf {f}}^c - {\mathbf {y}}\right\| ^2_F + \frac{\lambda }{2}{\mathcal {R}}_\mathrm{s} + \frac{\mu }{2}{\mathcal {R}}_\mathrm{t} + \frac{\eta }{2}{\mathcal {R}}_\mathrm{a}, \end{aligned}$$
(2)
where \(\left\| \sum _{c=1}^{C}{\mathbf {x}}^c * {\mathbf {f}}^c - {\mathbf {y}}\right\| ^2_F\) denotes the regression loss parameterized by \({\mathbf {f}}\). The \({\mathcal {R}}_\mathrm{s}\), \({\mathcal {R}}_\mathrm{t}\) and \({\mathcal {R}}_\mathrm{a}\) refer to the spatial, temporal and aberrance suppressed regularizer, respectively. The parameters \(\lambda \), \(\mu \) and \(\eta \) are the corresponding coefficients to the regularizers.

Dynamic spatial regularizer

The constant spatial regularizer in the SOTA trackers (e.g., SRDCF [13], BACF [18] and STRCF [25]) does not fully exploit the diversity information of the target. The filter coefficients will be unreliable, leading to tracking failures, when the target suffers from interferences, e.g., severe occlusion, background clutter. To solve this problem, we design a dynamic spatial regularizer based on the response variation rate.
The response variation rate is defined as \({\varvec{\Pi }} = \left\| {\Pi }^1, {\Pi }^2,\right. \left. \ldots ,{\Pi }^{MN}\right\| \), and the i-th element \({\Pi }^i\) is defined as,
$$\begin{aligned} {\Pi }^i=\frac{\mathrm {R}^i_t-\left( \mathrm {R}_{t-1}[\psi \bigtriangleup ]\right) ^i}{\left( \mathrm {R}_{t-1}[\psi \bigtriangleup ]\right) ^i}, \end{aligned}$$
(3)
where \([\psi \bigtriangleup ]\) is the shift operator. It enables the peaks of response \(\mathrm {R}_t\) and \(\mathrm {R}_{t-1}\) to coincide with each other to eliminate the motion influence [23]. Considering that the response variation rate \({\varvec{\Pi }}\) reveals the confidence level of each pixel in the search area, we introduce \({\varvec{\Pi }}\) into the spatial weight \({\mathbf {w}}\),
$$\begin{aligned} {\mathbf {w}}=\delta \log {\varvec{\Pi }} + \tilde{{\mathbf {w}}}, \end{aligned}$$
(4)
where \(\delta \) is a hyperparameter for adjusting the weight of \(\varvec{\Pi }\), and \(\tilde{{\mathbf {w}}}\) is a matrix for initializing spatial regularization weight \({\mathbf {w}}\). The dynamic spatial regularizer of STAR model is defined as,
$$\begin{aligned} {\mathcal {R}}_\mathrm{s} = \sum _{c=1}^{C}\left\| {\mathbf {w}}_t\odot {\mathbf {f}}^c_t\right\| ^2_F, \end{aligned}$$
(5)
where \(\odot \) is the Hadamard product. The visualization of the dynamic variation of the spatial regularization is shown in Fig. 2. It shows that the dynamic spatial regularizer can impose different penalties on the spatial position according to the value of the response variation rate. Specifically, it imposes a higher penalty on the larger part of the response variation rate while a lower penalty on the smaller part. Thus, it achieves more reliable filter coefficients at the detection state.

Dynamic temporal regularizer

The existing temporal regularizer \(\sum _{c=1}^{C}\left\| {\mathbf {f}}^c_t - {\mathbf {f}}^c_{t-1}\right\| ^2_F\) is constructed using the previous filter \({\mathbf {f}}_{t-1}\) (e.g., STRCF [25], LADCF [45] and AutoTrack [28]). The filter learned at frame t is affected to a large extent by the filter \({\mathbf {f}}_{t-1}\). However, \({\mathbf {f}}_{t-1}\) may be corrupted by occlusion or background clutter; thus, it will break the assumption that the filters between consecutive frames should be coherent. To tackle this issue, we propose to learn a dynamic temporal regularizer based on APCE measure. The APCE measure is defined as,
$$\begin{aligned} \mathrm {APCE} = \frac{\left| \mathrm {R}_\mathrm{max} - \mathrm {R}_\mathrm{min}\right| ^2}{\mathrm{mean}\left[ \sum _{w,h}\left( \mathrm {R}_{w,h} - \mathrm {R}_\mathrm{min}\right) ^2\right] }, \end{aligned}$$
(6)
where \(\mathrm {R}_\mathrm{max}\), \(\mathrm {R}_\mathrm{min}\) and \(\mathrm {R}_{w,h}\) denote the maximum, minimum and the \(w\mathrm{th}\) row \(h\mathrm{th}\) column elements of the response \(\mathrm {R}\), respectively. The visualization of the value of APCE with its corresponding threshold in a typical tracking sample is shown in Fig. 3. At the stage of training, the filter may be corrupted by occlusion, background clutter, etc., then, the response map with interference is generated by the convolution of the corrupted filter and the feature map. As a consequence, the value of APCE obtained by Eq. (6) will drop significantly. This specialty of APCE can be adopted to judge whether the filter is corrupted or not. Subsequently, the uncorrupted filter \({\mathbf {f}}_\mathrm{s}\) is selected for temporal regularizer instead of \({\mathbf {f}}_{t-1}\), as follows,
$$\begin{aligned} \begin{aligned}&{\mathbf {f}}_\mathrm{s}= {\left\{ \begin{array}{ll} {\mathbf {f}}_{t-1} &{}\text {if } \mathrm {APCE}_{t}>\zeta \mathrm {APCE}_\mathrm{hm}\\ {\mathbf {f}}_{t-i} &{}\text {otherwise}\\ \end{array}\right. }\\&\mathrm{s.t.}, \; i= {\left\{ \begin{array}{ll} i \in {\mathbb {N}}\\ i>1\\ \underset{i}{\arg \min }\left( \mathrm {APCE}_{t-i+1}>\zeta \mathrm {APCE}_\mathrm{hm}\right) \\ \end{array}\right. }, \end{aligned} \end{aligned}$$
(7)
where \({\mathbf {f}}_{t-1}\) and \({\mathbf {f}}_{t-i}\) denote the filter at the \((t-1)\)-th and \((t-i)\)-th frame, respectively. \(\zeta \) is hyperparameter, and \(\mathrm {APCE}_\mathrm{hm}\) stands for the historical mean value of APCE.
The uncorrupted filter \({\mathbf {f}}_\mathrm{s}\) is selected to construct the dynamic temporal regularizer for the STAR model as follows,
$$\begin{aligned} {\mathcal {R}}_\mathrm{t} = \sum _{c=1}^{C}\left\| {\mathbf {f}}^c_t - {\mathbf {f}}^c_\mathrm{s}\right\| ^2_F. \end{aligned}$$
(8)
Compared with the existing temporal regularization methods [25, 28, 45], the STAR model takes the full advantage of the video continuity natures by exploiting \(\left\| {\mathbf {f}}_t - {\mathbf {f}}_\mathrm{s}\right\| ^2_F\) to penalize the difference between the current filter \({\mathbf {f}}_t\) and the uncorrupted filter \({\mathbf {f}}_\mathrm{s}\). Thus, the proposed STAR gains a more robust appearance model, and alleviate the filter degradation effectively.

Aberrance suppressed regularizer

The response map can reveal the confidence degree about the tracking results to a large extent [39]. The aberrance caused by background clutter occurs at the detection stage, and it will result in an abrupt variation in response maps. The aberrance can be effectively repressed by restricting the response variation. As a result, an aberrance suppressed regularizer is introduced to handle the aberrance at the stage of training. The aberrance suppressed regularizer is formulated as,
$$\begin{aligned} {\mathcal {R}}_\mathrm{a} = \left\| \mathrm {R}_t - \mathrm {R}_{t-1}[\psi \bigtriangleup ]\right\| ^2_F, \end{aligned}$$
(9)
where all the variables have been explained in the Eq. (3).

Optimization of STAR

After all the regularization defined, optimization of the Eq. (2) is one of the key to solve the tracking. The Eq. (2) can be minimized using ADMM [6] to achieve the optimal solution benefitting from its convexity. Specifically, we introduce the auxiliary variable \({\mathbf {g}}={\mathbf {f}}\) and the step size parameter \(\gamma \) to construct the following augmented Lagrange function,
$$\begin{aligned} {\mathcal {L}}&= \frac{1}{2}\left\| \sum _{c=1}^{C}{\mathbf {x}}^c_t * {\mathbf {f}}^c_t - {\mathbf {y}}\right\| ^2_F + \frac{\lambda }{2}\sum _{c=1}^{C}\left\| {\mathbf {w}}_t \odot {\mathbf {g}}^c_t\right\| ^2_F \nonumber \\&\quad + \frac{\mu }{2}\sum _{c=1}^{C}\left\| {\mathbf {f}}^c_t - {\mathbf {f}}^c_s\right\| ^2_F+ \frac{\eta }{2}\left\| \sum _{c=1}^{C}{\mathbf {x}}^c_t * {\mathbf {f}}^c_t - {\mathbf {r}}\right\| ^2_F \nonumber \\&\quad + \sum _{c=1}^{C}\left( {\mathbf {f}}^c_t - {\mathbf {g}}^c_t\right) ^\mathrm {T}{\mathbf {s}}^c_t+ \frac{\gamma }{2}\sum _{c=1}^{C}\left\| {\mathbf {f}}^c_t - {\mathbf {g}}^c_t\right\| ^2_F, \end{aligned}$$
(10)
where \({\mathbf {r}}={\mathbf {R}}_{t-1}[\psi \bigtriangleup ]\), and \({\mathbf {s}}\) refers to the Lagrange multiplier. By introducing \({\mathbf {h}}=\frac{1}{\gamma }{\mathbf {s}}\), Eq. (10) can be reformulated as,
$$\begin{aligned} \begin{aligned} {\mathcal {L}}&= \frac{1}{2}\left\| \sum _{c=1}^{C}{\mathbf {x}}^c_t * {\mathbf {f}}^c_t - {\mathbf {y}}\right\| ^2_F + \frac{\lambda }{2}\sum _{c=1}^{C}\left\| {\mathbf {w}}_t \odot {\mathbf {g}}^c_t\right\| ^2_F\\&\quad + \frac{\mu }{2}\sum _{c=1}^{C}\left\| {\mathbf {f}}^c_t - {\mathbf {f}}^c_s\right\| ^2_F+ \frac{\eta }{2}\left\| \sum _{c=1}^{C}{\mathbf {x}}^c_t * {\mathbf {f}}^c_t - {\mathbf {r}}\right\| ^2_F\\&\quad + \frac{\gamma }{2}\sum _{c=1}^{C}\left\| {\mathbf {f}}^c_t - {\mathbf {g}}^c_t + {\mathbf {h}}^c_t\right\| ^2_F. \end{aligned} \end{aligned}$$
(11)
Then, the following subproblems are alternately optimized via ADMM formulation.
$$\begin{aligned} \begin{aligned} {\left\{ \begin{array}{ll} {\mathbf {f}}^{i+1} = \underset{{\mathbf {f}}}{\arg \min }\frac{1}{2}\left\| \sum _{c=1}^{C}{\mathbf {x}}^c_t * {\mathbf {f}}^c_t - {\mathbf {y}}\right\| ^2_F + \frac{\mu }{2}\sum _{c=1}^{C}\left\| {\mathbf {f}}^c_t - {\mathbf {f}}^c_s\right\| ^2_F\\ \qquad +\frac{\eta }{2}\left\| \sum _{c=1}^{C}{\mathbf {x}}^c_t * {\mathbf {f}}^c_t - {\mathbf {r}}\right\| ^2_F + \frac{\gamma }{2}\sum _{c=1}^{C}\left\| {\mathbf {f}}^c_t - {\mathbf {g}}^c_t + {\mathbf {h}}^c_t\right\| ^2_F\\ \\ {\mathbf {g}}^{i+1} = \underset{{\mathbf {g}}}{\arg \min }\frac{\lambda }{2}\sum _{c=1}^{C}\left\| {\mathbf {w}}_t \odot {\mathbf {g}}^c_t\right\| ^2_F + \frac{\gamma }{2}\sum _{c=1}^{C}\left\| {\mathbf {f}}^c_t - {\mathbf {g}}^c_t + {\mathbf {h}}^c_t\right\| ^2_F\\ \\ {\mathbf {h}}^{i+1} = {\mathbf {h}}^i+{\mathbf {f}}^{i+1}-{\mathbf {g}}^{i+1} \end{array}\right. }. \end{aligned}\nonumber \\ \end{aligned}$$
(12)
Subproblem \({\mathbf {f}}\): For the first subproblem of Eq. (12), it can be transformed into the frequency domain using Parseval’s formulation as,
$$\begin{aligned} \begin{aligned} \widehat{{\mathbf {f}}}^*&= \underset{\widehat{{\mathbf {f}}}}{\arg \min }\frac{1}{2}\left\| \sum _{c=1}^{C}\widehat{{\mathbf {x}}}^c_t \odot \widehat{{\mathbf {f}}}^c_t - \widehat{{\mathbf {y}}}\right\| ^2_F + \frac{\mu }{2}\sum _{c=1}^{C}\left\| \widehat{{\mathbf {f}}}^c_t - \widehat{{\mathbf {f}}}^c_s\right\| ^2_F\\&\quad + \frac{\eta }{2}\left\| \sum _{c=1}^{C}\widehat{{\mathbf {x}}}^c_t \odot \widehat{{\mathbf {f}}}^c_t - \widehat{{\mathbf {r}}}\right\| ^2_F + \frac{\gamma }{2}\sum _{c=1}^{C}\left\| \widehat{{\mathbf {f}}}^c_t - \widehat{{\mathbf {g}}}^c_t + \widehat{{\mathbf {h}}}^c_t\right\| ^2_F, \end{aligned} \end{aligned}$$
(13)
where \(^{\hat{}}\) denotes the discrete Fourier transform (DFT). The j-th element of the label \(\widehat{{\mathbf {y}}}\) relies on the j-th element of the sample \(\widehat{{\mathbf {x}}}_t\) and the filter \(\widehat{{\mathbf {f}}}_t\) across all C channels. \({\mathcal {V}}\left( {\mathbf {f}}\right) \in {\mathbb {R}}^C\) is the vector consisting of the j-th element of \({\mathbf {f}}\) along the channels. Equation (13) can be further decomposed into \(M \times N\) subproblems, where each subproblem is defined as,
$$\begin{aligned} \begin{aligned} {{\mathcal {V}}_j(\widehat{{\mathbf {f}}}^*)}&= \underset{{\mathcal {V}}_j(\widehat{{\mathbf {f}}})}{\arg \min }\frac{1}{2}\left\| {\mathcal {V}}_j(\widehat{{\mathbf {x}}}_t)^{\mathrm {T}}{\mathcal {V}}_j(\widehat{{\mathbf {f}}}_t) - \widehat{{\mathbf {y}}}_j\right\| ^2_F\\&\quad + \frac{\mu }{2}\left\| {\mathcal {V}}_j(\widehat{{\mathbf {f}}}_t) - {\mathcal {V}}_j(\widehat{{\mathbf {f}}}_s)\right\| ^2_F\\&\quad + \frac{\eta }{2}\left\| {\mathcal {V}}_j(\widehat{{\mathbf {x}}}_t)^{\mathrm {T}}{\mathcal {V}}_j(\widehat{{\mathbf {f}}}_t) - \widehat{{\mathbf {r}}}_j\right\| ^2_F + \frac{\gamma }{2}\left\| {\mathcal {V}}_j(\widehat{{\mathbf {f}}}_t) \right. \\&\quad \left. - {\mathcal {V}}_j(\widehat{{\mathbf {g}}}_t) + {\mathcal {V}}_j(\widehat{{\mathbf {h}}}_t)\right\| ^2_F, \end{aligned} \end{aligned}$$
(14)
where superscript \(^\mathrm {T}\) on a complex vector or matrix indicates conjugate transpose operation. Taking the derivative of Eq. (14) as zero, the closed-form solution of \({\mathcal {V}}_j(\widehat{{\mathbf {f}}}^*)\) can be denoted as,
$$\begin{aligned} {\mathcal {V}}_j(\widehat{{\mathbf {f}}}^*) = \left[ \left( 1+\eta \right) {\mathcal {V}}_j(\widehat{{\mathbf {x}}}_t){\mathcal {V}}_j(\widehat{{\mathbf {x}}}_t)^\mathrm {T} + \left( \mu +\gamma \right) \right] ^{-1}{\mathbf {q}}, \end{aligned}$$
(15)
where the vector \({\mathbf {q}} = {\mathcal {V}}_j(\widehat{{\mathbf {x}}}_t)\widehat{{\mathbf {y}}}_j + \eta {\mathcal {V}}_j(\widehat{{\mathbf {x}}}_t)\widehat{{\mathbf {r}}}_j + \gamma {\mathcal {V}}_j(\widehat{{\mathbf {g}}}_t) - \gamma {\mathcal {V}}_j(\widehat{{\mathbf {h}}}_t) + \mu {\mathcal {V}}_j(\widehat{{\mathbf {f}}}_s)\). Since \({\mathcal {V}}_j(\widehat{{\mathbf {x}}}_t){\mathcal {V}}_j(\widehat{{\mathbf {x}}}_t)^\mathrm {T}\) is a rank-1 matrix, Eq. (15) can be further rewritten via the Sherman–Morrsion formulation [32] as,
$$\begin{aligned} {\mathcal {V}}_j(\widehat{{\mathbf {f}}})^*=\frac{1}{\mu +\gamma }\left[ {\mathbf {I}}-\frac{{\mathcal {V}}_{j}(\widehat{{\mathbf {x}}}){\mathcal {V}}_{j}(\widehat{{\mathbf {x}}})^\mathrm {T}}{\frac{\mu +\gamma }{1+\eta }+{\mathcal {V}}_{j}(\widehat{{\mathbf {x}}})^\mathrm {T} {\mathcal {V}}_{j}(\widehat{{\mathbf {x}}})}\right] {\mathbf {q}}. \end{aligned}$$
(16)
Note that Eq. (16) only contains vector multiply–add operation, thus it can be computed efficiently. \({\mathbf {f}}\) can be further obtained by the IDFT of \(\widehat{{\mathbf {f}}}\).
Subproblem \({\mathbf {g}}\): For the second subproblem of Eq. (12), each element of \({\mathbf {g}}\) can be computed independently as,
$$\begin{aligned} \begin{aligned} {\mathbf {g}}^* = \frac{\gamma \left( {\mathbf {f}} + {\mathbf {h}}\right) }{\lambda \left( {\mathbf {w}} \odot {\mathbf {w}}\right) + \gamma {\mathbf {I}}}. \end{aligned} \end{aligned}$$
(17)
Lagrangian multiplier update: The Lagrange multiplier is updated as,
$$\begin{aligned} \begin{aligned} {\mathbf {h}}^{i+1} = {\mathbf {h}}^i + {\mathbf {f}}^{*(i+1)} - {\mathbf {g}}^{*(i+1)}, \end{aligned} \end{aligned}$$
(18)
where the subscript i represents the i-th iteration. \({\mathbf {f}}^*\) and \({\mathbf {g}}^*\) are the solution of subproblem \({\mathbf {f}}\) and \({\mathbf {g}}\), respectively.
By solving the aforementioned subproblems iteratively, the optimal filter \({\mathbf {f}}^*\) of the t-th frame can be obtained and then used for tracking at \((t+1)\)-th frame.

Target localization

The response map \({\mathbf {R}}_t\) at the t-th frame in Fourier domain can be calculated as,
$$\begin{aligned} \widehat{{\mathbf {R}}}_t = \sum _{c=1}^{C}\widehat{{\mathbf {x}}}^c_t \odot \widehat{{\mathbf {f}}}^{*c}_{t-1}. \end{aligned}$$
(19)
After computing the IDFT on \(\widehat{{\mathbf {R}}}\) to obtain the response map \({\mathbf {R}}_t\), the location can be predicted based on the maximum value of the response map. The overall tracking algorithm of the STAR model is summarized in Algorithm 1.

Experimental results

Evaluation metrics

Quantitative and qualitative experiments are conducted on four tracking benchmarks, i.e., TC128 [29], OTB2013 [43], OTB2015 [44] and UAV123 [33]. For these benchmarks, success rate and precision are utilized under the rule of one pass evaluation (OPE) [43, 44]. The AREA UNDER CURVE (AUC) in the success rate and the distance precision (DP) at a threshold of 20 pixels in the precision are adopted as the evaluation metrics to measure the tracking accuracy. Meanwhile, the speed is measured in frames per second (FPS). For the sake of fair comparison, the compared trackers are based on publicly available code or results reported in the original paper.

Experimental setup

The experiments are conducted on a PC equipped with i7-9700K CPU and NVIDIA GTX 1080Ti GPU using MATLAB R2017a and MatConvNet toolbox.1 We combine the output of Conv-3 layer from VGG-M network [36] with HOG+CN features for target representation. The values of spatial, temporal and aberrance suppressed regularizer are set as \(\lambda = 1\), \(\mu = 10\) and \(\eta = 0.1\), respectively. The step size parameter \(\gamma \) is initialized to 1 and updated by \(\gamma ^{i+1}=\min \left( \gamma _{\max },\;\rho \gamma ^{i}\right) \), (where \(\rho = 10,\;\gamma _{\max } = 1000\)). Other hyper-parameters are set to \(\delta = 0.1\) and \(\zeta = 0.7\), and the ADMM iteration is set to \(N = 3\). To make a fair comparison, the parameters of the STAR tracker are fixed throughout the experiments.

Quantitative evaluation

Evaluation on TC128

The TC128 benchmark [29] contains 128 challenging color sequences. We compare the proposed STAR tracker with some SOTA DCF-based trackers, e.g., MCCT [41], LADCF-HC [45], MCCT-HC [41], STRCF [25], ECO-HC [14], CFWCR [20], MCPF [46], UDT+ [42], ARCF [23], UDT [42], AutoTrack [28], STRAPLE_CA [34], ARCF-H [23], DR2Track [17], BACF [18], TB-BiCF [30], RSST [47] and fDSST [15]. The success and precision plots of the evaluated trackers are depicted in Fig. 4 and the comparative results of the evaluated trackers in accuracy and speed are shown in Table 1. It shows that the STAR obtains the scores of 0.582 and 0.780 in AUC and DP, which outperform all the compared trackers. Specifically, compared with STRCF [41] which only adopts spatio-temporal regularization, the STAR increases the AUC and DP by 3.4 and 3.6%. Compared with ARCF [42] which only applies the aberrance suppressed strategy, the STAR gains an increase of 6.3 and 7.7% in AUC and DP. The performance improvement can be attributed to the effect of the dynamic spatio-temporal and the aberrance suppressed regularizer. In addition, the STAR runs at a speed of 10.6 fps, which is competitive compared with other deep-based trackers, i.e., RSST (1.5 fps), UDT+ (19.8 fps), MCPF (0.5 fps), CFWCR (10.2 fps) and MCCT (2.7 fps).
Table 1
Comparative results of the evaluated trackers on TC128 in accuracy and speed
Trackers
fDSST
RSST
TB-BiCF
BACF
DR2Track
ARCF-H
STAPLE_CA
AutoTrack
UDT
ARCF
AUC
0.432
0.470
0.479
0.486
0.492
0.494
0.506
0.513
0.517
0.519
DP
0.571
0.643
0.651
0.642
0.667
0.668
0.679
0.700
0.687
0.703
FPS
130.0
1.5*
49.2
36.4
55.8
51.4
52.4
34.1
14.9
32.0
Trackers
UDT+
MCPF
CFWCR
ECO-HC
STRCF
MCCT-H
LADCF-HC
MCCT
Ours
 
AUC
0.541
0.542
0.542
0.547
0.548
0.551
0.556
0.572
0.582
 
DP
0.728
0.751
0.740
0.732
0.744
0.742
0.744
0.774
0.780
 
FPS
19.8*
0.50*
10.2*
60.5
20.6
43.2
21.6
2.7*
10.6*
 
Note that the number with * indicates the speed of running on the GPU

Evaluation on OTB2013 and OTB2015

The OTB2013 and OTB2015 are two popular tracking benchmarks, which consist of 50 and 100 video sequences, respectively. We compare the proposed STAR with several representative trackers, including ECO [14], DeepSTRCF [25], STRCF [25], LADCF-HC [45], CFWCR [20], MCCT-HC [41], BACF [18], ECO-HC [14], UDT [42], ARCF [23], ARCF-H [23], UDT+ [42], AutoTrack [28], STAPLE_CA [34], TB-BiCF [30], fDSST [15], RSST [47] and DR2Track [17]. The overall comparison results on OTB2013 [43] and OTB2015 [44] are presented in Fig. 5.
On the OTB2013 benchmark, the proposed STAR archives the best AUC (0.688) and the second-best DP (0.892). Compared with the feature selection-based tracker, i.e., LADCF-HC, the STAR improves the AUC and DP by 1.6 and 2.8%, respectively. Compared with UDT, which is trained in an unsupervised manner, the STAR improves by 6.1 and 6.6% in AUC and DP, respectively.
On the OTB2015 benchmark, the proposed STAR achieves the score of 0.672 and 0.875 in AUC and DP, respectively. Compared with the BACF tracker that uses the constant spatial regularizer, the STAR improves the AUC by 5.7% and the DP by 5.9%. This is mainly benefited from the dynamic spatial regularizer, which can impose different penalties on the spatial position based on the value of response variation rate, and produces more reliable filter coefficients at the tracking stage.

Evaluation on UAV123

Compared with the generic object tracking, UAV-based tracking is to locate a certain target from a low-altitude aerial perspective, which poses new challenges, e.g., rapid changes in scale and perspective, limited pixels in the target region, and multiple similar disruptors [48]. The compared trackers include CFWCR [20], DeepSTRCF [25], UDT+ [42], ECO-HC [14], LADCF-HC [45], STRCF [25], UDT [42], AutoTrack [28], TB-BiCF [30], ARCF [23], RSST [47], DR2Track [17], BACF [18], MCCT-H [41], ARCF-H [23], STAPLE_CA [34] and fDSST [15]. The comparative results are presented in Fig. 6. It shows that the STAR ranks first and third place in AUC (0.516) and DP (0.723), respectively. Compared with other DCF-based trackers, e.g., ECO-HC, AutoTrack and DR2Track, STAR increases by 2.3, 4.0 and 5.7% in AUC, and 1.5, 3.3, and 6.1% in DP, respectively. Compared with DeepSTRCF that adopts the spatio-temporal regularization and multi-features (CNN+HOG+CN), STAR increases the AUC and DP by 0.8 and 1.8%, respectively. This can be attributed to the dynamic spatio-temporal regularizer, which can effectively alleviate the boundary effect and filter degradation, and provide a robust appearance model.

Attribute evaluation

To analyze the abilities of handling different challenges, attribute-based evaluations are performed. There are 12 attributes on UAV123 benchmark, i.e., occlusion (POC), full occlusion (FOC), fast motion (FM), illumination variation (IV), aspect ratio change (ARC), similar object (SOB), scale variation (SV), out-of-view (OV), background clutter (BC), viewpoint change (VC), camera motion (CM) and low resolution (LR). The success and precision plots of the evaluated trackers under these challenging attributes are presented in Figs. 7 and 8, respectively. It can be seen that the proposed STAR achieves the best AUC on several attributes, including POC (0.444), CM (0.511), ARC (0.454), VC (0.483), OV (0.436) and FM (0.419). Meanwhile, the proposed tracker achieves the best DP of 0.667, 0.626 and 0.654 in terms of OC, OV and FM, respectively.
Table 2
Ablation studies of the critical components in STAR on OTB2013
Trackers
AUC
DP
FPS
Baseline
0.642
0.841
8.27
Baseline + DSR
0.670
0.873
6.82
Baseline + DTR
0.656
0.857
7.68
Baseline + AR
0.663
0.865
7.14
Baseline + DSR + DTR + AR
0.688
0.892
5.55
The best results are shown in bold

Ablation studies

Ablation studies on OTB2013 [43] are conducted to demonstrate the effectiveness of the key components in the proposed STAR tracker. The key components include the dynamic spatial regularizer (DSR), dynamic temporal regularizer (DTR) and aberrance suppressed regularizer (AR). We compare the baseline with four variants, i.e., “Baseline” (the standard DCF tracker in “Revisit the standard DCF” which adopts the same feature representation as in STAR), “Baseline+DSR”, “Baseline+DTR”, “Baseline+AR” and “Baseline+DSR+DTR+AR” (i.e., the final STAR tracker). The ablation results are reported in Table 2. It shows that the baseline tracker achieves the score of 0.642 and 0.841 in AUC and DP. When the components of “DTR”, “AR” and “DSR” are introduced into the “Baseline”, they can improve the tracking performance gradually. Finally, the proposed STAR which integrates all the key components surpasses the “Baseline” by 4.6 and 5.1% in AUC and DP, respectively.

Qualitative evaluations

To intuitively exhibit the superiority of the STAR tracker, six sets of screenshots of the tracking results from OTB2015, i.e., biker, bird2, box, football, human4 and soccer(from top to bottom) are shown in Fig. 9. The target in these sequences undergoes challenging attributes such as rotation, scale variation, occlusion, motion blur, and fast motion. The compared trackers include AutoTrack [23], ARCF [23], CFWCR [20], ECO [14], LADCF-HC [45], STRCF [25] and TB-BiCF [30]. It shows that the proposed STAR (in red box) achieves much better tracking precision compared with other SOTA trackers. Specifically, in the “biker” sequence in which the target suffers from fast motion and motion blur, most of the compared trackers fail at frame 70. The attributes of “soccer” sequences include occlusion and background cluster, causing most compared trackers to fail at frame 365. In contrast, the proposed STAR achieves satisfying performance in these sequences.

Conclusion

In this paper, we propose a novel spatio-temporal joint aberrance suppressed regularization (STAR) correlation filter for robust visual tracking. The STAR tracker takes full advantage of spatio-temporal information and employs aberrance suppressed strategy. The dynamic spatio-temporal regularizer can effectively alleviate boundary effect and filter degradation, while the aberrance suppressed strategy reduces the interference caused by background cluster. Besides, the STAR tracker is efficiently optimized based on the ADMM formulation. Comprehensive experiments on four tracking benchmarks demonstrate the superiority of the proposed method against the SOTA trackers.

Acknowledgements

This work is partially supported by the National Natural Science Foundation of China (No. 61801272).

Declarations

Conflict of interest

The authors declare that they have no conflict of interest.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
3.
6.
Zurück zum Zitat Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2010) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3:1–122CrossRef Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2010) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3:1–122CrossRef
7.
Zurück zum Zitat Choi J, Chang HJ, Yun S, Fischer T, Demiris Y, Choi JY (2017) Attentional correlation filter network for adaptive visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4828–4837. https://doi.org/10.1109/CVPR.2017.513 Choi J, Chang HJ, Yun S, Fischer T, Demiris Y, Choi JY (2017) Attentional correlation filter network for adaptive visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4828–4837. https://​doi.​org/​10.​1109/​CVPR.​2017.​513
19.
Zurück zum Zitat Grabner H, Grabner M, Bischof H (2006) Real-time tracking via on-line boosting. In: Proceedings of the British machine vision conference Grabner H, Grabner M, Bischof H (2006) Real-time tracking via on-line boosting. In: Proceedings of the British machine vision conference
28.
36.
Zurück zum Zitat Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the international conference on learning representations Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the international conference on learning representations
37.
Zurück zum Zitat Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PHS (2017) End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5000–5008. https://doi.org/10.1109/CVPR.2017.531 Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PHS (2017) End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5000–5008. https://​doi.​org/​10.​1109/​CVPR.​2017.​531
Metadaten
Titel
Spatio-temporal joint aberrance suppressed correlation filter for visual tracking
verfasst von
Libin Xu
Pyoungwon Kim
Mengjie Wang
Jinfeng Pan
Xiaomin Yang
Mingliang Gao
Publikationsdatum
25.09.2021
Verlag
Springer International Publishing
Erschienen in
Complex & Intelligent Systems / Ausgabe 5/2022
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-021-00544-1

Weitere Artikel der Ausgabe 5/2022

Complex & Intelligent Systems 5/2022 Zur Ausgabe

Premium Partner