Skip to main content
Top
Published in: Complex & Intelligent Systems 1/2024

Open Access 15-09-2023 | Original Article

RCFT: re-parameterization convolution and feature filter for object tracking

Authors: Yuanyun Wang, Wenhui Yang, Peng Yin, Jun Wang

Published in: Complex & Intelligent Systems | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Siamese-based trackers have been widely studied for their high accuracy and speed. Both the feature extraction and feature fusion are two important components in Siamese-based trackers. Siamese-based trackers obtain fine local features by traditional convolution. However, some important channel information and global information are lost when enhancing local features. In the feature fusion process, cross-correlation-based feature fusion between the template and search region feature ignores the global spatial context information and does not make the best of the spatial information. In this paper, to solve the above problem, we design a novel feature extraction sub-network based on batch-free normalization re-parameterization convolution, which scales the features in the channel dimension and increases the receptive field. Richer channel information is obtained and powerful target features are extracted for the feature fusion. Furthermore, we learn a feature fusion network (FFN) based on feature filter. The FFN fuses the template and search region features in a global spatial context to obtain high-quality fused features by enhancing important features and filtering redundant features. By jointly learning the proposed feature extraction sub-network and FFN, the local and global information are fully exploited. Then, we propose a novel tracking algorithm based on the designed feature extraction sub-network and FFN with re-parameterization convolution and feature filter, referred to as RCFT. We evaluate the proposed RCFT tracker and some recent state-of-the-art (SOTA) trackers on OTB100, VOT2018, LaSOT, GOT-10k, UAV123 and the visual-thermal dataset VOT-RGBT2019 datasets, which achieves superior tracking performance with 45 FPS tracking speed.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Object tracking, as a classical research topic in computer vision [1], has various applications such as human–computer interaction, video editing and military reconnaissance, etc. In real scenarios, it is still a challenging problem to achieve accurate and real-time tracking because of the complexity of object appearance and scene variations.
In recent years, the research on object tracking has focused on improving the tracking accuracy and tracking speed. In terms of the tracking speed, the correlation filtering algorithm is one of the best frameworks. The most primitive object tracking using adaptive correlation filters (MOSSE) [2] uses simpler manual features, and it runs at nearly 700 FPS (frames per second). Correlation filtering techniques help trackers improve tracking speed [3]. However, the correlation filtering algorithm is not robust to complex appearance changes, and the tracking performance degrades significantly. In terms of accuracy, deep learning-based object tracking algorithms are better to cope with complex appearance variations. However, the deep learning-based target tracking algorithms are not very fast. To better improve the tracking accuracy, researchers have proposed some algorithms that combine depth features and correlation filtering, such as efficient convolution operators for tracking (ECO) [4]. However, the tracking speed of these algorithms is still slow.
Recently, with the use of deep learning technology, the performance of tracking algorithms has been greatly improved. Among them, Siamese networks-based tracking algorithms have attracted extensive attention because of the high speed and accuracy as well as the excellent performance on many datasets. Siamese networks-based object tracking calculate the similarity of two branch inputs very well. Firstly, the algorithms perform point-by-point target template matching in the search region and then find the target candidate with the maximum similarity as the tracking result in the current frame.
SiamFC [5] uses a fully convolutional neural network trained off-line to match a template image with the search region to predict the location of tracked targets. Some Siamese networks-based tracking algorithms use SiamFC as a baseline to improve the tracking performance, such as high-performance visual tracking with Siamese region proposal network (SiamRPN) [6], and Siamese fully convolutional classification and regression for visual tracking (SiamCAR) [7]. SiamRPN [6] introduces regional proposal network (RPN) to better distinguish foreground and background and, at the same time, to better solve the problem of scale variations. SiamCAR [5] utilizes the deep backbone network while optimizing the classification branches to achieve a simple and efficient model.
Although Siamese networks-based trackers have a large performance improvement, there are some disadvantages. (1) The Siamese networks-based trackers use the traditional convolutional neural network for feature extraction. Traditional convolutional neural networks obtain pixel-based and local features after convolution. However, many intrinsic information (such as location information and global information) is lost while enhancing local features. How to make the best of more intrinsic information is the key to improve the feature extraction capability. (2) Existing Siamese networks-based trackers use cross-correlation operations for feature fusion and reach better tracking performance. However, this feature fusion approach ignores the global spatial context. The spatial information is not fully utilized, which is crucial for target localization.
Inspired by the above works, we design a novel feature extraction sub-network and feature fusion network for object tracking as shown in Fig. 1. We improve the online convolutional re-parameterization [8] by using batch-free normalization (BFN) [9] to replace the batch normalization layer, referred to as batch-free normalization re-parameterization convolution (BFN-REP). A feature extraction sub-network is designed by introducing the BFN-REP convolution block into ResNet50 [10] to replace the traditional convolution blocks. The feature extraction network is able to scale the features in the channel dimension, increase the receptive field and obtain more channel information as well as global information. Then, powerful target features are obtained for feature fusion. A novel feature fusion network (FFN) is proposed by adding the feature filter (FF). The FF fuses template features and search region features in the global spatial background to produce a high-quality fusion feature. By integrating the proposed feature extraction sub-network and the feature fusion module (FFN), we propose an effective tracking algorithm named RCFT. The main contributions are summarized as follows.
  • We propose a novel feature extraction sub-network based on the designed BFN-REP convolutional module and the backbone network ResNet50. The sub-network is capable of scaling features in the channel dimension, increasing the perceptual field and obtaining more channel information, as well as preventing degradation of the network model performance and obtaining a powerful characterization capability.
  • We propose a feature fusion network (FFN) with a feature filter to fuse template features and search region features. High-quality fused features are obtained by the feature filter. Compared with the cross-correlation operation, the proposed feature fusion network focuses more on the important element locations and contents in the global spatial context and is able to gain powerful target features.
  • We jointly learn the proposed feature extraction sub-network and feature fusion network. Then, an effective object tracking algorithm is proposed, named RCFT. Experiments on six challenging benchmarks including OTB100 [11], VOT2018 [12], LaSOT [13], GOT-10k[14], UAV123[15] and a visual–thermal dataset VOT-RGBT2019 [16] demonstrate that the proposed RCFT achieves superior tracking performance.
The rest of this paper is organized as follows. “Related work” summarizes the related works. “Method” describes the proposed tracking algorithm in detail. “Experiments and result” analyzes the experimental results and compares the proposed RCFT tracker with state-of-the-art (SOTA) trackers. Finally, “Conclusion” concludes this paper.
Recently, many advanced trackers have been proposed to address various challenges. In this section, we mainly review some tracking methods and techniques related to the proposed tracker, including correlation filtering and deep learning-based trackers, Siamese networks-based trackers, attention mechanism-based trackers, and feature extraction and fusion network for visual tracking.

Correlation filtering and deep learning-based trackers

In recent years, the combination of deep learning and correlation filtering is widely used in object tracking. To make full use of the advantages of correlation filtering and deep learning, accurate tracking by overlap maximization (ATOM) [17] combines deep learning and correlation filtering for object tracking. Furthermore, learning discriminative model prediction for tracking (DiMP) [18] also uses background information to learn nonlinear filters through optimization methods, significantly improving the discriminant ability of the network.
Considering the state confidence estimation method, probabilistic regression for visual tracking (PrDiMP) [19] uses an energy-based model to predict the non-normalized probability density of the target frame. Since the appearance of the target changes all the time in the tracking process, know your surroundings (KYS) [20] proposes a discriminant correlation filtering method based on the scene information, propagates the modeled dense local state vector on the time sequence, and combines with the appearance model to locate the target.

Siamese network-based trackers

Recently, tracking algorithms with Siamese networks as the underlying structure have shown great potential for development. SiamFC [5] pioneers the use of Siamese networks in object tracking, which uses convolution instead of traditional correlation filtering. The structure of fully convolution matches the template image with a candidate region and finally outputs the desired response map. The point with the highest response value is found in the response map, and the candidate region corresponding to that point is considered as the predicted target location.
Valmadre et al. [21] regards correlation filtering as a network layer embedded in the Siamese networks, and combines the characteristics of correlation filtering in the Siamese networks to improve tracking performance. Dynamic Siamese network (DSiam) [22] learns the variations of target appearances from historical frames in an online manner and achieves effective background suppression by separating and extracting foreground and background information. From the perspective of the location method, the object tracking algorithms of Siamese networks mainly include anchor based and anchor-free Siamese networks tracking methods.
Based on anchors, SiamRPN [6] combines Siamese networks with the regional proposal network (RPN), lays anchor frames with different aspect ratios on the feature map, and conducts classification and regression at the same time, which solves the scaling problem well. SiamRPN [6] has achieved better performance on multiple datasets, which has attracted extensive attention and research.
To perceive the target location states, localization-aware target confidence (SiamLA) [23] utilizes the location-aware feature aggregation module to generate a perceived target confidence score. Since RPN has a weak ability to discriminate between similar targets, distractor-aware Siamese network (DaSiamRPN) [24] introduces an distractor-aware model for target distractor to enhance the intra-class discriminative power of the network model. To take advantage of deep network, evolution of Siamese visual tracking with very deep networks (SiamRPN++) [25] uses multi-layer aggregation to fuse shallow features and deep features with SiamRPN. After that, many advanced algorithms utilize the deep backbone network for feature extraction, such as SiamDW [26], SiamR-CNN [27] and SiamRCR [28].
Most of the anchor-based tracking algorithms are implemented by the sliding window approach, which lead to the generation of a large number of anchors and make the computational complexity increase as well as the real-time performance decrease. Therefore, to further solve the anchor problem, some researchers put forward the anchor-free network and abandoned the anchor-frame strategy that requires prior knowledge. SiamFC++ [29] decomposes the process of object tracking into classification tasks and state estimation tasks, and adopts the idea of no anchor, which weakens the negative influence of the anchor frame mechanism on the algorithm generalization ability in the SiamRPN [6] network. Both SiamCAR [7] and SiamBAN [30] directly classify and regress the distance to the boundary box on a pixel-by-pixel basis, without an anchor box. Ocean [31] also integrates the online classification branch and target awareness branch to further improve the robustness of the model.

Attention mechanism-based trackers

Attention mechanisms [32, 33] have been applied to vision tasks such as target detection and image classification. To focus more on features that are important in space and channel locations, Wang et al. [34] use a residual attention mechanism to enhance key features of an image. Hu et al. [35] introduce a compact module to mine the relationship between channels and use the features from the average pooling layer to compute the attention between channels.
The attention mechanism is often used to generate the spatial significance graph. ToPG [36] uses color histogram to generate significance graph and generates target samples during training and tracking. Avytekin et al. [37] achieve better tracking results by fusing saliency-based features with convolution features. Besides generating spatial salience maps, attention mechanisms can also select importance among image frames. FlowTrack [38] dynamically learns the importance of different historical frames for feature representation by embedding an attention module in the neural network.
Some variant attention mechanism and self-attentions are used in object tracking. Yu et al. [69] propose a deformable attention mechanism to improve the network model in representing target features and to distinguish foreground targets from the background. SiamTPN [39] utilizes a pooled attention module to reduce the computational intensity imposed by Transformer. TransT [40] uses self-attention to efficiently fuse the features of two branches in Siamese networks. TrDiMP [41] uses self-attentive enhancement of template feature in the transformer encoder to obtain rich contextual information in consecutive frames. Subsequently, in the decoder, the information obtained in the encoder is propagated by using cross-attention.

Feature extraction and fusion network for visual tracking

In recent years, although Siamese trackers using AlexNet [42] as backbone networks have achieved better performance, these trackers are prone to tracking drift when dealing with complex appearance variations. Therefore, some researchers have utilized deep networks such as VGG [43], Inception [44] and ResNet [10] instead of shallow networks, such as ATOM [17], DiMP [18] and TCTrack [45]. Different optimization operations on the backbone network aim to extract the target features more accurately and make the feature representation more comprehensive and informative. Traditional convolutional neural networks treat the image features within each channel equally when extracting features. Thus, the features that are more important for target tracking are not enhanced and redundant features are not suppressed. Different from traditional convolutional methods, our proposed BFN-REP convolution scales features in different channel dimensions and enhances favorable features while preventing network model degradation.
Feature fusion is an important component in Siamese-based trackers, which computes the similarity between two branches of Siamese networks [29, 46]. SiamFC [5] uses cross-correlation (XCorr) for similarity calculation and obtains the desired response map. The position corresponding to the maximum score in the response map is considered as the target position. SiamRPN++ [25] improves the XCorr by using depth-wise cross-correlation (DW-Xcorr). DW-Xcorr performs the channel-by-channel correlation on feature maps to obtain efficient information connectivity. Additionally, since cross-correlation for global matching largely ignores the target structure and partial hierarchical information, graph attention tracking (SiamGAT) [47] designs a part-to-part information embedding network by using the graph attention module.
Although some progress has been made in the improvement of the above fusion methods, the feature fusion process on cross-correlation does not take advantages of global information. To address this problem, some cutting-edge research has utilized transformer for feature fusion. Transformer meeting tracker (TrSiam) [41] uses transformer to augment deep convolutional features for object tracking. Visual tracking with transformer (TrTr) [48] uses transformer to obtain rich contextual information for improving similarity computation. Different from the above feature fusion methods, we use the feature fusion network with feature filter to fuse the features of two branches and obtain high-quality fused features. Based on the designed fusion method, the important feature information is highlighted, which is used to perform similarity calculation with the template feature.

Method

In this section, we describe the proposed tracker RCFT in detail. As shown in Fig. 1, the proposed tracker RCFT consists of three major components, namely, the feature extraction sub-network with BFN-REP convolution, the feature fusion network based on feature filters and a tracking prediction head. Nextly, we will analyze the three components, respectively.

Feature extraction with BFN-REP convolution

Siamese networks-based trackers generally use convolutional neural network to extract features from target images. These trackers use traditional convolutional approaches for extracting image features for each channel. Since each channel is treated equally, it leads to inaccurate representation in channel importance.
To reduce the loss of other intrinsic information while enhancing local features, we design a feature extraction sub-network based on BFN-REP convolution. This sub-network scales the features in the channel dimension to obtain more channel information and increase the receptive field. Since the BFN-REP convolution also integrates the BFN, it is also able to prevent the degradation of the network model. As shown in Fig. 1, our tracker uses the BFN-REP convolution in the first three convolutional layers of the modified ResNet50 to further enhance the feature extraction capability. Next, we will describe the technical details of the BFN-REP convolution based feature extraction sub-network.
Since the two branches of Siamese networks have shared weights in the feature extraction network, we only describe the feature extraction sub-network of template branch in details. As shown in Fig. 2, the BFN-REP convolution has a total of D convolutional branches before being compressed. First, each branch is compressed. Here, we assume that each branch has N layers of convolution operations, and the number of channels in each layer is denoted by \(C_{n}\), i.e., \(C_{n} \in \left[ C_{0}, C_{1}, \ldots , C_{N}\right] \). \({\textbf{Z}} \in {\mathbb {R}}^{C_0 \times H \times W}\) denotes the input of the template branch. We represent the convolution process as follows:
$$\begin{aligned} {\textbf{Z}}_{1}={\textbf{W}} * {\textbf{Z}}, \end{aligned}$$
(1)
where \({\textbf{Z}}_{1} \in {\mathbb {R}}^{C_0 \times H \times W}\) is the output. \({\textbf{W}}\) is a mapping matrix. For the second layer of convolution, we consider \({\textbf{W}} * {\textbf{Z}}\) as the input, so that \({\textbf{W}}\) can be expressed as follows:
$$\begin{aligned} {\textbf{W}}=\left( {\textbf{W}}_{N}\left( {\textbf{W}}_{N-1} * \cdots \left( {\textbf{W}}_{2} * {\textbf{W}}_{1}\right) \right) \right. , \end{aligned}$$
(2)
where \({\textbf{W}}_{N}\) is the weight of the \({N}_{th}\) layer.
In the convolutional re-parameterization, a scaling layer is utilized instead of the original normalization layer. After the convolution layer, the scaling layer is used to make it possible to scale the features in the channel dimension and increase the receptive field while obtaining richer channel information. We represent the computed output of the scaling layer as follows:
$$\begin{aligned} {\textbf{Z}}_{2}=\gamma {\textbf{W}}^{c} {\textbf{Z}}_{1}, \end{aligned}$$
(3)
where \(\gamma \) is the scaling factor. \({\textbf{W}}^{c}\) is the convolution kernel corresponding to the \({c}^{th}\) output channel.
Next, we merge the D branches into a single branch. Due to the linear nature of the convolution, we represent it as follows:
$$\begin{aligned} {\textbf{Z}}_{3}= & {} \left( {\textbf{W}}_{1} * {\textbf{Z}}_{2}\right) + \left( {\textbf{W}}_{2} * {\textbf{Z}}_{2}\right) + \cdots \left( {\textbf{W}}_{D} * {\textbf{Z}}_{2}\right) \nonumber \\= & {} \sum _{d=1}^{D}\left( {\textbf{W}}_{d} * {\textbf{Z}}_{2}\right) , \end{aligned}$$
(4)
where \({\textbf{W}}_{d}\) is the weight of the \({d}_{th}\) branch.
All the convolutional branches are merged and fed into the BFN layer. We use the scale factor \(\lambda \) from the BFN layer to indicate the importance on the different batch dimensions. The BFN layer is expressed as follows:
$$\begin{aligned} \hat{z'}_{i}= & {} \frac{1}{\sigma }\left( z'_{i}-\mu _{i}\right) \nonumber \\ {\textbf{Z}}_{4}= & {} BFN\left( z'_{i}\right) =\lambda \cdot \hat{z'}_{i}+\beta , \end{aligned}$$
(5)
where \(z'\) is the output feature of the convolutional layer and i denotes the index on the batch dimension. \(\mu _{i}\) and \(\sigma \) belong to the mean and standard deviation calculated in the same channel eigenvalue, respectively. \(\beta \) is learnable shift transformation parameters.
The output of the BFN layer is fed into the activation function ReLU to obtain the final output as follows:
$$\begin{aligned} {\textbf{Z}}'= {\textbf{ReLU}} \left( {\textbf{Z}}_{4}\right) . \end{aligned}$$
(6)

FFN for object tracking

Some Siamese-based trackers use cross-correlation for feature fusion. However, this feature fusion approach ignores the global spatial context information and the spatial information is not exploited. Spatial information is also important for target representation. Therefore, we propose a feature fusion network based on feature filter, which fuses the features of template and search region branch in the global spatial context to obtain rich spatial information while obtaining high-quality fused features and feature enhancement for important information.
As shown in Fig. 3, we will present the proposed feature filter module in detail. We set the features obtained from the template and the search branch through the feature extraction network as \({F_{Z}}\) and \({F_{X}}\), respectively. We input the acquired features \({F_{Z}}\) and \({F_{X}}\) to the proposed feature filter module. The feature filter projects \({F_{Z}}\) and \({F_{X}}\) into query (Q),  key (K) and value (V) by a convolution operation, which is formulated as follows:
$$\begin{aligned} Q= & {} \text {Conv}_{1}\left( \text {Conv}_{3}\left( {F}_{Z}\right) \right) , \end{aligned}$$
(7)
$$\begin{aligned} K= & {} \text {Conv}_{1}\left( \text {Conv}_{3}\left( {F}_{X}\right) \right) , \end{aligned}$$
(8)
$$\begin{aligned} V= & {} \text {Conv}_{3}\left( {F}_{X}\right) , \end{aligned}$$
(9)
where \(\text {Conv}_{1}\) is the convolution operator of \(1 \times 1\) kernel, \(\text {Conv}_{3}\) is the convolution operator of \(3 \times 3\) kernel, \(Q \in {\mathbb {R}}^{d_{m} \times d_{q}}\), \(K \in {\mathbb {R}}^{d_{m} \times d_{q}}\) and \(V \in {\mathbb {R}}^{d_{m} \times d_{v}}\). We set \(d_{m}\), \(d_{q}\) and \(d_{v}\) to 512, 64 and 64, respectively. \(d_{q}\) and \(d_{v}\) are the feature dimensions of Q and V,  respectively.
We obtain \(Q'\) and \(K'\) by projecting Q and K using a 1D convolution (flatten operation in Fig. 3). Subsequently, we perform a scaled dot product operation on \(Q'\) and \(K'\), which is formulated as follows:
$$\begin{aligned} {W}_{a}={\text {softmax}}\left( \frac{{Q'} {K'}^{\top }}{\sqrt{d_{q}}}\right) , \end{aligned}$$
(10)
where \({W}_{a}\) is the attention weight of the feature filter module. softmax is the normalization function.
We use the same 1D convolution for V as above to obtain \(V'\). We use the attention weights \({W}_{a}\) to enhance the favorable features in \(V'\) and the redundant features are filtered. Then, the final high-quality fused feature is expressed as follows:
$$\begin{aligned} {F}_{X'}={W}_{a} V', \end{aligned}$$
(11)
where \({F}_{X'}\) is the fused feature by the feature filter.
The output \({F}_{X'}\) will be compared with the template features processed by the tracking model for similarity calculation to obtain the final result. The proposed feature filter fused the template and the search region features in a global spatial context to obtain the final high-quality features. It is able to obtain high-quality fused features by filtering insignificant features under the condition of making full use of spatial information. Therefore, we propose a feature fusion network based on feature filter that can utilize more spatial information for accurate object tracking.

RCFT tracking framework

As shown in Fig. 1, we designed a novel tracking framework by integrating the proposed BFN-REP convolution-based feature extraction sub-network and the feature fusion network based on feature filter. The proposed tracker enhances the features in the process of feature extraction and feature fusion by using different ways to make the template and search region features to obtain richer channel and spatial information. This facilitates RCFT to obtain robust tracking performance.
The tracker uses BFN-REP convolution in the feature extraction sub-network to scale the features in the channel dimension, increase the receptive field and make the best of the channel information to extract more accurate target features. In the feature fusion process, redundant features are filtered and important features are enhanced in the global spatial context by a feature filter to obtain high-quality fused features. Template feature is obtained by tracking the model to obtain more robust template feature and convolving them with high-quality fused features to obtain response maps.

Experiments and result

Table 1
Ablation study on GOT-10k [14]
 
REP
BFN-REP
FFN
AO (%)
\(SR_{0.50}\) (%)
\(SR_{0.75}\) (%)
DiMP50
   
61.1
71.7
49.2
DiMP-REP
\(\surd \)
  
64.8
76.4
53.9
DiMP-BFN
 
\(\surd \)
 
65.9
77.4
55.9
DiMP-FFN
  
\(\surd \)
65.7
76.8
57.1
REP-FFN
\(\surd \)
 
\(\surd \)
67.0
78.2
56.1
RCFT
 
\(\surd \)
\(\surd \)
67.3
78.8
58.9

Implementation details

The proposed tracker RCFT is implemented with only one RTX 2060 card in Pytorch. We use the modified ResNet50 and the proposed BFN-REP convolution as the backbone network. We use GOT-10k [14], LaSOT [13], COCO [49] and TrackingNet [50] as training sets for the proposed RCFT. In the training process, the backbone network is initialized using the pretrained weights and the training batch size is set to 24, and 50 epochs are trained using the stochastic gradient descent method. We conduct extensive experiments to evaluate the proposed tracker on six challenging datasets, including OTB100 [11], VOT2018 [12], LaSOT [13], GOT-10k [14], UAV123 [15] and the visual–thermal dataset VOT-RGBT2019 [16]. In addition, the ablation study is performed in GOT-10k [14].

Ablation study

To verify the feasibility of validating online reparametric convolution, we add BFN-REP convolution and/or FFN to the baseline DiMP50 and perform tracking performance analysis on GOT-10k.
REP and BFN-REP. Firstly, to verify the effectiveness of online convolutional re-parameterization (REP) as well as BFN, we conduct experiments on DiMP50, DiMP-REP and DiMP-BFN, respectively. The results of these three trackers are shown in Table 1. Compared with the baseline DiMP50, the average overlap rate (AO), success rates \(SR_{0.50}\) and \(SR_{0.75}\) of DiMP-REP tracker are improved by 3.7%, 4.7% and 4.7%, respectively. We can see that DiMP-BFN has improved on the three metrics based on DiMP-REP. From the comparison results, the introduced online convolutional re-parameterization has better performance for visual tracking. Then the feature extraction sub-network based on BFN-REP convolution in the proposed tracking algorithm has obvious superiority.
FFN. Nextly, we verify the effectiveness of the feature fusion network (FFN) based on feature filter. As shown in Table 1, compared with DiMP50, DiMP-FFN improved 4.6% and 5.1% in AO and \(SR_{0.50}\), respectively. The \(SR_{0.75}\) is improved 7.9% from 49.2 to 57.1%. The results show that the proposed FFN can significantly improve the tracking performance.
Finally, we train the REP with FFN and BFN-REP with FFN, respectively. Compared with REP-FFN, the proposed RCFT has a 2.8% improvement on \(SR_{0.75}\), and also has some improvements on AO and \(SR_{0.50}\). By comparison, we can know that the joint learning of the proposed feature extraction sub-network and FFN obtains superior tracking performance. As shown in Fig. 4, we visualize the tracking results of DiMP50 and RCFT. From the second and third columns, it can be seen that RCFT can highlight the target location better in appearance variations such as background clutters, illumination variation and scale variation.

Evaluation on OTB-100

OTB100 is a benchmark dataset that has been widely used in recent years. It contains 100 tracking video sequences, each of which is annotated using 11 challenging attributes. The proposed RCFT tracker is compared with nine SOTA trackers in terms of success and precision. These trackers include AiATrack [51], SiamPW-RBO [52], ToMP50 [53], SiamTPN [39], DiMP50 [18], Ocean-offline [31], RMAN [54], TransT [40] and SiamFC [5].
As shown in Fig. 5, RCFT ranks first with success rate 0.705 and precision 0.921, respectively. In terms of precision, AiATrack [51] and SiamPW-RBO [52] follow RCFT at the second place with 0.917. In terms of success rate, RCFT outperforms the second-ranked SiamTPN [39] by 0.3% with 0.705. SiamTPN [39] uses transformer to strengthen the feature pyramid and build enhanced high-level feature maps. Although SiamTPN optimizes Siamese networks in feature fusion and achieves good results, however, it does not take into account many disadvantages of Siamese networks. Similar to SiamTPN, the RCFT is enhanced by the proposed FFN in the feature fusion stage to obtain high-quality fusion features. At the same time, the proposed RCFT also uses BFN-REP convolution to optimize the feature extraction network. As a result, the proposed RCFT achieves better performance.
In addition, we also compare the RCFT with the SOTA algorithms on four different attributes including background clutter (BC), scale variation (SV), out-of-view (OV) and illumination variation (IV) on OTB100. As shown in Fig. 6, the proposed tracker RCFT obtains the first place in all these different attributes. The above experiments show that RCFT achieves superior performance compared to these SOTA trackers.

Evaluation on VOT2018

VOT2018 [12] is an evaluation dataset consisting of 60 video sequences for object tracking. The test metrics used in VOT2018 [12] include accuracy (A), robustness (R), and expected average overlap (EAO). Among them, EAO is a combination of accuracy and number of failures in a tracker, which are designed to compensate for the lack of accuracy and robustness.
As shown in Fig. 7, the RCFT ranks first in EAO with 0.465 compared to nine previous methods. In addition, as shown in Table 2, we present the results on the A, R and EAO compared with 11 SOTA trackers, including SiamTPN [39], SiamLA [23], ULAST [55], TrDiMP [41], SiamRCR [28], SiamBAN [30], SiamR-CNN [27], DiMP50 [18], SiamCAR [7], DaSiamRPN [24] and SiamRPN++ [25]. The results in Table 2 show that the RCFT obtains the best performance with 0.465 and 0.612 in EAO and A, respectively. Moreover, the robustness (R) of the RCFT is only 0.004 lower than the top-ranked SiamLA (0.136). The above results show that RCFT has better competitive results in VOT2018. Although RCFT does not obtains the best tracking results on all indicators, the effectiveness of BFN-REP convolution and FFN can also be verified.

Evaluation on LaSOT

LaSOT [13] contains 70 categories, each of which contains 20 sequences that reflect both categorical balance and diversity in natural scenes. Each frame is manually labeled in LaSOT and the results are corrected when needed. As a result about 3.52 million high quality bounding box annotations were generated. LaSOT follows the principle of long-term tracking, where the shortest video contains 1000 frames and the longest video contains 11,397 frames.
Table 2
Comparison on VOT2018 [12]
https://static-content.springer.com/image/art%3A10.1007%2Fs40747-023-01223-z/MediaObjects/40747_2023_1223_Tab2_HTML.png
The best three results are highlighted in red, blue, and green, respectively
Figure 8 shows the success plots of the proposed RCFT and nine recent tracking algorithms. From the comparison results, RCFT has a success rate of 63.5%, which is 1.1% higher than the ranked second TrSiam (62.4%). In Table 3, we show the precision and normalized precision of RCFT and 14 SOTA trackers, including TrSiam [41], SAOT [56], SiamTPN [39], SiamPW-RBO [52], Ocean [31], DiMP50 [18], SiamLA [23], CGACD [57], SiamGAT [47], GlobalTrack, ATOM [17], SiamRPN++ [25], D3S [59] and VITAL [60]. The comparison results show that RCFT obtains the best performance on the precision and normalized precision. In terms of precision, the RCFT outperforms TrSiam and SAOT by 1.5% and 3.1%, respectively. In terms of normalized precision, RCFT outperforms TrSiam and SAOT by 1.3% and 1.7%, respectively. The above experimental results show that RCFT can perform long-term tracking better, and also verify that our BFN-REP convolution can alleviate network model degradation and improve tracking performance.

Evaluation on GOT-10k

GOT-10k [14] contains 87 motion patterns in 560 moving objects. It provides 10,000 video clips containing 1.5 million manually labeled bounding boxes. Each video also contains two labels for type and motion pattern. It is also the first video trajectory dataset to use WordNet semantic hierarchy to guide the class population, and ensure comprehensive and unbiased coverage of various moving objects.
Table 3
Comparison on LaSOT [13] in terms of precision (Prec.) and normalized precision (N.Prec.)
https://static-content.springer.com/image/art%3A10.1007%2Fs40747-023-01223-z/MediaObjects/40747_2023_1223_Tab3_HTML.png
The best three results are highlighted in red, blue, and green, respectively
As shown in Table 4, we compare the proposed tracker against 14 recent trackers on GOT-10k dataset. These trackers include UTT [61], TrDiMP [41], HCAT [62], DTT [63], SiamGAT [47], SiamLA [23], DiMP50 [18], SBT-light [64], SiamFC++ [29], SiamCAR [7], Ocean-offline [31], SPM [65], SiamRPN [6] and SiamFC [5]. The evaluation metrics include average overlap (AO), success rate \(SR_{0.50}\) and \(SR_{0.75}\).
Table 4
Comparison on GOT-10k [14]
https://static-content.springer.com/image/art%3A10.1007%2Fs40747-023-01223-z/MediaObjects/40747_2023_1223_Tab4_HTML.png
The best three results are highlighted in red, blue, and green, respectively
As in Table 4, our tracker obtains the best performance in AO and \(SR_{0.50}\) with 67.3 and 78.8, respectively, and also ranks on the second place on the evaluation metric of \(SR_{0.75}\) with 58.9. In addition, we compare our tracker with 9 nine trackers on success rate in Fig. 9. In terms of success rate, our tracker ranks first among the ten trackers. The above experiments prove that the RCFT tracker has strong generalization ability, and also verify that our FFN can significantly enhance the robustness of tracking in complex environments.

Evaluation on UAV123

UAV123 [15] is a dataset of specialized scenes captured with unmanned aerial vehicle. It contains 123 video sequences with clean backgrounds and more variations in perspective. We evaluate RCFT against SOTA trackers on UAV123, including ToMP50 [53], SiamPW-RBO [52], TransT [40], DiMP50 [18], SiamGAT [47], SiamBAN [30], SiamTPN [39], TCTrack [45] and HIFT [66].
The precision and success plots are shown in Fig. 10. The RCFT achieves the best performance in terms of accuracy with 0.891, and outperforms ToMP50 (0.876) and SiamPW-RBO (0.863) by 0.015 and 0.028, respectively. SiamPW-RBO optimizes the prediction head by taking advantage of classification ranking loss and IoU-guided ranking loss to improve tracking performance. Although SiamPW-RBO optimizes the Siamese networks on the prediction head and achieves good results, the feature extraction and feature fusion of the Siamese networks remain unchanged. RCFT optimizes the shortcomings of Siamese networks in feature extraction and feature fusion, and achieves better tracking performance. In terms of success rate, RCFT follows ToMP50 (0.678) and ranks on the second with 0.676. The results show that RCFT has better performance on multiple view factors.

Evaluation on VOT-RGBT2019

VOT-RGBT2019 [16] is a short-term tracking dataset which contains 60 video sequences. Different from traditional tracking datasets, it contains both RGB images and thermal imaging images. The RCFT is compared with recent tracking methods, including FANet [67], TFNet [68], DiMP50 [18], ATOM [17], DaSiamRPN [24], SiamFC [5], and ECO [4]. From Table 5, we can see that RCFT obtains the best performance in accuracy (A), robustness (R), and expected average overlap (EAO). Compared with DiMP50, RCFT improves 0.024 on EAO and 0.055 on A, respectively. Compared with DaSiamRPN, which ranks at the second place in accuracy, the RCFT improves 0.037 on A. The experimental results demonstrate that the designed BNF-REP convolution can improve the feature representation of the target and alleviate the degradation of the network model in RCFT.
Table 5
Comparison with SOTA trackers on VOT-RGBT2019 [16]
https://static-content.springer.com/image/art%3A10.1007%2Fs40747-023-01223-z/MediaObjects/40747_2023_1223_Tab5_HTML.png
The best three results are highlighted in red, blue, and green, respectively

Conclusion

In this work, a robust tracking framework is designed by jointly learning a feature extraction sub-network with the proposed BFN-REP convolution and feature fusion network based on feature filter. The proposed BFN-REP convolution-based feature extraction sub-network scales the features in the channel dimension and increases the perceptual field while obtaining more channel information and global information. Since the BFN is helpful to improve the online convolution re-parameterization, the feature extraction sub-network is also able to prevent the degradation of the network model. In addition, the designed feature fusion network based on feature filter fuses template and search region features in a global spatial context to produce high-quality features. The feature fusion network makes the best of spatial information while augmenting important features. The effectiveness and generalization of the proposed RCFT tracker is validated on six challenging datasets, including OTB100, VOT2018, LaSOT, GOT-10k, UAV123 and the visual–thermal dataset VOT-RGBT2019, with real-time tracking speed.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (no. 61861032).

Declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Author agreement

We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We understand that the corresponding author is the sole contact for the editorial process. He is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literature
1.
go back to reference Xu L, Kim P, Wang M, Pan J, Yang X (2022) Gao M (2022) Spatio-temporal joint aberrance suppressed correlation filter for visual tracking. Complex & Intelligent Systems 8:3765–3777 Xu L, Kim P, Wang M, Pan J, Yang X (2022) Gao M (2022) Spatio-temporal joint aberrance suppressed correlation filter for visual tracking. Complex & Intelligent Systems 8:3765–3777
2.
go back to reference Bolme DS, Beveridge JR, Draper BA, LuiYM (2010)Visual object tracking using adaptive correlation filters. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2544–2550 Bolme DS, Beveridge JR, Draper BA, LuiYM (2010)Visual object tracking using adaptive correlation filters. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2544–2550
3.
go back to reference Liu S, Liu D, Srivastava G, Połap D, Woźniak M (2021) Overview and methods of correlation filter algorithms in object tracking. Complex Intell Syst 7(4):1895–1917CrossRef Liu S, Liu D, Srivastava G, Połap D, Woźniak M (2021) Overview and methods of correlation filter algorithms in object tracking. Complex Intell Syst 7(4):1895–1917CrossRef
4.
go back to reference Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M (2017) Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6638–6646 Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M (2017) Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6638–6646
5.
go back to reference Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional Siamese networks for object tracking. In: European conference on computer vision. Springer, Berlin, pp 850–865 Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional Siamese networks for object tracking. In: European conference on computer vision. Springer, Berlin, pp 850–865
6.
go back to reference Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with Siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980 Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with Siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980
7.
go back to reference Guo D, Wang J, Cui Y, Wang Z, Chen S (2020) SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6269–6277 Guo D, Wang J, Cui Y, Wang Z, Chen S (2020) SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6269–6277
8.
go back to reference Hu M, Feng J, Hua J, Lai B, Huang J, Gong X, Hua X-S (2022) Online convolutional re-parameterization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 568–577 Hu M, Feng J, Hua J, Lai B, Huang J, Gong X, Hua X-S (2022) Online convolutional re-parameterization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 568–577
9.
go back to reference Huang L, Zhou Y, Wang T, Luo J, Liu X (2022) Delving into the estimation shift of batch normalization in a network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 763–772 Huang L, Zhou Y, Wang T, Luo J, Liu X (2022) Delving into the estimation shift of batch normalization in a network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 763–772
10.
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
11.
go back to reference Wu Y, Lim J, Yang M-H (2013) Online object tracking: a benchmark. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2411–2418 Wu Y, Lim J, Yang M-H (2013) Online object tracking: a benchmark. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2411–2418
12.
go back to reference Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Čehovin Zajc L, Vojir T, Bhat G, Lukezic A, Eldesokey A et al (2018) The sixth visual object tracking vot2018 challenge results. In: Proceedings of the European conference on computer vision (ECCV) workshops Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Čehovin Zajc L, Vojir T, Bhat G, Lukezic A, Eldesokey A et al (2018) The sixth visual object tracking vot2018 challenge results. In: Proceedings of the European conference on computer vision (ECCV) workshops
13.
go back to reference Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5374–5383 Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5374–5383
14.
go back to reference Huang L, Zhao X, Huang K (2021) Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577CrossRefPubMed Huang L, Zhao X, Huang K (2021) Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577CrossRefPubMed
15.
go back to reference Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for UAV tracking. In: European conference on computer vision. Springer, Berlin, pp 445–461 Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for UAV tracking. In: European conference on computer vision. Springer, Berlin, pp 445–461
16.
go back to reference Kristan M, Matas J, Leonardis A, Felsberg M, Pflugfelder R, Kamarainen J-K, Cehovin Zajc L, Drbohlav O, Lukezic A, Berg A et al (2019) The seventh visual object tracking vot2019 challenge results. In: Proceedings of the IEEE/CVF international conference on computer vision workshops Kristan M, Matas J, Leonardis A, Felsberg M, Pflugfelder R, Kamarainen J-K, Cehovin Zajc L, Drbohlav O, Lukezic A, Berg A et al (2019) The seventh visual object tracking vot2019 challenge results. In: Proceedings of the IEEE/CVF international conference on computer vision workshops
17.
go back to reference Danelljan M, Bhat G, Khan FS, Felsberg M (2019) Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4660–4669 Danelljan M, Bhat G, Khan FS, Felsberg M (2019) Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4660–4669
18.
go back to reference Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6182–6191 Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6182–6191
19.
go back to reference Danelljan M, Gool LV, Timofte R (2020) Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7183–7192 Danelljan M, Gool LV, Timofte R (2020) Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7183–7192
20.
go back to reference Bhat G, Danelljan M, Van Gool L, Timofte R (2020) Know your surroundings: exploiting scene information for object tracking. In: European conference on computer vision. Springer, Berlin, pp 205–221 Bhat G, Danelljan M, Van Gool L, Timofte R (2020) Know your surroundings: exploiting scene information for object tracking. In: European conference on computer vision. Springer, Berlin, pp 205–221
21.
go back to reference Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PH (2017) End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2805–2813 Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PH (2017) End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2805–2813
22.
go back to reference Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic Siamese network for visual object tracking. In: Proceedings of the IEEE international conference on computer vision, pp 1763–1771 Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic Siamese network for visual object tracking. In: Proceedings of the IEEE international conference on computer vision, pp 1763–1771
23.
go back to reference Nie J, Wu H, He Z, Yang Y, Gao M, Dong Z (2022) Learning localization-aware target confidence for Siamese visual tracking. arXiv preprint. arXiv:2204.14093 Nie J, Wu H, He Z, Yang Y, Gao M, Dong Z (2022) Learning localization-aware target confidence for Siamese visual tracking. arXiv preprint. arXiv:​2204.​14093
24.
go back to reference Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware Siamese networks for visual object tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 101–117 Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware Siamese networks for visual object tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 101–117
25.
go back to reference Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: evolution of Siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4282–4291 Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: evolution of Siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4282–4291
26.
go back to reference Zhang Z, Peng H (2019) Deeper and wider Siamese networks for real-time visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4591–4600 Zhang Z, Peng H (2019) Deeper and wider Siamese networks for real-time visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4591–4600
27.
go back to reference Voigtlaender P, Luiten J, Torr PH, Leibe B (2020) Siam R-CNN: visual tracking by re-detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6578–6588 Voigtlaender P, Luiten J, Torr PH, Leibe B (2020) Siam R-CNN: visual tracking by re-detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6578–6588
28.
go back to reference J. Peng, Z. Jiang, Y. Gu, Y. Wu, Y. Wang, Y. Tai, C. Wang, W. Lin, Siamrcr: Reciprocal classification and regression for visual object tracking, arXiv preprint arXiv:2105.11237 (2021) J. Peng, Z. Jiang, Y. Gu, Y. Wu, Y. Wang, Y. Tai, C. Wang, W. Lin, Siamrcr: Reciprocal classification and regression for visual object tracking, arXiv preprint arXiv:​2105.​11237 (2021)
29.
go back to reference Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 12549–12556 Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 12549–12556
30.
go back to reference Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6668–6677 Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6668–6677
31.
go back to reference Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: object-aware anchor-free tracking. In: Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part XXI 16. Springer, Berlin, pp 771–787 Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: object-aware anchor-free tracking. In: Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part XXI 16. Springer, Berlin, pp 771–787
32.
go back to reference Wang Q, Teng Z, Xing J, Gao J, Hu W, Maybank S (2018) Learning attentions: residual attentional Siamese network for high performance online visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4854–4863 Wang Q, Teng Z, Xing J, Gao J, Hu W, Maybank S (2018) Learning attentions: residual attentional Siamese network for high performance online visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4854–4863
33.
go back to reference Deng S, Liang Z, Sun L, Jia K (2022) Vista: boosting 3d object detection via dual cross-view spatial attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8448–8457 Deng S, Liang Z, Sun L, Jia K (2022) Vista: boosting 3d object detection via dual cross-view spatial attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8448–8457
34.
go back to reference Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164 Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
35.
go back to reference Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141 Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
36.
go back to reference Guo G, Wang H, Yan Y, Liao H-YM, Li B (2018) A new target-specific object proposal generation method for visual tracking. arXiv preprint. arXiv:1803.10098 Guo G, Wang H, Yan Y, Liao H-YM, Li B (2018) A new target-specific object proposal generation method for visual tracking. arXiv preprint. arXiv:​1803.​10098
37.
go back to reference Avytekin C, Cricri F, Aksu E (2018) Saliency enhanced robust visual tracking. In: European workshop on visual information processing (EUVIP) arXiv:1802.02783, pp 1–5 Avytekin C, Cricri F, Aksu E (2018) Saliency enhanced robust visual tracking. In: European workshop on visual information processing (EUVIP) arXiv:​1802.​02783, pp 1–5
38.
go back to reference Zhu Z, Wu W, Zou W, Yan J (2018) End-to-end flow correlation tracking with spatial-temporal attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 548–557 Zhu Z, Wu W, Zou W, Yan J (2018) End-to-end flow correlation tracking with spatial-temporal attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 548–557
39.
go back to reference Xing D, Evangeliou N, Tsoukalas A, Tzes A (2022) Siamese transformer pyramid networks for real-time UAV tracking. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2139–2148 Xing D, Evangeliou N, Tsoukalas A, Tzes A (2022) Siamese transformer pyramid networks for real-time UAV tracking. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2139–2148
40.
go back to reference Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8126–8135 Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8126–8135
41.
go back to reference Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1571–1580 Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1571–1580
42.
go back to reference Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105 Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
43.
go back to reference Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint. arXiv:1409.1556 Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint. arXiv:​1409.​1556
44.
go back to reference Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826 Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
45.
go back to reference Cao Z, Huang Z, Pan L, Zhang S, Liu Z, Fu C (2022) Tctrack: temporal contexts for aerial tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14798–14808 Cao Z, Huang Z, Pan L, Zhang S, Liu Z, Fu C (2022) Tctrack: temporal contexts for aerial tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14798–14808
46.
go back to reference Dong X, Shen J (2018) Triplet loss in Siamese network for object tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 459–474 Dong X, Shen J (2018) Triplet loss in Siamese network for object tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 459–474
47.
go back to reference Guo D, Shao Y, Cui Y, Wang Z, Zhang L, Shen C (2021) Graph attention tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9543–9552 Guo D, Shao Y, Cui Y, Wang Z, Zhang L, Shen C (2021) Graph attention tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9543–9552
49.
go back to reference Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, Berlin, pp 740–755 Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, Berlin, pp 740–755
50.
go back to reference Muller M, Bibi A, Giancola S, Alsubaihi S, Ghanem B (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European conference on computer vision (ECCV), pp 300–317 Muller M, Bibi A, Giancola S, Alsubaihi S, Ghanem B (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European conference on computer vision (ECCV), pp 300–317
51.
go back to reference Gao S, Zhou C, Ma C, Wang X, Yuan J (2022) Aiatrack: attention in attention for transformer visual tracking. arXiv preprint. arXiv:2207.09603 Gao S, Zhou C, Ma C, Wang X, Yuan J (2022) Aiatrack: attention in attention for transformer visual tracking. arXiv preprint. arXiv:​2207.​09603
52.
go back to reference Tang F, Ling Q (2022) Ranking-based Siamese visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8741–8750 Tang F, Ling Q (2022) Ranking-based Siamese visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8741–8750
53.
go back to reference Mayer C, Danelljan M, Bhat G, Paul M, Paudel DP, Yu F, Van Gool L (2022) Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8731–8740 Mayer C, Danelljan M, Bhat G, Paul M, Paudel DP, Yu F, Van Gool L (2022) Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8731–8740
54.
go back to reference Pu S, Song Y, Ma C, Zhang H, Yang M-H (2020) Learning recurrent memory activation networks for visual tracking. IEEE Trans Image Process 30:725–738ADSCrossRefPubMed Pu S, Song Y, Ma C, Zhang H, Yang M-H (2020) Learning recurrent memory activation networks for visual tracking. IEEE Trans Image Process 30:725–738ADSCrossRefPubMed
55.
go back to reference Shen Q, Qiao L, Guo J, Li P, Li X, Li B, Feng W, Gan W, Wu W, Ouyang W (2022) Unsupervised learning of accurate Siamese tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8101–8110 Shen Q, Qiao L, Guo J, Li P, Li X, Li B, Feng W, Gan W, Wu W, Ouyang W (2022) Unsupervised learning of accurate Siamese tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8101–8110
56.
go back to reference Zhou Z, Pei W, Li X, Wang H, Zheng F, He Z (2021) Saliency-associated object tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9866–9875 Zhou Z, Pei W, Li X, Wang H, Zheng F, He Z (2021) Saliency-associated object tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9866–9875
57.
go back to reference Du F, Liu P, Zhao W, Tang X (2020) Correlation-guided attention for corner detection based visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6836–6845 Du F, Liu P, Zhao W, Tang X (2020) Correlation-guided attention for corner detection based visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6836–6845
58.
go back to reference Huang L, Zhao X, Huang K (2020) Globaltrack: a simple and strong baseline for long-term tracking. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11037–11044 Huang L, Zhao X, Huang K (2020) Globaltrack: a simple and strong baseline for long-term tracking. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11037–11044
59.
go back to reference Lukezic A, Matas J, Kristan M (2020) D3s-a discriminative single shot segmentation tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7133–7142 Lukezic A, Matas J, Kristan M (2020) D3s-a discriminative single shot segmentation tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7133–7142
60.
go back to reference Song Y, Ma C, Wu X, Gong L, Bao L, Zuo W, Shen C, Lau RW, Yang M-H (2018) Vital: visual tracking via adversarial learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8990–8999 Song Y, Ma C, Wu X, Gong L, Bao L, Zuo W, Shen C, Lau RW, Yang M-H (2018) Vital: visual tracking via adversarial learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8990–8999
61.
go back to reference Ma F, Shou MZ, Zhu L, Fan H, Xu Y, Yang Y, Yan Z (2022) Unified transformer tracker for object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8781–8790 Ma F, Shou MZ, Zhu L, Fan H, Xu Y, Yang Y, Yan Z (2022) Unified transformer tracker for object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8781–8790
62.
go back to reference Chen X, Wang D, Li D, Lu H (2022) Efficient visual tracking via hierarchical cross-attention transformer. arXiv preprint. arXiv:2203.13537 Chen X, Wang D, Li D, Lu H (2022) Efficient visual tracking via hierarchical cross-attention transformer. arXiv preprint. arXiv:​2203.​13537
63.
go back to reference Yu B, Tang M, Zheng L, Zhu G, Wang J, Feng H, Feng X, Lu H (2021) High-performance discriminative tracking with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9856–9865 Yu B, Tang M, Zheng L, Zhu G, Wang J, Feng H, Feng X, Lu H (2021) High-performance discriminative tracking with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9856–9865
64.
go back to reference Grabner H, Leistner C, Bischof H (2008) Semi-supervised on-line boosting for robust tracking. In: European conference on computer vision. Springer, Berlin, pp 234–247 Grabner H, Leistner C, Bischof H (2008) Semi-supervised on-line boosting for robust tracking. In: European conference on computer vision. Springer, Berlin, pp 234–247
65.
go back to reference Wang G, Luo C, Xiong Z, Zeng W (2019) Spm-tracker: series-parallel matching for real-time visual object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3643–3652 Wang G, Luo C, Xiong Z, Zeng W (2019) Spm-tracker: series-parallel matching for real-time visual object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3643–3652
66.
go back to reference Cao Z, Fu C, Ye J, Li B, Li Y (2021) Hift: hierarchical feature transformer for aerial tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 15457–15466 Cao Z, Fu C, Ye J, Li B, Li Y (2021) Hift: hierarchical feature transformer for aerial tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 15457–15466
67.
go back to reference Zhu Y, Li C, Tang J, Luo B (2020) Quality-aware feature aggregation network for robust RGBT tracking. IEEE Trans Intell Veh 6(1):121–130CrossRef Zhu Y, Li C, Tang J, Luo B (2020) Quality-aware feature aggregation network for robust RGBT tracking. IEEE Trans Intell Veh 6(1):121–130CrossRef
68.
go back to reference Zhu Y, Li C, Tang J, Luo B, Wang L (2021) RGBT tracking by trident fusion network. IEEE Trans Circuits Syst Video Technol 32(2):579–592CrossRef Zhu Y, Li C, Tang J, Luo B, Wang L (2021) RGBT tracking by trident fusion network. IEEE Trans Circuits Syst Video Technol 32(2):579–592CrossRef
69.
go back to reference Yu Y, Xiong Y, Huang W, Scott M. R (2020) Deformable siamese attention networks for visual object tracking, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6728–6737 Yu Y, Xiong Y, Huang W, Scott M. R (2020) Deformable siamese attention networks for visual object tracking, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6728–6737
Metadata
Title
RCFT: re-parameterization convolution and feature filter for object tracking
Authors
Yuanyun Wang
Wenhui Yang
Peng Yin
Jun Wang
Publication date
15-09-2023
Publisher
Springer International Publishing
Published in
Complex & Intelligent Systems / Issue 1/2024
Print ISSN: 2199-4536
Electronic ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-023-01223-z

Other articles of this Issue 1/2024

Complex & Intelligent Systems 1/2024 Go to the issue

Premium Partner