Introduction
-
A Transformer-based multi-scale feature fusion network with dual attentions is designed, namely, box attention and instance attention. With the feature fusion network, we can quickly obtain multiple bounding boxes with high confidence scores in encoder, and then refine and obtain the predicted bounding boxes in decoder.
-
In the Transformer-based feature fusion network, box attention effectively extracts the structured spatial information, and instance attention explores the temporal context information by the Encoder–Decoder architecture. By this way, the tracker MDTT can explore enough global context information across successive frames while focusing on more local responses.
-
A novel Transformer tracking framework with multi-scale dual-attention is proposed, which can effectively deal with complicated challenges, such as background clutter, fully occlusion, and viewpoint change. We have verified the effectiveness of the fusion network and tested the proposed tracker MDTT on six challenging tracking benchmark datasets. The experimental results on these test datasets show that MDTT achieves robust tracking performance while running on real-time tracking speed.
Related work
Siamese-based visual tracking
Attention mechanisms in computer vision
Transformer in visual tracking
Overall architecture
Box attention in transformer encoder
Instance attention in transformer decoder
Tracking with box attention and instance attention
Experiments
Implementation details
State-of-the-art comparison
Tracker | Year | AO (%) | \(SR_{0.50}\) (%) | \(SR_{0.75}\) (%) |
---|---|---|---|---|
Ours | 68.7 | 80.2 | 60.0 | |
STARK [21] | 2021 | 68.0 | 77.7 | 62.3 |
UTT [39] | 2022 | 67.2 | 76.3 | 60.5 |
TrDiMP [3] | 2021 | 67.1 | 77.7 | 58.3 |
TransT-N2 [24] | 2021 | 67.1 | 76.8 | 60.9 |
TREG [40] | 2021 | 66.8 | 77.8 | 57.2 |
SBT [41] | 2022 | 66.8 | 77.3 | 58.7 |
SuperDiMP [42] | 2019 | 66.1 | 77.2 | 59.2 |
TrSiam [3] | 2021 | 66.0 | 76.6 | 57.1 |
AutoMatch [43] | 2021 | 65.2 | 76.6 | 54.3 |
SiamR-CNN [44] | 2020 | 64.9 | 72.8 | 59.7 |
SiamPW-RBO [37] | 2022 | 64.4 | 76.7 | 50.9 |
STMTrack [34] | 2021 | 64.2 | 73.7 | 57.5 |
SAOT [45] | 2021 | 64.0 | 74.7 | 53.0 |
KYS [46] | 2020 | 63.6 | 75.1 | 51.5 |
FCOT [47] | 2020 | 63.4 | 76.6 | 52.1 |
PrDiMP [48] | 2020 | 63.4 | 73.8 | 54.3 |
SiamGAT [35] | 2021 | 62.7 | 74.3 | 48.8 |
SiamLA [49] | 2022 | 61.9 | 72.4 | 51.0 |
OCEAN [50] | 2020 | 61.6 | 72.1 | 47.3 |
DiMP [30] | 2019 | 61.1 | 71.7 | 49.2 |
D3S [51] | 2020 | 59.7 | 67.6 | 46.2 |
SiamCAR [52] | 2020 | 57.9 | 67.7 | 43.7 |
ATOM [53] | 2019 | 55.6 | 63.4 | 40.2 |
Tracker | Year | AUC (%) | P (%) | \(P_{Norm}\) (%) |
---|---|---|---|---|
Ours | 64.7 | 67.5 | 73.9 | |
UTT [39] | 2022 | 64.6 | 67.2 | – |
TransT-N2 [24] | 2021 | 64.2 | 68.2 | 73.5 |
TrDiMP [3] | 2021 | 64.0 | 66.6 | 73.2 |
DualTFR [54] | 2021 | 63.5 | 66.5 | 72.0 |
SuperDiMP [42] | 2019 | 63.1 | 65.3 | 72.2 |
TrSiam [3] | 2021 | 62.9 | 65.0 | 71.8 |
SAOT [45] | 2021 | 61.6 | 62.9 | 70.8 |
SBT [41] | 2022 | 61.1 | 63.8 | – |
STMTrack [34] | 2021 | 60.6 | 63.3 | 69.3 |
PrDiMP [48] | 2020 | 59.8 | 60.8 | 68.8 |
AutoMatch [43] | 2021 | 58.2 | 59.9 | 67.4 |
SiamTPN [55] | 2022 | 58.1 | 57.8 | 68.3 |
CAJMU [56] | 2022 | 57.3 | 57.2 | 66.3 |
LTMU [57] | 2020 | 57.2 | 57.8 | 66.5 |
FCOT [47] | 2020 | 56.9 | 58.9 | 67.8 |
SRRTransT [58] | 2022 | 56.9 | 57.1 | 64.0 |
SiamLA [49] | 2022 | 56.1 | 56.0 | 65.2 |
CNNInMo [38] | 2022 | 53.9 | 53.9 | 61.6 |
SiamGAT [35] | 2021 | 53.9 | 53.0 | 63.3 |
CGACD [59] | 2020 | 51.8 | 62.6 | – |
OCEAN [50] | 2020 | 51.6 | 52.6 | 60.7 |
SiamCAR [52] | 2020 | 51.6 | 52.4 | 61.0 |
ULAST [60] | 2022 | 47.1 | 45.1 | – |
Tracker | Year | A\( (\uparrow )\) | R\( (\downarrow )\) | EAO\( (\uparrow )\) |
---|---|---|---|---|
Ours | 61.9 | 16.0 | 45.2 | |
Retina-MAML [61] | 2020 | 60.4 | 15.9 | 45.2 |
CGACD [59] | 2020 | 61.5 | 17.2 | 44.9 |
PGNet [62] | 2020 | 61.8 | 19.2 | 44.7 |
STMTrack [34] | 2021 | 59.0 | 15.9 | 44.7 |
PrDiMP [48] | 2020 | 61.8 | 16.5 | 44.2 |
DiMP [30] | 2019 | 59.7 | 15.3 | 44.0 |
TrDiMP [3] | 2021 | 60.0 | 16.2 | 43.7 |
SiamFC++ [63] | 2020 | 58.7 | 18.3 | 42.6 |
SiamCAR [52] | 2020 | 57.8 | 19.7 | 42.3 |
SiamRPN++ [13] | 2019 | 60.0 | 23.4 | 41.4 |
SiamR-CNN [44] | 2020 | 60.9 | 22.0 | 40.8 |
ATOM [53] | 2019 | 59.0 | 20.4 | 40.1 |
LADCF [64] | 2019 | 50.3 | 15.9 | 38.9 |
MFT [5] | 2018 | 50.5 | 14.0 | 38.5 |
SiamRPN [12] | 2018 | 58.6 | 27.6 | 38.3 |
UPDT [65] | 2018 | 53.6 | 18.4 | 37.9 |
Tracker | Year | AUC (%) | P (%) | \(P_{Norm}\) (%) |
---|---|---|---|---|
Ours | 78.1 | 73.4 | 83.3 | |
SiamLA [49] | 2022 | 76.7 | 71.8 | 82.1 |
AutoMatch [43] | 2021 | 76.0 | 72.6 | – |
SRRTransT [58] | 2022 | 76.0 | 71.9 | 81.3 |
PrDiMP [48] | 2020 | 75.8 | 70.4 | 81.6 |
FCOS-MAML [61] | 2020 | 75.7 | 72.5 | 82.2 |
SiamFC++ [63] | 2020 | 75.4 | 70.5 | 80.0 |
SiamGAT [35] | 2021 | 75.3 | 69.8 | 80.7 |
DCFST [66] | 2020 | 75.2 | 70.0 | 80.9 |
CAJMU [56] | 2022 | 74.2 | 68.9 | 80.1 |
KYS [46] | 2020 | 74.0 | 68.8 | 80.0 |
DiMP [30] | 2019 | 74.0 | 68.7 | 80.1 |
SiamCAR [52] | 2020 | 74.0 | 68.4 | 80.4 |
SiamLTR [67] | 2021 | 73.6 | 69.1 | 80.2 |
SiamRPN++ [13] | 2019 | 73.3 | 69.4 | 80.0 |
D3S [51] | 2020 | 72.8 | 66.4 | 76.8 |
OCEAN [50] | 2020 | 70.3 | 68.8 | – |
ATOM [53] | 2019 | 70.3 | 64.8 | 77.1 |
CRPN [68] | 2019 | 66.9 | 61.9 | 74.6 |
ULAST [60] | 2022 | 65.4 | 59.2 | 73.2 |
DaSiamRPN [31] | 2018 | 63.8 | 59.1 | 73.3 |
UPDT [65] | 2018 | 61.1 | 55.7 | 70.2 |
Tracker | Ours | Tr | ToMP | Trans | Auto | CAJ | SiamBAN | STM | CNN | DCF | CRACT | OCEAN | SiamDW | KYS | Siam |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DiMP | 101 | T | Match | MU | RBO | Track | InMo | ST | R-CNN | ||||||
[3] | [25] | [24] | [43] | [56] | [37] | [34] | [38] | [66] | [69] | [50] | [70] | [46] | [44] | ||
Year | 2021 | 2022 | 2021 | 2021 | 2022 | 2022 | 2021 | 2022 | 2020 | 2020 | 2020 | 2019 | 2020 | 2020 | |
NfS | 66.2 | 64.8 | 66.7 | 65.7 | 60.6 | 62.7 | 61.3 | – | 56.0 | 64.1 | 62.5 | 55.3 | 52.1 | 63.5 | 63.9 |
UAV123 | 67.6 | 65.9 | 66.9 | 66.0 | 64.4 | – | 64.1 | 64.7 | 62.9 | – | 66.4 | 62.1 | 53.6 | – | 64.9 |
Method | Training sets | Devices | Different variations | UAV123 | ||
---|---|---|---|---|---|---|
Box-Att | Ins-Att | AUC (%) | P (%) | |||
TrDiMP | aSOT GOT-10k TrackingNet COCO | 4*GTX1080Ti | \(\times \) | \(\times \) | 67.0 | 87.6 |
Baseline | 1*RTX2060 | \(\times \) | \(\times \) | 65.9 | 87.2 | |
Ours | \(\surd \) | \(\times \) | 66.7 | 87.9 | ||
\(\times \) | \(\surd \) | 66.5 | 87.7 | |||
\(\surd \) | \(\surd \) | 67.6 | 89.2 |
Method | Training sets | Devices | Different variations | GOT-10k | |||
---|---|---|---|---|---|---|---|
Box-Att | Ins-Att | AO (%) | \(SR_{0.5}\) (%) | \(SR_{0.75}\) (%) | |||
TrDiMP | LaSOT GOT-10k TrackingNet COCO | 4*GTX1080Ti | \(\times \) | \(\times \) | 67.1 | 77.7 | 58.3 |
Baseline | 1*RTX2060 | \(\times \) | \(\times \) | 66.0 | 76.6 | 57.1 | |
Ours | \(\surd \) | \(\times \) | 68.0 | 79.1 | 59.6 | ||
\(\times \) | \(\surd \) | 67.6 | 78.8 | 58.5 | |||
\(\surd \) | \(\surd \) | 68.7 | 80.2 | 60.0 |
Ablation study and analysis
Tracker | Year | Backbone | Image size | Speed (fps) | FLOPs (G) | Params (M) | |
---|---|---|---|---|---|---|---|
Template | Search region | ||||||
Ours | ResNet-50 | 128\(\times \)128 | 256\(\times \)256 | 20 | 19.3 | 23.5 | |
SiamRPN++ [13] | 2019 | ResNet-50 | 127\(\times \)127 | 255\(\times \)255 | 35.0 | 48.9 | 54.0 |
STARK-S50 [21] | 2021 | ResNet-50 | 128\(\times \)128 | 320\(\times \)320 | 42.2 | 10.5 | 23.3 |
Mixformer [72] | 2022 | MAM | 128\(\times \)128 | 320\(\times \)320 | 25 | 23.04 | – |
TransT [24] | 2021 | ResNet-50 | 128\(\times \)128 | 256\(\times \)256 | 50 | 19.1 | 23 |
DualTFR [54] | 2021 | LAB | 112\(\times \)112 | 224\(\times \)224 | 40 | 18.9 | 44.1 |