1 Introduction
-
Existing approaches [1‐3] only use general CNN to extract the spatial and temporal features in videos, which ignore the key action information (e.g., objects and motion) in videos. To address this issue, we propose a multi-head attention mechanism-based two-stream network to capture the key action information from the extracted features in videos. Thus, MAT-EffNet can focus on the key action information at different frames to distinguish similar actions. The EfficientNet is applied as a feature extractor because of the high parameter efficiency and speed.
-
We conduct experiments on three widely used action recognition datasets (i.e., UCF101 [35], HMDB51 [36] and Kinetics [51]) to verify the performance of our approach. The experimental results show that the MAT-EffNet approach achieves the best classification results compared with several state-of-the-art methods. The rest of this paper is organized as follows. We review the existing two-stream network-based approaches and the attention mechanism-based approaches in Sect. 2. Section 3 presents the details of the proposed MAT-EffNet approach. Experimental results are presented in Sect. 4. Finally, Sect. 5 is the conclusion of this paper.
2 Related works
2.1 Two-stream network-based action recognition approaches
2.2 Attention mechanisms
3 Multi-head attention-based two-stream EfficientNet
3.1 EfficientNet
3.2 Multi-head self-attention mechanism
3.3 Our proposed MAT-EffNet
4 Experiments
4.1 Datasets
4.2 Implementation details
Stage | Operator | Size | Number of layers |
---|---|---|---|
0 | Input | 224 × 224 × 3 | – |
1 | Conv 3 × 3, stride 2 | 112 × 112 × 32 | 1 |
2 | MBConv1, k3 × 3, stride 1 | 112 × 112 × 16 | 1 |
3 | MBConv6, k3 × 3, stride 2 | 56 × 56 × 24 | 2 |
4 | MBConv6, k5 × 5, stride 2 | 28 × 28 × 40 | 2 |
5 | MBConv6, k3 × 3, stride 2 | 28 × 28 × 80 | 3 |
6 | MBConv6, k5 × 5, stride 1 | 14 × 14 × 112 | 3 |
7 | MBConv6, k5 × 5, stride 2 | 7 × 7 × 192 | 4 |
8 | MBConv6, k3 × 3, stride 1 | 7 × 7 × 320 | 1 |
9 | Conv 1 × 1, stride1, average pooling | 1 × 1 × 1280 | 1 |
10 | Multi-head attention layer | 1 × 1 × 512 | 1 |
11 | Fc & Softmax | 512 \(\times\){101 or 51} | – |
4.3 Ablation experiments
Training setting | Spatial stream (%) | Temporal stream (%) | Two-stream (%) |
---|---|---|---|
ResNet-18 | 76.2 | 79.1 | 81.9 |
ResNet-18 + Multi-head attention | 78.1 | 81.2 | 83.9 |
ResNet-34 | 79.7 | 80.3 | 82.5 |
ResNet-34 + Multi-head attention | 81.1 | 81.6 | 84.7 |
ResNet-50 | 82.5 | 85.6 | 88.5 |
ResNet-50 + Multi-head attention | 85.1 | 88.7 | 91.7 |
EfficieNet-B0 | 87.6 | 89.1 | 91.8 |
EfficientNet-B0 + Multi-head attention (MAT-EffNet) | 90.2 | 92.4 | 94.5 |
Training setting | Spatial stream (%) | Temporal stream (%) | Two-stream (%) |
---|---|---|---|
ResNet-18 | 36.7 | 38.1 | 40.1 |
ResNet-18 + Multi-head attention | 38.9 | 40.2 | 41.9 |
ResNet-34 | 37.6 | 39.1 | 43.1 |
ResNet-34 + Multi-head attention | 40.9 | 44.2 | 49.9 |
ResNet-50 | 43.2 | 51.4 | 57.8 |
ResNet-50 + Multi-head attention | 46.1 | 54.9 | 63.4 |
EfficieNet-B0 | 53.3 | 59.1 | 65.2 |
MAT-EffNet | 59.3 | 65.3 | 70.9 |
4.4 Exploration of MAT-EffNet on the Kinetics-400 dataset
Methods | Top-1 accuracy (%) | Top-5 accuracy (%) |
---|---|---|
EfficientNet-B0 (baseline) | 71 | 89.5 |
MAT-EffNet | 72.6 | 90.8 |
4.5 Exploration of MAT-EffNet on the UCF101 and HMDB51 datasets
Approaches | Input Modalities | UCF101 (%) | HMDB51 |
---|---|---|---|
LRCN [38] | RGB + optical flow | 82.9 | – |
C3D [26] | RGB only + 3D CNNs | 85.2 | – |
IDTs [32] | RGB only + 3D CNNs | 85.9 | 57.2% |
Two-stream [1] | RGB + optical flow | 88.0 | 59.4% |
FSTCN [39] | RGB + optical flow | 88.1 | 59.1% |
P3D-199 [65] | RGB + 3D CNNs | 89.2 | 62.9% |
TDD [34] | RGB + optical flow | 90.3 | 63.2% |
STS-network [17] | RGB + optical flow + others | 90.1 | 62.4% |
R-M3D [11] | RGB only + 3D CNNs | 93.2 | 65.4% |
STDAN + RGB difference [58] | RGB + optical flow + others | 91.0 | 60.4% |
TSN Corrnet [55] | RGB + optical flow | 94.4 | 70.6% |
MSM-ResNets [56] | RGB + optical flow + others | 93.5 | 66.7% |
R-STAN-50 [68] | RGB + optical flow | 91.5 | 62.8% |
3D ResNeXt-101 + Confidence Distillation [69] | RGB + 3D CNNs | 91.2 | – |
MAT-EffNet | RGB + optical flow | 94.8 | 71.1% |