Skip to main content
Top
Published in: Neural Processing Letters 8/2023

20-07-2023

Swin-Fusion: Swin-Transformer with Feature Fusion for Human Action Recognition

Authors: Tiansheng Chen, Lingfei Mo

Published in: Neural Processing Letters | Issue 8/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Human action recognition based on still images is one of the most challenging computer vision tasks. In the past decade, convolutional neural networks (CNNs) have developed rapidly and achieved good performance in human action recognition tasks based on still images. Due to the absence of the remote perception ability of CNNs, it is challenging to have a global structural understanding of human behavior and the overall relationship between the behavior and the environment. Recently, transformer-based models have been making a splash in computer vision, even reaching SOTA in several vision tasks. We explore the transformer’s capability in human action recognition based on still images and add a simple but effective feature fusion module based on the Swin-Transformer model. More specifically, we propose a new transformer-based model for behavioral feature extraction that uses a pre-trained Swin-Transformer as the backbone network. Swin-Transformer’s distinctive hierarchical structure, combined with the feature fusion module, is used to extract and fuse multi-scale behavioral information. Extensive experiments were conducted on five still image-based human action recognition datasets, including the Li’s action dataset, the Stanford-40 dataset, the PPMI-24 dataset, the AUC-V1 dataset, and the AUC-V2 dataset. Results indicate that our proposed Swin-Fusion model achieves better behavior recognition than previously improved CNN-based models by sharing and reusing feature maps of different scales at multiple stages, without modifying the original backbone training method and with only increasing training resources by 1.6%. The code and models will be available at https://​github.​com/​cts4444/​Swin-Fusion.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, 7083–7093 Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, 7083–7093
2.
go back to reference Li K, Wang Y, Gao P, Song G, Liu Y, Li H, Qiao Y (2022) Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676 Li K, Wang Y, Gao P, Song G, Liu Y, Li H, Qiao Y (2022) Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:​2201.​04676
3.
go back to reference Girdhar R, Singh M, Ravi N, van der Maaten L, Joulin A, Misra I (2022) Omnivore: A single model for many visual modalities. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16102–16112 Girdhar R, Singh M, Ravi N, van der Maaten L, Joulin A, Misra I (2022) Omnivore: A single model for many visual modalities. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16102–16112
4.
go back to reference Zhang J, Yang J, Yu J, Fan J (2022) Semisupervised image classification by mutual learning of multiple self-supervised models. Int J Intell Syst 37(5):3117–3141CrossRef Zhang J, Yang J, Yu J, Fan J (2022) Semisupervised image classification by mutual learning of multiple self-supervised models. Int J Intell Syst 37(5):3117–3141CrossRef
5.
go back to reference Qi T, Xu Y, Quan Y, Wang Y, Ling H (2017) Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267:475–488CrossRef Qi T, Xu Y, Quan Y, Wang Y, Ling H (2017) Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267:475–488CrossRef
6.
go back to reference Lavinia Y, Vo HH, Verma A (2016) Fusion based deep cnn for improved large-scale image action recognition. In: 2016 IEEE international symposium on multimedia (ISM), 609–614. IEEE Lavinia Y, Vo HH, Verma A (2016) Fusion based deep cnn for improved large-scale image action recognition. In: 2016 IEEE international symposium on multimedia (ISM), 609–614. IEEE
7.
go back to reference Hirooka K, Hasan MAM, Shin J, Srizon AY (2022) Ensembled transfer learning based multichannel attention networks for human activity recognition in still images. IEEE Access 10:47051–47062CrossRef Hirooka K, Hasan MAM, Shin J, Srizon AY (2022) Ensembled transfer learning based multichannel attention networks for human activity recognition in still images. IEEE Access 10:47051–47062CrossRef
8.
go back to reference Mohammadi S, Majelan SG, Shokouhi SB (2019) Ensembles of deep neural networks for action recognition in still images. In: 2019 9th international conference on computer and knowledge engineering (ICCKE), 315–318. IEEE Mohammadi S, Majelan SG, Shokouhi SB (2019) Ensembles of deep neural networks for action recognition in still images. In: 2019 9th international conference on computer and knowledge engineering (ICCKE), 315–318. IEEE
9.
go back to reference Chong Z, Mo L (2022) St-vton: self-supervised vision transformer for image-based virtual try-on. Image Vis Comput 127:104568CrossRef Chong Z, Mo L (2022) St-vton: self-supervised vision transformer for image-based virtual try-on. Image Vis Comput 127:104568CrossRef
10.
go back to reference Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol 30(12):4467–4480CrossRef Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol 30(12):4467–4480CrossRef
11.
go back to reference Zhang J, Cao Y, Wu Q (2021) Vector of locally and adaptively aggregated descriptors for image feature representation. Pattern Recogn 116:107952CrossRef Zhang J, Cao Y, Wu Q (2021) Vector of locally and adaptively aggregated descriptors for image feature representation. Pattern Recogn 116:107952CrossRef
12.
go back to reference Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol 1, 1–2. Prague Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol 1, 1–2. Prague
13.
go back to reference Ikizler N, Cinbis RG, Pehlivan S, Duygulu P (2008) Recognizing actions from still images. In: 2008 19th international conference on pattern recognition, pp 1–4. IEEE Ikizler N, Cinbis RG, Pehlivan S, Duygulu P (2008) Recognizing actions from still images. In: 2008 19th international conference on pattern recognition, pp 1–4. IEEE
14.
go back to reference Yao B, Khosla A, Fei-Fei L (2011) Combining randomization and discrimination for fine-grained image categorization. In: CVPR 2011, pp 1577–1584. IEEE Yao B, Khosla A, Fei-Fei L (2011) Combining randomization and discrimination for fine-grained image categorization. In: CVPR 2011, pp 1577–1584. IEEE
15.
go back to reference Yu X, Zhang Z, Wu L, Pang W, Chen H, Yu Z, Li B (2020) Deep ensemble learning for human action recognition in still images. Complexity 2020 Yu X, Zhang Z, Wu L, Pang W, Chen H, Yu Z, Li B (2020) Deep ensemble learning for human action recognition in still images. Complexity 2020
16.
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778
17.
go back to reference Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 1–9 Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 1–9
18.
go back to reference Sreela S, Idicula SM (2018) Action recognition in still images using residual neural network features. Procedia Comput. Sci. 143:563–569CrossRef Sreela S, Idicula SM (2018) Action recognition in still images using residual neural network features. Procedia Comput. Sci. 143:563–569CrossRef
19.
go back to reference Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with r* cnn. In: Proceedings of the IEEE international conference on computer vision, 1080–1088 Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with r* cnn. In: Proceedings of the IEEE international conference on computer vision, 1080–1088
20.
go back to reference Zhang Y, Cheng L, Wu J, Cai J, Do MN, Lu J (2016) Action recognition in still images with minimum annotation efforts. IEEE Trans Image Process 25(11):5479–5490MathSciNetCrossRefMATH Zhang Y, Cheng L, Wu J, Cai J, Do MN, Lu J (2016) Action recognition in still images with minimum annotation efforts. IEEE Trans Image Process 25(11):5479–5490MathSciNetCrossRefMATH
21.
go back to reference Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:​2010.​11929
22.
go back to reference Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers and distillation through attention. In: International conference on machine learning, 10347–10357. PMLR Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers and distillation through attention. In: International conference on machine learning, 10347–10357. PMLR
23.
go back to reference Yu W, Luo M, Zhou P, Si C, Zhou Y, Wang X, Feng J, Yan S (2022) Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10819–10829 Yu W, Luo M, Zhou P, Si C, Zhou Y, Wang X, Feng J, Yan S (2022) Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10819–10829
24.
go back to reference Li Y, Yuan G, Wen Y, Hu E, Evangelidis G, Tulyakov S, Wang Y, Ren J (2022) Efficientformer: vision transformers at mobilenet speed. arXiv preprint arXiv:2206.01191 Li Y, Yuan G, Wen Y, Hu E, Evangelidis G, Tulyakov S, Wang Y, Ren J (2022) Efficientformer: vision transformers at mobilenet speed. arXiv preprint arXiv:​2206.​01191
25.
go back to reference Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022
26.
go back to reference Cruz-Mota J, Bogdanova I, Paquier B, Bierlaire M, Thiran J-P (2012) Scale invariant feature transform on the sphere: theory and applications. Int J Comput Vis. 98(2):217–241MathSciNetCrossRefMATH Cruz-Mota J, Bogdanova I, Paquier B, Bierlaire M, Thiran J-P (2012) Scale invariant feature transform on the sphere: theory and applications. Int J Comput Vis. 98(2):217–241MathSciNetCrossRefMATH
27.
go back to reference Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1, 886–893. IEEE Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1, 886–893. IEEE
28.
go back to reference Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 3431–3440 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 3431–3440
29.
go back to reference Hariharan B, Arbeláez P, Girshick R, Malik J (2015) Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 447–456 Hariharan B, Arbeláez P, Girshick R, Malik J (2015) Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 447–456
30.
go back to reference Ghiasi G, Fowlkes CC (2016) Laplacian pyramid reconstruction and refinement for semantic segmentation. In: European conference on computer vision, 519–534. Springer Ghiasi G, Fowlkes CC (2016) Laplacian pyramid reconstruction and refinement for semantic segmentation. In: European conference on computer vision, 519–534. Springer
31.
go back to reference Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2117–2125 Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2117–2125
32.
go back to reference Li Z, Ge Y, Feng J, Qin X, Yu J, Yu H (2020) Deep selective feature learning for action recognition. In: 2020 IEEE international conference on multimedia and expo (ICME), 1–6. IEEE Li Z, Ge Y, Feng J, Qin X, Yu J, Yu H (2020) Deep selective feature learning for action recognition. In: 2020 IEEE international conference on multimedia and expo (ICME), 1–6. IEEE
33.
go back to reference Li R, Liu Z, Tan J (2018) Reassessing hierarchical representation for action recognition in still images. IEEE Access 6:61386–61400CrossRef Li R, Liu Z, Tan J (2018) Reassessing hierarchical representation for action recognition in still images. IEEE Access 6:61386–61400CrossRef
34.
go back to reference Bera A, Wharton Z, Liu Y, Bessis N, Behera A (2021) Attend and guide (ag-net): a keypoints-driven attention-based deep network for image recognition. IEEE Trans Image Process 30:3691–3704CrossRef Bera A, Wharton Z, Liu Y, Bessis N, Behera A (2021) Attend and guide (ag-net): a keypoints-driven attention-based deep network for image recognition. IEEE Trans Image Process 30:3691–3704CrossRef
37.
go back to reference Wharton Z, Behera A, Liu Y, Bessis N (2021) Coarse temporal attention network (cta-net) for driver’s activity recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 1279–1289 Wharton Z, Behera A, Liu Y, Bessis N (2021) Coarse temporal attention network (cta-net) for driver’s activity recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 1279–1289
38.
go back to reference Alotaibi M, Alotaibi B (2020) Distracted driver classification using deep learning. SIViP 14(3):617–624CrossRef Alotaibi M, Alotaibi B (2020) Distracted driver classification using deep learning. SIViP 14(3):617–624CrossRef
39.
go back to reference Arefin MR, Makhmudkhujaev F, Chae O, Kim J (2019) Aggregating cnn and hog features for real-time distracted driver detection. In: 2019 IEEE international conference on consumer electronics (ICCE), 1–3. IEEE Arefin MR, Makhmudkhujaev F, Chae O, Kim J (2019) Aggregating cnn and hog features for real-time distracted driver detection. In: 2019 IEEE international conference on consumer electronics (ICCE), 1–3. IEEE
40.
go back to reference Behera A, Keidel AH (2018) Latent body-pose guided densenet for recognizing driver’s fine-grained secondary activities. In: 2018 15th IEEE international conference on advanced video and signal based surveillance (AVSS), 1–6. IEEE Behera A, Keidel AH (2018) Latent body-pose guided densenet for recognizing driver’s fine-grained secondary activities. In: 2018 15th IEEE international conference on advanced video and signal based surveillance (AVSS), 1–6. IEEE
41.
go back to reference Wu M, Zhang X, Shen L, Yu H (2021) Pose-aware multi-feature fusion network for driver distraction recognition. In: 2020 25th international conference on pattern recognition (ICPR), 1228–1235. IEEE Wu M, Zhang X, Shen L, Yu H (2021) Pose-aware multi-feature fusion network for driver distraction recognition. In: 2020 25th international conference on pattern recognition (ICPR), 1228–1235. IEEE
42.
go back to reference Mase JM, Chapman P, Figueredo GP, Torres MT (2020) A hybrid deep learning approach for driver distraction detection. In: 2020 international conference on information and communication technology convergence (ICTC), 1–6. IEEE Mase JM, Chapman P, Figueredo GP, Torres MT (2020) A hybrid deep learning approach for driver distraction detection. In: 2020 international conference on information and communication technology convergence (ICTC), 1–6. IEEE
43.
go back to reference Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, 618–626 Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, 618–626
Metadata
Title
Swin-Fusion: Swin-Transformer with Feature Fusion for Human Action Recognition
Authors
Tiansheng Chen
Lingfei Mo
Publication date
20-07-2023
Publisher
Springer US
Published in
Neural Processing Letters / Issue 8/2023
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-023-11367-1

Other articles of this Issue 8/2023

Neural Processing Letters 8/2023 Go to the issue