nach oben

Pattern Analysis and Applications

Erschienen in:

12.05.2023 | Theoretical Advances

Multi-scale spatial–temporal convolutional neural network for skeleton-based action recognition

verfasst von: Qin Cheng, Jun Cheng, Ziliang Ren, Qieshi Zhang, Jianming Liu

Erschienen in: Pattern Analysis and Applications | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The skeleton data convey significant information for action recognition since they can robustly against cluttered backgrounds and illumination variation. In recent years, due to the limited ability to extract spatial–temporal features from skeleton data, the methods based on convolutional neural network (CNN) or recurrent neural network are inferior in recognition accuracy. A series of methods based on graph convolutional networks (GCN) have achieved remarkable performance and gradually become dominant. However, the computational cost of GCN-based methods is quite heavy, several works even over 100 GFLOPs. This is contrary to the highly condensed attributes of skeleton data. In this paper, a novel multi-scale spatial–temporal convolutional (MSST) module is proposed to take the implicit complementary advantages across spatial–temporal representations with different scales. Instead of converting skeleton data into pseudo-images like some previous CNN-based methods or using complex graph convolution, we take full use of multi-scale convolutions on temporal and spatial dimensions to capture comprehensive dependencies of skeleton joints. Unifying the MSST module, a multi-scale spatial–temporal convolutional neural network (MSSTNet) is proposed to capture high-level spatial–temporal semantic features for action recognition. Unlike previous methods which boost performance at the cost of computation, MSSTNet can be easily implemented with light model size and fast inference. Moreover, MSSTNet is used in a four-stream framework to fuse data of different modalities, providing notable improvement to recognition accuracy. On NTU RGB+D 60, NTU RGB+D 120, UAV-Human and Northwestern-UCLA datasets, the proposed MSSTNet achieves competitive performance with much less computational cost than state-of-the-art methods.

Vorheriger Artikel SE-MD: a single-encoder multiple-decoder deep network for point cloud reconstruction from 2D images

Nächster Artikel NSIWD: new statistical image watermark detector

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 1110–1118 . https://doi.org/10.1109/CVPR.2015.7298714

Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 3590–3598 . https://doi.org/10.1109/CVPR.2019.00371

Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision (ECCV), pp. 816–833. Springer, Cham

Shi L, Zhang Y, Cheng J, Lu H (2019) Skeleton-based action recognition with directed graph neural networks. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 7904–7913.https://doi.org/10.1109/CVPR.2019.00810

Shi L, Zhang Y, Cheng J, Lu H (2019)Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 12018–12027. https://doi.org/10.1109/CVPR.2019.01230

Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI conference on artificial intelligence (AAAI), pp. 7444–7452. https://ojs.aaai.org/index.php/AAAI/article/view/12328

Fernando B, Gavves E, José Oramas M, Ghodrati A, Tuytelaars T (2015) Modeling video evolution for action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 5378–5387 . https://doi.org/10.1109/CVPR.2015.7299176

Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 588–595. https://doi.org/10.1109/CVPR.2014.82

Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 4570–4579 . https://doi.org/10.1109/CVPR.2017.486

10.

Kim TS, Reiter A (2017) Interpretable 3d human action analysis with temporal convolutional networks. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp. 1623–1631. https://doi.org/10.1109/CVPRW.2017.207

11.

Li C, Zhong Q, Di X, Pu S (2017) Skeleton-based action recognition with convolutional neural networks. In: IEEE international conference on multimedia expo workshops (ICMEW), pp. 597–600 . https://doi.org/10.1109/ICMEW.2017.8026285

12.

Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 41(8):1963–1978. https://doi.org/10.1109/TPAMI.2019.2896631CrossRef

13.

Shahroudy A, Liu J, Ng T, Wang G (2016) Ntu rgb+d: a large scale dataset for 3d human activity analysis. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 1010–1019. https://doi.org/10.1109/CVPR.2016.115

14.

Xu Y, Hou Z, Liang J, Chen C, Jia L, Song Y (2019) Action recognition using weighted fusion of depth images and skeletons key frames. Multimed Tools Appl (MTAP) 78(17):25063–25078CrossRef

15.

Li B, Li X, Zhang Z, Wu F (2019) Spatio-temporal graph routing for skeleton-based action recognition. In: AAAI conference on artificial intelligence (AAAI), pp. 8561–8568 . https://doi.org/10.1609/aaai.v33i01.33018561. https://ojs.aaai.org/index.php/AAAI/article/view/4875

16.

Shi L, Zhang Y, Cheng J, Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process (TIP) 29:9532–9545. https://doi.org/10.1109/TIP.2020.3028207CrossRefMATH

17.

Ji X, Cheng J, Tao D, Wu X, Feng W (2017) The spatial laplacian and temporal energy pyramid representation for human action recognition using depth sequences. Knowl Based Syst (KBS) 122:64–74. https://doi.org/10.1016/j.knosys.2017.01.035CrossRef

18.

Li C, Zhong Q, Xie D, Pu S (2019) Collaborative spatiotemporal feature learning for video action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 7864–7873. https://doi.org/10.1109/CVPR.2019.00806

19.

Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: European conference on computer vision (ECCV), pp. 318–335

20.

Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. In: European conference on computer vision (ECCV), pp. 713–730

21.

Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 588–597 . https://doi.org/10.1109/CVPR42600.2020.00067

22.

Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 140–149 . https://doi.org/10.1109/CVPR42600.2020.00022

23.

Liu J, Shahroudy A, Perez M, Wang G, Duan L-Y, Kot AC (2020) Ntu rgb+d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans Pattern Anal Mach Intell (TPAMI) 42(10):2684–2701. https://doi.org/10.1109/TPAMI.2019.2916873CrossRef

24.

Li T, Liu J, Zhang W, Ni Y, Wang W, Li Z (2021) UAV-Human: a Large Benchmark for Human Behavior Understanding With Unmanned Aerial Vehicles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 16266–16275 . https://doi.org/10.1109/CVPR46437.2021.01600

25.

Wang J, Nie X, Xia Y, Wu Y, Zhu (2014)S Cross-view action modeling, learning, and recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 2649–2656 . https://doi.org/10.1109/CVPR.2014.339

26.

Hussein M, Torki M, Gowayyed M, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In: International joint conference on artificial intelligence (IJCAI), pp. 2466–2472

27.

Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 1290–1297 . https://doi.org/10.1109/CVPR.2012.6247813

28.

Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: AAAI conference on artificial intelligence (AAAI), pp. 3697–3703

29.

Avola D, Cascio M, Cinque L, Foresti GL, Massaroni C, Rodolà E (2020) 2-d skeleton-based action recognition via two-branch stacked lstm-rnns. IEEE Trans Multimed 22(10):2481–2496. https://doi.org/10.1109/TMM.2019.2960588CrossRef

30.

Cheng J, Ren Z, Zhang Q, Gao X, Hao F (2021) Cross-modality compensation convolutional neural networks for rgb-d action recognition. IEEE transactions on circuits and systems for video technology (TCSVT), 1–1 . https://doi.org/10.1109/TCSVT.2021.3076165

31.

Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2019) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell (TPAMI) 41(11):2740–2755. https://doi.org/10.1109/TPAMI.2018.2868668CrossRef

32.

Ren Z, Zhang Q, Gao X, Hao P, Cheng J (2020) Multi-modality learning for human action recognition. Multimedia tools and applications (MTAP), 1–16

33.

Wang P, Li W, Li C, Hou Y (2018) Action recognition based on joint trajectory maps with convolutional neural networks. Knowl Based Syst 158:43–53. https://doi.org/10.1016/j.knosys.2018.05.029CrossRef

34.

Li B, Dai Y, Cheng X, Chen H, Lin Y, He M (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: IEEE international conference on multimedia expo workshops (ICMEW), pp. 601–604 . https://doi.org/10.1109/ICMEW.2017.8026282

35.

Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit (PR) 68:346–362. https://doi.org/10.1016/j.patcog.2017.02.030CrossRef

36.

Cao C, Lan C, Zhang Y, Zeng W, Lu H, Zhang Y (2019) Skeleton-based action recognition with gated convolutional neural networks. IEEE Trans Circuits Syst Video Technol (TCSVT) 29(11):3247–3257. https://doi.org/10.1109/TCSVT.2018.2879913CrossRef

37.

Tian D, Lu Z, Chen X, Ma L (2020) An attentional spatial temporal graph convolutional network with co-occurrence feature learning for action recognition. Multimed Tools Appl (MTAP) 79(17–18):12679–12697CrossRef

38.

Chen T, Wang S, Zhou D, Guan Y (2021) LSTA-Net: Long short-term Spatio-Temporal aggregation network for skeleton-based action recognition. arXiv

39.

Chen Z, Li S, Yang B, Li Q, Liu H (2021) Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. Proc AAAI Conf Artif Intell 35:1113–1122. https://doi.org/10.1609/aaai.v35i2.16197CrossRef

40.

Chen T, Zhou D, Wang J, Wang S, Guan Y, He X, Ding E (2021) Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-Based Action Recognition. In: Proceedings of the 29th ACM international conference on multimedia. MM ’21, pp. 4334–4342. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3474085.3475574

41.

Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H (2020) Skeleton-Based Action Recognition With Shift Graph Convolutional Network. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (CVPR), pp. 180–189. https://doi.org/10.1109/CVPR42600.2020.00026

42.

Cheng K, Zhang Y, Cao C, Shi L, Cheng J, Lu H (2020) Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition. In: Computer vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIV, pp. 536–553. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-030-58586-0_32

43.

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 2818–2826 . https://doi.org/10.1109/CVPR.2016.308

44.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 1–9 . https://doi.org/10.1109/CVPR.2015.7298594

45.

Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: IEEE/CVF international conference on computer vision (ICCV), pp. 6201–6210. https://doi.org/10.1109/ICCV.2019.00630

46.

Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 4724–4733 . https://doi.org/10.1109/CVPR.2017.502

47.

Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 6546–6555 . https://doi.org/10.1109/CVPR.2018.00685

48.

Deng J, Dong W, Socher R, Li LJ, Li J, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 248–255 . https://doi.org/10.1109/CVPR.2009.5206848

49.

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778. https://doi.org/10.1109/CVPR.2016.90

50.

Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: IEEE international conference on computer vision (ICCV), pp. 2136–2145 . https://doi.org/10.1109/ICCV.2017.233

51.

Wen Y, Gao L, Fu H, Zhang F, Xia S (2019) Graph cnns with motif and variable temporal block for skeleton-based action recognition. In: AAAI conference on artificial intelligence (AAAI), pp. 8989–8996 . https://doi.org/10.1609/aaai.v33i01.33018989

52.

Zhang P, Lan C, Zeng W, Xing J, Xue J, Zheng N (2020) Semantics-guided neural networks for efficient skeleton-based human action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 1109–1118. https://doi.org/10.1109/CVPR42600.2020.00119

53.

Wang M, Ni B, Yang X (2020) Learning multi-view interactional skeleton graph for action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI). https://doi.org/10.1109/TPAMI.2020.3032738CrossRef

54.

Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 1227–1236 . https://doi.org/10.1109/CVPR.2019.00132

55.

Li T, Liu J, Zhang W, Duan L (2020) HARD-Net: hardness-AwaRe discrimination network for 3D early activity prediction. In: Computer Vision—ECCV 2020, pp. 420–436. Springer, Cham

56.

Veeriah V, Zhuang N, Qi G (2015) Differential recurrent neural networks for action recognition. In: IEEE international conference on computer vision (ICCV), pp. 4041–4049 . https://doi.org/10.1109/ICCV.2015.460

57.

Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3d human action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 36(5):914–927. https://doi.org/10.1109/TPAMI.2013.198CrossRef

58.

Lee I, Kim D, Kang S, Lee S (2017) Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In: IEEE international conference on computer vision (ICCV), pp. 1012–1020 . https://doi.org/10.1109/ICCV.2017.115

Titel: Multi-scale spatial–temporal convolutional neural network for skeleton-based action recognition
verfasst von: Qin Cheng
Jun Cheng
Ziliang Ren
Qieshi Zhang
Jianming Liu
Publikationsdatum: 12.05.2023
Verlag: Springer London
Erschienen in: Pattern Analysis and Applications / Ausgabe 3/2023
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI: https://doi.org/10.1007/s10044-023-01156-w

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2023

Multiple kernel k-means clustering with block diagonal property

A review of natural language processing in contact centre automation

Cross-modal face recognition with illumination-invariant local discrete cosine transform binary pattern (LDCTBP)

Deep neural network watermarking based on a reversible image hiding network

DStab: estimating clustering quality by distance stability

Body condition scoring network based on improved YOLOX

Premium Partner