nach oben

Neural Computing and Applications

Erschienen in:

09.02.2021 | Original Article

DM-CTSA: a discriminative multi-focused and complementary temporal/spatial attention framework for action recognition

verfasst von: Ming Tong, Kaibo Yan, Lei Jin, Xing Yue, Mingyang Li

Erschienen in: Neural Computing and Applications | Ausgabe 15/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Video-based human action recognition remains a challenging task. There are three main limitations: (1) Most works are only restricted to single temporal scale modeling. (2) Although a few methods consider multilevel motion features, they disregard the fact that different features usually contribute differently. (3) Most attention mechanisms only notice important regions in frames without concerning the spatial structure information around them. To address these issues, a discriminative multi-focused and complementary temporal/spatial attention framework is presented, which consists of the multi-focused temporal attention network with multi-granularity loss (M²TEAN) and complementary spatial attention network with co-classification loss (C²SPAN). Firstly, M²TEAN not only focuses on discriminative multilevel motion features, but also highlights more discriminative features among them. Specifically, a short-term discriminative attention sub-network and a middle-term consistent attention sub-network are, respectively, constructed to focus on discriminative short-term and middle-term features. A long-term evolutive attention sub-network is proposed to focus on long-term action evolution over time. Followed by a multi-focused temporal attention module, more discriminative features are ulteriorly highlighted. Secondly, C²SPAN captures discriminative regions in frames, while mining the spatial structure information around them. Experiments reveal that our methods produce state-of-the-art results.

Vorheriger Artikel Optimal tracking control of switched systems applied in grid-connected hybrid generation using reinforcement learning

Nächster Artikel On-chip trainable hardware-based deep Q-networks approximating a backpropagation algorithm

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Yu T, Wang L, Da C, Gu H, Xiang S, Pan C (2019) Weakly semantic guided action recognition. IEEE Trans Multimed 21(10):2504–2517CrossRef

Ibrahim MS, Mori G (2018) Hierarchical relational networks for group activity recognition and retrieval. In: Proceedings of the European conference on computer vision (ECCV). pp 721–736

Hu JF, Zheng WS, Pan J, Lai J, Zhang J (2018) Deep bilinear learning for rgb-d action recognition. In: Proceedings of the European conference on computer vision (ECCV). pp 335–351

Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 886–893

Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 1–8

Crasto N, Weinzaepfel P, Alahari K, Schmid C (2019) MARS: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 7882–7891

Song S, Liu J, Li Y, Guo Z (2020) Modality compensation network: cross-modal adaptation for action recognition. IEEE Trans Image Process 29:3957–3969CrossRef

Du Y, Yuan C, Li B, Zhao L, Li Y, Hu W (2018) Interaction-aware spatio-temporal pyramid attention networks for action classification. In: Proceedings of the European conference on computer vision (ECCV). pp 373–389

Zhu Y, Li R, Yang Y, Ye N (2020) Learning cascade attention for fine-grained image classification. Neural Netw 122:174–182CrossRef

10.

Georgakopoulos SV, Kottari K, Delibasis K, Plagianakos VP, Maglogiannis I (2019) Improving the performance of convolutional neural network for skin image classification using the response of image analysis filters. Neural Comput Appl 31(6):1805–1822CrossRef

11.

Takikawa T, Acuna D, Jampani V, Fidler S (2019) Gated-scnn: Gated shape cnns for semantic segmentation. In: Proceedings of IEEE international conference on computer vision ( ICCV). pp 5229–5238

12.

Tokunaga H, Acuna D, Jampani V, Fidler S (2019) Adaptive weighting multi-field-of-view CNN for semantic segmentation in pathology. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 12597–12606

13.

Yang M, Wen W, Wang X, Shen L, Gao G (2020) Adaptive convolution local and global learning for class-level joint representation of facial recognition with a single sample per data subject. IEEE Trans Inf Forensics Secur 15:2469–2484CrossRef

14.

Liu D, Gao X, Wang N, Li J, Peng C (2020) Coupled attribute learning for heterogeneous face recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2019.2957285CrossRef

15.

Zhou L, Gu X (2020) Embedding topological features into convolutional neural network salient object detection. Neural Netw 121:308–318CrossRef

16.

Zhang H, Guo H, Wang X, Ji Y, Wu QJ (2020) Clothescounter: a framework for star-oriented clothes mining from videos. Neurocomputing 377:38–48CrossRef

17.

Zhang H, Ji Y, Huang W, Liu L (2019) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl 31(11):7361–7380CrossRef

18.

Dixit M, Li Y, Vasconcelos N (2019) Semantic fisher scores for task transfer: using objects to classify scenes. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2921960CrossRef

19.

He N, Fang L, Li S, Plaza J, Plaza A (2019) Skip-connected covariance network for remote sensing scene classification. IEEE Trans Neural Netw Learn Syst 31(5):1461–1474CrossRef

20.

Li C, Zhong Q, Xie D, Pu S (2019) Collaborative spatiotemporal feature learning for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 7872–7881

21.

Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of IEEE international conference on computer vision (ICCV). pp 4489–4497

22.

Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 4724–4733

23.

Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 677–691

24.

Wang Q, Chen K (2020) Multi-label zero-shot human action recognition via joint latent ranking embedding. Neural Netw 122:1–23CrossRef

25.

Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of advances in neural information processing systems (NIPS). pp 568–576

26.

Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 1933–1941

27.

Dai C, Liu X, Lai J (2020) Human action recognition using two-stream attention based LSTM networks. Applied Soft Computing 86:105820CrossRef

28.

Lu M, Li ZN, Wang Y, Pan G (2019) Deep attention network for egocentric action recognition. IEEE Trans Image Process 28(8):3703–3713MathSciNetCrossRef

29.

Rahimi S, Aghagolzadeh A, Ezoji M (2020) Human action recognition using double discriminative sparsity preserving projections and discriminant ridge-based classifier based on the GDWL-l1 graph. Expert Syst with Appl 141:112927CrossRef

30.

Naveenkumar N, Domnic S (2020) Deep ensemble network using distance maps and body part features for skeleton based action recognition. Pattern Recognit 100:107125CrossRef

31.

Li Y, Song S, Li Y, Liu J (2019) Temporal bilinear networks for video action recognition. Proce AAAI Conf Artif Intell 33:8674–8681

32.

Zhang H, Liu D, Xiong Z (2019) Two-stream action recognition-oriented video super-resolution. In: Proceedings of IEEE international conference on computer vision (ICCV). pp 8799–8808

33.

Li L, Zhang Z, Huang Y, Wang L (2018) Deep temporal feature encoding for action recognition. In: 2018 24th international conference on pattern recognition (ICPR). pp 1109–1114

34.

Zhu J, Zhu Z, Zou W (2018) End-to-end video-level representation learning for action recognition. In: 2018 24th international conference on pattern recognition (ICPR). pp 645–650

35.

Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van GL (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Proceedings of the European conference on computer vision (ECCV). pp 20–36

36.

Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 1924–1932

37.

Bilen H, Fernando B, Gavves E, Vedaldi A (2018) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell 40:2799–2813CrossRef

38.

Meng L, Zhao B, Chang B, Huang G, Sun W, Tung F, Sigal L (2019) Interpretable spatio-temporal attention for video action recognition. In: Proceedings of the IEEE international conference on computer vision workshops (ICCV Workshops)

39.

Song S, Lan C, Xing J, Zeng W, Liu J (2018) Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Trans Image Process 27(7):3459–3471MathSciNetCrossRef

40.

Yang H, Yuan C, Zhang L, Sun Y, Hu W, Maybank SJ (2020) STA-CNN: Convolutional spatial-temporal attention learning for action recognition. IEEE Trans Image Process 29:5783–5793CrossRef

41.

Ni B, Li T, Yang X (2018) Learning semantic-aligned action representation. IEEE Trans Neural Netw Learn Syst 29(8):3715–3725CrossRef

42.

Li D, Qiu Z, Dai Q, Yao T, Mei T (2018) Recurrent tubelet proposal and recognition networks for action detection. In: Proceedings of the European conference on computer vision (ECCV). pp 303–318

43.

Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 1227–1236

44.

Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. https://arxiv.org/abs/1212.0402v1

45.

Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of IEEE international conference on computer vision (ICCV). pp 2556–2563

46.

Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In: Proceedings of NIPS workshop. pp 1–4

47.

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 2818–2826

48.

Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 248–255

49.

Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l 1 optical flow. In: Proceedings of the 29th DAGM symposium on pattern recognition. pp 214–223

50.

Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2018) Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans Image Process 27(5):2326–2339MathSciNetCrossRef

51.

Wei D, Lim JJ, Zisserman A, Freeman WT (2018) Learning and using the arrow of time. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 8052–8060

52.

Li D, Yao T, Duan LY, Mei T, Rui Y (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimed 21(2):416–428CrossRef

53.

Chen L, Song Z, Lu J, Zhou J (2019) Learning principal orientations and residual descriptor for action recognition. Pattern Recognit 86:14–26CrossRef

54.

Zhao J, Snoek CG (2019) Dance with flow: Two-in-one stream action detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 9935–9944

55.

Wang P, Liu L, Shen C, Shen HT (2019) Order-aware convolutional pooling for video based action recognition. Pattern Recognit 91:357–365CrossRef

56.

Bo Y, Lu Y, He W (2020) Few-shot learning of video action recognition only based on video contents. In: IEEE winter conference on applications of computer vision (WACV). pp 595–604

57.

Girdhar R, Tran D, Torresani L, Ramanan D (2019) Distinit: Learning video representations without a single labeled video. In: Proceedings of the IEEE international conference on computer vision (ICCV). pp 852–861

58.

Wang C, Fu H, Ling CX, Du P, Ma H (2020) Region-based global reasoning networks. In: Proceedings of the AAAI conference on artificial intelligence. pp 12136–12143

59.

Yang H, Yuan C, Li B, Du Y, Xing J, Hu W, Maybank SJ (2019) Asymmetric 3d convolutional neural networks for action recognition. Pattern Recognit 85:1–12CrossRef

60.

Su B, Wu Y (2019) Learning low-dimensional temporal representations with latent alignments. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2919303CrossRef

61.

Pang B, Zha K, Cao H, Shi C, Lu C (2019) Deep rnn framework for visual sequential applications. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 423–432

62.

Hao W, Zhang Z (2019) Spatiotemporal distilled dense-connectivity network for video action recognition. Pattern Recognit 92:13–24CrossRef

63.

Liu K, Liu W, Ma H, Tan M, Gan C (2020) A real-time action representation with temporal encoding and deep compression. IEEE Trans Circ Syst Video Technol. https://doi.org/10.1109/TCSVT.2020.2984569CrossRef

Titel: DM-CTSA: a discriminative multi-focused and complementary temporal/spatial attention framework for action recognition
verfasst von: Ming Tong
Kaibo Yan
Lei Jin
Xing Yue
Mingyang Li
Publikationsdatum: 09.02.2021
Verlag: Springer London
Erschienen in: Neural Computing and Applications / Ausgabe 15/2021
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-021-05698-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 15/2021

Handwritten Bangla city name word recognition using CNN-based transfer learning and FCN

Automatic identification of epileptic seizure signal using optimized added kernel support vector machine (OAKSVM)

Speech synthesis using generative adversarial network for improving readability of Hindi words to recuperate from dyslexia

Full-state neural network observer-based hybrid quantum diagonal recurrent neural network adaptive tracking control

HPWO-LS-based deep learning approach with S-ROA-optimized optic cup segmentation for fundus image classification

Multi-objectives TLBO hybrid method to select the related risk features with rheumatism disease