nach oben

Neural Computing and Applications

Erschienen in:

28.10.2019 | Original Article

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

verfasst von: Sunder Ali Khowaja, Seok-Lyong Lee

Erschienen in: Neural Computing and Applications | Ausgabe 14/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Two-stream networks have provided an alternate way of exploiting the spatiotemporal information for action recognition problem. Nevertheless, most of the two-stream variants perform the fusion of homogeneous modalities which cannot efficiently capture the action-motion dynamics from the videos. Moreover, the existing studies cannot extend the streams beyond the number of modalities. To address these limitations, we propose a hybrid and hierarchical fusion (HHF) networks. The hybrid fusion handles non-homogeneous modalities and introduces a cross-modal learning stream for effective modeling of motion dynamics while extending the networks from existing two-stream variants to three and six streams. On the other hand, the hierarchical fusion makes the modalities consistent by modeling long-term temporal information along with the combination of multiple streams to improve the recognition performance. The proposed network architecture comprises of three fusion tiers: the hybrid fusion itself, the long-term fusion pooling layer which models the long-term dynamics from RGB and optical flow modalities, and the adaptive weighting scheme for combining the classification scores from several streams. We show that the hybrid fusion has different representations from the base modalities for training the cross-modal learning stream. We have conducted extensive experiments and shown that the proposed six-stream HHF network outperforms the existing two- and four-stream networks, achieving the state-of-the-art recognition performance, 97.2% and 76.7% accuracies on UCF101 and HMDB51 datasets, respectively, which are widely used in action recognition studies.

Vorheriger Artikel Multi-view representation learning in multi-task scene

Nächster Artikel Bus travel time prediction based on deep belief network with back-propagation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Gan C, Lin M, Yang Y et al (2016) Concepts not alone: exploring pairwise relationships for zero-shot video activity recognition. In: AAAI thirteenth conference on artificial intelligence, pp 3487–3493

Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE international conference on computer vision. IEEE, pp 3551–3558

Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: 15th International conference on multimedia—MULTIMEDIA’07. ACM Press, New York, NY, USA, pp 357–360

Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23:257–267. https://doi.org/10.1109/34.910878 CrossRef

Bilen H, Fernando B, Gavves E et al (2016) Dynamic image networks for action recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3034–3042

Khowaja SA, Lee S-L (2019) Semantic image networks for human action recognition, ArXiv Prepr http://arxiv.org/abs/1901.06792 (2019). (to appear in IJCV)

Fernando B, Gavves E, Jose Oramas M et al (2015) Modeling video evolution for action recognition. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 5378–5387

Ng JY-H, Hausknecht M, Vijayanarasimhan S et al (2015) Beyond short snippets: deep networks for video classification. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4694–4702

Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4489–4497

10.

Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 1–9

11.

Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1933–1941

12.

Ma C-Y, Chen M-H, Kira Z, AlRegib G (2018) TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Signal Process Image Commun. https://doi.org/10.1016/j.image.2018.09.003 CrossRef

13.

Wu Z, Wang X, Jiang Y-G et al (2015) Modeling spatial–temporal clues in a hybrid deep learning framework for video classification. In: 23rd ACM international conference on multimedia—MM’15. ACM Press, New York, NY, USA, pp 461–470

14.

Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4305–4314

15.

Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36

16.

Lan Z, Zhu Y, Hauptmann AG, Newsam S (2017) Deep local video feature for action recognition. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 1219–1225

17.

Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1541–1550

18.

Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35:221–231. https://doi.org/10.1109/TPAMI.2012.59 CrossRef

19.

Wang X, Gao L, Wang P et al (2018) Two-stream 3-D convNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimed 20:634–644. https://doi.org/10.1109/TMM.2017.2749159 CrossRef

20.

Bilen H, Fernando B, Gavves E, Vedaldi A (2018) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell 40:2799–2813. https://doi.org/10.1109/TPAMI.2017.2769085 CrossRef

21.

Zhang G, Liu J, Li H et al (2017) Joint human detection and head pose estimation via multistream networks for RGB-D videos. IEEE Signal Process Lett 24:1666–1670. https://doi.org/10.1109/LSP.2017.2731952 CrossRef

22.

Wu Z, Jiang Y-G, Wang X, et al (2016) Multi-stream multi-class fusion of deep networks for video classification. In: ACM on multimedia conference—MM’16. ACM Press, New York, NY, USA, pp 791–800

23.

Donahue J, Hendricks LA, Rohrbach M et al (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39:677–691. https://doi.org/10.1109/TPAMI.2016.2599174 CrossRef

24.

Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Thirty-first association for the advancement of artificial intelligence (AAAI), pp 4278–4284

25.

Khowaja SA, Yahya BN, Lee S-L (2017) Hierarchical classification method based on selective learning of slacked hierarchy for activity recognition systems. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2017.06.040 CrossRef

26.

Ryoo MS, Rothrock B, Matthies L (2015) Pooled motion features for first-person videos. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 896–904

27.

Kwon H, Kim Y, Lee JS, Cho M (2018) First person action recognition via two-stream ConvNet with long-term fusion pooling. Pattern Recognit Lett 112:161–167. https://doi.org/10.1016/j.patrec.2018.07.011 CrossRef

28.

Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint. arXiv:1212.0402

29.

Kuehne H, Jhuang H, Garrote E et al (2011) HMDB: a large video database for human motion recognition. In: 2011 International conference on computer vision. IEEE, pp 2556–2563

30.

Russakovsky O, Deng J, Su H et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115:211–252. https://doi.org/10.1007/s11263-015-0816-y MathSciNetCrossRef

31.

Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th international conference on machine learning, pp 1310–1318

32.

Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40:1510–1517. https://doi.org/10.1109/TPAMI.2017.2712608 CrossRef

33.

Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4597–4605

34.

Carreira J, Zisserman A (2017) Quo Vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4724–4733

35.

Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2097–2106

Titel: Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition
verfasst von: Sunder Ali Khowaja
Seok-Lyong Lee
Publikationsdatum: 28.10.2019
Verlag: Springer London
Erschienen in: Neural Computing and Applications / Ausgabe 14/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-019-04578-y

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 14/2020

Domination integrity and efficient fuzzy graphs

Detection of focal epilepsy in brain maps through a novel pattern recognition technique

Integrated computational intelligent paradigm for nonlinear electric circuit models using neural networks, genetic algorithms and sequential quadratic programming

A reinforcement learning-based communication topology in particle swarm optimization

Financial hedging in energy market by cross-learning machines

Bus travel time prediction based on deep belief network with back-propagation