Skip to main content
Erschienen in: Neural Computing and Applications 14/2020

28.10.2019 | Original Article

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

verfasst von: Sunder Ali Khowaja, Seok-Lyong Lee

Erschienen in: Neural Computing and Applications | Ausgabe 14/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Two-stream networks have provided an alternate way of exploiting the spatiotemporal information for action recognition problem. Nevertheless, most of the two-stream variants perform the fusion of homogeneous modalities which cannot efficiently capture the action-motion dynamics from the videos. Moreover, the existing studies cannot extend the streams beyond the number of modalities. To address these limitations, we propose a hybrid and hierarchical fusion (HHF) networks. The hybrid fusion handles non-homogeneous modalities and introduces a cross-modal learning stream for effective modeling of motion dynamics while extending the networks from existing two-stream variants to three and six streams. On the other hand, the hierarchical fusion makes the modalities consistent by modeling long-term temporal information along with the combination of multiple streams to improve the recognition performance. The proposed network architecture comprises of three fusion tiers: the hybrid fusion itself, the long-term fusion pooling layer which models the long-term dynamics from RGB and optical flow modalities, and the adaptive weighting scheme for combining the classification scores from several streams. We show that the hybrid fusion has different representations from the base modalities for training the cross-modal learning stream. We have conducted extensive experiments and shown that the proposed six-stream HHF network outperforms the existing two- and four-stream networks, achieving the state-of-the-art recognition performance, 97.2% and 76.7% accuracies on UCF101 and HMDB51 datasets, respectively, which are widely used in action recognition studies.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Gan C, Lin M, Yang Y et al (2016) Concepts not alone: exploring pairwise relationships for zero-shot video activity recognition. In: AAAI thirteenth conference on artificial intelligence, pp 3487–3493 Gan C, Lin M, Yang Y et al (2016) Concepts not alone: exploring pairwise relationships for zero-shot video activity recognition. In: AAAI thirteenth conference on artificial intelligence, pp 3487–3493
2.
Zurück zum Zitat Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE international conference on computer vision. IEEE, pp 3551–3558 Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE international conference on computer vision. IEEE, pp 3551–3558
3.
Zurück zum Zitat Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: 15th International conference on multimedia—MULTIMEDIA’07. ACM Press, New York, NY, USA, pp 357–360 Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: 15th International conference on multimedia—MULTIMEDIA’07. ACM Press, New York, NY, USA, pp 357–360
5.
Zurück zum Zitat Bilen H, Fernando B, Gavves E et al (2016) Dynamic image networks for action recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3034–3042 Bilen H, Fernando B, Gavves E et al (2016) Dynamic image networks for action recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3034–3042
7.
Zurück zum Zitat Fernando B, Gavves E, Jose Oramas M et al (2015) Modeling video evolution for action recognition. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 5378–5387 Fernando B, Gavves E, Jose Oramas M et al (2015) Modeling video evolution for action recognition. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 5378–5387
8.
Zurück zum Zitat Ng JY-H, Hausknecht M, Vijayanarasimhan S et al (2015) Beyond short snippets: deep networks for video classification. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4694–4702 Ng JY-H, Hausknecht M, Vijayanarasimhan S et al (2015) Beyond short snippets: deep networks for video classification. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4694–4702
9.
Zurück zum Zitat Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4489–4497 Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4489–4497
10.
Zurück zum Zitat Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 1–9 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 1–9
11.
Zurück zum Zitat Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1933–1941 Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1933–1941
13.
Zurück zum Zitat Wu Z, Wang X, Jiang Y-G et al (2015) Modeling spatial–temporal clues in a hybrid deep learning framework for video classification. In: 23rd ACM international conference on multimedia—MM’15. ACM Press, New York, NY, USA, pp 461–470 Wu Z, Wang X, Jiang Y-G et al (2015) Modeling spatial–temporal clues in a hybrid deep learning framework for video classification. In: 23rd ACM international conference on multimedia—MM’15. ACM Press, New York, NY, USA, pp 461–470
14.
Zurück zum Zitat Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4305–4314 Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4305–4314
15.
Zurück zum Zitat Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36 Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36
16.
Zurück zum Zitat Lan Z, Zhu Y, Hauptmann AG, Newsam S (2017) Deep local video feature for action recognition. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 1219–1225 Lan Z, Zhu Y, Hauptmann AG, Newsam S (2017) Deep local video feature for action recognition. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 1219–1225
17.
Zurück zum Zitat Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1541–1550 Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1541–1550
22.
Zurück zum Zitat Wu Z, Jiang Y-G, Wang X, et al (2016) Multi-stream multi-class fusion of deep networks for video classification. In: ACM on multimedia conference—MM’16. ACM Press, New York, NY, USA, pp 791–800 Wu Z, Jiang Y-G, Wang X, et al (2016) Multi-stream multi-class fusion of deep networks for video classification. In: ACM on multimedia conference—MM’16. ACM Press, New York, NY, USA, pp 791–800
24.
Zurück zum Zitat Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Thirty-first association for the advancement of artificial intelligence (AAAI), pp 4278–4284 Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Thirty-first association for the advancement of artificial intelligence (AAAI), pp 4278–4284
26.
Zurück zum Zitat Ryoo MS, Rothrock B, Matthies L (2015) Pooled motion features for first-person videos. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 896–904 Ryoo MS, Rothrock B, Matthies L (2015) Pooled motion features for first-person videos. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 896–904
28.
Zurück zum Zitat Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint. arXiv:1212.0402 Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint. arXiv:​1212.​0402
29.
Zurück zum Zitat Kuehne H, Jhuang H, Garrote E et al (2011) HMDB: a large video database for human motion recognition. In: 2011 International conference on computer vision. IEEE, pp 2556–2563 Kuehne H, Jhuang H, Garrote E et al (2011) HMDB: a large video database for human motion recognition. In: 2011 International conference on computer vision. IEEE, pp 2556–2563
31.
Zurück zum Zitat Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th international conference on machine learning, pp 1310–1318 Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th international conference on machine learning, pp 1310–1318
33.
Zurück zum Zitat Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4597–4605 Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4597–4605
34.
Zurück zum Zitat Carreira J, Zisserman A (2017) Quo Vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4724–4733 Carreira J, Zisserman A (2017) Quo Vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4724–4733
35.
Zurück zum Zitat Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2097–2106 Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2097–2106
Metadaten
Titel
Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition
verfasst von
Sunder Ali Khowaja
Seok-Lyong Lee
Publikationsdatum
28.10.2019
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 14/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-019-04578-y

Weitere Artikel der Ausgabe 14/2020

Neural Computing and Applications 14/2020 Zur Ausgabe