Top

Neural Processing Letters

Published in:

03-01-2019

Spatiotemporal Fusion Networks for Video Action Recognition

Authors: Zheng Liu, Haifeng Hu, Junxuan Zhang

Published in: Neural Processing Letters | Issue 2/2019

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Learning spatiotemporal information is a fundamental part in action recognition. In this work, we attempt to extract efficient spatiotemporal information for video representation through a novel architecture, termed as SpatioTemporal Fusion Networks (STFN). STFN extract spatiotemporal information by introducing connections between the spatial and temporal streams in two-stream networks with fusion blocks, called as Compactly Fuse Spatial and Temporal information (CFST) block, whose goal is to integrate spatial and temporal information with little computational cost. CFST is built upon Compact Bilinear Pooling which can capture multiplicative interactions at corresponding locations. For better integration of two streams, we make an exploration of fusion configuration about where to insert fusion block and a combination of CFST block and additive interaction. We evaluate our proposed architecture on UCF-101 and HMDB-51, and obtain a comparable performance.

previous article Measuring Entity Relatedness via Entity and Text Joint Embedding

next article Deep Captioning with Attention-Based Visual Concept Transfer Mechanism for Enriching Description

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2:1097–1105

Girshick R, Donahue J, Darrell T et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

Ji S, Xu W, Yang M et al (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRef

Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. Advances in neural information processing systems, pp 3468–3476

Gao Y, Beijbom O, Zhang N et al (2016) Compact bilinear pooling. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 317–326

Soomro K, Zamir A R, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

10.

Kuehne H, Jhuang H, Stiefelhagen R et al (2013) Hmdb51: A large video database for human motion recognition. In: High Performance Computing in Science and Engineering 12. Springer, Berlin, pp 571–582

11.

Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123CrossRef

12.

Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: European conference on computer vision. Springer, Berlin, pp 650–663

13.

Wang H, Kläser A, Schmid C et al (2011) Action recognition by dense trajectories. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3169–3176

14.

Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558

15.

Laptev I, Marszalek M, Schmid C et al (2008) Learning realistic human actions from movies. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–8

16.

Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer, Cham, pp 20–36

17.

Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941

18.

Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7445–7454

19.

Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

20.

Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5534–5542

21.

Diba A, Fayyaz M, Sharma V et al (2017) Temporal 3D ConvNets: new architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200

22.

Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4724–4733

23.

Wang L, Li W, Li W, et al (2018) Appearance-and-relation networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1430–1439

24.

Sun S, Kuang Z, Sheng L, et al (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1390–1399

25.

Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314

26.

Zhu W, Hu J, Sun G et al (2016) A key volume mining deep framework for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1991–1999

27.

Hong C, Yu J, Wan J et al (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670MathSciNetCrossRef

28.

Hong C, Chen X, Wang X et al (2016) Hypergraph regularized autoencoder for image-based 3D human pose recovery. Signal Process 124:132–140CrossRef

29.

Hong C, Yu J, Tao D et al (2015) Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Trans Ind Electron 62(6):3742–3751

30.

Yang M, Liu Y, You Z (2017) The Euclidean embedding learning based on convolutional neural network for stereo matching. Neurocomputing 267:195–200CrossRef

31.

Qian S, Liu H, Liu C et al (2018) Adaptive activation functions in convolutional neural networks. Neurocomputing 272:204–212CrossRef

32.

Yu Z, Yu J, Fan J et al (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1839–1848

33.

Yu Z, Yu J, Xiang C et al (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 99:1–13

34.

Kim JH, On KW, Lim W et al (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325

35.

Simon M, Gao Y, Darrell T et al (2017) Generalized orderless pooling performs implicit salient matching. In: Proceedings of the IEEE international conference on computer vision, pp 4960–4969

Title: Spatiotemporal Fusion Networks for Video Action Recognition
Authors: Zheng Liu
Haifeng Hu
Junxuan Zhang
Publication date: 03-01-2019
Publisher: Springer US
Published in: Neural Processing Letters / Issue 2/2019
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI: https://doi.org/10.1007/s11063-018-09972-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2019

On Finite-Time Stability for Fractional-Order Neural Networks with Proportional Delays

Robust Exponential Stabilization for Switched Neutral Neural Networks with Mixed Time-Varying Delays

Robust Class-Specific Autoencoder for Data Cleaning and Classification in the Presence of Label Noise

-Exponential Stability of Impulsive Fractional-Order Complex-Valued Neural Networks with Time Delays

Adaptive State Estimation of Stochastic Delayed Neural Networks with Fractional Brownian Motion

A Soft Sensing Scheme of Gas Utilization Ratio Prediction for Blast Furnace Via Improved Extreme Learning Machine