Skip to main content
Top
Published in: Neural Processing Letters 1/2019

03-10-2018

Action Recognition Using Multiple Pooling Strategies of CNN Features

Authors: Haifeng Hu, Zhongke Liao, Xiang Xiao

Published in: Neural Processing Letters | Issue 1/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The deep convolution neural network has shown great potential in the field of human action recognition. For the sake of obtaining compact and discriminative feature representation, this paper proposes multiple pooling strategies using CNN features. We explore three different pooling strategies, which are called space-time feature pooling (STFP), time filter pooling (TFP) and spatio-temporal pyramid pooling (STPP), respectively. STFP shares the advantages of both hand-crafted features and deep ConvNets features. TFP reflects the change of elements on each CNN feature map over time. STPP focuses on the spatial and temporal pyramid structure of the feature maps. We aggregate these pooled features to produce a new discriminative video descriptor. Experimental results show that the three strategies have complementary advantages on the challenging YouTube, UCF50 and UCF101 datasets, and our video representation is comparable to the previous state-of-the-art algorithms.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Yamato J, Ohya J, Ishii K (1992) Recognizing human actions in time-sequential images using hidden Markov model. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 379–385 Yamato J, Ohya J, Ishii K (1992) Recognizing human actions in time-sequential images using hidden Markov model. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 379–385
2.
go back to reference Laptev I (2005) On space-time interest points. Int J Comput Vis 64:107–123CrossRef Laptev I (2005) On space-time interest points. Int J Comput Vis 64:107–123CrossRef
3.
go back to reference Niebles J, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial temporal words. Int J Comput Vis 79:299–318CrossRef Niebles J, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial temporal words. Int J Comput Vis 79:299–318CrossRef
4.
go back to reference Efros A, Berg A, Mori G, Malik J (2003) Recognizing action at a distance. In: Proceedings of IEEE conference on computer vision, pp 726–733 Efros A, Berg A, Mori G, Malik J (2003) Recognizing action at a distance. In: Proceedings of IEEE conference on computer vision, pp 726–733
5.
go back to reference Wang H, Klaser A, Schmid C (2013) Dense trajectories and motion boundary descriptors for action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 60–79 Wang H, Klaser A, Schmid C (2013) Dense trajectories and motion boundary descriptors for action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 60–79
6.
go back to reference Yeffet L, Wolf L (2009) Local trinary patterns for human action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 492–4976 Yeffet L, Wolf L (2009) Local trinary patterns for human action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 492–4976
7.
go back to reference Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987CrossRefMATH Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987CrossRefMATH
8.
go back to reference Kliper-Gross O, Gurovich Y, Hassner T (2012) Motion interchange patterns for action recognition in unconstrained videos. In: European conference on computer vision, pp 256–269 Kliper-Gross O, Gurovich Y, Hassner T (2012) Motion interchange patterns for action recognition in unconstrained videos. In: European conference on computer vision, pp 256–269
9.
go back to reference Tao D, Guo Y, Li Y, Gao X (2018) Tensor rank preserving discriminant analysis for facial recognition. IEEE Trans Image Process 27(1):325–334MathSciNetCrossRefMATH Tao D, Guo Y, Li Y, Gao X (2018) Tensor rank preserving discriminant analysis for facial recognition. IEEE Trans Image Process 27(1):325–334MathSciNetCrossRefMATH
10.
go back to reference Tao D, Cheng J, Song M, Lin X (2016) Manifold ranking-based matrix factorization for saliency detection. IEEE Trans Neural Netw Learn Syst 27(6):1122–1134MathSciNetCrossRef Tao D, Cheng J, Song M, Lin X (2016) Manifold ranking-based matrix factorization for saliency detection. IEEE Trans Neural Netw Learn Syst 27(6):1122–1134MathSciNetCrossRef
11.
go back to reference Wang H, Klaser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 3169–3176 Wang H, Klaser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 3169–3176
12.
go back to reference Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 3551–3558 Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 3551–3558
13.
go back to reference Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 2555–2562 Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 2555–2562
14.
go back to reference Ramana Murthy OV, Goecke R (2013) Ordered trajectories for large scale human action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 412–419 Ramana Murthy OV, Goecke R (2013) Ordered trajectories for large scale human action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 412–419
15.
go back to reference Jiang YG, Dai Q, Liu W, Xue XY, Ngo CW (2015) Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans Image Process 24(11):3781–3795MathSciNetCrossRefMATH Jiang YG, Dai Q, Liu W, Xue XY, Ngo CW (2015) Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans Image Process 24(11):3781–3795MathSciNetCrossRefMATH
16.
go back to reference Seo JJ, Son J, Kim H, Neve WD, Ro YM (2015) Efficient and effective human action recognition in video through motion boundary description with a compact set of trajectories. In: Proceedings of IEEE conference on automatic face and gesture recognition, pp 1–6 Seo JJ, Son J, Kim H, Neve WD, Ro YM (2015) Efficient and effective human action recognition in video through motion boundary description with a compact set of trajectories. In: Proceedings of IEEE conference on automatic face and gesture recognition, pp 1–6
17.
go back to reference Yang X, Liu W, Tao D, Cheng J (2017) Canonical correlation analysis networks for two-view image recognition. Inf Sci 385:338–352CrossRef Yang X, Liu W, Tao D, Cheng J (2017) Canonical correlation analysis networks for two-view image recognition. Inf Sci 385:338–352CrossRef
18.
go back to reference Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of IEEE international conference on computer vision, pp 1839–1848 Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of IEEE international conference on computer vision, pp 1839–1848
19.
go back to reference Hong C, Jun Y, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670MathSciNetCrossRefMATH Hong C, Jun Y, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670MathSciNetCrossRefMATH
20.
go back to reference Jun Y, Hong C, Rui Y, Tao D (2018) Multitask autoencoder model for recovering human poses. IEEE Trans Ind Electron 65(6):5060–5068CrossRef Jun Y, Hong C, Rui Y, Tao D (2018) Multitask autoencoder model for recovering human poses. IEEE Trans Ind Electron 65(6):5060–5068CrossRef
21.
go back to reference Ji S, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRef Ji S, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRef
22.
go back to reference Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with R*CNN. In: Vision, pp 1080–1088 Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with R*CNN. In: Vision, pp 1080–1088
23.
go back to reference Simonyan K, Zisserman A (2013) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1(4):568–576 Simonyan K, Zisserman A (2013) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1(4):568–576
24.
go back to reference Veeriah V, Zhuang N, Qi GJ (2015) Differential recurrent neural networks for action recognition. In: Proceedings of IEEE conference on computer vision, pp 4041–4049 Veeriah V, Zhuang N, Qi GJ (2015) Differential recurrent neural networks for action recognition. In: Proceedings of IEEE conference on computer vision, pp 4041–4049
25.
go back to reference Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of IEEE conference on computer vision, pp 4597–4605 Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of IEEE conference on computer vision, pp 4597–4605
26.
go back to reference Wang LM, Qiao Y, Tang XO (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 4305–4314 Wang LM, Qiao Y, Tang XO (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 4305–4314
27.
go back to reference Jegou H, Perronnin F, Douze M, Sanchez J (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716CrossRef Jegou H, Perronnin F, Douze M, Sanchez J (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716CrossRef
28.
go back to reference Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 2169–2178 Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 2169–2178
29.
go back to reference Dollar P, Rabaud V, Cottrell G, Belongie S (2006) Behavior recognition via sparse spatio-temporal features. In: Joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp 65–72 Dollar P, Rabaud V, Cottrell G, Belongie S (2006) Behavior recognition via sparse spatio-temporal features. In: Joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp 65–72
30.
go back to reference Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 1470–1477 Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 1470–1477
31.
go back to reference Sanchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105:222–245MathSciNetCrossRefMATH Sanchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105:222–245MathSciNetCrossRefMATH
32.
go back to reference Jegou H, Douze M, Schmid C, Perez P (2010) Aggregating local descriptors into a compact image representation. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 3304–3311 Jegou H, Douze M, Schmid C, Perez P (2010) Aggregating local descriptors into a compact image representation. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 3304–3311
33.
go back to reference Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li FF (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 1725–1732 Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li FF (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 1725–1732
34.
go back to reference Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision, pp 392–407 Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision, pp 392–407
35.
go back to reference He KM, Zhang XY, Ren SQ, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: European conference on computer vision, pp 1904–1916 He KM, Zhang XY, Ren SQ, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: European conference on computer vision, pp 1904–1916
36.
go back to reference Yoo D, Park S, Lee JY, Kweon IS (2015) Multi-scale pyramid pooling for deep convolutional representation. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 71–80 Yoo D, Park S, Lee JY, Kweon IS (2015) Multi-scale pyramid pooling for deep convolutional representation. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 71–80
37.
go back to reference Ryoo MS, Rothrock B, Matthies L (2015) Pooled motion features for first-person videos. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 896–904 Ryoo MS, Rothrock B, Matthies L (2015) Pooled motion features for first-person videos. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 896–904
38.
go back to reference Choi J, Jeon WJ, Lee SC (1992) Spatio-temporal pyramid matching for sports videos. In: ACM international conference on multimedia information retrieval, pp 379–385 Choi J, Jeon WJ, Lee SC (1992) Spatio-temporal pyramid matching for sports videos. In: ACM international conference on multimedia information retrieval, pp 379–385
39.
go back to reference Zhang XJ, Zhang H, Cao XC (2012) Action recognition based on spatial-temporal pyramid sparse coding. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 1455–1458 Zhang XJ, Zhang H, Cao XC (2012) Action recognition based on spatial-temporal pyramid sparse coding. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 1455–1458
40.
go back to reference Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the wild. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 1996–2003 Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the wild. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 1996–2003
41.
go back to reference Reddy KK, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24:971–981CrossRef Reddy KK, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24:971–981CrossRef
42.
43.
go back to reference Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick RB, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. CoRR arXiv:1408.5093 Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick RB, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. CoRR arXiv:​1408.​5093
44.
go back to reference Jones S, Ling S (2014) A multigraph representation for improved unsupervised/semi-supervised learning of human actions. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 23–28 Jones S, Ling S (2014) A multigraph representation for improved unsupervised/semi-supervised learning of human actions. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 23–28
45.
go back to reference Belkin M, Niyogi P (2002) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15:1373–1396CrossRefMATH Belkin M, Niyogi P (2002) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15:1373–1396CrossRefMATH
46.
go back to reference Ciptadi A, Goodwin MS, Rehg JM (2014) Movement pattern histogram for action recognition and retrieval. In: European conference on computer vision, pp 695–710 Ciptadi A, Goodwin MS, Rehg JM (2014) Movement pattern histogram for action recognition and retrieval. In: European conference on computer vision, pp 695–710
Metadata
Title
Action Recognition Using Multiple Pooling Strategies of CNN Features
Authors
Haifeng Hu
Zhongke Liao
Xiang Xiao
Publication date
03-10-2018
Publisher
Springer US
Published in
Neural Processing Letters / Issue 1/2019
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-018-9932-3

Other articles of this Issue 1/2019

Neural Processing Letters 1/2019 Go to the issue