Skip to main content
Top
Published in: International Journal of Computer Vision 2/2015

01-01-2015

Pose Adaptive Motion Feature Pooling for Human Action Analysis

Authors: Bingbing Ni, Pierre Moulin, Shuicheng Yan

Published in: International Journal of Computer Vision | Issue 2/2015

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Ineffective spatial–temporal motion feature pooling has been a fundamental bottleneck for human action recognition/detection for decades. Previous pooling schemes such as global, spatial–temporal pyramid, or human and object centric pooling fail to capture discriminative motion patterns because informative movements only occur in specific regions of the human body, that depend on the type of action being performed. Global (holistic) motion feature pooling methods therefore often result in an action representation with limited discriminative capability. To address this fundamental limitation, we propose an adaptive motion feature pooling scheme that utilizes human poses as side information. Such poses can be detected for instance in assisted living and indoor smart surveillance scenarios. Taking both video sub-volumes for pooling and human pose types as hidden variables, we formulate the motion feature pooling problem as a latent structural learning problem where the relationship between the discriminative pooling video sub-volumes and the pose types is learned. The resulting pose adaptive motion feature pooling scheme is extensively tested on assisted living and smart surveillance datasets and on general action recognition benchmarks. Improved action recognition and detection performances are demonstrated.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
We also offline test other spatial partition schemes including: (1)  vertical-4region-overlap scheme; (2)  vertical-3region-nonoverlapscheme; and (3)  vertical-3region-overlap/horizontal-2region-nonoverlap scheme (simply performing a horizontal cut in the middle on the current partition scheme used in this work). The results show that the overlapping partition scheme is better than the non-overlapping version and the six-region partition scheme (i.e., vertical-3region-overlap/horizontal-2region-nonoverlap scheme) only slightly outperforms the currently used three-region partition scheme, but with much higher computational cost. Therefore, in this work, we use the currentvertical-3region-overlap partition scheme, which is also naturally corresponding to the head-upper torso, torso, and lower torso-leg regions.
 
2
We have offline tested our poselet key-framing implementation on the UT-Interaction dataset, our recognition result (accuracy on half videos) on that dataset is \(71.5\,\%\) which is comparable with the result reported in the original work (Raptis and Sigal 2013), i.e., \(73.3\,\%\). Note that the manual annotations in Raptis and Sigal (2013) are not available.
 
Literature
go back to reference Andrews, S., & Tsochantaridis, I. (2003). Support vector machines for multiple instance learning. In: Advances in neural information processing systems (pp. 561–568). MIT Press. Andrews, S., & Tsochantaridis, I. (2003). Support vector machines for multiple instance learning. In: Advances in neural information processing systems (pp. 561–568). MIT Press.
go back to reference Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent System and Technology, 2(27), 1–27.CrossRef Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent System and Technology, 2(27), 1–27.CrossRef
go back to reference Chen, Q., Song, Z., Hua, Y., Huang, Z., & Yan, S. (2011). Hierarchical matching with side information for image classification. In: International conference on computer vision and pattern recognition. Chen, Q., Song, Z., Hua, Y., Huang, Z., & Yan, S. (2011). Hierarchical matching with side information for image classification. In: International conference on computer vision and pattern recognition.
go back to reference Choi, J., Jeon, W.J., & Lee, S.C. (2008). Spatio-temporal pyramid matching for sports videos. In: ACM multimedia information retrieval. Choi, J., Jeon, W.J., & Lee, S.C. (2008). Spatio-temporal pyramid matching for sports videos. In: ACM multimedia information retrieval.
go back to reference Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In: International conference on computer vision and pattern recognition (pp. 886–893). Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In: International conference on computer vision and pattern recognition (pp. 886–893).
go back to reference Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In: VS-PETS. Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In: VS-PETS.
go back to reference Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human actions in video. In: International conference on computer vision (pp. 1491–1498). Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human actions in video. In: International conference on computer vision (pp. 1491–1498).
go back to reference Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRef Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRef
go back to reference Yang, J., YG., Yu, K., Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In: International conference on computer vision and pattern recognition. Yang, J., YG., Yu, K., Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In: International conference on computer vision and pattern recognition.
go back to reference Jiang, Y., Yuan, J., Yu, G. (2012). Randomized spatial partition for scene recognition. In: European conference on computer vision. Jiang, Y., Yuan, J., Yu, G. (2012). Randomized spatial partition for scene recognition. In: European conference on computer vision.
go back to reference Kanan, C., Cottrell, G. (2010). Robust classification of objects, faces, and flowers using natural image statistics. In: International conference on computer vision and pattern recognition. Kanan, C., Cottrell, G. (2010). Robust classification of objects, faces, and flowers using natural image statistics. In: International conference on computer vision and pattern recognition.
go back to reference Klaser, A., Marszalek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d gradients. In: British machine vision conference. Klaser, A., Marszalek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d gradients. In: British machine vision conference.
go back to reference Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In: International conference on computer vision. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In: International conference on computer vision.
go back to reference Laptev, I., Lindeberg, T. (2003). Space-time interest points. In: International conference on computer vision. Laptev, I., Lindeberg, T. (2003). Space-time interest points. In: International conference on computer vision.
go back to reference Lazebnik, S., Schmid, C., Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: International conference on computer vision and pattern recognition. Lazebnik, S., Schmid, C., Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: International conference on computer vision and pattern recognition.
go back to reference Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and viterbi path searching. In: International conference computer vision and pattern recognition. Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and viterbi path searching. In: International conference computer vision and pattern recognition.
go back to reference Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In: International conference on computer vision and pattern recognition (pp. 2929–2936). Retrieved June, 2009. Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In: International conference on computer vision and pattern recognition (pp. 2929–2936). Retrieved June, 2009.
go back to reference Ni, B., Wang, G., & Moulin, P. (2011). RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In: ICCV workshops (pp. 1147–1153). Ni, B., Wang, G., & Moulin, P. (2011). RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In: ICCV workshops (pp. 1147–1153).
go back to reference Niebles, J.C., Chen, C.W., & Fei-fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In: European conference on computer vision (pp. 392–405). Niebles, J.C., Chen, C.W., & Fei-fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In: European conference on computer vision (pp. 392–405).
go back to reference Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In: European conference on computer vision (pp. 143–156). Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In: European conference on computer vision (pp. 143–156).
go back to reference Raptis, M., & Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In: International conference on computer vision and pattern recognition (pp. 2650–2657). Raptis, M., & Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In: International conference on computer vision and pattern recognition (pp. 2650–2657).
go back to reference Raptis, M., Kokkinos, I., & Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In: International conference on computer vision and pattern recognition. Raptis, M., Kokkinos, I., & Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In: International conference on computer vision and pattern recognition.
go back to reference Russakovsky, O., Lin, Y., Yu, K., & Fei-Fei, L. (2012). Object-centric spatial pooling for image classification. In: European conference on computer vision. Russakovsky, O., Lin, Y., Yu, K., & Fei-Fei, L. (2012). Object-centric spatial pooling for image classification. In: European conference on computer vision.
go back to reference Ryoo, M.S., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: International conference on computer vision (pp. 1593–1600). Ryoo, M.S., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: International conference on computer vision (pp. 1593–1600).
go back to reference Satkin, S., Hebert, M. (2010). Modeling the temporal extent of actions. In: European conference on computer vision (pp. 536–548). Satkin, S., Hebert, M. (2010). Modeling the temporal extent of actions. In: European conference on computer vision (pp. 536–548).
go back to reference Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In: International conference on pattern recognition. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In: International conference on pattern recognition.
go back to reference Shi, Q., Wang, L., Cheng, L., & Smola, A. (2011). Discriminative human action segmentation and recognition using semi-markov models. International Journal of Computer Vision, 93(1), 22–32.CrossRefMATH Shi, Q., Wang, L., Cheng, L., & Smola, A. (2011). Discriminative human action segmentation and recognition using semi-markov models. International Journal of Computer Vision, 93(1), 22–32.CrossRefMATH
go back to reference Tang, K., Fei-fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In: International conference on computer vision and pattern recognition. Tang, K., Fei-fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In: International conference on computer vision and pattern recognition.
go back to reference Vahdat, A., Gao, B., Ranjbar, M., & Mori, G. (2011). A discriminative key pose sequence model for recognizing human interactions. In: ICCV workshop (pp. 1729–1736). Vahdat, A., Gao, B., Ranjbar, M., & Mori, G. (2011). A discriminative key pose sequence model for recognizing human interactions. In: ICCV workshop (pp. 1729–1736).
go back to reference Wang, G., & Forsyth, D. (2009). Joint learning of visual attributes, object classes and visual saliency. In: International conference on computer vision. Wang, G., & Forsyth, D. (2009). Joint learning of visual attributes, object classes and visual saliency. In: International conference on computer vision.
go back to reference Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In: International conference on computer vision. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In: International conference on computer vision.
go back to reference Wang, H., Kläser, A., Schmid, C., & Cheng-Lin, L. (2011). Action recognition by dense trajectories. In: International conference on computer vision and pattern recognition (pp. 3169–3176). Wang, H., Kläser, A., Schmid, C., & Cheng-Lin, L. (2011). Action recognition by dense trajectories. In: International conference on computer vision and pattern recognition (pp. 3169–3176).
go back to reference Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.CrossRefMathSciNet Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.CrossRefMathSciNet
go back to reference Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In: International conference on computer vision and pattern recognition (pp. 1290–1297). Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In: International conference on computer vision and pattern recognition (pp. 1290–1297).
go back to reference Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(7), 1310–1323.CrossRef Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(7), 1310–1323.CrossRef
go back to reference Wolf, C., Mille, J., Lombardi, L., Celiktutan, O., Jiu, M., Baccouche, M., Dellandrea, E., Bichot, C., Garcia, C., & Sankur, B. (2012). The liris human activities dataset and the icpr 2012 human activities recognition and localization competition. Technical report RR-LIRIS-2012-004, LIRIS laboratory, URL http://liris.cnrs.fr/harl2012/evaluation.html Wolf, C., Mille, J., Lombardi, L., Celiktutan, O., Jiu, M., Baccouche, M., Dellandrea, E., Bichot, C., Garcia, C., & Sankur, B. (2012). The liris human activities dataset and the icpr 2012 human activities recognition and localization competition. Technical report RR-LIRIS-2012-004, LIRIS laboratory, URL http://​liris.​cnrs.​fr/​harl2012/​evaluation.​html
go back to reference Yakhnenko, O., & Verbeek, J. (2011). Region-based image classification with a latent SVM model. Technical report, INRIA. Yakhnenko, O., & Verbeek, J. (2011). Region-based image classification with a latent SVM model. Technical report, INRIA.
go back to reference Yamato, J., Ohya, J., & Ishii, K. (1992). Recognizing human action in time-sequential images using hidden markov model. In: International conference on computer vision and pattern recognition (pp. 379–385). Yamato, J., Ohya, J., & Ishii, K. (1992). Recognizing human action in time-sequential images using hidden markov model. In: International conference on computer vision and pattern recognition (pp. 379–385).
go back to reference Yuan, J., Liu, Z., & Wu, Y. (2011). Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9), 1728–1743.CrossRef Yuan, J., Liu, Z., & Wu, Y. (2011). Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9), 1728–1743.CrossRef
Metadata
Title
Pose Adaptive Motion Feature Pooling for Human Action Analysis
Authors
Bingbing Ni
Pierre Moulin
Shuicheng Yan
Publication date
01-01-2015
Publisher
Springer US
Published in
International Journal of Computer Vision / Issue 2/2015
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-014-0742-4

Other articles of this Issue 2/2015

International Journal of Computer Vision 2/2015 Go to the issue

Premium Partner