skip to main content
10.1145/2647868.2654912acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks

Published:03 November 2014Publication History

ABSTRACT

Human activity understanding with 3D/depth sensors has received increasing attention in multimedia processing and interactions. This work targets on developing a novel deep model for automatic activity recognition from RGB-D videos. We represent each human activity as an ensemble of cubic-like video segments, and learn to discover the temporal structures for a category of activities, i.e. how the activities to be decomposed in terms of classification. Our model can be regarded as a structured deep architecture, as it extends the convolutional neural networks (CNNs) by incorporating structure alternatives. Specifically, we build the network consisting of 3D convolutions and max-pooling operators over the video segments, and introduce the latent variables in each convolutional layer manipulating the activation of neurons. Our model thus advances existing approaches in two aspects: (i) it acts directly on the raw inputs (grayscale-depth data) to conduct recognition instead of relying on hand-crafted features, and (ii) the model structure can be dynamically adjusted accounting for the temporal variations of human activities, i.e. the network configuration is allowed to be partially activated during inference. For model training, we propose an EM-type optimization method that iteratively (i) discovers the latent structure by determining the decomposed actions for each training example, and (ii) learns the network parameters by using the back-propagation algorithm. Our approach is validated in challenging scenarios, and outperforms state-of-the-art methods. A large human activity database of RGB-D videos is presented in addition.

References

  1. M. R. Amer and S. Todorovic. Sum-product networks for modeling activities with stochastic structure. In CVPR, pages 1314--1321, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. In ICCV, pages 778--785, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. M. Chaquet, E. J. Carmona, and A. Fernandez-Caballero. A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 117(6):633--659, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Z. Cheng, L. Qin, Q. Huang, S. Jiang, S. Yan, and Q. Tian. Human group activity analysis with fusion of motion and appearance information. In ACM Multimedia, pages 1401--1404, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Gupta, A. Y. Chia, D. Rajan, E. S. Ng, and E. H. Lung. Human activities recognition using depth images. In ACM Multimedia, pages 283--292, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  8. S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221--231, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. S. Koppula, R. Gupta, and A. Saxena. Learning human activities and object affordances from rgb-d videos. International Journal of Robotics Research (IJRR), 32(8):951--970, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. S. Koppula and A. Saxena. Learning spatio-temporal structure from rgb-d videos for human activity detection and anticipation. In ICML, pages 792--800, 2013.Google ScholarGoogle Scholar
  11. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, pages 3361--3368, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and e. a. L.D. Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. Liang, L. Lin, and L. Cao. Learning latent spatio-temporal compositional model for human action recognition. In ACM Multimedia, pages 263--272, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Lin, H. Gong, L. Li, and L. Wang. Semantic event representation and recognition using syntactic attribute graph grammar. Pattern Recognition Letters, 30(2):180--186, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. Lin, T. Wu, J. Porway, and Z. Xu. A stochastic graph grammar for compositional object representation and recognition. Pattern Recognition, 42(7):1297--1307, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Luo, X. Wang, and X. Tang. A deep sum-product architecture for robust facial attributes analysis. In ICCV, pages 2864--2871, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Luo, X. Wang, and X. Tang. Pedestrian parsing via deep decompositional neural network. In ICCV, pages 2648--2655, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Ni, Z. L. Y. Pei, L. Lin, and P. Moulin. Integrating multi-stage depth-induced contextual information for human action recognition and localization. In International Conference and Workshops on Automatic Face and Gesture Recognition, pages 1--8, 2013.Google ScholarGoogle Scholar
  20. O. Oreifej and Z. Liu. Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In CVPR, pages 716--723, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Packer, K. Saenko, and D. Koller. A combined pose, object, and feature model for action understanding. In CVPR, pages 1378--1385, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Pei, Y. Jia, and S. Zhu. Parsing video events with goal inference and intent prediction. In ICCV, pages 487--494, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, pages 1234--1241, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In ACM Multimedia, pages 357--360, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Sung, C. Ponce, B. Selman, and A. Saxena. Unstructured human activity detection from rgbd images. In ICRA, pages 842--849, 2012.Google ScholarGoogle Scholar
  26. K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, pages 1250--1257, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In ECCV, pages 140--153, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Tu, M. Meng, M. W. Lee, T. Choi, and S. Zhu. Joint video and text parsing for understanding events and answering queries. IEEE Transactions on Multimedia, 21(2):42--70, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  29. C. Wang, Y. Wang, and A. L. Yuille. An approach to pose-based action recognition. In CVPR, pages 915--922, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In CVPR, pages 1290--1297, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Wang and Y. Wu. Learning maximum margin temporal warping for action recognition. In ICCV, pages 2688--2695, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. X. Wang, L. Lin, L. Huang, and S. Yan. Incorporating structural alternatives and sharing into hierarchy for multiclass object recognition and detection. In CVPR, pages 3334--3341, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Wang and G. Mori. Hidden part models for human action recognition: Probabilistic vs. max-margin. IEEE Trans. Pattern Anal. Mach. Intell., 33(7):1310--1323, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. Wu, S. Hoi, H. Xia, P. Zhao, D. Wang, and C. Miao. Online multimodal deep similarity learning with application to image retrieval. In ACM Mutilmedia, pages 153--162, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. L. Xia and J. Aggarwal. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In CVPR, pages 2834--2841, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. L. Xia, C. Chen, and J. K. Aggarwal. View invariant human action recognition using histograms of 3d joints. In CVPRW, pages 20--27, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  37. X. Yang, C. Zhang, and Y. Tian. Recognizing actions using depth motion maps-based histograms of oriented gradients. In ACM Multimedia, pages 1057--1060, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Yu, Y. Lin, and J. Lafferty. Learning image representations from the pixel level via hierarchical sparse coding. In CVPR, pages 1713--1720, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. X. Zhao, Y. Liu, and Y. Fu. Exploring discriminative pose sub-patterns for effective action classification. In ACM Multimedia, pages 273--282, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. X. Zhou, X. Zhuang, S. Yan, S. F. Chang, M. H. Johnson, and T. S. Huang. Sift-bag kernel for video event analysis. In ACM Multimedia, pages 229--238, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. Zhu and D. Mumford. A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4):259--362, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. 3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '14: Proceedings of the 22nd ACM international conference on Multimedia
      November 2014
      1310 pages
      ISBN:9781450330633
      DOI:10.1145/2647868

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 November 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      MM '14 Paper Acceptance Rate55of286submissions,19%Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader