ABSTRACT
Human activity understanding with 3D/depth sensors has received increasing attention in multimedia processing and interactions. This work targets on developing a novel deep model for automatic activity recognition from RGB-D videos. We represent each human activity as an ensemble of cubic-like video segments, and learn to discover the temporal structures for a category of activities, i.e. how the activities to be decomposed in terms of classification. Our model can be regarded as a structured deep architecture, as it extends the convolutional neural networks (CNNs) by incorporating structure alternatives. Specifically, we build the network consisting of 3D convolutions and max-pooling operators over the video segments, and introduce the latent variables in each convolutional layer manipulating the activation of neurons. Our model thus advances existing approaches in two aspects: (i) it acts directly on the raw inputs (grayscale-depth data) to conduct recognition instead of relying on hand-crafted features, and (ii) the model structure can be dynamically adjusted accounting for the temporal variations of human activities, i.e. the network configuration is allowed to be partially activated during inference. For model training, we propose an EM-type optimization method that iteratively (i) discovers the latent structure by determining the decomposed actions for each training example, and (ii) learns the network parameters by using the back-propagation algorithm. Our approach is validated in challenging scenarios, and outperforms state-of-the-art methods. A large human activity database of RGB-D videos is presented in addition.
- M. R. Amer and S. Todorovic. Sum-product networks for modeling activities with stochastic structure. In CVPR, pages 1314--1321, 2012. Google ScholarDigital Library
- W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. In ICCV, pages 778--785, 2011. Google ScholarDigital Library
- J. M. Chaquet, E. J. Carmona, and A. Fernandez-Caballero. A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 117(6):633--659, 2013. Google ScholarDigital Library
- Z. Cheng, L. Qin, Q. Huang, S. Jiang, S. Yan, and Q. Tian. Human group activity analysis with fusion of motion and appearance information. In ACM Multimedia, pages 1401--1404, 2011. Google ScholarDigital Library
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. Google ScholarDigital Library
- R. Gupta, A. Y. Chia, D. Rajan, E. S. Ng, and E. H. Lung. Human activities recognition using depth images. In ACM Multimedia, pages 283--292, 2013. Google ScholarDigital Library
- G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, 2006.Google ScholarCross Ref
- S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221--231, 2013. Google ScholarDigital Library
- H. S. Koppula, R. Gupta, and A. Saxena. Learning human activities and object affordances from rgb-d videos. International Journal of Robotics Research (IJRR), 32(8):951--970, 2013. Google ScholarDigital Library
- H. S. Koppula and A. Saxena. Learning spatio-temporal structure from rgb-d videos for human activity detection and anticipation. In ICML, pages 792--800, 2013.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.Google ScholarDigital Library
- Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, pages 3361--3368, 2011. Google ScholarDigital Library
- Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and e. a. L.D. Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, 1990. Google ScholarDigital Library
- X. Liang, L. Lin, and L. Cao. Learning latent spatio-temporal compositional model for human action recognition. In ACM Multimedia, pages 263--272, 2013. Google ScholarDigital Library
- L. Lin, H. Gong, L. Li, and L. Wang. Semantic event representation and recognition using syntactic attribute graph grammar. Pattern Recognition Letters, 30(2):180--186, 2009. Google ScholarDigital Library
- L. Lin, T. Wu, J. Porway, and Z. Xu. A stochastic graph grammar for compositional object representation and recognition. Pattern Recognition, 42(7):1297--1307, 2009. Google ScholarDigital Library
- P. Luo, X. Wang, and X. Tang. A deep sum-product architecture for robust facial attributes analysis. In ICCV, pages 2864--2871, 2013. Google ScholarDigital Library
- P. Luo, X. Wang, and X. Tang. Pedestrian parsing via deep decompositional neural network. In ICCV, pages 2648--2655, 2013. Google ScholarDigital Library
- B. Ni, Z. L. Y. Pei, L. Lin, and P. Moulin. Integrating multi-stage depth-induced contextual information for human action recognition and localization. In International Conference and Workshops on Automatic Face and Gesture Recognition, pages 1--8, 2013.Google Scholar
- O. Oreifej and Z. Liu. Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In CVPR, pages 716--723, 2013. Google ScholarDigital Library
- B. Packer, K. Saenko, and D. Koller. A combined pose, object, and feature model for action understanding. In CVPR, pages 1378--1385, 2012. Google ScholarDigital Library
- M. Pei, Y. Jia, and S. Zhu. Parsing video events with goal inference and intent prediction. In ICCV, pages 487--494, 2011. Google ScholarDigital Library
- S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, pages 1234--1241, 2012. Google ScholarDigital Library
- P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In ACM Multimedia, pages 357--360, 2007. Google ScholarDigital Library
- J. Sung, C. Ponce, B. Selman, and A. Saxena. Unstructured human activity detection from rgbd images. In ICRA, pages 842--849, 2012.Google Scholar
- K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, pages 1250--1257, 2012. Google ScholarDigital Library
- G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In ECCV, pages 140--153, 2010. Google ScholarDigital Library
- K. Tu, M. Meng, M. W. Lee, T. Choi, and S. Zhu. Joint video and text parsing for understanding events and answering queries. IEEE Transactions on Multimedia, 21(2):42--70, 2014.Google ScholarCross Ref
- C. Wang, Y. Wang, and A. L. Yuille. An approach to pose-based action recognition. In CVPR, pages 915--922, 2013. Google ScholarDigital Library
- J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In CVPR, pages 1290--1297, 2012. Google ScholarDigital Library
- J. Wang and Y. Wu. Learning maximum margin temporal warping for action recognition. In ICCV, pages 2688--2695, 2013. Google ScholarDigital Library
- X. Wang, L. Lin, L. Huang, and S. Yan. Incorporating structural alternatives and sharing into hierarchy for multiclass object recognition and detection. In CVPR, pages 3334--3341, 2013. Google ScholarDigital Library
- Y. Wang and G. Mori. Hidden part models for human action recognition: Probabilistic vs. max-margin. IEEE Trans. Pattern Anal. Mach. Intell., 33(7):1310--1323, 2011. Google ScholarDigital Library
- P. Wu, S. Hoi, H. Xia, P. Zhao, D. Wang, and C. Miao. Online multimodal deep similarity learning with application to image retrieval. In ACM Mutilmedia, pages 153--162, 2013. Google ScholarDigital Library
- L. Xia and J. Aggarwal. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In CVPR, pages 2834--2841, 2013. Google ScholarDigital Library
- L. Xia, C. Chen, and J. K. Aggarwal. View invariant human action recognition using histograms of 3d joints. In CVPRW, pages 20--27, 2012.Google ScholarCross Ref
- X. Yang, C. Zhang, and Y. Tian. Recognizing actions using depth motion maps-based histograms of oriented gradients. In ACM Multimedia, pages 1057--1060, 2012. Google ScholarDigital Library
- K. Yu, Y. Lin, and J. Lafferty. Learning image representations from the pixel level via hierarchical sparse coding. In CVPR, pages 1713--1720, 2011. Google ScholarDigital Library
- X. Zhao, Y. Liu, and Y. Fu. Exploring discriminative pose sub-patterns for effective action classification. In ACM Multimedia, pages 273--282, 2013. Google ScholarDigital Library
- X. Zhou, X. Zhuang, S. Yan, S. F. Chang, M. H. Johnson, and T. S. Huang. Sift-bag kernel for video event analysis. In ACM Multimedia, pages 229--238, 2009. Google ScholarDigital Library
- S. Zhu and D. Mumford. A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4):259--362, 2007. Google ScholarDigital Library
Index Terms
- 3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks
Recommendations
Human Activity Recognition Using Wearable Sensors by Deep Convolutional Neural Networks
MM '15: Proceedings of the 23rd ACM international conference on MultimediaHuman physical activity recognition based on wearable sensors has applications relevant to our daily life such as healthcare. How to achieve high recognition accuracy with low computational cost is an important issue in the ubiquitous computing. Rather ...
A Deep Structured Model with Radius---Margin Bound for 3D Human Activity Recognition
Understanding human activity is very challenging even with the recently developed 3D/depth sensors. To solve this problem, this work investigates a novel deep structured model, which adaptively decomposes an activity instance into temporal parts using ...
Human Action Recognition using Pre-trained Convolutional Neural Networks
VSIP '20: Proceedings of the 2020 2nd International Conference on Video, Signal and Image ProcessingRecognition of human action is one of the challenges in the field of artificial intelligence. Deep learning model has become a research issue in action recognition applications due to its ability to outperform traditional machine learning approaches. ...
Comments