research-article

3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks

Authors:
Keze Wang

Sun Yat-Sen University, Guangzhou, China

Sun Yat-Sen University, Guangzhou, China
View Profile

,
Xiaolong Wang

Sun Yat-Sen University, Guangzhou, China

Sun Yat-Sen University, Guangzhou, China
View Profile

,
Liang Lin

Sun Yat-Sen University, Guangzhou, China

Sun Yat-Sen University, Guangzhou, China
View Profile

,
Meng Wang

Hefei University of Technology, Hefei, China

Hefei University of Technology, Hefei, China
View Profile

,
Wangmeng Zuo

Harbin Institute of Technology, Harbin, China

Harbin Institute of Technology, Harbin, China
View Profile

MM '14: Proceedings of the 22nd ACM international conference on MultimediaNovember 2014Pages 97–106https://doi.org/10.1145/2647868.2654912

Published:03 November 2014Publication History

MM '14: Proceedings of the 22nd ACM international conference on Multimedia

Pages 97–106

ABSTRACT

Human activity understanding with 3D/depth sensors has received increasing attention in multimedia processing and interactions. This work targets on developing a novel deep model for automatic activity recognition from RGB-D videos. We represent each human activity as an ensemble of cubic-like video segments, and learn to discover the temporal structures for a category of activities, i.e. how the activities to be decomposed in terms of classification. Our model can be regarded as a structured deep architecture, as it extends the convolutional neural networks (CNNs) by incorporating structure alternatives. Specifically, we build the network consisting of 3D convolutions and max-pooling operators over the video segments, and introduce the latent variables in each convolutional layer manipulating the activation of neurons. Our model thus advances existing approaches in two aspects: (i) it acts directly on the raw inputs (grayscale-depth data) to conduct recognition instead of relying on hand-crafted features, and (ii) the model structure can be dynamically adjusted accounting for the temporal variations of human activities, i.e. the network configuration is allowed to be partially activated during inference. For model training, we propose an EM-type optimization method that iteratively (i) discovers the latent structure by determining the decomposed actions for each training example, and (ii) learns the network parameters by using the back-propagation algorithm. Our approach is validated in challenging scenarios, and outperforms state-of-the-art methods. A large human activity database of RGB-D videos is presented in addition.

References

M. R. Amer and S. Todorovic. Sum-product networks for modeling activities with stochastic structure. In CVPR, pages 1314--1321, 2012. Google ScholarDigital Library
W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. In ICCV, pages 778--785, 2011. Google ScholarDigital Library
J. M. Chaquet, E. J. Carmona, and A. Fernandez-Caballero. A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 117(6):633--659, 2013. Google ScholarDigital Library
Z. Cheng, L. Qin, Q. Huang, S. Jiang, S. Yan, and Q. Tian. Human group activity analysis with fusion of motion and appearance information. In ACM Multimedia, pages 1401--1404, 2011. Google ScholarDigital Library
R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. Google ScholarDigital Library
R. Gupta, A. Y. Chia, D. Rajan, E. S. Ng, and E. H. Lung. Human activities recognition using depth images. In ACM Multimedia, pages 283--292, 2013. Google ScholarDigital Library
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, 2006.Google ScholarCross Ref
S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221--231, 2013. Google ScholarDigital Library
H. S. Koppula, R. Gupta, and A. Saxena. Learning human activities and object affordances from rgb-d videos. International Journal of Robotics Research (IJRR), 32(8):951--970, 2013. Google ScholarDigital Library
H. S. Koppula and A. Saxena. Learning spatio-temporal structure from rgb-d videos for human activity detection and anticipation. In ICML, pages 792--800, 2013.Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.Google ScholarDigital Library
Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, pages 3361--3368, 2011. Google ScholarDigital Library
Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and e. a. L.D. Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, 1990. Google ScholarDigital Library
X. Liang, L. Lin, and L. Cao. Learning latent spatio-temporal compositional model for human action recognition. In ACM Multimedia, pages 263--272, 2013. Google ScholarDigital Library
L. Lin, H. Gong, L. Li, and L. Wang. Semantic event representation and recognition using syntactic attribute graph grammar. Pattern Recognition Letters, 30(2):180--186, 2009. Google ScholarDigital Library
L. Lin, T. Wu, J. Porway, and Z. Xu. A stochastic graph grammar for compositional object representation and recognition. Pattern Recognition, 42(7):1297--1307, 2009. Google ScholarDigital Library
P. Luo, X. Wang, and X. Tang. A deep sum-product architecture for robust facial attributes analysis. In ICCV, pages 2864--2871, 2013. Google ScholarDigital Library
P. Luo, X. Wang, and X. Tang. Pedestrian parsing via deep decompositional neural network. In ICCV, pages 2648--2655, 2013. Google ScholarDigital Library
B. Ni, Z. L. Y. Pei, L. Lin, and P. Moulin. Integrating multi-stage depth-induced contextual information for human action recognition and localization. In International Conference and Workshops on Automatic Face and Gesture Recognition, pages 1--8, 2013.Google Scholar
O. Oreifej and Z. Liu. Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In CVPR, pages 716--723, 2013. Google ScholarDigital Library
B. Packer, K. Saenko, and D. Koller. A combined pose, object, and feature model for action understanding. In CVPR, pages 1378--1385, 2012. Google ScholarDigital Library
M. Pei, Y. Jia, and S. Zhu. Parsing video events with goal inference and intent prediction. In ICCV, pages 487--494, 2011. Google ScholarDigital Library
S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, pages 1234--1241, 2012. Google ScholarDigital Library
P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In ACM Multimedia, pages 357--360, 2007. Google ScholarDigital Library
J. Sung, C. Ponce, B. Selman, and A. Saxena. Unstructured human activity detection from rgbd images. In ICRA, pages 842--849, 2012.Google Scholar
K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, pages 1250--1257, 2012. Google ScholarDigital Library
G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In ECCV, pages 140--153, 2010. Google ScholarDigital Library
K. Tu, M. Meng, M. W. Lee, T. Choi, and S. Zhu. Joint video and text parsing for understanding events and answering queries. IEEE Transactions on Multimedia, 21(2):42--70, 2014.Google ScholarCross Ref
C. Wang, Y. Wang, and A. L. Yuille. An approach to pose-based action recognition. In CVPR, pages 915--922, 2013. Google ScholarDigital Library
J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In CVPR, pages 1290--1297, 2012. Google ScholarDigital Library
J. Wang and Y. Wu. Learning maximum margin temporal warping for action recognition. In ICCV, pages 2688--2695, 2013. Google ScholarDigital Library
X. Wang, L. Lin, L. Huang, and S. Yan. Incorporating structural alternatives and sharing into hierarchy for multiclass object recognition and detection. In CVPR, pages 3334--3341, 2013. Google ScholarDigital Library
Y. Wang and G. Mori. Hidden part models for human action recognition: Probabilistic vs. max-margin. IEEE Trans. Pattern Anal. Mach. Intell., 33(7):1310--1323, 2011. Google ScholarDigital Library
P. Wu, S. Hoi, H. Xia, P. Zhao, D. Wang, and C. Miao. Online multimodal deep similarity learning with application to image retrieval. In ACM Mutilmedia, pages 153--162, 2013. Google ScholarDigital Library
L. Xia and J. Aggarwal. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In CVPR, pages 2834--2841, 2013. Google ScholarDigital Library
L. Xia, C. Chen, and J. K. Aggarwal. View invariant human action recognition using histograms of 3d joints. In CVPRW, pages 20--27, 2012.Google ScholarCross Ref
X. Yang, C. Zhang, and Y. Tian. Recognizing actions using depth motion maps-based histograms of oriented gradients. In ACM Multimedia, pages 1057--1060, 2012. Google ScholarDigital Library
K. Yu, Y. Lin, and J. Lafferty. Learning image representations from the pixel level via hierarchical sparse coding. In CVPR, pages 1713--1720, 2011. Google ScholarDigital Library
X. Zhao, Y. Liu, and Y. Fu. Exploring discriminative pose sub-patterns for effective action classification. In ACM Multimedia, pages 273--282, 2013. Google ScholarDigital Library
X. Zhou, X. Zhuang, S. Yan, S. F. Chang, M. H. Johnson, and T. S. Huang. Sift-bag kernel for video event analysis. In ACM Multimedia, pages 229--238, 2009. Google ScholarDigital Library
S. Zhu and D. Mumford. A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4):259--362, 2007. Google ScholarDigital Library

Index Terms

3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

Human Activity Recognition Using Wearable Sensors by Deep Convolutional Neural Networks
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Human physical activity recognition based on wearable sensors has applications relevant to our daily life such as healthcare. How to achieve high recognition accuracy with low computational cost is an important issue in the ubiquitous computing. Rather ...
Read More
A Deep Structured Model with Radius---Margin Bound for 3D Human Activity Recognition

Understanding human activity is very challenging even with the recently developed 3D/depth sensors. To solve this problem, this work investigates a novel deep structured model, which adaptively decomposes an activity instance into temporal parts using ...
Read More
Human Action Recognition using Pre-trained Convolutional Neural Networks
VSIP '20: Proceedings of the 2020 2nd International Conference on Video, Signal and Image Processing

Recognition of human action is one of the challenges in the field of artificial intelligence. Deep learning model has become a research issue in action recognition applications due to its ability to outperform traditional machine learning approaches. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '14: Proceedings of the 22nd ACM international conference on Multimedia
November 2014
1310 pages
ISBN:9781450330633
DOI:10.1145/2647868
General Chairs:
Kien A. Hua
University of Central Florida, USA
,
Yong Rui
Microsoft Research, China
,
Ralf Steinmetz
Technische Universitt Darmstadt, Germany
,
Program Chairs:
Alan Hanjalic
Delft University of Technology, Netherlands
,
Apostol (Paul) Natsev
Google, USA
,
Wenwu Zhu
Tsinghua University, China
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 November 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
3d activity
deep learning
structured model
video parsing
Qualifiers
- research-article
Conference

Acceptance Rates
MM '14 Paper Acceptance Rate55of286submissions,19%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 67
  Total Citations
  View Citations
- 855
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks

MM '14: Proceedings of the 22nd ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Human Activity Recognition Using Wearable Sensors by Deep Convolutional Neural Networks

A Deep Structured Model with Radius---Margin Bound for 3D Human Activity Recognition

Human Action Recognition using Pre-trained Convolutional Neural Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks

MM '14: Proceedings of the 22nd ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Human Activity Recognition Using Wearable Sensors by Deep Convolutional Neural Networks

A Deep Structured Model with Radius---Margin Bound for 3D Human Activity Recognition

Human Action Recognition using Pre-trained Convolutional Neural Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media