nach oben

International Journal of Computer Vision

Erschienen in:

21.01.2017

Max-Margin Heterogeneous Information Machine for RGB-D Action Recognition

verfasst von: Yu Kong, Yun Fu

Erschienen in: International Journal of Computer Vision | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

We propose a novel approach, max-margin heterogeneous information machine (MMHIM), for human action recognition from RGB-D videos. MMHIM fuses heterogeneous RGB visual features and depth features, and learns effective action classifiers using the fused features. Rich heterogeneous visual and depth data are effectively compressed and projected to a learned shared space and independent private spaces, in order to reduce noise and capture useful information for recognition. Knowledge from various sources can then be shared with others in the learned space to learn cross-modal features. This guides the discovery of valuable information for recognition. To capture complex spatiotemporal structural relationships in visual and depth features, we represent both RGB and depth data in a matrix form. We formulate the recognition task as a low-rank bilinear model composed of row and column parameter matrices. The rank of the model parameter is minimized to build a low-rank classifier, which is beneficial for improving the generalization power. We also extend MMHIM to a structured prediction model that is capable of making structured outputs. Extensive experiments on a new RGB-D action dataset and two other public RGB-D action datasets show that our approaches achieve state-of-the-art results. Promising results are also shown if RGB or depth data are missing in training or testing procedure.

Vorheriger Artikel Visual Object Detection Using Cascades of Binary and One-Class Classifiers

Nächster Artikel A Branch-and-Bound Framework for Unsupervised Common Event Discovery

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://forge.lip6.fr/projects/nrbm

Please refer to the supplemental material for details.

Please refer to the supplemental material for formulations of bilinear SVM, BHIM, and MMHIM in single modality learning

Technically, the feature O here is not shared between two modalities as it is only computed from RGB data.

Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. In ICML.

Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. In IJCV.

Bo, L., Lai, K., Ren, X., & Fox, D. (2011). Object recognition with hierarchical kernel descriptors. In CVPR.

Chen, L., Li, W., & Xu, D. (2014). Recognizing RGB images by learning from RGB-D data. In CVPR.

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE computer society conference on CVPR 2005 (Vol. 1, pp. 886–893). doi:10.1109/CVPR.2005.177.

Do, T. M. T., & Artieres, T. (2009). Large margin training for hidden markov models with partially observed states. In ICML.

Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In ICCV VS-PETS.

Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In CVPR.

El, R. O., Rosman, G., Wetzler, A., Kimmel, R., & Bruckstein, A. M. (2015). Rgbd-fusion: Real-time high precision depth recovery. In CVPR.

Fernando, B., Anderson, P., Hutter, M., & Gould, S. (2016). Discriminative hierarchical rank pooling for activity recognition. In CVPR.

Fernando, B., Gavves, E., Ghodrati, J. O. M. A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In CVPR.

Hadfield, S., & Bowden, R. (2013). Hollywood 3D: Recognizing actions in 3D natural scenes. In CVPR.

Hu, J. F., Zheng, W. S., Lai, J., & Zhang, J. (2015). Jointly learning heterogeneous features for rgb-d activity recognition. In CVPR.

Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. In PAMI.

Jia, C., Kong, Y., Ding, Z., & Fu, Y. (2014). Latent tensor transfer learning for RGB-D action recognition. In ACM Multimedia.

Joachims, T., Finley, T., & Yu, C. N. (2009). Cutting-plane training of structural SVMs. Machine Learning, 77(1), 27–59.CrossRefMATH

Karpathy, A., Toderici, G., Shetty, S., Leung, T., & Sukthankar, R., Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.

Klaser, A., Marszalek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In BMVC (pp. 1–10).

Kobayashi, T. (2014). Low-rank biliner classification: Efficient convex optimization and extensions. In IJCV.

Kong, Y., & Fu, Y. (2015). Bilinear heterogeneous information machine for rgb-d action recognition. In CVPR.

Kong, Y., Jia, Y., & Fu, Y. (2014). Interactive phrases: Semantic descriptions for human interaction recognition. In PAMI.

Kong, Y., Kit, D., & Fu, Y. (2014). A discriminative model with multiple temporal scales for action prediction. In ECCV.

Koppula, H.S., & Saxena, A. (2013).Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation. In ICML.

Lan, T., Wang, Y., Yang, W., Robinovitch, S. N., & Mori, G. (2012). Discriminative latent models for recognizing contextual group activities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(8), 1549–1562.CrossRef

Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2/3), 107–123.CrossRef

Li, W., Zhang, Z., & Liu, Z. (2010). Action recognition based on a bag of 3D points. In CVPR workshop.

Lin, Y. Y., Hua, J. H., Tang, N. C., Chen, M. H., & Liao, H. Y. M. (2014). Depth and skeleton associated action recognition without online accessible RGB-D cameras. In CVPR.

Liu, J., Ali, S., & Shah, M. (2008). Recognizing human actions using multiple features. In CVPR (pp. 1–8).

Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR (pp. 3337–3344).

Liu, L., & Shao, L. (2013). Learning discriminative representations from RGB-D video data. In IJCAI.

Lu, C., Jia, J., & Tang, C. K. (2014). Range-sample depth feature for action recognition. In CVPR.

Luo, J., Wang, W., & Qi, H. (2013). Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In ICCV.

Ma, S., Sigal, L., & Sclaroff, S. (2016). Learning activity progression in lstms for activity detection and early detection. In CVPR.

Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In Proceedings of IEEE conference on computer vision and pattern recognition.

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In ICML.

Ni, B., Moulin, P., Yang, X., & Yan, S. (2015). Motion part regularization: Improving action recognition via trajectory group selection. In CVPR.

Ni, B., Wang, G., & Moulin, P. (2011). RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In ICCV Workshop on CDC3CV.

Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (2013). Berkeley MHAD: A comprehensive multimodal human action database. In Proceedings of the IEEE Workshop on Applications on Computer Vision.

Oreifej, O., & Liu, Z. (2013). HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In CVPR,. doi:10.1109/CVPR.2013.98.

Pirsiavash, H., Ramanan, D., & Fowlkes, C. (2009). Bilinear classifiers for visual recognition. In NIPS.

Raptis, M., & Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In CVPR.

Raptis, M., & Soatto, S. (2010). Tracklet descriptors for action modeling and video analysis. In ECCV.

Ryoo, M., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV (pp. 1593–1600).

Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.

Shotton, J., Girshick, R., Fitzgibbon, A., Sharp, T., Cook, M., Finocchio, M., Moore, R., Kohli, P., Criminisi, A., Kipman, A., & Blake, A. (2013). Efficient human pose estimation from single depth images. In PAMI.

Simonyan, K., & Zisserman, A. (2014). two-stream convolutional networks for action recognition in videos. In NIPS.

Srivastava, N., & Salakhutdinov, R. (2014). Multimodal learning with deep boltzmann machines. JMLR, 15, 2949–2980.MathSciNetMATH

Sung, J., Ponce, C., Selman, B., & Saxena, A. (2012). Unstructured human activity detection from rgbd images. In ICRA.

Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR.

Tenenbaum, J. B., & Freeman, W. T. (2000). Separating style and content with bilinear models. Neural Computation.

Teo, C.H., Le, Q., Smola, A., & Vishwanathan, S. (2007). A scalable modular convex solver for regularized risk minimization. In KDD.

Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. In Proceedings of the 37-th annual allerton conference on communication, control and computing, pp. 368–377.

Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Anticipating visual representations from unlabeled video. In CVPR.

Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, Sydney, Australia. http://hal.inria.fr/hal-00873267

Wang, J., Liu, Z., Chorowski, J., Chen, Z., & Wu, Y. (2012a). Robust 3D action recognition with random occupancy patterns. In ECCV (pp. 872–885).

Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012b). Mining actionlet ensemble for action recognition with depth cameras. In CVPR.

Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Towards unified depth and semantic prediction from a single image. In CVPR.

Wang, W., Arora, R., Livescu, K., & Bilmes, J. (2015). On deep multi-view representation learning. In ICML.

Wolf, L., Jhuang, H., & Hazan, T. (2007). Modeling appearances with low-rank svm. In CVPR.

Wu, C., Zhang, J., Savarese, S., & Saxena, A. (2015). Watch-n-patch: Unsupervised understanding of actions and relations. In CVPR.

Xia, L., & Aggarwal, J. (2013). Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In CVPR.

Xie, P., & Xing, E. P. (2013). Multi-modal distance metric learning. In IJCAI.

Xu, C., & Cheng, L. (2013). Efficient hand pose estimation from a single depth image. In ICCV.

Xu, C., Tao, D., & Xu, C. (2014). Large-margin multi-view information bottleneck. PAMI, 36(8), 1559–1572.

Yang, X., & Tian, Y. (2014). Super normal vector for activity recognition using depth sequences. In CVPR.

Yang, X., Zhang, C., & Tian, Y. (2012). Recognizing actions using depth motion maps-based histograms of oriented gradients. In ACM Multimedia,. doi:10.1145/2393347.2396382.

Zanfir, M., Leordeanu, M., & Sminchisescu, C. (2013). The moving pose: An efficient 3D kinematics descriptor for low-latency action recognition and detection. In ICCV.

Zhang, J., Kan, C., Schwing, A. G., & Urtasun, R. (2013). Estimating the 3d layout of indoor scenes and its clutter from depth sensors. In ICCV.

Zhou, Y., Ni, B., Hong, R., Wang, M., & Tian, Q. (2015). Interaction part mining: A mid-level approach for fine-grained action recognition. In CVPR.

Titel: Max-Margin Heterogeneous Information Machine for RGB-D Action Recognition
verfasst von: Yu Kong
Yun Fu
Publikationsdatum: 21.01.2017
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 3/2017
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-016-0982-6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 3/2017

Transductive Zero-Shot Action Recognition by Word-Vector Embedding

Visual Object Detection Using Cascades of Binary and One-Class Classifiers

A Branch-and-Bound Framework for Unsupervised Common Event Discovery

Crowd Behavior Analysis via Curl and Divergence of Motion Trajectories

Empowering Simple Binary Classifiers for Image Set Based Face Recognition

Lie-X: Depth Image Based Articulated Object Pose Estimation, Tracking, and Action Recognition on Lie Groups