nach oben

International Journal of Computer Vision

Erschienen in:

01.10.2016

Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation

verfasst von: Sijin Li, Weichen Zhang, Antoni B. Chan

Erschienen in: International Journal of Computer Vision | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

This paper focuses on structured-output learning using deep neural networks for 3D human pose estimation from monocular images. Our network takes an image and 3D pose as inputs and outputs a score value, which is high when the image-pose pair matches and low otherwise. The network structure consists of a convolutional neural network for image feature extraction, followed by two sub-networks for transforming the image features and pose into a joint embedding. The score function is then the dot-product between the image and pose embeddings. The image-pose embedding and score function are jointly trained using a maximum-margin cost function. Our proposed framework can be interpreted as a special form of structured support vector machines where the joint feature space is discriminatively learned using deep neural networks. We also propose an efficient recurrent neural network for performing inference with the learned image-embedding. We test our framework on the Human3.6m dataset and obtain state-of-the-art results compared to other recent methods. Finally, we present visualizations of the image-pose embedding space, demonstrating the network has learned a high-level embedding of body-orientation and pose-configuration.

Vorheriger Artikel Partially Camouflaged Object Tracking using Modified Probabilistic Neural Network and Fuzzy Energy based Active Contour

Nächster Artikel Free-Hand Sketch Synthesis with Deformable Stroke Models

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Note that \({\hat{y}}\) depends on the input (x, y) and network parameters \(\theta \). To reduce clutter, we write \({\hat{y}}\) instead of \({\hat{y}}(x,y,\theta )\) when no confusion arises.

The action “Direction” is not included due to video corruption.

For better visualization, we only use the images from a single subject.

Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. ICML, 28, 1247–1255.

Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2d human pose estimation: New benchmark and state of the art analysis. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3686–3693).

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., & Bengio, Y. (2012). Theano: new features and speed improvements. In NIPS: Deep learning and unsupervised feature learning workshop

Bengio, Y., Mesnil, G., Dauphin, Y., & Rifai, S. (2013). Better mixing via deep representations. In ICML (pp. 552–560).

Bregler, C., Malik, J., & Pullen, K. (2004). Twist based acquisition and tracking of animal and human kinematics. International Journal of Computer Vision, 56(3), 179–194.CrossRef

Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3d pictorial structures for multiple view articulated pose estimation. In CVPR (pp. 3618–3625).

Calamai, P. H., & Moré, J. J. (1987). Projected gradient methods for linearly constrained problems. Mathematical programming, 39(1), 93–116.MathSciNetCrossRefMATH

Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In The IEEE conference on computer vision and pattern recognition (CVPR)

Chen, X. & Yuille, A. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS

Chu, X., Ouyang, W., Yang, W., & Wang, X. (2015). Multi-task recurrent neural network for immediacy prediction. In The IEEE international conference on computer vision (ICCV) (pp. 3352–3360).

Deutscher, J., & Reid, I. (2005). Articulated body motion capture by stochastic search. IJCV, 61(2), 185–205.CrossRef

Dhungel, N., Carneiro, G., & Bradley, A. P. (2014). Deep structured learning for mass segmentation from mammograms. CoRR arXiv:1410.7454

Eichner, M. & Ferrari, V. (2009). Better appearance models for pictorial structures. In BMVC (pp 1–11)

Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. IJCV, 61(1), 55–79.CrossRef

Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International conference on learning representations

Ionescu, C., Bo, L., & Sminchisescu, C. (2009). Structural SVM for visual localization and continuous state estimation. In ICCV (pp. 1157–1164).

Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In ICCV (pp. 2220–2227).

Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI, 36(7), 1325–1339.CrossRef

Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2015). Deep structured output learning for unconstrained text recognition. ICLR

Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., & Bregler, C. (2014). Learning human pose estimation features with convolutional networks. In ICLR

Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural svms. Machine Learning, 77(1), 27–59.CrossRefMATH

Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. Cambridge: MIT Press.MATH

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS

Li, S. & Chan, A. B. (2014). 3d human pose estimation from monocular images with deep convolutional neural network. In ACCV

Li, S., Liu, Z. Q., & Chan, A. B. (2014). Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In IJCV (pp 1–18).

Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3d human pose estimation. In The IEEE international conference on computer vision (ICCV)

Murray, R. M., Li, Z., & Sastry, S. S. (1994). A mathematical introduction to robotic manipulation (Vol. 29). Boca Raton: CRC press.MATH

Nair, V. & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In ICML

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In ICML (pp. 689–696)

Osadchy, M., LeCun, Y., & Miller, M. L. (2007). Synergistic face detection and pose estimation with energy-based models. Journal of Machine Learning Research, 8, 1197–1215.

Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In CVPR (pp. 512–519)

Rodríguez, J. A. & Perronnin, F. (2013). Label embedding for text recognition. In BMVC

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Neurocomputing: Foundations of research, Chap Learning representations by back-propagating errors (pp. 696–699). Cambridge, MA: MIT Press.

Sapp, B. & Taskar, B. (2013). Modec: Multimodal decomposablemodels for human pose estimation. In Proceedings of the IEEE conference on CVPR

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR arXiv:1312.6229

Srivastava, N. & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In NIPS (pp. 2222–2230). Curran Associates Inc., Red Hook.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.MathSciNetMATH

Sun, Y., Wang, X., & Tang, X. (2014). Deep learning face representation from predicting 10,000 classes. In CVPR, IEEE Computer Society

Tompson, J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS

Toshev, A. & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In CVPR

Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ICML

Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.MathSciNetMATH

Yang, Y. & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In CVPR (pp. 1385 – 1392)

Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. (2015). Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV)

Titel: Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation
verfasst von: Sijin Li
Weichen Zhang
Antoni B. Chan
Publikationsdatum: 01.10.2016
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 1/2017
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-016-0962-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 1/2017

Free-Hand Sketch Synthesis with Deformable Stroke Models

Fast Algorithms for Fitting Active Appearance Models to Unconstrained Images

Active Rectification of Curved Document Images Using Structured Beams

Semantic Decomposition and Recognition of Long and Complex Manipulation Action Sequences

Generalizing the Prediction Sum of Squares Statistic and Formula, Application to Linear Fractional Image Warp and Surface Fitting

Partially Camouflaged Object Tracking using Modified Probabilistic Neural Network and Fuzzy Energy based Active Contour