Skip to main content
Erschienen in: International Journal of Computer Vision 1/2017

01.10.2016

Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation

verfasst von: Sijin Li, Weichen Zhang, Antoni B. Chan

Erschienen in: International Journal of Computer Vision | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper focuses on structured-output learning using deep neural networks for 3D human pose estimation from monocular images. Our network takes an image and 3D pose as inputs and outputs a score value, which is high when the image-pose pair matches and low otherwise. The network structure consists of a convolutional neural network for image feature extraction, followed by two sub-networks for transforming the image features and pose into a joint embedding. The score function is then the dot-product between the image and pose embeddings. The image-pose embedding and score function are jointly trained using a maximum-margin cost function. Our proposed framework can be interpreted as a special form of structured support vector machines where the joint feature space is discriminatively learned using deep neural networks. We also propose an efficient recurrent neural network for performing inference with the learned image-embedding. We test our framework on the Human3.6m dataset and obtain state-of-the-art results compared to other recent methods. Finally, we present visualizations of the image-pose embedding space, demonstrating the network has learned a high-level embedding of body-orientation and pose-configuration.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
Note that \({\hat{y}}\) depends on the input (xy) and network parameters \(\theta \). To reduce clutter, we write \({\hat{y}}\) instead of \({\hat{y}}(x,y,\theta )\) when no confusion arises.
 
2
The action “Direction” is not included due to video corruption.
 
3
For better visualization, we only use the images from a single subject.
 
Literatur
Zurück zum Zitat Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. ICML, 28, 1247–1255. Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. ICML, 28, 1247–1255.
Zurück zum Zitat Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2d human pose estimation: New benchmark and state of the art analysis. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3686–3693). Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2d human pose estimation: New benchmark and state of the art analysis. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3686–3693).
Zurück zum Zitat Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., & Bengio, Y. (2012). Theano: new features and speed improvements. In NIPS: Deep learning and unsupervised feature learning workshop Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., & Bengio, Y. (2012). Theano: new features and speed improvements. In NIPS: Deep learning and unsupervised feature learning workshop
Zurück zum Zitat Bengio, Y., Mesnil, G., Dauphin, Y., & Rifai, S. (2013). Better mixing via deep representations. In ICML (pp. 552–560). Bengio, Y., Mesnil, G., Dauphin, Y., & Rifai, S. (2013). Better mixing via deep representations. In ICML (pp. 552–560).
Zurück zum Zitat Bregler, C., Malik, J., & Pullen, K. (2004). Twist based acquisition and tracking of animal and human kinematics. International Journal of Computer Vision, 56(3), 179–194.CrossRef Bregler, C., Malik, J., & Pullen, K. (2004). Twist based acquisition and tracking of animal and human kinematics. International Journal of Computer Vision, 56(3), 179–194.CrossRef
Zurück zum Zitat Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3d pictorial structures for multiple view articulated pose estimation. In CVPR (pp. 3618–3625). Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3d pictorial structures for multiple view articulated pose estimation. In CVPR (pp. 3618–3625).
Zurück zum Zitat Calamai, P. H., & Moré, J. J. (1987). Projected gradient methods for linearly constrained problems. Mathematical programming, 39(1), 93–116.MathSciNetCrossRefMATH Calamai, P. H., & Moré, J. J. (1987). Projected gradient methods for linearly constrained problems. Mathematical programming, 39(1), 93–116.MathSciNetCrossRefMATH
Zurück zum Zitat Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In The IEEE conference on computer vision and pattern recognition (CVPR) Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In The IEEE conference on computer vision and pattern recognition (CVPR)
Zurück zum Zitat Chen, X. & Yuille, A. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS Chen, X. & Yuille, A. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS
Zurück zum Zitat Chu, X., Ouyang, W., Yang, W., & Wang, X. (2015). Multi-task recurrent neural network for immediacy prediction. In The IEEE international conference on computer vision (ICCV) (pp. 3352–3360). Chu, X., Ouyang, W., Yang, W., & Wang, X. (2015). Multi-task recurrent neural network for immediacy prediction. In The IEEE international conference on computer vision (ICCV) (pp. 3352–3360).
Zurück zum Zitat Deutscher, J., & Reid, I. (2005). Articulated body motion capture by stochastic search. IJCV, 61(2), 185–205.CrossRef Deutscher, J., & Reid, I. (2005). Articulated body motion capture by stochastic search. IJCV, 61(2), 185–205.CrossRef
Zurück zum Zitat Dhungel, N., Carneiro, G., & Bradley, A. P. (2014). Deep structured learning for mass segmentation from mammograms. CoRR arXiv:1410.7454 Dhungel, N., Carneiro, G., & Bradley, A. P. (2014). Deep structured learning for mass segmentation from mammograms. CoRR  arXiv:​1410.​7454
Zurück zum Zitat Eichner, M. & Ferrari, V. (2009). Better appearance models for pictorial structures. In BMVC (pp 1–11) Eichner, M. & Ferrari, V. (2009). Better appearance models for pictorial structures. In BMVC (pp 1–11)
Zurück zum Zitat Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. IJCV, 61(1), 55–79.CrossRef Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. IJCV, 61(1), 55–79.CrossRef
Zurück zum Zitat Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International conference on learning representations Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International conference on learning representations
Zurück zum Zitat Ionescu, C., Bo, L., & Sminchisescu, C. (2009). Structural SVM for visual localization and continuous state estimation. In ICCV (pp. 1157–1164). Ionescu, C., Bo, L., & Sminchisescu, C. (2009). Structural SVM for visual localization and continuous state estimation. In ICCV (pp. 1157–1164).
Zurück zum Zitat Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In ICCV (pp. 2220–2227). Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In ICCV (pp. 2220–2227).
Zurück zum Zitat Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI, 36(7), 1325–1339.CrossRef Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI, 36(7), 1325–1339.CrossRef
Zurück zum Zitat Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2015). Deep structured output learning for unconstrained text recognition. ICLR Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2015). Deep structured output learning for unconstrained text recognition. ICLR
Zurück zum Zitat Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., & Bregler, C. (2014). Learning human pose estimation features with convolutional networks. In ICLR Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., & Bregler, C. (2014). Learning human pose estimation features with convolutional networks. In ICLR
Zurück zum Zitat Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural svms. Machine Learning, 77(1), 27–59.CrossRefMATH Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural svms. Machine Learning, 77(1), 27–59.CrossRefMATH
Zurück zum Zitat Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. Cambridge: MIT Press.MATH Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. Cambridge: MIT Press.MATH
Zurück zum Zitat Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS
Zurück zum Zitat Li, S. & Chan, A. B. (2014). 3d human pose estimation from monocular images with deep convolutional neural network. In ACCV Li, S. & Chan, A. B. (2014). 3d human pose estimation from monocular images with deep convolutional neural network. In ACCV
Zurück zum Zitat Li, S., Liu, Z. Q., & Chan, A. B. (2014). Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In IJCV (pp 1–18). Li, S., Liu, Z. Q., & Chan, A. B. (2014). Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In IJCV (pp 1–18).
Zurück zum Zitat Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3d human pose estimation. In The IEEE international conference on computer vision (ICCV) Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3d human pose estimation. In The IEEE international conference on computer vision (ICCV)
Zurück zum Zitat Murray, R. M., Li, Z., & Sastry, S. S. (1994). A mathematical introduction to robotic manipulation (Vol. 29). Boca Raton: CRC press.MATH Murray, R. M., Li, Z., & Sastry, S. S. (1994). A mathematical introduction to robotic manipulation (Vol. 29). Boca Raton: CRC press.MATH
Zurück zum Zitat Nair, V. & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In ICML Nair, V. & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In ICML
Zurück zum Zitat Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In ICML (pp. 689–696) Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In ICML (pp. 689–696)
Zurück zum Zitat Osadchy, M., LeCun, Y., & Miller, M. L. (2007). Synergistic face detection and pose estimation with energy-based models. Journal of Machine Learning Research, 8, 1197–1215. Osadchy, M., LeCun, Y., & Miller, M. L. (2007). Synergistic face detection and pose estimation with energy-based models. Journal of Machine Learning Research, 8, 1197–1215.
Zurück zum Zitat Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In CVPR (pp. 512–519) Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In CVPR (pp. 512–519)
Zurück zum Zitat Rodríguez, J. A. & Perronnin, F. (2013). Label embedding for text recognition. In BMVC Rodríguez, J. A. & Perronnin, F. (2013). Label embedding for text recognition. In BMVC
Zurück zum Zitat Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Neurocomputing: Foundations of research, Chap Learning representations by back-propagating errors (pp. 696–699). Cambridge, MA: MIT Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Neurocomputing: Foundations of research, Chap Learning representations by back-propagating errors (pp. 696–699). Cambridge, MA: MIT Press.
Zurück zum Zitat Sapp, B. & Taskar, B. (2013). Modec: Multimodal decomposablemodels for human pose estimation. In Proceedings of the IEEE conference on CVPR Sapp, B. & Taskar, B. (2013). Modec: Multimodal decomposablemodels for human pose estimation. In Proceedings of the IEEE conference on CVPR
Zurück zum Zitat Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR arXiv:1312.6229 Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR  arXiv:​1312.​6229
Zurück zum Zitat Srivastava, N. & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In NIPS (pp. 2222–2230). Curran Associates Inc., Red Hook. Srivastava, N. & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In NIPS (pp. 2222–2230). Curran Associates Inc., Red Hook.
Zurück zum Zitat Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.MathSciNetMATH Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.MathSciNetMATH
Zurück zum Zitat Sun, Y., Wang, X., & Tang, X. (2014). Deep learning face representation from predicting 10,000 classes. In CVPR, IEEE Computer Society Sun, Y., Wang, X., & Tang, X. (2014). Deep learning face representation from predicting 10,000 classes. In CVPR, IEEE Computer Society
Zurück zum Zitat Tompson, J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS Tompson, J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS
Zurück zum Zitat Toshev, A. & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In CVPR Toshev, A. & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In CVPR
Zurück zum Zitat Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ICML Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ICML
Zurück zum Zitat Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.MathSciNetMATH Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.MathSciNetMATH
Zurück zum Zitat Yang, Y. & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In CVPR (pp. 1385 – 1392) Yang, Y. & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In CVPR (pp. 1385 – 1392)
Zurück zum Zitat Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. (2015). Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV) Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. (2015). Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV)
Metadaten
Titel
Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation
verfasst von
Sijin Li
Weichen Zhang
Antoni B. Chan
Publikationsdatum
01.10.2016
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 1/2017
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-016-0962-x

Weitere Artikel der Ausgabe 1/2017

International Journal of Computer Vision 1/2017 Zur Ausgabe