Skip to main content
Erschienen in: Pattern Analysis and Applications 1/2022

24.01.2022 | Theoretical Advances

3D hand pose estimation from a single RGB image through semantic decomposition of VAE latent space

verfasst von: Xinru Guo, Song Xu, Xiangbo Lin, Yi Sun, Xiaohong Ma

Erschienen in: Pattern Analysis and Applications | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Based on the disentanglement representation learning theory and the cross-modal variational autoencoder (VAE) model, we derive a “Single Input Multiple Output” (SIMO) disentangled model \({\text{cmSIMO} - \beta \,\text{VAE}}\). With the guidance of this derived model, we design a new VAE network, named da-VAE, for the challenging task of 3D hand pose estimation from a single RGB image. The designed da-VAE network has a multi-head encoder with the attention modules. Cooperating with the specific supervisions, the latent space is decomposed into subspaces with explicit semantics, which are relevant to the generative factors of hand pose, shape, appearance and others. The performance of the proposed da-VAE network is evaluated on RHD and STB dataset. The experimental results show competitive accuracies with the state-of-the-art methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Zimmermann C, Brox T (2017) Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 4903–4911 Zimmermann C, Brox T (2017) Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 4903–4911
2.
Zurück zum Zitat Iqbal U, Molchanov P, Gall TBJ, Kautz J (2018) Hand pose estimation via latent 2.5d heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 118–134 Iqbal U, Molchanov P, Gall TBJ, Kautz J (2018) Hand pose estimation via latent 2.5d heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 118–134
3.
Zurück zum Zitat Cai Y, Ge L, Cai J, Yuan J (2018) Weakly-supervised 3d hand pose estimation from monocular rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 666–682 Cai Y, Ge L, Cai J, Yuan J (2018) Weakly-supervised 3d hand pose estimation from monocular rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 666–682
4.
Zurück zum Zitat Boukhayma A, Bem R de, Torr PHS (2019) 3d hand shape and pose from images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10843–10852 Boukhayma A, Bem R de, Torr PHS (2019) 3d hand shape and pose from images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10843–10852
5.
Zurück zum Zitat Ge L, Ren Z, Li Y, Xue Z, Wang Y, Cai J, Yuan J (2019) 3d hand shape and pose estimation from a single rgb image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10833–10842 Ge L, Ren Z, Li Y, Xue Z, Wang Y, Cai J, Yuan J (2019) 3d hand shape and pose estimation from a single rgb image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10833–10842
6.
Zurück zum Zitat Zhang X, Li Q, Mo H, Zhang W, Zheng W (2019) End-to-end hand mesh recovery from a monocular rgb image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2354–2364 Zhang X, Li Q, Mo H, Zhang W, Zheng W (2019) End-to-end hand mesh recovery from a monocular rgb image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2354–2364
7.
Zurück zum Zitat Baek S, Kim KI, Kim TK (2019) Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1067–1076 Baek S, Kim KI, Kim TK (2019) Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1067–1076
8.
Zurück zum Zitat Cai Y, Ge L, Liu J, Cai J, Cham T-J, Yuan J, Thalmann NM (2019) Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2272–2281 Cai Y, Ge L, Liu J, Cai J, Cham T-J, Yuan J, Thalmann NM (2019) Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2272–2281
9.
Zurück zum Zitat Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828CrossRef Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828CrossRef
10.
11.
Zurück zum Zitat Kingma DP, Welling M (2014) Auto-encoding variational bayes. In International Conference on Learning Representation (ICLR) Kingma DP, Welling M (2014) Auto-encoding variational bayes. In International Conference on Learning Representation (ICLR)
12.
Zurück zum Zitat Kulkarni TD, Whitney W, Kohli P, Tenenbaum JB (2015) Deep convolutional inverse graphics network. Advances in Neural Information Processing Systems (NIPS), pp 2539–2547 Kulkarni TD, Whitney W, Kohli P, Tenenbaum JB (2015) Deep convolutional inverse graphics network. Advances in Neural Information Processing Systems (NIPS), pp 2539–2547
13.
Zurück zum Zitat Karaletsos T, Belongie S, Rtsch G (2016) Bayesian representation learning with oracle constraints. In International Conference on Learning Representations (ICLR) Karaletsos T, Belongie S, Rtsch G (2016) Bayesian representation learning with oracle constraints. In International Conference on Learning Representations (ICLR)
14.
Zurück zum Zitat Kim M, Wang Y, Sahu P, Pavlovic V (2019) Bayes-factor-vae: Hierarchical bayesian deep auto-encoder models for factor disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2979–2987 Kim M, Wang Y, Sahu P, Pavlovic V (2019) Bayes-factor-vae: Hierarchical bayesian deep auto-encoder models for factor disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2979–2987
15.
Zurück zum Zitat Chen RTQ, Li X, Grosse R, Duvenaud D (2018) Isolating sources of disentanglement in variational autoencoders. arXiv preprintarXiv:1802.04942 Chen RTQ, Li X, Grosse R, Duvenaud D (2018) Isolating sources of disentanglement in variational autoencoders. arXiv preprintarXiv:​1802.​04942
16.
Zurück zum Zitat Yang L, Yao A (2019) Disentangling latent hands for image synthesis and pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9877–9886 Yang L, Yao A (2019) Disentangling latent hands for image synthesis and pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9877–9886
17.
Zurück zum Zitat Locatello F, Bauer S, Lucic M, Raetsch G, Gelly S, Schölkopf B, Bachem O (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning (ICML), pp. 4114–4124 Locatello F, Bauer S, Lucic M, Raetsch G, Gelly S, Schölkopf B, Bachem O (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning (ICML), pp. 4114–4124
19.
Zurück zum Zitat Zhang J, Jiao J, Chen M, Qu L, Xu X, Yang Q (2016) 3d hand pose tracking and estimation using stereo matching. arXiv preprintarXiv:1610.07214 Zhang J, Jiao J, Chen M, Qu L, Xu X, Yang Q (2016) 3d hand pose tracking and estimation using stereo matching. arXiv preprintarXiv:​1610.​07214
20.
Zurück zum Zitat Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2017) \(\beta\)-vae: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR) Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2017) \(\beta\)-vae: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR)
21.
Zurück zum Zitat Burgess CP, Higgins I, Pal A, Matthey L, Watters N, Desjardins G, Lerchner A (2018) Understanding disentangling in \(\beta\)-vae. arXiv preprintarXiv:1804.03599 Burgess CP, Higgins I, Pal A, Matthey L, Watters N, Desjardins G, Lerchner A (2018) Understanding disentangling in \(\beta\)-vae. arXiv preprintarXiv:​1804.​03599
22.
Zurück zum Zitat Kim H, Mnih A (2018) Disentangling by factorising. In International Conference on Machine Learning, pp. 2649–2658 Kim H, Mnih A (2018) Disentangling by factorising. In International Conference on Machine Learning, pp. 2649–2658
23.
Zurück zum Zitat Kumar A, Sattigeri P, Balakrishnan A (2017) Variational inference of disentangled latent concepts from unlabeled observations. In International Conference on Learning Representations (ICLR) Kumar A, Sattigeri P, Balakrishnan A (2017) Variational inference of disentangled latent concepts from unlabeled observations. In International Conference on Learning Representations (ICLR)
24.
Zurück zum Zitat Dupont E (2018) Learning disentangled joint continuous and discrete representations. Adv Neural Inf Process Syst (NIPS), pp. 710–720 Dupont E (2018) Learning disentangled joint continuous and discrete representations. Adv Neural Inf Process Syst (NIPS), pp. 710–720
25.
Zurück zum Zitat Lee W, Kim D, Hong S, Lee H (2020) High-fidelity synthesis with disentangled representation. In European Conference on Computer Vision (ECCV), pp. 157–174 Lee W, Kim D, Hong S, Lee H (2020) High-fidelity synthesis with disentangled representation. In European Conference on Computer Vision (ECCV), pp. 157–174
26.
Zurück zum Zitat Siddharth N, Paige B, van de Meent J-W, Desmaison A, Goodman N, Kohli P, Wood F, Torr P (2017) Learning disentangled representations with semi-supervised deep generative models. Adv Neural Inf Process Syst (NIPS) 30:5925–5935 Siddharth N, Paige B, van de Meent J-W, Desmaison A, Goodman N, Kohli P, Wood F, Torr P (2017) Learning disentangled representations with semi-supervised deep generative models. Adv Neural Inf Process Syst (NIPS) 30:5925–5935
27.
Zurück zum Zitat Ruiz A, Martinez O, Binefa X, Verbeek J (2019) Learning disentangled representations with reference-based variational autoencoders. arXiv preprintarXiv:1901.08534 Ruiz A, Martinez O, Binefa X, Verbeek J (2019) Learning disentangled representations with reference-based variational autoencoders. arXiv preprintarXiv:​1901.​08534
28.
Zurück zum Zitat Chen J, Batmanghelich K (2020) Weakly supervised disentanglement by pairwise similarities. Proce AAAI Conf Artif Intell 34:3495–3502 Chen J, Batmanghelich K (2020) Weakly supervised disentanglement by pairwise similarities. Proce AAAI Conf Artif Intell 34:3495–3502
29.
Zurück zum Zitat Locatello F, Tschannen M, Bauer S, Rätsch G, Schölkopf B, Bachem O (2019) Disentangling factors of variation using few labels. arXiv preprintarXiv:1905.01258 Locatello F, Tschannen M, Bauer S, Rätsch G, Schölkopf B, Bachem O (2019) Disentangling factors of variation using few labels. arXiv preprintarXiv:​1905.​01258
30.
Zurück zum Zitat Wan C, Probst T, Van Gool L, Yao A (2017) Crossing nets: combining gans and vaes with a shared latent space for hand pose estimation. In Proc IEEE Conf Computer Vision Pattern Recogn (CVPR), pp. 680–689 Wan C, Probst T, Van Gool L, Yao A (2017) Crossing nets: combining gans and vaes with a shared latent space for hand pose estimation. In Proc IEEE Conf Computer Vision Pattern Recogn (CVPR), pp. 680–689
31.
Zurück zum Zitat Gao Y, Wang Y, Falco P, Navab N, Tombari F (2019) Variational object-aware 3-d hand pose from a single rgb image. IEEE Robot Autom Letts 4(4):4239–4246CrossRef Gao Y, Wang Y, Falco P, Navab N, Tombari F (2019) Variational object-aware 3-d hand pose from a single rgb image. IEEE Robot Autom Letts 4(4):4239–4246CrossRef
32.
Zurück zum Zitat Spurr A, Song J, Park S, Hilliges O (2018) Cross-modal deep variational hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 89–98 Spurr A, Song J, Park S, Hilliges O (2018) Cross-modal deep variational hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 89–98
33.
Zurück zum Zitat Yang L, Li S, Lee D, Yao A (2019) Aligning latent spaces for 3d hand pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2335–2343 Yang L, Li S, Lee D, Yao A (2019) Aligning latent spaces for 3d hand pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2335–2343
34.
Zurück zum Zitat Kulon D, Guler RA, Kokkinos I, Bronstein MM, Zafeiriou S (2020) Weakly-supervised mesh-convolutional hand reconstruction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4990–5000 Kulon D, Guler RA, Kokkinos I, Bronstein MM, Zafeiriou S (2020) Weakly-supervised mesh-convolutional hand reconstruction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4990–5000
35.
Zurück zum Zitat He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016
36.
Zurück zum Zitat Li X, Wang W, Hu X, Yang J (2019) Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 510–519 Li X, Wang W, Hu X, Yang J (2019) Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 510–519
37.
Zurück zum Zitat Yang Y, Feng C, Shen Y, Tian D (2018) Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 206–215 Yang Y, Feng C, Shen Y, Tian D (2018) Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 206–215
38.
Zurück zum Zitat Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141 Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141
39.
Zurück zum Zitat Li S, Lee D (2019) Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11927–11936 Li S, Lee D (2019) Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11927–11936
40.
Zurück zum Zitat Romero J, Tzionas D, Black MJ (2017) Embodied hands: modeling and capturing hands and bodies together. ACM Trans Graph (ToG) 36(6):1–17CrossRef Romero J, Tzionas D, Black MJ (2017) Embodied hands: modeling and capturing hands and bodies together. ACM Trans Graph (ToG) 36(6):1–17CrossRef
41.
Zurück zum Zitat Yang L, Li J, Xu W, Diao Y, Lu C (2020) Bihand: Recovering hand mesh with multi-stage bisected hourglass networks. arXiv preprintarXiv:2008.05079 Yang L, Li J, Xu W, Diao Y, Lu C (2020) Bihand: Recovering hand mesh with multi-stage bisected hourglass networks. arXiv preprintarXiv:​2008.​05079
42.
Zurück zum Zitat Zhou Y, Habermann M, Xu W, Habibie I, Theobalt C, Xu F (2020) Monocular real-time hand shape and motion capture using multi-modal data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5346–5355 Zhou Y, Habermann M, Xu W, Habibie I, Theobalt C, Xu F (2020) Monocular real-time hand shape and motion capture using multi-modal data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5346–5355
43.
Zurück zum Zitat Zhao L, Peng X, Chen Y, Kapadia M, Metaxas DN (2020) Knowledge as priors: cross-modal knowledge generalization for datasets without superior knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6528–6537 Zhao L, Peng X, Chen Y, Kapadia M, Metaxas DN (2020) Knowledge as priors: cross-modal knowledge generalization for datasets without superior knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6528–6537
44.
Zurück zum Zitat Mueller F, Bernard F, Sotnychenko O, Mehta D, Sridhar S, Casas D, Theobalt C (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 49–59 Mueller F, Bernard F, Sotnychenko O, Mehta D, Sridhar S, Casas D, Theobalt C (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 49–59
Metadaten
Titel
3D hand pose estimation from a single RGB image through semantic decomposition of VAE latent space
verfasst von
Xinru Guo
Song Xu
Xiangbo Lin
Yi Sun
Xiaohong Ma
Publikationsdatum
24.01.2022
Verlag
Springer London
Erschienen in
Pattern Analysis and Applications / Ausgabe 1/2022
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-021-01048-x

Weitere Artikel der Ausgabe 1/2022

Pattern Analysis and Applications 1/2022 Zur Ausgabe

Premium Partner