Skip to main content

2024 | OriginalPaper | Buchkapitel

RAGT: Learning Robust Features for Occluded Human Pose and Shape Estimation with Attention-Guided Transformer

verfasst von : Ziqing Li, Yang Li, Shaohui Lin

Erschienen in: Computer-Aided Design and Computer Graphics

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

3D human pose and shape estimation from monocular images is a fundamental task in computer vision, but it is highly ill-posed and challenging due to occlusion. Occlusion can be caused by other objects that block parts of the body from being visible in the image. When an occlusion occurs, the image features become incomplete and ambiguous, leading to inaccurate or even wrong predictions. In this paper, we propose a novel method, named RAGT, that can handle occlusion robustly and recover the complete 3D pose and shape of humans. Our study focuses on achieving robust feature representation for human pose and shape estimation in the presence of occlusion. To this end, we introduce a dual-branch architecture that learns incorporation weights from visible parts to occluded parts and suppression weights to inhibit the integration of background features. To further improve the quality of visible and occluded maps, we leverage pseudo ground-truth maps generated by DensePose for pixel-level supervision. Additionally, we propose a novel transformer-based module called COAT (Contextual Occlusion-Aware Transformer) to effectively incorporate visible features into occluded regions. The COAT module is guided by an Occlusion-Guided Attention Loss (OGAL). OGAL is designed to explicitly encourage the COAT module to fuse more important and relevant features that are semantically and spatially closer to the occluded regions. We conduct experiments on various benchmarks and prove the robustness of RAGT to the different kinds of occluded scenes both quantitatively and qualitatively.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014) Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
2.
5.
Zurück zum Zitat Choi, H., Moon, G., Lee, K.M.: Beyond static features for temporally consistent 3D human pose and shape from a video. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1964–1973 (2020) Choi, H., Moon, G., Lee, K.M.: Beyond static features for temporally consistent 3D human pose and shape from a video. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1964–1973 (2020)
7.
Zurück zum Zitat Choi, H., Moon, G., Park, J., Lee, K.M.: Learning to estimate robust 3D human mesh from in-the-wild crowded scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1475–1484 (2022) Choi, H., Moon, G., Park, J., Lee, K.M.: Learning to estimate robust 3D human mesh from in-the-wild crowded scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1475–1484 (2022)
8.
Zurück zum Zitat Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:​2010.​11929 (2020)
9.
Zurück zum Zitat Dou, M., et al.: Fusion4D: real-time performance capture of challenging scenes. ACM Trans. Graph. (ToG) 35(4), 1–13 (2016)MathSciNetCrossRef Dou, M., et al.: Fusion4D: real-time performance capture of challenging scenes. ACM Trans. Graph. (ToG) 35(4), 1–13 (2016)MathSciNetCrossRef
10.
Zurück zum Zitat Dwivedi, S.K., Athanasiou, N., Kocabas, M., Black, M.J.: Learning to regress bodies from images using differentiable semantic rendering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11250–11259 (2021) Dwivedi, S.K., Athanasiou, N., Kocabas, M., Black, M.J.: Learning to regress bodies from images using differentiable semantic rendering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11250–11259 (2021)
12.
Zurück zum Zitat Ghafoor, M., Mahmood, A.: Quantification of occlusion handling capability of 3D human pose estimation framework. IEEE Trans. Multimed. (2022) Ghafoor, M., Mahmood, A.: Quantification of occlusion handling capability of 3D human pose estimation framework. IEEE Trans. Multimed. (2022)
13.
Zurück zum Zitat Gong, K., Zhang, J., Feng, J.: PoseAug: a differentiable pose augmentation framework for 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8575–8584 (2021) Gong, K., Zhang, J., Feng, J.: PoseAug: a differentiable pose augmentation framework for 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8575–8584 (2021)
14.
Zurück zum Zitat Güler, R.A., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7297–7306 (2018) Güler, R.A., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7297–7306 (2018)
15.
Zurück zum Zitat He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
16.
Zurück zum Zitat Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013) Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
17.
Zurück zum Zitat Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation. In: 2021 International Conference on 3D Vision (3DV), pp. 42–52. IEEE (2021) Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation. In: 2021 International Conference on 3D Vision (3DV), pp. 42–52. IEEE (2021)
18.
Zurück zum Zitat Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018) Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018)
19.
Zurück zum Zitat Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5607–5616 (2018) Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5607–5616 (2018)
21.
Zurück zum Zitat Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263 (2020) Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263 (2020)
22.
Zurück zum Zitat Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: Pare: part attention regressor for 3D human body estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11127–11137 (2021) Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: Pare: part attention regressor for 3D human body estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11127–11137 (2021)
23.
Zurück zum Zitat Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: Spec: seeing people in the wild with an estimated camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11035–11045 (2021) Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: Spec: seeing people in the wild with an estimated camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11035–11045 (2021)
24.
Zurück zum Zitat Kocabas, M., Karagoz, S., Akbas, E.: Self-supervised learning of 3D human pose using multi-view geometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1077–1086 (2019) Kocabas, M., Karagoz, S., Akbas, E.: Self-supervised learning of 3D human pose using multi-view geometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1077–1086 (2019)
25.
Zurück zum Zitat Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2252–2261 (2019) Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2252–2261 (2019)
26.
Zurück zum Zitat Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6050–6059 (2017) Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6050–6059 (2017)
27.
Zurück zum Zitat Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: HybrIK: a hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3383–3393 (2021) Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: HybrIK: a hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3383–3393 (2021)
28.
29.
Zurück zum Zitat Lin, K., Lin, C.C., Liang, L., Liu, Z., Wang, L.: MPT: mesh pre-training with transformers for human pose and mesh reconstruction. arXiv preprint arXiv:2211.13357 (2022) Lin, K., Lin, C.C., Liang, L., Liu, Z., Wang, L.: MPT: mesh pre-training with transformers for human pose and mesh reconstruction. arXiv preprint arXiv:​2211.​13357 (2022)
30.
Zurück zum Zitat Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1954–1963 (2021) Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1954–1963 (2021)
31.
Zurück zum Zitat Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12939–12948 (2021) Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12939–12948 (2021)
34.
Zurück zum Zitat Liu, S., Jiang, H., Xu, J., Liu, S., Wang, X.: Semi-supervised 3D hand-object poses estimation with interactions in time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14687–14697 (2021) Liu, S., Jiang, H., Xu, J., Liu, S., Wang, X.: Semi-supervised 3D hand-object poses estimation with interactions in time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14687–14697 (2021)
35.
Zurück zum Zitat Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)CrossRef Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)CrossRef
36.
Zurück zum Zitat Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 2017 International Conference on 3D Vision (3DV), pp. 506–516. IEEE (2017) Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 2017 International Conference on 3D Vision (3DV), pp. 506–516. IEEE (2017)
38.
Zurück zum Zitat Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting: unifying deep learning and model based human pose and shape estimation. In: 2018 International Conference on 3D Vision (3DV), pp. 484–494. IEEE (2018) Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting: unifying deep learning and model based human pose and shape estimation. In: 2018 International Conference on 3D Vision (3DV), pp. 484–494. IEEE (2018)
39.
Zurück zum Zitat Park, J., Oh, Y., Moon, G., Choi, H., Lee, K.M.: HandOccNet: occlusion-robust 3D hand mesh estimation network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1496–1505 (2022) Park, J., Oh, Y., Moon, G., Choi, H., Lee, K.M.: HandOccNet: occlusion-robust 3D hand mesh estimation network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1496–1505 (2022)
40.
Zurück zum Zitat Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019) Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
41.
Zurück zum Zitat Saleh, K., Szénási, S., Vámossy, Z.: Occlusion handling in generic object detection: a review. In: 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 000477–000484. IEEE (2021) Saleh, K., Szénási, S., Vámossy, Z.: Occlusion handling in generic object detection: a review. In: 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 000477–000484. IEEE (2021)
42.
Zurück zum Zitat Sárándi, I., Linder, T., Arras, K.O., Leibe, B.: How robust is 3D human pose estimation to occlusion? arXiv preprint arXiv:1808.09316 (2018) Sárándi, I., Linder, T., Arras, K.O., Leibe, B.: How robust is 3D human pose estimation to occlusion? arXiv preprint arXiv:​1808.​09316 (2018)
43.
Zurück zum Zitat Sun, Y., Li, Y., Wang, C.: Multi-source templates learning for real-time aerial tracking. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023, pp. 1–5. IEEE (2023) Sun, Y., Li, Y., Wang, C.: Multi-source templates learning for real-time aerial tracking. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023, pp. 1–5. IEEE (2023)
44.
Zurück zum Zitat Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3D people. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11179–11188 (2021) Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3D people. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11179–11188 (2021)
45.
Zurück zum Zitat Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
46.
Zurück zum Zitat Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using imus and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018) Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using imus and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018)
47.
Zurück zum Zitat Wan, Z., Li, Z., Tian, M., Liu, J., Yi, S., Li, H.: Encoder-decoder with multi-level attention for 3D human shape and pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13033–13042 (2021) Wan, Z., Li, Z., Tian, M., Liu, J., Yi, S., Li, H.: Encoder-decoder with multi-level attention for 3D human shape and pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13033–13042 (2021)
48.
Zurück zum Zitat Wang, X., Li, Y., Boukhayma, A., Wang, C., Christie, M.: Contact-conditioned hand-held object reconstruction from single-view images. Comput. Graph. (2023) Wang, X., Li, Y., Boukhayma, A., Wang, C., Christie, M.: Contact-conditioned hand-held object reconstruction from single-view images. Comput. Graph. (2023)
49.
Zurück zum Zitat Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018) Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
50.
Zurück zum Zitat Zhang, H., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11446–11456 (2021) Zhang, H., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11446–11456 (2021)
Metadaten
Titel
RAGT: Learning Robust Features for Occluded Human Pose and Shape Estimation with Attention-Guided Transformer
verfasst von
Ziqing Li
Yang Li
Shaohui Lin
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-99-9666-7_22

Premium Partner