nach oben

Erschienen in:

2024 | OriginalPaper | Buchkapitel

RAGT: Learning Robust Features for Occluded Human Pose and Shape Estimation with Attention-Guided Transformer

verfasst von : Ziqing Li, Yang Li, Shaohui Lin

Erschienen in: Computer-Aided Design and Computer Graphics

Verlag: Springer Nature Singapore

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

3D human pose and shape estimation from monocular images is a fundamental task in computer vision, but it is highly ill-posed and challenging due to occlusion. Occlusion can be caused by other objects that block parts of the body from being visible in the image. When an occlusion occurs, the image features become incomplete and ambiguous, leading to inaccurate or even wrong predictions. In this paper, we propose a novel method, named RAGT, that can handle occlusion robustly and recover the complete 3D pose and shape of humans. Our study focuses on achieving robust feature representation for human pose and shape estimation in the presence of occlusion. To this end, we introduce a dual-branch architecture that learns incorporation weights from visible parts to occluded parts and suppression weights to inhibit the integration of background features. To further improve the quality of visible and occluded maps, we leverage pseudo ground-truth maps generated by DensePose for pixel-level supervision. Additionally, we propose a novel transformer-based module called COAT (Contextual Occlusion-Aware Transformer) to effectively incorporate visible features into occluded regions. The COAT module is guided by an Occlusion-Guided Attention Loss (OGAL). OGAL is designed to explicitly encourage the COAT module to fuse more important and relevant features that are semantically and spatially closer to the occluded regions. We conduct experiments on various benchmarks and prove the robustness of RAGT to the different kinds of occluded scenes both quantitatively and qualitatively.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel QuadSampling: A Novel Sampling Method for Remote Implicit Neural 3D Reconstruction Based on Quad-Tree

Nächstes Kapitel P2M2-Net: Part-Aware Prompt-Guided Multimodal Point Cloud Completion

Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)

Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34CrossRef

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13CrossRef

Cho, J., Youwang, K., Oh, T.H.: Cross-attention of disentangled modalities for 3D human mesh recovery with transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13661, pp. 342–359. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19769-7_20CrossRef

Choi, H., Moon, G., Lee, K.M.: Beyond static features for temporally consistent 3D human pose and shape from a video. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1964–1973 (2020)

Choi, H., Moon, G., Lee, K.M.: Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 769–787. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_45CrossRef

Choi, H., Moon, G., Park, J., Lee, K.M.: Learning to estimate robust 3D human mesh from in-the-wild crowded scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1475–1484 (2022)

Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

Dou, M., et al.: Fusion4D: real-time performance capture of challenging scenes. ACM Trans. Graph. (ToG) 35(4), 1–13 (2016)MathSciNetCrossRef

10.

Dwivedi, S.K., Athanasiou, N., Kocabas, M., Black, M.J.: Learning to regress bodies from images using differentiable semantic rendering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11250–11259 (2021)

11.

Georgakis, G., Li, R., Karanam, S., Chen, T., Košecká, J., Wu, Z.: Hierarchical kinematic human mesh recovery. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 768–784. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_45CrossRef

12.

Ghafoor, M., Mahmood, A.: Quantification of occlusion handling capability of 3D human pose estimation framework. IEEE Trans. Multimed. (2022)

13.

Gong, K., Zhang, J., Feng, J.: PoseAug: a differentiable pose augmentation framework for 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8575–8584 (2021)

14.

Güler, R.A., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7297–7306 (2018)

15.

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

16.

Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)

17.

Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation. In: 2021 International Conference on 3D Vision (3DV), pp. 42–52. IEEE (2021)

18.

Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018)

19.

Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5607–5616 (2018)

20.

Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

21.

Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263 (2020)

22.

Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: Pare: part attention regressor for 3D human body estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11127–11137 (2021)

23.

Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: Spec: seeing people in the wild with an estimated camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11035–11045 (2021)

24.

Kocabas, M., Karagoz, S., Akbas, E.: Self-supervised learning of 3D human pose using multi-view geometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1077–1086 (2019)

25.

Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2252–2261 (2019)

26.

Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6050–6059 (2017)

27.

Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: HybrIK: a hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3383–3393 (2021)

28.

Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: CLIFF: carrying location information in full frames into human pose and shape estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13665, pp. 590–606. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20065-6_34CrossRef

29.

Lin, K., Lin, C.C., Liang, L., Liu, Z., Wang, L.: MPT: mesh pre-training with transformers for human pose and mesh reconstruction. arXiv preprint arXiv:2211.13357 (2022)

30.

Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1954–1963 (2021)

31.

Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12939–12948 (2021)

32.

Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48CrossRef

33.

Liu, Q., Zhang, Y., Bai, S., Yuille, A.: Explicit occlusion reasoning for multi-person 3D human pose estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13665, pp. 497–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20065-6_29CrossRef

34.

Liu, S., Jiang, H., Xu, J., Liu, S., Wang, X.: Semi-supervised 3D hand-object poses estimation with interactions in time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14687–14697 (2021)

35.

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)CrossRef

36.

Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 2017 International Conference on 3D Vision (3DV), pp. 506–516. IEEE (2017)

37.

Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 752–768. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_44CrossRef

38.

Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting: unifying deep learning and model based human pose and shape estimation. In: 2018 International Conference on 3D Vision (3DV), pp. 484–494. IEEE (2018)

39.

Park, J., Oh, Y., Moon, G., Choi, H., Lee, K.M.: HandOccNet: occlusion-robust 3D hand mesh estimation network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1496–1505 (2022)

40.

Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

41.

Saleh, K., Szénási, S., Vámossy, Z.: Occlusion handling in generic object detection: a review. In: 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 000477–000484. IEEE (2021)

42.

Sárándi, I., Linder, T., Arras, K.O., Leibe, B.: How robust is 3D human pose estimation to occlusion? arXiv preprint arXiv:1808.09316 (2018)

43.

Sun, Y., Li, Y., Wang, C.: Multi-source templates learning for real-time aerial tracking. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023, pp. 1–5. IEEE (2023)

44.

Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3D people. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11179–11188 (2021)

45.

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

46.

Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using imus and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018)

47.

Wan, Z., Li, Z., Tian, M., Liu, J., Yi, S., Li, H.: Encoder-decoder with multi-level attention for 3D human shape and pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13033–13042 (2021)

48.

Wang, X., Li, Y., Boukhayma, A., Wang, C., Christie, M.: Contact-conditioned hand-held object reconstruction from single-view images. Comput. Graph. (2023)

49.

Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

50.

Zhang, H., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11446–11456 (2021)

Titel: RAGT: Learning Robust Features for Occluded Human Pose and Shape Estimation with Attention-Guided Transformer
verfasst von: Ziqing Li
Yang Li
Shaohui Lin
Verlag: Springer Nature Singapore
Buch: Computer-Aided Design and Computer Graphics
Print ISBN: 978-981-9996-65-0

Electronic ISBN: 978-981-9996-66-7

Copyright-Jahr: 2024
DOI: https://doi.org/10.1007/978-981-99-9666-7_22

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner