Skip to main content
Top
Published in: Pattern Analysis and Applications 2/2023

01-02-2023 | Theoretical Advances

MSRT: multi-scale representation transformer for regression-based human pose estimation

Authors: Beiguang Shan, Qingxuan Shi, Fang Yang

Published in: Pattern Analysis and Applications | Issue 2/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we are interested in the human pose estimation problem with a focus on leveraging discriminative pose features. Recent pose estimation works concentrate on extracting high-level features but ignore the low-level details, thus reducing the prediction accuracy. To mitigate the above issues, we propose an end-to-end method called multi-scale representation transformer network (MSRT). Our network consists of two key components: feature aggregation module (FAM) and transformers. The FAM splits and stacks feature maps of different scales, then fuses them to achieve multi-scale representation learning. This module makes up for the lack of detailed information in the high-level features. Furthermore, we utilize Transformers to identify long-range interactions among feature maps, and capture implicit body structure information, which allows the proposed network to refine the locations of terminal and occluded joints. Compared with existing regression-based methods, MSRT achieves superior results on the COCO2017 and MPII datasets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Geng Z, Sun K, Xiao B, Zhang Z, Wang J (2021) Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14676–14686 Geng Z, Sun K, Xiao B, Zhang Z, Wang J (2021) Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14676–14686
2.
go back to reference Su C, Li J, Zhang S, Xing J, Gao W, Tian Q (2017) Pose-driven deep convolutional model for person re-identification. In: Proceedings of the IEEE international conference on computer vision, pp. 3960–3969 Su C, Li J, Zhang S, Xing J, Gao W, Tian Q (2017) Pose-driven deep convolutional model for person re-identification. In: Proceedings of the IEEE international conference on computer vision, pp. 3960–3969
3.
go back to reference Farrajota M, Rodrigues JM, du Buf JH (2019) Human action recognition in videos with articulated pose information by deep networks. Pattern Anal Appl 22(4):1307–1318MathSciNetCrossRef Farrajota M, Rodrigues JM, du Buf JH (2019) Human action recognition in videos with articulated pose information by deep networks. Pattern Anal Appl 22(4):1307–1318MathSciNetCrossRef
4.
go back to reference Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp. 466–481 Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp. 466–481
5.
go back to reference Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5693–5703 Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5693–5703
6.
go back to reference Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Proceedings of the European conference on computer vision (ECCV), pp. 529–545 Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Proceedings of the European conference on computer vision (ECCV), pp. 529–545
7.
go back to reference Wei F, Sun X, Li H, Wang J, Lin S (2020) Point-set anchors for object detection, instance segmentation and pose estimation. In: European conference on computer vision, pp. 527–544 Wei F, Sun X, Li H, Wang J, Lin S (2020) Point-set anchors for object detection, instance segmentation and pose estimation. In: European conference on computer vision, pp. 527–544
8.
go back to reference Fang H.-S, Xie S, Tai Y.-W, Lu C (2017) Rmpe: regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp. 2334–2343 Fang H.-S, Xie S, Tai Y.-W, Lu C (2017) Rmpe: regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp. 2334–2343
9.
go back to reference Li J, Wang C, Zhu H, Mao Y, Fang H-S, Lu C (2019) Crowdpose: efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10863–10872 Li J, Wang C, Zhu H, Mao Y, Fang H-S, Lu C (2019) Crowdpose: efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10863–10872
10.
go back to reference Hidalgo G, Raaj Y, Idrees H, Xiang D, Joo H, Simon T, Sheikh Y (2019) Single-network whole-body pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6982–6991 Hidalgo G, Raaj Y, Idrees H, Xiang D, Joo H, Simon T, Sheikh Y (2019) Single-network whole-body pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6982–6991
11.
go back to reference Shi Q, Di H, Lu Y, Lv F, Tian X (2017) Video pose estimation with global motion cues. Neurocomputing 219:269–279CrossRef Shi Q, Di H, Lu Y, Lv F, Tian X (2017) Video pose estimation with global motion cues. Neurocomputing 219:269–279CrossRef
12.
go back to reference Zhou T, Wang W, Liu S, Yang Y, Van Gool L (2021) Differentiable multi-granularity human representation learning for instance-aware human semantic parsing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1622–1631 Zhou T, Wang W, Liu S, Yang Y, Van Gool L (2021) Differentiable multi-granularity human representation learning for instance-aware human semantic parsing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1622–1631
13.
go back to reference Zhou L, Chen Y, Gao Y, Wang J, Lu H (2020) Occlusion-aware Siamese network for human pose estimation. In: European conference on computer vision, pp. 396–412 Zhou L, Chen Y, Gao Y, Wang J, Lu H (2020) Occlusion-aware Siamese network for human pose estimation. In: European conference on computer vision, pp. 396–412
14.
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008
15.
go back to reference Sun X, Shang J, Liang S, Wei Y (2017) Compositional human pose regression. In: Proceedings of the IEEE international conference on computer vision, pp. 2602–2611 Sun X, Shang J, Liang S, Wei Y (2017) Compositional human pose regression. In: Proceedings of the IEEE international conference on computer vision, pp. 2602–2611
16.
go back to reference Li K, Wang S, Zhang X, Xu Y, Xu W, Tu Z (2021) Pose recognition with cascade transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1944–1953 Li K, Wang S, Zhang X, Xu Y, Xu W, Tu Z (2021) Pose recognition with cascade transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1944–1953
17.
go back to reference Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K(2017) Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4903–4911 Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K(2017) Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4903–4911
18.
go back to reference Su K, Yu D, Xu Z, Geng X, Wang C (2019) Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5674–5682 Su K, Yu D, Xu Z, Geng X, Wang C (2019) Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5674–5682
19.
go back to reference Li W, Wang Z, Yin B, Peng Q, Du Y, Xiao T, Yu G, Lu H, Wei Y, Sun J (2019) Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148 Li W, Wang Z, Yin B, Peng Q, Du Y, Xiao T, Yu G, Lu H, Wei Y, Sun J (2019) Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:​1901.​00148
20.
go back to reference Wang J, Long X, Gao Y, Ding E, Wen S (2020) Graph-PCNN: two stage human pose estimation with graph pose refinement. In: European conference on computer vision, pp. 492–508 Wang J, Long X, Gao Y, Ding E, Wen S (2020) Graph-PCNN: two stage human pose estimation with graph pose refinement. In: European conference on computer vision, pp. 492–508
21.
go back to reference Toshev A, Szegedy C (2014) Human pose estimation via deep neural networks. CVPR.(Columbus, Ohio, 2014), pp. 1653–1660 Toshev A, Szegedy C (2014) Human pose estimation via deep neural networks. CVPR.(Columbus, Ohio, 2014), pp. 1653–1660
22.
go back to reference Carreira J, Agrawal P, Fragkiadaki K, Malik J (2016) Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4733–4742 Carreira J, Agrawal P, Fragkiadaki K, Malik J (2016) Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4733–4742
25.
go back to reference Nie X, Feng J, Zhang J, Yan S (2019) Single-stage multi-person pose machines. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6951–6960 Nie X, Feng J, Zhang J, Yan S (2019) Single-stage multi-person pose machines. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6951–6960
26.
go back to reference Li J, Bian S, Zeng A, Wang C, Pang B, Liu W, Lu C (2021) Human pose regression with residual log-likelihood estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11025–11034 Li J, Bian S, Zeng A, Wang C, Pang B, Liu W, Lu C (2021) Human pose regression with residual log-likelihood estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11025–11034
27.
go back to reference Mao W, Ge Y, Shen C, Tian Z, Wang X, Wang Z, Hengel A.V.D (2022) Poseur: direct human pose regression with transformers. arXiv preprint arXiv:2201.07412 Mao W, Ge Y, Shen C, Tian Z, Wang X, Wang Z, Hengel A.V.D (2022) Poseur: direct human pose regression with transformers. arXiv preprint arXiv:​2201.​07412
28.
go back to reference Wang W, Song H, Zhao S, Shen J, Zhao S, Hoi S.C, Ling H (2019) Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3064–3074 Wang W, Song H, Zhao S, Shen J, Zhao S, Hoi S.C, Ling H (2019) Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3064–3074
29.
go back to reference Zhou T, Li J, Wang S, Tao R, Shen J (2020) Matnet: motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process 29:8326–8338CrossRefMATH Zhou T, Li J, Wang S, Tao R, Shen J (2020) Matnet: motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process 29:8326–8338CrossRefMATH
30.
go back to reference Wang W, Zhao S, Shen J, Hoi S.C, Borji A (2019) Salient object detection with pyramid attention and salient edges. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1448–1457 Wang W, Zhao S, Shen J, Hoi S.C, Borji A (2019) Salient object detection with pyramid attention and salient edges. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1448–1457
31.
go back to reference Fan D.-P, Wang W, Cheng M.-M, Shen J (2019) Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8554–8564 Fan D.-P, Wang W, Cheng M.-M, Shen J (2019) Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8554–8564
33.
go back to reference Wang W, Shen J (2017) Deep cropping via attention box prediction and aesthetics assessment. In: Proceedings of the IEEE international conference on computer vision, pp. 2186–2194 Wang W, Shen J (2017) Deep cropping via attention box prediction and aesthetics assessment. In: Proceedings of the IEEE international conference on computer vision, pp. 2186–2194
34.
go back to reference Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:​2010.​04159
35.
go back to reference Yang S, Quan Z, Nie M, Yang W (2020) Transpose: towards explainable human pose estimation by transformer. arXiv preprint arXiv:2012.14214 Yang S, Quan Z, Nie M, Yang W (2020) Transpose: towards explainable human pose estimation by transformer. arXiv preprint arXiv:​2012.​14214
36.
37.
go back to reference Zheng C, Zhu S, Mendieta M, Yang T, Chen C, Ding Z (2021) 3d human pose estimation with spatial and temporal transformers. arXiv preprint arXiv:2103.10455 Zheng C, Zhu S, Mendieta M, Yang T, Chen C, Ding Z (2021) 3d human pose estimation with spatial and temporal transformers. arXiv preprint arXiv:​2103.​10455
38.
go back to reference Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y, et al. (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556 Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y, et al. (2020) A survey on visual transformer. arXiv preprint arXiv:​2012.​12556
39.
go back to reference Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp. 213–229 Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp. 213–229
40.
go back to reference Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:​2010.​11929
41.
go back to reference Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:​2103.​14030
42.
go back to reference Li Y, Zhang S, Wang Z, Yang S, Yang W, Xia S.-T, Zhou E (2021) Tokenpose: learning keypoint tokens for human pose estimation. arXiv preprint arXiv:2104.03516 Li Y, Zhang S, Wang Z, Yang S, Yang W, Xia S.-T, Zhou E (2021) Tokenpose: learning keypoint tokens for human pose estimation. arXiv preprint arXiv:​2104.​03516
43.
go back to reference Mao W, Ge Y, Shen C, Tian Z, Wang X, Wang Z (2021) Tfpose: direct human pose estimation with transformers. arXiv preprint arXiv:2103.15320 Mao W, Ge Y, Shen C, Tian Z, Wang X, Wang Z (2021) Tfpose: direct human pose estimation with transformers. arXiv preprint arXiv:​2103.​15320
44.
go back to reference Yang Y, Ramanan D (2011) Articulated pose estimation with flexible mixtures-of-parts. In: CVPR 2011, pp. 1385–1392. IEEE Yang Y, Ramanan D (2011) Articulated pose estimation with flexible mixtures-of-parts. In: CVPR 2011, pp. 1385–1392. IEEE
45.
go back to reference Chen X, Yuille AL (2015) Parsing occluded people by flexible compositions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3945–3954 Chen X, Yuille AL (2015) Parsing occluded people by flexible compositions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3945–3954
46.
go back to reference Fu L, Zhang J, Huang K (2016) ORGM: occlusion relational graphical model for human pose estimation. IEEE Trans Image Process 26(2):927–941MathSciNetCrossRefMATH Fu L, Zhang J, Huang K (2016) ORGM: occlusion relational graphical model for human pose estimation. IEEE Trans Image Process 26(2):927–941MathSciNetCrossRefMATH
47.
go back to reference Islam M.A, Jia S, Bruce N.D (2020) How much position information do convolutional neural networks encode? arXiv preprint arXiv:2001.08248 Islam M.A, Jia S, Bruce N.D (2020) How much position information do convolutional neural networks encode? arXiv preprint arXiv:​2001.​08248
48.
go back to reference Wu K, Peng H, Chen M, Fu J, Chao H (2021) Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10033–10041 Wu K, Peng H, Chen M, Fu J, Chao H (2021) Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10033–10041
49.
go back to reference Lin T.-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C.L (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp. 740–755 Lin T.-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C.L (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp. 740–755
50.
go back to reference Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2d human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3686–3693 Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2d human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3686–3693
51.
go back to reference Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112 Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112
52.
go back to reference Li Z, Ye J, Song M, Huang Y, Pan Z (2021) Online knowledge distillation for efficient pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11740–11750 Li Z, Ye J, Song M, Huang Y, Pan Z (2021) Online knowledge distillation for efficient pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11740–11750
53.
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778
54.
go back to reference Tang W, Yu P, Wu Y (2018) Deeply learned compositional models for human pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 190–206 Tang W, Yu P, Wu Y (2018) Deeply learned compositional models for human pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 190–206
55.
go back to reference Nibali A, He Z, Morganc S, Prendergast L (2018) Numerical coordinate regression with convolutional neural networks. arXiv preprint arXiv:1801.07372 Nibali A, He Z, Morganc S, Prendergast L (2018) Numerical coordinate regression with convolutional neural networks. arXiv preprint arXiv:​1801.​07372
Metadata
Title
MSRT: multi-scale representation transformer for regression-based human pose estimation
Authors
Beiguang Shan
Qingxuan Shi
Fang Yang
Publication date
01-02-2023
Publisher
Springer London
Published in
Pattern Analysis and Applications / Issue 2/2023
Print ISSN: 1433-7541
Electronic ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-023-01130-6

Other articles of this Issue 2/2023

Pattern Analysis and Applications 2/2023 Go to the issue

Premium Partner