Introduction
-
We propose GITPose for 2D HPE, which is one of the first studies to utilize a vision transformer to extract feature representations.
-
GITPose introduces a hierarchical transformer in which we utilize MLP(s) to encode deep local feature tokens in the initial stages, whereas self-attention (SA) components are employed to encode longer relationships in the deeper stages, and a decoder layer for keypoint detection.
-
We also propose a novel deformable token association (DTA) module to fuse flexibly more valuable and informative keypoint tokens to provide hierarchical illustration and representations with increased transformation modeling capacity.
-
GITPose architecture, based on several experiments, outperforms its current keypoint detection counterparts either with or without a convolutional neural network (CNN) as a backbone and obtains new SOTA results on the two benchmark datasets, MS-COCO and MPII.
Related works
Transformers in vision
Human pose estimation
Proposed model
Revisiting transformers in vision
Overall architecture of GITPose
Design of shallow and deeper blocks of GITPose
Deformable token association (DTA)
Transformer decoder
Experiments
Setup
Implementation details
Comparison with the state-of-the-art models
MPII keypoint detection
Method(s) | Backbone(s) | Hea | Sho | Elbw | Wris | Hip | Knee | Ank | Mean |
---|---|---|---|---|---|---|---|---|---|
CPM [47] | CPM | 96.1 | 95.2 | 87.3 | 82.3 | 87.7 | 82.8 | 78.5 | 87.1 |
SBL [13] | Res 152 | 86.7 | 95.3 | 89.3 | 85.1 | 89.2 | 84.7 | 82.1 | 88.3 |
HRNet [40] | HRNet w32 | 97.4 | 85.6 | 92.5 | 85.7 | 89.3 | 86.2 | 82.7 | 89.3 |
HRNet [40] | HRNet w48 | 97.3 | 96.0 | 90.6 | 85.7 | 89.1 | 86.8 | 82.3 | 89.7 |
CFA [67] | Res 101 | 96.4 | 95.8 | 91.3 | 87.0 | 90.0 | 87.9 | 84.0 | 90.3 |
ASDA [68] | HRNet w48 | 96.3 | 96.7 | 92.0 | 88.1 | 91.0 | 89.1 | 85.2 | 91.2 |
ViTPose [25] | ViT-B | 97.6 | 97.4 | 93.7 | 90.1 | 92.4 | 91.9 | 88.3 | 93.4 |
ViTPose [25] | ViT-L | 97.7 | 97.4 | 94.1 | 91.5 | 93.1 | 92.2 | 89.7 | 93.9 |
ViTPose [25] | ViT-H | 97.7 | 97.6 | 94.3 | 91.2 | 93.3 | 92.5 | 90.1 | 94.1 |
GITPose | ViT-B | 98.5 | 97.9 | 94.1 | 90.3 | 92.8 | 91.9 | 89.7 | 93.7 |
GITPose* | ViT-L | 98.5 | 98.5 | 95.3 | 92.4 | 93.2 | 93.5 | 90.9 | 94.8 |
GITPose | ViT-H | 98.6 | 98.4 | 95.0 | 91.1 | 93.3 | 92.6 | 90.6 | 94.3 |
Performance Gain | + 0.8 | + 0.9 | + 1.0 | + 0.9 | + 1.0 | + 0.8 | + 0.7 |
Coco keypoint detection
Model | Backbone | Input resolution | GFLOPs | AP | AR |
---|---|---|---|---|---|
HigherHRNet [14] | HRNet-w48 | 384 × 288 | – | 72.1 | – |
SBL [13] | Res-152 | 256 × 192 | – | 73.6 | 79.0 |
HRNet [40] | HRNet-w32 | 256 × 192 | – | 74.4 | 78.9 |
HRNet [40] | HRNet-w32 | 384 × 288 | 39.6 | 75.8 | 81.0 |
HRNet [40] | HRNet-w48 | 256 × 192 | – | 75.1 | 80.4 |
HRNet [40] | HRNet-w48 | 384 × 288 | – | 76.3 | 81.2 |
UDP [66] | HRNet-w48 | 384 × 288 | – | 77.2 | 82.0 |
TokenPose-L/D24 [70] | HRNet-w48 | 256 × 192 | – | 75.8 | 80.9 |
HRFormer [69] | HRNet-w48 | 256 × 192 | 44.0 | 75.6 | 80.8 |
HRFormer [69] | HRFormer-B | 384 × 288 | 45.5 | 77.2 | 82.0 |
ViTPose [25] | ViT-B | 256 × 192 | 76.5 | 75.8 | 81.1 |
ViTPose [25] | ViT-L | 256 × 192 | 77.2 | 78.3 | 83.5 |
ViTPose [25] | ViT-H | 256 × 192 | 78.5 | 79.1 | 84.1 |
GITPose | ViT-B | 256 × 192 | 71.1 | 76.7 | 83.0 |
GITPose | ViT-L | 256 × 192 | 75.7 | 78.8 | 84.1 |
GITPose* | ViT-H | 256 × 192 | 79.0 | 80.0 | 84.3 |
Performance gain | + 0.9 | + 0.2 |
Model | Backbone | #Params | Input Resolution | AP | AR | ||||
---|---|---|---|---|---|---|---|---|---|
50 | 75 | M | L | ||||||
HigherHRNet [14] | HRNet-W48 | 21.0 M | 384 × 288 | 71.3 | 92.2 | 81.2 | 70.0 | 77.2 | 74.9 |
HRNet [40] | HRNet-W32 | 29.0 M | 384 × 288 | 76.0 | 94.5 | 82.5 | 71.2 | 80.3 | 80.1 |
HRNet [40] | HRNet-W48 | 64.0 M | 256 × 192 | – | – | ||||
HRNet [40] | HRNet-W48 | 64.0 M | 384 × 288 | 76.4 | 95.3 | 83.6 | 72.1 | 81.8 | 80.5 |
UDP [66] | HRNet-W48 | 64.0 M | 384 × 288 | – | – | ||||
TokenPose-L/D24 [70] | HRNet-W48 | 28.0 M | 256 × 192 | 76.1 | 94.4 | 81.7 | 70.5 | 81.1 | 80.2 |
TransPose-H/A6 [70] | HRNet-W48 | 18.0 M | 256 × 192 | 75.9 | 93.0 | 80.8 | 68.9 | 77.9 | – |
HRFormer [69] | HRNet-W48 | 43.0 M | 256 × 192 | – | – | ||||
HRFormer [69] | HRFormer-B | 43.0 M | 384 × 288 | 76.2 | 95.7 | 82.8 | 71.6 | 82.3 | 81.2 |
ViTPose [25] | ViT-B | 86.0 M | 256 × 192 | 76.9 | 94.1 | 82.1 | 71.1 | 79.9 | 80.3 |
ViTPose [25] | ViT-L | 307.0 M | 256 × 192 | 78.6 | 95.8 | 83.0 | 72.3 | 81.1 | 82.4 |
ViTPose [25] | ViT-H | 632.0 M | 256 × 192 | 80.0 | 96.0 | 84.1 | 72.4 | 82.2 | 83.1 |
GITPose | ViT-B | 90.1 M | 256 × 192 | 77.1 | 94.8 | 83.2 | 72.1 | 79.8 | 80.5 |
GITPose | ViT-L | 401.0 M | 256 × 192 | 79.4 | 95.5 | 83.9 | 72.5 | 82.1 | 82.3 |
GITPose* | ViT-H | 604.3 M | 256 × 192 | 81.1 | 96.4 | 84.3 | 73.9 | 84.0 | 83.3 |
Performance gain | + 1.1 | + 0.4 | + 0.2 | + 1.5 | + 1.8 | + 0.2 |
Ablation study
Layer | Stages (S) | GITPose-B | GITPose-L | GITPose-H | |||
---|---|---|---|---|---|---|---|
P1 | C1 | P1 | C1 | P1 | C1 | ||
Patch mbedding | S1, 56 × 56 | 4 | 96 | 4 | 96 | 4 | 128 |
MLP block | L1 = 2 |
P2 | C2 | P2 | C2 | P2 | C2 | ||
---|---|---|---|---|---|---|---|
DTA | S2, 28 × 28 | 2 | 192 | 2 | 192 | 2 | 256 |
MLP block | L2 = 2 |
Layer | Stages (S) | GITPose-B | GITPose-L | GITPose-H | |||
---|---|---|---|---|---|---|---|
P3 | C3 | P3 | C3 | P3 | C3 | ||
DTA | S3, 14 × 14 | 2 | 384 | 2 | 384 | 2 | 512 |
Transformer block | N3 | L3 | N3 | L3 | N3 | L3 | |
12 | 6 | 12 | 18 | 16 | 18 |
P4 | C4 | P4 | C4 | P4 | C4 | ||
---|---|---|---|---|---|---|---|
DTA | S4, 7 × 7 | 2 | 768 | 2 | 768 | 2 | 1024 |
Transformer block | N4 | L4 | N4 | L4 | N4 | L4 | |
24 | 2 | 24 | 2 | 32 | 2 |
Model | Inference speed (fps) | AP |
---|---|---|
HRNet-W48 | 309 | 76.3 |
HRNet-W32 | 428 | 74.4 |
TokenPose-L | 602 | 75.8 |
Transpose-H | 309 | 75.8 |
ViTPose-H | 241 | 79.1 |
GITPose-H | 288 | 80.0 |