4.1 Datasets
A trio of excellent gait datasets are employed, including CASIA-B [
36], which is widely used indoors, Gait3D [
39], which is famous for its diversity in real-life settings, and GREW [
38], which is commonly used in wild environments. As shown in Table
1, the relevant statistics pertaining to the number of sequences and identities are provided. Subsequent sections present a comprehensive outline of the data collection procedures for each dataset, underscoring the notable distinctions between indoor and outdoor datasets.
Table 1
Count of labels (Lab) and sequences (Seq) contained in the CASIA-B, GREW, and Gait3D datasets
| 2006 | 74 | 8140 | 50 | 5500 |
| 2021 | 20,000 | 102,887 | 6000 | 24,000 |
| 2022 | 3000 | 18,940 | 1000 | 6369 |
4.1.1 CASIA-B [36]
Is a widely used gait recognition dataset with 124 subjects. Each subject had 10 different walking styles, including 6 normal walking styles, 2 backpack walking styles, and 2 walking styles with different clothes. Viewing angles ranged from 0
\(^\circ\) to 180
\(^\circ\) at 18
\(^\circ\) intervals, and a total of 11 cameras were used. The CASIA-B dataset therefore includes 13,640 gait sequences. We adopted the popular division scheme of existing studies [
44], since CASIA-B has never been officially divided into training and test sets. Specifically, training data were collected from 74 subjects (001-074), while test data were collected from 50 subjects (075-124). Four gait sequences were used in the test set in NM condition. The rest of the gait sequences served as probe sequences. To extract the pose information in the CASIA-B dataset, we used the pretrained HRNet pose estimation model [
45] that was pre-trained previously.
4.1.2 Gait3D [39]
Is a large-scale live gait recognition dataset that was captured using 39 cameras in a supermarket, and the dataset includes 4000 different objects and more than 25,000 gait sequences. In the training phase, we use data from 3000 of these objects for training. During the testing phase, all of the remaining 1000 samples will be analyzed, with a maximum of one sequence from each sample being randomly selected as the probe sequence and the remaining 1000 samples serving as the gallery.
4.1.3 GREW [38]
Is a dataset derived from real-world video streams, which are composed of hundreds of cameras and thousands of hours of streams in open systems, as captured by natural cameras. Including 26K identifiers and 128K sequences, the GREW dataset contains a variety of attributes, useful view variations, and more natural challenge factors throughout that make it a great dataset for training, validation, and test purposes. Test objects are composed of two sequences, one of which is considered a probe and the other is considered a gallery, which is composed of two sequences from each test object.
In all our experiments, we follow the official protocols or the most popular protocols for training, testing, and gallery/probe set segmentation in order to achieve the most relevant results possible, using the main evaluation metrics such as Rank-1, Rank-5, Rank-10, and Rank-20 in order to evaluate the effectiveness of our experiments.
4.3 Comparison with state of the art
The aim of this study is to compare GaitDLF with state-of-the-art skeleton-based methods for gait recognition. A systematic and comprehensive experiment was conducted on three different datasets to validate GaitDLF’s effectiveness. These datasets are CASIA-B, Gait3D, and GREW. Our first comparison is between GaitDLF and representative skeleton-based methods, including PoseGait, GaitGraph, and GaitGraph2. This choice was made because of the consistency between these three methods and our own evaluation protocol.
Table 3
Averaged Rank-1 accuracies in percent on CASIA-B per probe angle compared with other model-based methods
NM | | 55.3 | 69.6 | 73.9 | 75.0 | 68.0 | 68.2 | 71.1 | 72.9 | 76.1 | 70.4 | 55.4 | 68.7 |
| 85.3 | 88.5 | 91.0 | 92.5 | 87.2 | 86.5 | 88.4 | 89.2 | 87.9 | 85.9 | 81.9 | 87.7 |
| 78.5 | 82.9 | 85.8 | 85.6 | 83.1 | 81.5 | 84.3 | 83.2 | 84.2 | 81.6 | 71.8 | 82.0 |
GaitDLF | 80.1 | 87.1 | 87.7 | 89.2 | 84.4 | 84.4 | 84.2 | 85.2 | 85.2 | 85.4 | 80.6 | 84.86 |
BG | | 35.3 | 47.2 | 52.4 | 46.9 | 45.5 | 43.9 | 46.1 | 48.1 | 49.4 | 43.6 | 31.1 | 44.5 |
| 75.8 | 76.7 | 75.9 | 76.1 | 71.4 | 73.9 | 78.0 | 74.7 | 75.4 | 75.4 | 69.2 | 74.8 |
| 69.9 | 75.9 | 78.1 | 79.3 | 71.4 | 71.7 | 74.3 | 76.2 | 73.2 | 73.4 | 61.7 | 73.2 |
GaitDLF | 68.2 | 71.0 | 74.2 | 75.8 | 69.0 | 71.2 | 71.5 | 71.2 | 72.0 | 70.5 | 63.5 | 70.74 |
CL | | 24.3 | 29.7 | 41.3 | 38.8 | 38.2 | 38.5 | 41.6 | 44.9 | 42.2 | 33.4 | 22.5 | 36.0 |
| 69.6 | 66.1 | 68.8 | 67.2 | 64.5 | 62.0 | 69.5 | 65.6 | 65.7 | 66.1 | 64.3 | 66.3 |
| 57.1 | 61.1 | 68.9 | 66.0 | 67.8 | 65.4 | 68.1 | 67.2 | 63.7 | 63.6 | 50.4 | 63.6 |
GaitDLF | 67.0 | 68.1 | 69.1 | 68.9 | 64.4 | 68.0 | 69.2 | 71.3 | 69.3 | 70.1 | 62.1 | 67.95 |
As shown in Table
3 in the CASIA-B indoor dataset, the performance of GaitDLF is only
\(84.86\%\) in the NM condition, which is not as good as GaitGraph’s
\(87.7\%\), and only
\(70.74\%\) in the BG condition, which is also not as good as GaitGraph’s
\(74.8\%\). But comparing these several skeleton-based methods, our method was able to obtain higher results in the coat-loaded (CL) condition, which suggests that methods focusing on localized limb movements can show higher robustness when the overall skeleton is more occluded by clothing. However, the CASIA-B indoor dataset was collected in the laboratory in order to bridge the gap between laboratory studies and real-world applications. We need methods that are more applicable to wild environments.
Table 4
Averaged Rank-1 accuracies in percent on GREW and Gait3D compared with other model-based methods
| PR 2020 | 0.23 | 1.05 | 2.23 | 4.28 | 0.24 |
| ICIP 2021 | 1.31 | 3.46 | 5.08 | 7.51 | 6.25 |
| CVPRW 2022 | 33.54 | 49.45 | 56.28 | 61.92 | 11.1 |
GaitDLF | Ours | 68.77 | 82.25 | 87.25 | 90.3 | 27.2 |
On both the GREW and Gait3D datasets, our method achieves the highest recognition accuracy and has the best recognition rate compared to the best methods on both datasets, as shown in Table
4. Importantly, on the GREW dataset, our accuracy is an impressive
\(30\%\) higher than that of GaitGraph2. This result highlights the ability of our method to efficiently learn complex and unique gait features from large-scale datasets, which makes it more suitable for real-world gait recognition tasks. It further shows that compared to other methods that pool over the whole skeleton to obtain global features, our proposed modeling and dynamically adaptive fusion of local limbs is clearly more advantageous.
Given the unique characteristics of these three datasets, it is possible to see that our method performs well primarily on larger, more influential datasets (e.g., Gait3D and the GREW dataset in the wild). For the smaller CASIA-B dataset, our method does not achieve the best results and performs second only to GaitGraph. This is because GaitDLF possesses a more complex design than other skeleton-based networks to cope with the challenges of complex scenarios, and thus, on the small dataset, this may have led to overfitting, making GaitDLF difficult to achieve the best performance. However, on large datasets, GaitDLF achieves very high performance.
4.4 Comparison with appearance-based methods
Since appearance-based methods have been predominantly used in gait recognition and the silhouette images used include gait information and other recognizable visual cues (e.g., clothing, handbags), the skeleton-based methods only use dynamic skeletal sequences of the subject to extract gait features. Skeletal-based methods, on the other hand, use only the dynamic skeletal sequence of the subject to extract gait features. The GaitDLF is then compared with various state-of-the-art appearance-based methods, such as GaitSet, GaitPart, GaitGL, and GaitBase, in order to assess its performance.
Table 5
Averaged Rank-1 accuracies in percent on CASIA-B per probe type compared with other appearance-based methods
NM | | 90.8 | 97.9 | 99.4 | 96.9 | 93.6 | 91.7 | 95.0 | 97.8 | 98.9 | 96.8 | 85.8 | 95.0 |
| 94.1 | 98.6 | 99.3 | 98.5 | 94.0 | 92.3 | 95.9 | 98.4 | 99.2 | 97.8 | 90.4 | 96.1 |
| 96.0 | 98.3 | 99.0 | 97.9 | 96.9 | 95.4 | 97.0 | 98.9 | 99.3 | 98.8 | 94.0 | 97.4 |
| 93.9 | 98.8 | 99.6 | 98.1 | 94.0 | 91.6 | 94.9 | 98.4 | 99.3 | 98.5 | 91.8 | 97.6 |
GaitDLF | 80.1 | 87.1 | 87.7 | 89.2 | 84.4 | 84.4 | 84.2 | 85.2 | 85.2 | 85.4 | 80.6 | 84.86 |
BG | | 83.8 | 91.2 | 91.8 | 88.8 | 83.3 | 81.0 | 84.1 | 30.0 | 92.2 | 94.4 | 79.0 | 87.2 |
| 89.1 | 94.8 | 96.7 | 95.1 | 88.3 | 94.9 | 89.0 | 93.5 | 96.1 | 93.8 | 85.8 | 90.7 |
| 92.6 | 96.6 | 96.8 | 95.5 | 93.5 | 89.3 | 92.2 | 96.5 | 98.2 | 96.9 | 91.5 | 94.5 |
| 91.9 | 95.5 | 96.8 | 94.7 | 90.9 | 88.9 | 91.7 | 94.9 | 96.2 | 95.5 | 86.3 | 94.0 |
GaitDLF | 68.2 | 71.0 | 74.2 | 75.8 | 69.0 | 71.2 | 71.5 | 71.2 | 72.0 | 70.5 | 63.5 | 70.74 |
CL | | 61.4 | 75.4 | 80.7 | 77.3 | 72.1 | 70.1 | 71.5 | 73.5 | 73.5 | 68.4 | 50.0 | 70.4 |
| 70.7 | 85.5 | 86.9 | 83.3 | 77.1 | 72.5 | 76.9 | 82.2 | 83.8 | 80.2 | 66.5 | 78.7 |
| 76.6 | 90.0 | 90.3 | 87.1 | 84.5 | 79.0 | 84.1 | 87.0 | 87.3 | 84.4 | 69.5 | 83.8 |
| 60.2 | 77.6 | 82.8 | 78.7 | 74.8 | 72.2 | 76.1 | 78.2 | 76.8 | 72.0 | 56.9 | 77.4 |
GaitDLF | 67.0 | 68.1 | 69.1 | 68.9 | 64.4 | 68.0 | 69.2 | 71.3 | 69.3 | 70.1 | 62.1 | 67.95 |
As shown in Table
5, the appearance-based state-of-the-art method outperforms the accuracy of our GaitDLF on the CASIA-B dataset. However, it is worth noting that the CASIA-B dataset was collected under the 2006 constraints and contains only 124 individuals and approximately 13K video sequences, as well as artificially formulated tilt angles, and occlusions. It is shown that in indoor environments, the appearance-based approach can exhibit the highest gait recognition accuracy due to the lack of many complicating factors.
It can be concluded from Tables
3 and
5 that the performance of all the methods is susceptible to the change in walking conditions, based on the performance of skeleton-based and silhouette-based methods in the indoor dataset. Moreover, the recognition performance of the appearance-based methods exhibits large fluctuations when the tilt angle changes, and it can be expected that such fluctuations will be even larger in complex environments. Therefore, in wild environments, skeleton-based methods, due to their greater robustness, are bound to be more advantageous.
Table 6
Averaged Rank-1, Rank-5, Rank-10, and Rank-20 accuracies in percent on GREW and averaged Rank-1 accuracies in percent on Gait3D compared with other appearance-based methods
| AAAI 2019 | 46.28 | 63.58 | 70.26 | 76.82 | 36.7 |
| CVPR 2020 | 44.01 | 60.68 | 67.25 | 73.47 | 28.2 |
| ICCV 2021 | 47.28 | 63.56 | 69.32 | 74.18 | 29.7 |
| CVPR2023 | 60.1 | 75.4 | 80.38 | 84.16 | 64.6 |
GaitDLF | Ours | 68.77 | 82.25 | 87.25 | 90.3 | 27.2 |
One of the most notable features of the GREW dataset is that it is available in 2021 in a highly complex and free environment with 26K individuals and approximately 128K video sequences. The dataset is completely unconstrained, has a large number of undefined variations, and has diverse and useful perspectives. GREW also includes a variety of challenging factors such as complex backgrounds, occlusion, carrying objects, and wearing jewelry. As shown in Table
6, GaitDLF achieves a Rank-1 accuracy of
\(68.77\%\), and the Rank-1 accuracy of the state-of-the-art silhouette-based method is
\(60.1\%\). This indicates that GaitDLF outperforms the existing state-of-the-art appearance-based methods in an unconstrained wild environment.
Skeletal-based gait recognition can prioritize pose, angle, and directly relevant gait information, whereas appearance-based networks tend to emphasize subjective features such as color and texture. Therefore, for such a large dataset collected in unconstrained and complex environments, a skeleton-based approach that extracts gait features only from skeleton sequences is more appropriate than an appearance-based approach.
The Gait3D dataset contains only 4K individuals and 25K video sequences, which are collected in supermarkets, and the amount of data is much smaller than that of the GREW dataset, as shown in Table
6, the Rank-1 accuracy of GaitDLF is only
\(27.2\%\), and that of the state-of-the-art silhouette-based method is
\(64.4\%\). It shows that the silhouette-based and skeleton-based methods have their own advantages and usage scenarios.
The above results show that for gait recognition under strict constraints, good results can be achieved using an appearance-based approach, but for unconstrained wild environments with a large number of samples, a skeleton-based approach would be more appropriate. We believe this is caused by the fact that the skeleton sequence contains much less information than the silhouette sequence. Skeleton sequences simply include only the coordinates of each joint and the confidence level, whereas silhouette maps contain a large number of visual cues. However, large datasets have more data volume and diversity, which can help skeleton-based models better capture different variations and characteristics of gait. Therefore, on smaller datasets, the diversity of data may not be sufficient to support the skeleton based model to learn a wider range of gait variations, leading to performance degradation.
In conclusion, a larger capacity and more diverse dataset can help the skeleton-based model better capture a variety of gait variations and features. Compared with appearance-based methods, model-based methods focus only on human gait information and do not focus on rich appearance information, although it performs poorly on small datasets, in datasets with many unpredictable influences, and due to the large amount of data, skeleton-based methods are precisely able to ignore very sensitive appearance features and learn gait features. Therefore, based on this fact, this approach can be more effective in realizing its potential when applying gait recognition tasks in datasets of complexity and large data volumes or even in the real world.
4.5 Ablation studies
Our objective in this section is to determine the effectiveness of the modules we have designed by performing an extensive ablation analysis of each module in GaitDLF for each module. Using Gaitgraph2 as a baseline, this study is able to improve the effectiveness of the method by modifying its loss function, data augmentation, and test time augmentation (TTA) strategies in order to improve its performance. During the testing of Gaitgraph2, the left-right flip sample and the time reversal sample are used as two supplementary samples, and the three embedded data obtained are connected to facilitate the subsequent distance calculation. And the distance metric between its gallery samples and probe samples uses the cosine similarity function. On the other hand, our approach avoids the use of supplementary samples during the testing process and instead computes the distance between gallery samples and probe samples by using the Euclidean distance between the respective feature vectors of gallery samples and probe samples. Additionally, auxiliary samples are not utilized, and all frames are input to the network for testing. It is shown in the table that in comparison with the original Gaitgraph2, the adjusted baseline achieves a recognition accuracy of approximately
\(58\%\), which shows an increase of approximately
\(25\%\) when compared to the original Gaitgraph2.
Table 7
Rank-1 accuracy of the network on the GREW dataset, with each of the proposed modules being removed in turn to assess the validity of each module
GaitGraph2 | | | | 33.54 | 49.45 | 56.28 | 61.92 |
GaitBase | | | | 60.1 | 75.4 | 80.38 | 84.16 |
our | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | 68.77 | 82.77 | 87.25 | 90.3 |
\(\checkmark\) | \(\checkmark\) | | 66.1 | 81.8 | 86.6 | 90 |
\(\checkmark\) | | | 58.2 | 76.2 | 82.1 | 86.2 |
| \(\checkmark\) | \(\checkmark\) | 68.1 | 82.3 | 86.6 | 89.8 |
4.5.1 Evaluation of the SLM block
Inspired by the appearance-based approach, we introduce the SLM module, which aims to capture unique motion information from different limb parts to utilize richer gait characteristics. As shown in Table
7, the Rank-1 metric is only
\(58.2\%\) when only GAS is left in the network, a result that is
\(10.5\%\) lower than the performance of GaitDLF. This result validates the ability of the SLM module to model different limb parts and emphasizes that focusing on the complex details of localized limb parts is not only effective for appearance-based approaches, but also for model-based approaches.
4.5.2 Evaluation of the DFF block
After feature extraction through the SLM module, limb-level motion features corresponding to the five limb parts are obtained. Several simple but effective methods can be used to fuse these five parts, but to further enhance the fusion of these five limb part motion features, we propose an attention-based DFF block. The goal is to dynamically adjust the attention weights of the five limb part motion features to facilitate dynamic fusion to preserve essential information and discard redundant details.
It is worth noting that both global and local channel context information are computed based on the five limb segments during the generation of attention weights. As shown in Table
7, the accuracy of the DFF block when removed from the LDS is
\(66.1\%\), a result that is
\(2.6\%\) lower than the performance of GaitDLF. This result further confirms the efficacy of the proposed DFF block in fusing localized limb segments.
4.5.3 Evaluation of the GAS
We have verified the effectiveness of the two modules, of LDS, and we intend to remove the original GAS, according to Table
7, the performance of the network with only LDS is
\(68.1\%\), and this result is
\(0.67\%\) lower than the performance of GaitDLF. It shows that GAS and LDS together can make the network achieve better performance.
4.7 Complexity analysis
We perform a complexity analysis of GaitDLF with the state-of-the-art skeleton-based methods, as shown in Table
8, the parameter of GaitDLF is 0.94 M, which is just 0.18 M more than the 0.76 M of GaitGraph2, while their FLOPs are comparable, indicating that GaitDLF optimizes the computational burden while maintaining a similar model design while maintaining a similar computational burden. More importantly, GaitDLF shows excellent performance on two important outdoor datasets, Gait3D and GREW, with the top-ranked accuracy reaching
\(27.2\%\) on Gait3D and
\(68.77\%\) on GREW, which clearly outperforms the other comparison methods. In contrast, Gaitgraph, despite being more lightweight in terms of model size, with a parameter count of only 0.35 M, does not perform as well in terms of actual performance.
Table 8
Comparison of performance and computational complexity of skeleton-based gait recognition methods on Gait3D and GREW outdoor datasets
Gaitgraph | 6.25 | 1.31 | 0.35M | 0.0753G |
Gaitgraph2 | 11.1 | 33.54 | 0.76M | 0.1936G |
GaitDLF | 27.2 | 68.77 | 0.94M | 0.1939G |