Skip to main content
Erschienen in: Complex & Intelligent Systems 4/2023

Open Access 03.02.2023 | Original Article

Spatial–temporal transformer for end-to-end sign language recognition

verfasst von: Zhenchao Cui, Wenbo Zhang, Zhaoxin Li, Zhaoqi Wang

Erschienen in: Complex & Intelligent Systems | Ausgabe 4/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Continuous sign language recognition (CSLR) is an essential task for communication between hearing-impaired and people without limitations, which aims at aligning low-density video sequences with high-density text sequences. The current methods for CSLR were mainly based on convolutional neural networks. However, these methods perform poorly in balancing spatial and temporal features during visual feature extraction, making them difficult to improve the accuracy of recognition. To address this issue, we designed an end-to-end CSLR network: Spatial–Temporal Transformer Network (STTN). The model encodes and decodes the sign language video as a predicted sequence that is aligned with a given text sequence. First, since the image sequences are too long for the model to handle directly, we chunk the sign language video frames, i.e., ”image to patch”, which reduces the computational complexity. Second, global features of the sign language video are modeled at the beginning of the model, and the spatial action features of the current video frame and the semantic features of consecutive frames in the temporal dimension are extracted separately, giving rise to fully extracting visual features. Finally, the model uses a simple cross-entropy loss to align video and text. We extensively evaluated the proposed network on two publicly available datasets, CSL and RWTH-PHOENIX-Weather multi-signer 2014 (PHOENIX-2014), which demonstrated the superior performance of our work in CSLR task compared to the state-of-the-art methods.
Hinweise
Zhenchao Cui and Wenbo Zhang have contributed equally to this work.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Sign language is the primary language of the hearing-impaired community and consists of various gestural movements, facial expressions, and head movements. According to World Health Organization (WHO) [1], 466 million people worldwide suffer from hearing loss, accounting for more than 5% of the world’s population, and nearly 2.5 billion people are expected to suffer from hearing impairment by 2050. Therefore, the development of sign language recognition (SLR) technology is of great importance for daily communication between hearing-impaired and hearing people as well as for social development. Traditional SLR methods are limited to static gestures and isolated words [24]. In contrast, continuous sign language recognition (CSLR) is a better way to meet the needs of hearing-impaired communication. In comparison with SLR, CSLR methods process sign language video that contains rich semantic movements of sign language [5, 6], and the magnitude of the movements is more localized and detailed.
Video sequences for CSLR are longer and more complex, and require feature and semantic learning in sequential frame sequences [6, 7]. It is challenging to map low-density sign language video sequences to the corresponding high-density natural language sequences. In real-world scenarios, sign language videos contain complex life scenes [8], and thus, there are long-term semantic dependencies in the videos. Each video frame is correlated not only with adjacent video frames but also with distant video frames. Typically, CSLR requires the detection of key frames in sign language videos. Spatial feature sequences are extracted from key frames using convolutional neural networks (CNN), and then, temporal features are fused by recurrent neural networks (RNN).
To achieve a high recognition accuracy, the feature extraction of sign language sequences is especially critical. However, existing methods [710] have difficulties in capturing detailed temporal dynamics over long intervals due to insufficient feature extraction. Therefore, adequately capturing visual features in sign language videos, especially long-term semantic dependencies, and extracting the corresponding video contextual features are key issues in CSLR. In addition, the contribution of all values in the output coding vector C of the RNN encoder [11] is the same, which leads to information loss in long sequence data. The impossibility of the model to be executed in parallel is also a major problem. In contrast, Transformer [12] has a strong semantic feature extraction capability and long-range feature capture capability, not only focusing on local information, but also seeing the global information from the low-level feature, and then constructing the global connection between key points.
To solve the long-term semantic dependence of sign language video, we propose a temporal and spatial feature extraction method based on Transformer. The proposed model can capture the spatial feature information of video frames while focusing on the contextual semantic information of consecutive frames. The model can extract the rich sign language features more efficiently and thus improve the recognition accuracy. The work in this paper is based on the traditional Transformer model and combines the characteristics of sign language video sequences for network design. Specifically, we perform a patch chunking operation on sign language video frames to facilitate model learning and training and propose a Spatial–Temporal Transformer model for CSLR (shown in Fig. 1).
The main contributions of this paper are as follows:
1.
We propose a deconvolutional sign language recognition network that contains a spatial–temporal (ST) feature encoder and a dynamic decoder. Where the ST feature encoder can distinguish temporal and spatial features, part of the attention module focuses only on the contextual features in the temporal dimension, and the other part of the attention module extracts the spatial dynamic features of the video frames. By such a design, the extraction of sign language video features can be enhanced by aggregating the attention results from different heads.
 
2.
For the long frame sequences of sign language videos, a patch operation is designed to map them into easy-to-process sequences. This operation can reduce the computational complexity and facilitates the processing of sign language videos.
 
3.
We designed a model progressive learning strategy to explore the effectiveness of frame size and patch size on recognition results. We conducted experimental evaluations on two widely used datasets, CSL and PHOENIX-2014 dataset, and obtained competitive performance for our method compared to several recently proposed methods.
 
The remainder of this paper is organized as follows: after presenting the related work in Section II, we present the architectural system implemented in Section III. In Section IV, we present the experimental results. Finally, in Section V, we draw conclusions and look ahead to our work.
Sign language recognition can be classified into isolated word recognition [35] and continuous sign language recognition [6, 7] based on whether it is continuous or not. Early SLR relied on manual extraction of features, including handshape, appearance, and motion trajectory [8]. The sign language video is first converted into a high-dimensional feature vector by a visual encoder, and then, the feature mapping of this feature vector to semantic text is learned by a decoder. Initially, CSLR was also based on the recognition of individual isolated words. This CSLR based on isolated words involves algorithms related to temporal segmentation [9], which is a complex process and has a high misclassification rate due to temporal segmentation. With the development of deep learning, recent CSLR methods have turned into the automatic extraction of sign language features using deep neural networks.
According to the input modality of recognition, methods can be divided into single modality and mixed modality, where single modality refers to RGB video as the input and mixed modality adds skeleton, depth, optical flow, and other information to RGB video [8]. Current recognition methods mainly focus on a single modality. To extract the visual features of sign language videos, most research used convolutional networks to extract feature sequences from videos, which generally means extracting spatial features using two-dimensional convolution (2DCNN) or three-dimensional convolutional networks (3DCNN), and then modeling temporal information dependence using RNN.
Oscar Koller et al. [10] embedded a model combines CNN and Long Short-Term Memory (LSTM) in each Hidden Markov Model (HMM) stream, relying on sequence constraints of HMM independent streams, to learn sign language, mouth shape, and hand shape classifiers using sequential parallelism, which reduced the single-stream HMM Word Error Rate (WER) to 26.5% and the dual-stream WER to 24.1% on the RWTH-PHOENIX-Weather multi-signer 2014T dataset (PHOENIX-2014-T, an extended version of PHOENIX-2014, is mainly used for sign language translation tasks). Cihan Camgoz et al. [7] proposed a depth-based and end-to-end CSLR framework using the SubUNets approach to improve the learning process of intermediate representation learning. Cui et al. [13] developed a CSLR framework using a combined CNN and Bi-directional Long Short-Term Memory (Bi-LSTM) model using an iterative optimization strategy to obtain representative features from the CNN, and experiments were conducted on the PHOENIX-2014 database and SIGNUM signer-dependent set, with WERs decreased to 24.43% and 3.58%. The VAC network proposed by Min et al. [14] uses 2DCNN to extract frame features and then uses one-dimensional convolutional networks (1DCNN) into local feature extraction with the addition of two auxiliary modules for alignment supervision. In the PHOENIX-2014 dataset, the WER was reduced to 21.2%.
Since sign language sequences require strong temporal correlation between frames, 3D convolution has been adopted for temporal dimensional feature extraction. Pu et al. [11] proposed a CNN-based model for continuous dynamic CSLR from RGB video input. They generated pseudo-labels for video clips from sequence learning model with Connectionist Temporal Classification (CTC), and finetune the 3D-ResNet with the supervision of pseudo-labels for a better feature representation. Their method was evaluated on the PHOENIX-2014 dataset and reduced the WER to 38.0%. Huang et al. [15] proposed a video-based CSLR method without temporal segmentation, which is based on a 3DCNN network and a hierarchical attention network for recognition. Yang et al. [16] proposed a shallow hybrid CNN, which uses both 2D and 3D convolutions, and is coupled with two LSTM networks for glossy and sentence-level sequence modeling, respectively.
Although CNN has a strong feature extraction capability, it is limited to feature extraction of single-frame images. The limited perceptual domain of 3D convolution leads to insufficient extraction of long-term time-dependent features. The convolutional network generates a single feature vector representing the whole video with the average pooling layer, completely ignoring the sequential relationship of video frames, which will lose the temporal and contextual information of the sign language video. And convolutional networks accumulate multiple layers in counting global information, which can lead to problems such as low learning rate and difficulty in transmitting information over long distances.
With the explosive application of Transformer [17] in the field of machine translation, its feature of being good at modeling long-range sequences is widely used in the field of vision, which can alleviate part of the feature extraction problems existing in CSLR. From 2020, transformer started to make a splash in the Computer Vision (CV) field: Vit [18] for image classification, DETR [19] for target detection, semantic segmentation (SETR [20], MedT [21]), image generation (GANsformer [22]), and video understanding (Timesformer [23]), among others. M. Rosso et al. [24] employed ViT for the first time within the road tunnel assessment field, the vision transformer provides overwhelming results for the automatic road tunnel defects classification paradigm. L. Tanzi et al. [25] applied a ViT architecture to femur fracture classification. It outperformed the state-of-the-art approaches based on CNN. Camgoz et al. [26] introduced the Transformer architecture for joint end-to-end CSLR and translation, with superior translation performance on the PHOENIX-2014-T dataset.
The length of the sign language sequence creates a high degree of complexity in the computation of the transformer, it is impractical to spread the sign language video sequence as the Transformer input. To this end, we improved the transformer structure. We proposed a patch operation for video frames, which reduces the input dimension of video frame sequences and can alleviate the problem of computational power in the first place, while facilitating feature extraction. Since it is an impossibility for the transformer to distinguish between temporal and spatial features of sign language videos, we designed an ST dual-channel feature extraction network to extract contextual features and dynamic features, respectively, with more adequate visual feature extraction.

Spatial–temporal transformer networks for sign language recognition

We propose to consider CSLR as a vector mapping from a low-density video sequence to a sign language high-density text sequence. The mapping is presented in Eq. (1)
$$\begin{aligned}{} & {} \left\{ Y_{1}^{s},Y_{2}^{s},Y_{3}^{s}\ldots Y_{m}^{s}\mid Y_{i}^{s}\in R^{d_{Y}} \right\} \nonumber \\{} & {} \quad = F\left( \left\{ X_{1}^{t},X_{2}^{t},X_{3}^{t}\ldots X_{n}^{t}\mid X_{i}^{t}\in R^{d_{X}} \right\} \right) , \end{aligned}$$
(1)
where X, Y represent the video sequence and text sequence; s and t represent the dimensions; m and n represent the length.
In this work, we proposed a new end-to-end Spatial-Temporal fusion Transformer Network (STTN) for CSLR. Its architecture is shown in Fig. 1, and it is already presented in the fourth paragraph of the introduction. The model consists of three main parts: sign language video sequence vectorization, ST feature extraction, and feature decryption. First, the video frames are extracted uniformly, and the extracted video frames are patched and position coded. Second, the patch sequences with the position information are input to the encoder part of the model. Third, the temporal and spatial features are extracted and fused by the ST encoder. Finally, the fused features are fed to the decoder part for decoding and predicting.

Vectorization processing

Patch embedding

The standard transformer input is a one-dimensional token embedding. The input in our method is a vector sequence of \(f\in R^{B\times T\times C\times H\times W} \) dimension (where B is batch size, T is the number of frames, C is the number of channels, and H and W represent the height and width of the sign language frame, respectively), we reshape each frame \(Z_{i}^{C\times H\times W}\) of the T-frame sign language video frame into a 2D block of dimension \(\left( h\times w \right) \times \left( p_{1} \times p_{2} \times C \right) \), where \(H=h\times {{p}_{1}},W=w\times {{p}_{2}}\). \(h\times w\) is the number of blocks per frame, which directly affects the length of the input sequence, and a constant hidden vector \({{d}_{model}}\) is used on all layers to map the block spreading projection to the size of \({{d}_{model}}=D\), where D represents the specific dimensional value. The output of this projection is the patch embedding. At this point, the size of the feature map is \(B\times T\times N\times D\), where N is the product of h, w. The output of patch embedding is noted as: \({{X}_{(p,t)}}\), where p is the number of patches and t is the number of frames.

Positional embedding

To prevent the position-related loss in networks, the feature map with dimension \({{f}_{0}}\in {{R}^{B\times T\times N\times D}}\) needs to be position encoded before entering the encoder for feature extraction. Position encoding requires that each position has unique position information and the relationship between two positions can be modeled by affine transformations between their positions. And it is experimentally verified that satisfies pose embedding functions as presented in Eqs. (2), (3)
$$\begin{aligned} PE_{\left( pos,2i \right) } =&\sin \left( \frac{pos}{{10000^{2i} }/{d_{model} } } \right) , \end{aligned}$$
(2)
$$\begin{aligned} PE_{\left( pos,2i+1 \right) } =&\cos \left( \frac{pos}{{10000^{2i+1} }/{d_{model} } } \right) , \end{aligned}$$
(3)
where pos indicates the position of the token in the sequence, the starting token position is 0, 2i and \(2i+1\) indicates even and odd positions, and i takes the value range \([0,\ldots ,{{d}_{model}}/{2})\). We marked the resulting positional encoding information as \(e_{(p,t)}^{pos}\). As shown in Fig. 2, the left image is without the position coding, and since the dimension values of each position are the same, it is impossible to distinguish the information on different positions; the right image is the result after adding the position coding, where the dimension values on each position are unique, so the information on each position can be labeled.
Positional Encoding and Patch-Embedding have the same dimensionality \({{d}_{model}}\), so these two can be directly summed. The vector after positional-embedding is noted as \(Z_{(p,t)}^{i}\) as specified below, in Eq. (4)
$$\begin{aligned} Z_{\left( p,t \right) }^{i} = E\cdot X_{\left( p,t \right) } + e_{\left( p,t \right) }^{pos}, \end{aligned}$$
(4)
where X denotes the vector corresponding to each patch, and X is multiplied with a learnable matrix E. The result of the multiplication is added to the position code \(e_{(p,t)}^{pos}\).

Encoder

The original Transformer structure like most seq2seq models [32] consists of an encoder and a decoder. The encoder and decoder consists of \(N=6\) identical layers, each layer consisting of two sub_layers: multi-headed self-attention (MHA) mechanism and fully connected feedforward network (FFN). Each sub_layer is connected with residual connection and normalization, and the equation is expressed as Eq. (5)
$$\begin{aligned}&sub\_layer\_output \nonumber \\&\quad = LayerNormalization\left( x+\left( sub\_Layer\left( x \right) \right) \right) . \end{aligned}$$
(5)
The transformer’s attention is a linear combination \(\sum \nolimits _{i}{{{a}_{i}}{{v}_{i}}}\) of all word vectors \({{v}_{i}}\) in the encoded sentence based on the learned attention weight matrix \({{a}_{i}}\), to perform decoded prediction with attention. Multi-head self-attention, on the other hand, is the projection of “query” Q, “key” K, and “value” V by means of using different linear transformations of heads (“heads” is the number of attention heads). The process of V is projected shown in Fig. 3.
The results of different attentions are attached together as shown in Eq. (6) and schematically visualized in Fig. 4
$$\begin{aligned} MHA\left( Q, K, V \right) =Concat\left( head_{1},\ldots ,head_{h} \right) W^{o}, \end{aligned}$$
(6)
where \(hea{{d}_{i}}=Attention(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})\). In addition, the calculation of attention mostly uses scaled dot-product (the calculation process is shown in Fig. 5), the equation is shown in Eq. (7)
$$\begin{aligned} Attention\left( Q, K, V \right) =softmax\left( \frac{QK^{T} }{\sqrt{d_{k} } } \right) V. \end{aligned}$$
(7)
In general, since that CSLR is highly ST dependent, and it is difficult capture temporal features as well as capturing spatial features. In this paper, to remedy the problem of temporal and spatial features, we propose an ST encoder structure for the dynamic spatial correlation and long-term temporal correlation of sign language videos. As presented in Fig. 6, the structure of proposed ST encoder is composed of spatial-attention block and temporal-attention block. The incoming sign language video vector is divided into two channels for processing temporal and spatial attention, and then, the extracted features are attached together. Using dynamic directed spatial correlation and long-term temporal correlation, the model can be enhanced to extract and encode the features of sign language video frames.
Spatial self-attention block performs MSA calculation only for different tokens of the same frame, and the attention value of each patch (p, t) spatial dimension is calculated
$$\begin{aligned} \left[ Z_{\left( p,t \right) }^{i} \right] ^{space}&=softmax\left( \left( \frac{q_{\left( p,t \right) }^{i} }{\sqrt{D_{h} } } \right) ^{S} \nonumber \right. \\&\left. \quad \cdot \left[ k_{\left( 0,0 \right) }^{i} \left\{ k_{\left( p^{'} ,t \right) }^{i} \right\} _{p^{'}=1,\ldots ,N } \right] \right) , \end{aligned}$$
(8)
where N is the number of patches.
Temporal self-attention block only calculates MSA for tokens at the same position in different frames, and calculates attention in the time dimension
$$\begin{aligned} \left[ Z_{\left( p,t \right) }^{i} \right] ^{time}&=softmax\left( \left( \frac{q_{\left( p,t \right) }^{i} }{\sqrt{D_{h} } } \right) ^{T} \nonumber \right. \\&\left. \quad \cdot \left[ k_{\left( 0,0 \right) }^{i} \left\{ k_{\left( p ,t^{'} \right) }^{i} \right\} _{t^{'}=1,\ldots ,M } \right] \right) , \end{aligned}$$
(9)
where M is the number of frames. And then, the calculated temporal attention and spatial attention are connected together
$$\begin{aligned} Z_{\left( p,t \right) }^{i}=\left[ Z_{\left( p,t \right) }^{i} \right] ^{space} +\left[ Z_{\left( p,t \right) }^{i} \right] ^{time}. \end{aligned}$$
(10)

Decoder

The role of the decoder is to characterize the next possible ’value’ based on the results of the encoder and the previous prediction. As shown in Fig. 7, each decoder consists of three \(sub\_layers\): the first \(sub\_layer\) includes a multi-headed self-attention layer, a normalization layer, and a residual connection layer; the second \(sub\_layer\) includes a multi-headed cross-attentive layer, a normalization layer, and a residual connection layer; the third \(sub\_layer\) contains an FFN, a normalization layer, and a residual connection layer. There is only one output at the encoder side, and each decoder layer passes into the decoder part acts as the K, V of the multi-headed attention mechanism in the second of these \(sub\_layers\). The three sub_layers are computed as shown in Eqs. (11)(12)(13)
$$\begin{aligned} self\_attn:Q_{i}^{1} =&MHA\left( \tilde{Q_{i-1}}, \tilde{Q_{i-1}},\tilde{Q_{i-1}} \right) , \end{aligned}$$
(11)
$$\begin{aligned} cross\_attn:Q_{i}^{2} =&MHA\left( \tilde{Q_{i-1}}, {\tilde{F}},F \right) , \end{aligned}$$
(12)
$$\begin{aligned} FFN:Q_{i} =&FFN\left( Q_{i}^{2} \right) . \end{aligned}$$
(13)
To decode the encoded feature vectors, we employ the following three operations on them: self-attention, cross-attention, and linear mapping. In addition, for the subsequent alignment operation, we perform positional encoding of the sign language text to ensure the normal text language order, as shown in Eqs. (14)(15)
$$\begin{aligned} PE_{\left( pos,2i \right) } =&\sin \left( \frac{pos}{{10000^{2i} }/{d_{model} } } \right) , \end{aligned}$$
(14)
$$\begin{aligned} PE_{\left( pos,2i+1 \right) } =&\cos \left( \frac{pos}{{10000^{2i+1} }/{d_{model} } } \right) . \end{aligned}$$
(15)
The position encoding in this section is the same as the position encoding module used in the encoder section for video frames. To ensure that each token may only use its predecessors while extracting contextual information, we used a mask operation on the attention computation. To facilitate the probability calculation, we linearly map the vector generated by the decoder stack to a larger vector, which becomes the ’logits’ vector. Then, we utilized the softmax function to perform the maximum probability calculation and select the word corresponding to the highest probability cell as the output of the current time step, as shown in Fig. 8.

Experiment

Dataset

As previously stated, to validate the proposed method, this study conducts experiments on two publicly available datasets: PHOENIX-2014 [6], and CSL [9]. The data composition of the two public datasets and the division of training and test samples are shown in Table 1.
PHOENIX-2014 is a sign language dataset recorded over 6 years (2009-2014) at the RWTH Aachen University in Germany. Recorded during the sign language commentary on Phoenix Public Television’s daily news and weather programs. All sign language videos were recorded at 25 frames per second, which are divided into two versions: 2012 and 2014. The 2014 dataset is an extension of the 2012 dataset, and we experimented on the 2014 version of the dataset. The 2014 dataset with 190 sign language samples containing 965940 frames and a total of 1558 words combined into 6861 consecutive utterances.
The Chinese Sign Language dataset (CSL) contains video instances from 50 signers, each repeated 5 times, containing 25 K tags, for a total of over 100 time-length videos. The dataset is divided into isolated words and continuous utterances, containing RGB, depth, and skeleton node data, with 500 classes of words, each containing 250 samples, and a sequence of 21 skeleton node coordinates. There are 100 sentences and a total of 25,000 videos; each sentence contains an average of 4 to 8 words. Each video example is labeled by a professional CSL teacher.
Table 1
Statistical data on PHOENIX-2014 and CSL datasets
Statistics
PHOENIX-2014
CSL
Train
Dev
Test
Train
Dev
Test
Signers
9
9
9
30
10
10
Vocabulary
1231
460
460
504
504
504
Videos
5672
540
629
15000
5000
5000

Evaluation metric

For CSLR, substitution, deletion, or insertion of certain words is necessary to maintain consistency between the recognized word sequence and the standard word sequence. The WER is a metric to measure the performance of a CSLR. It compares the model output at the current parameter values with the actual correct sign language sentence vector and is defined as: \(\text {WER}=\frac{S+D+I}{N}\), where S is the number of substitutions, D is the number of deletions, I is the number of insertions, and N is the number of words in the reference; an example diagram is shown in Fig. 9.
Table 2
The default parameters for the experiment
Image size
Patch size
Batch size
Encoder layer(s)
Decoder layer(s)
224*224
32
1
4
4
Head(s)
D_model
FF_model
Learning_ratio
Dropout
4
512
2048
0.0001
0.5

Implementation detail

Our model construction is based on the Pytorch platform [33], and the experimental environment is a 12 GB Nvidia GTX 3060 GPU. On the CSL dataset, we extract 60 frames for each sign language using uniform frame extraction, and then discard 12 frames randomly, and use the last 48 frames as valid frame inputs. The size of each frame is first adjusted to \(256 \times 256\). Adam optimizer is used and the learning rate and weight decay are set to \({{10}^{-4}}\) and \({{10}^{-5}}\), respectively. What can be clearly seen in Table 2 are the default parameters for our experiments (where, D_model is the dimensionality of the patch embeddings). Dropout is set to 0.5 to mitigate overfitting. We apply the cross-entropy classification loss on the predictions p with ground-truth targets t to train the model as follows (Eq. (16)):
$$\begin{aligned} {Loss}=-[tlogp+(1-t)log(1-p)], \end{aligned}$$
(16)
where t is the true label value and p is the predicted probability value. It characterizes the difference between the true sample label and the predicted probability.

Ablation experiments

Impact of video size

The different sizes of the extracted video frames will have an impact on the extracted features, which in turn will affect the final results of the model. To explore the effect of the model under different cropping ratios, we set three sizes of \(224 \times 224\), \(112 \times 112\) and \(256 \times 256\) for comparison. The comparison was performed in three aspects: model size, WER, and experiment time. According to the results in Table 3, the model size does not change under the three scales. \(112 \times 112\) takes up the least amount of video memory compared to \(224 \times 224\), which is 40.5% lower, and takes the shortest time per 200 iterations, which is 47.11% lower, but the effect is not as good as \(224 \times 224\), which is 34.2% lower. \(256 \times 256\) is impossible to experiment with successfully, because it takes up too much memory. According to the experimental results, the best results were achieved when the setting was \(224 \times 224\), and the WER was reduced to 19.94.

Impact of patch size

Sign language videos often contain long sequences, which can cause computational difficulties when fed directly into the network. To solve this problem, we divide the sign language video frames into small pieces, i.e., image to patch. To investigate the effect of different P (the P is the number of video frames divided into blocks) on the experimental results, we set the size to 8, 16, and 32, respectively, for experiment comparison. Experiments are carried out based on two perspectives: accuracy and time. The size of extracted video frames is rescaled uniformly: \(224 \times 224\).
Table 3
Performance comparison for the three cropping sizes
Rescale
para(MB)
Memory
WER
Times(ms)
\(112 \times 112\)
38.908
4.9
26.76
3,867.32
\(224 \times 224\)
38.908
11.5
19.94
6,827.97
\(256 \times 256\)
38.908
Table 4
Performance comparison for the four patch sizes
Patchs
para(MB)
Memory
WER
Times(ms)
8
42.924
16
38.098
11.5
20.98
6,924.03
28
38.909
9.7
20.51
6,538.18
32
39.278
6.3
19.94
6,642.84
According to Table 4, the size of the patch directly affects the length of the sequence, but it has a smaller effect on the parameters of the model. When the patch size is 32, the number of patches is the smallest, and the computation occupies the less memory. Although the computation time is slightly increased, but the model can be computed faster by increasing the batch size. Through experimental comparison, we conclude that the best result is achieved when the patch size set to 32.

Impact of the proposed modules

Table 5
Ablation experiments on ST. The baseline network is compared with the baseline+designed ST. The metrics include WER and the running time taken to complete 200 iterations
Method
Dev(%)
Test(%)
Time(s)
baseline
25.11
24.74
3995.11
baseline+ST
19.94
19.98
2394.14
In this section, we further verify the effectiveness of the ST module in the STTN network architecture. In this part of the experiment, we set the video size to \(224 \times 224\) and the patch size to 32. As shown in Table 5, the first row represents the original transformer network, and we can see that the best result of WER is 25.11%, and the second row represents the architecture after adding the ST encoder we designed, the WER drops to 19.94%, which is a relative improvement of 5.17 percentage point, and running time is also reduced. From Fig. 10, we can see that the curve after adding the ST module is significantly better than before, not only achieving a lower WER, but also fitting faster. These results suggest the effectiveness of our method.
Table 6
Performance comparison on PHOENIX-2014 dataset
Methods
Backbone
Dev(%)
Test(%)
del/ins
WER
del/ins
WER
SubUNet [7]
CaffeNet
14.6/4.0
40.8
14.3/4.0
40.7
Dilates [27]
3D-ResNet
8.3/4.8
38.0
7.6/4.8
37.3
CNN-LSTM-HMM [28]
GoogLeNet
26.0
26.0
SLT [26]
Transformer
11.7/6.5
24.9
11.2/6.1
24.6
FCN [10]
Custom
23.7
23.9
CMA [9]
GoogLeNet
7.3/2.7
21.3
7.3/2.4
21.9
VAC [29]
ResNet18
7.9/2.5
21.2
8.4/2.6
22.3
STMC* [11]
Custom
7.7/3.4
21.1
7.4/2.6
20.7
SMKD [30]
ResNet18
6.8/2.5
20.8
6.3/2.3
21.0
STTN(ours)
Transformer
4.6/2.5
19.94
4.8/2.4
19.98
The best results are marked in bold
The entries denoted by “*” used extra clues (such as keypoints and tracked face regions)

Comparison with state-of-the-art

We compared the proposed algorithm with the state-of-the-art CSLR methods (Min et al. [29] 2021; Pu et al. [9] 2020; Hao et al. [30] 2021; Camgöz et al. 2020 [24]) using the most general metric WER, the results of PHOENIX-2014 are shown in Table 6, and the results of CSL are shown in Table 7. Pu et al. feature enhancement by the aid of edit real video text pairs and generate corresponding pseudo-correspondence pairs, which, although achieving good result (WER dropped to 21.3%), did not take full advantage of the visual properties of the sign language videos themselves. Min and Hao et al. proposed a visual alignment constraint (VAC) method based on the ResNet18 network to enhance feature extraction by additional alignment supervision, and proposed a Self-Mutual Knowledge Distillation (SMKD) method which enforces the visual and contextual modules to focus on short-term and long-term information and enhances the discriminative power of both modules simultaneously. Their proposed method of VAC reduces the WER to 21.2% and SMKD’s method reduces the WER to 20.8%, which shows that their method is effective and also demonstrates the need to pay more attention to the visual features of the sign language video itself. Camgöz et al. proposed an “SLT” model to do joint end-to-end sign language recognition and translation, but this method impossibility to distinguish and extract temporal and spatial features of sign language videos was very well, yet these features are crucial.
Table 7
Performance comparison on the CSL dataset. The entries denoted by “*” used extra clues (such as keypoints and tracked face regions)
Methods
WER(%)
LS-HAN [15]
17.3
SubUNet [7]
11.0
HLSTM-attn [31]
7.1
SF-Net [16]
3.8
FCN [10]
3.0
STMC* [11]
2.1
VAC [29]
1.6
Ours(STTN)
1.2
In Tables 6 and 7, it can be seen that our proposed method (STTN) achieves good performance, with WERs falling to 19.94% and 1.2% on the PEOENIX-2014 and CSL datasets, respectively. This demonstrates that the long-term temporal dependence and dynamic spatial dependence of joint sign language videos can better learn the visual properties of sign language videos.

Results’ visualization

For the purpose of better understand the learning process, we selected a sentence from the CSL dataset for sequence visualization, as shown in Fig. 9, as already mentioned in the previous sections, where different prediction sequences correspond to different WER values. As can be seen from the Fig. 11, that after the 12th epoch, the WER fluctuates slightly around 1.2, which indicates that we achieve better results than the previous model. We also selected a random sample of data in the PHOENIX-2014 dataset, which is demonstrated in Fig. 12, and we can see the movements of the signer displayed in each frame. In addition, we visualized the training effect (WER change curve during train, and validation) in Fig. 13, in which the WER decreases faster during training, and the value of WER reached a low point of 14.26 in the 29th epoch. In addition, the validated experimental data show that the decline of WER slows down, and stops decrease after the 30th epoch reaches 19.94%.

Conclusions

Inadequate feature extraction is one of the major problems in current CSLR tasks, which directly leads to poor recognition of sign languages. In this study, we propose to enhance the feature extraction capability of the network model in both temporal and spatial dimensions, and to patch the sign language video frames to reduce the computational effort while enhancing the generalization capability of the model, thus allowing the CSLR network to be trained end-to-end. The proposed method does not require a text-related inductive bias module and aligns video and text using a simple cross-entropy loss. The experiments show that our proposed achieves state-of-the-art performance on CSL dataset and PHOENIX-2014 dataset, offering new perspectives on vision and natural language processing.

Acknowledgements

This work was supported by National Key Research and Development Program of China (No. 2020YFC1523302), the Research Initiation Project for High-Level Talents of Hebei University, Contract No. 521100221081, National Natural Science Foundation of China under Grant No. 62172392, Provincial Science and Technology Program of Hebei Province (No. 22370301D), and compute services from Hebei Artificial Intelligence Computing Center.

Declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
9.
Zurück zum Zitat Pu J, Zhou W, Hu H, et al (2020) Boosting continuous sign language recognition via cross modality augmentation[C]. Proceedings of the 28th ACM International Conference on Multimedia. 1497-1505 Pu J, Zhou W, Hu H, et al (2020) Boosting continuous sign language recognition via cross modality augmentation[C]. Proceedings of the 28th ACM International Conference on Multimedia. 1497-1505
10.
Zurück zum Zitat Cheng KL, Yang Z, Chen Q, Tai YW (2020) Fully Convolutional Networks for Continuous Sign Language Recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision - ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12369. Springer, Cham. https://doi.org/10.1007/978-3-030-58586-0_41 Cheng KL, Yang Z, Chen Q, Tai YW (2020) Fully Convolutional Networks for Continuous Sign Language Recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision - ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12369. Springer, Cham. https://​doi.​org/​10.​1007/​978-3-030-58586-0_​41
12.
Zurück zum Zitat Zihang D, Zhilin Y, Yiming Y, Jaime C, Quoc L, Ruslan S (2019) Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978-2988, Florence, Italy. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1285 Zihang D, Zhilin Y, Yiming Y, Jaime C, Quoc L, Ruslan S (2019) Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978-2988, Florence, Italy. Association for Computational Linguistics. https://​doi.​org/​10.​18653/​v1/​P19-1285
14.
Zurück zum Zitat Xie P, Cui Z, Du Y, et al (2021) Multi-Scale Local-Temporal Similarity Fusion for Continuous Sign Language Recognition[J]. arXiv preprint arXiv:2107.12762 Xie P, Cui Z, Du Y, et al (2021) Multi-Scale Local-Temporal Similarity Fusion for Continuous Sign Language Recognition[J]. arXiv preprint arXiv:​2107.​12762
16.
Zurück zum Zitat Yang Z, Shi Z, Shen X, et al (2019) SF-Net: Structured feature network for continuous sign language recognition[J]. arXiv preprint arXiv:1908.01341 Yang Z, Shi Z, Shen X, et al (2019) SF-Net: Structured feature network for continuous sign language recognition[J]. arXiv preprint arXiv:​1908.​01341
17.
Zurück zum Zitat Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, Polosukhin Illia (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000-6010 Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, Polosukhin Illia (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000-6010
18.
Zurück zum Zitat Alexey D, Lucas B, Alexander K, Dirk W, Xiaohua Z, Thomas U, Mostafa D, Matthias M, Georg H, Sylvain G et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 1, 2, 3, 5, 7 Alexey D, Lucas B, Alexander K, Dirk W, Xiaohua Z, Thomas U, Mostafa D, Matthias M, Georg H, Sylvain G et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 1, 2, 3, 5, 7
19.
Zurück zum Zitat Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-End Object Detection with Transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision - ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12346. Springer, Cham. https://doi.org/10.1007/978-3-030-58452-8_13 Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-End Object Detection with Transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision - ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12346. Springer, Cham. https://​doi.​org/​10.​1007/​978-3-030-58452-8_​13
21.
Zurück zum Zitat Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM (2021) Medical Transformer: Gated Axial-Attention for Medical Image Segmentation. In: , et al. Medical Image Computing and Computer Assisted Intervention - MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12901. Springer, Cham. https://doi.org/10.1007/978-3-030-87193-2_4 Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM (2021) Medical Transformer: Gated Axial-Attention for Medical Image Segmentation. In: , et al. Medical Image Computing and Computer Assisted Intervention - MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12901. Springer, Cham. https://​doi.​org/​10.​1007/​978-3-030-87193-2_​4
23.
Zurück zum Zitat Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding[J]. arXiv preprint arXiv:2102.05095, 2(3):4 Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding[J]. arXiv preprint arXiv:​2102.​05095, 2(3):4
25.
Zurück zum Zitat Tanzi L, Audisio A, Cirrincione G, Aprato A, Vezzetti E (2021) Vision Transformer for femur fracture classification. arXiv:2108.03414 Tanzi L, Audisio A, Cirrincione G, Aprato A, Vezzetti E (2021) Vision Transformer for femur fracture classification. arXiv:​2108.​03414
27.
Zurück zum Zitat Pu Junfu, Zhou Wengang, Li Houqiang (2018) Dilated convolutional network with iterative optimization for continuous sign language recognition. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). AAAI Press, 885-891 Pu Junfu, Zhou Wengang, Li Houqiang (2018) Dilated convolutional network with iterative optimization for continuous sign language recognition. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). AAAI Press, 885-891
28.
Zurück zum Zitat Koller O, Camgoz NC, Ney H, Bowden R (1 Sept. 2020) “Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos”, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2306-2320, https://doi.org/10.1109/TPAMI.2019.2911077 Koller O, Camgoz NC, Ney H, Bowden R (1 Sept. 2020) “Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos”, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2306-2320, https://​doi.​org/​10.​1109/​TPAMI.​2019.​2911077
31.
Zurück zum Zitat Guo Dan, Zhou Wengang, Li Houqiang, Wang Meng (2018) Hierarchical LSTM for sign language translation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 838, 6845-6852 Guo Dan, Zhou Wengang, Li Houqiang, Wang Meng (2018) Hierarchical LSTM for sign language translation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 838, 6845-6852
32.
Zurück zum Zitat Cho Kyunghyun, van Merriënboer Bart, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, Bengio Yoshua (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724-1734, Doha, Qatar. Association for Computational Linguistics Cho Kyunghyun, van Merriënboer Bart, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, Bengio Yoshua (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724-1734, Doha, Qatar. Association for Computational Linguistics
33.
Zurück zum Zitat Paszke Adam, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zach DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, Adam Lerer (2017) “Automatic differentiation in PyTorch.” Paszke Adam, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zach DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, Adam Lerer (2017) “Automatic differentiation in PyTorch.”
Metadaten
Titel
Spatial–temporal transformer for end-to-end sign language recognition
verfasst von
Zhenchao Cui
Wenbo Zhang
Zhaoxin Li
Zhaoqi Wang
Publikationsdatum
03.02.2023
Verlag
Springer International Publishing
Erschienen in
Complex & Intelligent Systems / Ausgabe 4/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-023-00977-w

Weitere Artikel der Ausgabe 4/2023

Complex & Intelligent Systems 4/2023 Zur Ausgabe

Premium Partner