1 Introduction
2 Related Work
3 Continuous Sign Language Recognition
3.1 Legacy GMM-HMM Approach
3.2 Hybrid CNN-HMM Approach
3.3 Tandem Approach
4 Data Sets
PHOENIX 2012 | PHOENIX 2014 | SIGNUM | |||||
---|---|---|---|---|---|---|---|
Train | Test | Train | Dev | Test | Train | Test | |
# Signers | 1 | 1 | 9 | 9 | 9 | 1 | 1 |
Hours | 0.51 | 0.07 | 8.88 | 0.84 | 0.99 | 3.86 | 1.06 |
Frames | 46282 | 6751 | 799006 | 75186 | 89472 | 416620 | 114230 |
\(\sim \)Still frames | – | – | 20% | – | – | 38% | – |
Running words | 3309 | 487 | 65227 | 5540 | 6504 | 11127 | 2805 |
\(\varnothing \) Frames/word | 14.0 | – | 9.8 | – | – | 23.2 | |
Vocabulary | 266 | – | 1080 | – | – | 465 | – |
OOVs running | – | 8 | – | 28 | 35 | – | 9 |
OOVs [%] | – | 1.6 | – | 0.5 | 0.5 | – | 0.3 |
5 Implementation Details
5.1 Image Preprocessing
5.2 Convolutional Neural Network Training
5.3 CNN Inference
5.4 Continuous Sign Language Recognition
Type of pruning | PHOENIX 2012 | PHOENIX 2014 | SIGNUM |
---|---|---|---|
Visual threshold | None | 2000 | 2000 |
Visual histogram | None | 20,000 | 20,000 |
LM threshold | None | 4000 | 4000 |
LM histogram | None | 10,000 | 10,000 |
5.5 Computational Requirements
6 Experiments
NN-structure | Input size [px] | #layers | PHOENIX 2012 | PHOENIX 2014 | SIGNUM | |||
---|---|---|---|---|---|---|---|---|
#params (last fc) [\(10^6\)] | #params (last fc) [\(10^6\)] | #params (last fc) [\(10^6\)] | ||||||
LeNet |
\(227 \times 227\)
| 4 | 73.6 | (0.7) | 74.7 | (1.8) | 73.6 | (0.6) |
AlexNet |
\(227 \times 227\)
| 8 | 62.7 | (5.9) | 72.0 | (15.1) | 62.4 | (5.5) |
GoogLeNet |
\(224 \times 224\)
| 22 | 14.7 | (1.4) | 21.6 | (3.7) | 14.5 | (1.4) |
6.1 Effect of CNN Structure
CNN Structure | PHOENIX 2012 | PHOENIX 2014 | SIGNUM | |
---|---|---|---|---|
Test | Dev | Test | Test | |
LeNet (\(227 \times 227\) input) | 47.8 | 69.5 | 68.4 | 17.9 |
AlexNet | 51.5 | 45.5 | 44.5 | 10.6 |
GoogLeNet | 34.1 | 43.1 | 42.7 | 8.9 |
AlexNet | PHOENIX 2012 | PHOENIX 2014 | SIGNUM | |
---|---|---|---|---|
Test | Dev | Test | Test | |
Randomly initialised | 51.5 | 45.5 | 44.5 | 10.6 |
Fine-tuned | 39.2 | 42.2 | 41.1 | 8.7 |
6.2 Effect of Finetuning
GoogLeNet | PHOENIX 2012 | PHOENIX 2014 | SIGNUM | |
---|---|---|---|---|
Test | Dev | Test | Test | |
Randomly initialised | 34.1 | 43.1 | 42.7 | 8.9 |
Fine-tuned | 30.0 | 38.3 | 38.8 | 7.4 |
6.3 Hybrid Compared to Tandem Modelling
6.4 Effect of Hidden States
HMM structure | PHOENIX 2014 | |
---|---|---|
States \(\times \) Repetitions | Total states | Parameters [\(10^6\)] |
1 \(\times \) 2 | 1232 | 14.1 |
2 \(\times \) 2 | 2463 | 17.9 |
3 \(\times \) 2 | 3694 | 21.7 |
4 \(\times \) 2 | 4925 | 25.4 |
5 \(\times \) 2 | 6156 | 29.2 |
6 \(\times \) 2 | 7387 | 33.0 |
7 \(\times \) 2 | 8618 | 36.8 |
8 \(\times \) 2 | 9849 | 40.6 |
6.5 Effortless Ensemble of Models
Method | PHOENIX 2012 | PHOENIX 2014 | SIGNUM | ||
---|---|---|---|---|---|
Test | Dev | Test | Test | ||
von Agris et al. (2008a) | GMM-HMM | – | – | – | 12.7 |
Gweth et al. (2012) | GMM-HMM (MLP feat.) | – | – | – | 11.9 |
Forster et al. (2013) | GMM-HMM | 41.9 | – | – | 10.7 |
Forster et al. (2013a) | GMM-HMM | 38.6 | – | – | 10.7 |
Koller et al. (2015a) | GMM-HMM | 34.3 | 57.3 | 55.6 | 10.0 |
Koller et al. (2015a) | GMM-HMM (CMLLR) | – | 55.0 | 53.0 | – |
Koller et al. (2016a) | GMM-HMM (CNN feat.) | 31.2 | 47.1 | 45.1 | 7.6 |
Koller et al. (2016b) | tandem CNN-HMM | 31.0 | 39.9 | 38.8 | 10.0 |
Camgoz et al. (2017) | CNN-LSTM with CTC | – | 40.8 | 40.7 | – |
Cui et al. (2017) | CNN-LSTM with CTC | – | 39.4 | 38.7 | – |
Proposed approach | Hybrid CNN-HMM |
30.0
|
31.6
|
32.5
| 7.4 |