1 Introduction
2 Related Work
2.1 Visual Feature Selection and Blending
2.2 Synthesis Based on Hidden Markov Models
2.3 Synthesis Based on Deep Neural Networks
2.4 GAN-Based Video Synthesis
3 Speech-Driven Facial Synthesis
3.1 Generator
3.1.1 Identity Encoder
3.1.2 Content Encoder
3.1.3 Noise Generator
3.1.4 Frame Decoder
3.2 Discriminators
3.2.1 Frame Discriminator
3.2.2 Sequence Discriminator
3.2.3 Synchronization Discriminator
3.3 Training
4 Datasets
Dataset | Test subjects |
---|---|
GRID | 2, 4, 11, 13, 15, 18, 19, 25, 31, 33 |
TCD TIMIT | 8, 9, 15, 18, 25, 28, 33, 41, 55, 56 |
CREMA-D | 15, 20, 21, 30, 33, 52, 62, 81, 82, 89 |
Dataset | Samples/hours (Tr) | Samples/hours (V) | Samples/hours (T) |
---|---|---|---|
GRID | 31639/26.4 | 6999/5.8 | 9976/8.31 |
TCD | 8218/9.1 | 686/0.8 | 977/1.2 |
CREMA | 11594/9.7 | 819/0.7 | 820/0.68 |
LRW | 112658/36.3 | 5870/1.9 | 5980/1.9 |
5 Metrics
- Reconstruction Metrics We use common reconstruction metrics such as the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) index to evaluate the generated videos. During our assessment it is important to take into account the fact that reconstruction metrics will penalize videos for any facial expression that does not match those in the ground truth videos.
- Sharpness Metrics The frame sharpness is evaluated using the cumulative probability blur detection (CPBD) measure (Narvekar and Karam 2009), which determines blur based on the presence of edges in the image. For this metric as well as for the reconstruction metrics larger values imply better quality.
- Content Metrics The content of the videos is evaluated based on how well the video captures identity of the target and on the accuracy of the spoken words. We verify the identity of the speaker using the average content distance (ACD) (Tulyakov et al. 2018), which measures the average Euclidean distance of the still image representation, obtained using OpenFace (Amos et al. 2016), from the representation of the generated frames. The accuracy of the spoken message is measured using the word error rate (WER) achieved by a pre-trained lip-reading model. We use the LipNet model (Assael et al. 2016), which surpasses the performance of human lip-readers on the GRID dataset. For both content metrics lower values indicate better accuracy.
- Audio-Visual Synchrony Metrics Synchrony is quantified using the methods proposed in Chung and Zisserman (2016b). In this work Chung et al. propose the SyncNet network which calculates the euclidean distance between the audio and video encodings on small (0.2 s) sections of the video. The audio-visual offset is obtained by using a sliding window approach to find where the distance is minimized. The offset is measured in frames and is positive when the audio leads the video. For audio and video pairs that correspond to the same content the distance will increase on either side of point where the minimum distance occurs. However, for uncorrelated audio and video the distance is expected to be stable. Based on this fluctuation Chung and Zisserman (2016b) further propose using the difference between the minimum and the median of the Euclidean distances as an audio-visual (AV) confidence score which determines the audio-visual correlation. Higher scores indicate a stronger correlation, whereas confidence scores smaller than 0.5 indicate that audio and video are uncorrelated.
- Expression Evaluation We investigate the generation of spontaneous expressions since it is one of the main factors that affect our perception of how natural a video looks. According to the study presented in Bentivoglio et al. (1997) the average person blinks 17 times per minute (0.28 blinks/s), although this rate increases during conversation and decreases when reading. We use a blink detector based on the one proposed in Soukupova and Cech (2016), which relies on the eye aspect ratio (EAR) to detect the occurrence of blinks in videos. The EAR is calculated per frame according to the formula shown in Eq. (8) using facial landmarks \(p_1\) to \(p_6\) shown in Fig. 7. The blink detector algorithm first calculates the EAR signal for the entire video and then identifies blink locations by detecting a sharp drop in the EAR signal.
Accuracy | Precision | Recall | MAE (Start) | MAE (End) |
---|---|---|---|---|
80% | 100% | 80% | 1.4 | 2.1 |
GRID | TIMIT | CREMA | LRW | |
---|---|---|---|---|
Blinks/s | 0.39 | 0.28 | 0.26 | 0.53 |
Median duration (s) | 0.4 | 0.2 | 0.36 | 0.32 |
Method | PSNR | SSIM | CPBD | ACD | WER (%) | AV Offset | AV Confidence | Blinks/s | Blink dur. (s) |
---|---|---|---|---|---|---|---|---|---|
GT |
\(\infty \)
| 1.00 | 0.276 |
\(0.98 \cdot 10^{-4}\)
| 21.76 | 1 | 7.0 | 0.39 | 0.41 |
w/o \(\mathcal {L}_{adv}\) | 28.467 | 0.855 | 0.210 |
\( 1.92 \cdot 10^{-4}\)
| 26.6 | 1 | 7.1 | 0.02 | 0.16 |
w/o \(\mathcal {L}_{L_1}\) | 26.516 | 0.805 | 0.270 |
\( \mathbf{1}.\mathbf{03} \cdot 10^{-4}\)
| 56.4 | 1 | 6.3 | 0.41 | 0.32 |
w/o \(\mathcal {L}^{img}_{adv}\) | 26.474 | 0.804 | 0.252 |
\(1.96 \cdot 10^{-4}\)
| 23.2 | 1 | 7.3 | 0.16 | 0.28 |
w/o \(\mathcal {L}^{sync}_{adv}\) | 27.548 | 0.829 | 0.263 |
\(1.19 \cdot 10^{-4}\)
| 27.8 | 1 | 7.2 | 0.21 | 0.32 |
w/o \(\mathcal {L}^{seq}_{adv}\) | 27.590 | 0.829 | 0.259 |
\(1.13 \cdot 10^{-4}\)
| 27.0 | 1 | 7.4 | 0.03 | 0.16 |
Full Model | 27.100 | 0.818 | 0.268 |
\( 1.47 \cdot 10^{-4}\)
| 23.1 | 1 | 7.4 | 0.45 | 0.36 |
6 Experiments
6.1 Ablation Study
6.2 Qualitative Results
6.3 Quantitative Results
6.4 User Study
Method | PSNR | SSIM | CPBD | ACD | WER | AV Offset | AV Confidence | Blinks/s | Blink dur. (s) | |
---|---|---|---|---|---|---|---|---|---|---|
GRID | Proposed model | 27.100 | 0.818 | 0.268 | 1.47 \(\cdot 10^{-4}\) | 23.1% | 1 | 7.4 | 0.45 | 0.36 |
Baseline | 27.023 | 0.811 | 0.249 | \(\mathbf 1.42 \cdot 10^{-4}\) | 36.4% | 2 | 6.5 | 0.04 | 0.29 | |
Speech2Vid | 22.662 | 0.720 | 0.255 | \( 1.48 \cdot 10^{-4}\) | 58.2% | 1 | 5.3 | 0.00 | 0.00 | |
TCD | Proposed model | 24.243 | 0.730 | 0.308 | 1.76\(\cdot 10^{-4}\) | N/A | 1 | 5.5 | 0.19 | 0.33 |
Baseline | 24.187 | 0.711 | 0.231 | \(1.77 \cdot 10^{-4}\) | N/A | 8 | 1.4 | 0.08 | 0.13 | |
Speech2Vid | 20.305 | 0.658 | 0.211 | \( 1.81 \cdot 10^{-4}\) | N/A | 1 | 4.6 | 0.00 | 0.00 | |
CREMA | Proposed model | 23.565 | 0.700 | 0.216 | 1.40\(\cdot 10^{-4}\) | N/A | 2 | 5.5 | 0.25 | 0.26 |
Baseline | 22.933 | 0.685 | 0.212 | \( 1.65 \cdot 10^{-4}\) | N/A | 2 | 5.2 | 0.11 | 0.13 | |
Speech2Vid | 22.190 | 0.700 | 0.217 | \( 1.73 \cdot 10^{-4}\) | N/A | 1 | 4.7 | 0.00 | 0.00 | |
LRW | Proposed model | 23.077 | 0.757 | 0.260 | 1.53 \(\cdot 10^{-4}\) | N/A | 1 | 7.4 | 0.52 | 0.28 |
Baseline | 22.884 | 0.746 | 0.218 | 1.02\(\cdot 10^{-4}\) | N/A | 2 | 6.0 | 0.42 | 0.13 | |
Speech2Vid | 22.302 | 0.709 | 0.199 | \(2.61 \cdot 10^{-4}\) | N/A | 2 | 6.2 | 0.00 | 0.00 | |
ATVGNet | 20.107 | 0.743 | 0.189 | \( 2.14 \cdot 10^{-4}\) | N/A | 2 | 7.0 | 0.00 | 0.00 |