Skip to main content
Top

2020 | OriginalPaper | Chapter

SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading

Author : Peratham Wiriyathammabhum

Published in: Neural Information Processing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper presents a novel deep learning architecture for word-level lipreading. Previous works suggest a potential for incorporating a pretrained deep 3D Convolutional Neural Networks as a front-end feature extractor. We introduce SpotFast networks, a variant of the state-of-the-art SlowFast networks for action recognition, which utilizes a temporal window as a spot pathway and all frames as a fast pathway. The spot pathway uses word boundaries information while the fast pathway implicitly models other contexts. Both pathways are fused with dual temporal convolutions, which speed up training. We further incorporate memory augmented lateral transformers to learn sequential features for classification. We evaluate the proposed model on the LRW dataset. The experiments show that our proposed model outperforms various state-of-the-art models, and incorporating the memory augmented lateral transformers makes a \(3.7\%\) improvement to the SpotFast networks and \(16.1\%\) compared to finetuning the original SlowFast networks. The temporal window utilizing word boundaries helps improve the performance up to \(12.1\%\) by eliminating visual silences from coarticulations.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018) Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
2.
go back to reference Bear, H.L., Harvey, R.: Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Commun. 95, 40–67 (2017)CrossRef Bear, H.L., Harvey, R.: Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Commun. 95, 40–67 (2017)CrossRef
3.
go back to reference Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
4.
go back to reference Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (2017) Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (2017)
6.
go back to reference Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211 (2019) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211 (2019)
7.
go back to reference Lample, G., Sablayrolles, A., Ranzato, M., Denoyer, L., Jégou, H.: Large memory layers with product keys. In: Advances in Neural Information Processing Systems, pp. 8546–8557 (2019) Lample, G., Sablayrolles, A., Ranzato, M., Denoyer, L., Jégou, H.: Large memory layers with product keys. In: Advances in Neural Information Processing Systems, pp. 8546–8557 (2019)
8.
go back to reference Mann, V.A., Repp, B.H.: Influence of vocalic context on perception of the [\(\int \)]-[s] distinction. Percept. Psychophys. 28(3), 213–228 (1980)CrossRef Mann, V.A., Repp, B.H.: Influence of vocalic context on perception of the [\(\int \)]-[s] distinction. Percept. Psychophys. 28(3), 213–228 (1980)CrossRef
9.
go back to reference McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746 (1976)CrossRef McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746 (1976)CrossRef
10.
go back to reference Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018) Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)
11.
go back to reference Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. Comput. Vis. Image Underst. 176, 22–32 (2018)CrossRef Stafylakis, T., Khan, M.H., Tzimiropoulos, G.: Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. Comput. Vis. Image Underst. 176, 22–32 (2018)CrossRef
12.
go back to reference Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. Proc. Interspeech 2017, 3652–3656 (2017)CrossRef Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. Proc. Interspeech 2017, 3652–3656 (2017)CrossRef
13.
go back to reference Taylor, S.L., Mahler, M., Theobald, B.J., Matthews, I.: Dynamic units of visual speech. In: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 275–284. Eurographics Association (2012) Taylor, S.L., Mahler, M., Theobald, B.J., Matthews, I.: Dynamic units of visual speech. In: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 275–284. Eurographics Association (2012)
14.
go back to reference Thangthai, K.: Computer lipreading via hybrid deep neural network hidden Markov models. Ph.D. thesis, University of East Anglia (2018) Thangthai, K.: Computer lipreading via hybrid deep neural network hidden Markov models. Ph.D. thesis, University of East Anglia (2018)
15.
go back to reference Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015) Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
16.
go back to reference Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
18.
go back to reference Zhang, X., Cheng, F., Wang, S.: Spatio-temporal fusion based convolutional sequence learning for lip reading. In: The IEEE International Conference on Computer Vision (ICCV), October 2019 Zhang, X., Cheng, F., Wang, S.: Spatio-temporal fusion based convolutional sequence learning for lip reading. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Metadata
Title
SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading
Author
Peratham Wiriyathammabhum
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-63820-7_63

Premium Partner