Top

Published in:

2017 | OriginalPaper | Chapter

Lip Reading in the Wild

Authors : Joon Son Chung, Andrew Zisserman

Published in: Computer Vision – ACCV 2016

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of suitable datasets.

We make two novel contributions: first, we develop a pipeline for fully automated large-scale data collection from TV broadcasts. With this we have generated a dataset with over a million word instances, spoken by over a thousand different people; second, we develop CNN architectures that are able to effectively learn and recognize hundreds of words from this large-scale dataset.

We also demonstrate a recognition performance that exceeds the state of the art on a standard public benchmark dataset.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Parametric Image Segmentation of Humans with Structural Shape Priors

next chapter Recurrent Convolutional Face Alignment

Anina, I., Zhou, Z., Zhao, G., Pietikäinen, M.: OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–5. IEEE (2015)

Buehler, P., Everingham, M., Zisserman, A.: Learning sign language by watching TV (using weakly aligned subtitles). In: Proceedings of CVPR (2009)

Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of BMVC (2014)

Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)CrossRef

Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)

Everingham, M., Sivic, J., Zisserman, A.: “Hello! My name is.. Buffy” - automatic naming of characters in TV video. In: Proceedings of BMVC (2006)

Fu, Y., Yan, S., Huang, T.S.: Classification and feature extraction by simplexization. IEEE Trans. Inf. Forensics Secur. 3(1), 91–100 (2008)CrossRef

Goldschen, A.J., Garcia, O.N., Petajan, E.D.: Rationale for phoneme-viseme mapping and feature selection in visual speech recognition. In: Stork, D.G., Hennecke, M.E. (eds.) Speechreading by Humans and Machines, pp. 505–515. Springer, Heidelberg (1996)CrossRef

Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)CrossRef

10.

Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

11.

Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: Workshop on Deep Learning, NIPS (2014)

12.

Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE PAMI 35(1), 221–231 (2013)CrossRef

13.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

14.

Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874 (2014)

15.

King, D.E.: Dlib-ml: a machine learning toolkit. J. Acoust. Soc. Am. 10, 1755–1758 (2009)

16.

Koller, O., Ney, H., Bowden, R.: Deep learning of mouth shapes for sign language. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 85–91 (2015)

17.

Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pp. 288–295. Association for Computational Linguistics (2000)

18.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)

19.

Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., Huang, T.S.: AVICAR: audio-visual speech corpus in a car environment. In: INTERSPEECH. Citeseer (2004)

20.

Lienhart, R.: Reliable transition detection in videos: a survey and practitioner’s guide. Images Graph. 1, 469–486 (2001)CrossRef

21.

Lucey, P., Martin, T., Sridharan, S.: Confusability of phonemes grouped according to their viseme classes in noisy environments. In: Proceedings of Australian International Conference on Speech Science & Technical, pp. 265–270 (2004)

22.

Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)CrossRef

23.

McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)CrossRef

24.

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)

25.

Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: INTERSPEECH, pp. 1149–1153 (2014)

26.

Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)CrossRef

27.

Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. II-2017. IEEE (2002)

28.

Pei, Y., Kim, T.K., Zha, H.: Unsupervised random forest manifold alignment for lipreading. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 129–136 (2013)

29.

Petridis, S., Pantic, M.: Deep complementary bottleneck features for visual speech recognition. ICASSP, pp. 2304–2308 (2016)

30.

Rubin, S., Berthouzoz, F., Mysore, G.J., Li, W., Agrawala, M.: Content-based tools for editing audio stories. In: Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology, pp. 113–122. ACM (2013)

31.

Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. In: Workshop at International Conference on Learning Representations (2014)

32.

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)

33.

Tamura, S., Ninomiya, H., Kitaoka, N., Osuga, S., Iribe, Y., Takeda, K., Hayamizu, S.: Audio-visual speech recognition using deep bottleneck features and high-performance lipreading. In: 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 575–582. IEEE (2015)

34.

Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision, Vancouver, BC, Canada (1981)

35.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks (2015)

36.

Ukai, N., Seko, T., Tamura, S., Hayamizu, S.: GIF-LR: GA-based informative feature for lipreading. In: Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1–4. IEEE (2012)

37.

Vedaldi, A., Lenc, K.: Matconvnet - convolutional neural networks for matlab. CoRR abs/1412.4564 (2014)

38.

Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)

39.

Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. arXiv preprint arXiv:1601.08188 (2016)

40.

Woodland, P.C., Leggetter, C., Odell, J., Valtchev, V., Young, S.J.: The 1994 HTK large vocabulary speech recognition system. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, 1995. ICASSP-95, vol. 1, pp. 73–76. IEEE (1995)

41.

Yuan, J., Liberman, M.: Speaker identification on the scotus corpus. IEEE Trans. Audio Speech Lang. Process. 123(5), 3878 (2008)

42.

Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)

43.

Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Audio Speech Lang. Process. 11(7), 1254–1265 (2009)

44.

Zhou, Z., Hong, X., Zhao, G., Pietikäinen, M.: A compact representation of visual speech data using latent variables. IEEE Trans. Audio Speech Lang. Process. 36(1), 1–1 (2014)

45.

Zhou, Z., Zhao, G., Hong, X., Pietikäinen, M.: A review of recent advances in visual speech decoding. IEEE Trans. Audio Speech Lang. Process. 32(9), 590–605 (2014)

Title: Lip Reading in the Wild
Authors: Joon Son Chung
Andrew Zisserman
Publisher: Springer International Publishing
Book: Computer Vision – ACCV 2016
Print ISBN: 978-3-319-54183-9

Electronic ISBN: 978-3-319-54184-6

Copyright Year: 2017
DOI: https://doi.org/10.1007/978-3-319-54184-6_6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner