Skip to main content

2017 | OriginalPaper | Buchkapitel

Out of Time: Automated Lip Sync in the Wild

verfasst von : Joon Son Chung, Andrew Zisserman

Erschienen in: Computer Vision – ACCV 2016 Workshops

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The goal of this work is to determine the audio-video synchronisation between mouth motion and speech in a video.
We propose a two-stream ConvNet architecture that enables the mapping between the sound and the mouth images to be trained end-to-end from unlabelled data. The trained network is used to determine the lip-sync error in a video.
We apply the network to two further tasks: active speaker detection and lip reading. On both tasks we set a new state-of-the-art on standard benchmark datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bt.1359: Relative timing of sound and vision for broadcasting. ITU (1998) Bt.1359: Relative timing of sound and vision for broadcasting. ITU (1998)
2.
Zurück zum Zitat Anina, I., Zhou, Z., Zhao, G., Pietikäinen, M.: Ouluvs2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–5. IEEE (2015) Anina, I., Zhou, Z., Zhao, G., Pietikäinen, M.: Ouluvs2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–5. IEEE (2015)
3.
Zurück zum Zitat Bredin, H., Chollet, G.: Audiovisual speech synchrony measure: application to biometrics. EURASIP J. Appl. Signal Process. 2007(1), 179 (2007)CrossRefMATH Bredin, H., Chollet, G.: Audiovisual speech synchrony measure: application to biometrics. EURASIP J. Appl. Signal Process. 2007(1), 179 (2007)CrossRefMATH
4.
Zurück zum Zitat Chakravarty, P., Tuytelaars, T.: Cross-modal supervision for learning active speaker detection in video. arXiv preprint arXiv:1603.08907 (2016) Chakravarty, P., Tuytelaars, T.: Cross-modal supervision for learning active speaker detection in video. arXiv preprint arXiv:​1603.​08907 (2016)
5.
Zurück zum Zitat Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of BMVC (2014) Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of BMVC (2014)
6.
Zurück zum Zitat Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of the CVPR, vol. 1, pp. 539–546. IEEE (2005) Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of the CVPR, vol. 1, pp. 539–546. IEEE (2005)
7.
Zurück zum Zitat Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Proceedings of ACCV (2016) Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Proceedings of ACCV (2016)
8.
Zurück zum Zitat Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)CrossRef Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)CrossRef
9.
Zurück zum Zitat Geras, K.J., Mohamed, A.R., Caruana, R., Urban, G., Wang, S., Aslan, O., Philipose, M., Richardson, M., Sutton, C.: Compressing LSTMS into CNNS. arXiv preprint arXiv:1511.06433 (2015) Geras, K.J., Mohamed, A.R., Caruana, R., Urban, G., Wang, S., Aslan, O., Philipose, M., Richardson, M., Sutton, C.: Compressing LSTMS into CNNS. arXiv preprint arXiv:​1511.​06433 (2015)
10.
Zurück zum Zitat Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015) Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:​1502.​03167 (2015)
12.
Zurück zum Zitat King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009) King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
13.
Zurück zum Zitat Koster, B.E., Rodman, R.D., Bitzer, D.: Automated lip-sync: direct translation of speech-sound to mouth-shape. In: 1994 Conference Record of the Twenty-Eighth Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 583–586. IEEE (1994) Koster, B.E., Rodman, R.D., Bitzer, D.: Automated lip-sync: direct translation of speech-sound to mouth-shape. In: 1994 Conference Record of the Twenty-Eighth Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 583–586. IEEE (1994)
14.
Zurück zum Zitat Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)
15.
Zurück zum Zitat Lewis, J.: Automated lip-sync: background and techniques. J. Vis. Comput. Anim. 2(4), 118–122 (1991)CrossRef Lewis, J.: Automated lip-sync: background and techniques. J. Vis. Comput. Anim. 2(4), 118–122 (1991)CrossRef
16.
Zurück zum Zitat Lienhart, R.: Reliable transition detection in videos: a survey and practitioner’s guide. Int. J. Image Graph. 1, 469–486 (2001)CrossRef Lienhart, R.: Reliable transition detection in videos: a survey and practitioner’s guide. Int. J. Image Graph. 1, 469–486 (2001)CrossRef
17.
Zurück zum Zitat Marcheret, E., Potamianos, G., Vopicka, J., Goel, V.: Detecting audio-visual synchrony using deep neural networks. In: Sixteenth Annual Conference of the International Speech Communication Association (2015) Marcheret, E., Potamianos, G., Vopicka, J., Goel, V.: Detecting audio-visual synchrony using deep neural networks. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
18.
Zurück zum Zitat McAllister, D.F., Rodman, R.D., Bitzer, D.L., Freeman, A.S.: Lip synchronization of speech. In: Audio-Visual Speech Processing: Computational & Cognitive Science Approaches (1997) McAllister, D.F., Rodman, R.D., Bitzer, D.L., Freeman, A.S.: Lip synchronization of speech. In: Audio-Visual Speech Processing: Computational & Cognitive Science Approaches (1997)
19.
Zurück zum Zitat Morishima, S., Ogata, S., Murai, K., Nakamura, S.: Audio-visual speech translation with automatic lip syncronization and face tracking based on 3-D head model. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, p. II-2117. IEEE (2002) Morishima, S., Ogata, S., Murai, K., Nakamura, S.: Audio-visual speech translation with automatic lip syncronization and face tracking based on 3-D head model. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, p. II-2117. IEEE (2002)
20.
Zurück zum Zitat Rúa, E.A., Bredin, H., Mateo, C.G., Chollet, G., Jiménez, D.G.: Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Anal. Appl. 12(3), 271–284 (2009)MathSciNetCrossRef Rúa, E.A., Bredin, H., Mateo, C.G., Chollet, G., Jiménez, D.G.: Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Anal. Appl. 12(3), 271–284 (2009)MathSciNetCrossRef
21.
Zurück zum Zitat Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, S., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., Li, F.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)MathSciNetCrossRef Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, S., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., Li, F.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)MathSciNetCrossRef
22.
Zurück zum Zitat Sargin, M.E., Yemez, Y., Erzin, E., Tekalp, A.M.: Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimed. 9(7), 1396–1403 (2007)CrossRef Sargin, M.E., Yemez, Y., Erzin, E., Tekalp, A.M.: Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimed. 9(7), 1396–1403 (2007)CrossRef
23.
Zurück zum Zitat Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014) Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
24.
Zurück zum Zitat Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
25.
Zurück zum Zitat Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision, Vancouver, BC, Canada (1981) Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision, Vancouver, BC, Canada (1981)
26.
Zurück zum Zitat Vedaldi, A., Lenc, K.: Matconvnet - convolutional neural networks for MATLAB. CoRR abs/1412.4564 (2014) Vedaldi, A., Lenc, K.: Matconvnet - convolutional neural networks for MATLAB. CoRR abs/1412.4564 (2014)
27.
Zurück zum Zitat Zhong, Y., Arandjelović, R., Zisserman, A.: Faces in places: compound query retrieval. In: British Machine Vision Conference (2016) Zhong, Y., Arandjelović, R., Zisserman, A.: Faces in places: compound query retrieval. In: British Machine Vision Conference (2016)
28.
Zurück zum Zitat Zhou, Z., Hong, X., Zhao, G., Pietikäinen, M.: A compact representation of visual speech data using latent variables. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 1 (2014)CrossRef Zhou, Z., Hong, X., Zhao, G., Pietikäinen, M.: A compact representation of visual speech data using latent variables. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 1 (2014)CrossRef
29.
Zurück zum Zitat Zoric, G., Pandzic, I.S.: A real-time lip sync system using a genetic algorithm for automatic neural network configuration. In: 2005 IEEE International Conference on Multimedia and Expo, pp. 1366–1369. IEEE (2005) Zoric, G., Pandzic, I.S.: A real-time lip sync system using a genetic algorithm for automatic neural network configuration. In: 2005 IEEE International Conference on Multimedia and Expo, pp. 1366–1369. IEEE (2005)
Metadaten
Titel
Out of Time: Automated Lip Sync in the Wild
verfasst von
Joon Son Chung
Andrew Zisserman
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-54427-4_19

Premium Partner