nach oben

Erschienen in:

2017 | OriginalPaper | Buchkapitel

Out of Time: Automated Lip Sync in the Wild

verfasst von : Joon Son Chung, Andrew Zisserman

Erschienen in: Computer Vision – ACCV 2016 Workshops

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The goal of this work is to determine the audio-video synchronisation between mouth motion and speech in a video.

We propose a two-stream ConvNet architecture that enables the mapping between the sound and the mouth images to be trained end-to-end from unlabelled data. The trained network is used to determine the lip-sync error in a video.

We apply the network to two further tasks: active speaker detection and lip reading. On both tasks we set a new state-of-the-art on standard benchmark datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel A 3D Reconstruction Method with Color Reproduction from Multi-band and Multi-view Images

Nächstes Kapitel Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System

Bt.1359: Relative timing of sound and vision for broadcasting. ITU (1998)

Anina, I., Zhou, Z., Zhao, G., Pietikäinen, M.: Ouluvs2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–5. IEEE (2015)

Bredin, H., Chollet, G.: Audiovisual speech synchrony measure: application to biometrics. EURASIP J. Appl. Signal Process. 2007(1), 179 (2007)CrossRefMATH

Chakravarty, P., Tuytelaars, T.: Cross-modal supervision for learning active speaker detection in video. arXiv preprint arXiv:1603.08907 (2016)

Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of BMVC (2014)

Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of the CVPR, vol. 1, pp. 539–546. IEEE (2005)

Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Proceedings of ACCV (2016)

Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)CrossRef

Geras, K.J., Mohamed, A.R., Caruana, R., Urban, G., Wang, S., Aslan, O., Philipose, M., Richardson, M., Sutton, C.: Compressing LSTMS into CNNS. arXiv preprint arXiv:1511.06433 (2015)

10.

Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

11.

Jia, Y.: Caffe: an open source convolutional architecture for fast feature embedding (2013). http://caffe.berkeleyvision.org/

12.

King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)

13.

Koster, B.E., Rodman, R.D., Bitzer, D.: Automated lip-sync: direct translation of speech-sound to mouth-shape. In: 1994 Conference Record of the Twenty-Eighth Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 583–586. IEEE (1994)

14.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)

15.

Lewis, J.: Automated lip-sync: background and techniques. J. Vis. Comput. Anim. 2(4), 118–122 (1991)CrossRef

16.

Lienhart, R.: Reliable transition detection in videos: a survey and practitioner’s guide. Int. J. Image Graph. 1, 469–486 (2001)CrossRef

17.

Marcheret, E., Potamianos, G., Vopicka, J., Goel, V.: Detecting audio-visual synchrony using deep neural networks. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

18.

McAllister, D.F., Rodman, R.D., Bitzer, D.L., Freeman, A.S.: Lip synchronization of speech. In: Audio-Visual Speech Processing: Computational & Cognitive Science Approaches (1997)

19.

Morishima, S., Ogata, S., Murai, K., Nakamura, S.: Audio-visual speech translation with automatic lip syncronization and face tracking based on 3-D head model. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, p. II-2117. IEEE (2002)

20.

Rúa, E.A., Bredin, H., Mateo, C.G., Chollet, G., Jiménez, D.G.: Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Anal. Appl. 12(3), 271–284 (2009)MathSciNetCrossRef

21.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, S., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., Li, F.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)MathSciNetCrossRef

22.

Sargin, M.E., Yemez, Y., Erzin, E., Tekalp, A.M.: Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimed. 9(7), 1396–1403 (2007)CrossRef

23.

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)

24.

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)

25.

Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision, Vancouver, BC, Canada (1981)

26.

Vedaldi, A., Lenc, K.: Matconvnet - convolutional neural networks for MATLAB. CoRR abs/1412.4564 (2014)

27.

Zhong, Y., Arandjelović, R., Zisserman, A.: Faces in places: compound query retrieval. In: British Machine Vision Conference (2016)

28.

Zhou, Z., Hong, X., Zhao, G., Pietikäinen, M.: A compact representation of visual speech data using latent variables. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 1 (2014)CrossRef

29.

Zoric, G., Pandzic, I.S.: A real-time lip sync system using a genetic algorithm for automatic neural network configuration. In: 2005 IEEE International Conference on Multimedia and Expo, pp. 1366–1369. IEEE (2005)

Titel: Out of Time: Automated Lip Sync in the Wild
verfasst von: Joon Son Chung
Andrew Zisserman
Verlag: Springer International Publishing
Buch: Computer Vision – ACCV 2016 Workshops
Print ISBN: 978-3-319-54426-7

Electronic ISBN: 978-3-319-54427-4

Copyright-Jahr: 2017
DOI: https://doi.org/10.1007/978-3-319-54427-4_19

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner