skip to main content
research-article
Public Access

Visemenet: audio-driven animator-centric speech animation

Published:30 July 2018Publication History
Skip Abstract Section

Abstract

We present a novel deep-learning based approach to producing animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio. Our three-stage Long Short-Term Memory (LSTM) network architecture is motivated by psycho-linguistic insights: segmenting speech audio into a stream of phonetic-groups is sufficient for viseme construction; speech styles like mumbling or shouting are strongly co-related to the motion of facial landmarks; and animator style is encoded in viseme motion curve profiles. Our contribution is an automatic real-time lip-synchronization from audio solution that integrates seamlessly into existing animation pipelines. We evaluate our results by: cross-validation to ground-truth data; animator critique and edits; visual comparison to recent deep-learning lip-synchronization solutions; and showing our approach to be resilient to diversity in speaker and language.

Skip Supplemental Material Section

Supplemental Material

161-171.mp4

mp4

98.2 MB

a161-zhou.mp4

mp4

212.5 MB

References

  1. Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive Visual Text-to-Speech Using Active Appearance Models. In Proc. CVPR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Gérard Bailly. 1997. Learning to speak. Sensori-motor control of speech movements. Speech Communication 22, 2-3 (1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Gérard Bailly, Pascal Perrier, and Eric Vatikiotis-Bateson. 2012. Audiovisual Speech Processing. Cambridge University Press.Google ScholarGoogle Scholar
  4. Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proc. SIGGRAPH. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Carlos Busso, Sungbok Lee, and Shrikanth S. Narayanan. 2007. Using neutral speech models for emotional speech analysis. In Proc. InterSpeech.Google ScholarGoogle Scholar
  6. Rich Caruana. 1997. Multi-task Learning. Machine Learning 28, 1 (1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. EMNLP.Google ScholarGoogle Scholar
  8. Michael M Cohen and Dominic W Massaro. 1993. Modeling Coarticulation in Synthetic Visual Speech. Models and Techniques in Computer Animation 92 (1993).Google ScholarGoogle Scholar
  9. Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. J. Acoustical Society of America 120, 5 (2006).Google ScholarGoogle ScholarCross RefCross Ref
  10. Steven B. Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustics, Speech and Signal Processing 28, 4 (1980).Google ScholarGoogle ScholarCross RefCross Ref
  11. Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: An Animator-centric Viseme Model for Expressive Lip Synchronization. ACM Trans. Graphics. 35, 4 (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Paul Ekman and Wallace V. Friesen. 1978. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press.Google ScholarGoogle Scholar
  13. Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. In Proc. SIGGRAPH. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Faceware. 2017. Analyzer. http://facewaretech.com/products/software/analyzer. (2017).Google ScholarGoogle Scholar
  15. G. Fanelli, J. Gall, H. Romsdorfer, T. Weise, and L. Van Gool. 2010. A 3-D Audio-Visual Corpus of Affective Communication. IEEE Trans. Multimedia 12, 6 (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Cletus G Fisher. 1968. Confusions among visually perceived consonants. J. Speech, Language, and Hearing Research 11, 4 (1968).Google ScholarGoogle Scholar
  17. Jennifer MB Fugate. 2013. Categorical perception for emotional faces. Emotion Review 5, 1 (2013).Google ScholarGoogle ScholarCross RefCross Ref
  18. Google. 2017. Google Cloud Voice. https://cloud.google.com/speech. (2017).Google ScholarGoogle Scholar
  19. Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proc. ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Haq and P.J.B. Jackson. 2009. Speaker-dependent audio-visual emotion recognition. In Proc. AVSP.Google ScholarGoogle Scholar
  22. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar Digitization from a Single Image for Real-time Rendering. ACM Trans. Graphics. 36, 6 (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-image Translation with Conditional Adversarial Networks. arxiv abs/1611.07004 (2016).Google ScholarGoogle Scholar
  25. Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven Facial Animation by Joint End-to-end Learning of Pose and Emotion. ACM Trans. Graphics. 36, 4 (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. J. of Machine Learning Research 10, Jul (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with on-the-Fly Correctives. ACM Trans. Graphics 32, 4 (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Alvin M Liberman, Katherine Safford Harris, Howard S Hoffman, and Belver C Griffith. 1957. The discrimination of speech sounds within and across phoneme boundaries. J. Experimental Psychology 54, 5 (1957).Google ScholarGoogle ScholarCross RefCross Ref
  29. Karl F. MacDorman, Robert D. Green, Chin-Chang Ho, and Clinton T Koch. 2009. Too real for comfort? Uncanny responses to computer generated faces. Computers in Human Behavior 25, 3 (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal Forced Aligner: trainable text-speech alignment using Kaldi. In Proc. Interspeech.Google ScholarGoogle ScholarCross RefCross Ref
  31. Christopher Olah. 2015. Understanding LSTM Networks. http://colah.github.io/posts/2015-08-Understanding-LSTMs. (2015).Google ScholarGoogle Scholar
  32. Kuldip K Paliwal. 1998. Spectral subband centroid features for speech recognition. In Proc. ICASSP.Google ScholarGoogle ScholarCross RefCross Ref
  33. Roger Blanco i Ribera, Eduard Zell, J. P. Lewis, Junyong Noh, and Mario Botsch. 2017. Facial Retargeting with Automatic Range of Motion Alignment. ACM Trans. Graphics. 36, 4 (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Supasorn Suwajanakorn, Steven M. Seitz, and I. Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Trans. Graphics 36, 4 (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A Deep Learning Approach for Generalized Speech Animation. ACM Trans. Graphics. 36, 4 (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proc. SCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Lijuan Wang, Wei Han, and Frank K Soong. 2012. High Quality Lip-Sync Animation for 3D Photo-Realistic Talking Head. In Proc. ICASSP.Google ScholarGoogle ScholarCross RefCross Ref
  38. Wenwu Wang. 2010. Machine Audition: Principles, Algorithms and Systems: Principles, Algorithms and Systems. IGI Global. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime performance-based facial animation. In ACM Trans. Graphics, Vol. 30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Lance Williams. 1990. Performance-driven Facial Animation. In Proc. SIGGRAPH.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Visemenet: audio-driven animator-centric speech animation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Graphics
        ACM Transactions on Graphics  Volume 37, Issue 4
        August 2018
        1670 pages
        ISSN:0730-0301
        EISSN:1557-7368
        DOI:10.1145/3197517
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 30 July 2018
        Published in tog Volume 37, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader