research-article

Public Access

Visemenet: audio-driven animator-centric speech animation

Authors:
Yang Zhou

University of Massachusetts Amherst

University of Massachusetts Amherst
View Profile

,
Zhan Xu

University of Massachusetts Amherst

University of Massachusetts Amherst
View Profile

,
Chris Landreth

University of Toronto

University of Toronto
View Profile

,
Evangelos Kalogerakis

University of Massachusetts Amherst

University of Massachusetts Amherst
View Profile

,
Subhransu Maji

University of Massachusetts Amherst

University of Massachusetts Amherst
View Profile

,
Karan Singh

University of Toronto

University of Toronto
View Profile

Authors Info & Claims

ACM Transactions on Graphics Volume 37 Issue 4Article No.: 161pp 1–10https://doi.org/10.1145/3197517.3201292

Published:30 July 2018Publication History

ACM Transactions on Graphics

Abstract

We present a novel deep-learning based approach to producing animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio. Our three-stage Long Short-Term Memory (LSTM) network architecture is motivated by psycho-linguistic insights: segmenting speech audio into a stream of phonetic-groups is sufficient for viseme construction; speech styles like mumbling or shouting are strongly co-related to the motion of facial landmarks; and animator style is encoded in viseme motion curve profiles. Our contribution is an automatic real-time lip-synchronization from audio solution that integrates seamlessly into existing animation pipelines. We evaluate our results by: cross-validation to ground-truth data; animator critique and edits; visual comparison to recent deep-learning lip-synchronization solutions; and showing our approach to be resilient to diversity in speaker and language.

Supplemental Material

161-171.mp4

mp4

98.2 MB

Download

a161-zhou.mp4

mp4

212.5 MB

Download

References

Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive Visual Text-to-Speech Using Active Appearance Models. In Proc. CVPR. Google ScholarDigital Library
Gérard Bailly. 1997. Learning to speak. Sensori-motor control of speech movements. Speech Communication 22, 2-3 (1997). Google ScholarDigital Library
Gérard Bailly, Pascal Perrier, and Eric Vatikiotis-Bateson. 2012. Audiovisual Speech Processing. Cambridge University Press.Google Scholar
Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proc. SIGGRAPH. Google ScholarDigital Library
Carlos Busso, Sungbok Lee, and Shrikanth S. Narayanan. 2007. Using neutral speech models for emotional speech analysis. In Proc. InterSpeech.Google Scholar
Rich Caruana. 1997. Multi-task Learning. Machine Learning 28, 1 (1997). Google ScholarDigital Library
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. EMNLP.Google Scholar
Michael M Cohen and Dominic W Massaro. 1993. Modeling Coarticulation in Synthetic Visual Speech. Models and Techniques in Computer Animation 92 (1993).Google Scholar
Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. J. Acoustical Society of America 120, 5 (2006).Google ScholarCross Ref
Steven B. Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustics, Speech and Signal Processing 28, 4 (1980).Google ScholarCross Ref
Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: An Animator-centric Viseme Model for Expressive Lip Synchronization. ACM Trans. Graphics. 35, 4 (2016). Google ScholarDigital Library
Paul Ekman and Wallace V. Friesen. 1978. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press.Google Scholar
Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. In Proc. SIGGRAPH. Google ScholarDigital Library
Faceware. 2017. Analyzer. http://facewaretech.com/products/software/analyzer. (2017).Google Scholar
G. Fanelli, J. Gall, H. Romsdorfer, T. Weise, and L. Van Gool. 2010. A 3-D Audio-Visual Corpus of Affective Communication. IEEE Trans. Multimedia 12, 6 (2010). Google ScholarDigital Library
Cletus G Fisher. 1968. Confusions among visually perceived consonants. J. Speech, Language, and Hearing Research 11, 4 (1968).Google Scholar
Jennifer MB Fugate. 2013. Categorical perception for emotional faces. Emotion Review 5, 1 (2013).Google ScholarCross Ref
Google. 2017. Google Cloud Voice. https://cloud.google.com/speech. (2017).Google Scholar
Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proc. ICML. Google ScholarDigital Library
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005). Google ScholarDigital Library
S. Haq and P.J.B. Jackson. 2009. Speaker-dependent audio-visual emotion recognition. In Proc. AVSP.Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997). Google ScholarDigital Library
Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar Digitization from a Single Image for Real-time Rendering. ACM Trans. Graphics. 36, 6 (2017). Google ScholarDigital Library
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-image Translation with Conditional Adversarial Networks. arxiv abs/1611.07004 (2016).Google Scholar
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven Facial Animation by Joint End-to-end Learning of Pose and Emotion. ACM Trans. Graphics. 36, 4 (2017). Google ScholarDigital Library
Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. J. of Machine Learning Research 10, Jul (2009). Google ScholarDigital Library
Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with on-the-Fly Correctives. ACM Trans. Graphics 32, 4 (2013). Google ScholarDigital Library
Alvin M Liberman, Katherine Safford Harris, Howard S Hoffman, and Belver C Griffith. 1957. The discrimination of speech sounds within and across phoneme boundaries. J. Experimental Psychology 54, 5 (1957).Google ScholarCross Ref
Karl F. MacDorman, Robert D. Green, Chin-Chang Ho, and Clinton T Koch. 2009. Too real for comfort? Uncanny responses to computer generated faces. Computers in Human Behavior 25, 3 (2009). Google ScholarDigital Library
Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal Forced Aligner: trainable text-speech alignment using Kaldi. In Proc. Interspeech.Google ScholarCross Ref
Christopher Olah. 2015. Understanding LSTM Networks. http://colah.github.io/posts/2015-08-Understanding-LSTMs. (2015).Google Scholar
Kuldip K Paliwal. 1998. Spectral subband centroid features for speech recognition. In Proc. ICASSP.Google ScholarCross Ref
Roger Blanco i Ribera, Eduard Zell, J. P. Lewis, Junyong Noh, and Mario Botsch. 2017. Facial Retargeting with Automatic Range of Motion Alignment. ACM Trans. Graphics. 36, 4 (2017). Google ScholarDigital Library
Supasorn Suwajanakorn, Steven M. Seitz, and I. Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Trans. Graphics 36, 4 (2017). Google ScholarDigital Library
Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A Deep Learning Approach for Generalized Speech Animation. ACM Trans. Graphics. 36, 4 (2017). Google ScholarDigital Library
Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proc. SCA. Google ScholarDigital Library
Lijuan Wang, Wei Han, and Frank K Soong. 2012. High Quality Lip-Sync Animation for 3D Photo-Realistic Talking Head. In Proc. ICASSP.Google ScholarCross Ref
Wenwu Wang. 2010. Machine Audition: Principles, Algorithms and Systems: Principles, Algorithms and Systems. IGI Global. Google ScholarDigital Library
Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime performance-based facial animation. In ACM Trans. Graphics, Vol. 30. Google ScholarDigital Library
Lance Williams. 1990. Performance-driven Facial Animation. In Proc. SIGGRAPH.Google ScholarDigital Library

Index Terms

Visemenet: audio-driven animator-centric speech animation
1. Computing methodologies
  1. Computer graphics
    1. Animation
  2. Machine learning
    1. Machine learning algorithms

Recommendations

JALI: an animator-centric viseme model for expressive lip synchronization

The rich signals we extract from facial expressions imposes high expectations for the science and art of facial animation. While the advent of high-resolution performance capture has greatly improved realism, the utility of procedural animation warrants ...
Read More
Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions

In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a ...
Read More
Lip synchronization by acoustic inversion
SIGGRAPH '10: ACM SIGGRAPH 2010 Posters

Talking computer animated characters are a common sight in video games and movies. Although doing the mouth animation by hand gives the best results it is not always feasible because of cost or time constraints. Therefore producing lip animation ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Graphics Volume 37, Issue 4
August 2018
1670 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3197517
Issue’s Table of Contents

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 July 2018
Published in tog Volume 37, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
facial animation
neural networks
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 128
  Total Citations
  View Citations
- 1,657
  Total Downloads
- Downloads (Last 12 months)296
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Visemenet: audio-driven animator-centric speech animation

ACM Transactions on Graphics

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

JALI: an animator-centric viseme model for expressive lip synchronization

Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions

Lip synchronization by acoustic inversion

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Visemenet: audio-driven animator-centric speech animation

ACM Transactions on Graphics

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

JALI: an animator-centric viseme model for expressive lip synchronization

Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions

Lip synchronization by acoustic inversion

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media