ABSTRACT
The results reported in this article are an integral part of a larger project aimed at achieving perceptually realistic animations, including the individualized nuances, of three-dimensional human faces driven by speech. The audiovisual system that has been developed for learning the spatio-temporal relationship between speech acoustics and facial animation is described, including video and speech processing, pattern analysis, and MPEG-4 compliant facial animation for a given speaker. In particular, we propose a perceptual transformation of the speech spectral envelope, which is shown to capture the dynamics of articulatory movements. An efficient nearest-neighbor algorithm is used to predict novel articulatory trajectories from the speech dynamics. The results are very promising and suggest a new way to approach the modeling of synthetic lip motion of a given speaker driven by his/her speech. This would also provide clues toward a more general cross-speaker realistic animation.
- M. Cohen and D. Massaro, 1993, "Modeling coarticulation in synthetic visual speech," in N. M. Thalmann and D. Thalmann, editors, Models and Techniques in Computer Animation, pp. 141--155. Springer Verlag, Tokyo.Google Scholar
- F. I. Parke, "Parameterized models for facial animation" IEEE Computer Graphics and Applications, vol. 2, no. 9, pp. 61--68, November 1982.Google Scholar
- P. Kalra, 1993, "An interactive multimodal facial interaction", Ph.D Dissertation No. 1183, Ecole polytechnique fédérale de Lausanne, Switzerland.Google Scholar
- J. Fischl, B. Miller and J. Robinson, 1993, "Parameter tracking in muscle-based analysis-synthesis system," in Proceedings of Picture Coding Symposium (PCS93), Lausanne, Switzerland.Google Scholar
- J. Ostermann and E. Haratsch, 1997, "An animation definition interface - rapid design of MPEG-4 compliant animated faces and bodies," in Proceedings of the International Workshop on Synthetic-Natural Hybrid Coding and 3D Imaging, Rhodes, Greece, September 5--9 1997.Google Scholar
- E. Cosatto and H. P. Graf, 1998, "Sample-based synthesis of photo-realistic talking heads," in Computer Animation, pp. 103--110, Philadelphia, Pennsylvania, June 8--10, 1998. Google ScholarDigital Library
- J. Ostermann, 1998, "Animation of synthetic face in MPEG-4," in proceedings of Computer Animation, Philadelphia, PA. Google ScholarDigital Library
- A. M. Tekalp and J. Ostermann, 2000, "Face and 2-D mesh animation in MPEG-4", in Signal Processing: Image Communication 15, pp. 387--421.Google ScholarCross Ref
- I. S. Pandzic, J. Ostermann and D. Millen, 1999, "User evaluation: synthetic talking faces for interactive services", Visual Computer 15, pp. 330--340.Google ScholarDigital Library
- D. W. Massaro, 1997, Perceiving Talking Faces: From Speech Perception to a Behavioral Principle, MIT Press.Google Scholar
- S. Morishima and H. Harashima, 1991, "A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface," in IEEE Journal on Selected Areas in Communications 9(4), 594--600.Google ScholarDigital Library
- K. Waters and T. M. Levergood, 1993, "DECface: an automatic lip synchronization algorithm for synthetic faces," Technical Report CRL 93/4, DEC Cambridge Research Laboratory, Cambridge, MA.Google Scholar
- C. Pelachaud, N. I. Badler and M. Steedman, 1996, "Generating facial Expressions for Speech," in Cognitive Science 20, pp. 1--46.Google ScholarCross Ref
- J. Beskow, 1995, "Rule-based visual speech synthesis," in ESCA EUROSPEECH '95, 4th European Conference on Speech Communication and Technology, Madrid, Spain.Google Scholar
- L. M. Arslan and D. Talkin, 1999, "Codebook Based Face Point Trajectory Synthesis Algorithm using Speech Input," in Speech Communication 27, pp. 81--93. Google ScholarDigital Library
- T. Ohman, 1998, "An audio-visual speech database and automatic measurements of visual speech," in Quarterly Progress and Status Report, Department of Speech, Music and Hearing, Royal Institute of Technology, Stockholm, Sweden, Stockholm, Sweden.Google Scholar
- E. Yamamoto, S. Nakamura and K. Shikano, 1998, "Lip movement synthesis from speech based on Hidden Markov models," in Speech Communication 28, pp. 105--115. Google ScholarDigital Library
- M. Brand, 1999, "Voice Puppetry," in Proceedings of SIGGRAPH'99 Computer Graphics, Annual Conference Series, pp. 21--28. Google ScholarDigital Library
- F. Lavagetto, 1995, "Converting speech into lip movements: A multimedia telephone for hard of hearing people," in IEEE Transactions on Rehabilitation Engineering 3(1), pp. 90--102.Google ScholarCross Ref
- H. Yehia, P. Rubin and E. Vatikiotis-Bateson, 1998, "Quantitative Association of Vocal-tract and Facial Behavior," in Speech Communications 26, pp. 23--43. Google ScholarDigital Library
- L. R. Rabiner and B. H. Juang, 1993, Fundamentals of Speech Recognition, Prentice-Hall, 1993. Google ScholarDigital Library
- H. Hermansky, 1990, "Perceptual linear predictive (PLP) analysis of speech," in Journal of Acoustic Society of America, vol. 87(4), pp. 1738--1792.Google ScholarCross Ref
- H. Hermansky and N. Morgan, 1994, "RASTA Processing of Speech," in IEEE Transactions on Speech and Audio Processing 2(4), pp. 578--589.Google ScholarCross Ref
- D. H. Klatt, 1982, "Prediction of perceived phonetic distance from critical band spectra: a first step," in Proceedings of the International Congress on Acoustics, Speech, Signal Processing, Paris, IEEE Press, pp. 1278--1281.Google Scholar
- L. R. Rabiner and R. W. Schafer, 1978, Digital processing of speech signals, Prentice-hall, 1978.Google Scholar
- G. Aversano, A. Esposito and M. Marinaro, 2001, "A new text-independent method for phoneme segmentation," to appear in the Proceedings of IEEE Midwest Symposium on Circuits and Systems, Dayton 14--17 August 2001.Google Scholar
- F. Lavagetto and R. Pockaj, 1999, "The Facial Animation Engine: towards a high-level interface for the design of MPEG-4 compliant animated faces", in IEEE Transactions on Circuits and Systems for Video Technology 9(2), pp. 277--289. Google ScholarDigital Library
- R. Y. Tsai, 1987, "A Versatile Camera Calibration Technique for High-Accuracy 3D Machine Vision Metrology Using Off-the-Shelf TV Cameras and Lenses," in IEEE Journal of Robotics and Automation 3, pp. 323--344.Google ScholarCross Ref
- R. Bryll, X. Ma and F. Quek, 1999, "Camera Calibration Utility Description," Technical Report VISLab-01-15, Vision Interfaces and Systems Laboratory, Wright State University. http://vislab.cs.wright.edu/Publications/ technical-reports/BryMQ01.htmlGoogle Scholar
- F. Quek, D. McNeill, R. Bryll, C. Kirbas, H. Arslan, K. McCullough, N. Furuyama and R. Ansari, 2000, "Gesture, Speech and Gaze Cues for Discourse Segmentation," in Proceedings of CVPR 2000, Hilton Head Island, South Carolina, June 13--15, 2000.Google Scholar
- J. Garofolo et al, 1998, DARPA TIMIT CD-ROM: An Acoustic Phonetic Continuous Speech Database, National Institute of Standards and Technology, Gaithersburg, MD.Google Scholar
Index Terms
- Speech driven facial animation
Recommendations
Video-audio driven real-time facial animation
We present a real-time facial tracking and animation system based on a Kinect sensor with video and audio input. Our method requires no user-specific training and is robust to occlusions, large head rotations, and background noise. Given the color, ...
Thai Speech-Driven Facial Animation
CULTURE-COMPUTING '11: Proceedings of the 2011 Second International Conference on Culture and ComputingWe consider the problem of making lip movement for an animated talking character, which consumes workload and cost during the animation development process. The main idea is to extract and capture a vise me from the video of a human talking and the ...
Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces
Synthesizing expressive facial animation is a very challenging topic within the graphics community. In this paper, we present an expressive facial animation synthesis system enabled by automated learning from facial motion capture data. Accurate 3D ...
Comments