ABSTRACT
We describe how to create with machine learning techniques a generative, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject's mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned.The two key contributions of this paper are 1) a variant of the multidimensional morphable model (MMM) to synthesize new, previously unseen mouth configurations from a small set of mouth image prototypes; and 2) a trajectory synthesis technique based on regularization, which is automatically trained from the recorded video corpus, and which is capable of synthesizing trajectories in MMM space corresponding to any desired utterance.
- BARRON, J. L., FLEET, D. J., AND BEAUCHEMIN, S. S. 1994. Performance of optical flow techniques. International Journal of Computer Vision 12, 1, 43-77. Google ScholarDigital Library
- BEIER, T., AND NEELY, S. 1992. Feature-based image metamorphosis. In Computer Graphics (Proceedings of ACM SIGGRAPH 92), vol. 26(2), ACM, 35-42. Google ScholarDigital Library
- BERGEN, J., ANANDAN, P., HANNA, K., AND HINGORANI, R. 1992. Hierarchical model-based motion estimation. In Proceedings of the European Conference on Computer Vision, 237-252. Google ScholarDigital Library
- BEYMER, D., AND POGGIO, T. 1996. Image representations for visual learning. Science 272, 1905-1909.Google ScholarCross Ref
- BEYMER, D., SHASHUA, A., AND POGGIO, T. 1993. Example based image analysis and synthesis. Tech. Rep. 1431, MIT AI Lab. Google ScholarDigital Library
- BISHOP, C. M. 1995. Neural Networks for Pattern Recognition. Clarendon Press, Oxford. Google ScholarDigital Library
- BLACK, A., AND TAYLOR, P. 1997. The Festival Speech Synthesis System. University of Edinburgh.Google Scholar
- BLACK, M., FLEET, D., AND YACOOB, Y. 2000. Robustly estimating changes in image appearance. Computer Vision and Image Understanding, Special Issue on Robust Statistical Techniques in Image Understanding, 8-31. Google ScholarDigital Library
- BLANZ, V., AND VETTER, T. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of SIGGRAPH 2001, ACM Press / ACM SIGGRAPH, Los Angeles, A. Rockwood, Ed., Computer Graphics Proceedings, Annual Conference Series, ACM, 187-194. Google ScholarDigital Library
- BRAND, M., AND HERTZMANN, A. 2000. Style machines. In Proceedings of SIGGRAPH 2000, ACM Press / ACM SIGGRAPH, K. Akeley, Ed., Computer Graphics Proceedings, Annual Conference Series, ACM, 183-192. Google ScholarDigital Library
- BRAND, M. 1999. Voice puppetry. In Proceedings of SIGGRAPH 1999, ACM Press / ACM SIGGRAPH, Los Angeles, A. Rockwood, Ed., Computer Graphics Proceedings, Annual Conference Series, ACM, 21-28. Google ScholarDigital Library
- BREGLER, C., COVELL, M., AND SLANEY, M. 1997. Video rewrite: Driving visual speech with audio. In Proceedings of SIGGRAPH 1997, ACM Press / ACM SIGGRAPH, Los Angeles, CA, Computer Graphics Proceedings, Annual Conference Series, ACM, 353-360. Google ScholarDigital Library
- BROOKE, N., AND SCOTT, S. 1994. Computer graphics animations of talking faces based on stochastic models. In Intl. Symposium on Speech, Image Processing, and Neural Networks.Google ScholarCross Ref
- BURT, P. J., AND ADELSON, E. H. 1983. The laplacian pyramid as a compact image code. IEEE Trans. on Communications COM-31, 4 (Apr.), 532-540.Google ScholarCross Ref
- CHEN, S. E., AND WILLIAMS, L. 1993. View interpolation for image synthesis. In Proceedings of SIGGRAPH 1993, ACM Press / ACM SIGGRAPH, Anaheim, CA, Computer Graphics Proceedings, Annual Conference Series, ACM, 279-288. Google ScholarDigital Library
- COHEN, M. M., AND MASSARO, D. W. 1993. Modeling coarticulation in synthetic visual speech. In Models and Techniques in Computer Animation, N. M. Thalmann and D. Thalmann, Eds. Springer-Verlag, Tokyo, 139-156.Google Scholar
- COOTES, T. F., EDWARDS, G. J., AND TAYLOR, C. J. 1998. Active appearance models. In Proceedings of the European Conference on Computer Vision. Google ScholarDigital Library
- CORMEN, T. H., LEISERSON, C. E., AND RIVEST, R. L. 1989. Introduction to Algorithms. The MIT Press and McGraw-Hill Book Company. Google ScholarDigital Library
- COSATTO, E., AND GRAF, H. 1998. Sample-based synthesis of photorealistic talking heads. In Proceedings of Computer Animation '98, 103-110. Google ScholarDigital Library
- EZZAT, T., AND POGGIO, T. 2000. Visual speech synthesis by morphing visemes. International Journal of Computer Vision 38, 45-57. Google ScholarDigital Library
- GIROSI, F., JONES, M., AND POGGIO, T. 1993. Priors, stabilizers, and basis functions: From regularization to radial, tensor, and additive splines. Tech. Rep. 1430, MIT AI Lab, June. Google ScholarDigital Library
- GUENTER, B., GRIMM, C., WOOD, D., MALVAR, H., AND PIGHIN, F. 1998. Making faces. In Proceedings of SIGGRAPH 1998, ACM Press / ACM SIGGRAPH, Orlando, FL, Computer Graphics Proceedings, Annual Conference Series, ACM, 55-66. Google ScholarDigital Library
- HORN, B. K. P., AND SCHUNCK, B. G. 1981. Determining optical flow. Artificial Intelligence 17, 185-203.Google ScholarDigital Library
- HUANG, X., ALLEVA, F., HON, H.-W., HWANG, M.-Y., LEE, K.-F., AND ROSENFELD, R. 1993. The SPHINX-II speech recognition system: an overview (http://sourceforge.net/projects/cmusphinx/). Computer Speech and Language 7, 2, 137-148.Google ScholarCross Ref
- JONES, M., AND POGGIO, T. 1998. Multidimensional morphable models: A framework for representing and maching object classes. In Proceedings of the International Conference on Computer Vision. Google ScholarDigital Library
- LEE, S. Y., CHWA, K. Y., SHIN, S. Y., AND WOLBERG, G. 1995. Image metemorphosis using snakes and free-form deformations. In Proceedings of SIGGRAPH 1995, ACM Press / ACM SIGGRAPH, vol. 29 of Computer Graphics Proceedings, Annual Conference Series, ACM, 439-448. Google ScholarDigital Library
- LEE, Y., TERZOPOULOS, D., AND WATERS, K. 1995. Realistic modeling for facial animation. In Proceedings of SIGGRAPH 1995, ACM Press / ACM SIGGRAPH, Los Angeles, California, Computer Graphics Proceedings, Annual Conference Series, ACM, 55-62. Google ScholarDigital Library
- LEE, S. Y., WOLBERG, G., AND SHIN, S. Y. 1998. Polymorph: An algorithm for morphing among multiple images. IEEE Computer Graphics Applications 18, 58-71. Google ScholarDigital Library
- LEGOFF, B., AND BENOIT, C. 1996. A text-to-audiovisual-speech synthesizer for french. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).Google ScholarCross Ref
- MASUKO, T., KOBAYASHI, T., TAMURA, M., MASUBUCHI, J., AND TOKUDA, K. 1998. Text-to-visual speech synthesis based on parameter generation from hmm. In ICASSP.Google Scholar
- MOULINES, E., AND CHARPENTIER, F. 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9, 453-467. Google ScholarDigital Library
- PARKE, F. I. 1974. A parametric model of human faces. PhD thesis, University of Utah. Google ScholarDigital Library
- PEARCE, A., WYVILL, B., WYVILL, G., AND HILL, D. 1986. Speech and expression: A computer solution to face animation. In Graphics Interface. Google ScholarDigital Library
- PIGHIN, F., HECKER, J., LISCHINSKI, D., SZELISKI, R., AND SALESIN, D. 1998. Synthesizing realistic facial expressions from photographs. In Proceedings of SIGGRAPH 1998, ACM Press / ACM SIGGRAPH, Orlando, FL, Computer Graphics Proceedings, Annual Conference Series, ACM, 75-84. Google ScholarDigital Library
- POGGIO, T., AND VETTER, T. 1992. Recognition and structure from one 2D model view: observations on prototypes, object classes and symmetries. Tech. Rep. 1347, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Google ScholarDigital Library
- ROWEIS, S. 1998. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems, The MIT Press, M. I. Jordan, M. J. Kearns, and S. A. Solla, Eds., vol. 10. Google ScholarDigital Library
- SCOTT, K., KAGELS, D., WATSON, S., ROM, H., WRIGHT, J., LEE, M., AND HUSSEY, K. 1994. Synthesis of speaker facial movement to match selected speech sequences. In Proceedings of the Fifth Australian Conference on Speech Science and Technology, vol. 2, 620-625.Google Scholar
- SJLANDER, K., AND BESKOW, J. 2000. Wavesurfer - an open source speech tool. In Proc of ICSLP, vol. 4, 464-467.Google Scholar
- TENENBAUM, J. B., DE SILVA, V., AND LANGFORD, J. C. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290 (Dec), 2319-2323.Google ScholarCross Ref
- TIPPING, M. E., AND BISHOP, C. M. 1999. Mixtures of probabilistic principal component analyzers. Neural Computation 11, 2, 443-482. Google ScholarDigital Library
- WAHBA, G. 1900. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia.Google Scholar
- WATERS, K. 1987. A muscle model for animating three-dimensional facial expressions. In Computer Graphics (Proceedings of ACM SIGGRAPH 87), vol. 21(4), ACM, 17-24. Google ScholarDigital Library
- WATSON, S., WRIGHT, J., SCOTT, K., KAGELS, D., FREDA, D., AND HUSSEY, K. 1997. An advanced morphing algorithm for interpolating phoneme images to simulate speech. Jet Propulsion Laboratory, California Institute of Technology.Google Scholar
- WOLBERG, G. 1990. Digital Image Warping. IEEE Computer Society Press, Los Alamitos, CA. Google ScholarDigital Library
Index Terms
- Trainable videorealistic speech animation
Recommendations
Trainable videorealistic speech animation
We describe how to create with machine learning techniques a generative, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual ...
Visual Speech Synthesis by Morphing Visemes
special issue on learning and vision at the center for biological and computational learning, Massachusetts Institute of TechnologyWe present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired ...
Trainable videorealistic speech animation
FGR' 04: Proceedings of the Sixth IEEE international conference on Automatic face and gesture recognitionWe describe how to create with machine learning techniques a generative, videorealistic, speech animation module. A human subject is first recorded using a videocamera as he/she utters a pre-determined speech corpus. After processing the corpus ...
Comments