Elsevier

Pattern Recognition

Volume 40, Issue 8, August 2007, Pages 2325-2340
Pattern Recognition

A coupled HMM approach to video-realistic speech animation

https://doi.org/10.1016/j.patcog.2006.12.001Get rights and content

Abstract

We propose a coupled hidden Markov model (CHMM) approach to video-realistic speech animation, which realizes realistic facial animations driven by speaker independent continuous speech. Different from hidden Markov model (HMM)-based animation approaches that use a single-state chain, we use CHMMs to explicitly model the subtle characteristics of audio–visual speech, e.g., the asynchrony, temporal dependency (synchrony), and different speech classes between the two modalities. We derive an expectation maximization (EM)-based A/V conversion algorithm for the CHMMs, which converts acoustic speech into decent facial animation parameters. We also present a video-realistic speech animation system. The system transforms the facial animation parameters to a mouth animation sequence, refines the animation with a performance refinement process, and finally stitches the animated mouth with a background facial sequence seamlessly. We have compared the animation performance of the CHMM with the HMMs, the multi-stream HMMs and the factorial HMMs both objectively and subjectively. Results show that the CHMMs achieve superior animation performance. The ph-vi-CHMM system, which adopts different state variables (phoneme states and viseme states) in the audio and visual modalities, performs the best. The proposed approach indicates that explicitly modelling audio–visual speech is promising for speech animation.

Introduction

Speech-driven talking faces are playing an increasingly indispensable role in multimedia applications such as computer games, online virtual characters, video telephony, and other interactive human–machine interfaces. For example, talking faces can provide visual speech perceptual information for hearing-impaired people to better communicate with machines through lipreading [1]. Recent studies have shown that the trust and attention of humans towards machines can be significantly increased by 30% if humans are interacting with a human face instead of text only [2]. Current Internet videophones can transmit videos, but due to bandwidth limitations and network congestion, facial motion accompanied with audio often appears jerky since many frames are lost during transmissions. Therefore, a video-realistic speech-driven talking face may provide a good alteration.

The essential problem of speech-driven talking faces is speech animation—synthesizing speech-related facial animation from audio. Despite of decades of extensive research, realistic speech animation still remains to be one of the challenging talks due to the variabilities of human speech, mostly commonly the coarticulation phenomenon [3]. Various approaches have been proposed during the last decade, which significantly improve the animation performance. Some approaches use a 3D mesh to define the head shape and map a face texture to the mesh [4], [5], [6]. Others realize photo- or video-realistic animation from recorded image sequences of face and render facial movements directly at the image level [6], [7], [8], [9], [10].

Despite of different head models, according to the audio/visual conversion method, speech animation can be categorized into speech classes from audio and animation parameters from audio [6]. In the former approaches, audio is first segmented to a string of speech classes (e.g., phonemes) manually or automatically by a speech recognizer. Subsequently these units are mapped simply to lip poses, ignoring dynamic factors such as speech rate and prosody. The latter approaches derive animation parameters directly from speech acoustics, where speech dynamics are preserved. During the last two decades, machine learning methods, such as neural networks [11], Gaussian mixture models (GMMs) and hidden Markov models (HMMs) have been extensively used in the audio/visual conversion.

In GMM-based approaches [12], [13], Gaussian mixtures are used to model the probability distribution of audio–visual data. After the GMM is learned using the expectation maximization (EM) algorithm and the audio–visual training data, the visual parameters are mapped analytically from the audio. The universal mapping of the GMM ignores the context cues that are inherent in speech. To utilize the context cues, a mapping can be tailored to a specific linguistic unit, e.g., a word. Hence, HMM-based approaches have been recently explored.

To the best of our knowledge, Yamamoto et al. [14] were the first to introduce HMMs into speech animation, which has led the way to some new developments [13], [15], [16], [17], [18], [19], [20]. Their approach, namely Viterbi single-state approach, trained HMMs from audio data, and aligned the corresponding animation parameters to the HMM states. During the synthesis stage, an optimal HMM state sequence is selected for a novel audio using the Viterbi alignment algorithm [21]; and the visual parameters associated with each state is retrieved. Such approaches produce jerky animations since the predicted visual parameter set for each frame was an average of the Gaussian mixture components associated with the current single state, and it was indirectly related to the current audio input. In some other techniques, e.g., the mixture-based HMM [13] and the remapping HMM in Voice Puppetry [15], the visual output was made dependent not only on the current state, but also the audio input, resulting in improved performance. The mixture-based HMM technique [13] trained joint audio–visual HMMs which encapsulate the synchronization between the two modalities of speech. Recently, Aleksic et al. [20] proposed a correlation-HMM system using MPEG-4 visual features, which integrated independently trained acoustic HMM and visual HMM, allowing for increased flexibility in model topologies.

However, all the above methods heavily rely on the Viterbi algorithm that lacks robustness to noise [17]. If speech is contaminated by ambient noise, the animation quality will suffer greatly [15]. Moreover, the Viterbi sequence is deficient for speech animation in that it represents only a small fraction of the total probability mass, and many other slightly different state sequences potentially have nearly equal likelihoods [22].

Moon et al. [23] proposed a hidden Markov model inversion (HMMI) method for robust speech recognition. Choi et al. [16], [17] extended this method to audio–visual domain for speech animation, in which audio and video were jointly modelled by phoneme HMMs. They were able to generate animation parameters directly from the audio input by a conversion algorithm considered as an inversion of EM-based parameter training. In this way, they managed to avoid using the Viterbi algorithm, and made use of all possible state sequences to represent a quite large fraction of the total probability mass. More recently, Fu et al. [22] have demonstrated that the HMMI method outperforms the remapping HMMs [15] and the mixture-based HMMs [13] on a common test bed.

The conventional one-chain HMMs do have limitations in describing audio–visual speech: (1) we know that due to the difference in discrimination abilities, audio speech and visual speech can be categorized to different speech classes—phonemes and visemes [24]. Therefore, the bimodal speech is better modelled by different atoms explicitly. (2) speech production and perception are inherently coupled processes with both synchrony and asynchrony between the audio and visual modalities [25]. In previous HMM-based speech animations, the bimodal speech is modelled by a single Markov chain, which cannot reflect the above important facts.

The above two facts are important in audio–visual speech recognition (AVSR) or automatic lipreading [26]. These facts also affect the performance of the inverse problem (i.e., speech animation) since human perception system is sensitive to artifacts induced by loss of synchrony or asynchrony. Therefore, in order to make animation look more natural, it is necessary to take these facts into consideration. In this paper, we propose a coupled HMM (CHMM) approach to video-realistic speech animation, in which we use CHMMs to model the above characteristics of audio–visual speech.

In the following section, we describe the diagram of our speech animation system. Section 3 presents the AV-CHMMs used in our speech animation system, including our motivations, the model structures and the model training procedure. Section 4 derives the EM-based A/V conversion algorithm for the AV-CHMMs. Section 5 describes the audio-visual front-end. Our facial animation unit is presented in Section 6. Section 7 gives the comparative evaluations both objectively and subjectively. Finally, conclusions and future work are given in Section 8.

Section snippets

Speech animation system overview

Fig. 1 shows the block diagram of the proposed speech animation system. The system is composed of two main phases—the AV modelling phase (offline) and the speech-to-video synthesis phase (online). The offline phase is used to model the audio–visual speech as well as learn the correspondences between the two modalities from the AV facial recordings. Given the AV models, the online synthesizer converts acoustic audio to visual parameters (i.e., animation parameters) and synthesizes facial

Asynchrony and synchrony

Asynchrony arises naturally both in audio–visual speech perception and in speech production. From the speech production point of view, it has been proven that usually visual speech activity precedes the audio signal by as much as 120ms [27], which is close to the average duration of a phoneme. Lavagetto [28] has shown that visible articulators (i.e., lips, tongue and jaw), during uttering, start and complete their trajectories asynchronously, resulting in both forward and backward

EM-based A/V conversion on AV-CHMMs

Given the trained AV models, a simple, common A/V conversion approach is to derive sub-phonemic transcriptions from a novel audio via the Viterbi algorithm, and the visual Gaussian mean associated with the current state label is used as the visual parameter vector of the current frame. As indicated in Section 1, this approach has a major defect in that the facial animation performance relies heavily on the Viterbi state sequence which is not robust to acoustic degradation, e.g., additive noise.

Audio visual front-end

Prior to AV modelling, front-end processing is performed to achieve representative features of audio and visual speech (see Fig. 1). The speech signal, sampled at 16kHz mono, is processed in frames of 25ms with a 15ms overlapping (rate=100Hz). We first pre-emphasize speech frames with an FIR filter (H(z)=1-az-1,a=0.97), and weight them with a Hamming window to avoid spectral distortions. After pre-processing, we extract Mel Frequency Cepstral Coefficients (MFCCs) [42] as the acoustic features.

Facial animation unit

The facial animation unit first smoothes the predicted visual parameters by a moving average filter (3 frames wide) to remove jitters, and then augments the fine details of the mouth appearance using a performance refinement process. Subsequently, mouth animation is generated from the visual parameters by the PCA expansion process [43]. Finally, we overlay the synthesized mouth animation onto a background sequence which contains natural head and eye movements.

Experiment setup

We have compared the CHMMs with three models—HMMs, MSHMMs and FHMMs in performance of speech animation. The tested systems are summarized in Table 2, where C() denotes the cardinality of state variables. The systems named ‘ph-*’ adopt only phoneme states for both audio and visual modalities, while the systems named ‘ph-vi-*’ adopt phonemes state for audio and viseme states for video. Each phoneme (viseme) is modelled by five states, and the 13 visemes are mapped from the 47 phonemes using

Conclusions and future work

In this paper, we have proposed a CHMM approach to video-realistic speech animation. Motivated by the subtle relationships between audio speech and mouth movement, we use the CHMMs to explicitly model the synchrony, asynchrony, temporal coupling and different speech classes between the audio speech and visual speech. The CHMMs use two Markov chains to model the audio–visual asynchrony, while still preserving the natural correlations (i.e., synchrony) through inter-modal dependencies.

We have

About the Author—LEI XIE received the B.Eng., M.Eng. and Ph.D. degrees from Northwestern Polytechnical University, Xi’an, China, in 1999, 2001 and 2004, respectively, all in computer science. He was granted IBM Excellent Chinese Student Awards twice in 1999 and 2002. From 2001 to 2002, he was with the Department of Electronics and Information Processing, Vrije Universiteit Brussel (VUB), Brussels, Belgium as a Visiting Scientist. From 2004 to 2006, he was a Senior Research Associate in the

References (44)

  • L. Yin et al.

    Synthesizing realistic facial animations using energy minimization for model-based coding

    Pattern Recognition

    (2001)
  • E. Yamamoto et al.

    Lip movement synthesis from speech based on Hidden Markov Models

    Speech Commun.

    (1998)
  • R. Lippman

    Speech recognition by machines and humans

    Speech Commun.

    (1997)
  • J. Ostermann, A. Weissenfeld, Talking faces—technologies and applications, in: Proceedings of ICPR’04, vol. 3, 2004,...
  • M.M. Cohen et al.

    Modeling coarticulation in synthetic visual speech

  • F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, D.H. Salesin, Synthesizing realistic facial expressions from...
  • E. Cosatto et al.

    Lifelike talking faces for interactive services

    Proc. IEEE

    (2003)
  • C. Bregler, M. Covell, M. Slaney, Video rewrite: driving visual speech with audio, in: Proceedings of ACM SIGGRAPH’97,...
  • T. Ezzat, G. Geiger, T. Poggio, Trainable videorealistic speech animation, in: Proceedings of ACM SIGGRAPH, 2002, pp....
  • E. Cosatto, H. Graf, Sample-based synthesis of photo-realistic talking heads, in: Proceedings of IEEE Computer...
  • E. Cosatto et al.

    Photo-realistic talking heads from image samples

    IEEE Trans. Multimedia

    (2000)
  • P. Hong et al.

    Real-time speech-driven face animation with expressions using neural networks

    IEEE Trans. Neural Networks

    (2002)
  • F.J. Huang, T. Chen, Real-time lip-synch face animation driven by human voice, in: IEEE Second Workshop on Multimedia...
  • R.R. Rao et al.

    Audio-to-visual conversion for multimedia communication

    IEEE Trans. Ind. Electron.

    (1998)
  • M. Brand, Voice puppetry, in: SIGGRAPH’99, Los Angeles, 1999, pp....
  • K. Choi, J. N. Hwang, Baum–Welch hidden Markov model inversion for reliable audio-to-visual conversion, in: Proceedings...
  • K. Choi et al.

    Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system

    J. VLSI Signal Process.

    (2001)
  • S. Lee, D. Yook, Audio-to-visual conversion using hidden Markov models, in: M. Ishizuka, S. A. (Eds.), Proceedings of...
  • L. Xie, D.-M. Jiang, I. Ravyse, W. Verhelst, H. Sahli, V. Slavova, R.-C. Zhao, Context dependent viseme models for...
  • P.S. Aleksic et al.

    Speech-to-video synthesis using MPEG-4 compliant visual features

    IEEE Trans. Circuits Systems Video Technol.

    (2004)
  • L.R. Rabiner

    A tutorial on hidden Markov models and selected applications in speech animation

    Proc. IEEE

    (1989)
  • S. Fu et al.

    Audio/visual mapping with cross-modal hidden Markov models

    IEEE Trans. Multimedia

    (2005)
  • Cited by (73)

    • POLEMAD–A database for the multimodal analysis of Polish pronunciation

      2021, Speech Communication
      Citation Excerpt :

      For CHMM models, the model shown in Fig. 8 is assumed. In order to describe the model shown in Fig. 8, the following parameters are stored according to (Xie and Liu, 2007): P(qts|qt-1), s ∈{a(v),ema}: transition probabilities between states

    • Multichannel dynamic modeling of non-Gaussian mixtures

      2019, Pattern Recognition
      Citation Excerpt :

      CHMM is a state of the art method for dynamic modeling that has been implemented in Bayesian networks using GMM [16]. It has been used in several pattern recognition applications, such as modeling intra-modal dependences in multimodal data for video-realistic speech animation [17] and sign language recognition [18]. In contrast with GMM-based methods, the non-Gaussianity of the sources extracted by G-SICAMM allows for source identification and facilitates their interpretation and association with meaningful variables of real applications.

    View all citing articles on Scopus

    About the Author—LEI XIE received the B.Eng., M.Eng. and Ph.D. degrees from Northwestern Polytechnical University, Xi’an, China, in 1999, 2001 and 2004, respectively, all in computer science. He was granted IBM Excellent Chinese Student Awards twice in 1999 and 2002. From 2001 to 2002, he was with the Department of Electronics and Information Processing, Vrije Universiteit Brussel (VUB), Brussels, Belgium as a Visiting Scientist. From 2004 to 2006, he was a Senior Research Associate in the Center for Media Technology (RCMT), School of Creative Media, City University of Hong Kong, Hong Kong SAR, China.

    Dr. Xie is currently a Postdoctoral Fellow in the Human–Computer Communications Laboratory (HCCL), Department of Systems Engineering & Engineering Management, the Chinese University of Hong Kong. His current research interest includes talking face, multimedia retrieval, speech recognition, multimedia signal processing and pattern recognition.

    About the Author—ZHI-QIANG LIU received the M.A.Sc. degree in Aerospace Engineering from the Institute for Aerospace Studies, The University of Toronto, and the Ph.D. degree in Electrical Engineering from The University of Alberta, Canada. He is currently with School of Creative Media, City University of Hong Kong. He has taught computer architecture, computer networks, artificial intelligence, programming languages, machine learning, pattern recognition, computer graphics, and art & technology.

    His interests are scuba diving, neural-fuzzy systems, painting, gardening, machine learning, mountain/beach trekking, human–media systems, horse riding, computer vision, serving the community, mobile computing, computer networks, and fishing.

    This work is supported by the Hong Kong RGC CERG project CityU 1247/03E.

    View full text