A coupled HMM approach to video-realistic speech animation☆
Introduction
Speech-driven talking faces are playing an increasingly indispensable role in multimedia applications such as computer games, online virtual characters, video telephony, and other interactive human–machine interfaces. For example, talking faces can provide visual speech perceptual information for hearing-impaired people to better communicate with machines through lipreading [1]. Recent studies have shown that the trust and attention of humans towards machines can be significantly increased by 30% if humans are interacting with a human face instead of text only [2]. Current Internet videophones can transmit videos, but due to bandwidth limitations and network congestion, facial motion accompanied with audio often appears jerky since many frames are lost during transmissions. Therefore, a video-realistic speech-driven talking face may provide a good alteration.
The essential problem of speech-driven talking faces is speech animation—synthesizing speech-related facial animation from audio. Despite of decades of extensive research, realistic speech animation still remains to be one of the challenging talks due to the variabilities of human speech, mostly commonly the coarticulation phenomenon [3]. Various approaches have been proposed during the last decade, which significantly improve the animation performance. Some approaches use a 3D mesh to define the head shape and map a face texture to the mesh [4], [5], [6]. Others realize photo- or video-realistic animation from recorded image sequences of face and render facial movements directly at the image level [6], [7], [8], [9], [10].
Despite of different head models, according to the audio/visual conversion method, speech animation can be categorized into speech classes from audio and animation parameters from audio [6]. In the former approaches, audio is first segmented to a string of speech classes (e.g., phonemes) manually or automatically by a speech recognizer. Subsequently these units are mapped simply to lip poses, ignoring dynamic factors such as speech rate and prosody. The latter approaches derive animation parameters directly from speech acoustics, where speech dynamics are preserved. During the last two decades, machine learning methods, such as neural networks [11], Gaussian mixture models (GMMs) and hidden Markov models (HMMs) have been extensively used in the audio/visual conversion.
In GMM-based approaches [12], [13], Gaussian mixtures are used to model the probability distribution of audio–visual data. After the GMM is learned using the expectation maximization (EM) algorithm and the audio–visual training data, the visual parameters are mapped analytically from the audio. The universal mapping of the GMM ignores the context cues that are inherent in speech. To utilize the context cues, a mapping can be tailored to a specific linguistic unit, e.g., a word. Hence, HMM-based approaches have been recently explored.
To the best of our knowledge, Yamamoto et al. [14] were the first to introduce HMMs into speech animation, which has led the way to some new developments [13], [15], [16], [17], [18], [19], [20]. Their approach, namely Viterbi single-state approach, trained HMMs from audio data, and aligned the corresponding animation parameters to the HMM states. During the synthesis stage, an optimal HMM state sequence is selected for a novel audio using the Viterbi alignment algorithm [21]; and the visual parameters associated with each state is retrieved. Such approaches produce jerky animations since the predicted visual parameter set for each frame was an average of the Gaussian mixture components associated with the current single state, and it was indirectly related to the current audio input. In some other techniques, e.g., the mixture-based HMM [13] and the remapping HMM in Voice Puppetry [15], the visual output was made dependent not only on the current state, but also the audio input, resulting in improved performance. The mixture-based HMM technique [13] trained joint audio–visual HMMs which encapsulate the synchronization between the two modalities of speech. Recently, Aleksic et al. [20] proposed a correlation-HMM system using MPEG-4 visual features, which integrated independently trained acoustic HMM and visual HMM, allowing for increased flexibility in model topologies.
However, all the above methods heavily rely on the Viterbi algorithm that lacks robustness to noise [17]. If speech is contaminated by ambient noise, the animation quality will suffer greatly [15]. Moreover, the Viterbi sequence is deficient for speech animation in that it represents only a small fraction of the total probability mass, and many other slightly different state sequences potentially have nearly equal likelihoods [22].
Moon et al. [23] proposed a hidden Markov model inversion (HMMI) method for robust speech recognition. Choi et al. [16], [17] extended this method to audio–visual domain for speech animation, in which audio and video were jointly modelled by phoneme HMMs. They were able to generate animation parameters directly from the audio input by a conversion algorithm considered as an inversion of EM-based parameter training. In this way, they managed to avoid using the Viterbi algorithm, and made use of all possible state sequences to represent a quite large fraction of the total probability mass. More recently, Fu et al. [22] have demonstrated that the HMMI method outperforms the remapping HMMs [15] and the mixture-based HMMs [13] on a common test bed.
The conventional one-chain HMMs do have limitations in describing audio–visual speech: (1) we know that due to the difference in discrimination abilities, audio speech and visual speech can be categorized to different speech classes—phonemes and visemes [24]. Therefore, the bimodal speech is better modelled by different atoms explicitly. (2) speech production and perception are inherently coupled processes with both synchrony and asynchrony between the audio and visual modalities [25]. In previous HMM-based speech animations, the bimodal speech is modelled by a single Markov chain, which cannot reflect the above important facts.
The above two facts are important in audio–visual speech recognition (AVSR) or automatic lipreading [26]. These facts also affect the performance of the inverse problem (i.e., speech animation) since human perception system is sensitive to artifacts induced by loss of synchrony or asynchrony. Therefore, in order to make animation look more natural, it is necessary to take these facts into consideration. In this paper, we propose a coupled HMM (CHMM) approach to video-realistic speech animation, in which we use CHMMs to model the above characteristics of audio–visual speech.
In the following section, we describe the diagram of our speech animation system. Section 3 presents the AV-CHMMs used in our speech animation system, including our motivations, the model structures and the model training procedure. Section 4 derives the EM-based A/V conversion algorithm for the AV-CHMMs. Section 5 describes the audio-visual front-end. Our facial animation unit is presented in Section 6. Section 7 gives the comparative evaluations both objectively and subjectively. Finally, conclusions and future work are given in Section 8.
Section snippets
Speech animation system overview
Fig. 1 shows the block diagram of the proposed speech animation system. The system is composed of two main phases—the AV modelling phase (offline) and the speech-to-video synthesis phase (online). The offline phase is used to model the audio–visual speech as well as learn the correspondences between the two modalities from the AV facial recordings. Given the AV models, the online synthesizer converts acoustic audio to visual parameters (i.e., animation parameters) and synthesizes facial
Asynchrony and synchrony
Asynchrony arises naturally both in audio–visual speech perception and in speech production. From the speech production point of view, it has been proven that usually visual speech activity precedes the audio signal by as much as [27], which is close to the average duration of a phoneme. Lavagetto [28] has shown that visible articulators (i.e., lips, tongue and jaw), during uttering, start and complete their trajectories asynchronously, resulting in both forward and backward
EM-based A/V conversion on AV-CHMMs
Given the trained AV models, a simple, common A/V conversion approach is to derive sub-phonemic transcriptions from a novel audio via the Viterbi algorithm, and the visual Gaussian mean associated with the current state label is used as the visual parameter vector of the current frame. As indicated in Section 1, this approach has a major defect in that the facial animation performance relies heavily on the Viterbi state sequence which is not robust to acoustic degradation, e.g., additive noise.
Audio visual front-end
Prior to AV modelling, front-end processing is performed to achieve representative features of audio and visual speech (see Fig. 1). The speech signal, sampled at mono, is processed in frames of with a overlapping . We first pre-emphasize speech frames with an FIR filter , and weight them with a Hamming window to avoid spectral distortions. After pre-processing, we extract Mel Frequency Cepstral Coefficients (MFCCs) [42] as the acoustic features.
Facial animation unit
The facial animation unit first smoothes the predicted visual parameters by a moving average filter (3 frames wide) to remove jitters, and then augments the fine details of the mouth appearance using a performance refinement process. Subsequently, mouth animation is generated from the visual parameters by the PCA expansion process [43]. Finally, we overlay the synthesized mouth animation onto a background sequence which contains natural head and eye movements.
Experiment setup
We have compared the CHMMs with three models—HMMs, MSHMMs and FHMMs in performance of speech animation. The tested systems are summarized in Table 2, where denotes the cardinality of state variables. The systems named ‘’ adopt only phoneme states for both audio and visual modalities, while the systems named ‘’ adopt phonemes state for audio and viseme states for video. Each phoneme (viseme) is modelled by five states, and the 13 visemes are mapped from the 47 phonemes using
Conclusions and future work
In this paper, we have proposed a CHMM approach to video-realistic speech animation. Motivated by the subtle relationships between audio speech and mouth movement, we use the CHMMs to explicitly model the synchrony, asynchrony, temporal coupling and different speech classes between the audio speech and visual speech. The CHMMs use two Markov chains to model the audio–visual asynchrony, while still preserving the natural correlations (i.e., synchrony) through inter-modal dependencies.
We have
About the Author—LEI XIE received the B.Eng., M.Eng. and Ph.D. degrees from Northwestern Polytechnical University, Xi’an, China, in 1999, 2001 and 2004, respectively, all in computer science. He was granted IBM Excellent Chinese Student Awards twice in 1999 and 2002. From 2001 to 2002, he was with the Department of Electronics and Information Processing, Vrije Universiteit Brussel (VUB), Brussels, Belgium as a Visiting Scientist. From 2004 to 2006, he was a Senior Research Associate in the
References (44)
- et al.
Synthesizing realistic facial animations using energy minimization for model-based coding
Pattern Recognition
(2001) - et al.
Lip movement synthesis from speech based on Hidden Markov Models
Speech Commun.
(1998) Speech recognition by machines and humans
Speech Commun.
(1997)- J. Ostermann, A. Weissenfeld, Talking faces—technologies and applications, in: Proceedings of ICPR’04, vol. 3, 2004,...
- et al.
Modeling coarticulation in synthetic visual speech
- F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, D.H. Salesin, Synthesizing realistic facial expressions from...
- et al.
Lifelike talking faces for interactive services
Proc. IEEE
(2003) - C. Bregler, M. Covell, M. Slaney, Video rewrite: driving visual speech with audio, in: Proceedings of ACM SIGGRAPH’97,...
- T. Ezzat, G. Geiger, T. Poggio, Trainable videorealistic speech animation, in: Proceedings of ACM SIGGRAPH, 2002, pp....
- E. Cosatto, H. Graf, Sample-based synthesis of photo-realistic talking heads, in: Proceedings of IEEE Computer...
Photo-realistic talking heads from image samples
IEEE Trans. Multimedia
Real-time speech-driven face animation with expressions using neural networks
IEEE Trans. Neural Networks
Audio-to-visual conversion for multimedia communication
IEEE Trans. Ind. Electron.
Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system
J. VLSI Signal Process.
Speech-to-video synthesis using MPEG-4 compliant visual features
IEEE Trans. Circuits Systems Video Technol.
A tutorial on hidden Markov models and selected applications in speech animation
Proc. IEEE
Audio/visual mapping with cross-modal hidden Markov models
IEEE Trans. Multimedia
Cited by (73)
POLEMAD–A database for the multimodal analysis of Polish pronunciation
2021, Speech CommunicationCitation Excerpt :For CHMM models, the model shown in Fig. 8 is assumed. In order to describe the model shown in Fig. 8, the following parameters are stored according to (Xie and Liu, 2007): P(qts|qt-1), s ∈{a(v),ema}: transition probabilities between states
Multichannel dynamic modeling of non-Gaussian mixtures
2019, Pattern RecognitionCitation Excerpt :CHMM is a state of the art method for dynamic modeling that has been implemented in Bayesian networks using GMM [16]. It has been used in several pattern recognition applications, such as modeling intra-modal dependences in multimodal data for video-realistic speech animation [17] and sign language recognition [18]. In contrast with GMM-based methods, the non-Gaussianity of the sources extracted by G-SICAMM allows for source identification and facilitates their interpretation and association with meaningful variables of real applications.
About the Author—LEI XIE received the B.Eng., M.Eng. and Ph.D. degrees from Northwestern Polytechnical University, Xi’an, China, in 1999, 2001 and 2004, respectively, all in computer science. He was granted IBM Excellent Chinese Student Awards twice in 1999 and 2002. From 2001 to 2002, he was with the Department of Electronics and Information Processing, Vrije Universiteit Brussel (VUB), Brussels, Belgium as a Visiting Scientist. From 2004 to 2006, he was a Senior Research Associate in the Center for Media Technology (RCMT), School of Creative Media, City University of Hong Kong, Hong Kong SAR, China.
Dr. Xie is currently a Postdoctoral Fellow in the Human–Computer Communications Laboratory (HCCL), Department of Systems Engineering & Engineering Management, the Chinese University of Hong Kong. His current research interest includes talking face, multimedia retrieval, speech recognition, multimedia signal processing and pattern recognition.
About the Author—ZHI-QIANG LIU received the M.A.Sc. degree in Aerospace Engineering from the Institute for Aerospace Studies, The University of Toronto, and the Ph.D. degree in Electrical Engineering from The University of Alberta, Canada. He is currently with School of Creative Media, City University of Hong Kong. He has taught computer architecture, computer networks, artificial intelligence, programming languages, machine learning, pattern recognition, computer graphics, and art & technology.
His interests are scuba diving, neural-fuzzy systems, painting, gardening, machine learning, mountain/beach trekking, human–media systems, horse riding, computer vision, serving the community, mobile computing, computer networks, and fishing.
- ☆
This work is supported by the Hong Kong RGC CERG project CityU 1247/03E.