Top

Published in:

2018 | OriginalPaper | Chapter

Comparing Cascaded LSTM Architectures for Generating Head Motion from Speech in Task-Oriented Dialogs

Authors : Duc-Canh Nguyen, Gérard Bailly, Frédéric Elisei

Published in: Human-Computer Interaction. Interaction Technologies

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

To generate action events for a humanoid robot for human robot interaction (HRI), multimodal interactive behavioral models are typically used given observed actions of the human partner(s). In previous research, we built an interactive model to generate discrete events for gaze and arm gestures, which can be used to drive our iCub humanoid robot [19, 20]. In this paper, we investigate how to generate continuous head motion in the context of a collaborative scenario where head motion contributes to verbal as well as nonverbal functions. We show that in this scenario, the fundamental frequency of speech (F0 feature) is not enough to drive head motion, while the gaze significantly contributes to the head motion generation. We propose a cascaded Long-Short Term Memory (LSTM) model that first estimates the gaze from speech content and hand gestures performed by the partner. This estimation is further used as input for the generation of the head motion. The results show that the proposed method outperforms a single-task model with the same inputs.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter An Adaptive Speech Interface for Assistance in Maintenance and Changeover Procedures

next chapter Acoustic Feature Comparison for Different Speaking Rates

Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: Human trajectory prediction in crowded spaces. In: IEEE Conference on Computer Vision and Pattern Recognition (CPVR), pp. 961–971 (2016)

Ben Youssef, A., Shimodaira, H., Braude, D.A.: Articulatory features for speech-driven head motion synthesis. In: Interspeech, pp. 2758–2762 (2013)

Boersma, P., Weenik, D.: PRAAT: a system for doing phonetics by computer. Report of the Institute of Phonetic Sciences of the University of Amsterdam. University of Amsterdam, Amsterdam (1996)

Brimijoin, W.O., Boyd, A.W., Akeroyd, M.A.: The contribution of head movement to the externalization and internalization of sounds. PloS One 8(12), e83068 (2013)CrossRef

Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: analysis and synthesis. IEEE Trans. Audio Speech Lang. Process. 15(3), 1075–1086 (2007)CrossRef

Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., Stone, M.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420. ACM (1994)

Dehon, C., Filzmoser, P., Croux, C.: Robust methods for canonical correlation analysis. In: Kiers, H.A.L., Rasson, J.P., Groenen, P.J.F., Schader, M. (eds.) Data Analysis, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 321–326. Springer, Heidelberg (2000). https://doi.org/10.1007/978-3-642-59789-3_51CrossRefMATH

Ding, Y., Pelachaud, C., Artières, T.: Modeling multimodal behaviors from speech prosody. In: Aylett, R., Krenn, B., Pelachaud, C., Shimodaira, H. (eds.) IVA 2013. LNCS (LNAI), vol. 8108, pp. 217–228. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40415-3_19CrossRef

Graf, H.P., Cosatto, E., Strom, V., Jie Huang, F.: Visual prosody: Facial movements accompanying speech. In: Automatic Face and Gesture Recognition (FG), pp. 396–401. IEEE (2002)

10.

Guitton, D., Volle, M.: Gaze control in humans: eye-head coordination during orienting movements to targets within and beyond the oculomotor range. J. Neurophysiol. 58(3), 427–459 (1987)CrossRef

11.

Haag, K., Shimodaira, H.: Bidirectional LSTM networks employing stacked bottleneck features for expressive speech-driven head motion synthesis. In: Traum, D., Swartout, W., Khooshabeh, P., Kopp, S., Scherer, S., Leuski, A. (eds.) IVA 2016. LNCS (LNAI), vol. 10011, pp. 198–207. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47665-0_18CrossRef

12.

Lee, J., Marsella, S.: Nonverbal behavior generator for embodied conversational agents. In: Gratch, J., Young, M., Aylett, R., Ballin, D., Olivier, P. (eds.) IVA 2006. LNCS (LNAI), vol. 4133, pp. 243–255. Springer, Heidelberg (2006). https://doi.org/10.1007/11821830_20CrossRef

13.

Levine, S., Theobalt, C., Koltun, V.: Real-time prosody-driven synthesis of body language. In: ACM Transactions on Graphics (TOG), vol. 28, Article no. 172. ACM (2009)CrossRef

14.

Liu, C., Ishi, C.T., Ishiguro, H., Hagita, N.: Generation of nodding, head tilting and eye gazing for human-robot dialogue interaction. In: Human-Robot Interaction (HRI), pp. 285–292. IEEE (2012)

15.

Mariooryad, S., Busso, C.: Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Trans. Audio Speech Lang. Process. 20(8), 2329–2340 (2012)CrossRef

16.

May, T., Ma, N., Brown, G.J.: Robust localisation of multiple speakers exploiting head movements and multi-conditional training of binaural cues. In: Acoustics, Speech and Signal Processing (ICASSP), pp. 2679–2683. IEEE (2015)

17.

Mihoub, A., Bailly, G., Wolf, C., Elisei, F.: Graphical models for social behavior modeling in face-to face interaction. Pattern Recogn. Lett. (PRL) 74(2016), 82–89 (2016)CrossRef

18.

Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Vatikiotis-Bateson, E.: Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychol. Sci. 15(2), 133–137 (2004)CrossRef

19.

Nguyen, D.-C., Bailly, G., Elisei, F.: Conducting neuropsychological tests with a humanoid robot: design and evaluation. In: Cognitive Infocommunications (CogInfoCom), pp. 337–342. IEEE (2016)

20.

Nguyen, D.-C., Bailly, G., Elisei, F.: Learning Off-line vs. On-line models of interactive multimodal behaviors with recurrent neural networks. Pattern Recognition Letters (PRL) (accepted with minor revision)

21.

Sadoughi, N., Busso, C.: Speech-driven Animation with Meaningful Behaviors (2017). arXiv preprint arXiv:1708.01640

22.

Thórisson, K.R.: Natural turn-taking needs no manual: computational theory and model, from perception to action. In: Granström, B., House, D., Karlsson, I. (eds.) Multimodality in Language and Speech Systems. Text, Speech and Language Technology, vol. 19, pp. 173–207. Springer, Dordrecht (2002). https://doi.org/10.1007/978-94-017-2367-1_8CrossRef

23.

Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H.: ELAN: a professional framework for multimodality research. In: International Conference on Language Resources and Evaluation (LREC) (2006)

24.

Yehia, H., Kuratate, T., Vatikiotis-Bateson, E.: Facial animation and head motion driven by speech acoustics. In: 5th Seminar on Speech Production: Models and Data, pp. 265–268. Kloster Seeon, Germany (2000)

25.

Wolpert, D.M., Doya, K., Kawato, M.: A unifying computational framework for motor control and social interaction. Philos. Trans. R. Soc. B Biol. Sci. 358(1431), 593–602 (2003)CrossRef

Title: Comparing Cascaded LSTM Architectures for Generating Head Motion from Speech in Task-Oriented Dialogs
Authors: Duc-Canh Nguyen
Gérard Bailly
Frédéric Elisei
Publisher: Springer International Publishing
Book: Human-Computer Interaction. Interaction Technologies
Print ISBN: 978-3-319-91249-3

Electronic ISBN: 978-3-319-91250-9

Copyright Year: 2018
DOI: https://doi.org/10.1007/978-3-319-91250-9_13

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"