Skip to main content
Top

2018 | OriginalPaper | Chapter

Comparing Cascaded LSTM Architectures for Generating Head Motion from Speech in Task-Oriented Dialogs

Authors : Duc-Canh Nguyen, Gérard Bailly, Frédéric Elisei

Published in: Human-Computer Interaction. Interaction Technologies

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

To generate action events for a humanoid robot for human robot interaction (HRI), multimodal interactive behavioral models are typically used given observed actions of the human partner(s). In previous research, we built an interactive model to generate discrete events for gaze and arm gestures, which can be used to drive our iCub humanoid robot [19, 20]. In this paper, we investigate how to generate continuous head motion in the context of a collaborative scenario where head motion contributes to verbal as well as nonverbal functions. We show that in this scenario, the fundamental frequency of speech (F0 feature) is not enough to drive head motion, while the gaze significantly contributes to the head motion generation. We propose a cascaded Long-Short Term Memory (LSTM) model that first estimates the gaze from speech content and hand gestures performed by the partner. This estimation is further used as input for the generation of the head motion. The results show that the proposed method outperforms a single-task model with the same inputs.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: Human trajectory prediction in crowded spaces. In: IEEE Conference on Computer Vision and Pattern Recognition (CPVR), pp. 961–971 (2016) Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: Human trajectory prediction in crowded spaces. In: IEEE Conference on Computer Vision and Pattern Recognition (CPVR), pp. 961–971 (2016)
2.
go back to reference Ben Youssef, A., Shimodaira, H., Braude, D.A.: Articulatory features for speech-driven head motion synthesis. In: Interspeech, pp. 2758–2762 (2013) Ben Youssef, A., Shimodaira, H., Braude, D.A.: Articulatory features for speech-driven head motion synthesis. In: Interspeech, pp. 2758–2762 (2013)
3.
go back to reference Boersma, P., Weenik, D.: PRAAT: a system for doing phonetics by computer. Report of the Institute of Phonetic Sciences of the University of Amsterdam. University of Amsterdam, Amsterdam (1996) Boersma, P., Weenik, D.: PRAAT: a system for doing phonetics by computer. Report of the Institute of Phonetic Sciences of the University of Amsterdam. University of Amsterdam, Amsterdam (1996)
4.
go back to reference Brimijoin, W.O., Boyd, A.W., Akeroyd, M.A.: The contribution of head movement to the externalization and internalization of sounds. PloS One 8(12), e83068 (2013)CrossRef Brimijoin, W.O., Boyd, A.W., Akeroyd, M.A.: The contribution of head movement to the externalization and internalization of sounds. PloS One 8(12), e83068 (2013)CrossRef
5.
go back to reference Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: analysis and synthesis. IEEE Trans. Audio Speech Lang. Process. 15(3), 1075–1086 (2007)CrossRef Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: analysis and synthesis. IEEE Trans. Audio Speech Lang. Process. 15(3), 1075–1086 (2007)CrossRef
6.
go back to reference Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., Stone, M.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420. ACM (1994) Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., Stone, M.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420. ACM (1994)
7.
go back to reference Dehon, C., Filzmoser, P., Croux, C.: Robust methods for canonical correlation analysis. In: Kiers, H.A.L., Rasson, J.P., Groenen, P.J.F., Schader, M. (eds.) Data Analysis, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 321–326. Springer, Heidelberg (2000). https://doi.org/10.1007/978-3-642-59789-3_51CrossRefMATH Dehon, C., Filzmoser, P., Croux, C.: Robust methods for canonical correlation analysis. In: Kiers, H.A.L., Rasson, J.P., Groenen, P.J.F., Schader, M. (eds.) Data Analysis, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 321–326. Springer, Heidelberg (2000). https://​doi.​org/​10.​1007/​978-3-642-59789-3_​51CrossRefMATH
9.
go back to reference Graf, H.P., Cosatto, E., Strom, V., Jie Huang, F.: Visual prosody: Facial movements accompanying speech. In: Automatic Face and Gesture Recognition (FG), pp. 396–401. IEEE (2002) Graf, H.P., Cosatto, E., Strom, V., Jie Huang, F.: Visual prosody: Facial movements accompanying speech. In: Automatic Face and Gesture Recognition (FG), pp. 396–401. IEEE (2002)
10.
go back to reference Guitton, D., Volle, M.: Gaze control in humans: eye-head coordination during orienting movements to targets within and beyond the oculomotor range. J. Neurophysiol. 58(3), 427–459 (1987)CrossRef Guitton, D., Volle, M.: Gaze control in humans: eye-head coordination during orienting movements to targets within and beyond the oculomotor range. J. Neurophysiol. 58(3), 427–459 (1987)CrossRef
11.
go back to reference Haag, K., Shimodaira, H.: Bidirectional LSTM networks employing stacked bottleneck features for expressive speech-driven head motion synthesis. In: Traum, D., Swartout, W., Khooshabeh, P., Kopp, S., Scherer, S., Leuski, A. (eds.) IVA 2016. LNCS (LNAI), vol. 10011, pp. 198–207. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47665-0_18CrossRef Haag, K., Shimodaira, H.: Bidirectional LSTM networks employing stacked bottleneck features for expressive speech-driven head motion synthesis. In: Traum, D., Swartout, W., Khooshabeh, P., Kopp, S., Scherer, S., Leuski, A. (eds.) IVA 2016. LNCS (LNAI), vol. 10011, pp. 198–207. Springer, Cham (2016). https://​doi.​org/​10.​1007/​978-3-319-47665-0_​18CrossRef
13.
go back to reference Levine, S., Theobalt, C., Koltun, V.: Real-time prosody-driven synthesis of body language. In: ACM Transactions on Graphics (TOG), vol. 28, Article no. 172. ACM (2009)CrossRef Levine, S., Theobalt, C., Koltun, V.: Real-time prosody-driven synthesis of body language. In: ACM Transactions on Graphics (TOG), vol. 28, Article no. 172. ACM (2009)CrossRef
14.
go back to reference Liu, C., Ishi, C.T., Ishiguro, H., Hagita, N.: Generation of nodding, head tilting and eye gazing for human-robot dialogue interaction. In: Human-Robot Interaction (HRI), pp. 285–292. IEEE (2012) Liu, C., Ishi, C.T., Ishiguro, H., Hagita, N.: Generation of nodding, head tilting and eye gazing for human-robot dialogue interaction. In: Human-Robot Interaction (HRI), pp. 285–292. IEEE (2012)
15.
go back to reference Mariooryad, S., Busso, C.: Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Trans. Audio Speech Lang. Process. 20(8), 2329–2340 (2012)CrossRef Mariooryad, S., Busso, C.: Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Trans. Audio Speech Lang. Process. 20(8), 2329–2340 (2012)CrossRef
16.
go back to reference May, T., Ma, N., Brown, G.J.: Robust localisation of multiple speakers exploiting head movements and multi-conditional training of binaural cues. In: Acoustics, Speech and Signal Processing (ICASSP), pp. 2679–2683. IEEE (2015) May, T., Ma, N., Brown, G.J.: Robust localisation of multiple speakers exploiting head movements and multi-conditional training of binaural cues. In: Acoustics, Speech and Signal Processing (ICASSP), pp. 2679–2683. IEEE (2015)
17.
go back to reference Mihoub, A., Bailly, G., Wolf, C., Elisei, F.: Graphical models for social behavior modeling in face-to face interaction. Pattern Recogn. Lett. (PRL) 74(2016), 82–89 (2016)CrossRef Mihoub, A., Bailly, G., Wolf, C., Elisei, F.: Graphical models for social behavior modeling in face-to face interaction. Pattern Recogn. Lett. (PRL) 74(2016), 82–89 (2016)CrossRef
18.
go back to reference Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Vatikiotis-Bateson, E.: Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychol. Sci. 15(2), 133–137 (2004)CrossRef Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Vatikiotis-Bateson, E.: Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychol. Sci. 15(2), 133–137 (2004)CrossRef
19.
go back to reference Nguyen, D.-C., Bailly, G., Elisei, F.: Conducting neuropsychological tests with a humanoid robot: design and evaluation. In: Cognitive Infocommunications (CogInfoCom), pp. 337–342. IEEE (2016) Nguyen, D.-C., Bailly, G., Elisei, F.: Conducting neuropsychological tests with a humanoid robot: design and evaluation. In: Cognitive Infocommunications (CogInfoCom), pp. 337–342. IEEE (2016)
20.
go back to reference Nguyen, D.-C., Bailly, G., Elisei, F.: Learning Off-line vs. On-line models of interactive multimodal behaviors with recurrent neural networks. Pattern Recognition Letters (PRL) (accepted with minor revision) Nguyen, D.-C., Bailly, G., Elisei, F.: Learning Off-line vs. On-line models of interactive multimodal behaviors with recurrent neural networks. Pattern Recognition Letters (PRL) (accepted with minor revision)
22.
go back to reference Thórisson, K.R.: Natural turn-taking needs no manual: computational theory and model, from perception to action. In: Granström, B., House, D., Karlsson, I. (eds.) Multimodality in Language and Speech Systems. Text, Speech and Language Technology, vol. 19, pp. 173–207. Springer, Dordrecht (2002). https://doi.org/10.1007/978-94-017-2367-1_8CrossRef Thórisson, K.R.: Natural turn-taking needs no manual: computational theory and model, from perception to action. In: Granström, B., House, D., Karlsson, I. (eds.) Multimodality in Language and Speech Systems. Text, Speech and Language Technology, vol. 19, pp. 173–207. Springer, Dordrecht (2002). https://​doi.​org/​10.​1007/​978-94-017-2367-1_​8CrossRef
23.
go back to reference Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H.: ELAN: a professional framework for multimodality research. In: International Conference on Language Resources and Evaluation (LREC) (2006) Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H.: ELAN: a professional framework for multimodality research. In: International Conference on Language Resources and Evaluation (LREC) (2006)
24.
go back to reference Yehia, H., Kuratate, T., Vatikiotis-Bateson, E.: Facial animation and head motion driven by speech acoustics. In: 5th Seminar on Speech Production: Models and Data, pp. 265–268. Kloster Seeon, Germany (2000) Yehia, H., Kuratate, T., Vatikiotis-Bateson, E.: Facial animation and head motion driven by speech acoustics. In: 5th Seminar on Speech Production: Models and Data, pp. 265–268. Kloster Seeon, Germany (2000)
25.
go back to reference Wolpert, D.M., Doya, K., Kawato, M.: A unifying computational framework for motor control and social interaction. Philos. Trans. R. Soc. B Biol. Sci. 358(1431), 593–602 (2003)CrossRef Wolpert, D.M., Doya, K., Kawato, M.: A unifying computational framework for motor control and social interaction. Philos. Trans. R. Soc. B Biol. Sci. 358(1431), 593–602 (2003)CrossRef
Metadata
Title
Comparing Cascaded LSTM Architectures for Generating Head Motion from Speech in Task-Oriented Dialogs
Authors
Duc-Canh Nguyen
Gérard Bailly
Frédéric Elisei
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-91250-9_13