ABSTRACT
In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner. Our results show that modeling linguistic and acoustic features at separate temporal rates can be beneficial for turn-taking modeling. We also show that our approach can be used to incorporate gaze features into turn-taking models.
- Anne H Anderson, Miles Bader, Ellen Gurman Bard, Elizabeth Boyle, Gwyneth Doherty, Simon Garrod, Stephen Isard, Jacqueline Kowtko, Jan McAllister, Jim Miller, and others . 1991. The HCRC Map Task Corpus. Language and speech Vol. 34, 4 (1991), 351--366.Google Scholar
- Tadas Baltruv saitis, Peter Robinson, Louis-Philippe Morency, and others . 2016. OpenFace: An Open Source Facial Behavior Analysis Toolkit 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1--10.Google Scholar
- Sanjay Bilakhia, Stavros Petridis, Anton Nijholt, and Maja Pantic . 2015. The MAHNOB Mimicry Database: A Database of Naturalistic Human Interactions. Pattern Recognition Letters Vol. 66 (Nov. . 2015), 52--61. Google ScholarDigital Library
- Iwan De Kok and Dirk Heylen . 2009. Multimodal End-of-Turn Prediction in Multi-Party Meetings Proceedings of the 2009 International Conference on Multimodal Interfaces. ACM, 91--98. Google ScholarDigital Library
- Florian Eyben, Klaus R. Scherer, Bjorn W. Schuller, Johan Sundberg, Elisabeth Andre, Carlos Busso, Laurence Y. Devillers, Julien Epps, Petri Laukka, Shrikanth S. Narayanan, and Khiet P. Truong . 2016. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Transactions on Affective Computing Vol. 7, 2 (April . 2016), 190--202.Google ScholarDigital Library
- Florian Eyben, Martin Wöllmer, and Björn Schuller . 2010. Opensmile: The Munich Versatile and Fast Open-Source Audio Feature Extractor. In Proceedings of the International Conference on Multimedia. ACM, 1459--1462. Google ScholarDigital Library
- Luciana Ferrer, Elizabeth Shriberg, and Andreas Stolcke . 2002. Is the Speaker Done yet? Faster and More Accurate End-of-Utterance Detection Using Prosody. In Seventh International Conference on Spoken Language Processing.Google Scholar
- Mattias Heldner and Jens Edlund . 2010. Pauses, Gaps and Overlaps in Conversations. Journal of Phonetics Vol. 38, 4 (Oct. . 2010), 555--568.Google ScholarCross Ref
- Angelika Maier, Julian Hough, and David Schlangen . 2017. Towards Deep End-of-Turn Prediction for Situated Spoken Dialogue Systems. Proceedings of INTERSPEECH 2017 (2017).Google ScholarCross Ref
- Antoine Raux and Maxine Eskenazi . 2009. A Finite-State Turn-Taking Model for Spoken Dialog Systems Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 629--637. Google ScholarDigital Library
- Matthew Roddy, Gabriel Skantze, and Naomi Harte . 2018. Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs. In Proceedings of INTERSPEECH 2018. Hyderabad, India, 5.Google ScholarCross Ref
- Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson . 1974. A Simplest Systematics for the Organization of Turn-Taking for Conversation. Language Vol. 50, 4 (Dec. . 1974), 696.Google ScholarCross Ref
- Gabriel Skantze . 2017. Towards a General, Continuous Model of Turn-Taking in Spoken Dialogue Using LSTM Recurrent Neural Networks Proceedings of SigDial. Saarbrucken, Germany.Google Scholar
Recommendations
Smooth Turn-taking by a Robot Using an Online Continuous Model to Generate Turn-taking Cues
ICMI '19: 2019 International Conference on Multimodal InteractionTurn-taking in human-robot interaction is a crucial part of spoken dialogue systems, but current models do not allow for human-like turn-taking speed seen in natural conversation. In this work we propose combining two independent prediction models. A ...
Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal InteractionThe task of identifying when to take a conversational turn is an important function of spoken dialogue systems. The turn-taking system should also ideally be able to handle many types of dialogue, from structured conversation to spontaneous and ...
Multimodal end-of-turn prediction in multi-party meetings
ICMI-MLMI '09: Proceedings of the 2009 international conference on Multimodal interfacesOne of many skills required to engage properly in a conversation is to know the appropiate use of the rules of engagement. In order to engage properly in a conversation, a virtual human or robot should, for instance, be able to know when it is being ...
Comments