skip to main content
10.1145/3242969.3242997acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs

Published:02 October 2018Publication History

ABSTRACT

In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner. Our results show that modeling linguistic and acoustic features at separate temporal rates can be beneficial for turn-taking modeling. We also show that our approach can be used to incorporate gaze features into turn-taking models.

References

  1. Anne H Anderson, Miles Bader, Ellen Gurman Bard, Elizabeth Boyle, Gwyneth Doherty, Simon Garrod, Stephen Isard, Jacqueline Kowtko, Jan McAllister, Jim Miller, and others . 1991. The HCRC Map Task Corpus. Language and speech Vol. 34, 4 (1991), 351--366.Google ScholarGoogle Scholar
  2. Tadas Baltruv saitis, Peter Robinson, Louis-Philippe Morency, and others . 2016. OpenFace: An Open Source Facial Behavior Analysis Toolkit 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1--10.Google ScholarGoogle Scholar
  3. Sanjay Bilakhia, Stavros Petridis, Anton Nijholt, and Maja Pantic . 2015. The MAHNOB Mimicry Database: A Database of Naturalistic Human Interactions. Pattern Recognition Letters Vol. 66 (Nov. . 2015), 52--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Iwan De Kok and Dirk Heylen . 2009. Multimodal End-of-Turn Prediction in Multi-Party Meetings Proceedings of the 2009 International Conference on Multimodal Interfaces. ACM, 91--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Florian Eyben, Klaus R. Scherer, Bjorn W. Schuller, Johan Sundberg, Elisabeth Andre, Carlos Busso, Laurence Y. Devillers, Julien Epps, Petri Laukka, Shrikanth S. Narayanan, and Khiet P. Truong . 2016. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Transactions on Affective Computing Vol. 7, 2 (April . 2016), 190--202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Florian Eyben, Martin Wöllmer, and Björn Schuller . 2010. Opensmile: The Munich Versatile and Fast Open-Source Audio Feature Extractor. In Proceedings of the International Conference on Multimedia. ACM, 1459--1462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Luciana Ferrer, Elizabeth Shriberg, and Andreas Stolcke . 2002. Is the Speaker Done yet? Faster and More Accurate End-of-Utterance Detection Using Prosody. In Seventh International Conference on Spoken Language Processing.Google ScholarGoogle Scholar
  8. Mattias Heldner and Jens Edlund . 2010. Pauses, Gaps and Overlaps in Conversations. Journal of Phonetics Vol. 38, 4 (Oct. . 2010), 555--568.Google ScholarGoogle ScholarCross RefCross Ref
  9. Angelika Maier, Julian Hough, and David Schlangen . 2017. Towards Deep End-of-Turn Prediction for Situated Spoken Dialogue Systems. Proceedings of INTERSPEECH 2017 (2017).Google ScholarGoogle ScholarCross RefCross Ref
  10. Antoine Raux and Maxine Eskenazi . 2009. A Finite-State Turn-Taking Model for Spoken Dialog Systems Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 629--637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Matthew Roddy, Gabriel Skantze, and Naomi Harte . 2018. Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs. In Proceedings of INTERSPEECH 2018. Hyderabad, India, 5.Google ScholarGoogle ScholarCross RefCross Ref
  12. Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson . 1974. A Simplest Systematics for the Organization of Turn-Taking for Conversation. Language Vol. 50, 4 (Dec. . 1974), 696.Google ScholarGoogle ScholarCross RefCross Ref
  13. Gabriel Skantze . 2017. Towards a General, Continuous Model of Turn-Taking in Spoken Dialogue Using LSTM Recurrent Neural Networks Proceedings of SigDial. Saarbrucken, Germany.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction
    October 2018
    687 pages
    ISBN:9781450356923
    DOI:10.1145/3242969

    Copyright © 2018 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 2 October 2018

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • short-paper

    Acceptance Rates

    ICMI '18 Paper Acceptance Rate63of149submissions,42%Overall Acceptance Rate453of1,080submissions,42%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader