research-article

Optimizing the turn-taking behavior of task-oriented spoken dialog systems

Authors:
Antoine Raux

Honda Research Institute USA, Mountain View, CA

Honda Research Institute USA, Mountain View, CA
View Profile

,
Maxine Eskenazi

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

ACM Transactions on Speech and Language Processing Volume 9 Issue 1Article No.: 1pp 1–23https://doi.org/10.1145/2168748.2168749

Published:16 May 2012Publication History

ACM Transactions on Speech and Language Processing

Abstract

Even as progress in speech technologies and task and dialog modeling has allowed the development of advanced spoken dialog systems, the low-level interaction behavior of those systems often remains rigid and inefficient. Based on an analysis of human-human and human-computer turn-taking in naturally occurring task-oriented dialogs, we define a set of features that can be automatically extracted and show that they can be used to inform efficient end-of-turn detection. We then frame turn-taking as decision making under uncertainty and describe the Finite-State Turn-Taking Machine (FSTTM), a decision-theoretic model that combines data-driven machine learning methods and a cost structure derived from Conversation Analysis to control the turn-taking behavior of dialog systems. Evaluation results on CMU Let's Go, a publicly deployed bus information system, confirm that the FSTTM significantly improves the responsiveness of the system compared to a standard threshold-based approach, as well as previous data-driven methods.

References

Atterer, M., Baumann, T., and Schlangen, D. 2008. Towards incremental end-of-utterance detection in dialogue systems. In Proceedings of the International Conference on Computational Linguistics (COLING).Google Scholar
Beattie, G. W. 1982. Turn-taking and interruption in political interviews: Margaret Thatcher and Jim Callaghan compared and contrasted. Semiotica 39, 1-2, 93--114.Google ScholarCross Ref
Black, A., Burger, S., Conkie, A., Hastie, H., Keizer, S., Lemon, O., Merigaud, N., Parent, G., Schubiner, G., Thomson, B., Williams, J., Yu, K., Young, S., and Eskenazi, M. 2011. Spoken dialog challenge 2010: Comparison of live and control test results. In Proceedings of the SIGDIAL Conference, Special Interest Group on Discourse and Dialogue. Google ScholarDigital Library
Bohus, D. and Horvitz , E. 2011. Multiparty turn taking in situated dialog: Study, lessons, and directions. In Proceedings of the SIGDIAL Conference, Special Interest Group on Discourse and Dialogue. Google ScholarDigital Library
Bohus, D. and Rudnicky, A. 2002. Integrating multiple knowledge sources for utterance-level confidence annotation in the CMU Communicator spoken dialog system. Tech. rep. CS-190, Carnegie Mellon University, Pittsburgh, PA.Google Scholar
Bohus, D. and Rudnicky, A. 2003. RavenClaw: Dialog management using hierarchical task decomposition and an expectation agenda. In Proceedings of the Conference on Speech Communication and Technology (EUROSPEECH).Google Scholar
Bohus, D. and Rudnicky, A. 2007. Implicitly-supervised learning in spoken language interfaces: an application to the confidence annotation problem. In Proceedings of the SIGDIAL Conference, Special Interest Group on Discourse and Dialogue.Google Scholar
Bohus, D. and Rudnicky, A. I. 2009. The RavenClaw dialog management framework: architecture and systems. Comput. Speech Lang. 23, 3, 332--361. Google ScholarDigital Library
Brady, P. T. 1969. A model for generating on-off speech patterns in two-way conversation. Bell Syst. Tech. J. 48, 2445--2472.Google ScholarCross Ref
Bull, M. 1997. The timing and coordination of turn-taking. Ph.D. thesis, University of Edinburgh.Google Scholar
Bull, M. and Aylett, M. 1998. An analysis of the timing of turn-taking in a corpus of goal-oriented dialogue. In Proceedings of the International Conference on Spoken Language Processing (ISCLP). 1175--1178.Google Scholar
Carletta, J., Isard, S., Doherty- Sneddon, G., Isard, A., Kowtko, J. C., and Anderson, A. H. 1997. The reliability of a dialogue structure coding scheme. Comput. Linguist 23, 1, 13--31. Google ScholarDigital Library
Cassell, J., Vilhjalmsson, H., and Bickmore, T. 2001. BEAT: The behavior expression animation toolkit. In Proceedings of the ACM SIGGRAPH International Conference on Computer Graphics and Interactive Techniques. 477--486. Google ScholarDigital Library
Chafe, W. L. 1992. Talking Data: Transcription and Coding Methods for Language Research. Lawrence Erlbaum, 33--43.Google Scholar
Chao, C., Lee, J., Begum, M., and Thomaz, A. 2011. Simon plays Simon says: The timing of turn-taking in an imitation game. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).Google Scholar
Clancy, P. M., Thompson, S. A., Suzuki , R., and Tao, H. 1996. The conversational use of reactive tokens in English, Japanese, and Mandarin. J. Pragmatics 26, 355--387.Google ScholarCross Ref
Clarkson, P. and Rosenfeld, R. 1997. Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings of the Conference on Speech Communication and Technology (EUROSPEECH).Google Scholar
De Ruiter, J. P., Mitterer, H., and Enfield, N. J. 2006. Predicting the end of a speaker's turn; a cognitive cornerstone of conversation. Lang. 82, 3, 515--535.Google ScholarCross Ref
DeVault, D., Sagae, K., and Traum, D. 2009. Can i finish&quest; learning when to respond to incremental interpretation results in interactive dialogue. In Proceedings of the 10th SIGDIAL Meeting on Discourse and Dialogue. Google ScholarDigital Library
Duncan, S. 1972. Some signals and rules for taking speaking turns in conversations. J. Personal. Social Psych. 23, 2, 283--292.Google ScholarCross Ref
Edlund, J. and Heldner, M. 2006. /nailon/- software for online analysis of prosody. In Proceedings of Interspeech.Google Scholar
Ferrer, L., Shriberg, E., and Stolcke, A. 2003. A prosody-based approach to end-of-utterance detection that does not require speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google Scholar
Ford, C. E. and Thompson, S. A. 1996. Interaction and Grammar. Cambridge University Press, 134--184.Google Scholar
Furo, H. 2001. Turn-Taking in English and Japanese. Projectability in Grammar, Intonation, and Semantics. Routeledge.Google Scholar
Gravano, A. and Hirschberg, J. 2011. Turn-taking cues in task-oriented dialogue. ACM Trans. Speech Lang. Process. 25, 3, 601--634. Google ScholarDigital Library
Huang, L., Morency, L. P., and Gratch, J. 2011. A multimodal end-of-turn prediction model: Learning from parasocial consensus sampling. In Proceedings of the 10th International Conference on Autnomous Agents and Multiagent Systems. Google ScholarDigital Library
Huggins-Dai Nes, D., Kumar, M., Chan, A., Black, A. W., Ravishankar, M., and Rudnicky, A. I. 2006. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google Scholar
Jaffe, J. and Feldstein, S. 1970. Rhythms of Dialogue. Academic Press.Google Scholar
Koiso, H., Horiuchi , Y., Tutiya, S., Ichikawa, A., and Den, Y. 1998. An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map task dialogs. Lang. Speech 41, 3-4, 295--321.Google ScholarCross Ref
Kronlid, F. 2006. Turn taking for artificial conversational agents. In Cooperative Information Agents X., Edinburgh, UK. Google ScholarDigital Library
Laskowski, K. 2010. Modeling norms of turn-taking in multiparty conversation. In Proceedings of the Meeting of the Association for Conversational Linguistics (ACL). Google ScholarDigital Library
Laskowski, K., Edlund, J., and Heldner, M. 2011. A single-port non-parametric model of turn-taking in multi-party conversation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google Scholar
Morency, L.-P., de Kok, I., and Gratch, J. 2009. A probabilistic multimodal approach for predicting listener backchannels. journal of autonomous agents and multi-agent systems. J. Autonom. Agents Multi-Agent Syst. 20, 1, 70--84. Google ScholarDigital Library
Mutlu, B., Shiwa, T., Kanda, T., Ishiguro, H., and Hagita, N. 2009. Footing in human-robot conversations: How robots might shape participant roles using gaze cues. In Proceedings of the 4th ACM/IEEE Conference on Human-Robot Interaction. Google ScholarDigital Library
Oreström, B. 1983. Turn-Taking in English Conversation. CWK Gleerup, Lund.Google Scholar
Paek, T. and Horvitz, E. 2000. Conversation as action under uncertainty. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence. Google ScholarDigital Library
Porzel, R. and Baudis, M. 2004. The Tao of CHI: Towards effective human-computer interaction. In Proceedings of the Human language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL).Google Scholar
Raux, A., and Eskenazi, M. 2007. A multi-layer architecture for semi-synchronous event-driven dialogue management. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).Google Scholar
Raux, A. 2008. Flexible turn-taking for spoken dialog systems. Ph.D. thesis, Language Technologies Institute, Carnegie Mellon University.Google Scholar
Raux, A., Bohus, D., Langner, B., Black, A. W., and Eskenazi, M. 2006. Doing research on a deployed spoken dialogue system: One year of Let's Go! experience. In Proceedings of the 9th International Conference on Spoken Language Processing (Interspeech).Google Scholar
Raux, A. and Eskenazi, M. 2008. Optimizing endpointing thresholds using dialogue features in a spoken dialogue system. In Proceedings of the 8th SIGDIAL Meeting on Discourse and Dialogue. Google ScholarDigital Library
Raux, A. and Eskenazi, M. 2009. A finite-state turn-taking model for spoken dialog systems. In Proceedings of the Human language Technologies. Conference of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL). Google ScholarDigital Library
Raux, A., Langner, B., Black, A., and Eskenazi, M. 2003. LET'S GO: Improving spoken dialog systems for the elderly and non-native. In Proceedings of the Conference on Speech Communication and Technology (EUROSPEECH).Google Scholar
Raux, A., Langner, B., Bohus, D., Black, A. W., and Eskenazi, M. 2005. Let's Go Public! taking a spoken dialog system to the real world. In Proceedings of the International Conference on Spoken Language Processing (Interspeech).Google Scholar
Sacks, H., Schegloff, E. A., and Jefferson, G. 1974. A simplest systematics for the organization of turn-taking for conversation. Lang. 50, 4, 696--735.Google ScholarCross Ref
Sato, R., Higashinaka, R., Tamoto, M., Nakano, M., and Aikawa, K. 2002. Learning decision trees to determine turn-taking by spoken dialogue systems. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).Google Scholar
Schegloff, E. 2000. Overlapping talk and the organization of turn-taking for conversation. Lang. Society 29, 1--63.Google ScholarCross Ref
Schlangen, D., Baumann, T., Buschmeier, H., Buss, O., Kopp, S., Skantze, G., and Yaghoubzadeh, R. 2010. Middleware for incremental processing in conversational agents. In Proceedings of the SIGDIAL Meeting on Discourse and Dialogue. Google ScholarDigital Library
Sjolander, K. 2004. The snack sound toolkit. http://www.speech.kth.se/snack/.Google Scholar
Takeuchi, M., Kitaoka, N., and Nakagawa, S. 2004. Timing detection for realtime dialog systems using prosodic and linguistic information. In Proceedings of the Speech Prosody Conference.Google Scholar
Thorisson, K. R. 2002. Multimodality in Language and Speech Systems. Kluwer Academic Publishers, 173--207. Google ScholarDigital Library
Ward, N., Fuentes, O., and Vega, A. 2010. Dialog prediction for a general model of turn-taking. In Proceedings of the International Conference on Spoken Language Processing (Interspeech).Google Scholar
Ward, N., Rivera, A., Ward, K., and Novick, D. 2005. Root causes of lost time and user stress in a simple dialog system. In Proceedings of the International Conference on Spoken Language Processing (Interspeech).Google Scholar
Ward, W. and Issar, S. 1994. Recent improvements in the CMU spoken language understanding system. In Proceedings of the ARPA Human Language Technology Workshop. 213--216. Google ScholarDigital Library
Wesseling, W. and Vason, R. 2005. Timing of experimentally elicited minimal responses as quantitative evidence for the use of intonation in projecting TRPs. In Proceedings of the International Conference on Spoken Language Processing (Interspeech). 3389--3392.Google Scholar
White, M., Foster, M., Oberlander, J., and Brown, A. 2005. Using facial feedback to enhance turn-taking in a multimodal dialogue system. In Proceedings of International Conference on Human-Computer Interaction (HCII-5).Google Scholar

Index Terms

Optimizing the turn-taking behavior of task-oriented spoken dialog systems
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

The task of identifying when to take a conversational turn is an important function of spoken dialogue systems. The turn-taking system should also ideally be able to handle many types of dialogue, from structured conversation to spontaneous and ...
Read More
Turn-taking cues in task-oriented dialogue

Abstract: As interactive voice response systems become more prevalent and provide increasingly more complex functionality, it becomes clear that the challenges facing such systems are not solely in their synthesis and recognition capabilities. Issues ...
Read More
Grounding and turn-taking in multimodal multiparty conversation
HCI'13: Proceedings of the 15th international conference on Human-Computer Interaction: interaction modalities and techniques - Volume Part IV

This study explores the empirical basis for multimodal conversation control acts. Applying conversation analysis as an exploratory approach, we attempt to illuminate the control functions of paralinguistic behaviors in managing multiparty conversation. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Speech and Language Processing Volume 9, Issue 1
May 2012
44 pages
ISSN:1550-4875
EISSN:1550-4883
DOI:10.1145/2168748
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 May 2012
- Accepted: 1 February 2012
- Revised: 1 January 2012
- Received: 1 September 2011
Published in tslp Volume 9, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Spoken dialog systems
machine learning
turn-taking
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 773
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing the turn-taking behavior of task-oriented spoken dialog systems

ACM Transactions on Speech and Language Processing

Abstract

References

Cited By

Index Terms

Recommendations

Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios

Turn-taking cues in task-oriented dialogue

Grounding and turn-taking in multimodal multiparty conversation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Optimizing the turn-taking behavior of task-oriented spoken dialog systems

ACM Transactions on Speech and Language Processing

Abstract

References

Cited By

Index Terms

Recommendations

Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios

Turn-taking cues in task-oriented dialogue

Grounding and turn-taking in multimodal multiparty conversation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media