Abstract
Even as progress in speech technologies and task and dialog modeling has allowed the development of advanced spoken dialog systems, the low-level interaction behavior of those systems often remains rigid and inefficient. Based on an analysis of human-human and human-computer turn-taking in naturally occurring task-oriented dialogs, we define a set of features that can be automatically extracted and show that they can be used to inform efficient end-of-turn detection. We then frame turn-taking as decision making under uncertainty and describe the Finite-State Turn-Taking Machine (FSTTM), a decision-theoretic model that combines data-driven machine learning methods and a cost structure derived from Conversation Analysis to control the turn-taking behavior of dialog systems. Evaluation results on CMU Let's Go, a publicly deployed bus information system, confirm that the FSTTM significantly improves the responsiveness of the system compared to a standard threshold-based approach, as well as previous data-driven methods.
- Atterer, M., Baumann, T., and Schlangen, D. 2008. Towards incremental end-of-utterance detection in dialogue systems. In Proceedings of the International Conference on Computational Linguistics (COLING).Google Scholar
- Beattie, G. W. 1982. Turn-taking and interruption in political interviews: Margaret Thatcher and Jim Callaghan compared and contrasted. Semiotica 39, 1-2, 93--114.Google ScholarCross Ref
- Black, A., Burger, S., Conkie, A., Hastie, H., Keizer, S., Lemon, O., Merigaud, N., Parent, G., Schubiner, G., Thomson, B., Williams, J., Yu, K., Young, S., and Eskenazi, M. 2011. Spoken dialog challenge 2010: Comparison of live and control test results. In Proceedings of the SIGDIAL Conference, Special Interest Group on Discourse and Dialogue. Google ScholarDigital Library
- Bohus, D. and Horvitz , E. 2011. Multiparty turn taking in situated dialog: Study, lessons, and directions. In Proceedings of the SIGDIAL Conference, Special Interest Group on Discourse and Dialogue. Google ScholarDigital Library
- Bohus, D. and Rudnicky, A. 2002. Integrating multiple knowledge sources for utterance-level confidence annotation in the CMU Communicator spoken dialog system. Tech. rep. CS-190, Carnegie Mellon University, Pittsburgh, PA.Google Scholar
- Bohus, D. and Rudnicky, A. 2003. RavenClaw: Dialog management using hierarchical task decomposition and an expectation agenda. In Proceedings of the Conference on Speech Communication and Technology (EUROSPEECH).Google Scholar
- Bohus, D. and Rudnicky, A. 2007. Implicitly-supervised learning in spoken language interfaces: an application to the confidence annotation problem. In Proceedings of the SIGDIAL Conference, Special Interest Group on Discourse and Dialogue.Google Scholar
- Bohus, D. and Rudnicky, A. I. 2009. The RavenClaw dialog management framework: architecture and systems. Comput. Speech Lang. 23, 3, 332--361. Google ScholarDigital Library
- Brady, P. T. 1969. A model for generating on-off speech patterns in two-way conversation. Bell Syst. Tech. J. 48, 2445--2472.Google ScholarCross Ref
- Bull, M. 1997. The timing and coordination of turn-taking. Ph.D. thesis, University of Edinburgh.Google Scholar
- Bull, M. and Aylett, M. 1998. An analysis of the timing of turn-taking in a corpus of goal-oriented dialogue. In Proceedings of the International Conference on Spoken Language Processing (ISCLP). 1175--1178.Google Scholar
- Carletta, J., Isard, S., Doherty- Sneddon, G., Isard, A., Kowtko, J. C., and Anderson, A. H. 1997. The reliability of a dialogue structure coding scheme. Comput. Linguist 23, 1, 13--31. Google ScholarDigital Library
- Cassell, J., Vilhjalmsson, H., and Bickmore, T. 2001. BEAT: The behavior expression animation toolkit. In Proceedings of the ACM SIGGRAPH International Conference on Computer Graphics and Interactive Techniques. 477--486. Google ScholarDigital Library
- Chafe, W. L. 1992. Talking Data: Transcription and Coding Methods for Language Research. Lawrence Erlbaum, 33--43.Google Scholar
- Chao, C., Lee, J., Begum, M., and Thomaz, A. 2011. Simon plays Simon says: The timing of turn-taking in an imitation game. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).Google Scholar
- Clancy, P. M., Thompson, S. A., Suzuki , R., and Tao, H. 1996. The conversational use of reactive tokens in English, Japanese, and Mandarin. J. Pragmatics 26, 355--387.Google ScholarCross Ref
- Clarkson, P. and Rosenfeld, R. 1997. Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings of the Conference on Speech Communication and Technology (EUROSPEECH).Google Scholar
- De Ruiter, J. P., Mitterer, H., and Enfield, N. J. 2006. Predicting the end of a speaker's turn; a cognitive cornerstone of conversation. Lang. 82, 3, 515--535.Google ScholarCross Ref
- DeVault, D., Sagae, K., and Traum, D. 2009. Can i finish? learning when to respond to incremental interpretation results in interactive dialogue. In Proceedings of the 10th SIGDIAL Meeting on Discourse and Dialogue. Google ScholarDigital Library
- Duncan, S. 1972. Some signals and rules for taking speaking turns in conversations. J. Personal. Social Psych. 23, 2, 283--292.Google ScholarCross Ref
- Edlund, J. and Heldner, M. 2006. /nailon/- software for online analysis of prosody. In Proceedings of Interspeech.Google Scholar
- Ferrer, L., Shriberg, E., and Stolcke, A. 2003. A prosody-based approach to end-of-utterance detection that does not require speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google Scholar
- Ford, C. E. and Thompson, S. A. 1996. Interaction and Grammar. Cambridge University Press, 134--184.Google Scholar
- Furo, H. 2001. Turn-Taking in English and Japanese. Projectability in Grammar, Intonation, and Semantics. Routeledge.Google Scholar
- Gravano, A. and Hirschberg, J. 2011. Turn-taking cues in task-oriented dialogue. ACM Trans. Speech Lang. Process. 25, 3, 601--634. Google ScholarDigital Library
- Huang, L., Morency, L. P., and Gratch, J. 2011. A multimodal end-of-turn prediction model: Learning from parasocial consensus sampling. In Proceedings of the 10th International Conference on Autnomous Agents and Multiagent Systems. Google ScholarDigital Library
- Huggins-Dai Nes, D., Kumar, M., Chan, A., Black, A. W., Ravishankar, M., and Rudnicky, A. I. 2006. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google Scholar
- Jaffe, J. and Feldstein, S. 1970. Rhythms of Dialogue. Academic Press.Google Scholar
- Koiso, H., Horiuchi , Y., Tutiya, S., Ichikawa, A., and Den, Y. 1998. An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map task dialogs. Lang. Speech 41, 3-4, 295--321.Google ScholarCross Ref
- Kronlid, F. 2006. Turn taking for artificial conversational agents. In Cooperative Information Agents X., Edinburgh, UK. Google ScholarDigital Library
- Laskowski, K. 2010. Modeling norms of turn-taking in multiparty conversation. In Proceedings of the Meeting of the Association for Conversational Linguistics (ACL). Google ScholarDigital Library
- Laskowski, K., Edlund, J., and Heldner, M. 2011. A single-port non-parametric model of turn-taking in multi-party conversation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google Scholar
- Morency, L.-P., de Kok, I., and Gratch, J. 2009. A probabilistic multimodal approach for predicting listener backchannels. journal of autonomous agents and multi-agent systems. J. Autonom. Agents Multi-Agent Syst. 20, 1, 70--84. Google ScholarDigital Library
- Mutlu, B., Shiwa, T., Kanda, T., Ishiguro, H., and Hagita, N. 2009. Footing in human-robot conversations: How robots might shape participant roles using gaze cues. In Proceedings of the 4th ACM/IEEE Conference on Human-Robot Interaction. Google ScholarDigital Library
- Oreström, B. 1983. Turn-Taking in English Conversation. CWK Gleerup, Lund.Google Scholar
- Paek, T. and Horvitz, E. 2000. Conversation as action under uncertainty. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence. Google ScholarDigital Library
- Porzel, R. and Baudis, M. 2004. The Tao of CHI: Towards effective human-computer interaction. In Proceedings of the Human language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL).Google Scholar
- Raux, A., and Eskenazi, M. 2007. A multi-layer architecture for semi-synchronous event-driven dialogue management. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).Google Scholar
- Raux, A. 2008. Flexible turn-taking for spoken dialog systems. Ph.D. thesis, Language Technologies Institute, Carnegie Mellon University.Google Scholar
- Raux, A., Bohus, D., Langner, B., Black, A. W., and Eskenazi, M. 2006. Doing research on a deployed spoken dialogue system: One year of Let's Go! experience. In Proceedings of the 9th International Conference on Spoken Language Processing (Interspeech).Google Scholar
- Raux, A. and Eskenazi, M. 2008. Optimizing endpointing thresholds using dialogue features in a spoken dialogue system. In Proceedings of the 8th SIGDIAL Meeting on Discourse and Dialogue. Google ScholarDigital Library
- Raux, A. and Eskenazi, M. 2009. A finite-state turn-taking model for spoken dialog systems. In Proceedings of the Human language Technologies. Conference of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL). Google ScholarDigital Library
- Raux, A., Langner, B., Black, A., and Eskenazi, M. 2003. LET'S GO: Improving spoken dialog systems for the elderly and non-native. In Proceedings of the Conference on Speech Communication and Technology (EUROSPEECH).Google Scholar
- Raux, A., Langner, B., Bohus, D., Black, A. W., and Eskenazi, M. 2005. Let's Go Public! taking a spoken dialog system to the real world. In Proceedings of the International Conference on Spoken Language Processing (Interspeech).Google Scholar
- Sacks, H., Schegloff, E. A., and Jefferson, G. 1974. A simplest systematics for the organization of turn-taking for conversation. Lang. 50, 4, 696--735.Google ScholarCross Ref
- Sato, R., Higashinaka, R., Tamoto, M., Nakano, M., and Aikawa, K. 2002. Learning decision trees to determine turn-taking by spoken dialogue systems. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).Google Scholar
- Schegloff, E. 2000. Overlapping talk and the organization of turn-taking for conversation. Lang. Society 29, 1--63.Google ScholarCross Ref
- Schlangen, D., Baumann, T., Buschmeier, H., Buss, O., Kopp, S., Skantze, G., and Yaghoubzadeh, R. 2010. Middleware for incremental processing in conversational agents. In Proceedings of the SIGDIAL Meeting on Discourse and Dialogue. Google ScholarDigital Library
- Sjolander, K. 2004. The snack sound toolkit. http://www.speech.kth.se/snack/.Google Scholar
- Takeuchi, M., Kitaoka, N., and Nakagawa, S. 2004. Timing detection for realtime dialog systems using prosodic and linguistic information. In Proceedings of the Speech Prosody Conference.Google Scholar
- Thorisson, K. R. 2002. Multimodality in Language and Speech Systems. Kluwer Academic Publishers, 173--207. Google ScholarDigital Library
- Ward, N., Fuentes, O., and Vega, A. 2010. Dialog prediction for a general model of turn-taking. In Proceedings of the International Conference on Spoken Language Processing (Interspeech).Google Scholar
- Ward, N., Rivera, A., Ward, K., and Novick, D. 2005. Root causes of lost time and user stress in a simple dialog system. In Proceedings of the International Conference on Spoken Language Processing (Interspeech).Google Scholar
- Ward, W. and Issar, S. 1994. Recent improvements in the CMU spoken language understanding system. In Proceedings of the ARPA Human Language Technology Workshop. 213--216. Google ScholarDigital Library
- Wesseling, W. and Vason, R. 2005. Timing of experimentally elicited minimal responses as quantitative evidence for the use of intonation in projecting TRPs. In Proceedings of the International Conference on Spoken Language Processing (Interspeech). 3389--3392.Google Scholar
- White, M., Foster, M., Oberlander, J., and Brown, A. 2005. Using facial feedback to enhance turn-taking in a multimodal dialogue system. In Proceedings of International Conference on Human-Computer Interaction (HCII-5).Google Scholar
Index Terms
- Optimizing the turn-taking behavior of task-oriented spoken dialog systems
Recommendations
Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal InteractionThe task of identifying when to take a conversational turn is an important function of spoken dialogue systems. The turn-taking system should also ideally be able to handle many types of dialogue, from structured conversation to spontaneous and ...
Turn-taking cues in task-oriented dialogue
Abstract: As interactive voice response systems become more prevalent and provide increasingly more complex functionality, it becomes clear that the challenges facing such systems are not solely in their synthesis and recognition capabilities. Issues ...
Grounding and turn-taking in multimodal multiparty conversation
HCI'13: Proceedings of the 15th international conference on Human-Computer Interaction: interaction modalities and techniques - Volume Part IVThis study explores the empirical basis for multimodal conversation control acts. Applying conversation analysis as an exploratory approach, we attempt to illuminate the control functions of paralinguistic behaviors in managing multiparty conversation. ...
Comments