skip to main content
research-article

Optimizing the turn-taking behavior of task-oriented spoken dialog systems

Published:16 May 2012Publication History
Skip Abstract Section

Abstract

Even as progress in speech technologies and task and dialog modeling has allowed the development of advanced spoken dialog systems, the low-level interaction behavior of those systems often remains rigid and inefficient. Based on an analysis of human-human and human-computer turn-taking in naturally occurring task-oriented dialogs, we define a set of features that can be automatically extracted and show that they can be used to inform efficient end-of-turn detection. We then frame turn-taking as decision making under uncertainty and describe the Finite-State Turn-Taking Machine (FSTTM), a decision-theoretic model that combines data-driven machine learning methods and a cost structure derived from Conversation Analysis to control the turn-taking behavior of dialog systems. Evaluation results on CMU Let's Go, a publicly deployed bus information system, confirm that the FSTTM significantly improves the responsiveness of the system compared to a standard threshold-based approach, as well as previous data-driven methods.

References

  1. Atterer, M., Baumann, T., and Schlangen, D. 2008. Towards incremental end-of-utterance detection in dialogue systems. In Proceedings of the International Conference on Computational Linguistics (COLING).Google ScholarGoogle Scholar
  2. Beattie, G. W. 1982. Turn-taking and interruption in political interviews: Margaret Thatcher and Jim Callaghan compared and contrasted. Semiotica 39, 1-2, 93--114.Google ScholarGoogle ScholarCross RefCross Ref
  3. Black, A., Burger, S., Conkie, A., Hastie, H., Keizer, S., Lemon, O., Merigaud, N., Parent, G., Schubiner, G., Thomson, B., Williams, J., Yu, K., Young, S., and Eskenazi, M. 2011. Spoken dialog challenge 2010: Comparison of live and control test results. In Proceedings of the SIGDIAL Conference, Special Interest Group on Discourse and Dialogue. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bohus, D. and Horvitz , E. 2011. Multiparty turn taking in situated dialog: Study, lessons, and directions. In Proceedings of the SIGDIAL Conference, Special Interest Group on Discourse and Dialogue. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bohus, D. and Rudnicky, A. 2002. Integrating multiple knowledge sources for utterance-level confidence annotation in the CMU Communicator spoken dialog system. Tech. rep. CS-190, Carnegie Mellon University, Pittsburgh, PA.Google ScholarGoogle Scholar
  6. Bohus, D. and Rudnicky, A. 2003. RavenClaw: Dialog management using hierarchical task decomposition and an expectation agenda. In Proceedings of the Conference on Speech Communication and Technology (EUROSPEECH).Google ScholarGoogle Scholar
  7. Bohus, D. and Rudnicky, A. 2007. Implicitly-supervised learning in spoken language interfaces: an application to the confidence annotation problem. In Proceedings of the SIGDIAL Conference, Special Interest Group on Discourse and Dialogue.Google ScholarGoogle Scholar
  8. Bohus, D. and Rudnicky, A. I. 2009. The RavenClaw dialog management framework: architecture and systems. Comput. Speech Lang. 23, 3, 332--361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Brady, P. T. 1969. A model for generating on-off speech patterns in two-way conversation. Bell Syst. Tech. J. 48, 2445--2472.Google ScholarGoogle ScholarCross RefCross Ref
  10. Bull, M. 1997. The timing and coordination of turn-taking. Ph.D. thesis, University of Edinburgh.Google ScholarGoogle Scholar
  11. Bull, M. and Aylett, M. 1998. An analysis of the timing of turn-taking in a corpus of goal-oriented dialogue. In Proceedings of the International Conference on Spoken Language Processing (ISCLP). 1175--1178.Google ScholarGoogle Scholar
  12. Carletta, J., Isard, S., Doherty- Sneddon, G., Isard, A., Kowtko, J. C., and Anderson, A. H. 1997. The reliability of a dialogue structure coding scheme. Comput. Linguist 23, 1, 13--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cassell, J., Vilhjalmsson, H., and Bickmore, T. 2001. BEAT: The behavior expression animation toolkit. In Proceedings of the ACM SIGGRAPH International Conference on Computer Graphics and Interactive Techniques. 477--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chafe, W. L. 1992. Talking Data: Transcription and Coding Methods for Language Research. Lawrence Erlbaum, 33--43.Google ScholarGoogle Scholar
  15. Chao, C., Lee, J., Begum, M., and Thomaz, A. 2011. Simon plays Simon says: The timing of turn-taking in an imitation game. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).Google ScholarGoogle Scholar
  16. Clancy, P. M., Thompson, S. A., Suzuki , R., and Tao, H. 1996. The conversational use of reactive tokens in English, Japanese, and Mandarin. J. Pragmatics 26, 355--387.Google ScholarGoogle ScholarCross RefCross Ref
  17. Clarkson, P. and Rosenfeld, R. 1997. Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings of the Conference on Speech Communication and Technology (EUROSPEECH).Google ScholarGoogle Scholar
  18. De Ruiter, J. P., Mitterer, H., and Enfield, N. J. 2006. Predicting the end of a speaker's turn; a cognitive cornerstone of conversation. Lang. 82, 3, 515--535.Google ScholarGoogle ScholarCross RefCross Ref
  19. DeVault, D., Sagae, K., and Traum, D. 2009. Can i finish? learning when to respond to incremental interpretation results in interactive dialogue. In Proceedings of the 10th SIGDIAL Meeting on Discourse and Dialogue. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Duncan, S. 1972. Some signals and rules for taking speaking turns in conversations. J. Personal. Social Psych. 23, 2, 283--292.Google ScholarGoogle ScholarCross RefCross Ref
  21. Edlund, J. and Heldner, M. 2006. /nailon/- software for online analysis of prosody. In Proceedings of Interspeech.Google ScholarGoogle Scholar
  22. Ferrer, L., Shriberg, E., and Stolcke, A. 2003. A prosody-based approach to end-of-utterance detection that does not require speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google ScholarGoogle Scholar
  23. Ford, C. E. and Thompson, S. A. 1996. Interaction and Grammar. Cambridge University Press, 134--184.Google ScholarGoogle Scholar
  24. Furo, H. 2001. Turn-Taking in English and Japanese. Projectability in Grammar, Intonation, and Semantics. Routeledge.Google ScholarGoogle Scholar
  25. Gravano, A. and Hirschberg, J. 2011. Turn-taking cues in task-oriented dialogue. ACM Trans. Speech Lang. Process. 25, 3, 601--634. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Huang, L., Morency, L. P., and Gratch, J. 2011. A multimodal end-of-turn prediction model: Learning from parasocial consensus sampling. In Proceedings of the 10th International Conference on Autnomous Agents and Multiagent Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Huggins-Dai Nes, D., Kumar, M., Chan, A., Black, A. W., Ravishankar, M., and Rudnicky, A. I. 2006. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google ScholarGoogle Scholar
  28. Jaffe, J. and Feldstein, S. 1970. Rhythms of Dialogue. Academic Press.Google ScholarGoogle Scholar
  29. Koiso, H., Horiuchi , Y., Tutiya, S., Ichikawa, A., and Den, Y. 1998. An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map task dialogs. Lang. Speech 41, 3-4, 295--321.Google ScholarGoogle ScholarCross RefCross Ref
  30. Kronlid, F. 2006. Turn taking for artificial conversational agents. In Cooperative Information Agents X., Edinburgh, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Laskowski, K. 2010. Modeling norms of turn-taking in multiparty conversation. In Proceedings of the Meeting of the Association for Conversational Linguistics (ACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Laskowski, K., Edlund, J., and Heldner, M. 2011. A single-port non-parametric model of turn-taking in multi-party conversation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google ScholarGoogle Scholar
  33. Morency, L.-P., de Kok, I., and Gratch, J. 2009. A probabilistic multimodal approach for predicting listener backchannels. journal of autonomous agents and multi-agent systems. J. Autonom. Agents Multi-Agent Syst. 20, 1, 70--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mutlu, B., Shiwa, T., Kanda, T., Ishiguro, H., and Hagita, N. 2009. Footing in human-robot conversations: How robots might shape participant roles using gaze cues. In Proceedings of the 4th ACM/IEEE Conference on Human-Robot Interaction. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Oreström, B. 1983. Turn-Taking in English Conversation. CWK Gleerup, Lund.Google ScholarGoogle Scholar
  36. Paek, T. and Horvitz, E. 2000. Conversation as action under uncertainty. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Porzel, R. and Baudis, M. 2004. The Tao of CHI: Towards effective human-computer interaction. In Proceedings of the Human language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL).Google ScholarGoogle Scholar
  38. Raux, A., and Eskenazi, M. 2007. A multi-layer architecture for semi-synchronous event-driven dialogue management. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).Google ScholarGoogle Scholar
  39. Raux, A. 2008. Flexible turn-taking for spoken dialog systems. Ph.D. thesis, Language Technologies Institute, Carnegie Mellon University.Google ScholarGoogle Scholar
  40. Raux, A., Bohus, D., Langner, B., Black, A. W., and Eskenazi, M. 2006. Doing research on a deployed spoken dialogue system: One year of Let's Go! experience. In Proceedings of the 9th International Conference on Spoken Language Processing (Interspeech).Google ScholarGoogle Scholar
  41. Raux, A. and Eskenazi, M. 2008. Optimizing endpointing thresholds using dialogue features in a spoken dialogue system. In Proceedings of the 8th SIGDIAL Meeting on Discourse and Dialogue. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Raux, A. and Eskenazi, M. 2009. A finite-state turn-taking model for spoken dialog systems. In Proceedings of the Human language Technologies. Conference of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Raux, A., Langner, B., Black, A., and Eskenazi, M. 2003. LET'S GO: Improving spoken dialog systems for the elderly and non-native. In Proceedings of the Conference on Speech Communication and Technology (EUROSPEECH).Google ScholarGoogle Scholar
  44. Raux, A., Langner, B., Bohus, D., Black, A. W., and Eskenazi, M. 2005. Let's Go Public! taking a spoken dialog system to the real world. In Proceedings of the International Conference on Spoken Language Processing (Interspeech).Google ScholarGoogle Scholar
  45. Sacks, H., Schegloff, E. A., and Jefferson, G. 1974. A simplest systematics for the organization of turn-taking for conversation. Lang. 50, 4, 696--735.Google ScholarGoogle ScholarCross RefCross Ref
  46. Sato, R., Higashinaka, R., Tamoto, M., Nakano, M., and Aikawa, K. 2002. Learning decision trees to determine turn-taking by spoken dialogue systems. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).Google ScholarGoogle Scholar
  47. Schegloff, E. 2000. Overlapping talk and the organization of turn-taking for conversation. Lang. Society 29, 1--63.Google ScholarGoogle ScholarCross RefCross Ref
  48. Schlangen, D., Baumann, T., Buschmeier, H., Buss, O., Kopp, S., Skantze, G., and Yaghoubzadeh, R. 2010. Middleware for incremental processing in conversational agents. In Proceedings of the SIGDIAL Meeting on Discourse and Dialogue. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Sjolander, K. 2004. The snack sound toolkit. http://www.speech.kth.se/snack/.Google ScholarGoogle Scholar
  50. Takeuchi, M., Kitaoka, N., and Nakagawa, S. 2004. Timing detection for realtime dialog systems using prosodic and linguistic information. In Proceedings of the Speech Prosody Conference.Google ScholarGoogle Scholar
  51. Thorisson, K. R. 2002. Multimodality in Language and Speech Systems. Kluwer Academic Publishers, 173--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Ward, N., Fuentes, O., and Vega, A. 2010. Dialog prediction for a general model of turn-taking. In Proceedings of the International Conference on Spoken Language Processing (Interspeech).Google ScholarGoogle Scholar
  53. Ward, N., Rivera, A., Ward, K., and Novick, D. 2005. Root causes of lost time and user stress in a simple dialog system. In Proceedings of the International Conference on Spoken Language Processing (Interspeech).Google ScholarGoogle Scholar
  54. Ward, W. and Issar, S. 1994. Recent improvements in the CMU spoken language understanding system. In Proceedings of the ARPA Human Language Technology Workshop. 213--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Wesseling, W. and Vason, R. 2005. Timing of experimentally elicited minimal responses as quantitative evidence for the use of intonation in projecting TRPs. In Proceedings of the International Conference on Spoken Language Processing (Interspeech). 3389--3392.Google ScholarGoogle Scholar
  56. White, M., Foster, M., Oberlander, J., and Brown, A. 2005. Using facial feedback to enhance turn-taking in a multimodal dialogue system. In Proceedings of International Conference on Human-Computer Interaction (HCII-5).Google ScholarGoogle Scholar

Index Terms

  1. Optimizing the turn-taking behavior of task-oriented spoken dialog systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Speech and Language Processing
        ACM Transactions on Speech and Language Processing   Volume 9, Issue 1
        May 2012
        44 pages
        ISSN:1550-4875
        EISSN:1550-4883
        DOI:10.1145/2168748
        Issue’s Table of Contents

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 May 2012
        • Accepted: 1 February 2012
        • Revised: 1 January 2012
        • Received: 1 September 2011
        Published in tslp Volume 9, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader