ABSTRACT
One of many skills required to engage properly in a conversation is to know the appropiate use of the rules of engagement. In order to engage properly in a conversation, a virtual human or robot should, for instance, be able to know when it is being addressed or when the speaker is about to hand over the turn. The paper presents a multimodal approach to end-of-speaker-turn prediction using sequential probabilistic models (Conditional Random Fields) to learn a model from observations of real-life multi-party meetings. Although the results are not as good as expected, we provide insight into which modalities are important when taking a multimodal approach to the problem based on literature and our own results.
- M. Argyle and M. Cook. Gaze and mutual gaze. Cambridge University Press, London, United Kingdom, 1976.Google Scholar
- M. Atterer, T. Baumann, and D. Schlangen. Towards incremental end-of-utterance detection in dialogue systems. In Proceedings of International Conference on Computational Linguistics, 2008.Google Scholar
- P. Barkhuysen, E. Krahmer, and M. Swerts. The interplay between auditory and visual cues for end-of-utterance detection. Journal of Acoustical Society of America, 123(1):354 -- 365, 2008.Google ScholarCross Ref
- P. Boersma and V. van Heuven. Speak and unspeak with praat. Glot International, 5(9-10):341--347, November 2001.Google Scholar
- J. Cassell, Y. I. Nakano, T. W. Bickmore, C. L. Sidner, and C. Rich. Non-verbal cues for discourse structure. In ACL '01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pages 114--123, Morristown, NJ, USA, 2001. Association for Computational Linguistics. Google ScholarDigital Library
- J. Cassell, J. Sullivan, S. Prevost, and E. F. Churchill. Embodied Conversational Agents. MIT Press, Cambridge Massachusetts, London England, 2000.Google Scholar
- J. Cassell, O. E. Torres, and S. Prevost. Turn taking vs. discourse structure: How best to model multimodal conversation. In Machine Conversations, pages 143--154. Kluwer, 1998.Google Scholar
- J. de Ruiter, H. Mitterer, and N. Enfield. Projecting the end of a speaker's turn: A cognitive cornerstone of conversation. Language, 82(3):515 -- 535, 2006.Google ScholarCross Ref
- S. Duncan. Some signals and rules for taking speaking turns in conversations. Journal of Personality and Social Psychology, 23(2):283 -- 292, 1972.Google ScholarCross Ref
- S. Duncan and G. Niederehe. On signalling that it's your turn to speak. Journal of Experimental Social Psychology, 10:234--247, 1974.Google ScholarCross Ref
- O. Fuentes, D. Vera, and T. Solorio. A filter-based approach to detect end-of-utterances from prosody in dialog systems. In HLT-NAACL (Short Papers), pages 45--48. The Association for Computational Linguistics, 2007. Google ScholarDigital Library
- J. Fung, D. Hakkani-Tur, M. Magimai-Doss, E. Shriberg, S. Cuendet, and N. Mirghafori. Prosodic features and feature selection for multi-lingual sentence segmentation. In Proceedings of Interspeech 2007, pages 2585--2588, 2007.Google Scholar
- C. Goodwin. Conversational Organization: interaction between speakers and hearers. Academic Press, 1981.Google Scholar
- D. Heylen. Head gestures, gaze and the principles of conversational structure. International Journal of Humanoid Robotics, 3(3):241--267, 2006.Google ScholarCross Ref
- D. Heylen. Listening heads. In I. Wachsmuth and G. Knoblich, editors, Modeling Communication with robots and virtual humans, volume 4930 of Lecture Notes in Artificial Intelligence, pages 241--259. Springer Verlag, Berlin, 2008. Google ScholarDigital Library
- http://corpus.amiproject.org. The AMI Meeting Corpus, May 2009.Google Scholar
- A. Kendon. Some functions of gaze direction in social interaction. Acta Psychologica, 26:22--63, 1967.Google ScholarCross Ref
- J. Laerty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labelling sequence data. In ICML, 2001.Google Scholar
- T. Minato, Y. Yoshikawa, T. Noda, S. Ikemoto, H. Ishiguro, and M. Asada. CB2: A child robot with biomimetic body for cognitive developmental robotics. In IROS 2008: Proceedings of the IEEE/RSJ 2008 International Conference on Intelligent RObots and Systems, pages 193--200, 2008.Google Scholar
- L.-P. Morency, I. de Kok, and J. Gratch. Context-based recognition during human interactions: Automatic feature selection and encoding dictionary. In ICMI '08: Proceedings of the 10th International Conference on Multimodal Interfaces, pages 181--188, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- L.-P. Morency, I. de Kok, and J. Gratch. Predicting listener backchannels: A probabilistic multimodal approach. In Intelligent Virtual Agents (IVA '08), pages 176--190, 2008. Google ScholarDigital Library
- D. C. O'Connell, S. Kowal, and E. Kaltenbacher. Turn-taking: A critical analysis of the research tradition. Journal of Psycholinguistic Research, 19(6):345 -- 373, 1990.Google ScholarCross Ref
- L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989.Google ScholarCross Ref
- R. J. Rienks, R. Poppe, and D. Heylen. Di erences in head orientation behavior for speakers and listeners: an experiment in a virtual environment. Transactions on Applied Perception, 7(1):accepted for publication, 2010. Google ScholarDigital Library
- H. Sacks, E. A. Scheglo , and G. Je erson. A simplest systematics for the organization of turn-taking for conversation. Language, 50(4):696 -- 735, 1974.Google ScholarCross Ref
- D. Sakamoto, T. Kanda, T. Ono, H. Ishiguro, and N. Hagita. Android as a telecommunication medium with a human-like presence. In HRI '07: Proceedings of the ACM/IEEE international conference on Human-robot interaction, pages 193--200, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- D. Schlangen. From reaction to prediction: Experiments with computational models of turn-taking. In Proceedings of Interspeech 2006, 2006.Google Scholar
- T. Sikorski and J. F. Allen. A task-based evaluation of the trains-95 dialogue system. In ECAI '96: Workshop on Dialogue Processing in Spoken Language Systems, pages 207--220, London, UK, 1997. Springer-Verlag. Google ScholarDigital Library
- R. Vertegaal, R. Slagter, G. van der Veer, and A. Nijholt. Eye gaze patterns in conversations: There is more to conversational agents than meets the eyes. In Proceedings of CHI'01, pages 301 -- 308. ACM, 2001. Google ScholarDigital Library
- R. Vertegaal, G. van der Veer, and H. Vons. E ects of gaze on multiparty mediated communication. In Proceedings of Graphics Interface, pages 95 -- 102, Montreal, Canada, 2000. Morgan Kaufmann Publishers.Google Scholar
- N. Ward and W. Tsukahara. Prosodic features which cue back-channel responses in english and japanese. Journal of Pragmatics, 32(8):1177--1207, 2000.Google ScholarCross Ref
Index Terms
- Multimodal end-of-turn prediction in multi-party meetings
Recommendations
Multimodal Fusion using Respiration and Gaze for Predicting Next Speaker in Multi-Party Meetings
ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal InteractionTechniques that use nonverbal behaviors to predict turn-taking situations, such as who will be the next speaker and the next utterance timing in multi-party meetings are receiving a lot of attention recently. It has long been known that gaze is a ...
Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal InteractionIn human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into ...
A conversation analytical study on multimodal turn-giving cues: end-of-turn prediction
COST'11: Proceedings of the 2011 international conference on Cognitive Behavioural SystemsThe present paper focuses on the systematic study of the sequential organization of verbal as well as nonverbal behavior in spontaneous interaction. The study concerns one of the most universal structural features of conversation, the phenomenon of ...
Comments