ABSTRACT
During face-to-face conversation, people use visual feedback such as head nods to communicate relevant information and to synchronize rhythm between participants. In this paper we describe how contextual information from other participants can be used to predict visual feedback and improve recognition of head gestures in human-human interactions. For example, in a dyadic interaction, the speaker contextual cues such as gaze shifts or changes in prosody will influence listener backchannel feedback (e.g., head nod). To automatically learn how to integrate this contextual information into the listener gesture recognition framework, this paper addresses two main challenges: optimal feature representation using an encoding dictionary and automatic selection of optimal feature-encoding pairs. Multimodal integration between context and visual observations is performed using a discriminative sequential model (Latent-Dynamic Conditional Random Fields) trained on previous interactions. In our experiments involving 38 storytelling dyads, our context-based recognizer significantly improved head gesture recognition performance over a vision-only recognizer.
- J. Allwood, J. Nivre, and E. Ahlsén. On the semantics and pragmatics of linguistic feedback. Journal of Semantics, pages 1--26, 1992.Google ScholarCross Ref
- H. Anderson, M. Bader, E. Bard, G. Doherty, S. Garrod, S. Isard, J. Kowtko, J. McAllister, J. Miller, C. Sotillo, H. Thompson, and R. Weinert. The mcrc map task corpus. Language and Speech, 34(4):351--366, 1991.Google ScholarCross Ref
- J. B. Bavelas, L. Coates, and T. Johnson. Listeners as co-narrators. Journal of Personality and Social Psychology, 79(6):941--952, 2000.Google ScholarCross Ref
- J. K. Burgoon, L. A. Stern, and L. Dillman. Interpersonal adaptation: Dyadic interaction patterns. Cambridge University Press, Cambridge, 1995.Google ScholarCross Ref
- N. Cathcart, J. Carletta, and E. Klein. A shallow model of backchannel continuers in spoken dialogue. In European ACL, pages 51--58, 2003. Google ScholarDigital Library
- H. H. Clark. Using Language. Cambridge University Press, 1996.Google Scholar
- S. Fujie, Y. Ejiri, K. Nakajima, Y. Matsusaka, and T. Kobayashi. A conversation robot using head gesture recognition as para-linguistic information. In RO-MAN, pages 159--164, September 2004.Google ScholarCross Ref
- S. Igor, S. Petr, M. Pavel, B. LukáŽ, F. Michal, K. Martin, and C. Jan. Comparison of keyword spotting approaches for informal continuous speech. In MLMI, 2005.Google Scholar
- E. Kaiser, A. Olwal, D. McGee, H. Benko, A. Corradini, X. Li, P. Cohen, and S. Feiner. Mutual disambiguation of 3d multimodal interaction in augmented and virtual reality. In ICMI, pages 12--19, November 2003. Google ScholarDigital Library
- A. Kapoor and R. Picard. A real-time head nod and shake detector. In PUI, November 2001. Google ScholarDigital Library
- S. Kawato and J. Ohya. Real-time detection of nodding and head-shaking by directly detecting and tracking the 'between-eyes'. In FG, pages 40--45, 2000. Google ScholarDigital Library
- L.-P. Morency and T. Darrell. Recognizing gaze aversion gestures in embodied conversational discourse. In ICMI, Banff, Canada, November 2006. Google ScholarDigital Library
- L.-P. Morency and T. Darrell. Conditional sequence model for context-based recognition of gaze aversion. In MLMI, 2007. Google ScholarDigital Library
- L.-P. Morency, C. Sidner, C. Lee, and T. Darrell. Head gestures for perceptual interfaces: The role of context in improving recognition. Artificial Intelligence, 171(8-9):568--585, June 2007. Google ScholarDigital Library
- R. Nishimura, N. Kitaoka, and S. Nakagawa. A spoken dialog system for chat-like conversations considering response timing. LNCS, 4629:599--606, 2007. Google ScholarDigital Library
- A. Rizzo, D. Klimchuk, R. Mitura, T. Bowerly, J. Buckwalter, and T. Parsons. A virtual reality scenario for all seasons: The virtual classroom. CNS Spectrums, 11(1):35--44, 2006.Google ScholarCross Ref
- L. Tickle-Degnen and R. Rosenthal. The nature of rapport and its nonverbal correlates. Psychological Inquiry, 1(4):285--293, 1990.Google ScholarCross Ref
- L. Z. Tiedens and A. R. Fragale. Power moves: Complementarity in dominant and submissive nonverbal behavior. Journal of Personality and Social Psychology, 84(3):558--568, 2003.Google ScholarCross Ref
- A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based vision system for place and object recognition. In ICCV, Nice, France, October 2003. Google ScholarDigital Library
- N. Ward and W. Tsukahara. Prosodic features which cue back-channel responses in english and japanese. Journal of Pragmatics, 23:1177--1207, 2000.Google ScholarCross Ref
- Watson: Head tracking and gesture recognition library. http://projects.ict.usc.edu/vision/watson/.Google Scholar
- Y. Xiong, F. Quek, and D. McNeill. Hand motion gestural oscillations multimodal discourse. In ICMI, pages 132--139, Vancouver B. C., Canada, November 2003. Google ScholarDigital Library
- V. H. Yngve. On getting a word in edgewise. In Sixth regional Meeting of the Chicago Linguistic Society, pages 567--577, 1970.Google Scholar
Index Terms
- Context-based recognition during human interactions: automatic feature selection and encoding dictionary
Recommendations
Co-occurrence graphs: contextual representation for head gesture recognition during multi-party interactions
UCVP '09: Proceedings of the Workshop on Use of Context in Vision ProcessingHead pose and gesture offer several conversational grounding cues and are used extensively in face-to-face interaction among people. To accurately recognize visual feedback, humans often use contextual knowledge from previous and current events to ...
Context-based conversational hand gesture classification in narrative interaction
ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interactionCommunicative hand gestures play important roles in face-to-face conversations. These gestures are arbitrarily used depending on an individual; even when two speakers narrate the same story, they do not always use the same hand gesture (movement, ...
Conditional sequence model for context-based recognition of gaze aversion
MLMI'07: Proceedings of the 4th international conference on Machine learning for multimodal interactionEye gaze and gesture form key conversational grounding cues that are used extensively in face-to-face interaction among people. To accurately recognize visual feedback during interaction, people often use contextual knowledge from previous and current ...
Comments