skip to main content
10.1145/1452392.1452426acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Context-based recognition during human interactions: automatic feature selection and encoding dictionary

Published:20 October 2008Publication History

ABSTRACT

During face-to-face conversation, people use visual feedback such as head nods to communicate relevant information and to synchronize rhythm between participants. In this paper we describe how contextual information from other participants can be used to predict visual feedback and improve recognition of head gestures in human-human interactions. For example, in a dyadic interaction, the speaker contextual cues such as gaze shifts or changes in prosody will influence listener backchannel feedback (e.g., head nod). To automatically learn how to integrate this contextual information into the listener gesture recognition framework, this paper addresses two main challenges: optimal feature representation using an encoding dictionary and automatic selection of optimal feature-encoding pairs. Multimodal integration between context and visual observations is performed using a discriminative sequential model (Latent-Dynamic Conditional Random Fields) trained on previous interactions. In our experiments involving 38 storytelling dyads, our context-based recognizer significantly improved head gesture recognition performance over a vision-only recognizer.

References

  1. J. Allwood, J. Nivre, and E. Ahlsén. On the semantics and pragmatics of linguistic feedback. Journal of Semantics, pages 1--26, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  2. H. Anderson, M. Bader, E. Bard, G. Doherty, S. Garrod, S. Isard, J. Kowtko, J. McAllister, J. Miller, C. Sotillo, H. Thompson, and R. Weinert. The mcrc map task corpus. Language and Speech, 34(4):351--366, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  3. J. B. Bavelas, L. Coates, and T. Johnson. Listeners as co-narrators. Journal of Personality and Social Psychology, 79(6):941--952, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  4. J. K. Burgoon, L. A. Stern, and L. Dillman. Interpersonal adaptation: Dyadic interaction patterns. Cambridge University Press, Cambridge, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  5. N. Cathcart, J. Carletta, and E. Klein. A shallow model of backchannel continuers in spoken dialogue. In European ACL, pages 51--58, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. H. H. Clark. Using Language. Cambridge University Press, 1996.Google ScholarGoogle Scholar
  7. S. Fujie, Y. Ejiri, K. Nakajima, Y. Matsusaka, and T. Kobayashi. A conversation robot using head gesture recognition as para-linguistic information. In RO-MAN, pages 159--164, September 2004.Google ScholarGoogle ScholarCross RefCross Ref
  8. S. Igor, S. Petr, M. Pavel, B. LukáŽ, F. Michal, K. Martin, and C. Jan. Comparison of keyword spotting approaches for informal continuous speech. In MLMI, 2005.Google ScholarGoogle Scholar
  9. E. Kaiser, A. Olwal, D. McGee, H. Benko, A. Corradini, X. Li, P. Cohen, and S. Feiner. Mutual disambiguation of 3d multimodal interaction in augmented and virtual reality. In ICMI, pages 12--19, November 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Kapoor and R. Picard. A real-time head nod and shake detector. In PUI, November 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Kawato and J. Ohya. Real-time detection of nodding and head-shaking by directly detecting and tracking the 'between-eyes'. In FG, pages 40--45, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L.-P. Morency and T. Darrell. Recognizing gaze aversion gestures in embodied conversational discourse. In ICMI, Banff, Canada, November 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L.-P. Morency and T. Darrell. Conditional sequence model for context-based recognition of gaze aversion. In MLMI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L.-P. Morency, C. Sidner, C. Lee, and T. Darrell. Head gestures for perceptual interfaces: The role of context in improving recognition. Artificial Intelligence, 171(8-9):568--585, June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Nishimura, N. Kitaoka, and S. Nakagawa. A spoken dialog system for chat-like conversations considering response timing. LNCS, 4629:599--606, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Rizzo, D. Klimchuk, R. Mitura, T. Bowerly, J. Buckwalter, and T. Parsons. A virtual reality scenario for all seasons: The virtual classroom. CNS Spectrums, 11(1):35--44, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  17. L. Tickle-Degnen and R. Rosenthal. The nature of rapport and its nonverbal correlates. Psychological Inquiry, 1(4):285--293, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  18. L. Z. Tiedens and A. R. Fragale. Power moves: Complementarity in dominant and submissive nonverbal behavior. Journal of Personality and Social Psychology, 84(3):558--568, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  19. A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based vision system for place and object recognition. In ICCV, Nice, France, October 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. Ward and W. Tsukahara. Prosodic features which cue back-channel responses in english and japanese. Journal of Pragmatics, 23:1177--1207, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  21. Watson: Head tracking and gesture recognition library. http://projects.ict.usc.edu/vision/watson/.Google ScholarGoogle Scholar
  22. Y. Xiong, F. Quek, and D. McNeill. Hand motion gestural oscillations multimodal discourse. In ICMI, pages 132--139, Vancouver B. C., Canada, November 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. V. H. Yngve. On getting a word in edgewise. In Sixth regional Meeting of the Chicago Linguistic Society, pages 567--577, 1970.Google ScholarGoogle Scholar

Index Terms

  1. Context-based recognition during human interactions: automatic feature selection and encoding dictionary

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ICMI '08: Proceedings of the 10th international conference on Multimodal interfaces
            October 2008
            322 pages
            ISBN:9781605581989
            DOI:10.1145/1452392

            Copyright © 2008 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 20 October 2008

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate453of1,080submissions,42%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader