skip to main content
10.1145/2818346.2820755acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Multimodal Fusion using Respiration and Gaze for Predicting Next Speaker in Multi-Party Meetings

Published:09 November 2015Publication History

ABSTRACT

Techniques that use nonverbal behaviors to predict turn-taking situations, such as who will be the next speaker and the next utterance timing in multi-party meetings are receiving a lot of attention recently. It has long been known that gaze is a physical behavior that plays an important role in transferring the speaking turn between humans. Recently, a line of research has focused on the relationship between turn-taking and respiration, a biological signal that conveys information about the intention or preliminary action to start to speak. It has been demonstrated that respiration and gaze behavior separately have the potential to allow predicting the next speaker and the next utterance timing in multi-party meetings. As a multimodal fusion to create models for predicting the next speaker in multi-party meetings, we integrated respiration and gaze behavior, which were extracted from different modalities and are completely different in quality, and implemented a model uses information about them to predict the next speaker at the end of an utterance. The model has a two-step processing. The first is to predict whether turn-keeping or turn-taking happens; the second is to predict the next speaker in turn-taking. We constructed prediction models with either respiration or gaze behavior and with both respiration and gaze behaviors as features and compared their performance. The results suggest that the model with both respiration and gaze behaviors performs better than the one using only respiration or gaze behavior. It is revealed that multimodal fusion using respiration and gaze behavior is effective for predicting the next speaker in multi-party meetings. It was found that gaze behavior is more useful for predicting turn-keeping/turn-taking than respiration and that respiration is more useful for predicting the next speaker in turn-taking.

References

  1. R. R. Bouckaert, E. Frank, M. A. Hall, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. WEKA-experiences with a java open-source project. J. Machine Learning Research, 11:2533--2541, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Chen and M. P. Harper. Multimodal floor control shift detection. In Proceedings of International Conference Multimodal Interaction, pages 15--22, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. I. de Kok and D. Heylen. Multimodal end-of-turn prediction in multi-party meetings. In Proceedings of International Conference Multimodal Interaction, pages 91--98, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Dielmann, G. Garau, and H. Bourlard. Floor holder detection and end of speaker turn prediction in meetings. In ISCA, pages 2306--2309, 2010.Google ScholarGoogle Scholar
  5. L. Ferrer, E. Shriberg, and A. Stolcke. Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody in human-computer dialog. In ICSLP, volume 3, pages 2061--2064, 2002.Google ScholarGoogle Scholar
  6. D. Gatica-Perez. Analyzing group interactions in conversations: a review. In MFI, pages 41--46, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  7. R. Ishii, S. Kumano, and K. Otsuka. Predicting next speaker using head movement in multi-party meetings. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015.Google ScholarGoogle Scholar
  8. R. Ishii, K. Otsuka, S. Kumano, M. Matsuda, and J. Yamato. Predicting next speaker and timing from gaze transition patterns in multi-party meetings. In Proceedings of International Conference Multimodal Interaction, pages 79--86, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Ishii, K. Otsuka, S. Kumano, and J. Yamato. Analysis and modeling of next speaking start timing based on gaze behavior in multi-party meetings. In ICASSP, pages 694--698, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  10. R. Ishii, K. Otsuka, S. Kumano, and J. Yamato. Analysis of respiration for prediction of "who will be next speaker and when?" in multi-party meetings. In Proceedings of International Conference Multimodal Interaction, pages 18--25, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Ishii, K. Otsuka, S. Kumano, and J. Yamato. Analysis of timing structure of eye contact in turn-changing. In Proceedings of International Conference on Multimodal Interaction (ICMI), Workshop on Eye gaze in intelligent human machine interaction, pages 15--20, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Jokinen, H. Furukawa, M. Nishida, and S. Yamamoto. Gaze and turn-taking behavior in casual conversational interactions. J. TiiS, 3(2):12, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Jovanovic, R. op den Akker, and A. Nijholt. Addressee identification in face-to-face meetings. In Conf. European Chapter of the ACL, 2006.Google ScholarGoogle Scholar
  14. T. Kawahara, T. Iwatate, and K. Takanashii. Prediction of turn-taking by combining prosodic and eye-gaze information in poster conversations. In ISCA, 2012.Google ScholarGoogle Scholar
  15. S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. Improvements to platt's smo algorithm for svm classifier design. Neural Computation, 13(3):637--649, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Kendon. Some functions of gaze direction in social interaction. ActaPsychologica, 26:22--63, 1967.Google ScholarGoogle Scholar
  17. H. Koiso, Y. Horiuchi, S. Tutiya, A. Ichikawa, and Y. Den. An analysis of turn-taking and backchannels based on prosodic and syntactic features in japanese map task dialogs. In Language and Speech, volume 41, pages 295--321, 1998.Google ScholarGoogle Scholar
  18. K. Laskowski, J. Edlund, and M. Heldner. A single-port non-parametric model of turn-taking in multi-party conversation. In ICASSP, pages 5600--5603, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  19. G.-A. Levow. Turn-taking in mandarin dialogue: Interactions of tones and intonation. In SIGHAN, 2005.Google ScholarGoogle Scholar
  20. MIND MEDIA "NeXus-10 MARKII". http://www.mindmedia.info/CMS2014/products/systems/nexus-10-mkii.Google ScholarGoogle Scholar
  21. K. Otsuka. Conversational scene analysis. IEEE Signal Processing Magazine, 28:127--131, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  22. K. Otsuka, S. Araki, D. Mikami, K. Ishizuka, M. Fujimoto, and J. Yamato. Realtime meeting analysis and 3d meeting viewer based on omnidirectional multimodal sensors. In Proceedings of International Conference on Multimodal Interaction (ICMI), pages 219--220, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Sacks, E. A. Schegloff, and G. Jefferson. A simplest systematics for the organisation of turn taking for conversation. Language, 50:696--735, 1974.Google ScholarGoogle ScholarCross RefCross Ref
  24. 2D. Schlangen. From reaction to prediction experiments with computational models of turn-taking. In ISCA, pages 17--21, 2006.Google ScholarGoogle Scholar

Index Terms

  1. Multimodal Fusion using Respiration and Gaze for Predicting Next Speaker in Multi-Party Meetings

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction
      November 2015
      678 pages
      ISBN:9781450339124
      DOI:10.1145/2818346

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 November 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ICMI '15 Paper Acceptance Rate52of127submissions,41%Overall Acceptance Rate453of1,080submissions,42%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader