skip to main content
10.1145/1631272.1631301acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Visual speaker localization aided by acoustic models

Published:19 October 2009Publication History

ABSTRACT

The following paper presents a novel audio-visual approach for unsupervised speaker locationing. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the art audio-only speaker localization system (traditionally called speaker diarization) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the number of speakers and estimates "who spoke when", then, in a second step, the visual models are used to infer the location of the speakers in the video. The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker locationing at little incremental engineering and computation costs.

References

  1. N. Campbell and N. Suzuki. Working with Very Sparse Data to Detect Speaker and Listener Participation in a Meetings Corpus. In Workshop Programme, volume 10, May 2006.Google ScholarGoogle Scholar
  2. J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraiij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, M. McCowan, W. Post, D. Reidsma, and P. Wellner. The AMI meeting corpus: A pre-announcement. In Joint Workshop on Machine Learning and Multimodal Interaction (MLMI), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Chen and R. Rao. Cross-modal Prediction in Audio-visual Communication. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 4, pages 2056--2059, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. W. Fisher and T. Darrell. Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia, 6(3):406--413, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. W. Fisher, T. Darrell, W. T. Freeman, and P. A. Viola. Learning joint statistical models for audio-visual fusion and segregation. In Conference on Neural Information Processing Systems (NIPS), pages 772--778, 2000.Google ScholarGoogle Scholar
  6. G. Friedland, H. Hung, and C. Yeo. Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), page (to appear), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Hershenson. Reaction time as a measure of intersensory facilitation. J Exp Psychol, 63:289--93, 1962.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. Huijbregts. Segmentation, Diarization, and Speech Transcription: Surprise Data Unraveled. PrintPartners Ipskamp, Enschede, The Netherlands, 2008.Google ScholarGoogle Scholar
  9. H. Hung and G. Friedland. Towards audio-visual on-line diarization of participants in group meetings. In Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications in conjunction with ECCV, Marseille, France, October 2008.Google ScholarGoogle Scholar
  10. H. Hung, Y. Huang, G. Friedland, and D. Gatica-Perez. Estimating the dominant person in multi-party conversations using speaker diarization strategies. In International Conference on Acoustics, Speech, and Signal Processing, 2008.Google ScholarGoogle Scholar
  11. H. Hung, Y. Huang, C. Yeo, and D. Gatica-Perez. Associating audio-visual activity cues in a dominance estimation framework. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on Human Communicative Behavior, Ankorage, Alaska, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  12. H. Hung, Y. Huang, C. Yeo, and D. Gatica-Perez. Correlating audio-visual cues in a dominance estimation framework. In CVPR Workshop on Human Communicative Behavior Analysis, 2008.Google ScholarGoogle Scholar
  13. H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264(5588):746--48, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  14. S. J. McKenna, S. Gong, and Y. Raja. Modelling facial colour and identity with gaussian mixtures. Pattern Recognition, 31(12):1883--1892, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  15. D. McNeill. Language and Gesture. Cambridge University Press New York, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  16. H. J. Nock, G. Iyengar, and C. Neti. Speaker localisation using audio-visual synchrony: An empirical study. In ACM International Conference on Image and Video Retrieval (CIVR), pages 488--499, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Noulas and B. J. A. Krose. On-line multi-modal speaker diarization. In Proc. International Conference on Multimodal Interfaces (ICMI), pages 350--357, New York, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Pardo, X. Anguera, and C. Wooters. Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information. IEEE Transactions on Computers, 56(9):1189, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy. CUAVE: A new audio-visual database for multimodal human-computer interface research. In International Conference on Acoustics, Speech, and Signal Processing, pages 2017--2020, 2002.Google ScholarGoogle Scholar
  20. R. Rao and T. Chen. Exploiting audio-visual correlation in coding of talking head sequences. International Picture Coding Symposium, March 1996.Google ScholarGoogle Scholar
  21. D. A. Reynolds and P. Torres-Carrasquillo. Approaches and applications of audio diarization. In Proc. of International Conference on Audio and Speech Signal Processing, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  22. M. Siracusa and J. Fisher. Dynamic dependency tests for audio-visual speaker association. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), April 2007.Google ScholarGoogle ScholarCross RefCross Ref
  23. S. Tamura, K. Iwano, and S. FURUI. Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images. Real World Speech Processing, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  24. H. Vajaria, T. Islam, S. Sarkar, R. Sankar, and R. Kasturi. Audio segmentation and speaker localization in meeting videos. International Conference on Pattern Recognition, 2006. ICPR 2006. 18th, 2:1150--1153, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Vajaria, S. Sarkar, and R. Kasturi. Exploring co-occurence between speech and body movement for audio-guided video localization. IEEE Transactions on Circuits and Systems for Video Technology, 18:1608--1617, Nov 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Wooters and M. Huijbregts. The ICSI RT07s speaker diarization system. In Proceedings of the Rich Transcription 2007 Meeting Recognition Evaluation Workshop, 2007.Google ScholarGoogle Scholar
  27. C. Yeo and K. Ramchandran. Compressed domain video processing of meetings for activity estimation in dominance classification and slide transition detection. Technical Report UCB/EECS-2008-79, EECS Department, University of California, Berkeley, Jun 2008.Google ScholarGoogle Scholar
  28. C. Zhang, P. Yin, Y. Rui, R. Cutler, and P. Viola. Boosting-Based Multimodal Speaker Detection for Distributed Meetings. IEEE International Workshop on Multimedia Signal Processing (MMSP) 2006, 2006.Google ScholarGoogle Scholar

Index Terms

  1. Visual speaker localization aided by acoustic models

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            MM '09: Proceedings of the 17th ACM international conference on Multimedia
            October 2009
            1202 pages
            ISBN:9781605586083
            DOI:10.1145/1631272

            Copyright © 2009 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 19 October 2009

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate995of4,171submissions,24%

            Upcoming Conference

            MM '24
            MM '24: The 32nd ACM International Conference on Multimedia
            October 28 - November 1, 2024
            Melbourne , VIC , Australia

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader