research-article

Visual speaker localization aided by acoustic models

Authors:
Gerald Friedland

International Computer Science Institute, Berkeley, CA, USA

International Computer Science Institute, Berkeley, CA, USA
View Profile

,
Chuohao Yeo

University of California, Berkeley, CA, USA

University of California, Berkeley, CA, USA
View Profile

,
Hayley Hung

IDIAP Research Institute, Martigny, Switzerland

IDIAP Research Institute, Martigny, Switzerland
View Profile

MM '09: Proceedings of the 17th ACM international conference on MultimediaOctober 2009Pages 195–202https://doi.org/10.1145/1631272.1631301

Published:19 October 2009Publication History

MM '09: Proceedings of the 17th ACM international conference on Multimedia

Pages 195–202

ABSTRACT

The following paper presents a novel audio-visual approach for unsupervised speaker locationing. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the art audio-only speaker localization system (traditionally called speaker diarization) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the number of speakers and estimates "who spoke when", then, in a second step, the visual models are used to infer the location of the speakers in the video. The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker locationing at little incremental engineering and computation costs.

References

N. Campbell and N. Suzuki. Working with Very Sparse Data to Detect Speaker and Listener Participation in a Meetings Corpus. In Workshop Programme, volume 10, May 2006.Google Scholar
J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraiij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, M. McCowan, W. Post, D. Reidsma, and P. Wellner. The AMI meeting corpus: A pre-announcement. In Joint Workshop on Machine Learning and Multimodal Interaction (MLMI), 2005. Google ScholarDigital Library
T. Chen and R. Rao. Cross-modal Prediction in Audio-visual Communication. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 4, pages 2056--2059, 1996. Google ScholarDigital Library
J. W. Fisher and T. Darrell. Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia, 6(3):406--413, 2004. Google ScholarDigital Library
J. W. Fisher, T. Darrell, W. T. Freeman, and P. A. Viola. Learning joint statistical models for audio-visual fusion and segregation. In Conference on Neural Information Processing Systems (NIPS), pages 772--778, 2000.Google Scholar
G. Friedland, H. Hung, and C. Yeo. Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), page (to appear), 2009. Google ScholarDigital Library
M. Hershenson. Reaction time as a measure of intersensory facilitation. J Exp Psychol, 63:289--93, 1962.Google ScholarCross Ref
M. Huijbregts. Segmentation, Diarization, and Speech Transcription: Surprise Data Unraveled. PrintPartners Ipskamp, Enschede, The Netherlands, 2008.Google Scholar
H. Hung and G. Friedland. Towards audio-visual on-line diarization of participants in group meetings. In Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications in conjunction with ECCV, Marseille, France, October 2008.Google Scholar
H. Hung, Y. Huang, G. Friedland, and D. Gatica-Perez. Estimating the dominant person in multi-party conversations using speaker diarization strategies. In International Conference on Acoustics, Speech, and Signal Processing, 2008.Google Scholar
H. Hung, Y. Huang, C. Yeo, and D. Gatica-Perez. Associating audio-visual activity cues in a dominance estimation framework. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on Human Communicative Behavior, Ankorage, Alaska, 2008.Google ScholarCross Ref
H. Hung, Y. Huang, C. Yeo, and D. Gatica-Perez. Correlating audio-visual cues in a dominance estimation framework. In CVPR Workshop on Human Communicative Behavior Analysis, 2008.Google Scholar
H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264(5588):746--48, 1976.Google ScholarCross Ref
S. J. McKenna, S. Gong, and Y. Raja. Modelling facial colour and identity with gaussian mixtures. Pattern Recognition, 31(12):1883--1892, 1998.Google ScholarCross Ref
D. McNeill. Language and Gesture. Cambridge University Press New York, 2000.Google ScholarCross Ref
H. J. Nock, G. Iyengar, and C. Neti. Speaker localisation using audio-visual synchrony: An empirical study. In ACM International Conference on Image and Video Retrieval (CIVR), pages 488--499, 2003. Google ScholarDigital Library
A. Noulas and B. J. A. Krose. On-line multi-modal speaker diarization. In Proc. International Conference on Multimodal Interfaces (ICMI), pages 350--357, New York, USA, 2007. ACM. Google ScholarDigital Library
J. Pardo, X. Anguera, and C. Wooters. Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information. IEEE Transactions on Computers, 56(9):1189, 2007. Google ScholarDigital Library
E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy. CUAVE: A new audio-visual database for multimodal human-computer interface research. In International Conference on Acoustics, Speech, and Signal Processing, pages 2017--2020, 2002.Google Scholar
R. Rao and T. Chen. Exploiting audio-visual correlation in coding of talking head sequences. International Picture Coding Symposium, March 1996.Google Scholar
D. A. Reynolds and P. Torres-Carrasquillo. Approaches and applications of audio diarization. In Proc. of International Conference on Audio and Speech Signal Processing, 2005.Google ScholarCross Ref
M. Siracusa and J. Fisher. Dynamic dependency tests for audio-visual speaker association. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), April 2007.Google ScholarCross Ref
S. Tamura, K. Iwano, and S. FURUI. Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images. Real World Speech Processing, 2004.Google ScholarCross Ref
H. Vajaria, T. Islam, S. Sarkar, R. Sankar, and R. Kasturi. Audio segmentation and speaker localization in meeting videos. International Conference on Pattern Recognition, 2006. ICPR 2006. 18th, 2:1150--1153, 2006. Google ScholarDigital Library
H. Vajaria, S. Sarkar, and R. Kasturi. Exploring co-occurence between speech and body movement for audio-guided video localization. IEEE Transactions on Circuits and Systems for Video Technology, 18:1608--1617, Nov 2008. Google ScholarDigital Library
C. Wooters and M. Huijbregts. The ICSI RT07s speaker diarization system. In Proceedings of the Rich Transcription 2007 Meeting Recognition Evaluation Workshop, 2007.Google Scholar
C. Yeo and K. Ramchandran. Compressed domain video processing of meetings for activity estimation in dominance classification and slide transition detection. Technical Report UCB/EECS-2008-79, EECS Department, University of California, Berkeley, Jun 2008.Google Scholar
C. Zhang, P. Yin, Y. Rui, R. Cutler, and P. Viola. Boosting-Based Multimodal Speaker Detection for Distributed Meetings. IEEE International Workshop on Multimedia Signal Processing (MMSP) 2006, 2006.Google Scholar

Index Terms

Visual speaker localization aided by acoustic models

Recommendations

Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

The following article presents a novel audio-visual approach for unsupervised speaker localization in both time and space and systematically analyzes its unique properties. Using recordings from a single, low-resolution room overview camera and a single ...
Read More
Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

Human-machine interaction in meetings requires the localization and identification of the speakers interacting with the system as well as the recognition of the words spoken. A seminal step toward this goal is the field of rich transcription research, ...
Read More
Speaker-adaptive speech recognition using speaker diarization for improved transcription of large spoken archives

This paper deals with speaker-adaptive speech recognition for large spoken archives. The goal is to improve the recognition accuracy of an automatic speech recognition (ASR) system that is being deployed for transcription of a large archive of Czech ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '09: Proceedings of the 17th ACM international conference on Multimedia
October 2009
1202 pages
ISBN:9781605586083
DOI:10.1145/1631272
General Chairs:
Wen Gao
Peking University, China
,
Yong Rui
Microsoft, China
,
Alan Hanjalic
Delft University of Technology, The Netherlands
,
Program Chairs:
Changsheng Xu
Institute of Automation, Chinese Academy of Sciences, China
,
Eckehard Steinbach
Technical University of Munich, Germany
,
Abdulmotaleb El Saddik
University of Ottawa, Canada
,
Michelle Zhou
IBM T. J. Watson Research Center, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
multimodal integration
speaker diarization
visual localization
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 298
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Visual speaker localization aided by acoustic models

MM '09: Proceedings of the 17th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

Speaker-adaptive speech recognition using speaker diarization for improved transcription of large spoken archives