ABSTRACT
The following paper presents a novel audio-visual approach for unsupervised speaker locationing. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the art audio-only speaker localization system (traditionally called speaker diarization) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the number of speakers and estimates "who spoke when", then, in a second step, the visual models are used to infer the location of the speakers in the video. The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker locationing at little incremental engineering and computation costs.
- N. Campbell and N. Suzuki. Working with Very Sparse Data to Detect Speaker and Listener Participation in a Meetings Corpus. In Workshop Programme, volume 10, May 2006.Google Scholar
- J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraiij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, M. McCowan, W. Post, D. Reidsma, and P. Wellner. The AMI meeting corpus: A pre-announcement. In Joint Workshop on Machine Learning and Multimodal Interaction (MLMI), 2005. Google ScholarDigital Library
- T. Chen and R. Rao. Cross-modal Prediction in Audio-visual Communication. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 4, pages 2056--2059, 1996. Google ScholarDigital Library
- J. W. Fisher and T. Darrell. Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia, 6(3):406--413, 2004. Google ScholarDigital Library
- J. W. Fisher, T. Darrell, W. T. Freeman, and P. A. Viola. Learning joint statistical models for audio-visual fusion and segregation. In Conference on Neural Information Processing Systems (NIPS), pages 772--778, 2000.Google Scholar
- G. Friedland, H. Hung, and C. Yeo. Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), page (to appear), 2009. Google ScholarDigital Library
- M. Hershenson. Reaction time as a measure of intersensory facilitation. J Exp Psychol, 63:289--93, 1962.Google ScholarCross Ref
- M. Huijbregts. Segmentation, Diarization, and Speech Transcription: Surprise Data Unraveled. PrintPartners Ipskamp, Enschede, The Netherlands, 2008.Google Scholar
- H. Hung and G. Friedland. Towards audio-visual on-line diarization of participants in group meetings. In Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications in conjunction with ECCV, Marseille, France, October 2008.Google Scholar
- H. Hung, Y. Huang, G. Friedland, and D. Gatica-Perez. Estimating the dominant person in multi-party conversations using speaker diarization strategies. In International Conference on Acoustics, Speech, and Signal Processing, 2008.Google Scholar
- H. Hung, Y. Huang, C. Yeo, and D. Gatica-Perez. Associating audio-visual activity cues in a dominance estimation framework. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on Human Communicative Behavior, Ankorage, Alaska, 2008.Google ScholarCross Ref
- H. Hung, Y. Huang, C. Yeo, and D. Gatica-Perez. Correlating audio-visual cues in a dominance estimation framework. In CVPR Workshop on Human Communicative Behavior Analysis, 2008.Google Scholar
- H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264(5588):746--48, 1976.Google ScholarCross Ref
- S. J. McKenna, S. Gong, and Y. Raja. Modelling facial colour and identity with gaussian mixtures. Pattern Recognition, 31(12):1883--1892, 1998.Google ScholarCross Ref
- D. McNeill. Language and Gesture. Cambridge University Press New York, 2000.Google ScholarCross Ref
- H. J. Nock, G. Iyengar, and C. Neti. Speaker localisation using audio-visual synchrony: An empirical study. In ACM International Conference on Image and Video Retrieval (CIVR), pages 488--499, 2003. Google ScholarDigital Library
- A. Noulas and B. J. A. Krose. On-line multi-modal speaker diarization. In Proc. International Conference on Multimodal Interfaces (ICMI), pages 350--357, New York, USA, 2007. ACM. Google ScholarDigital Library
- J. Pardo, X. Anguera, and C. Wooters. Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information. IEEE Transactions on Computers, 56(9):1189, 2007. Google ScholarDigital Library
- E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy. CUAVE: A new audio-visual database for multimodal human-computer interface research. In International Conference on Acoustics, Speech, and Signal Processing, pages 2017--2020, 2002.Google Scholar
- R. Rao and T. Chen. Exploiting audio-visual correlation in coding of talking head sequences. International Picture Coding Symposium, March 1996.Google Scholar
- D. A. Reynolds and P. Torres-Carrasquillo. Approaches and applications of audio diarization. In Proc. of International Conference on Audio and Speech Signal Processing, 2005.Google ScholarCross Ref
- M. Siracusa and J. Fisher. Dynamic dependency tests for audio-visual speaker association. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), April 2007.Google ScholarCross Ref
- S. Tamura, K. Iwano, and S. FURUI. Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images. Real World Speech Processing, 2004.Google ScholarCross Ref
- H. Vajaria, T. Islam, S. Sarkar, R. Sankar, and R. Kasturi. Audio segmentation and speaker localization in meeting videos. International Conference on Pattern Recognition, 2006. ICPR 2006. 18th, 2:1150--1153, 2006. Google ScholarDigital Library
- H. Vajaria, S. Sarkar, and R. Kasturi. Exploring co-occurence between speech and body movement for audio-guided video localization. IEEE Transactions on Circuits and Systems for Video Technology, 18:1608--1617, Nov 2008. Google ScholarDigital Library
- C. Wooters and M. Huijbregts. The ICSI RT07s speaker diarization system. In Proceedings of the Rich Transcription 2007 Meeting Recognition Evaluation Workshop, 2007.Google Scholar
- C. Yeo and K. Ramchandran. Compressed domain video processing of meetings for activity estimation in dominance classification and slide transition detection. Technical Report UCB/EECS-2008-79, EECS Department, University of California, Berkeley, Jun 2008.Google Scholar
- C. Zhang, P. Yin, Y. Rui, R. Cutler, and P. Viola. Boosting-Based Multimodal Speaker Detection for Distributed Meetings. IEEE International Workshop on Multimedia Signal Processing (MMSP) 2006, 2006.Google Scholar
Index Terms
- Visual speaker localization aided by acoustic models
Recommendations
Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem
The following article presents a novel audio-visual approach for unsupervised speaker localization in both time and space and systematically analyzes its unique properties. Using recordings from a single, low-resolution room overview camera and a single ...
Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information
Human-machine interaction in meetings requires the localization and identification of the speakers interacting with the system as well as the recognition of the words spoken. A seminal step toward this goal is the field of rich transcription research, ...
Speaker-adaptive speech recognition using speaker diarization for improved transcription of large spoken archives
This paper deals with speaker-adaptive speech recognition for large spoken archives. The goal is to improve the recognition accuracy of an automatic speech recognition (ASR) system that is being deployed for transcription of a large archive of Czech ...
Comments