Simultaneous Recognition of Distant-Talking Speech of Multiple Talkers Based on the 3-D N-Best Search Method

Heracleous, Panikos; Nakamura, Satoshi; Shikano, Kiyohiro

doi:10.1023/B:VLSI.0000015090.87686.bd

Simultaneous Recognition of Distant-Talking Speech of Multiple Talkers Based on the 3-D N-Best Search Method

Published: 01 February 2004

Volume 36, pages 105–116, (2004)
Cite this article

Journal of VLSI signal processing systems for signal, image and video technology Aims and scope Submit manuscript

Panikos Heracleous^1,2,
Satoshi Nakamura¹ &
Kiyohiro Shikano²

61 Accesses
2 Citations
Explore all metrics

Abstract

This paper describes a novel method for hands-free speech recognition and in particular for simultaneous recognition of distant-talking speech of multiple sound sources (talkers or noise sources). Our method is based on the 3-D Viterbi search extended to a 3-D N-best search method to allow simultaneous speech recognition of multiple talkers. The baseline system integrates two existing technologies—3-D Viterbi search and conventional N-best search—into a complete system. However, initial evaluation of the 3-D N-best search-based system showed that new ideas were needed in order to build a system to simultaneously recognize multiple sound sources. Two factors were found to have an important role in system performance. Those two factors are the different likelihood ranges of the talkers and the direction-based separation of the hypotheses. More specifically, since we have to compare hypotheses originating from different talkers, an accurate comparison of these hypotheses cannot be made due to the different likelihood dynamic range of the talkers. Moreover, the hypotheses originated from talkers are located in different directions and therefore separating them based on their direction provides an efficient method for accurate recognition. To solve these problems, we implemented a likelihood normalization technique and a path distance-based clustering technique into the baseline 3-D N-best search-based system. The performance of our system was evaluated by experiments for recognizing the distant-talking speech of two talkers. The experiments were carried out on simulated (with only time delay) data and on reverberated (simulated and real) data. In this paper, we evaluated the proposed method in reverberant environments, and we introduced results obtained by experiments at several reverberation times and results obtained in a real environment. The experiments showed that implementing the two techniques described above produced significant improvements. Best results for simulated data were obtained by implementing the two techniques and using a microphone array composed of 32 channels. In that case in particular, the Simultaneous Word Accuracy (where both talkers are correctly recognized simultaneously) in the ‘top 1’ hypothesis was 72.49%, and in the ‘top 3’ hypotheses was 86.25%, which were very promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A recursive expectation-maximization algorithm for speaker tracking and separation

Article Open access 04 December 2021

Concurrent speakers localization using blind source separation and microphone array geometry

Article 09 May 2021

Multichannel Spatial Clustering Using Model-Based Source Separation

References

S. Nakamura, T. Yamada, T. Takiguchi, and K. Shikano, ‘Hands-Free Speech Recognition by a Microphone Array and HMM Composition,’ in Proc. of International Workshop on Human Interface Technology, 1995, pp. 33-38
M. Inoue, S. Nakamura, T. Yamada, and K. Shikano, ‘Microphone Array Design Measures for Hands-Free Speech Recognition,’ in Proc. of European Conference on Speech Communication and Technology, 1997, pp. 331-334.
S. Nakamura, T. Yamada, P. Heracleous, and K. Shikano, ‘Recognition of Distant-Talking Speech Based on 3-D Trellis Search Using a Microphone Array and Adaptive Beamforming, in Proc. of Workshop on Robust Methods for Speech Recognition, 1999, pp. 219-222.
T. Takiguchi, S. Nakamura, and K. Shikano, ‘Speech Recognition for a Distant Moving Speaker Based on HMM Composition and Separation,’ in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000, pp. 1403-1406.
G.W. Elko, Superdirectional Microphone Arrays, Acoustic Signal Processing for Telecommunications, Kluwer Academic Publishers, 2000.
D.H. Johnson and D.E. Dudgeon, Array Signal Processing, Concepts and Techniques PTR Prentice-Hall, Inc., 1993.
J.L. Flanagan, D.A. Berkley, G.W. Elko, J.E. West, and M.M. Sondhi, ‘Autodirective Microphone Systems,’ Acoustica vol. 75, 1991.
M. Omologo and P. Svaizer, ‘Talker Localization and Speech Recognition Using a Microphone Array and a Cross-Power Spectrum Phase Analysis,’, Pro-c. ICSLP, 1994, pp. 1243-1246.
M. Omologo and P. Svaizer, ‘Acoustic Source Location in Noisy and Reverberant Environment Using CSP Analysis,’ Proc. ICASSP, 1996, pp. 921-924.
T. Yamada, S. Nakamura, and K. Shikano, ‘Robust Speech Recognition with Speaker Localization by a Microphone Array,’ Proc. ICSLP, 1996, pp. 1317-1320.
P. Svaizer, M. Matassoni, and M. Omologo, ‘Acoustic Source Location in a Three-Dimensional Space Using Crosspower Spectrum Phase,’ Proc. ICASSP, 1997, pp. 231-234.
T. Hughes, H. Kim, J. DiBiase, and H. Silverman, ‘Using a Real Time, Tracking Microphone Array as Input to an HMM Speech Recognizer,’ Proc. ICASSP, 1998, pp. 249-252.
T. Nishiura, T. Yamada, S. Nakamura, and K. Shikano, ‘Localization of Multiple Sound Sources Based on a CSP Analysis with a Microphone Array,’ Proc. ICASSP, 2000, pp. 1053-1056.
T. Yamada, S. Nakamura, K. Shikano, ‘Hands-free Speech Recognition Based on 3-D Viterbi Search Using a Microphone Array,’ Proc. ICASSP, 1998, pp. 245-248.
T. Yamada, S. Nakamura, and K. Shikano, ‘An Effect of Adaptive Beamforming on Hands-free Speech Recognition Based on 3-D Viterbi Search,’ Proc. ICSLP, 1998, pp. 381-384.
P. Heracleous, T. Yamada, S. Nakamura, and K. Shikano, ‘Simultaneous Recognition of Multiple Sound Sources based on 3-D N-best Search using Microphone Array,’ Proc. Eurospeech99, 1999, pp. 69-72.
P. Heracleous, S. Nakamura, and K. Shikano, ‘Multiple Sound Sources Recognition by a Microphone Array-based 3-D N-best Search with Likelihood Normalization,’ in Proc. International Workshop on Hands-free Speech Communication, 2001, pp. 103-107.
T. Matsui and S. Furui, ‘Likelihood Normalization for Speaker Verification using a Phoneme-and Speaker-independent Model,’ Speech Communication, vol. 17, 1995, pp. 109-116.
Article Google Scholar
J.B. Allen and D.A. Berkley. Image Method for Efficiently Simulating Small-Room Acoustics. Journal of Acoustical Society of America, vol. 65, no 4, 1979, pp. 943-950.
Article Google Scholar

Download references

Author information

Authors and Affiliations

ATR Spoken Language Translation Research Labs, 2-2-2 Hikaridai Seika-Cho Soraku-gun, Kyoto, 619-0288, Japan
Panikos Heracleous & Satoshi Nakamura
Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Ikoma Takayma, Nara, 630-0101, Japan
Panikos Heracleous & Kiyohiro Shikano

Authors

Panikos Heracleous
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Nakamura
View author publications
You can also search for this author in PubMed Google Scholar
Kiyohiro Shikano
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Heracleous, P., Nakamura, S. & Shikano, K. Simultaneous Recognition of Distant-Talking Speech of Multiple Talkers Based on the 3-D N-Best Search Method. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 36, 105–116 (2004). https://doi.org/10.1023/B:VLSI.0000015090.87686.bd

Download citation

Published: 01 February 2004
Issue Date: February 2004
DOI: https://doi.org/10.1023/B:VLSI.0000015090.87686.bd

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Simultaneous Recognition of Distant-Talking Speech of Multiple Talkers Based on the 3-D N-Best Search Method

Abstract

Access this article

Similar content being viewed by others

A recursive expectation-maximization algorithm for speaker tracking and separation

Concurrent speakers localization using blind source separation and microphone array geometry

Multichannel Spatial Clustering Using Model-Based Source Separation

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Simultaneous Recognition of Distant-Talking Speech of Multiple Talkers Based on the 3-D N-Best Search Method

Abstract

Access this article

Similar content being viewed by others

A recursive expectation-maximization algorithm for speaker tracking and separation

Concurrent speakers localization using blind source separation and microphone array geometry

Multichannel Spatial Clustering Using Model-Based Source Separation

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation