Skip to main content
Erschienen in: International Journal of Speech Technology 4/2016

04.08.2016

Stream fusion for multi-stream automatic speech recognition

verfasst von: Hesam Sagha, Feipeng Li, Ehsan Variani, José del R. Millán, Ricardo Chavarriaga, Björn Schuller

Erschienen in: International Journal of Speech Technology | Ausgabe 4/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Multi-stream automatic speech recognition (MS-ASR) has been confirmed to boost the recognition performance in noisy conditions. In this system, the generation and the fusion of the streams are the essential parts and need to be designed in such a way to reduce the effect of noise on the final decision. This paper shows how to improve the performance of the MS-ASR by targeting two questions; (1) How many streams are to be combined, and (2) how to combine them. First, we propose a novel approach based on stream reliability to select the number of streams to be fused. Second, a fusion method based on Parallel Hidden Markov Models is introduced. Applying the method on two datasets (TIMIT and RATS) with different noises, we show an improvement of MS-ASR.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
We used the Quicknet toolbox developed at the International Computer Science Institute (http://​www1.​icsi.​berkeley.​edu/​Speech/​qn.​html).
 
Literatur
Zurück zum Zitat Allen, J. (1994). How do humans process and recognize speech? IEEE Transactions on Speech and Audio Processing, 2(4), 567–577.CrossRef Allen, J. (1994). How do humans process and recognize speech? IEEE Transactions on Speech and Audio Processing, 2(4), 567–577.CrossRef
Zurück zum Zitat Bourlard, H. & Dupont, S. (1997). Subband-based speech recognition. In 22nd International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2 (pp 1251–1254). Munich, Germany Bourlard, H. & Dupont, S. (1997). Subband-based speech recognition. In 22nd International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2 (pp 1251–1254). Munich, Germany
Zurück zum Zitat Bourlard, H., Dupont , S., Ris, C. (1997). Multi-stream speech recognition. Tech. Rep. IDIAP-RR 96-07, IDIAP Bourlard, H., Dupont , S., Ris, C. (1997). Multi-stream speech recognition. Tech. Rep. IDIAP-RR 96-07, IDIAP
Zurück zum Zitat Fletcher, H. (1953). Speech and hearing in communication. New York: Krieger. Fletcher, H. (1953). Speech and hearing in communication. New York: Krieger.
Zurück zum Zitat Furui, S. (1992). Towards robust speech recognition under adverse conditions. In ESCA Workshop on Speech Processing in Adverse Conditions (pp. 31–41) Furui, S. (1992). Towards robust speech recognition under adverse conditions. In ESCA Workshop on Speech Processing in Adverse Conditions (pp. 31–41)
Zurück zum Zitat Ganapathy, S., & Hermansky, H. (2012). Temporal resolution analysis in frequency domain linear prediction. The Journal of the Acoustical Society of America, 132(5), 436–442.CrossRef Ganapathy, S., & Hermansky, H. (2012). Temporal resolution analysis in frequency domain linear prediction. The Journal of the Acoustical Society of America, 132(5), 436–442.CrossRef
Zurück zum Zitat Garofolo, J. S., et al. (1988). Getting started with the darpa timit cd-rom: An acoustic phonetic continuous speech database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD, p. 107 Garofolo, J. S., et al. (1988). Getting started with the darpa timit cd-rom: An acoustic phonetic continuous speech database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD, p. 107
Zurück zum Zitat Geiger, J. T., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G. (2014). Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: Proceedings of 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), ISCA, Singapore, Singapore Geiger, J. T., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G. (2014). Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: Proceedings of 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), ISCA, Singapore, Singapore
Zurück zum Zitat Giacinto, G., Roli, F. (2000). Dynamic classifier selection. In Multiple Classifier Systems (pp. 177–189). Springer Giacinto, G., Roli, F. (2000). Dynamic classifier selection. In Multiple Classifier Systems (pp. 177–189). Springer
Zurück zum Zitat Hermansky, H. (2013). Multistream recognition of speech: Dealing with unknown unknowns. IEEE Proceedings, 101(5), 1076–1088.CrossRef Hermansky, H. (2013). Multistream recognition of speech: Dealing with unknown unknowns. IEEE Proceedings, 101(5), 1076–1088.CrossRef
Zurück zum Zitat Hermansky, H., & Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.CrossRef Hermansky, H., & Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.CrossRef
Zurück zum Zitat Hermansky, H., Tibrewala, S., Pavel, M. (1996). Towards ASR on partially corrupted speech. In Fourth International Conference on Spoken Language (ICSLP), vol 1 (pp. 462–465). IEEE, Philadelphia, PA, USA Hermansky, H., Tibrewala, S., Pavel, M. (1996). Towards ASR on partially corrupted speech. In Fourth International Conference on Spoken Language (ICSLP), vol 1 (pp. 462–465). IEEE, Philadelphia, PA, USA
Zurück zum Zitat Hermansky, H., Variani, E., Peddinti, V. (2013). Mean temporal distance: Predicting ASR error from temporal properties of speech signal. In 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, Vancouver, Canada Hermansky, H., Variani, E., Peddinti, V. (2013). Mean temporal distance: Predicting ASR error from temporal properties of speech signal. In 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, Vancouver, Canada
Zurück zum Zitat Ikbal, S., Misra, H., Hermansky, H., & Magimai-Doss, M. (2012). Phase autocorrelation (PAC) features for noise robust speech recognition. Speech Communication, 54(7), 867–880.CrossRef Ikbal, S., Misra, H., Hermansky, H., & Magimai-Doss, M. (2012). Phase autocorrelation (PAC) features for noise robust speech recognition. Speech Communication, 54(7), 867–880.CrossRef
Zurück zum Zitat Mallidi, S. H., & Hermansky, H. (2016). Novel neural network based fusion for multistream ASR. In 41st International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5680–5684). Shanghai, China: IEEE. Mallidi, S. H., & Hermansky, H. (2016). Novel neural network based fusion for multistream ASR. In 41st International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5680–5684). Shanghai, China: IEEE.
Zurück zum Zitat Mallidi, S. H., Ogawa, T., & Hermansky, H. (2015). Uncertainty estimation of dnn classifiers. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 283–288). USA: Arizona. Mallidi, S. H., Ogawa, T., & Hermansky, H. (2015). Uncertainty estimation of dnn classifiers. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 283–288). USA: Arizona.
Zurück zum Zitat Mesgarani, N., Thomas, S., Hermansky, H. (2011). Adaptive stream fusion in multistream recognition of speech. In 12th Annual Conference of the International Speech Communication Association (InterSpeech). Portland, Oregon Mesgarani, N., Thomas, S., Hermansky, H. (2011). Adaptive stream fusion in multistream recognition of speech. In 12th Annual Conference of the International Speech Communication Association (InterSpeech). Portland, Oregon
Zurück zum Zitat Mohamed, A., Dahl, G., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio Speech and Language Processing, 20(1), 14–22.CrossRef Mohamed, A., Dahl, G., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio Speech and Language Processing, 20(1), 14–22.CrossRef
Zurück zum Zitat Sharma, S. R. (1999). Multi-stream approach to robust speech recognition. PhD thesis Sharma, S. R. (1999). Multi-stream approach to robust speech recognition. PhD thesis
Zurück zum Zitat Tibrewala, S., Hermansky, H. (1997). Sub-band based recognition of noisy speech. In 22nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2 (pp. 1255–1258). Munich, Germany, Tibrewala, S., Hermansky, H. (1997). Sub-band based recognition of noisy speech. In 22nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2 (pp. 1255–1258). Munich, Germany,
Zurück zum Zitat Variani, E., Li, F., Hermansky, H. (2013). Multi-stream recognition of noisy speech with performance monitoring. In 14th Annual Conference of the International Speech Communication Association (InterSpeech). Lyon, France Variani, E., Li, F., Hermansky, H. (2013). Multi-stream recognition of noisy speech with performance monitoring. In 14th Annual Conference of the International Speech Communication Association (InterSpeech). Lyon, France
Zurück zum Zitat Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., & Rigoll, G. (2014). Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments. Computer Speech and Language, 28(4), 888–902.CrossRef Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., & Rigoll, G. (2014). Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments. Computer Speech and Language, 28(4), 888–902.CrossRef
Zurück zum Zitat Wöllmer, M., Weninger, F., Geiger, J., Schuller, B., & Rigoll, G. (2013). Noise robust ASR in reverberated multisource environments applying convolutive NMF and long short-term memory. Computer Speech and Language, 27(3), 780–797.CrossRef Wöllmer, M., Weninger, F., Geiger, J., Schuller, B., & Rigoll, G. (2013). Noise robust ASR in reverberated multisource environments applying convolutive NMF and long short-term memory. Computer Speech and Language, 27(3), 780–797.CrossRef
Metadaten
Titel
Stream fusion for multi-stream automatic speech recognition
verfasst von
Hesam Sagha
Feipeng Li
Ehsan Variani
José del R. Millán
Ricardo Chavarriaga
Björn Schuller
Publikationsdatum
04.08.2016
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 4/2016
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-016-9357-1

Weitere Artikel der Ausgabe 4/2016

International Journal of Speech Technology 4/2016 Zur Ausgabe

Neuer Inhalt