nach oben

International Journal of Speech Technology

Erschienen in:

28.08.2021

Exploiting variable length segments with coarticulation effect in online speech recognition based on deep bidirectional recurrent neural network and context-sensitive segment

verfasst von: Song-Il Mun, Chol-Jin Han, Hye-Song Hong

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Deep bidirectional recurrent network (DBRNN) is a powerful acoustic model that can capture the dynamics and coarticulation effect of speech signal. It can model the temporal sequences that depend on left and right contexts, whereas deep unidirectional recurrent neural network (or deep recurrent neural network) can model the temporal sequences that usually depend only on past information. When traditional DBRNNs are used, context-sensitive segments with carefully selected fixed length are exploited to balance recognition accuracy and latency for online speech recognition because the ASR decoder results in recognition latency, depending on the whole input sequence in each evaluation. On the other hand, acoustical realization of phoneme depends not only on the left-sided phoneme, but also on the right-sided phoneme, which should be considered in acoustic modeling for speech recognition. In this paper, we propose a DBRNN-based online speech recognition method that selects and exploits variable length chunks to take into account coarticulation effects appearing in speech production. In order to select variable length segments with the coarticulation effects, the vowel identification points predicted by a deep unidirectional recurrent neural network are used, and such variable length segments are used for training of DBRNN for online recognition. The deep unidirectional recurent neural network for predicting variable length segments is trained using the connectionist temporal classification (CTC) method. We show that the online recognizable DBRNN acoustic model constructed using variable length chunks with coarticulation effect in experiments on Korean speech recognition effectively limits recognition latency, resulting in performance comparable to traditional offline DBRNN, and provides improved performance than online recognition based on fixed-length context-sensitive chunks.

Vorheriger Artikel Automatic speaker verification systems and spoof detection techniques: review and analysis

Nächster Artikel Automatic short utterance speaker recognition using stationary wavelet coefficients of pitch synchronised LP residual

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Amodei, R., Anubhai, Battenberg, E., Case, C., Casper, J., Catanzaro, B., et al. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. Proceedings of the International Conference on Machine Learning (pp. 173–182).

Ba, L., Kiros, J., & Hinton, G. (2016). Layer normalization. CoRR, vol. abs/1607.06450. http://arxiv.org/abs/1607.06450.

Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., & Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. ICASSP. IEEE (pp. 4945–4949).

Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. ICASSP. IEEE (pp. 4960–4964).

Chen, K., Yan, Z., & Huo, Q. (2015). Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT Approach. Interspeech (pp. 3600–3604).

Cho, K. et al. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. http://arxiv.org/abs/1406.1078.

Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. In Advances in Neural Information Processing Systems (pp. 577–585).

Graves A., Fernandez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. Proceedings of the International Conference on Machine learning. ACM (pp. 369–376).

Graves, A., Fernandez, S., & Schmidhuber, J. (2005). Bidirectional LSTM networks for improved phoneme classification and recognition. Proceedings of International Conference on Artificial Neural Networks (pp. 799–804).

Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the International Conference on Machine Learning (pp. 1764–1772).

Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. ICASSP. IEEE (pp. 6645–6649).

Hadian, H., Sameti, H., Povey, D., & Khudanpur, S. (2018). End-to-end speech recognition using lattice-free mmi. Interspeech (pp. 12–16).

Hinton, G., Deng, L., Yu, D., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRef

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning (pp. 448–456).

Karita S., N., Soplin, Watanabe, S., Delcroix, M., Ogawa, A., & Nakatani, T. (2019). Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration. Interspeech (pp. 1408–1412).

Li, X. & Wu, X., (2015). Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. ICASSP. IEEE (pp. 4520–4524).

Maas, A., Xie, Z., Jurafsky, D., & Ng, A. (2015). Lexicon-free conversational speech recognition with neural networks. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Miao, Y., Gowayyed, M., & Metze, F. (2015). Eesen: End-to-end speech recognition using deep RNN models and WFST-based decoding. Proceedings of Automatic Speech Recognition and Understanding (ASRU) (pp. 167–174).

Miao, Y., Gowayyed, M., Na, X., Ko, T., Metze, F., & Waibel, A. (2016). An empirical exploration of ctc acoustic models. ICASSP. IEEE (pp. 2623–2627).

Mohri, M., Pereira, F., & Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1), 69–88.CrossRef

Morgan N., & Bourlard, H. (1990). Continuous speech recognition using multilayer perceptrons with hidden markov models. ICASSP. IEEE.

Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. Interspeech (pp. 3214–3218).

Peddinti, V., Wang, Y., Povey, D., & Khudanpur, S. (2018). Low latency acoustic modeling using temporal convolution and lstms. IEEE Signal Processing Letters, 25(3), 373–377.CrossRef

Povey, D., Ghoshal, A., Boulianne, G., et al. (2011). The Kaldi speech recognition toolkit. Proceedings of Automatic Speech Recognition and Understanding (ASRU) (pp. 1–4).

Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., Wang, Y., & Khudanpur, S. (2016). Purely sequence-trained neural networks for asr based on lattice-free MMI. Interspeech (pp. 2751–2755).

Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Interspeech (pp. 338–342).

Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. Interspeech (pp. 1468–1472).

Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., & J. Schalkwyk (2015). Learning acoustic frame labeling for speech recognition with recurrent neural networks. ICASSP. IEEE (pp. 4280–4284).

Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.CrossRef

Shi, Y., Hwang M., & Lei, X. (2019). End-to-end speech recognition using a high rank LSTM-CTC based model. ICASSP. IEEE (pp. 7080–7084)

Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the Importance of Momentum and Initialization in Deep Learning. Proceedings of the International Conference on Machine Learning.

Tian, Z., Yi, J., Tao, J., Bai, Y., & Wen, Z. 2019. Self-Attention Transducers for End-to-End Speech Recognition. Interspeech (pp. 4395–4399).

Xue, S. & Yan, Z. (2017). Improving latency-controlled BLSTM acoustic models for online speech recognition. ICASSP. IEEE.

Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach. Springer.

Zeyer, A., Schluter, R., & Ney, H. (2016). Towards online-recognition with deep bidirectional LSTM acoustic models. Interspeech (pp. 3424–3428).

Titel: Exploiting variable length segments with coarticulation effect in online speech recognition based on deep bidirectional recurrent neural network and context-sensitive segment
verfasst von: Song-Il Mun
Chol-Jin Han
Hye-Song Hong
Publikationsdatum: 28.08.2021
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 1/2022
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-021-09885-1

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Internationaler Motorenkongress/© [M] ATZlive | Chisnikov / Fotolia.com, Search Icon, Banner Hanser, Gardiner von Trapp/© Alpega Group, Benny Hahn/© ZEP GmbH, Customer Experience/© © oatawa / Getty Images / iStock, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2022

Handling emotional speech: a prosody based data augmentation technique for improving neutral speech trained ASR systems

Exploring single channel speech separation for short-time text-dependent speaker verification

Normalized approximate descent used for spike based automatic bird species recognition system

A method for constructing Korean spontaneous spoken language corpus based on an imitation of abbreviated and transformed particles

Automatic speaker verification systems and spoof detection techniques: review and analysis

Noise-robust speech recognition in mobile network based on convolution neural networks

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.