Skip to main content
Erschienen in: International Journal of Speech Technology 1/2022

28.08.2021

Exploiting variable length segments with coarticulation effect in online speech recognition based on deep bidirectional recurrent neural network and context-sensitive segment

verfasst von: Song-Il Mun, Chol-Jin Han, Hye-Song Hong

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Deep bidirectional recurrent network (DBRNN) is a powerful acoustic model that can capture the dynamics and coarticulation effect of speech signal. It can model the temporal sequences that depend on left and right contexts, whereas deep unidirectional recurrent neural network (or deep recurrent neural network) can model the temporal sequences that usually depend only on past information. When traditional DBRNNs are used, context-sensitive segments with carefully selected fixed length are exploited to balance recognition accuracy and latency for online speech recognition because the ASR decoder results in recognition latency, depending on the whole input sequence in each evaluation. On the other hand, acoustical realization of phoneme depends not only on the left-sided phoneme, but also on the right-sided phoneme, which should be considered in acoustic modeling for speech recognition. In this paper, we propose a DBRNN-based online speech recognition method that selects and exploits variable length chunks to take into account coarticulation effects appearing in speech production. In order to select variable length segments with the coarticulation effects, the vowel identification points predicted by a deep unidirectional recurrent neural network are used, and such variable length segments are used for training of DBRNN for online recognition. The deep unidirectional recurent neural network for predicting variable length segments is trained using the connectionist temporal classification (CTC) method. We show that the online recognizable DBRNN acoustic model constructed using variable length chunks with coarticulation effect in experiments on Korean speech recognition effectively limits recognition latency, resulting in performance comparable to traditional offline DBRNN, and provides improved performance than online recognition based on fixed-length context-sensitive chunks.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Amodei, R., Anubhai, Battenberg, E., Case, C., Casper, J., Catanzaro, B., et al. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. Proceedings of the International Conference on Machine Learning (pp. 173–182). Amodei, R., Anubhai, Battenberg, E., Case, C., Casper, J., Catanzaro, B., et al. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. Proceedings of the International Conference on Machine Learning (pp. 173–182).
Zurück zum Zitat Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., & Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. ICASSP. IEEE (pp. 4945–4949). Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., & Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. ICASSP. IEEE (pp. 4945–4949).
Zurück zum Zitat Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. ICASSP. IEEE (pp. 4960–4964). Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. ICASSP. IEEE (pp. 4960–4964).
Zurück zum Zitat Chen, K., Yan, Z., & Huo, Q. (2015). Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT Approach. Interspeech (pp. 3600–3604). Chen, K., Yan, Z., & Huo, Q. (2015). Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT Approach. Interspeech (pp. 3600–3604).
Zurück zum Zitat Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. In Advances in Neural Information Processing Systems (pp. 577–585). Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. In Advances in Neural Information Processing Systems (pp. 577–585).
Zurück zum Zitat Graves A., Fernandez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. Proceedings of the International Conference on Machine learning. ACM (pp. 369–376). Graves A., Fernandez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. Proceedings of the International Conference on Machine learning. ACM (pp. 369–376).
Zurück zum Zitat Graves, A., Fernandez, S., & Schmidhuber, J. (2005). Bidirectional LSTM networks for improved phoneme classification and recognition. Proceedings of International Conference on Artificial Neural Networks (pp. 799–804). Graves, A., Fernandez, S., & Schmidhuber, J. (2005). Bidirectional LSTM networks for improved phoneme classification and recognition. Proceedings of International Conference on Artificial Neural Networks (pp. 799–804).
Zurück zum Zitat Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the International Conference on Machine Learning (pp. 1764–1772). Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the International Conference on Machine Learning (pp. 1764–1772).
Zurück zum Zitat Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. ICASSP. IEEE (pp. 6645–6649). Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. ICASSP. IEEE (pp. 6645–6649).
Zurück zum Zitat Hadian, H., Sameti, H., Povey, D., & Khudanpur, S. (2018). End-to-end speech recognition using lattice-free mmi. Interspeech (pp. 12–16). Hadian, H., Sameti, H., Povey, D., & Khudanpur, S. (2018). End-to-end speech recognition using lattice-free mmi. Interspeech (pp. 12–16).
Zurück zum Zitat Hinton, G., Deng, L., Yu, D., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRef Hinton, G., Deng, L., Yu, D., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRef
Zurück zum Zitat Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning (pp. 448–456). Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning (pp. 448–456).
Zurück zum Zitat Karita S., N., Soplin, Watanabe, S., Delcroix, M., Ogawa, A., & Nakatani, T. (2019). Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration. Interspeech (pp. 1408–1412). Karita S., N., Soplin, Watanabe, S., Delcroix, M., Ogawa, A., & Nakatani, T. (2019). Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration. Interspeech (pp. 1408–1412).
Zurück zum Zitat Li, X. & Wu, X., (2015). Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. ICASSP. IEEE (pp. 4520–4524). Li, X. & Wu, X., (2015). Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. ICASSP. IEEE (pp. 4520–4524).
Zurück zum Zitat Maas, A., Xie, Z., Jurafsky, D., & Ng, A. (2015). Lexicon-free conversational speech recognition with neural networks. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Maas, A., Xie, Z., Jurafsky, D., & Ng, A. (2015). Lexicon-free conversational speech recognition with neural networks. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Zurück zum Zitat Miao, Y., Gowayyed, M., & Metze, F. (2015). Eesen: End-to-end speech recognition using deep RNN models and WFST-based decoding. Proceedings of Automatic Speech Recognition and Understanding (ASRU) (pp. 167–174). Miao, Y., Gowayyed, M., & Metze, F. (2015). Eesen: End-to-end speech recognition using deep RNN models and WFST-based decoding. Proceedings of Automatic Speech Recognition and Understanding (ASRU) (pp. 167–174).
Zurück zum Zitat Miao, Y., Gowayyed, M., Na, X., Ko, T., Metze, F., & Waibel, A. (2016). An empirical exploration of ctc acoustic models. ICASSP. IEEE (pp. 2623–2627). Miao, Y., Gowayyed, M., Na, X., Ko, T., Metze, F., & Waibel, A. (2016). An empirical exploration of ctc acoustic models. ICASSP. IEEE (pp. 2623–2627).
Zurück zum Zitat Mohri, M., Pereira, F., & Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1), 69–88.CrossRef Mohri, M., Pereira, F., & Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1), 69–88.CrossRef
Zurück zum Zitat Morgan N., & Bourlard, H. (1990). Continuous speech recognition using multilayer perceptrons with hidden markov models. ICASSP. IEEE. Morgan N., & Bourlard, H. (1990). Continuous speech recognition using multilayer perceptrons with hidden markov models. ICASSP. IEEE.
Zurück zum Zitat Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. Interspeech (pp. 3214–3218). Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. Interspeech (pp. 3214–3218).
Zurück zum Zitat Peddinti, V., Wang, Y., Povey, D., & Khudanpur, S. (2018). Low latency acoustic modeling using temporal convolution and lstms. IEEE Signal Processing Letters, 25(3), 373–377.CrossRef Peddinti, V., Wang, Y., Povey, D., & Khudanpur, S. (2018). Low latency acoustic modeling using temporal convolution and lstms. IEEE Signal Processing Letters, 25(3), 373–377.CrossRef
Zurück zum Zitat Povey, D., Ghoshal, A., Boulianne, G., et al. (2011). The Kaldi speech recognition toolkit. Proceedings of Automatic Speech Recognition and Understanding (ASRU) (pp. 1–4). Povey, D., Ghoshal, A., Boulianne, G., et al. (2011). The Kaldi speech recognition toolkit. Proceedings of Automatic Speech Recognition and Understanding (ASRU) (pp. 1–4).
Zurück zum Zitat Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., Wang, Y., & Khudanpur, S. (2016). Purely sequence-trained neural networks for asr based on lattice-free MMI. Interspeech (pp. 2751–2755). Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., Wang, Y., & Khudanpur, S. (2016). Purely sequence-trained neural networks for asr based on lattice-free MMI. Interspeech (pp. 2751–2755).
Zurück zum Zitat Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Interspeech (pp. 338–342). Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Interspeech (pp. 338–342).
Zurück zum Zitat Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. Interspeech (pp. 1468–1472). Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. Interspeech (pp. 1468–1472).
Zurück zum Zitat Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., & J. Schalkwyk (2015). Learning acoustic frame labeling for speech recognition with recurrent neural networks. ICASSP. IEEE (pp. 4280–4284). Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., & J. Schalkwyk (2015). Learning acoustic frame labeling for speech recognition with recurrent neural networks. ICASSP. IEEE (pp. 4280–4284).
Zurück zum Zitat Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.CrossRef Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.CrossRef
Zurück zum Zitat Shi, Y., Hwang M., & Lei, X. (2019). End-to-end speech recognition using a high rank LSTM-CTC based model. ICASSP. IEEE (pp. 7080–7084) Shi, Y., Hwang M., & Lei, X. (2019). End-to-end speech recognition using a high rank LSTM-CTC based model. ICASSP. IEEE (pp. 7080–7084)
Zurück zum Zitat Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the Importance of Momentum and Initialization in Deep Learning. Proceedings of the International Conference on Machine Learning. Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the Importance of Momentum and Initialization in Deep Learning. Proceedings of the International Conference on Machine Learning.
Zurück zum Zitat Tian, Z., Yi, J., Tao, J., Bai, Y., & Wen, Z. 2019. Self-Attention Transducers for End-to-End Speech Recognition. Interspeech (pp. 4395–4399). Tian, Z., Yi, J., Tao, J., Bai, Y., & Wen, Z. 2019. Self-Attention Transducers for End-to-End Speech Recognition. Interspeech (pp. 4395–4399).
Zurück zum Zitat Xue, S. & Yan, Z. (2017). Improving latency-controlled BLSTM acoustic models for online speech recognition. ICASSP. IEEE. Xue, S. & Yan, Z. (2017). Improving latency-controlled BLSTM acoustic models for online speech recognition. ICASSP. IEEE.
Zurück zum Zitat Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach. Springer. Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach. Springer.
Zurück zum Zitat Zeyer, A., Schluter, R., & Ney, H. (2016). Towards online-recognition with deep bidirectional LSTM acoustic models. Interspeech (pp. 3424–3428). Zeyer, A., Schluter, R., & Ney, H. (2016). Towards online-recognition with deep bidirectional LSTM acoustic models. Interspeech (pp. 3424–3428).
Metadaten
Titel
Exploiting variable length segments with coarticulation effect in online speech recognition based on deep bidirectional recurrent neural network and context-sensitive segment
verfasst von
Song-Il Mun
Chol-Jin Han
Hye-Song Hong
Publikationsdatum
28.08.2021
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 1/2022
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-021-09885-1

Weitere Artikel der Ausgabe 1/2022

International Journal of Speech Technology 1/2022 Zur Ausgabe

Neuer Inhalt