Skip to main content
Erschienen in: International Journal of Speech Technology 2/2023

02.08.2022

Hybrid end-to-end model for Kazakh speech recognition

verfasst von: Orken Zh. Mamyrbayev, Dina O. Oralbekova, Keylan Alimhan, Bulbul M. Nuranbayeva

Erschienen in: International Journal of Speech Technology | Ausgabe 2/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Modern automatic speech recognition systems based on end-to-end (E2E) models show good results in terms of the accuracy of language recognition, which have large corpuses for several thousand hours of speech for system training. Such models require a very large amount of training data, which is problematic for low-resource languages like the Kazakh language. However, many studies have shown that the combination of connectionist temporal classification with other E2E models improves the performance of systems even with limited training data. In this regard, the speech corpus of the Kazakh language was assembled, and this corpus was expanded using the augmentation method. Our work presents the implementation of a joint model of CTC and the attention mechanism for recognition of Kazakh speech, which solves the problem of rapid decoding and training of the system. The results demonstrated that the proposed E2E model using language models improved the system performance and showed the best result on our dataset for the Kazakh language. As a result of the experiment, the system achieved competitive results in Kazakh speech recognition.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Amirgaliyev, N., Kuanyshbay, D., & Baimuratov, O. (2020). Development of automatic speech recognition for Kazakh language using transfer learning. Speech recognition for Kazakh language project. Amirgaliyev, N., Kuanyshbay, D., & Baimuratov, O. (2020). Development of automatic speech recognition for Kazakh language using transfer learning. Speech recognition for Kazakh language project.
Zurück zum Zitat Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016b). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016b IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, China, 2016 (pp. 4960–4964). https://doi.org/10.1109/ICASSP.2016.7472621. Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016b). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016b IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, China, 2016 (pp. 4960–4964). https://​doi.​org/​10.​1109/​ICASSP.​2016.​7472621.
Zurück zum Zitat Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016a). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai (pp. 4960–4964). https://doi.org/10.1109/ICASSP.2016a.7472621. Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016a). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai (pp. 4960–4964). https://​doi.​org/​10.​1109/​ICASSP.​2016a.​7472621.
Zurück zum Zitat Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural 'networks. In ICML 2006—Proceedings of the 23rd international conference on machine learning, 2006 (pp. 369–376). https://doi.org/10.1145/1143844.1143891. Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural 'networks. In ICML 2006—Proceedings of the 23rd international conference on machine learning, 2006 (pp. 369–376). https://​doi.​org/​10.​1145/​1143844.​1143891.
Zurück zum Zitat Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRef Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRef
Zurück zum Zitat Hori, T., Watanabe, S., Zhang, Y., & Chan, W. (2017). Advances in Joint CTC–attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. In INTERSPEECH 2017. Hori, T., Watanabe, S., Zhang, Y., & Chan, W. (2017). Advances in Joint CTC–attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. In INTERSPEECH 2017.
Zurück zum Zitat Keren, G., & Schuller, B. (2016). Convolutional RNN: An enhanced model for extracting features from sequential data. In Proceedings of the international joint conference on neural networks, 2016 (pp. 3412–3419). Keren, G., & Schuller, B. (2016). Convolutional RNN: An enhanced model for extracting features from sequential data. In Proceedings of the international joint conference on neural networks, 2016 (pp. 3412–3419).
Zurück zum Zitat Kim, S., Hori, T., & Watanabe, S. (2016). Joint CTC–attention based end-to-end speech recognition using multi-task learning. In ICASSP 2017. Kim, S., Hori, T., & Watanabe, S. (2016). Joint CTC–attention based end-to-end speech recognition using multi-task learning. In ICASSP 2017.
Zurück zum Zitat Kingma, D., & Ba, J. (2014). Adam: a method for stochastic optimization. In Proceedings of the 3rd international conference for learning representation, 2014. Kingma, D., & Ba, J. (2014). Adam: a method for stochastic optimization. In Proceedings of the 3rd international conference for learning representation, 2014.
Zurück zum Zitat Levenshtein, V. I. (1996). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.MathSciNet Levenshtein, V. I. (1996). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.MathSciNet
Zurück zum Zitat Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., & Zhumazhanov, B. (2022). Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-European Journal of Enterprise Technologies, 1(9 (115)), 84–92. https://doi.org/10.15587/1729-4061.2022.252801CrossRef Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., & Zhumazhanov, B. (2022). Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-European Journal of Enterprise Technologies, 1(9 (115)), 84–92. https://​doi.​org/​10.​15587/​1729-4061.​2022.​252801CrossRef
Zurück zum Zitat Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., & Bekarystankyzy, A. (2021a). End-to-end model based on RNN-T for Kazakh speech recognition. In 2021a 3rd International conference on computer communication and the Internet (ICCCI), 2021 (pp. 163–167). https://doi.org/10.1109/ICCCI51764.2021.9486811. Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., & Bekarystankyzy, A. (2021a). End-to-end model based on RNN-T for Kazakh speech recognition. In 2021a 3rd International conference on computer communication and the Internet (ICCCI), 2021 (pp. 163–167). https://​doi.​org/​10.​1109/​ICCCI51764.​2021.​9486811.
Zurück zum Zitat Park H., Seo S., Sogang, D., Rim J., Kim C., Son H., Park, J., & Kim J. (2019). Korean grapheme unit-based speech recognition using attention–CTC ensemble network. In 2019 International symposium on multimedia and communication technology (ISMAC), 2019 (pp. 1–5). https://doi.org/10.1109/ISMAC.2019.8836146 Park H., Seo S., Sogang, D., Rim J., Kim C., Son H., Park, J., & Kim J. (2019). Korean grapheme unit-based speech recognition using attention–CTC ensemble network. In 2019 International symposium on multimedia and communication technology (ISMAC), 2019 (pp. 1–5). https://​doi.​org/​10.​1109/​ISMAC.​2019.​8836146
Metadaten
Titel
Hybrid end-to-end model for Kazakh speech recognition
verfasst von
Orken Zh. Mamyrbayev
Dina O. Oralbekova
Keylan Alimhan
Bulbul M. Nuranbayeva
Publikationsdatum
02.08.2022
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 2/2023
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-022-09983-8

Weitere Artikel der Ausgabe 2/2023

International Journal of Speech Technology 2/2023 Zur Ausgabe

Neuer Inhalt