nach oben

International Journal of Speech Technology

Erschienen in:

02.08.2022

Hybrid end-to-end model for Kazakh speech recognition

verfasst von: Orken Zh. Mamyrbayev, Dina O. Oralbekova, Keylan Alimhan, Bulbul M. Nuranbayeva

Erschienen in: International Journal of Speech Technology | Ausgabe 2/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Modern automatic speech recognition systems based on end-to-end (E2E) models show good results in terms of the accuracy of language recognition, which have large corpuses for several thousand hours of speech for system training. Such models require a very large amount of training data, which is problematic for low-resource languages like the Kazakh language. However, many studies have shown that the combination of connectionist temporal classification with other E2E models improves the performance of systems even with limited training data. In this regard, the speech corpus of the Kazakh language was assembled, and this corpus was expanded using the augmentation method. Our work presents the implementation of a joint model of CTC and the attention mechanism for recognition of Kazakh speech, which solves the problem of rapid decoding and training of the system. The results demonstrated that the proposed E2E model using language models improved the system performance and showed the best result on our dataset for the Kazakh language. As a result of the experiment, the system achieved competitive results in Kazakh speech recognition.

Nächster Artikel Perception of impoliteness in disagreement speech acts among Iranian upper-intermediate EFL students: a gender perspective

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Alsayadi, H., Abdelhamid, A., Hegazy, I., & Fayed, Z. (2021). Arabic speech recognition using end-to-end deep learning. IET Signal Processing. https://doi.org/10.1049/sil2.12057CrossRef

Amirgaliyev, N., Kuanyshbay, D., & Baimuratov, O. (2020). Development of automatic speech recognition for Kazakh language using transfer learning. Speech recognition for Kazakh language project.

Brown, J., & Smaragdis, P. (2009). Hidden Markov and Gaussian mixture models for automatic call classification. The Journal of the Acoustical Society of America, 125, EL221–EL224. https://doi.org/10.1121/1.3124659CrossRef

Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016b). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016b IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, China, 2016 (pp. 4960–4964). https://doi.org/10.1109/ICASSP.2016.7472621.

Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016a). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai (pp. 4960–4964). https://doi.org/10.1109/ICASSP.2016a.7472621.

Chen, J., Nishimura, R., & Kitaoka, N. (2020). End-to-end recognition of streaming Japanese speech using CTC and local attention. APSIPA Transactions on Signal and Information Processing. https://doi.org/10.1017/ATSIP.2020.23CrossRef

Emiru, E., Li, Y., Fesseha, A., & Diallo, M. (2021). Improving Amharic Speech Recognition System using connectionist temporal classification with attention model and phoneme-based byte-pair-encodings. Information, 12, 62. https://doi.org/10.3390/info12020062CrossRef

Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural 'networks. In ICML 2006—Proceedings of the 23rd international conference on machine learning, 2006 (pp. 369–376). https://doi.org/10.1145/1143844.1143891.

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRef

Hori, T., Watanabe, S., Zhang, Y., & Chan, W. (2017). Advances in Joint CTC–attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. In INTERSPEECH 2017.

Ignatenko, G. S., & Lamchanovsky, A. G. (2019). Classification of audio signals using neural networks. Young Scientist, 48(286), 23–25. Retrieved 07/02/2022, from https://moluch.ru/archive/286/64455/.

Juang, B., & Rabiner, L. (1991). Hidden Markov models for speech recognition. Technometrics, 33(3), 251–272. https://doi.org/10.2307/1268779MathSciNetCrossRefMATH

Keren, G., & Schuller, B. (2016). Convolutional RNN: An enhanced model for extracting features from sequential data. In Proceedings of the international joint conference on neural networks, 2016 (pp. 3412–3419).

Kim, S., Hori, T., & Watanabe, S. (2016). Joint CTC–attention based end-to-end speech recognition using multi-task learning. In ICASSP 2017.

Kingma, D., & Ba, J. (2014). Adam: a method for stochastic optimization. In Proceedings of the 3rd international conference for learning representation, 2014.

Levenshtein, V. I. (1996). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.MathSciNet

Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., & Zhumazhanov, B. (2022). Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-European Journal of Enterprise Technologies, 1(9 (115)), 84–92. https://doi.org/10.15587/1729-4061.2022.252801CrossRef

Mamyrbayev, O., Kydyrbekova, A., Alimhan, K., Oralbekova, D., Zhumazhanov, B., & Nuranbayeva, B. (2021). Development of security systems using DNN and i & x-vector classifiers. Eastern-European Journal of Enterprise Technologies, 4(9 (112)), 32–45. https://doi.org/10.15587/1729-4061.2021.239186CrossRef

Mamyrbayev, O., & Oralbekova, D. (2020). Modern trends in the development of speech recognition systems. News of the National Academy of Sciences of the Republic of Kazakhstan, 4(3320), 42–51. https://doi.org/10.32014/2020.2518-1726.64CrossRef

Mamyrbayev, O., Turdalyuly, M., Mekebayev, N., Kuralai, M., Alimhan, K., BabaAli, B., Nabieva, G., Duisenbayeva, A., & Akhmetov, B. (2019). Continuous speech recognition of Kazakh language. ITM Web of Conferences, 24, 01012. https://doi.org/10.1051/itmconf/20192401012CrossRef

Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., & Bekarystankyzy, A. (2021a). End-to-end model based on RNN-T for Kazakh speech recognition. In 2021a 3rd International conference on computer communication and the Internet (ICCCI), 2021 (pp. 163–167). https://doi.org/10.1109/ICCCI51764.2021.9486811.

Mamyrbayev, O., Alimhan, K., & Zhumazhanov, B., Turdalykyzy, T., & Gusmanova, F. (2020). End-to-end speech recognition in agglutinative languages.https://doi.org/10.1007/978-3-030-42058-1_33

Miao, H., Cheng, G., Zhang, P., & Li, T., & Yan, Y. (2019). Online hybrid CTC/attention architecture for end-to-end speech recognition. In INTERSPEECH 2019 (pp. 2623–2627). https://doi.org/10.21437/Interspeech.2019-2018.

Nie, M., & Lei, Z. (2020). Hybrid CTC/attention architecture with self-attention and convolution hybrid encoder for speech recognition. Journal of Physics: Conference Series, 1549, 052034. https://doi.org/10.1088/1742-6596/1549/5/052034CrossRef

Park H., Seo S., Sogang, D., Rim J., Kim C., Son H., Park, J., & Kim J. (2019). Korean grapheme unit-based speech recognition using attention–CTC ensemble network. In 2019 International symposium on multimedia and communication technology (ISMAC), 2019 (pp. 1–5). https://doi.org/10.1109/ISMAC.2019.8836146

Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240–1253. https://doi.org/10.1109/JSTSP.2017.2763455CrossRef

Wu, L., Li, T., Wang, L., & Yan, Y. (2019). Improving hybrid CTC/attention architecture with time-restricted self-attention CTC for end-to-end speech recognition. Applied Sciences, 9, 4639. https://doi.org/10.3390/app9214639CrossRef

Zeyer, A., Irie, K., Schlüter, R., & Ney, H. (2018). Improved training of end-to-end attention models for speech recognition. In INTERSPEECH 2018 (pp. 7–11). https://doi.org/10.21437/Interspeech.2018-1616.

Zweig, G., & Nguyen, P. (2009). A segmental CRF approach to large vocabulary continuous speech recognition. In IEEE workshop on automatic speech recognition and understanding (pp. 152–157). https://doi.org/10.1109/ASRU.2009.5372916.

Titel: Hybrid end-to-end model for Kazakh speech recognition
verfasst von: Orken Zh. Mamyrbayev
Dina O. Oralbekova
Keylan Alimhan
Bulbul M. Nuranbayeva
Publikationsdatum: 02.08.2022
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 2/2023
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-022-09983-8

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Internationaler Motorenkongress/© [M] ATZlive | Chisnikov / Fotolia.com, Search Icon, Banner Hanser, Benny Hahn/© ZEP GmbH, Customer Experience/© © oatawa / Getty Images / iStock, Erdgasmotor 1.5 TGI evo von Volkswagen/© Volkswagen AG, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2023

Noise robust automatic speech recognition: review and analysis

SHO based Deep Residual network and hierarchical speech features for speech enhancement

A transformer-based network for speech recognition

The perception of artificial-intelligence (AI) based synthesized speech in younger and older adults

Ensemble machine learning regression model based predictive framework for Parkinson’s UPDRS motor score prediction from speech data

Linguistic analysis for emotion recognition: a case of Chinese speakers

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.