nach oben

International Journal of Speech Technology

Erschienen in:

11.11.2023

End-to-end ASR framework for Indian-English accent: using speech CNN-based segmentation

verfasst von: Ghayas Ahmed, Aadil Ahmad Lawaye

Erschienen in: International Journal of Speech Technology | Ausgabe 4/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The superiority of Automatic Speech Recognition (ASR) has significantly enhanced over time, with a focus from short utterance circumstances to longer audio signal. In short utterances, speech endpoints are distinct, ensuring a good user experience. However, in long-form scenarios, these endpoints are less clear, leading to unnecessary resource consumption and deviating from ASR's primary goal of generating highly readable and well-formatted transcriptions. In this study, we introduced an ASR framework tailored for the Indian English accent. We employed Speech Segments Endpoint Detection (SSED), built using Mel-spectrogram features, short time energy signal, and a hybrid Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) model. Our experiments on a 29-h audio dataset containing Indian English accent speech achieved impressive results: the CNN–BiLSTM classification model for speech endpoint detection attained 98.67% accuracy in training and 93.62% accuracy in validation. The resulting ASR system achieved a Word Error Rate (WER) of 11.63%. Notably, the segmentation model reduced the dataset length by 16.4%, making it a valuable contribution to ASR technology.

Vorheriger Artikel Boosting Character-based Mandarin ASR via Chinese Pinyin Representation

Nächster Artikel Robust and efficient keyword spotting using a bidirectional attention LSTM

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Ahmed, G., & Lawaye, A. A. (2023). CNN-based speech segments endpoints detection framework using short-time signal energy features. International Journal of Information Technology, 2023, 1–13. https://doi.org/10.1007/S41870-023-01466-6CrossRef

Aytar, Y., & Vondrick, C. (n.d.). A. T.-A. in neural, and undefined 2016. “SoundNet: Learning sound representations from unlabeled video. proceedings.neurips.cc. Retrieved May 24, 2023, from https://proceedings.neurips.cc/paper/2016/hash/7dcd340d84f762eba80aa538b0c527f7-Abstract.html

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. Retrieved September 24, 2023, from http://arxiv.org/abs/1607.06450

Baevski, A., Auli, M., & Mohamed, A. (2019). Effectiveness of self-supervised pre-training for speech recognition. Retrieved September 24, 2023, from http://arxiv.org/abs/1911.03912

Baevski, A., Schneider, S., & Auli, M. (2020). VQ-WAV2VEC: Self-supervised learning of discrete speech representations. In 8th international conference on learning representations, (ICLR 2020).

Barkani, F., Satori, H., Hamidi, M., Zealouk, O., & Laaidi, N. (2020). Comparative evaluation of speech recognition systems based on different toolkits. Advances in Intelligent Systems and Computing, 1076, 33–41. https://doi.org/10.1007/978-981-15-0947-6_4CrossRef

Basbug, A. M., & Sert, M. (2019). Analysis of deep neural network models for acoustic scene classification. In 27th signal processing and communications applications conference, (SIU 2019). https://doi.org/10.1109/SIU.2019.8806301

Benisty, H., Katz, I., Crammer, K., & Malah, D. (2018). Discriminative keyword spotting for limited-data applications. Speech Communication, 99, 1–11. https://doi.org/10.1016/J.SPECOM.2018.02.003CrossRef

Chen, L., Zheng, X., Zhang, C., Guo, L., & Yu, B. (2022). Multi-scale temporal-frequency attention for music source separation. In Proceedings—IEEE international conference on multimedia and expo, July 2022. https://doi.org/10.1109/ICME52920.2022.9859957

Cho, J., Yun, S., Park, H., Eum, J., & Hwang, K. (2019). Acoustic scene classification based on a large-margin factorized CNN (pp. 45–49). https://doi.org/10.33682/8XH4-JM46

Choi, K., Fazekas, G., Sandler, M., & Cho, K. (2017). Convolutional recurrent neural networks for music classification. In ICASSP, IEEE international conference on acoustics, speech and signal processing—Proceedings, June 2017 (pp. 2392–2396). https://doi.org/10.1109/ICASSP.2017.7952585.

Demir, F., Abdullah, D. (n.d.). A. S.-I. Access, and undefined 2020. A new deep CNN model for environmental sound classification. ieeexplore.ieee.org. Retrieved May 24, 2023, from https://ieeexplore.ieee.org/abstract/document/9052658/

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL HLT 2019—2019 conference of the North American chapter of the association for computational linguistics: Human language technologies—Proceedings of the conference, 2019 (Vol. 1, pp. 4171–4186).

Dong, M. (2019). Convolutional neural network achieves human-level accuracy in music genre classification. https://doi.org/10.32470/CCN.2018.1153-0

Dörfler, M., & Bammer, R. (n.d.). T. G.-2017 international conference on, and undefined 2017. Inside the spectrogram: Convolutional neural networks in audio processing. ieeexplore.ieee.org. Retrieved May 24, 2023, from https://ieeexplore.ieee.org/abstract/document/8024472/

Guzhov, A., Raue, F., Hees, J., & Dengel, A. (2020). EsResNet: Environmental sound classification based on visual domain models. In Proceedings—International conference on pattern recognition, 2020 (pp. 8819–8825). https://doi.org/10.1109/ICPR48806.2021.9413035

Haflan, V. (2019). Noise robustness in small-vocabulary speech recognition. Retrieved September 25, 2023, from https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2613396

Hatala, Z. (2019). Practical speech recognition with HTK. Retrieved September 25, 2023, from http://arxiv.org/abs/1908.02119

Hemakumar, G., & P. P.-I. J. of S., and undefined 2014. Automatic segmentation of Kannada speech signal into syllables and sub-words: Noised and noiseless signals. Retrieved May 24, 2023, from https://www.academia.edu/download/34681327/Automatic-Segmentation-of-Kannada-Speech-Signal.pdf

Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (GELUs). Retrieved September 24, 2023, from http://arxiv.org/abs/1606.08415

Hershey, S., et al. (n.d.). CNN architectures for large-scale audio classification. ieeexplore.ieee.org. Retrieved May 24, 2023, from https://ieeexplore.ieee.org/abstract/document/7952132/

Hwang, I., & Chang, J. H. (2020). End-to-end speech endpoint detection utilizing acoustic and language modeling knowledge for online low-latency speech recognition. IEEE Access, 8, 161109–161123. https://doi.org/10.1109/ACCESS.2020.3020696CrossRef

Islam, M. M., Haque, M., Islam, S., Mia, M. Z. A., & Rahman, S. M. A. M. (2022). DCNN–LSTM based audio classification combining multiple feature engineering and data augmentation techniques. Lecture Notes in Networks and Systems, 371, 227–236. https://doi.org/10.1007/978-3-030-93247-3_23CrossRef

Jegou, H., & Douze, M. (2011). C. S.-I. transactions on pattern, and undefined 2010. Product quantization for nearest neighbor search. ieeexplore.ieee.org. In H. Jegou, M. Douze & C. Schmid (Eds), IEEE transactions on pattern analysis and machine intelligence, 2010 (Vol. 33, no. 1, pp. 117–128). https://doi.org/10.1109/TPAMI.2010.57i

Jongman, S., Khoe, Y. (n.d.). & Hintz, F. (2021). Vocabulary size influences spontaneous speech in native language users: Validating the use of automatic speech recognition in individual differences research. Language and Speech, 64(1), 35–51. https://doi.org/10.1177/0023830920911079

Ketkar, N., & Moolayil, J. (2021). Convolutional neural networks. In Deep learning with Python (pp. 197–242). https://doi.org/10.1007/978-1-4842-5364-9_6

Kudin, O., Kryvokhata, A., & Gorbenko, V. I. (2020). Developing a deep learning sound classification system for a smart farming. In ECS meeting abstracts (Vol. MA2020-01(26), pp. 1853–1853). https://doi.org/10.1149/MA2020-01261853MTGABS/META.M

Lee, J., Park, J., Kim, K. L., & Nam, J. (2019). Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. In Proceedings of the 14th sound and music computing conference 2017, (SMC 2017) (pp. 220–226).

Li, X., Chebiyyam, V., & Kirchhoff, K. (2019). Multi-stream network with temporal attention for environmental sound classification. In Proceedings of the annual conference of the international speech communication association, (Interspeech 2019) (pp. 3604–3608). https://doi.org/10.21437/Interspeech.2019-3019

Lin, Y., Li, Q., Yang, B., Yan, Z., Tan, H., & Chen, Z. (2021). Improving speech recognition models with small samples for air traffic control systems. Neurocomputing, 445, 287–297. https://doi.org/10.1016/J.NEUCOM.2020.08.092CrossRef

Liu, B., Hoffmeister, B., & Rastrow, A. (2015). Accurate endpointing with expected pause duration.

Liu, Y., et al. (2019). RoBERTa: A robustly optimized BERT pretraining approach. Retrieved September 24, 2023, from http://arxiv.org/abs/1907.11692

Liu, Y., Iyer, R., Kirchhoff, K., & Bilmes, J. (n.d.). SVitchboard II and FiSVer I: High-quality limited-complexity corpora of conversational English speech. people.ece.uw.edu. Retrieved September 25, 2023, from https://people.ece.uw.edu/bilmes/p/mypubs/liu-svb-ii-interspeech-2015.pdf

Maas, R., et al. (n.d.). Combining acoustic embeddings and decoding features for end-of-utterance detection in real-time far-field speech recognition systems. ieeexplore.ieee.org. Retrieved May 27, 2023, from https://ieeexplore.ieee.org/abstract/document/8461478/

Maas, R., Rastrow, A., Goehner, K., Tiwari, G., & Joseph, S. (2017). Domain-specific utterance end-point detection for speech recognition. Retrieved May 27, 2023, from https://www.amazon.science/publications/domain-specific-utterance-end-point-detection-for-speech-recognition

Mak, M. W., & Yu, H. B. (2014). A study of voice activity detection techniques for NIST speaker recognition evaluations. Computer Speech and Language, 28(1), 295–313. https://doi.org/10.1016/J.CSL.2013.07.003CrossRef

Mittal, P., & Singh, N. (2020). Subword analysis of small vocabulary and large vocabulary ASR for Punjabi language. International Journal of Speech Technology, 23(1), 71–78. https://doi.org/10.1007/S10772-020-09673-3CrossRef

Mohamed, A., Okhonko, D., & Zettlemoyer, L. (2019). Transformers with convolutional context for ASR. Retrieved September 24, 2023, from http://arxiv.org/abs/1904.11660

Moreno, I. L., Wan, L., Wang, Q., Ding, S., & Chang, S. (2019). Personal VAD: Speaker-conditioned voice activity detection (pp. 433–439). https://doi.org/10.21437/odyssey.2020-62

Mousazadeh, S., & Cohen, I. (2013). Voice activity detection in presence of transient noise using spectral clustering. IEEE Transactions on Audio, Speech, and Language Processing, 21(6), 1261–1271. https://doi.org/10.1109/TASL.2013.2248717CrossRef

Nguyen, T. (n.d.). F. P.-C. of the I. E. in, and undefined 2020. Lung sound classification using snapshot ensemble of convolutional neural networks. ieeexplore.ieee.org July 2020 (pp. 760–763). https://doi.org/10.1109/EMBC44109.2020.9176076

Niranjan, K. (n.d.). S. V.-2021 F. International, and undefined 2021. Ensemble and multi model approach to environmental sound classification. ieeexplore.ieee.org. Retrieved May 24, 2023, from https://ieeexplore.ieee.org/abstract/document/9616775/

Noughreche, A., Boulouma, S., Benbaghdad, M., Adnene, N., Sabri, B., & Mohammed, B. (2021). Design and implementation of an automatic speech recognition based voice control system. easychair.org. In N. Adnene, B. Sabri & B. Mohammed (Eds), Conference on electrical engineering 2021. Retrieved September 25, 2023, from https://easychair.org/publications/preprint_download/wzRf

Ouyang, Z., Yu, H., Zhu, W.-P., & Champagne, B. (n.d.). A fully convolutional neural network for complex spectrogram processing in speech enhancement. ieeexplore.ieee.org. Retrieved May 24, 2023, from https://ieeexplore.ieee.org/abstract/document/8683423/

Park, D. S., et al. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the annual conference of the international speech communication association, (Interspeech 2019) (pp. 2613–2617). https://doi.org/10.21437/Interspeech.2019-2680

Prombut, N., Waijanya, S., Promrit, N. (2021). Feature extraction technique based on Conv1D and Conv2D network for Thai speech emotion recognition. In ACM international conference proceeding series, December 2021 (pp. 54–60). https://doi.org/10.1145/3508230.3508238.

Rabiner, L. R., & Sambur, M. R. (1975). An algorithm for determining the endpoints of isolated utterances. Bell System Technical Journal, 54(2), 297–315. https://doi.org/10.1002/J.1538-7305.1975.TB02840.XCrossRef

Rahman, M. (n.d.). M. B.-I. J. of Advanced, and undefined 2012. Continuous Bangla speech segmentation using short-term speech features extraction approaches. academia.edu. Retrieved May 24, 2023, from https://www.academia.edu/download/59515251/Cashless_Society_pg_197-20320190604-52015-1ydu74l.pdf#page=143

Rahman, M., Khatun, F. (n.d.). M. B.-E. Preface, and undefined 2015. Blocking black area method for speech segmentation. Citeseer. Retrieved May 24, 2023, from https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=fbe239a538e7e07f72a5f95535e671a6baa3d1c9#page=9

Roberts, A., Engel, J., Raffel, C., Hawthorne, C., & Eck, D. (2018). A hierarchical latent vector model for learning long-term structure in music. In Proceedings. MLR Press. Retrieved May 24, 2023, from http://proceedings.mlr.press/v80/roberts18a.html

Scheirer, E. (n.d.). M. S.-1997 I. international conference on, and undefined 1997. Construction and evaluation of a robust multifeature speech/music discriminator. ieeexplore.ieee.org. Retrieved May 24, 2023, from https://ieeexplore.ieee.org/abstract/document/596192/

Si, S., et al. (n.d.). Variational information bottleneck for effective low-resource audio classification. arxiv.org. Retrieved May 24, 2023, from https://arxiv.org/abs/2107.04803

Snyder, D., Chen, G., & Povey, D. (n.d.). MUSAN: A music, speech, and noise corpus. Retrieved May 24, 2023, from http://arxiv.org/abs/1510.08484

Su, Y., Zhang, K., Wang, J. (n.d.). K. M.- Sensors, and undefined 2019. Environment sound classification using a two-stream CNN based on decision-level fusion. mdpi.com (Vol. 19, no. 7). https://doi.org/10.3390/s19071733

Supriya, S. (n.d.). S. H.-2017 I. I. Conference, and undefined 2017. Speech recognition using HTK toolkit for Marathi language. ieeexplore.ieee.org. In S. Supriya & Handore, S. M. (Eds), 2017 IEEE international conference on power, control, signals. Retrieved September 25, 2023, from https://ieeexplore.ieee.org/abstract/document/8391979/

T. G. I. and Telecommunications, vol. PhD, and undefined 2009. Study and application of acoustic information for the detection of harmful content, and fusion with visual information. cgi.di.uoa.gr 2009. Retrieved May 24, 2023, from http://cgi.di.uoa.gr/~tyiannak/phdText.pdf

Tejedor-García, C., Cardeñoso-Payo, V., & Escudero-Mancebo, D. (2021). Automatic speech recognition (ASR) systems applied to pronunciation assessment of L2 Spanish for Japanese speakers. Applied Sciences, 11(15), 6695. https://doi.org/10.3390/APP11156695CrossRef

Theera-Umpon, N., et al. (n.d.). Thai phoneme segmentation using dual-band energy contour. researchgate.net. Retrieved May 24, 2023, from https://www.researchgate.net/profile/Nipon-Theera-Umpon/publication/266067316_Thai_Phoneme_Segmentation_using_Dual-Band_Energy_Contour/links/569dcae708ae950bd7a6b277/Thai-Phoneme-Segmentation-using-Dual-Band-Energy-Contour.pdf

Tokozume, Y. (n.d.). T. H.-2017 I. international conference on, and undefined 2017. Learning environmental sounds with end-to-end convolutional neural network. ieeexplore.ieee.org. Retrieved May 24, 2023, from https://ieeexplore.ieee.org/abstract/document/7952651/

Tzanetakis, G. (n.d.). P. C.-I. T. on speech and, and undefined 2002. Musical genre classification of audio signals. ieeexplore.ieee.org (Vol. 10, no. 5, p. 293). https://doi.org/10.1109/TSA.2002.800560

Van Den Oord, A., et al. (n.d.). WaveNet: A generative model for raw audio. arxiv.org. Retrieved May 24, 2023, from https://arxiv.org/abs/1609.03499

Vaswani, A., et al. (n.d.). Attention is all you need. proceedings.neurips.cc. Retrieved September 24, 2023, from https://proceedings.neurips.cc/paper/7181-attention-is-all

Vidhya, J. (n.d.). R. U.-P. of the Algorithms, C. and, and undefined 2021. Violence detection in videos using Conv2D VGG-19 architecture and LSTM network. ceur-ws.org, 2021. Retrieved May 24, 2023, from http://ceur-ws.org/Vol-3010/PAPER_09.pdf

Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., & Auli, M. (2019). Pay less attention with lightweight and dynamic convolutions. In 7th international conference on learning representations, (ICLR 2019).

Zhang, T. (n.d.). C. K.-1999 I. I. C. on, and undefined 1999. Hierarchical classification of audio data for archiving and retrieving. ieeexplore.ieee.org. Retrieved May 24, 2023, from https://ieeexplore.ieee.org/abstract/document/757472/

Zhang, W., Lei, W., Xu, X. (n.d.). X. X.- Interspeech, and undefined 2016. Improved music genre classification with convolutional neural networks. isca-speech.org. Retrieved May 24, 2023, from https://www.isca-speech.org/archive_v0/Interspeech_2016/pdfs/1236.PDF

Titel: End-to-end ASR framework for Indian-English accent: using speech CNN-based segmentation
verfasst von: Ghayas Ahmed
Aadil Ahmad Lawaye
Publikationsdatum: 11.11.2023
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 4/2023
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-023-10053-w

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Internationaler Motorenkongress/© [M] ATZlive | Chisnikov / Fotolia.com, Search Icon, Banner Hanser, Customer Experience/© © oatawa / Getty Images / iStock, Erdgasmotor 1.5 TGI evo von Volkswagen/© Volkswagen AG, Thorsten Mücke/© Alexandra Bachran, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2023

Correction to: Novel data augmentation for named entity recognition

CCTG-NET: Contextualized Convolutional Transformer-GRU Network for speech emotion recognition

An optimized convolutional neural network for speech enhancement

CI-Mix: cut instance mix for robust speaker verification

Fusion of speech and handwritten signatures biometrics for person identification

Waveform based speech coding using nonlinear predictive techniques: a systematic review

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.