Skip to main content
Erschienen in: International Journal of Speech Technology 4/2023

11.11.2023

End-to-end ASR framework for Indian-English accent: using speech CNN-based segmentation

verfasst von: Ghayas Ahmed, Aadil Ahmad Lawaye

Erschienen in: International Journal of Speech Technology | Ausgabe 4/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The superiority of Automatic Speech Recognition (ASR) has significantly enhanced over time, with a focus from short utterance circumstances to longer audio signal. In short utterances, speech endpoints are distinct, ensuring a good user experience. However, in long-form scenarios, these endpoints are less clear, leading to unnecessary resource consumption and deviating from ASR's primary goal of generating highly readable and well-formatted transcriptions. In this study, we introduced an ASR framework tailored for the Indian English accent. We employed Speech Segments Endpoint Detection (SSED), built using Mel-spectrogram features, short time energy signal, and a hybrid Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) model. Our experiments on a 29-h audio dataset containing Indian English accent speech achieved impressive results: the CNN–BiLSTM classification model for speech endpoint detection attained 98.67% accuracy in training and 93.62% accuracy in validation. The resulting ASR system achieved a Word Error Rate (WER) of 11.63%. Notably, the segmentation model reduced the dataset length by 16.4%, making it a valuable contribution to ASR technology.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Baevski, A., Schneider, S., & Auli, M. (2020). VQ-WAV2VEC: Self-supervised learning of discrete speech representations. In 8th international conference on learning representations, (ICLR 2020). Baevski, A., Schneider, S., & Auli, M. (2020). VQ-WAV2VEC: Self-supervised learning of discrete speech representations. In 8th international conference on learning representations, (ICLR 2020).
Zurück zum Zitat Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL HLT 2019—2019 conference of the North American chapter of the association for computational linguistics: Human language technologies—Proceedings of the conference, 2019 (Vol. 1, pp. 4171–4186). Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL HLT 2019—2019 conference of the North American chapter of the association for computational linguistics: Human language technologies—Proceedings of the conference, 2019 (Vol. 1, pp. 4171–4186).
Zurück zum Zitat Jegou, H., & Douze, M. (2011). C. S.-I. transactions on pattern, and undefined 2010. Product quantization for nearest neighbor search. ieeexplore.ieee.org. In H. Jegou, M. Douze & C. Schmid (Eds), IEEE transactions on pattern analysis and machine intelligence, 2010 (Vol. 33, no. 1, pp. 117–128). https://doi.org/10.1109/TPAMI.2010.57i Jegou, H., & Douze, M. (2011). C. S.-I. transactions on pattern, and undefined 2010. Product quantization for nearest neighbor search. ieeexplore.ieee.org. In H. Jegou, M. Douze & C. Schmid (Eds), IEEE transactions on pattern analysis and machine intelligence, 2010 (Vol. 33, no. 1, pp. 117–128). https://​doi.​org/​10.​1109/​TPAMI.​2010.​57i
Zurück zum Zitat Jongman, S., Khoe, Y. (n.d.). & Hintz, F. (2021). Vocabulary size influences spontaneous speech in native language users: Validating the use of automatic speech recognition in individual differences research. Language and Speech, 64(1), 35–51. https://doi.org/10.1177/0023830920911079 Jongman, S., Khoe, Y. (n.d.). & Hintz, F. (2021). Vocabulary size influences spontaneous speech in native language users: Validating the use of automatic speech recognition in individual differences research. Language and Speech, 64(1), 35–51. https://​doi.​org/​10.​1177/​0023830920911079​
Zurück zum Zitat Lee, J., Park, J., Kim, K. L., & Nam, J. (2019). Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. In Proceedings of the 14th sound and music computing conference 2017, (SMC 2017) (pp. 220–226). Lee, J., Park, J., Kim, K. L., & Nam, J. (2019). Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. In Proceedings of the 14th sound and music computing conference 2017, (SMC 2017) (pp. 220–226).
Zurück zum Zitat Li, X., Chebiyyam, V., & Kirchhoff, K. (2019). Multi-stream network with temporal attention for environmental sound classification. In Proceedings of the annual conference of the international speech communication association, (Interspeech 2019) (pp. 3604–3608). https://doi.org/10.21437/Interspeech.2019-3019 Li, X., Chebiyyam, V., & Kirchhoff, K. (2019). Multi-stream network with temporal attention for environmental sound classification. In Proceedings of the annual conference of the international speech communication association, (Interspeech 2019) (pp. 3604–3608). https://​doi.​org/​10.​21437/​Interspeech.​2019-3019
Zurück zum Zitat Liu, B., Hoffmeister, B., & Rastrow, A. (2015). Accurate endpointing with expected pause duration. Liu, B., Hoffmeister, B., & Rastrow, A. (2015). Accurate endpointing with expected pause duration.
Zurück zum Zitat Noughreche, A., Boulouma, S., Benbaghdad, M., Adnene, N., Sabri, B., & Mohammed, B. (2021). Design and implementation of an automatic speech recognition based voice control system. easychair.org. In N. Adnene, B. Sabri & B. Mohammed (Eds), Conference on electrical engineering 2021. Retrieved September 25, 2023, from https://easychair.org/publications/preprint_download/wzRf Noughreche, A., Boulouma, S., Benbaghdad, M., Adnene, N., Sabri, B., & Mohammed, B. (2021). Design and implementation of an automatic speech recognition based voice control system. easychair.org. In N. Adnene, B. Sabri & B. Mohammed (Eds), Conference on electrical engineering 2021. Retrieved September 25, 2023, from https://​easychair.​org/​publications/​preprint_​download/​wzRf
Zurück zum Zitat Prombut, N., Waijanya, S., Promrit, N. (2021). Feature extraction technique based on Conv1D and Conv2D network for Thai speech emotion recognition. In ACM international conference proceeding series, December 2021 (pp. 54–60). https://doi.org/10.1145/3508230.3508238. Prombut, N., Waijanya, S., Promrit, N. (2021). Feature extraction technique based on Conv1D and Conv2D network for Thai speech emotion recognition. In ACM international conference proceeding series, December 2021 (pp. 54–60). https://​doi.​org/​10.​1145/​3508230.​3508238.
Zurück zum Zitat Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., & Auli, M. (2019). Pay less attention with lightweight and dynamic convolutions. In 7th international conference on learning representations, (ICLR 2019). Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., & Auli, M. (2019). Pay less attention with lightweight and dynamic convolutions. In 7th international conference on learning representations, (ICLR 2019).
Metadaten
Titel
End-to-end ASR framework for Indian-English accent: using speech CNN-based segmentation
verfasst von
Ghayas Ahmed
Aadil Ahmad Lawaye
Publikationsdatum
11.11.2023
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 4/2023
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-023-10053-w

Weitere Artikel der Ausgabe 4/2023

International Journal of Speech Technology 4/2023 Zur Ausgabe

Neuer Inhalt