nach oben

International Journal of Speech Technology

Erschienen in:

16.10.2020

Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation

verfasst von: Ankit Kumar, Rajesh Kumar Aggarwal

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

It is a need of time to build an Automatic Speech Recognition (ASR) system for low and limited resource languages. Usually, statistical techniques such as Hidden Markov Models (HMM) have been applied for Indian language ASR systems for the last two decades. In this work, we have selected the Time-delay Neural Network (TDNN) based acoustic modeling with i-vector adaptation for limited resource Hindi ASR. The TDNN can capture the extended temporal context of acoustic events. To reduce the training time, we used sub-sampling based TDNN architecture in this work. Further, data augmentation techniques have been applied to extend the size of training data developed by TIFR, Mumbai. The results show that data augmentation significantly improves the performance of the Hindi ASR. Further, \(\approx\) 4% average improvement has been recorded by applying i-vector adaptation in this work. We found the best system accuracy of 89.9% with TDNN based acoustic modeling with i-vector adaptation.

Vorheriger Artikel Normalized approximate descent used for spike based automatic bird species recognition system

Nächster Artikel Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://www.ethnologue.com/language/hin.

Online available: https://cqpweb.lancs.ac.uk/.

Online available: https://wavesurfer-js.org/.

Abraham, B., Seeram, T., & Umesh, S. (2017). Transfer learning and distillation techniques to improve the acoustic modeling of low resource languages. In INTERSPEECH (pp. 2158–2162).

Aggarwal, R. K., & Dave, M. (2011). Acoustic modeling problem for automatic speech recognition system: Advances and refinements (Part II). International Journal of Speech Technology, 14(4), 309.CrossRef

Aggarwal, R. K., & Dave, M. (2012). Filterbank optimization for robust ASR using GA and PSO. International Journal of Speech Technology, 15(2), 191–201.CrossRef

Aggarwal, R. K., & Dave, M. (2013). Performance evaluation of sequentially combined heterogeneous feature streams for hindi speech recognition system. Telecommunication Systems, 52(3), 1457–1466.CrossRef

An, G., Brizan, D. G., Ma, M., Morales, M., Syed, A.R., & Rosenberg, A. (2015). Automatic recognition of unified parkinson’s disease rating from speech with acoustic, i-vector and phonotactic features. In Sixteenth Annual Conference of the International Speech Communication Association.

Biswas, A., Menon, R., van der Westhuizen, E., & Niesler, T. (2019). Improved low-resource somali speech recognition by semi-supervised acoustic and language model training. arXiv preprint arXiv:1907.03064.

Biswas, A., Sahu, P. K., & Chandra, M. (2016). Admissible wavelet packet sub-band based harmonic energy features using anova fusion techniques for hindi phoneme recognition. IET Signal Processing, 10(8), 902–911.CrossRef

Chellapriyadharshini, M., Toffy, A., & Ramasubramanian, V., et al. (2018). Semi-supervised and active-learning scenarios: Efficient acoustic model refinement for a low resource indian language. arXiv preprint arXiv:1810.06635.

Chen, N. F., Lim, B. P., Hasegawa-Johnson, M. A., et al. (2017). Multitask learning for phone recognition of underresourced languages using mismatched transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(3), 501–514.

Chuangsuwanich, E. (2016). Multilingual techniques for low resource automatic speech recognition. Massachusetts Institute of Technology Cambridge United States: Tech. rep.

Dahl, G. E., Sainath, T. N., & Hinton, G. E. (2013). Improving deep neural networks for LVCSR using rectified linear units and dropout. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 8609–8613). IEEE.

Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011a). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30–42.CrossRef

Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011b). Large vocabulary continuous speech recognition with context-dependent DBN-HMMS. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4688–4691). IEEE.

Dua, M., Aggarwal, R. K., & Biswas, M. (2017). Discriminative training using heterogeneous feature vector for hindi automatic speech recognition system. In 2017 International Conference on Computer and Applications (ICCA) (pp. 158–162). IEEE.

Dua, M., Aggarwal, R. K., & Biswas, M. (2018a). Discriminative training using noise robust integrated features and refined hmm modeling. Journal of Intelligent Systems, 29(1), 327–344.CrossRef

Dua, M., Aggarwal, R. K., & Biswas, M. (2018b). Performance evaluation of hindi speech recognition system using optimized filterbanks. Engineering Science and Technology, an International Journal, 21(3), 389–398.CrossRef

Eghbal-Zadeh, H., Lehner, B., Dorfer, M., & Widmer, G. (2016). CP-JKU submissions for dcase-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks. IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), 6, 5024–5028.

Ghalehjegh, S. H., & Rose, R. C. (2015). Deep bottleneck features for i-vector based text-independent speaker verification. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 555–560). IEEE.

Hartmann, W., Hsiao, R., & Tsakalidis, S. (2017). Alternative networks for monolingual bottleneck features. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5290–5294). IEEE.

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, Ar, Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRef

Jaitly, N., & Hinton, G. (2011). Learning a better representation of speech soundwaves using restricted boltzmann machines. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5884–5887). IEEE.

Jaitly, N., & Hinton, G. E. (2013). Vocal tract length perturbation (VTLP) improves speech recognition. In: Proc. ICML Workshop on Deep Learning for Audio, Speech and Language (Vol. 117).

Karafiát, M., Burget, L., Matějka, P., Glembek, O., & Černockỳ, J. (2011). ivector-based discriminative adaptation for automatic speech recognition. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding (pp. 152–157). IEEE.

Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., & Khudanpur, S. (2017). A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5220–5224). IEEE.

Kreyssig, F.L., Zhang, C., & Woodland, P. C. (2018). Improved tdnns using deep kernels and frequency dependent grid-RNNS. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4864–4868). IEEE.

Liu, B., Zhang, W., Xu, X., & Chen, D. (2019). Time delay recurrent neural network for speech recognition. In Journal of Physics: Conference Series (Vol. 1229, p. 012078). IOP Publishing.

Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., & Khudanpur, S. (2015a). JHU aspire system: Robust LVCSR with TDNNS, ivector adaptation and rnn-lms. In ASRU (pp. 539–546).

Peddinti, V., Chen, G., Povey, D., & Khudanpur, S. (2015b). Reverberation robust acoustic modeling using i-vectors with time delay neural networks. In Sixteenth Annual Conference of the International Speech Communication Association.

Peddinti, V., Povey, D., & Khudanpur, S. (2015c). A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association.

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., & Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, CONF. IEEE Signal Processing Society.

Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y., & Khudanpur, S. (2016). Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Interspeech (pp. 2751–2755).

Ragni, A., Knill, K., Rath, S. P., & Gales, M. (2014). Data augmentation for low resource languages.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.CrossRef

Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128.

Samudravijaya, K., Rao, P., & Agrawal, S. (2000). Hindi speech database. In Sixth International Conference on Spoken Language Processing.

Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 55–59). IEEE.

Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding (pp. 24–29). IEEE.

Sercu, T., Puhrsch, C., Kingsbury, B., & LeCun, Y. (2016). Very deep multilingual convolutional neural networks for LVCSR. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4955–4959). IEEE.

Stolcke, A. (2002). Srilm-an extensible language modeling toolkit. In Seventh international conference on spoken language processing.

Trmal, J., Kumar, G., Manohar, V., Khudanpur, S., Post, M., & McNamee, P. (2017). Using of heterogeneous corpora for training of an ASR system. arXiv preprint arXiv:1706.00321.

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339.CrossRef

Weninger, F., Watanabe, S., Le Roux, J., Hershey, J., Tachioka, Y., Geiger, J., Schuller, B., & Rigoll, G. (2014). The merl/melco/tum system for the reverb challenge using deep recurrent neural network feature enhancement. In Proc. REVERB Workshop (pp. 1–8).

Xu, H., Su, H., Ni, C., Xiao, X., Huang, H., Chng, E. S., & Li, H. (2016). Semi-supervised and cross-lingual knowledge transfer learnings for DNN hybrid acoustic models under low-resource conditions. In INTERSPEECH (pp. 1315–1319).

Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., & Liu, Q. (2014). Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1713–1725.CrossRef

Titel: Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation
verfasst von: Ankit Kumar
Rajesh Kumar Aggarwal
Publikationsdatum: 16.10.2020
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 1/2022
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-020-09757-0

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Internationaler Motorenkongress/© [M] ATZlive | Chisnikov / Fotolia.com, Search Icon, Banner Hanser, Gardiner von Trapp/© Alpega Group, Benny Hahn/© ZEP GmbH, Customer Experience/© © oatawa / Getty Images / iStock, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2022

Information hiding in proposed 10.6 kbps CS-ACELP based speech codec using Quantization Index Modulation

Efficient cancelable speaker identification system based on a hybrid structure of DWT and SVD

Using novel method: Real Cepstral Discrete Cosine Transform, for detecting Parkinson from multiple system atrophy, other neurological diseases and healthy cases using voice analysis

Exploring single channel speech separation for short-time text-dependent speaker verification

Evaluating the effect of using different transcription schemes in building a speech recognition system for Arabic

Sparse representation and reproduction of speech signals in complex Fourier basis

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.