Skip to main content

2022 | OriginalPaper | Buchkapitel

Investigation of CNN-Based Acoustic Modeling for Continuous Hindi Speech Recognition

verfasst von : Tripti Choudhary, Atul Bansal, Vishal Goyal

Erschienen in: IoT and Analytics for Sensor Networks

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Recently, Convolutional Neural Network (CNN) gains more popularity over hybrid Deep Neural Network (DNN) and Hidden Markov Model (HMM) based acoustic models. CNN has the ability to deal with speech signals and it makes appropriate choice for the Automatic Speech Recognition (ASR) system. The sparse connectivity, weight sharing, and pooling allow CNN to handle a slight position shift in the frequency domain. This property helps to manage speaker and environment variations. CNN works well for speech recognition, but it was not appropriately examined for the Hindi speech recognition system. The activation functions and optimization techniques play a vital role in CNN to achieve high accuracy. In this work, we investigate the impact of various activation functions and optimization techniques in the Hindi-ASR system. All the experiments were performed on the Hindi speech dataset developed by TIFR, with the help of the Kaldi and Pytorch-Kaldi toolkit. The experiment results show that the ELU activation function with Rmsprop optimization techniques gives the best Word Error Rate (WER) 14.56%.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1), 30–42.CrossRef Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1), 30–42.CrossRef
2.
Zurück zum Zitat Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRef Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRef
3.
Zurück zum Zitat Mohamed, A. R., Dahl, G., Hinton, G. (2009) Deep belief networks for phone recognition. In Proceedings of Nips workshop on deep learning for speech recognition and related applications. vol. 1(9) (p. 39) Mohamed, A. R., Dahl, G., Hinton, G. (2009) Deep belief networks for phone recognition. In Proceedings of Nips workshop on deep learning for speech recognition and related applications. vol. 1(9) (p. 39)
4.
Zurück zum Zitat Passricha, V., & Aggarwal, R. K. (2019). A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. Journal of Intelligent Systems, 29(1), 1261–1274.CrossRef Passricha, V., & Aggarwal, R. K. (2019). A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. Journal of Intelligent Systems, 29(1), 1261–1274.CrossRef
5.
Zurück zum Zitat Glorot, X., Bengio, Y. (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249–256). Glorot, X., Bengio, Y. (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249–256).
6.
Zurück zum Zitat Rumellhart, D. E. (1986). Learning internal representations by error propagation. Parallel Distributed Processing, 1, 318–362. Rumellhart, D. E. (1986). Learning internal representations by error propagation. Parallel Distributed Processing, 1, 318–362.
7.
Zurück zum Zitat LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRef LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRef
8.
Zurück zum Zitat Samudravijaya, K., Rao, P. V., Agrawal, S. S. (2000) Hindi speech database. In Sixth International Conference on Spoken Language Processing Samudravijaya, K., Rao, P. V., Agrawal, S. S. (2000) Hindi speech database. In Sixth International Conference on Spoken Language Processing
9.
Zurück zum Zitat Sahu, P., Dua, M., Kumar, A. (2018) Challenges and issues in adopting speech recognition. In Speech and Language Processing for Human-Machine Communications (pp. 209–215). Singapore: Springer. Sahu, P., Dua, M., Kumar, A. (2018) Challenges and issues in adopting speech recognition. In Speech and Language Processing for Human-Machine Communications (pp. 209–215). Singapore: Springer.
10.
Zurück zum Zitat Kuamr, A., Dua, M., Choudhary, A. (2014) Implementation and performance evaluation of continuous Hindi speech recognition. In 2014 International Conference on Electronics and Communication Systems (ICECS) (pp. 1–5). India: IEEE. Kuamr, A., Dua, M., Choudhary, A. (2014) Implementation and performance evaluation of continuous Hindi speech recognition. In 2014 International Conference on Electronics and Communication Systems (ICECS) (pp. 1–5). India: IEEE.
11.
Zurück zum Zitat Kumar, A., Aggarwal, R. K. (2020) A time delay neural network acoustic modeling for Hindi speech recognition. In Advances in Data and Information Sciences (pp. 425–432). Singapore: Springer. Kumar, A., Aggarwal, R. K. (2020) A time delay neural network acoustic modeling for Hindi speech recognition. In Advances in Data and Information Sciences (pp. 425–432). Singapore: Springer.
12.
Zurück zum Zitat Dua, M., Aggarwal, R. K., & Biswas, M. (2019). Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Computing and Applications, 31(10), 6747–6755.CrossRef Dua, M., Aggarwal, R. K., & Biswas, M. (2019). Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Computing and Applications, 31(10), 6747–6755.CrossRef
13.
Zurück zum Zitat Dua, M., Aggarwal, R. K., & Biswas, M. (2019). GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. Journal of Ambient Intelligence and Humanized Computing, 10(6), 2301–2314.CrossRef Dua, M., Aggarwal, R. K., & Biswas, M. (2019). GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. Journal of Ambient Intelligence and Humanized Computing, 10(6), 2301–2314.CrossRef
14.
Zurück zum Zitat Kumar, A., Aggarwal, R. K. (2020) A hybrid CNN-LiGRU acoustic modeling using raw waveform sincnet for Hindi ASR. Computer Science, 21(4). Kumar, A., Aggarwal, R. K. (2020) A hybrid CNN-LiGRU acoustic modeling using raw waveform sincnet for Hindi ASR. Computer Science, 21(4).
15.
Zurück zum Zitat Kumar, A., Aggarwal, R. K. (2020) Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. International Journal of Speech Technology, 1–12. Kumar, A., Aggarwal, R. K. (2020) Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. International Journal of Speech Technology, 1–12.
16.
Zurück zum Zitat Kumar, A., & Aggarwal, R. K. (2020). Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling. Journal of Intelligent Systems, 30(1), 165–179.MathSciNetCrossRef Kumar, A., & Aggarwal, R. K. (2020). Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling. Journal of Intelligent Systems, 30(1), 165–179.MathSciNetCrossRef
17.
Zurück zum Zitat Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J. (2011) The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding 2011 (No. CONF). IEEE Signal Processing Society Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J. (2011) The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding 2011 (No. CONF). IEEE Signal Processing Society
18.
Zurück zum Zitat Ravanelli, M., Parcollet, T., Bengio, Y. (2019) The pytorch-kaldi speech recognition toolkit. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6465–6469). IEEE. Ravanelli, M., Parcollet, T., Bengio, Y. (2019) The pytorch-kaldi speech recognition toolkit. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6465–6469). IEEE.
19.
Zurück zum Zitat Stolcke, A. (2002) SRILM-an extensible language modeling toolkit. In Seventh international conference on spoken language processing. Stolcke, A. (2002) SRILM-an extensible language modeling toolkit. In Seventh international conference on spoken language processing.
Metadaten
Titel
Investigation of CNN-Based Acoustic Modeling for Continuous Hindi Speech Recognition
verfasst von
Tripti Choudhary
Atul Bansal
Vishal Goyal
Copyright-Jahr
2022
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-16-2919-8_38

Neuer Inhalt