Skip to main content
Erschienen in: Neural Computing and Applications 10/2019

28.04.2018 | Original Article

Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling

verfasst von: Mohit Dua, R. K. Aggarwal, Mantosh Biswas

Erschienen in: Neural Computing and Applications | Ausgabe 10/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper implements and evaluates the performance of a discriminatively trained continuous Hindi language speech recognition system. The system uses maximum mutual information and minimum phone error discriminative techniques with various numbers of Gaussian mixtures to train the automatic speech recognition (ASR) system. The training dataset consists of Hindi speech transcription. The experiments show a significant performance gain over maximum likelihood-based Hindi language speech recognition system. The system uses an efficient recurrent neural network (RNN)-based language modeling. The results indicate that the use of RNN-based language modeling enhances the performance of the ASR system. Further, the interpolation of n-gram language model (LM) with the RNNLM exhibits an additional increase in recognition performance of the implemented system. The proposed system introduces the concept of speaker adaption using maximum likelihood linear regression technique. The paper also gives an overview of the techniques used for discriminative training along with practical issues involved in their implementation.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Liu H, Yin J, Luo X, Zhang S (2018) Foreword to the special issue on recent advances on pattern recognition and artificial intelligence. Neural Comput Appl 29(1):1–2CrossRef Liu H, Yin J, Luo X, Zhang S (2018) Foreword to the special issue on recent advances on pattern recognition and artificial intelligence. Neural Comput Appl 29(1):1–2CrossRef
2.
Zurück zum Zitat de Jesús Rubio J et al (2013) A method for online pattern recognition of abnormal eye movements. Neural Comput Appl 22(3–4):597–605CrossRef de Jesús Rubio J et al (2013) A method for online pattern recognition of abnormal eye movements. Neural Comput Appl 22(3–4):597–605CrossRef
3.
Zurück zum Zitat Acır N (2006) A modified hybrid neural network for pattern recognition and its application to SSW complex in EEG. Neural Comput Appl 15(1):49–54CrossRef Acır N (2006) A modified hybrid neural network for pattern recognition and its application to SSW complex in EEG. Neural Comput Appl 15(1):49–54CrossRef
4.
Zurück zum Zitat Cervelló-Royo R, Guijarro F, Michniuk K (2015) Stock market trading rule based on pattern recognition and technical analysis: forecasting the DJIA index with intraday data. Expert Syst Appl 42(14):5963–5975CrossRef Cervelló-Royo R, Guijarro F, Michniuk K (2015) Stock market trading rule based on pattern recognition and technical analysis: forecasting the DJIA index with intraday data. Expert Syst Appl 42(14):5963–5975CrossRef
5.
Zurück zum Zitat Arabacı H, Bilgin O (2010) Automatic detection and classification of rotor cage faults in squirrel cage induction motor. Neural Comput Appl 19(5):713–723CrossRef Arabacı H, Bilgin O (2010) Automatic detection and classification of rotor cage faults in squirrel cage induction motor. Neural Comput Appl 19(5):713–723CrossRef
6.
Zurück zum Zitat Cardoso JS, Pardo XM, Paredes R (2017) Foreword to the special issue on pattern recognition and image analysis. Neural Comput Appl 28(9):2371–2372CrossRef Cardoso JS, Pardo XM, Paredes R (2017) Foreword to the special issue on pattern recognition and image analysis. Neural Comput Appl 28(9):2371–2372CrossRef
7.
Zurück zum Zitat Daneshyari M (2010) Chaotic neural network controlled by particle swarm with decaying chaotic inertia weight for pattern recognition. Neural Comput Appl 19(4):637–645CrossRef Daneshyari M (2010) Chaotic neural network controlled by particle swarm with decaying chaotic inertia weight for pattern recognition. Neural Comput Appl 19(4):637–645CrossRef
8.
Zurück zum Zitat Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, Zweig G (2017) The Microsoft 2016 conversational speech recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5255–5259 Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, Zweig G (2017) The Microsoft 2016 conversational speech recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5255–5259
9.
Zurück zum Zitat Adiga A, Magimai M, Seelamantula CS (2013) Gammatone wavelet cepstral coefficients for robust speech recognition. In: TENCON 2013-2013 IEEE Region 10 conference (31194). IEEE, pp 1–4 Adiga A, Magimai M, Seelamantula CS (2013) Gammatone wavelet cepstral coefficients for robust speech recognition. In: TENCON 2013-2013 IEEE Region 10 conference (31194). IEEE, pp 1–4
10.
Zurück zum Zitat Aggarwal RK, Dave M (2011) Discriminative techniques for Hindi speech recognition system. In: Information systems for Indian languages, pp 261–266 Aggarwal RK, Dave M (2011) Discriminative techniques for Hindi speech recognition system. In: Information systems for Indian languages, pp 261–266
11.
Zurück zum Zitat Biswas A et al (2015) Hindi phoneme classification using Wiener filtered wavelet packet decomposed periodic and aperiodic acoustic feature. Comput Electr Eng 42(2015):12–22CrossRef Biswas A et al (2015) Hindi phoneme classification using Wiener filtered wavelet packet decomposed periodic and aperiodic acoustic feature. Comput Electr Eng 42(2015):12–22CrossRef
12.
Zurück zum Zitat Shao Y et al (2009) An auditory-based feature for robust speech recognition. In: IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009. IEEE, pp 4625–4628 Shao Y et al (2009) An auditory-based feature for robust speech recognition. In: IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009. IEEE, pp 4625–4628
13.
Zurück zum Zitat Baba Ali B, Sameti H, Falk TH (2011) A model distance maximizing framework for speech recognizer-based speech enhancement. AEU Int J Electron Commun 65(2):99–106CrossRef Baba Ali B, Sameti H, Falk TH (2011) A model distance maximizing framework for speech recognizer-based speech enhancement. AEU Int J Electron Commun 65(2):99–106CrossRef
14.
Zurück zum Zitat Huang Z, Siniscalchi SM, Lee C-H (2016) A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition. Neurocomputing 218:448–459CrossRef Huang Z, Siniscalchi SM, Lee C-H (2016) A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition. Neurocomputing 218:448–459CrossRef
15.
Zurück zum Zitat Sun S et al (2017) An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257:79–87CrossRef Sun S et al (2017) An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257:79–87CrossRef
16.
Zurück zum Zitat Hayasaka N, Kawamura A, Sasaoka N (2017) Noise-robust scream detection using band-limited spectral entropy. AEU Int J Electron Commun 76:117–124CrossRef Hayasaka N, Kawamura A, Sasaoka N (2017) Noise-robust scream detection using band-limited spectral entropy. AEU Int J Electron Commun 76:117–124CrossRef
17.
Zurück zum Zitat Mahapatra A et al (2014) Human recognition system for outdoor videos using Hidden Markov model. AEU Int J Electron Commun 68(3):227–236CrossRef Mahapatra A et al (2014) Human recognition system for outdoor videos using Hidden Markov model. AEU Int J Electron Commun 68(3):227–236CrossRef
18.
Zurück zum Zitat Vertanen K (2004) An overview of discriminative training for speech recognition. University of Cambridge, Cambridge, pp 1–14 Vertanen K (2004) An overview of discriminative training for speech recognition. University of Cambridge, Cambridge, pp 1–14
19.
Zurück zum Zitat Gillick D, Wegmann S, Gillick L (2012) Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework. In: 2012 IEEE acoustics, speech and signal processing (ICASSP-12) conference, Kyoto. IEEE, pp 4745–4748 Gillick D, Wegmann S, Gillick L (2012) Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework. In: 2012 IEEE acoustics, speech and signal processing (ICASSP-12) conference, Kyoto. IEEE, pp 4745–4748
20.
Zurück zum Zitat McDermott E, Hazen TJ, Le Roux J, Nakamura A, Katagiri S (2007) Discriminative training for large-vocabulary speech recognition using minimum classification error. IEEE Trans Audio Speech Lang Process 15(1):203–223CrossRef McDermott E, Hazen TJ, Le Roux J, Nakamura A, Katagiri S (2007) Discriminative training for large-vocabulary speech recognition using minimum classification error. IEEE Trans Audio Speech Lang Process 15(1):203–223CrossRef
21.
Zurück zum Zitat Siniscalchi SM, Svendsen T, Lee C-H (2014) An artificial neural network approach to automatic speech processing. Neurocomputing 140:326–338CrossRef Siniscalchi SM, Svendsen T, Lee C-H (2014) An artificial neural network approach to automatic speech processing. Neurocomputing 140:326–338CrossRef
22.
Zurück zum Zitat Trentin E, Gori M (2001) A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 37(1):91–126CrossRef Trentin E, Gori M (2001) A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 37(1):91–126CrossRef
23.
Zurück zum Zitat Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Valtchev V (2002) The HTK book. Cambridge University Engineering Department, vol 3, pp 1–285 Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Valtchev V (2002) The HTK book. Cambridge University Engineering Department, vol 3, pp 1–285
24.
Zurück zum Zitat Kumar M, Rajput N, Verma A (2004) A large-vocabulary continuous speech recognition system for Hindi. IBM J Res Dev 48(5.6):703–715CrossRef Kumar M, Rajput N, Verma A (2004) A large-vocabulary continuous speech recognition system for Hindi. IBM J Res Dev 48(5.6):703–715CrossRef
25.
Zurück zum Zitat Kuamr A, Dua M, Choudhary A (2014) Implementation and performance evaluation of continuous Hindi speech recognition. In: Electronics and communication systems (ICECS), 2014 international conference on. IEEE, pp 1–5 Kuamr A, Dua M, Choudhary A (2014) Implementation and performance evaluation of continuous Hindi speech recognition. In: Electronics and communication systems (ICECS), 2014 international conference on. IEEE, pp 1–5
26.
Zurück zum Zitat Fung ADYLP (2012) Using English acoustic models for Hindi automatic speech recognition. In: 24th international conference on computational linguistics Fung ADYLP (2012) Using English acoustic models for Hindi automatic speech recognition. In: 24th international conference on computational linguistics
27.
Zurück zum Zitat Patil HA, Basu TK (2008) Development of speech corpora for speaker recognition research and evaluation in Indian languages. Int J Speech Technol 11(1):17–32CrossRef Patil HA, Basu TK (2008) Development of speech corpora for speaker recognition research and evaluation in Indian languages. Int J Speech Technol 11(1):17–32CrossRef
28.
Zurück zum Zitat Aggarwal RKumar, Dave M (2012) Filterbank optimization for robust ASR using GA and PSO. Int J Speech Technol 15(2):191–201CrossRef Aggarwal RKumar, Dave M (2012) Filterbank optimization for robust ASR using GA and PSO. Int J Speech Technol 15(2):191–201CrossRef
29.
Zurück zum Zitat Biswas A, Sahu PK, Chandra M (2016) Admissible wavelet packet sub-band based harmonic energy features using ANOVA fusion techniques for Hindi phoneme recognition. IET Signal Proc 10(8):902–911CrossRef Biswas A, Sahu PK, Chandra M (2016) Admissible wavelet packet sub-band based harmonic energy features using ANOVA fusion techniques for Hindi phoneme recognition. IET Signal Proc 10(8):902–911CrossRef
30.
Zurück zum Zitat Biswas A et al (2015) Admissible wavelet packet sub-band-based harmonic energy features for Hindi phoneme recognition. IET Signal Proc 9(6):511–519CrossRef Biswas A et al (2015) Admissible wavelet packet sub-band-based harmonic energy features for Hindi phoneme recognition. IET Signal Proc 9(6):511–519CrossRef
31.
Zurück zum Zitat Mittal T, Sharma R (2016) Speech recognition using ANN and predator-influenced civilized swarm optimization algorithm. Turk J Electr Eng Comput Sci 24:4790–4803CrossRef Mittal T, Sharma R (2016) Speech recognition using ANN and predator-influenced civilized swarm optimization algorithm. Turk J Electr Eng Comput Sci 24:4790–4803CrossRef
32.
Zurück zum Zitat Gopalakrishnan PS, Kanevsky D, Nadas A, Nahamoo D (1991) An inequality for rational functions with applications to some statistical estimation problems. IEEE Trans Inf Theory 37(1):107–113CrossRef Gopalakrishnan PS, Kanevsky D, Nadas A, Nahamoo D (1991) An inequality for rational functions with applications to some statistical estimation problems. IEEE Trans Inf Theory 37(1):107–113CrossRef
33.
Zurück zum Zitat Valtchev V (1995) Discriminative methods in HMM-based speech recognition, Ph.D Thesis. University of Cambridge Valtchev V (1995) Discriminative methods in HMM-based speech recognition, Ph.D Thesis. University of Cambridge
34.
Zurück zum Zitat Povey D, Kanevsky D, Kingsbury B, Ramabhadran B, Saon G, Visweswariah K (2008) Boosted MMI for model and feature-space discriminative training. In: 2008 IEEE international conference on acoustics, speech and signal processing (ICASSP-08), Las Vegas. IEEE, pp 4057–4060 Povey D, Kanevsky D, Kingsbury B, Ramabhadran B, Saon G, Visweswariah K (2008) Boosted MMI for model and feature-space discriminative training. In: 2008 IEEE international conference on acoustics, speech and signal processing (ICASSP-08), Las Vegas. IEEE, pp 4057–4060
35.
Zurück zum Zitat Povey D (2005) Discriminative training for large vocabulary speech recognition, Ph.D Thesis. University of Cambridge Povey D (2005) Discriminative training for large vocabulary speech recognition, Ph.D Thesis. University of Cambridge
36.
Zurück zum Zitat Liu X, Wang Y, Chen X, Gales MJF, Woodland PC (2014) Efficient lattice rescoring using recurrent neural network language models. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP-14), Florence. IEEE, pp 4908–4912 Liu X, Wang Y, Chen X, Gales MJF, Woodland PC (2014) Efficient lattice rescoring using recurrent neural network language models. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP-14), Florence. IEEE, pp 4908–4912
37.
Zurück zum Zitat Williams DRGHR, Hinton GE (1986) Learning representations by back-propagating errors. Nature 323(6088):533–538CrossRef Williams DRGHR, Hinton GE (1986) Learning representations by back-propagating errors. Nature 323(6088):533–538CrossRef
38.
Zurück zum Zitat Boden M (2002) A guide to recurrent neural networks and back propagation. The Dallas Project, Halmstad University, Sweden Boden M (2002) A guide to recurrent neural networks and back propagation. The Dallas Project, Halmstad University, Sweden
39.
Zurück zum Zitat Shi Y, Hwang MY, Yao K, Larson M (2013) Speed up of recurrent neural network language models with sentence independent sub sampling stochastic gradient descent. In: Proceeding of interspeech conference, Lyon. ISCA, pp 1203–1207 Shi Y, Hwang MY, Yao K, Larson M (2013) Speed up of recurrent neural network language models with sentence independent sub sampling stochastic gradient descent. In: Proceeding of interspeech conference, Lyon. ISCA, pp 1203–1207
40.
Zurück zum Zitat Huang Z, Zweig G, Levit M, Dumoulin B, Oguz B, Chang S (2013) Accelerating recurrent neural network training via two stage classes and parallelization. In: 2013 IEEE workshop on automatic speech recognition and understanding, Olomouc. IEEE, pp 326–331 Huang Z, Zweig G, Levit M, Dumoulin B, Oguz B, Chang S (2013) Accelerating recurrent neural network training via two stage classes and parallelization. In: 2013 IEEE workshop on automatic speech recognition and understanding, Olomouc. IEEE, pp 326–331
41.
Zurück zum Zitat Li B, Zhou E, Huang B, Duan J, Wang Y, Xu N, Zhang J, Yang H (2014) Large scale recurrent neural network on GPU. In: 2014 international joint conference on neural networks (IJCNN), Beijing. IEEE, pp 4062–4069 Li B, Zhou E, Huang B, Duan J, Wang Y, Xu N, Zhang J, Yang H (2014) Large scale recurrent neural network on GPU. In: 2014 international joint conference on neural networks (IJCNN), Beijing. IEEE, pp 4062–4069
42.
Zurück zum Zitat Chen X, Wang Y, Liu X, Gales MJ, Woodland PC (2014) Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch. In: Proceeding of interspeech conference, Singapore. ISCA, pp 641–645 Chen X, Wang Y, Liu X, Gales MJ, Woodland PC (2014) Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch. In: Proceeding of interspeech conference, Singapore. ISCA, pp 641–645
43.
Zurück zum Zitat Liu X, Chen X, Wang Y, Gales MJ, Woodland PC (2016) Two efficient lattice rescoring methods using recurrent neural network language models. IEEE/ACM Trans Audio Speech Lang Process 24(8):1438–1449CrossRef Liu X, Chen X, Wang Y, Gales MJ, Woodland PC (2016) Two efficient lattice rescoring methods using recurrent neural network language models. IEEE/ACM Trans Audio Speech Lang Process 24(8):1438–1449CrossRef
44.
Zurück zum Zitat Samudravijaya K, Rao PVS, Agrawal SS (2002) Hindi speech database. In: International conference on spoken language processing, Beijing, pp 456–464 Samudravijaya K, Rao PVS, Agrawal SS (2002) Hindi speech database. In: International conference on spoken language processing, Beijing, pp 456–464
45.
Zurück zum Zitat Macherey W (2010) Discriminative training and acoustic modeling for speech recognition, Ph.D Thesis. RWTH Aachen University Macherey W (2010) Discriminative training and acoustic modeling for speech recognition, Ph.D Thesis. RWTH Aachen University
46.
Zurück zum Zitat Chen X, Liu X, Qian Y, Gales MJF, Woodland PC (2016) CUED-RNN LM—an open-source toolkit for efficient training and evaluation of recurrent neural network language models. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP-16), Shanghai. IEEE, pp 6000–6004 Chen X, Liu X, Qian Y, Gales MJF, Woodland PC (2016) CUED-RNN LM—an open-source toolkit for efficient training and evaluation of recurrent neural network language models. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP-16), Shanghai. IEEE, pp 6000–6004
47.
Zurück zum Zitat Deoras A, Mikolov T, Kombrink S, Karafiát M, Khudanpur S (2011) Variational approximation of long-span language models for LVCSR. In: 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP-11), Prague. IEEE, pp 5532–5535 Deoras A, Mikolov T, Kombrink S, Karafiát M, Khudanpur S (2011) Variational approximation of long-span language models for LVCSR. In: 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP-11), Prague. IEEE, pp 5532–5535
48.
Zurück zum Zitat Lecouteux B, Linares G, Esteve Y, Gravier G (2008) Generalized driven decoding for speech recognition system combination. In: Acoustics, speech and signal processing, 2008. ICASSP 2008. IEEE international conference on. IEEE, pp 1549–1552 Lecouteux B, Linares G, Esteve Y, Gravier G (2008) Generalized driven decoding for speech recognition system combination. In: Acoustics, speech and signal processing, 2008. ICASSP 2008. IEEE international conference on. IEEE, pp 1549–1552
Metadaten
Titel
Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling
verfasst von
Mohit Dua
R. K. Aggarwal
Mantosh Biswas
Publikationsdatum
28.04.2018
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 10/2019
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-018-3499-9

Weitere Artikel der Ausgabe 10/2019

Neural Computing and Applications 10/2019 Zur Ausgabe