Skip to main content
Erschienen in: International Journal of Speech Technology 2/2012

01.06.2012

Integration of multiple acoustic and language models for improved Hindi speech recognition system

verfasst von: R. K. Aggarwal, M. Dave

Erschienen in: International Journal of Speech Technology | Ausgabe 2/2012

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Despite the significant progress of automatic speech recognition (ASR) in the past three decades, it could not gain the level of human performance, particularly in the adverse conditions. To improve the performance of ASR, various approaches have been studied, which differ in feature extraction method, classification method, and training algorithms. Different approaches often utilize complementary information; therefore, to use their combination can be a better option. In this paper, we have proposed a novel approach to use the best characteristics of conventional, hybrid and segmental HMM by integrating them with the help of ROVER system combination technique. In the proposed framework, three different recognizers are created and combined, each having its own feature set and classification technique. For design and development of the complete system, three separate acoustic models are used with three different feature sets and two language models. Experimental result shows that word error rate (WER) can be reduced about 4% using the proposed technique as compared to conventional methods. Various modules are implemented and tested for Hindi Language ASR, in typical field conditions as well as in noisy environment.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Aggarwal, R. K., & Dave, M. (2011a). Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I). International Journal of Speech Technology, 14, 297–308. CrossRef Aggarwal, R. K., & Dave, M. (2011a). Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I). International Journal of Speech Technology, 14, 297–308. CrossRef
Zurück zum Zitat Aggarwal, R. K., & Dave, M. (2011b). Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II). International Journal of Speech Technology, 14, 309–320. CrossRef Aggarwal, R. K., & Dave, M. (2011b). Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II). International Journal of Speech Technology, 14, 309–320. CrossRef
Zurück zum Zitat Aubert, X. L. (2002). An overview of decoding techniques for large vocabulary continuous speech recognition. Computer Speech and Language, 16(1), 89–114. CrossRef Aubert, X. L. (2002). An overview of decoding techniques for large vocabulary continuous speech recognition. Computer Speech and Language, 16(1), 89–114. CrossRef
Zurück zum Zitat Becchetti, C., & Ricotti, K. P. (2004). Speech recognition theory and C++ implementation. New York: Wiley. Becchetti, C., & Ricotti, K. P. (2004). Speech recognition theory and C++ implementation. New York: Wiley.
Zurück zum Zitat Benouareth, A., Ennaji, A., & Sellami, M. (2008). Semi-continuous HMMs with explicit state duration for unconstrained Arabic word modeling and recognition. Pattern Recognition Letters, 29, 1742–1752. CrossRef Benouareth, A., Ennaji, A., & Sellami, M. (2008). Semi-continuous HMMs with explicit state duration for unconstrained Arabic word modeling and recognition. Pattern Recognition Letters, 29, 1742–1752. CrossRef
Zurück zum Zitat Beyerlein, P. (1998). Discriminative model combination. In Proceedings ICASSP (pp. 481–484). Beyerlein, P. (1998). Discriminative model combination. In Proceedings ICASSP (pp. 481–484).
Zurück zum Zitat Bourlard, H., Morgan, N., & Renals, S. (1992). Neural nets and hidden Markov models: review and generalizations. Speech Communication, 11, 237–246. CrossRef Bourlard, H., Morgan, N., & Renals, S. (1992). Neural nets and hidden Markov models: review and generalizations. Speech Communication, 11, 237–246. CrossRef
Zurück zum Zitat Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28, 357–366. CrossRef Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28, 357–366. CrossRef
Zurück zum Zitat Digalakis, V. V., & Murveit, H. (1994). Genones: Optimization the degree of tying in a large vocabulary HMM-based speech recognizer. In Proceeding of IEEE ICASSP (pp. 537–540). Digalakis, V. V., & Murveit, H. (1994). Genones: Optimization the degree of tying in a large vocabulary HMM-based speech recognizer. In Proceeding of IEEE ICASSP (pp. 537–540).
Zurück zum Zitat Fiscus, J. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In Proceeding of the IEEE workshop on automatic speech recognition and understanding (ASRU’97), Santa Barbara (pp. 347–352). CrossRef Fiscus, J. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In Proceeding of the IEEE workshop on automatic speech recognition and understanding (ASRU’97), Santa Barbara (pp. 347–352). CrossRef
Zurück zum Zitat Garau, G., & Renals, S. (2008). Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16(3), 508–518. CrossRef Garau, G., & Renals, S. (2008). Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16(3), 508–518. CrossRef
Zurück zum Zitat Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings of ICASSP (pp. 13–16). Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings of ICASSP (pp. 13–16).
Zurück zum Zitat Hagen, A., & Morris, A. (2005). Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR. Computer Speech and Language, 19(3), 3–30. CrossRef Hagen, A., & Morris, A. (2005). Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR. Computer Speech and Language, 19(3), 3–30. CrossRef
Zurück zum Zitat Hermansky, H. (1990). Perceptually predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87, 1738–1752. CrossRef Hermansky, H. (1990). Perceptually predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87, 1738–1752. CrossRef
Zurück zum Zitat Kirchhoff, K., Fink, G. A., & Sagerer, G. (2000). Conversational speech recognition using acoustic and articulatory input. In Proceeding IEEE ICASSP, Istanbul, Turkey. Kirchhoff, K., Fink, G. A., & Sagerer, G. (2000). Conversational speech recognition using acoustic and articulatory input. In Proceeding IEEE ICASSP, Istanbul, Turkey.
Zurück zum Zitat Kirchhoff, K., & Bilmes, J. A. (2000). Combination and joint Training of acoustic classifiers for speech Recognition. In ISCA ITRW workshop on automatic speech recognition: challenges for the new mllennium, Paris, France. Kirchhoff, K., & Bilmes, J. A. (2000). Combination and joint Training of acoustic classifiers for speech Recognition. In ISCA ITRW workshop on automatic speech recognition: challenges for the new mllennium, Paris, France.
Zurück zum Zitat Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26, 283–297. CrossRef Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26, 283–297. CrossRef
Zurück zum Zitat Leggetter, C. J., & Woodland, P. (1995). Speaker adaptation using maximum likelihood linear regression. Computer Speech and Language, 9(2), 171–185. CrossRef Leggetter, C. J., & Woodland, P. (1995). Speaker adaptation using maximum likelihood linear regression. Computer Speech and Language, 9(2), 171–185. CrossRef
Zurück zum Zitat Mangu, L., Brill, E., & Stolcke, A. (2000). Finding consensus in speech recognition: word error minimization and other application of confusion network. Computer Speech and Language, 14(4), 373–400. CrossRef Mangu, L., Brill, E., & Stolcke, A. (2000). Finding consensus in speech recognition: word error minimization and other application of confusion network. Computer Speech and Language, 14(4), 373–400. CrossRef
Zurück zum Zitat O’Shaughnessy, D. (2008). Automatic speech recognition: history, methods and challenges. Pattern Recognition, 41, 2965–2979. Invited paper. MATHCrossRef O’Shaughnessy, D. (2008). Automatic speech recognition: history, methods and challenges. Pattern Recognition, 41, 2965–2979. Invited paper. MATHCrossRef
Zurück zum Zitat Padmanabhan, M., & Picheny, M. (2002). Large vocabulary speech recognition algorithms. IEEE Computer Society, 35(4), 42–50. CrossRef Padmanabhan, M., & Picheny, M. (2002). Large vocabulary speech recognition algorithms. IEEE Computer Society, 35(4), 42–50. CrossRef
Zurück zum Zitat Rao, G. V. R., & Yegnanarayana, B. (1991). Word boundary hypothesization in Hindi speech. Computer Speech and Language, 5, 379–392. CrossRef Rao, G. V. R., & Yegnanarayana, B. (1991). Word boundary hypothesization in Hindi speech. Computer Speech and Language, 5, 379–392. CrossRef
Zurück zum Zitat Rao, K. S. (2011). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33. CrossRef Rao, K. S. (2011). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33. CrossRef
Zurück zum Zitat Samir, A., Duchateau, J., & Vanhamme, H. (2008). Discriminative model combination and language model selection in a reading tutor for children. In Interspeech, ISCA, Brisbane Australia (pp. 2795–2798). Samir, A., Duchateau, J., & Vanhamme, H. (2008). Discriminative model combination and language model selection in a reading tutor for children. In Interspeech, ISCA, Brisbane Australia (pp. 2795–2798).
Zurück zum Zitat Sankar, A. (2005). Bayesian model combination (Baycom) for improved recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing. Sankar, A. (2005). Bayesian model combination (Baycom) for improved recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing.
Zurück zum Zitat Saraswathi, S., & Geetha, T. (2007). Comparison of morpheme-based language model with different word-based models for improving the performance of Tamil speech recognition system. ACM Transaction on Asian Language Information Processing, 6(3), article 9. Saraswathi, S., & Geetha, T. (2007). Comparison of morpheme-based language model with different word-based models for improving the performance of Tamil speech recognition system. ACM Transaction on Asian Language Information Processing, 6(3), article 9.
Zurück zum Zitat Schwenk, H., & Gauvain, J.-L. (2000). Combining multiple speech recognizers using voting and language model information. In IEEE international conference on spoken language processing (ICSLP), II Pekin (pp. 915–918). Schwenk, H., & Gauvain, J.-L. (2000). Combining multiple speech recognizers using voting and language model information. In IEEE international conference on spoken language processing (ICSLP), II Pekin (pp. 915–918).
Zurück zum Zitat Silsbee, P., & Bovik, A. (1996). Computer lip-reading for improved accuracy in ASR. IEEE Transactions on Speech and Audio Processing, 4(5), 337–351. CrossRef Silsbee, P., & Bovik, A. (1996). Computer lip-reading for improved accuracy in ASR. IEEE Transactions on Speech and Audio Processing, 4(5), 337–351. CrossRef
Zurück zum Zitat Siohan, O., Ramabhadran, B., & Kingsbury, B. (2005). Constructing ensembles of ASR systems using randomized decision trees. In ICASSP (Vol. I, pp. 197–200). Siohan, O., Ramabhadran, B., & Kingsbury, B. (2005). Constructing ensembles of ASR systems using randomized decision trees. In ICASSP (Vol. I, pp. 197–200).
Zurück zum Zitat Stolcke, A., et al. (2000). The SRI March 2000 Hub-5 conversational speech transcription system. In Proc. speech transcription workshop. Stolcke, A., et al. (2000). The SRI March 2000 Hub-5 conversational speech transcription system. In Proc. speech transcription workshop.
Zurück zum Zitat Stolke, A., Konig, Y., & Weintraub, M. (1997). Explicit word error minimization in N-best list rescoring. In Proc. Eurospeech (Vol. 1, pp. 163–166). Stolke, A., Konig, Y., & Weintraub, M. (1997). Explicit word error minimization in N-best list rescoring. In Proc. Eurospeech (Vol. 1, pp. 163–166).
Zurück zum Zitat Vaid, J., & Gupta, A. (2002). Exploring word recognition in a semi alphabetic script: the case of devnagari. Brain and Language, 81, 679–690. CrossRef Vaid, J., & Gupta, A. (2002). Exploring word recognition in a semi alphabetic script: the case of devnagari. Brain and Language, 81, 679–690. CrossRef
Zurück zum Zitat Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. ESCA Journal of Speech Communication, 12(3), 247–251. CrossRef Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. ESCA Journal of Speech Communication, 12(3), 247–251. CrossRef
Zurück zum Zitat Waibel, A., Sawai, H., & Shikano, K. (1989). Modularity and scaling in large phonemic neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(12), 1888–1898. CrossRef Waibel, A., Sawai, H., & Shikano, K. (1989). Modularity and scaling in large phonemic neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(12), 1888–1898. CrossRef
Zurück zum Zitat Woodland, P., Gales, M., Pye, D., & Young, S. (1997). Broadcast news transcription using HTK. In Proceeding of IEEE international conference on acoustics, speech and signal processing, ICASSP, Munich, Germany (Vol. 2, pp. 719–722). Woodland, P., Gales, M., Pye, D., & Young, S. (1997). Broadcast news transcription using HTK. In Proceeding of IEEE international conference on acoustics, speech and signal processing, ICASSP, Munich, Germany (Vol. 2, pp. 719–722).
Zurück zum Zitat Young, S., et al. (2009). The HTK Book. Microsoft Corporation and Cambridge University Engineering Department. Young, S., et al. (2009). The HTK Book. Microsoft Corporation and Cambridge University Engineering Department.
Zurück zum Zitat Zhang, R., & Rudnicky, A. (2006). Investigations of Issues for using multiple acoustic models to improve CSR. In IEEE international conference on spoken language processing, Pitsburgh, PA, USA. Zhang, R., & Rudnicky, A. (2006). Investigations of Issues for using multiple acoustic models to improve CSR. In IEEE international conference on spoken language processing, Pitsburgh, PA, USA.
Zurück zum Zitat Zolney, A., Kocharov, D., Schluter, R., & Ney, H. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication, 49, 514–525. CrossRef Zolney, A., Kocharov, D., Schluter, R., & Ney, H. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication, 49, 514–525. CrossRef
Metadaten
Titel
Integration of multiple acoustic and language models for improved Hindi speech recognition system
verfasst von
R. K. Aggarwal
M. Dave
Publikationsdatum
01.06.2012
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 2/2012
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-012-9131-y

Weitere Artikel der Ausgabe 2/2012

International Journal of Speech Technology 2/2012 Zur Ausgabe