Top

International Journal of Speech Technology

Published in:

01-06-2012

Integration of multiple acoustic and language models for improved Hindi speech recognition system

Authors: R. K. Aggarwal, M. Dave

Published in: International Journal of Speech Technology | Issue 2/2012

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Despite the significant progress of automatic speech recognition (ASR) in the past three decades, it could not gain the level of human performance, particularly in the adverse conditions. To improve the performance of ASR, various approaches have been studied, which differ in feature extraction method, classification method, and training algorithms. Different approaches often utilize complementary information; therefore, to use their combination can be a better option. In this paper, we have proposed a novel approach to use the best characteristics of conventional, hybrid and segmental HMM by integrating them with the help of ROVER system combination technique. In the proposed framework, three different recognizers are created and combined, each having its own feature set and classification technique. For design and development of the complete system, three separate acoustic models are used with three different feature sets and two language models. Experimental result shows that word error rate (WER) can be reduced about 4% using the proposed technique as compared to conventional methods. Various modules are implemented and tested for Hindi Language ASR, in typical field conditions as well as in noisy environment.

previous article The Construction-Integration framework: a means to diminish bias in LSA-based call routing

next article A pertinent learning machine input feature for speaker discrimination by voice

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Aggarwal, R. K., & Dave, M. (2011a). Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I). International Journal of Speech Technology, 14, 297–308. CrossRef

Aggarwal, R. K., & Dave, M. (2011b). Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II). International Journal of Speech Technology, 14, 309–320. CrossRef

Aubert, X. L. (2002). An overview of decoding techniques for large vocabulary continuous speech recognition. Computer Speech and Language, 16(1), 89–114. CrossRef

Becchetti, C., & Ricotti, K. P. (2004). Speech recognition theory and C++ implementation. New York: Wiley.

Benouareth, A., Ennaji, A., & Sellami, M. (2008). Semi-continuous HMMs with explicit state duration for unconstrained Arabic word modeling and recognition. Pattern Recognition Letters, 29, 1742–1752. CrossRef

Beyerlein, P. (1998). Discriminative model combination. In Proceedings ICASSP (pp. 481–484).

Bourlard, H., Morgan, N., & Renals, S. (1992). Neural nets and hidden Markov models: review and generalizations. Speech Communication, 11, 237–246. CrossRef

Chopde, A. (2009). Itrans Indian language transliteration package version, 5.2 source. http://www.aczone.com/itrans/.

Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28, 357–366. CrossRef

Digalakis, V. V., & Murveit, H. (1994). Genones: Optimization the degree of tying in a large vocabulary HMM-based speech recognizer. In Proceeding of IEEE ICASSP (pp. 537–540).

Fiscus, J. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In Proceeding of the IEEE workshop on automatic speech recognition and understanding (ASRU’97), Santa Barbara (pp. 347–352). CrossRef

Garau, G., & Renals, S. (2008). Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16(3), 508–518. CrossRef

Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings of ICASSP (pp. 13–16).

Hagen, A., & Morris, A. (2005). Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR. Computer Speech and Language, 19(3), 3–30. CrossRef

Hermansky, H. (1990). Perceptually predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87, 1738–1752. CrossRef

Kirchhoff, K., Fink, G. A., & Sagerer, G. (2000). Conversational speech recognition using acoustic and articulatory input. In Proceeding IEEE ICASSP, Istanbul, Turkey.

Kirchhoff, K., & Bilmes, J. A. (2000). Combination and joint Training of acoustic classifiers for speech Recognition. In ISCA ITRW workshop on automatic speech recognition: challenges for the new mllennium, Paris, France.

Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26, 283–297. CrossRef

Leggetter, C. J., & Woodland, P. (1995). Speaker adaptation using maximum likelihood linear regression. Computer Speech and Language, 9(2), 171–185. CrossRef

Mangu, L., Brill, E., & Stolcke, A. (2000). Finding consensus in speech recognition: word error minimization and other application of confusion network. Computer Speech and Language, 14(4), 373–400. CrossRef

O’Shaughnessy, D. (2008). Automatic speech recognition: history, methods and challenges. Pattern Recognition, 41, 2965–2979. Invited paper. MATHCrossRef

Padmanabhan, M., & Picheny, M. (2002). Large vocabulary speech recognition algorithms. IEEE Computer Society, 35(4), 42–50. CrossRef

Rao, G. V. R., & Yegnanarayana, B. (1991). Word boundary hypothesization in Hindi speech. Computer Speech and Language, 5, 379–392. CrossRef

Rao, K. S. (2011). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33. CrossRef

Samir, A., Duchateau, J., & Vanhamme, H. (2008). Discriminative model combination and language model selection in a reading tutor for children. In Interspeech, ISCA, Brisbane Australia (pp. 2795–2798).

Sankar, A. (2005). Bayesian model combination (Baycom) for improved recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing.

Saraswathi, S., & Geetha, T. (2007). Comparison of morpheme-based language model with different word-based models for improving the performance of Tamil speech recognition system. ACM Transaction on Asian Language Information Processing, 6(3), article 9.

Schwenk, H., & Gauvain, J.-L. (2000). Combining multiple speech recognizers using voting and language model information. In IEEE international conference on spoken language processing (ICSLP), II Pekin (pp. 915–918).

Silsbee, P., & Bovik, A. (1996). Computer lip-reading for improved accuracy in ASR. IEEE Transactions on Speech and Audio Processing, 4(5), 337–351. CrossRef

Siohan, O., Ramabhadran, B., & Kingsbury, B. (2005). Constructing ensembles of ASR systems using randomized decision trees. In ICASSP (Vol. I, pp. 197–200).

Stolcke, A., et al. (2000). The SRI March 2000 Hub-5 conversational speech transcription system. In Proc. speech transcription workshop.

Stolke, A., Konig, Y., & Weintraub, M. (1997). Explicit word error minimization in N-best list rescoring. In Proc. Eurospeech (Vol. 1, pp. 163–166).

Vaid, J., & Gupta, A. (2002). Exploring word recognition in a semi alphabetic script: the case of devnagari. Brain and Language, 81, 679–690. CrossRef

Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. ESCA Journal of Speech Communication, 12(3), 247–251. CrossRef

Waibel, A., Sawai, H., & Shikano, K. (1989). Modularity and scaling in large phonemic neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(12), 1888–1898. CrossRef

Woodland, P., Gales, M., Pye, D., & Young, S. (1997). Broadcast news transcription using HTK. In Proceeding of IEEE international conference on acoustics, speech and signal processing, ICASSP, Munich, Germany (Vol. 2, pp. 719–722).

Young, S., et al. (2009). The HTK Book. Microsoft Corporation and Cambridge University Engineering Department.

Zhang, R., & Rudnicky, A. (2006). Investigations of Issues for using multiple acoustic models to improve CSR. In IEEE international conference on spoken language processing, Pitsburgh, PA, USA.

Zolney, A., Kocharov, D., Schluter, R., & Ney, H. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication, 49, 514–525. CrossRef

Title: Integration of multiple acoustic and language models for improved Hindi speech recognition system
Authors: R. K. Aggarwal
M. Dave
Publication date: 01-06-2012
Publisher: Springer US
Published in: International Journal of Speech Technology / Issue 2/2012
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-012-9131-y

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2012

Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling

A new approach to acoustic analysis of two British regional accents—Birmingham and Liverpool accents

Overall performance evaluation of adaptive multi rate 06.90 speech codec based on code excited linear prediction algorithm using MATLAB

Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

Emotion recognition from speech using source, system, and prosodic features

Within-word pronunciation variation modeling for Arabic ASRs: a direct data-driven approach