Skip to main content
Top
Published in: International Journal of Speech Technology 2/2018

16-02-2018

Continuous Punjabi speech recognition model based on Kaldi ASR toolkit

Authors: Jyoti Guglani, A. N. Mishra

Published in: International Journal of Speech Technology | Issue 2/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, continuous Punjabi speech recognition model is presented using Kaldi toolkit. For speech recognition, the extraction of Mel frequency cepstral coefficients (MFCC) features and perceptual linear prediction (PLP) features were extracted from Punjabi continuous speech samples. The performance of automatic speech recognition (ASR) system for both monophone and triphone model i.e., tri1, tri2 and tri3 model using N-gram language model is reported. The performance of ASR system were computed in terms of word error rate (WER). A significant reduction in WER was observed using the tri phone model over mono phone model ASR .Also the performance of ASR using tri3 model is improved over tri2 model and the performance of tri2 model is improved over tri1 model ASR. Further, it was found that MFCC feature provides higher speech recognition accuracy than PLP features for continuous Punjabi speech.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., & Mohri, M. (2007). OpenFst: A general and efficient weighted finitestate transducer library. In Proc. CIAA. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., & Mohri, M. (2007). OpenFst: A general and efficient weighted finitestate transducer library. In Proc. CIAA.
go back to reference Allen, J. B. (1994). How do humans process and recognize speech. IEEE Transactions on Speech and Audio Processing, 2(4), 567–576.CrossRef Allen, J. B. (1994). How do humans process and recognize speech. IEEE Transactions on Speech and Audio Processing, 2(4), 567–576.CrossRef
go back to reference Becerra, A., de la Rosa, J. I., & González, E. (2016). A case study of speech recognition in Spanish: From conventional to deep approach. In IEEE ANDESCON. Becerra, A., de la Rosa, J. I., & González, E. (2016). A case study of speech recognition in Spanish: From conventional to deep approach. In IEEE ANDESCON.
go back to reference Bezoui, M., Elmoutaouakkil, A., & Beni-hssane, A. (2016). Feature extraction of some Quranic recitation using mel-frequency cepstral coeficients (MFCC). In 5th International Conference on Multimedia Computing and Systems ICMCS. Bezoui, M., Elmoutaouakkil, A., & Beni-hssane, A. (2016). Feature extraction of some Quranic recitation using mel-frequency cepstral coeficients (MFCC). In 5th International Conference on Multimedia Computing and Systems ICMCS.
go back to reference Chen, W., Zhenjiang, M., & Xiao, M. (2009). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(16), 582–589. Chen, W., Zhenjiang, M., & Xiao, M. (2009). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(16), 582–589.
go back to reference Chourasia, V., Samudravijaya, K., Ingle, M., & Chandwani, M. (2007). Hindi speech recognition under noisy conditions. In International Journal of Acoustic Society India (pp. 41–46). Chourasia, V., Samudravijaya, K., Ingle, M., & Chandwani, M. (2007). Hindi speech recognition under noisy conditions. In International Journal of Acoustic Society India (pp. 41–46).
go back to reference Chow, Y.-L. (1990). Maximum mutual information estimation of HMM parameters for continuous speech recognition using the N-best algorithm. In IEEE 1990 International Conference on Acoustics, Speech, and Signal Processing, 1990 (ICASSP-90) (pp. 701–704). IEEE. Chow, Y.-L. (1990). Maximum mutual information estimation of HMM parameters for continuous speech recognition using the N-best algorithm. In IEEE 1990 International Conference on Acoustics, Speech, and Signal Processing, 1990 (ICASSP-90) (pp. 701–704). IEEE.
go back to reference Cosi, P. (2015). A KALDI-DNN-based ASR system for Italian. In International Joint Conference on Neural Networks IJCNN. Cosi, P. (2015). A KALDI-DNN-based ASR system for Italian. In International Joint Conference on Neural Networks IJCNN.
go back to reference Gopinath, R. A. (1998). Maximum likelihood modeling with Gaussian distributions for classification. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998 (Vol. 2, pp. 661–664). IEEE. Gopinath, R. A. (1998). Maximum likelihood modeling with Gaussian distributions for classification. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998 (Vol. 2, pp. 661–664). IEEE.
go back to reference Hermansky, H. (1990). Perceptual linear prediction (PLP) analysis of speech. Journal of Acoustic Society America, 87, 1738–1752.CrossRef Hermansky, H. (1990). Perceptual linear prediction (PLP) analysis of speech. Journal of Acoustic Society America, 87, 1738–1752.CrossRef
go back to reference Kipyatkova, I., & Karpov, A. (2016). DNN-based acoustic modeling for Russian speech recognition using Kaldi. In International Conference on Speech and Computer SPECOM (pp. 246–253). Kipyatkova, I., & Karpov, A. (2016). DNN-based acoustic modeling for Russian speech recognition using Kaldi. In International Conference on Speech and Computer SPECOM (pp. 246–253).
go back to reference Kou, H., & Shang, W. (2014). Parallelized feature extraction and acoustic model training. In Digital Signal Processing. Proceedings ICDSP. IEEE. Kou, H., & Shang, W. (2014). Parallelized feature extraction and acoustic model training. In Digital Signal Processing. Proceedings ICDSP. IEEE.
go back to reference Lee, A., Kawahara, T., & Shikano, K. (2001). Julius—An open source realtime, large vocabulary recognition engine. In EUROSPEECH (pp. 1691–1694). Lee, A., Kawahara, T., & Shikano, K. (2001). Julius—An open source realtime, large vocabulary recognition engine. In EUROSPEECH (pp. 1691–1694).
go back to reference Lippman, R. P. (1997). Speech recognition by machines and humans. Speech Communication, 22, 1–15.CrossRef Lippman, R. P. (1997). Speech recognition by machines and humans. Speech Communication, 22, 1–15.CrossRef
go back to reference Mohri, M., Pereira, F., & Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech and Language, 20(1), 69–88.CrossRef Mohri, M., Pereira, F., & Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech and Language, 20(1), 69–88.CrossRef
go back to reference Povey, D. (2003). Discriminative training for large vocabulary speech recognition, PhD thesis, Cambridge University Engineering Department. Povey, D. (2003). Discriminative training for large vocabulary speech recognition, PhD thesis, Cambridge University Engineering Department.
go back to reference Povey, D., Gales, M. J. F., Kim, D. Y., & Woodland, P. C. (2003). MMI-MAP and MPE-MAP for acoustic model adaptation. In INTERSPEECH. Povey, D., Gales, M. J. F., Kim, D. Y., & Woodland, P. C. (2003). MMI-MAP and MPE-MAP for acoustic model adaptation. In INTERSPEECH.
go back to reference Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (No. EPFL-CONF192584). IEEE Signal Processing Society. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (No. EPFL-CONF192584). IEEE Signal Processing Society.
go back to reference Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal, A., Janda, M., Karafit, M., Kombrink, S., Motlek, P., Qian, Y., & Riedhammer, K. (2012). Generating exact lattices in the WFST framework. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4213–4216). Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal, A., Janda, M., Karafit, M., Kombrink, S., Motlek, P., Qian, Y., & Riedhammer, K. (2012). Generating exact lattices in the WFST framework. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4213–4216).
go back to reference Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., & Visweswariah, K. (2008a). Boosted MMI for model and feature-space discriminative training, In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008 (ICASSP 2008) (pp. 4057–4060). IEEE. Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., & Visweswariah, K. (2008a). Boosted MMI for model and feature-space discriminative training, In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008 (ICASSP 2008) (pp. 4057–4060). IEEE.
go back to reference Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., & Visweswariah, K. (2008b). Boosted MMI for model and feature-space discriminative training, In ICASSP. Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., & Visweswariah, K. (2008b). Boosted MMI for model and feature-space discriminative training, In ICASSP.
go back to reference Povey, D., & Woodland, P. C. (2002). Minimum phone error and ismoothing for improved discriminative training. Cambridge: Cambridge University Engineering Department. Povey, D., & Woodland, P. C. (2002). Minimum phone error and ismoothing for improved discriminative training. Cambridge: Cambridge University Engineering Department.
go back to reference Rabiner, L. R., & Juang, B. H. (2003). Fundamental of speech recognition (1st edn.). Delhi: Pearson Education.MATH Rabiner, L. R., & Juang, B. H. (2003). Fundamental of speech recognition (1st edn.). Delhi: Pearson Education.MATH
go back to reference Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Loof, J., Schluter, R., & Ney, H. (2009) The RWTH Aachen University open source speech recognition system. In INTERSPEECH (pp. 2111–2114). Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Loof, J., Schluter, R., & Ney, H. (2009) The RWTH Aachen University open source speech recognition system. In INTERSPEECH (pp. 2111–2114).
go back to reference Upadhyaya, P., Farooq, O., Abidi, M. R., & Varshney, P. (2015). Comparative study of visual feature for bimodal Hindi speech recognition. Archives of Acoustics, 40(4), 609–619.CrossRef Upadhyaya, P., Farooq, O., Abidi, M. R., & Varshney, P. (2015). Comparative study of visual feature for bimodal Hindi speech recognition. Archives of Acoustics, 40(4), 609–619.CrossRef
go back to reference Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., & Woelfel, J. (2004) Sphinx-4: A flexible open source framework for speech recognition. Sun Microsystems Inc., Technical Report SML1 TR20040811. Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., & Woelfel, J. (2004) Sphinx-4: A flexible open source framework for speech recognition. Sun Microsystems Inc., Technical Report SML1 TR20040811.
go back to reference Yadava, G. T., & Jayanna, H. S. (2016). Development and comparison of ASR models using Kaldi for noisy and enhanced kannada speech. In International Conference on Advances in Computing, Communications and Informatics ICACCI (pp. 635–644). Yadava, G. T., & Jayanna, H. S. (2016). Development and comparison of ASR models using Kaldi for noisy and enhanced kannada speech. In International Conference on Advances in Computing, Communications and Informatics ICACCI (pp. 635–644).
go back to reference Young, G., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2009). The HTK book (for version 3.4). Cambridge: Cambridge University Engineering Department. Young, G., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2009). The HTK book (for version 3.4). Cambridge: Cambridge University Engineering Department.
Metadata
Title
Continuous Punjabi speech recognition model based on Kaldi ASR toolkit
Authors
Jyoti Guglani
A. N. Mishra
Publication date
16-02-2018
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 2/2018
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-018-9497-6

Other articles of this Issue 2/2018

International Journal of Speech Technology 2/2018 Go to the issue