Skip to main content

2017 | OriginalPaper | Buchkapitel

Language Model Optimization for a Deep Neural Network Based Speech Recognition System for Serbian

verfasst von : Edvin Pakoci, Branislav Popović, Darko Pekar

Erschienen in: Speech and Computer

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper presents the results obtained using several variants of trigram language models in a large vocabulary continuous speech recognition (LVCSR) system for the Serbian language, based on the deep neural network (DNN) framework implemented within the Kaldi speech recognition toolkit. This training approach allows parallelization using several threads on either multiple GPUs or multiple CPUs, and provides a natural-gradient modification to the stochastic gradient descent (SGD) optimization method. Acoustic models are trained over a fixed number of training epochs with parameter averaging in the end. This paper discusses recognition using different language models trained with Kneser-Ney or Good-Turing smoothing methods, as well as several pruning parameter values. The results on a test set containing more than 120000 words and different utterance types are explored and compared to the referent results with GMM-HMM speaker-adapted models for the same speech database. Online and offline recognition results are compared to each other as well. Finally, the effect of additional discriminative training using a language model prior to the DNN stage is explored.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Popović, B., Pakoci, E., Ostrogonac, S., Pekar, D.: Large vocabulary continuous speech recognition for Serbian using the Kaldi toolkit. In: Proceedings of the 10th Conference on Digital Speech and Image Processing, DOGS, Novi Sad, pp. 31–34 (2014) Popović, B., Pakoci, E., Ostrogonac, S., Pekar, D.: Large vocabulary continuous speech recognition for Serbian using the Kaldi toolkit. In: Proceedings of the 10th Conference on Digital Speech and Image Processing, DOGS, Novi Sad, pp. 31–34 (2014)
2.
Zurück zum Zitat Popović, B., Ostrogonac, S., Pakoci, E., Jakovljević, N., Delić, V.: Deep neural network based continuous speech recognition for Serbian using the Kaldi toolkit. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015, LNCS, vol. 9319, pp. 186–192. Springer, Cham (2015) Popović, B., Ostrogonac, S., Pakoci, E., Jakovljević, N., Delić, V.: Deep neural network based continuous speech recognition for Serbian using the Kaldi toolkit. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015, LNCS, vol. 9319, pp. 186–192. Springer, Cham (2015)
3.
Zurück zum Zitat Povey, D., Kuo, H.-K.J., Soltau, H.: Fast speaker adaptive training for speech recognition. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association, INTERSPEECH, Brisbane, pp. 1245–1248 (2008) Povey, D., Kuo, H.-K.J., Soltau, H.: Fast speaker adaptive training for speech recognition. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association, INTERSPEECH, Brisbane, pp. 1245–1248 (2008)
4.
Zurück zum Zitat Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. thesis, University, Engineering Department, Cambridge (2003) Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. thesis, University, Engineering Department, Cambridge (2003)
5.
Zurück zum Zitat Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the 27th International Conference on Acoustics, Speech and Signal Processing, ICASSP, Orlando, pp. I-105–I-108 (2002) Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the 27th International Conference on Acoustics, Speech and Signal Processing, ICASSP, Orlando, pp. I-105–I-108 (2002)
6.
Zurück zum Zitat Suzić, S., Ostrogonac, S., Pakoci, E., Bojanić, M.: Building a speech repository for a Serbian LVCSR system. Telfor J. 6(2), 109–114 (2014). Paunović, Đ., Milić, L. (Eds.) Telecommunications Society, BelgradeCrossRef Suzić, S., Ostrogonac, S., Pakoci, E., Bojanić, M.: Building a speech repository for a Serbian LVCSR system. Telfor J. 6(2), 109–114 (2014). Paunović, Đ., Milić, L. (Eds.) Telecommunications Society, BelgradeCrossRef
7.
Zurück zum Zitat Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, Waikoloa, p. 5 (2011) Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, Waikoloa, p. 5 (2011)
8.
Zurück zum Zitat Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: Proceedings of the 20th International Conference on Acoustics, Speech and Signal Processing, ICASSP, Detroit, pp. 181–184 (1995) Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: Proceedings of the 20th International Conference on Acoustics, Speech and Signal Processing, ICASSP, Detroit, pp. 181–184 (1995)
9.
Zurück zum Zitat Gale, W.A., Sampson, G.: Good-Turing smoothing without tears. J. Quant. Linguist. 2(3), 217–237 (1995). Köhler, R. (Ed.) Swets & Zeitlinger, LisseCrossRef Gale, W.A., Sampson, G.: Good-Turing smoothing without tears. J. Quant. Linguist. 2(3), 217–237 (1995). Köhler, R. (Ed.) Swets & Zeitlinger, LisseCrossRef
10.
Zurück zum Zitat Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of the 33rd International Conference on Acoustics, Speech and Signal Processing, ICASSP, Las Vegas, pp. 4057–4060 (2008) Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of the 33rd International Conference on Acoustics, Speech and Signal Processing, ICASSP, Las Vegas, pp. 4057–4060 (2008)
11.
Zurück zum Zitat Povey, D., Zhang, X., Khudanpur, S.: Parallel training of DNNs with natural gradient and parameter averaging. In: Proceedings of the 3rd International Conference on Learning Representations Workshop, ICLR, San Diego (2015) Povey, D., Zhang, X., Khudanpur, S.: Parallel training of DNNs with natural gradient and parameter averaging. In: Proceedings of the 3rd International Conference on Learning Representations Workshop, ICLR, San Diego (2015)
12.
Zurück zum Zitat Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association, INTERSPEECH, Dresden, pp. 2–6 (2015) Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association, INTERSPEECH, Dresden, pp. 2–6 (2015)
13.
Zurück zum Zitat Bhanuprasad, K., Svenson, D.: Errgrams – a way to improving ASR for highly inflective Dravidian languages. In: Proceedings of the 3rd International Joint Conference on Natural Language Processing, IJCNLP, Hyderabad, pp. 805–810 (2008) Bhanuprasad, K., Svenson, D.: Errgrams – a way to improving ASR for highly inflective Dravidian languages. In: Proceedings of the 3rd International Joint Conference on Natural Language Processing, IJCNLP, Hyderabad, pp. 805–810 (2008)
Metadaten
Titel
Language Model Optimization for a Deep Neural Network Based Speech Recognition System for Serbian
verfasst von
Edvin Pakoci
Branislav Popović
Darko Pekar
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-66429-3_48