Skip to main content

2017 | OriginalPaper | Buchkapitel

Experimenting with Hybrid TDNN/HMM Acoustic Models for Russian Speech Recognition

verfasst von : Irina Kipyatkova

Erschienen in: Speech and Computer

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we study an application of time delay neural networks (TDNNs) in acoustic modeling for large vocabulary continuous Russian speech recognition. We created TDNNs with various numbers of hidden layers and units in the hidden layers with p-norm nonlinearity. Training of acoustic models was carried out on our own Russian speech corpus containing phonetically balanced phrases. Duration of the speech corpus is more than 30 h. Testing of TDNN-based acoustic models was performed in the very large vocabulary continuous Russian speech recognition task. Conducted experiments showed that TDNN models outperformed baseline deep neural network models in terms of the word error rate.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Yu, D., Deng, L.: Automatic Speech Recognition. A Deep Learning Approach. Springer, London (2015)MATH Yu, D., Deng, L.: Automatic Speech Recognition. A Deep Learning Approach. Springer, London (2015)MATH
2.
Zurück zum Zitat Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sign. Process. Mag. 29(6), 82–97 (2012)CrossRef Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sign. Process. Mag. 29(6), 82–97 (2012)CrossRef
4.
Zurück zum Zitat Deng, L.: Deep learning: from speech recognition to language and multimodal processing. APSIPA Trans. Sign. Inf. Process. 5, 1–15 (2016) Deng, L.: Deep learning: from speech recognition to language and multimodal processing. APSIPA Trans. Sign. Inf. Process. 5, 1–15 (2016)
5.
Zurück zum Zitat Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: INTERSPEECH 2011, pp. 437– 440 (2011) Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: INTERSPEECH 2011, pp. 437– 440 (2011)
6.
Zurück zum Zitat Delcroix, M., Kinoshita, K., Ogawa, A., Yoshioka, T., Tran, D., Nakatani, T.: Context adaptive neural network for rapid adaptation of deep CNN based acoustic models. In: INTERSPEECH 2016, pp. 1573–1577 (2016) Delcroix, M., Kinoshita, K., Ogawa, A., Yoshioka, T., Tran, D., Nakatani, T.: Context adaptive neural network for rapid adaptation of deep CNN based acoustic models. In: INTERSPEECH 2016, pp. 1573–1577 (2016)
7.
Zurück zum Zitat Tran, D.T., Delcroix, M., Ogawa, A., Huemmer, C., Nakatani, T.: Feedback connection for deep neural network-based acoustic modeling. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2017), pp. 5240–5244 (2017) Tran, D.T., Delcroix, M., Ogawa, A., Huemmer, C., Nakatani, T.: Feedback connection for deep neural network-based acoustic modeling. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2017), pp. 5240–5244 (2017)
8.
Zurück zum Zitat Geiger, J.T., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G.: Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: INTERSPEECH 2014, pp. 631–635 (2014) Geiger, J.T., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G.: Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: INTERSPEECH 2014, pp. 631–635 (2014)
9.
Zurück zum Zitat Peddini, V., Povey, D., Khundanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH 2015, pp. 3214–3218 (2015) Peddini, V., Povey, D., Khundanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH 2015, pp. 3214–3218 (2015)
10.
Zurück zum Zitat Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: INTERSPEECH 2014, pp. 2997–3001 (2014) Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: INTERSPEECH 2014, pp. 2997–3001 (2014)
11.
Zurück zum Zitat Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for Russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS(LNAI), vol. 9319, pp. 234–242. Springer, Cham (2015). doi:10.1007/978-3-319-23132-7_29 CrossRef Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for Russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS(LNAI), vol. 9319, pp. 234–242. Springer, Cham (2015). doi:10.​1007/​978-3-319-23132-7_​29 CrossRef
12.
Zurück zum Zitat Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using kaldi. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 246–253. Springer, Cham (2016). doi:10.1007/978-3-319-43958-7_29 CrossRef Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using kaldi. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 246–253. Springer, Cham (2016). doi:10.​1007/​978-3-319-43958-7_​29 CrossRef
13.
Zurück zum Zitat Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding ASRU (2011) Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding ASRU (2011)
14.
Zurück zum Zitat Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-Vectors. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 55–59 (2013) Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-Vectors. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 55–59 (2013)
16.
Zurück zum Zitat Zhang X., Trmal J., Povey D., Khudanpur S.: Improving deep neural network acoustic models using generalized maxout networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 215–219 (2014) Zhang X., Trmal J., Povey D., Khudanpur S.: Improving deep neural network acoustic models using generalized maxout networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 215–219 (2014)
17.
Zurück zum Zitat Gapochkin, A.V.: Neural networks in speech recognition systems. Sci. Time 1(1), 29–36 (2014). (in Russian) Gapochkin, A.V.: Neural networks in speech recognition systems. Sci. Time 1(1), 29–36 (2014). (in Russian)
18.
Zurück zum Zitat Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Sign. Process. 37(3), 328–339 (1989)CrossRef Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Sign. Process. 37(3), 328–339 (1989)CrossRef
19.
Zurück zum Zitat Karpov, A., Markov, K., Kipyatkova, I., Vazhenina, D., Ronzhin, A.: Large vocabulary Russian speech recognition using syntactico-statistical language modeling. Speech Commun. 56, 213–228 (2014)CrossRef Karpov, A., Markov, K., Kipyatkova, I., Vazhenina, D., Ronzhin, A.: Large vocabulary Russian speech recognition using syntactico-statistical language modeling. Speech Commun. 56, 213–228 (2014)CrossRef
20.
Zurück zum Zitat Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop ASRU 2011 (2011) Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop ASRU 2011 (2011)
21.
Zurück zum Zitat Kipyatkova, I., Karpov, A.: Lexicon size and language model order optimization for Russian LVCSR. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS(LNAI), vol. 8113, pp. 219–226. Springer, Cham (2013). doi:10.1007/978-3-319-01931-4_29 CrossRef Kipyatkova, I., Karpov, A.: Lexicon size and language model order optimization for Russian LVCSR. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS(LNAI), vol. 8113, pp. 219–226. Springer, Cham (2013). doi:10.​1007/​978-3-319-01931-4_​29 CrossRef
22.
Zurück zum Zitat Kipyatkova, I., Karpov, A., Verkhodanova, V., Zelezny, M.: Modeling of pronunciation, language and nonverbal units at conversational russian speech recognition. Int. J. Comput. Sci. Appl. 10(1), 11–30 (2013) Kipyatkova, I., Karpov, A., Verkhodanova, V., Zelezny, M.: Modeling of pronunciation, language and nonverbal units at conversational russian speech recognition. Int. J. Comput. Sci. Appl. 10(1), 11–30 (2013)
23.
Zurück zum Zitat Jokisch, O., Wagner, A., Sabo, R., Jaeckel, R., Cylwik, N., Rusko, M., Ronzhin A., Hoffmann, R.: Multilingual speech data collection for the assessment of pronunciation and prosody in a language learning system. In: Proceedings of SPECOM 2009, pp. 515–520 (2009) Jokisch, O., Wagner, A., Sabo, R., Jaeckel, R., Cylwik, N., Rusko, M., Ronzhin A., Hoffmann, R.: Multilingual speech data collection for the assessment of pronunciation and prosody in a language learning system. In: Proceedings of SPECOM 2009, pp. 515–520 (2009)
24.
Zurück zum Zitat State Standard P 50840–95. Speech transmission by communication paths. Evaluation methods of quality, intelligibility and recognizability, p. 230. Standartov Publ., Moscow (1996). (in Russian) State Standard P 50840–95. Speech transmission by communication paths. Evaluation methods of quality, intelligibility and recognizability, p. 230. Standartov Publ., Moscow (1996). (in Russian)
25.
Zurück zum Zitat Stepanova, S.B.: Phonetic features of Russian speech: realization and transcription, Ph.D. thesis (1988). (in Russian) Stepanova, S.B.: Phonetic features of Russian speech: realization and transcription, Ph.D. thesis (1988). (in Russian)
26.
Zurück zum Zitat Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 338–345. Springer, Cham (2016). doi:10.1007/978-3-319-43958-7_40 CrossRef Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 338–345. Springer, Cham (2016). doi:10.​1007/​978-3-319-43958-7_​40 CrossRef
27.
Zurück zum Zitat Karpov, A.A., Ronzhin, A.L.: Information enquiry kiosk with multimodal user interface. Pattern Recogn. Image Anal. 19(3), 546–558 (2009)CrossRef Karpov, A.A., Ronzhin, A.L.: Information enquiry kiosk with multimodal user interface. Pattern Recogn. Image Anal. 19(3), 546–558 (2009)CrossRef
28.
Zurück zum Zitat Kipyatkova, I., Karpov, A.: A study of neural network Russian language models for automatic continuous speech recognition systems. Autom. Remote Control 78(5), 858–867 (2017). SpringerCrossRef Kipyatkova, I., Karpov, A.: A study of neural network Russian language models for automatic continuous speech recognition systems. Autom. Remote Control 78(5), 858–867 (2017). SpringerCrossRef
Metadaten
Titel
Experimenting with Hybrid TDNN/HMM Acoustic Models for Russian Speech Recognition
verfasst von
Irina Kipyatkova
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-66429-3_35