Skip to main content

2015 | OriginalPaper | Buchkapitel

Improving Acoustic Models for Russian Spontaneous Speech Recognition

verfasst von : Alexey Prudnikov, Ivan Medennikov, Valentin Mendelev, Maxim Korenevsky, Yuri Khokhlov

Erschienen in: Speech and Computer

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The aim of the paper is to investigate the ways to improve acoustic models for Russian spontaneous speech recognition. We applied the main steps of the Kaldi Switchboard recipe to a Russian dataset but obtained low accuracy with respect to the results for English spontaneous telephone speech. We found two methods to be especially useful for Russian spontaneous speech: the i-vector based deep neural network adaptation and speaker-dependent bottleneck features which provide 8.6 % and 11.9 % relative word error rate reduction over the baseline system respectively.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Soltau, H., Saon, G., Sainath, T.N.: Joint training of convolutional and non-convolutional neural networks. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5572–5576. Florence (2014) Soltau, H., Saon, G., Sainath, T.N.: Joint training of convolutional and non-convolutional neural networks. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5572–5576. Florence (2014)
2.
Zurück zum Zitat Vesely, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: 14th Annual Conference of the International Speech Communication Association (Interspeech), pp. 2345–2349. Lyon (2014) Vesely, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: 14th Annual Conference of the International Speech Communication Association (Interspeech), pp. 2345–2349. Lyon (2014)
3.
Zurück zum Zitat Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: 13th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 55–59. Olomouc (2013) Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: 13th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 55–59. Olomouc (2013)
4.
Zurück zum Zitat Godfrey, J.J., Holliman, E.C., McDaniel, J.: SWITCHBOARD: telephone speech corpus for research and development. In: 17th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 517–520. San Francisco (1992) Godfrey, J.J., Holliman, E.C., McDaniel, J.: SWITCHBOARD: telephone speech corpus for research and development. In: 17th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 517–520. San Francisco (1992)
5.
Zurück zum Zitat Povey, D. et al.: The Kaldi speech recognition toolkit. In: 12th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 5572–5576. Big Island (2011) Povey, D. et al.: The Kaldi speech recognition toolkit. In: 12th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 5572–5576. Big Island (2011)
6.
Zurück zum Zitat Gales, M.J.F.: Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition. Technical report, Cambridge University Engineering Department (1997) Gales, M.J.F.: Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition. Technical report, Cambridge University Engineering Department (1997)
7.
Zurück zum Zitat Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. dissertation. University of Cambridge, Cambridge, UK (2003) Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. dissertation. University of Cambridge, Cambridge, UK (2003)
8.
Zurück zum Zitat Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 12th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 24–29. Big Island (2011) Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 12th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 24–29. Big Island (2011)
9.
Zurück zum Zitat Gemello, R., Mana, F., Scanzio, S., Laface, P., De Mori, R.: Linear hidden transformations for adaptation of hybrid ANN/HMM models. Speech Commun. 49(10–11), 827–835 (2007)CrossRef Gemello, R., Mana, F., Scanzio, S., Laface, P., De Mori, R.: Linear hidden transformations for adaptation of hybrid ANN/HMM models. Speech Commun. 49(10–11), 827–835 (2007)CrossRef
10.
Zurück zum Zitat Yao K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y.: Adaptation of context-dependent deep neural networks for automatic speech recognition. In: IEEE Spoken Language Technology Workshop (SLT), pp. 366–369. Miami (2012) Yao K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y.: Adaptation of context-dependent deep neural networks for automatic speech recognition. In: IEEE Spoken Language Technology Workshop (SLT), pp. 366–369. Miami (2012)
11.
Zurück zum Zitat Ochiai, T., Matsuda, S., Lu, X., Hori, C., Katagiri, S.: Speaker adaptive training using deep neural networks. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6399–6403. Florence (2014) Ochiai, T., Matsuda, S., Lu, X., Hori, C., Katagiri, S.: Speaker adaptive training using deep neural networks. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6399–6403. Florence (2014)
12.
Zurück zum Zitat Li, X., Bilmes, J.: Regularized adaptation of discriminative classifiers. In: 31st International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toulouse (2006) Li, X., Bilmes, J.: Regularized adaptation of discriminative classifiers. In: 31st International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toulouse (2006)
13.
Zurück zum Zitat Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7893–7897. Vancouver (2013) Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7893–7897. Vancouver (2013)
14.
Zurück zum Zitat Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with i-vector inputs. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 225–229. Florence (2014) Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with i-vector inputs. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 225–229. Florence (2014)
15.
Zurück zum Zitat Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: 15th Annual Conference of the International Speech Communication Association, pp. 2997–3001. Singapore (2014) Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: 15th Annual Conference of the International Speech Communication Association, pp. 2997–3001. Singapore (2014)
16.
Zurück zum Zitat Liu, S., Sim, K.C.: On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 195–199. Florence (2014) Liu, S., Sim, K.C.: On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 195–199. Florence (2014)
17.
Zurück zum Zitat Rouvier, M., Favre, B.: Speaker adaptation of DNN-based ASR with i-vectors: does it actually adapt models to speakers? In: 15th Annual Conference of the International Speech Communication Association (Interspeech), pp. 3007–3011. Singapore (2014) Rouvier, M., Favre, B.: Speaker adaptation of DNN-based ASR with i-vectors: does it actually adapt models to speakers? In: 15th Annual Conference of the International Speech Communication Association (Interspeech), pp. 3007–3011. Singapore (2014)
18.
Zurück zum Zitat Kozlov, A., Kudashev, O., Matveev, Y., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID Speaker Recognition System for NIST SRE 2012. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 278–285. Springer, Heidelberg (2013) CrossRef Kozlov, A., Kudashev, O., Matveev, Y., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID Speaker Recognition System for NIST SRE 2012. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 278–285. Springer, Heidelberg (2013) CrossRef
19.
Zurück zum Zitat Novoselov, S., Pekhovsky, T., Simonchik, K., Shulipa, A.: RBM-PLDA subsystem for the NIST i-vector challenge. In: 15th Annual Conference of the International Speech Communication Association (Interspeech), pp. 378–382. Singapore (2014) Novoselov, S., Pekhovsky, T., Simonchik, K., Shulipa, A.: RBM-PLDA subsystem for the NIST i-vector challenge. In: 15th Annual Conference of the International Speech Communication Association (Interspeech), pp. 378–382. Singapore (2014)
20.
Zurück zum Zitat Karafiat, M., Grezl, F., Hannemann, M., Cernocky, J.H.: But neural network features for spontaneous Vietnamese in BABEL. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5622–5626 (2014) Karafiat, M., Grezl, F., Hannemann, M., Cernocky, J.H.: But neural network features for spontaneous Vietnamese in BABEL. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5622–5626 (2014)
21.
Zurück zum Zitat Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Technical report search in Computing Technology (Harvard University) (1998) Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Technical report search in Computing Technology (Harvard University) (1998)
Metadaten
Titel
Improving Acoustic Models for Russian Spontaneous Speech Recognition
verfasst von
Alexey Prudnikov
Ivan Medennikov
Valentin Mendelev
Maxim Korenevsky
Yuri Khokhlov
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-23132-7_29