Skip to main content
Erschienen in: International Journal of Speech Technology 1/2017

01.02.2017

Domain adaptation of lattice-free MMI based TDNN models for speech recognition

verfasst von: Yanhua Long, Yijie Li, Hone Ye, Hongwei Mao

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The recent proposed time-delay deep neural network (TDNN) acoustic models trained with lattice-free maximum mutual information (LF-MMI) criterion have been shown to give significant performance improvements over other deep neural network (DNN) models in variety speech recognition tasks. Meanwhile, the Kullback–Leibler divergence (KLD) regularization has been validated as an effective adaptation method for DNN acoustic models. However, to our best knowledge, no work has been reported on investigating whether the KLD-based method is also effective for LF-MMI based TDNN models, especially for the domain adaptation. In this study, we generalized the KLD regularized model adaptation to train domain-specific TDNN acoustic models. A few distinct and important observations have been obtained. Experiments were performed on the Cantonese accent, in-car and far-field noise Mandarin speech recognition tasks. Results demonstrated that the proposed domain adapted models can achieve around relative 7–29% word error rate reduction on these tasks, even when the adaptation utterances are only around 1 K.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Bell, P., Gales, M., Lanchantin, P., Liu, X., Long, Y., Renals, S., et al. (2012). Transcription of multi-genre media archives using out-of-domain data. In Proceedings of Workshop on Spoken Language Technology, IEEE (pp. 324–329). Bell, P., Gales, M., Lanchantin, P., Liu, X., Long, Y., Renals, S., et al. (2012). Transcription of multi-genre media archives using out-of-domain data. In Proceedings of Workshop on Spoken Language Technology, IEEE (pp. 324–329).
Zurück zum Zitat Christensen, H., Aniol, M. B., Bell, P., Green, P., Hain, T., King, S., et al. (2013). Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech. In Proceedings of Interspeech, ISCA (pp. 3642–3645). Christensen, H., Aniol, M. B., Bell, P., Green, P., Hain, T., King, S., et al. (2013). Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech. In Proceedings of Interspeech, ISCA (pp. 3642–3645).
Zurück zum Zitat Fainberg, J., Bell, P., Lincoln, M., & Renals, S. (2016). Improving children’s speech recognition through out-of-domain data augmentation. In Proceedings of Interspeech, ISCA (pp. 1598–1602). Fainberg, J., Bell, P., Lincoln, M., & Renals, S. (2016). Improving children’s speech recognition through out-of-domain data augmentation. In Proceedings of Interspeech, ISCA (pp. 1598–1602).
Zurück zum Zitat Gauvain, J., & Lee, C. (1992). MAP estimation of continuous density HMM: Theory and applications. In Proceedings of Workshop on Speech and Natural Language, Association for Computational Linguistics (pp. 185–190). Gauvain, J., & Lee, C. (1992). MAP estimation of continuous density HMM: Theory and applications. In Proceedings of Workshop on Speech and Natural Language, Association for Computational Linguistics (pp. 185–190).
Zurück zum Zitat Huang, Y., Yu, D., Liu, C., & Gong, Y. (2014). Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In Proceedings of Interspeech, ISCA (pp. 2977–2981). Huang, Y., Yu, D., Liu, C., & Gong, Y. (2014). Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In Proceedings of Interspeech, ISCA (pp. 2977–2981).
Zurück zum Zitat Huang, Z., Tang, J., Xue, S., & Dai, L. (2016). Speaker adaptation of RNN-BLSTM for speech recognition based on speaker code. In Proceedings of ICASSP, IEEE (pp. 5305–5309). Huang, Z., Tang, J., Xue, S., & Dai, L. (2016). Speaker adaptation of RNN-BLSTM for speech recognition based on speaker code. In Proceedings of ICASSP, IEEE (pp. 5305–5309).
Zurück zum Zitat Legetter, c, & Woodland, P. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov models. Computer Speech and Language, 9, 171–185.CrossRef Legetter, c, & Woodland, P. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov models. Computer Speech and Language, 9, 171–185.CrossRef
Zurück zum Zitat Mirsamadi, S., & Hansen, J. (2015). A study on deep neural network acoustic model adaptation for robust far-field speech recognition. In Proceedings of Interspeech, ISCA (pp. 2430–2434). Mirsamadi, S., & Hansen, J. (2015). A study on deep neural network acoustic model adaptation for robust far-field speech recognition. In Proceedings of Interspeech, ISCA (pp. 2430–2434).
Zurück zum Zitat Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for different modeling of long temporal contexts. In Proceedings of Interspeech, ISCA (pp. 3214–3218). Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for different modeling of long temporal contexts. In Proceedings of Interspeech, ISCA (pp. 3214–3218).
Zurück zum Zitat Povey, D., (2005). Discriminative training for large vocabulary speech recognition. PhD dissertation, Cambridge University. Povey, D., (2005). Discriminative training for large vocabulary speech recognition. PhD dissertation, Cambridge University.
Zurück zum Zitat Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et. al. (2011). The Kaldi speech recognition toolkit. In Proceedings of ASRU, IEEE (pp. No. EPFL–CONF–192584). Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et. al. (2011). The Kaldi speech recognition toolkit. In Proceedings of ASRU, IEEE (pp. No. EPFL–CONF–192584).
Zurück zum Zitat Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., et al. (2016). Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Proceedings of Interspeech, ISCA (pp. 2751–2755). Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., et al. (2016). Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Proceedings of Interspeech, ISCA (pp. 2751–2755).
Zurück zum Zitat Qian, Y., Tan, T., Yu, D., & Zhang, Y. (2016). Integrated adaptation with multi-factor joint-learning for far-field speech recognition. In Proceedings of ICASSP, IEEE (pp. 5770–5774). Qian, Y., Tan, T., Yu, D., & Zhang, Y. (2016). Integrated adaptation with multi-factor joint-learning for far-field speech recognition. In Proceedings of ICASSP, IEEE (pp. 5770–5774).
Zurück zum Zitat Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. In Proceedings of Interspeech, ISCA (pp. 1468–1472). Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. In Proceedings of Interspeech, ISCA (pp. 1468–1472).
Zurück zum Zitat Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In Proceedings of ASRU, Olomouc (pp. 55–59). Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In Proceedings of ASRU, Olomouc (pp. 55–59).
Zurück zum Zitat Senior, A., & Lopez-Moreno, I. (2014). Improving DNN speaker independence with i-vector inputs. In Proceedings of ICASSP, IEEE (pp. 225–229). Senior, A., & Lopez-Moreno, I. (2014). Improving DNN speaker independence with i-vector inputs. In Proceedings of ICASSP, IEEE (pp. 225–229).
Zurück zum Zitat Senior, A., Sak, H., de Chaumont Quitry, F., Sainath, T., & Rao, K. (2015). Acoustic modeling with CD-CTC-SMBR LSTM RNNs. In Proceedings of ASRU, IEEE (pp. 604–609). Senior, A., Sak, H., de Chaumont Quitry, F., Sainath, T., & Rao, K. (2015). Acoustic modeling with CD-CTC-SMBR LSTM RNNs. In Proceedings of ASRU, IEEE (pp. 604–609).
Zurück zum Zitat Toth, L., & Gosztolya, G. (2016). Adaptation of DNN acoustic models using KL-divergence regularization and multi-task training, In Proceedings of SPECOM. (pp. 108–115). Toth, L., & Gosztolya, G. (2016). Adaptation of DNN acoustic models using KL-divergence regularization and multi-task training, In Proceedings of SPECOM. (pp. 108–115).
Zurück zum Zitat Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., & Liu, Q. (2014). Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 1713–1725.CrossRef Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., & Liu, Q. (2014). Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 1713–1725.CrossRef
Zurück zum Zitat Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach (1st ed.). New York: Springer.MATH Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach (1st ed.). New York: Springer.MATH
Zurück zum Zitat Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013). KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In Proceedings of ICASSP, IEEE (pp. 7893–7897). Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013). KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In Proceedings of ICASSP, IEEE (pp. 7893–7897).
Metadaten
Titel
Domain adaptation of lattice-free MMI based TDNN models for speech recognition
verfasst von
Yanhua Long
Yijie Li
Hone Ye
Hongwei Mao
Publikationsdatum
01.02.2017
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 1/2017
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-017-9399-z

Weitere Artikel der Ausgabe 1/2017

International Journal of Speech Technology 1/2017 Zur Ausgabe

Neuer Inhalt