Skip to main content
Top
Published in: International Journal of Speech Technology 1/2017

01-02-2017

Domain adaptation of lattice-free MMI based TDNN models for speech recognition

Authors: Yanhua Long, Yijie Li, Hone Ye, Hongwei Mao

Published in: International Journal of Speech Technology | Issue 1/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The recent proposed time-delay deep neural network (TDNN) acoustic models trained with lattice-free maximum mutual information (LF-MMI) criterion have been shown to give significant performance improvements over other deep neural network (DNN) models in variety speech recognition tasks. Meanwhile, the Kullback–Leibler divergence (KLD) regularization has been validated as an effective adaptation method for DNN acoustic models. However, to our best knowledge, no work has been reported on investigating whether the KLD-based method is also effective for LF-MMI based TDNN models, especially for the domain adaptation. In this study, we generalized the KLD regularized model adaptation to train domain-specific TDNN acoustic models. A few distinct and important observations have been obtained. Experiments were performed on the Cantonese accent, in-car and far-field noise Mandarin speech recognition tasks. Results demonstrated that the proposed domain adapted models can achieve around relative 7–29% word error rate reduction on these tasks, even when the adaptation utterances are only around 1 K.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Bell, P., Gales, M., Lanchantin, P., Liu, X., Long, Y., Renals, S., et al. (2012). Transcription of multi-genre media archives using out-of-domain data. In Proceedings of Workshop on Spoken Language Technology, IEEE (pp. 324–329). Bell, P., Gales, M., Lanchantin, P., Liu, X., Long, Y., Renals, S., et al. (2012). Transcription of multi-genre media archives using out-of-domain data. In Proceedings of Workshop on Spoken Language Technology, IEEE (pp. 324–329).
go back to reference Christensen, H., Aniol, M. B., Bell, P., Green, P., Hain, T., King, S., et al. (2013). Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech. In Proceedings of Interspeech, ISCA (pp. 3642–3645). Christensen, H., Aniol, M. B., Bell, P., Green, P., Hain, T., King, S., et al. (2013). Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech. In Proceedings of Interspeech, ISCA (pp. 3642–3645).
go back to reference Fainberg, J., Bell, P., Lincoln, M., & Renals, S. (2016). Improving children’s speech recognition through out-of-domain data augmentation. In Proceedings of Interspeech, ISCA (pp. 1598–1602). Fainberg, J., Bell, P., Lincoln, M., & Renals, S. (2016). Improving children’s speech recognition through out-of-domain data augmentation. In Proceedings of Interspeech, ISCA (pp. 1598–1602).
go back to reference Gauvain, J., & Lee, C. (1992). MAP estimation of continuous density HMM: Theory and applications. In Proceedings of Workshop on Speech and Natural Language, Association for Computational Linguistics (pp. 185–190). Gauvain, J., & Lee, C. (1992). MAP estimation of continuous density HMM: Theory and applications. In Proceedings of Workshop on Speech and Natural Language, Association for Computational Linguistics (pp. 185–190).
go back to reference Huang, Y., Yu, D., Liu, C., & Gong, Y. (2014). Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In Proceedings of Interspeech, ISCA (pp. 2977–2981). Huang, Y., Yu, D., Liu, C., & Gong, Y. (2014). Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In Proceedings of Interspeech, ISCA (pp. 2977–2981).
go back to reference Huang, Z., Tang, J., Xue, S., & Dai, L. (2016). Speaker adaptation of RNN-BLSTM for speech recognition based on speaker code. In Proceedings of ICASSP, IEEE (pp. 5305–5309). Huang, Z., Tang, J., Xue, S., & Dai, L. (2016). Speaker adaptation of RNN-BLSTM for speech recognition based on speaker code. In Proceedings of ICASSP, IEEE (pp. 5305–5309).
go back to reference Legetter, c, & Woodland, P. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov models. Computer Speech and Language, 9, 171–185.CrossRef Legetter, c, & Woodland, P. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov models. Computer Speech and Language, 9, 171–185.CrossRef
go back to reference Mirsamadi, S., & Hansen, J. (2015). A study on deep neural network acoustic model adaptation for robust far-field speech recognition. In Proceedings of Interspeech, ISCA (pp. 2430–2434). Mirsamadi, S., & Hansen, J. (2015). A study on deep neural network acoustic model adaptation for robust far-field speech recognition. In Proceedings of Interspeech, ISCA (pp. 2430–2434).
go back to reference Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for different modeling of long temporal contexts. In Proceedings of Interspeech, ISCA (pp. 3214–3218). Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for different modeling of long temporal contexts. In Proceedings of Interspeech, ISCA (pp. 3214–3218).
go back to reference Povey, D., (2005). Discriminative training for large vocabulary speech recognition. PhD dissertation, Cambridge University. Povey, D., (2005). Discriminative training for large vocabulary speech recognition. PhD dissertation, Cambridge University.
go back to reference Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et. al. (2011). The Kaldi speech recognition toolkit. In Proceedings of ASRU, IEEE (pp. No. EPFL–CONF–192584). Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et. al. (2011). The Kaldi speech recognition toolkit. In Proceedings of ASRU, IEEE (pp. No. EPFL–CONF–192584).
go back to reference Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., et al. (2016). Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Proceedings of Interspeech, ISCA (pp. 2751–2755). Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., et al. (2016). Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Proceedings of Interspeech, ISCA (pp. 2751–2755).
go back to reference Qian, Y., Tan, T., Yu, D., & Zhang, Y. (2016). Integrated adaptation with multi-factor joint-learning for far-field speech recognition. In Proceedings of ICASSP, IEEE (pp. 5770–5774). Qian, Y., Tan, T., Yu, D., & Zhang, Y. (2016). Integrated adaptation with multi-factor joint-learning for far-field speech recognition. In Proceedings of ICASSP, IEEE (pp. 5770–5774).
go back to reference Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. In Proceedings of Interspeech, ISCA (pp. 1468–1472). Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. In Proceedings of Interspeech, ISCA (pp. 1468–1472).
go back to reference Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In Proceedings of ASRU, Olomouc (pp. 55–59). Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In Proceedings of ASRU, Olomouc (pp. 55–59).
go back to reference Senior, A., & Lopez-Moreno, I. (2014). Improving DNN speaker independence with i-vector inputs. In Proceedings of ICASSP, IEEE (pp. 225–229). Senior, A., & Lopez-Moreno, I. (2014). Improving DNN speaker independence with i-vector inputs. In Proceedings of ICASSP, IEEE (pp. 225–229).
go back to reference Senior, A., Sak, H., de Chaumont Quitry, F., Sainath, T., & Rao, K. (2015). Acoustic modeling with CD-CTC-SMBR LSTM RNNs. In Proceedings of ASRU, IEEE (pp. 604–609). Senior, A., Sak, H., de Chaumont Quitry, F., Sainath, T., & Rao, K. (2015). Acoustic modeling with CD-CTC-SMBR LSTM RNNs. In Proceedings of ASRU, IEEE (pp. 604–609).
go back to reference Toth, L., & Gosztolya, G. (2016). Adaptation of DNN acoustic models using KL-divergence regularization and multi-task training, In Proceedings of SPECOM. (pp. 108–115). Toth, L., & Gosztolya, G. (2016). Adaptation of DNN acoustic models using KL-divergence regularization and multi-task training, In Proceedings of SPECOM. (pp. 108–115).
go back to reference Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., & Liu, Q. (2014). Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 1713–1725.CrossRef Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., & Liu, Q. (2014). Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 1713–1725.CrossRef
go back to reference Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach (1st ed.). New York: Springer.MATH Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach (1st ed.). New York: Springer.MATH
go back to reference Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013). KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In Proceedings of ICASSP, IEEE (pp. 7893–7897). Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013). KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In Proceedings of ICASSP, IEEE (pp. 7893–7897).
Metadata
Title
Domain adaptation of lattice-free MMI based TDNN models for speech recognition
Authors
Yanhua Long
Yijie Li
Hone Ye
Hongwei Mao
Publication date
01-02-2017
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 1/2017
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-017-9399-z

Other articles of this Issue 1/2017

International Journal of Speech Technology 1/2017 Go to the issue