Top

International Journal of Speech Technology

Published in:

01-02-2017

Domain adaptation of lattice-free MMI based TDNN models for speech recognition

Authors: Yanhua Long, Yijie Li, Hone Ye, Hongwei Mao

Published in: International Journal of Speech Technology | Issue 1/2017

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The recent proposed time-delay deep neural network (TDNN) acoustic models trained with lattice-free maximum mutual information (LF-MMI) criterion have been shown to give significant performance improvements over other deep neural network (DNN) models in variety speech recognition tasks. Meanwhile, the Kullback–Leibler divergence (KLD) regularization has been validated as an effective adaptation method for DNN acoustic models. However, to our best knowledge, no work has been reported on investigating whether the KLD-based method is also effective for LF-MMI based TDNN models, especially for the domain adaptation. In this study, we generalized the KLD regularized model adaptation to train domain-specific TDNN acoustic models. A few distinct and important observations have been obtained. Experiments were performed on the Cantonese accent, in-car and far-field noise Mandarin speech recognition tasks. Results demonstrated that the proposed domain adapted models can achieve around relative 7–29% word error rate reduction on these tasks, even when the adaptation utterances are only around 1 K.

previous article Security enhancement for AES encrypted speech in communications

next article Multiclass classification of Parkinson’s disease using different classifiers and LLBFS feature selection algorithm

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Bell, P., Gales, M., Lanchantin, P., Liu, X., Long, Y., Renals, S., et al. (2012). Transcription of multi-genre media archives using out-of-domain data. In Proceedings of Workshop on Spoken Language Technology, IEEE (pp. 324–329).

Christensen, H., Aniol, M. B., Bell, P., Green, P., Hain, T., King, S., et al. (2013). Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech. In Proceedings of Interspeech, ISCA (pp. 3642–3645).

Fainberg, J., Bell, P., Lincoln, M., & Renals, S. (2016). Improving children’s speech recognition through out-of-domain data augmentation. In Proceedings of Interspeech, ISCA (pp. 1598–1602).

Gauvain, J., & Lee, C. (1992). MAP estimation of continuous density HMM: Theory and applications. In Proceedings of Workshop on Speech and Natural Language, Association for Computational Linguistics (pp. 185–190).

Huang, Y., Yu, D., Liu, C., & Gong, Y. (2014). Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In Proceedings of Interspeech, ISCA (pp. 2977–2981).

Huang, Z., Tang, J., Xue, S., & Dai, L. (2016). Speaker adaptation of RNN-BLSTM for speech recognition based on speaker code. In Proceedings of ICASSP, IEEE (pp. 5305–5309).

Legetter, c, & Woodland, P. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov models. Computer Speech and Language, 9, 171–185.CrossRef

Mirsamadi, S., & Hansen, J. (2015). A study on deep neural network acoustic model adaptation for robust far-field speech recognition. In Proceedings of Interspeech, ISCA (pp. 2430–2434).

Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for different modeling of long temporal contexts. In Proceedings of Interspeech, ISCA (pp. 3214–3218).

Povey, D., (2005). Discriminative training for large vocabulary speech recognition. PhD dissertation, Cambridge University.

Povey, D., (2016). Kaldi code repository. Retrieved from https://github.com/kaldi-asr/kaldi.

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et. al. (2011). The Kaldi speech recognition toolkit. In Proceedings of ASRU, IEEE (pp. No. EPFL–CONF–192584).

Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., et al. (2016). Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Proceedings of Interspeech, ISCA (pp. 2751–2755).

Qian, Y., Tan, T., Yu, D., & Zhang, Y. (2016). Integrated adaptation with multi-factor joint-learning for far-field speech recognition. In Proceedings of ICASSP, IEEE (pp. 5770–5774).

Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. In Proceedings of Interspeech, ISCA (pp. 1468–1472).

Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In Proceedings of ASRU, Olomouc (pp. 55–59).

Senior, A., & Lopez-Moreno, I. (2014). Improving DNN speaker independence with i-vector inputs. In Proceedings of ICASSP, IEEE (pp. 225–229).

Senior, A., Sak, H., de Chaumont Quitry, F., Sainath, T., & Rao, K. (2015). Acoustic modeling with CD-CTC-SMBR LSTM RNNs. In Proceedings of ASRU, IEEE (pp. 604–609).

Toth, L., & Gosztolya, G. (2016). Adaptation of DNN acoustic models using KL-divergence regularization and multi-task training, In Proceedings of SPECOM. (pp. 108–115).

Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., & Liu, Q. (2014). Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 1713–1725.CrossRef

Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach (1st ed.). New York: Springer.MATH

Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013). KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In Proceedings of ICASSP, IEEE (pp. 7893–7897).

Title: Domain adaptation of lattice-free MMI based TDNN models for speech recognition
Authors: Yanhua Long
Yijie Li
Hone Ye
Hongwei Mao
Publication date: 01-02-2017
Publisher: Springer US
Published in: International Journal of Speech Technology / Issue 1/2017
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-017-9399-z

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 1/2017

Text-independent speaker identification based on selection of the most similar feature vectors

Melody extraction from music using modified group delay functions

Voice recognition package for ERTU’s cloud

Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech

Multiclass classification of Parkinson’s disease using different classifiers and LLBFS feature selection algorithm

An unsupervised approach for co-channel speech separation using Hilbert–Huang transform and Fuzzy C-Means clustering