Skip to main content

2018 | OriginalPaper | Buchkapitel

Combining Articulatory Features with End-to-End Learning in Speech Recognition

verfasst von : Leyuan Qu, Cornelius Weber, Egor Lakomkin, Johannes Twiefel, Stefan Wermter

Erschienen in: Artificial Neural Networks and Machine Learning – ICANN 2018

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

End-to-end neural networks have shown promising results on large vocabulary continuous speech recognition (LVCSR) systems. However, it is challenging to integrate domain knowledge into such systems. Specifically, articulatory features (AFs) which are inspired by the human speech production mechanism can help in speech recognition. This paper presents two approaches to incorporate domain knowledge into end-to-end training: (a) fine-tuning networks which reuse hidden layer representations of AF extractors as input for ASR tasks; (b) progressive networks which combine articulatory knowledge by lateral connections from AF extractors. We evaluate the proposed approaches on the speech Wall Street Journal corpus and test on the eval92 standard evaluation dataset. Results show that both fine-tuning and progressive networks can integrate articulatory information into end-to-end learning and outperform previous systems.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRef LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRef
2.
Zurück zum Zitat Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the ICLR (2015) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the ICLR (2015)
3.
Zurück zum Zitat Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of ICCV-2011, pp. 1457–1464 (2011) Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of ICCV-2011, pp. 1457–1464 (2011)
5.
Zurück zum Zitat Graves, A., Fernández, S., Gomez, F., et al.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of ICML-2006, pp. 369–376 (2006) Graves, A., Fernández, S., Gomez, F., et al.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of ICML-2006, pp. 369–376 (2006)
6.
Zurück zum Zitat Zweig, G., Yu, C., Droppo, J., et al.: Advances in all-neural speech recognition. In: Proceedings of ICASSP-2017, pp. 4805–4809 (2017) Zweig, G., Yu, C., Droppo, J., et al.: Advances in all-neural speech recognition. In: Proceedings of ICASSP-2017, pp. 4805–4809 (2017)
7.
Zurück zum Zitat King, S., Taylor, P.: Detection of phonological features in continuous speech using neural networks. Comput. Speech Lang. 14(4), 333–353 (2000)CrossRef King, S., Taylor, P.: Detection of phonological features in continuous speech using neural networks. Comput. Speech Lang. 14(4), 333–353 (2000)CrossRef
8.
Zurück zum Zitat Kirchhoff, K.: Robust speech recognition using articulatory information. Ph.D. thesis, University of Bielefeld (1999) Kirchhoff, K.: Robust speech recognition using articulatory information. Ph.D. thesis, University of Bielefeld (1999)
9.
Zurück zum Zitat Yu, D., Siniscalchi, S.M., Deng, L., et al.: Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition. In: Proceedings of ICASSP-2012, pp. 4169–4172 (2012) Yu, D., Siniscalchi, S.M., Deng, L., et al.: Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition. In: Proceedings of ICASSP-2012, pp. 4169–4172 (2012)
10.
Zurück zum Zitat Sak, H., Senior, A., Rao, K., et al.: Learning acoustic frame labelling for speech recognition with recurrent neural networks. In: Proceedings of ICASSP-2015, pp. 4280–4284 (2015) Sak, H., Senior, A., Rao, K., et al.: Learning acoustic frame labelling for speech recognition with recurrent neural networks. In: Proceedings of ICASSP-2015, pp. 4280–4284 (2015)
11.
Zurück zum Zitat Chorowski, J.K., Bahdanau, D., Serdyuk, D., et al.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015) Chorowski, J.K., Bahdanau, D., Serdyuk, D., et al.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015)
12.
Zurück zum Zitat Bahdanau, D., Chorowski, J., Serdyuk, D., et al.: End-to-end attention-based large vocabulary speech recognition. In: Proceedings of ICASSP-2016, pp. 4945–4949 (2016) Bahdanau, D., Chorowski, J., Serdyuk, D., et al.: End-to-end attention-based large vocabulary speech recognition. In: Proceedings of ICASSP-2016, pp. 4945–4949 (2016)
13.
Zurück zum Zitat Chan, W., Jaitly, N., Le, Q., et al.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Proceedings of ICASSP-2016, pp. 4960–4964 (2016) Chan, W., Jaitly, N., Le, Q., et al.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Proceedings of ICASSP-2016, pp. 4960–4964 (2016)
14.
Zurück zum Zitat Lee, C.-H., et al.: An overview on automatic speech attribute transcription (ASAT). In: Proceedings of INTERSPEECH-2007, pp. 1825–1828 (2007) Lee, C.-H., et al.: An overview on automatic speech attribute transcription (ASAT). In: Proceedings of INTERSPEECH-2007, pp. 1825–1828 (2007)
15.
Zurück zum Zitat Siniscalchi, S.M., Lee, C.-H.: A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Commun. 51, 1139–1153 (2009)CrossRef Siniscalchi, S.M., Lee, C.-H.: A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Commun. 51, 1139–1153 (2009)CrossRef
16.
Zurück zum Zitat Siniscalchi, S.M., Lyu, D.C., Svendsen, T., et al.: Experiments on cross-language attribute detection and phone recognition with minimal target-specific training data. IEEE Trans. Audio Speech Lang. Process. 20(3), 875–887 (2012)CrossRef Siniscalchi, S.M., Lyu, D.C., Svendsen, T., et al.: Experiments on cross-language attribute detection and phone recognition with minimal target-specific training data. IEEE Trans. Audio Speech Lang. Process. 20(3), 875–887 (2012)CrossRef
17.
Zurück zum Zitat Ananthakrishnan, S., Narayanan, S.: Improved speech recognition using acoustic and lexical correlates of pitch accent in a n-best rescoring framework. In: Proceedings of ICASSP-2007, vol. 4, pp. IV-873–IV-876 (2007) Ananthakrishnan, S., Narayanan, S.: Improved speech recognition using acoustic and lexical correlates of pitch accent in a n-best rescoring framework. In: Proceedings of ICASSP-2007, vol. 4, pp. IV-873–IV-876 (2007)
18.
19.
Zurück zum Zitat Amodei, D., Ananthanarayanan, S., Anubhai, R., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin. In: Proceedings of ICML-2016, pp. 173–182 (2016) Amodei, D., Ananthanarayanan, S., Anubhai, R., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin. In: Proceedings of ICML-2016, pp. 173–182 (2016)
20.
Zurück zum Zitat Sainath, T.N., Vinyals,. O., Senior, A., et al.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of ICASSP-2015, pp. 4580–4584 (2015) Sainath, T.N., Vinyals,. O., Senior, A., et al.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of ICASSP-2015, pp. 4580–4584 (2015)
21.
Zurück zum Zitat Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: Proceedings of ICML-2015, pp. 2342–2350 (2015) Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: Proceedings of ICML-2015, pp. 2342–2350 (2015)
22.
Zurück zum Zitat Paul, D.B., Baker, J.M.: The design for the wall street journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362 (1992) Paul, D.B., Baker, J.M.: The design for the wall street journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362 (1992)
23.
Zurück zum Zitat Abdel-Hamid, O., Mohamed, A., Jiang, H., et al.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: Proceedings of ICASSP-2012, pp. 4277–4280 (2012) Abdel-Hamid, O., Mohamed, A., Jiang, H., et al.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: Proceedings of ICASSP-2012, pp. 4277–4280 (2012)
24.
Zurück zum Zitat Hannun, A.Y., Maas, A.L., Jurafsky, D., et al.: First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. arXiv preprint arXiv:1408.2873 (2014) Hannun, A.Y., Maas, A.L., Jurafsky, D., et al.: First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. arXiv preprint arXiv:​1408.​2873 (2014)
25.
Zurück zum Zitat Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: Proceedings of ICASSP-2015, pp. 357–366 (1980)CrossRef Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: Proceedings of ICASSP-2015, pp. 357–366 (1980)CrossRef
26.
Zurück zum Zitat Lee, L., Rose, R.: A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)CrossRef Lee, L., Rose, R.: A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)CrossRef
27.
Zurück zum Zitat Veselý, K., Ghoshal, A., Burget, L., et al.: Sequence-discriminative training of deep neural networks. In: Proceedings of INTERSPEECH-2013, pp. 2345–2349 (2013) Veselý, K., Ghoshal, A., Burget, L., et al.: Sequence-discriminative training of deep neural networks. In: Proceedings of INTERSPEECH-2013, pp. 2345–2349 (2013)
Metadaten
Titel
Combining Articulatory Features with End-to-End Learning in Speech Recognition
verfasst von
Leyuan Qu
Cornelius Weber
Egor Lakomkin
Johannes Twiefel
Stefan Wermter
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-030-01424-7_49

Premium Partner