Skip to main content
Top

2019 | OriginalPaper | Chapter

12. End-to-End Speech Recognition

Authors : Uday Kamath, John Liu, James Whitaker

Published in: Deep Learning for NLP and Speech Recognition

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In Chap. 8, we aimed to create an ASR system by dividing the fundamental equation
$$\displaystyle W^* = \operatorname *{argmax}_{W \in V^*} P(W|X) $$
into an acoustic model, lexicon model, and language model by using Bayes’ theorem. This approach relies heavily on the use of the conditional independence assumption and separate optimization procedures for the different models.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
[Amo+16]
go back to reference Dario Amodei et al. “Deep speech 2: End-to-end speech recognition in English and Mandarin”. In: International Conference on Machine Learning. 2016, pp. 173–182. Dario Amodei et al. “Deep speech 2: End-to-end speech recognition in English and Mandarin”. In: International Conference on Machine Learning. 2016, pp. 173–182.
[BCB14a]
go back to reference Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate”. In: arXiv preprint arXiv:1409.0473 (2014). Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate”. In: arXiv preprint arXiv:1409.0473 (2014).
[Bah+16c]
go back to reference Dzmitry Bahdanau et al. “End-to-end attention-based large vocabulary speech recognition”. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE. 2016, pp. 4945–4949. Dzmitry Bahdanau et al. “End-to-end attention-based large vocabulary speech recognition”. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE. 2016, pp. 4945–4949.
[BH14]
go back to reference Samy Bengio and Georg Heigold. “Word embeddings for speech recognition”. In: Fifteenth Annual Conference of the International Speech Communication Association. 2014. Samy Bengio and Georg Heigold. “Word embeddings for speech recognition”. In: Fifteenth Annual Conference of the International Speech Communication Association. 2014.
[Cha+16b]
go back to reference William Chan et al. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition”. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE. 2016, pp. 4960–4964. William Chan et al. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition”. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE. 2016, pp. 4960–4964.
[Cho+14]
go back to reference Kyunghyun Cho et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation”. In: arXiv preprint arXiv:1406.1078 (2014). Kyunghyun Cho et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation”. In: arXiv preprint arXiv:1406.1078 (2014).
[Cho+15c]
go back to reference Jan K Chorowski et al. “Attention-based models for speech recognition”. In: Advances in neural information processing systems. 2015, pp. 577–585. Jan K Chorowski et al. “Attention-based models for speech recognition”. In: Advances in neural information processing systems. 2015, pp. 577–585.
[Chu+16]
go back to reference Y.-A. Chung et al. “Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder”. In: ArXiv e-prints (Mar 2016). Y.-A. Chung et al. “Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder”. In: ArXiv e-prints (Mar 2016).
[CPS16]
go back to reference Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. “Wav2letter: an end-to-end ConvNet-based speech recognition system”. In: arXiv preprint arXiv:1609.03193 (2016). Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. “Wav2letter: an end-to-end ConvNet-based speech recognition system”. In: arXiv preprint arXiv:1609.03193 (2016).
[Gra12]
go back to reference Alex Graves. “Sequence transduction with recurrent neural networks”. In: arXiv preprint arXiv:1211.3711 (2012). Alex Graves. “Sequence transduction with recurrent neural networks”. In: arXiv preprint arXiv:1211.3711 (2012).
[GMH13]
go back to reference Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks”. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on. IEEE. 2013, pp. 6645–6649. Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks”. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on. IEEE. 2013, pp. 6645–6649.
[Gra+06]
go back to reference Alex Graves et al. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp. 369–376. Alex Graves et al. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp. 369–376.
[Gul+15]
go back to reference Caglar Gulcehre et al. “On using monolingual corpora in neural machine translation”. In: arXiv preprint arXiv:1503.03535 (2015). Caglar Gulcehre et al. “On using monolingual corpora in neural machine translation”. In: arXiv preprint arXiv:1503.03535 (2015).
[Han17]
go back to reference Awni Hannun. “Sequence Modeling with CTC”. In: Distill. (2017). Awni Hannun. “Sequence Modeling with CTC”. In: Distill. (2017).
[Han+14a]
go back to reference Awni Hannun et al. “Deep speech: Scaling up end-to-end speech recognition”. In: arXiv preprint arXiv:1412.5567 (2014). Awni Hannun et al. “Deep speech: Scaling up end-to-end speech recognition”. In: arXiv preprint arXiv:1412.5567 (2014).
[Han+14b]
go back to reference Awni Y Hannun et al. “First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs”. In: arXiv preprint arXiv:1408.2873 (2014). Awni Y Hannun et al. “First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs”. In: arXiv preprint arXiv:1408.2873 (2014).
[He+18]
go back to reference Yanzhang He et al. “Streaming End-to-end Speech Recognition For Mobile Devices”. In: arXiv preprint arXiv:1811.06621 (2018). Yanzhang He et al. “Streaming End-to-end Speech Recognition For Mobile Devices”. In: arXiv preprint arXiv:1811.06621 (2018).
[Hea+13]
go back to reference Kenneth Heafield et al. “Scalable modified Kneser-Ney language model estimation”. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Vol. 2. 2013, pp. 690–696. Kenneth Heafield et al. “Scalable modified Kneser-Ney language model estimation”. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Vol. 2. 2013, pp. 690–696.
[HCW18]
go back to reference Takaaki Hori, Jaejin Cho, and Shinji Watanabe. “End-to-end Speech Recognition with Word-based RNN Language Models”. In: arXiv preprint arXiv:1808.02608 (2018). Takaaki Hori, Jaejin Cho, and Shinji Watanabe. “End-to-end Speech Recognition with Word-based RNN Language Models”. In: arXiv preprint arXiv:1808.02608 (2018).
[Hor+17]
go back to reference Takaaki Hori et al. “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM”. In: arXiv preprint arXiv:1706.02737 (2017). Takaaki Hori et al. “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM”. In: arXiv preprint arXiv:1706.02737 (2017).
[KWL16]
go back to reference Herman Kamper, Weiran Wang, and Karen Livescu. “Deep convolutional acoustic word embeddings using word-pair side information”. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE. 2016, pp. 4950–4954. Herman Kamper, Weiran Wang, and Karen Livescu. “Deep convolutional acoustic word embeddings using word-pair side information”. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE. 2016, pp. 4950–4954.
[KHW17]
go back to reference Suyoun Kim, Takaaki Hori, and Shinji Watanabe. “Joint CTC-attention based end-to-end speech recognition using multi-task learning”. In: Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE. 2017, pp. 4835–4839. Suyoun Kim, Takaaki Hori, and Shinji Watanabe. “Joint CTC-attention based end-to-end speech recognition using multi-task learning”. In: Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE. 2017, pp. 4835–4839.
[Li+17]
go back to reference J. Li et al. “Acoustic-To-Word Model Without OOV”. In: ArXiv e-prints (Nov.2017). J. Li et al. “Acoustic-To-Word Model Without OOV”. In: ArXiv e-prints (Nov.2017).
[Liu+17]
go back to reference Hairong Liu et al. “Gram-CTC: Automatic unit selection and target decomposition for sequence labelling”. In: arXiv preprint arXiv:1703.00096 (2017). Hairong Liu et al. “Gram-CTC: Automatic unit selection and target decomposition for sequence labelling”. In: arXiv preprint arXiv:1703.00096 (2017).
[MB18]
go back to reference Benjamin Milde and Chris Biemann. “Unspeech: Unsupervised Speech Context Embeddings”. In: arXiv preprint arXiv:1804.06775 (2018). Benjamin Milde and Chris Biemann. “Unspeech: Unsupervised Speech Context Embeddings”. In: arXiv preprint arXiv:1804.06775 (2018).
[MHG+14]
go back to reference Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. “Recurrent models of visual attention”. In: Advances in neural information processing systems. 2014, pp. 2204–2212. Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. “Recurrent models of visual attention”. In: Advances in neural information processing systems. 2014, pp. 2204–2212.
[Pra+18]
go back to reference Vineel Pratap et al. “wav2letter+ +: The Fastest Open-source Speech Recognition System”. In: arXiv preprint arXiv:1812.07625 (2018). Vineel Pratap et al. “wav2letter+ +: The Fastest Open-source Speech Recognition System”. In: arXiv preprint arXiv:1812.07625 (2018).
[SLS16]
go back to reference Hagen Soltau, Hank Liao, and Hasim Sak. “Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition”. In: arXiv preprint arXiv:1610.09975 (2016). Hagen Soltau, Hank Liao, and Hasim Sak. “Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition”. In: arXiv preprint arXiv:1610.09975 (2016).
[Sri+17]
go back to reference Anuroop Sriram et al. “Cold fusion: Training seq2seq models together with language models”. In: arXiv preprint arXiv:1708.06426 (2017). Anuroop Sriram et al. “Cold fusion: Training seq2seq models together with language models”. In: arXiv preprint arXiv:1708.06426 (2017).
[Tos+18]
go back to reference Shubham Toshniwal et al. “A comparison of techniques for language model integration in encoder-decoder speech recognition”. In: arXiv preprint arXiv:1807.10857 (2018). Shubham Toshniwal et al. “A comparison of techniques for language model integration in encoder-decoder speech recognition”. In: arXiv preprint arXiv:1807.10857 (2018).
[VDO+16]
go back to reference Aäron Van Den Oord et al. “WaveNet: A generative model for raw audio.” In: SSW. 2016, p. 125. Aäron Van Den Oord et al. “WaveNet: A generative model for raw audio.” In: SSW. 2016, p. 125.
[WLL18]
go back to reference Yu-Hsuan Wang, Hung-yi Lee, and Lin-shan Lee “Segmental audio word2vec: Representing utterances as sequences of vectors with applications in spoken term detection”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2018, pp. 6269–6273. Yu-Hsuan Wang, Hung-yi Lee, and Lin-shan Lee “Segmental audio word2vec: Representing utterances as sequences of vectors with applications in spoken term detection”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2018, pp. 6269–6273.
[Wat+17b]
go back to reference Shinji Watanabe et al. “Hybrid CTC/attention architecture for end-to-end speech recognition”. In: IEEE Journal of Selected Topics in Signal Processing 11.8 (2017), pp. 1240–1253. Shinji Watanabe et al. “Hybrid CTC/attention architecture for end-to-end speech recognition”. In: IEEE Journal of Selected Topics in Signal Processing 11.8 (2017), pp. 1240–1253.
[Xia+18]
go back to reference Zhangyu Xiao et al. “Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units”. In: arXiv preprint arXiv:1807.04978 (2018). Zhangyu Xiao et al. “Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units”. In: arXiv preprint arXiv:1807.04978 (2018).
[ZSG90]
go back to reference Victor Zue, Stephanie Seneff, and James Glass. “Speech database development at MIT: TIMIT and beyond”. In: Speech communication 9.4 (1990), pp. 351–356.CrossRef Victor Zue, Stephanie Seneff, and James Glass. “Speech database development at MIT: TIMIT and beyond”. In: Speech communication 9.4 (1990), pp. 351–356.CrossRef
Metadata
Title
End-to-End Speech Recognition
Authors
Uday Kamath
John Liu
James Whitaker
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-14596-5_12

Premium Partner