Top

Published in:

2019 | OriginalPaper | Chapter

12. End-to-End Speech Recognition

Authors : Uday Kamath, John Liu, James Whitaker

Published in: Deep Learning for NLP and Speech Recognition

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In Chap. 8, we aimed to create an ASR system by dividing the fundamental equation

$$\displaystyle W^* = \operatorname *{argmax}_{W \in V^*} P(W|X) $$

into an acoustic model, lexicon model, and language model by using Bayes’ theorem. This approach relies heavily on the use of the conditional independence assumption and separate optimization procedures for the different models.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Transfer Learning: Domain Adaptation

next chapter Deep Reinforcement Learning for Text and Speech

http://commoncrawl.org.

https://voice.mozilla.org/en/data.

https://github.com/mozilla/DeepSpeech.

https://github.com/PaddlePaddle/DeepSpeech.

https://github.com/SeanNaren/deepspeech.pytorch.

https://github.com/espnet/espnet.

Although it is possible to train this model on a CPU, it is unrealistic due to the computationally intensive nature of the convolutional and recurrent layers.

https://github.com/parlance/ctcdecode.

http://commoncrawl.org.

https://github.com/espnet/espnet.

[Amo+16]

Dario Amodei et al. “Deep speech 2: End-to-end speech recognition in English and Mandarin”. In: International Conference on Machine Learning. 2016, pp. 173–182.

[BCB14a]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate”. In: arXiv preprint arXiv:1409.0473 (2014).

[Bah+16c]

Dzmitry Bahdanau et al. “End-to-end attention-based large vocabulary speech recognition”. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE. 2016, pp. 4945–4949.

[BH14]

Samy Bengio and Georg Heigold. “Word embeddings for speech recognition”. In: Fifteenth Annual Conference of the International Speech Communication Association. 2014.

[Cha+16b]

William Chan et al. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition”. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE. 2016, pp. 4960–4964.

[Cho+14]

Kyunghyun Cho et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation”. In: arXiv preprint arXiv:1406.1078 (2014).

[Cho+15c]

Jan K Chorowski et al. “Attention-based models for speech recognition”. In: Advances in neural information processing systems. 2015, pp. 577–585.

[Chu+16]

Y.-A. Chung et al. “Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder”. In: ArXiv e-prints (Mar 2016).

[CPS16]

Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. “Wav2letter: an end-to-end ConvNet-based speech recognition system”. In: arXiv preprint arXiv:1609.03193 (2016).

[Gra12]

Alex Graves. “Sequence transduction with recurrent neural networks”. In: arXiv preprint arXiv:1211.3711 (2012).

[GMH13]

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks”. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on. IEEE. 2013, pp. 6645–6649.

[Gra+06]

Alex Graves et al. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp. 369–376.

[Gul+15]

Caglar Gulcehre et al. “On using monolingual corpora in neural machine translation”. In: arXiv preprint arXiv:1503.03535 (2015).

[Han17]

Awni Hannun. “Sequence Modeling with CTC”. In: Distill. (2017).

[Han+14a]

Awni Hannun et al. “Deep speech: Scaling up end-to-end speech recognition”. In: arXiv preprint arXiv:1412.5567 (2014).

[Han+14b]

Awni Y Hannun et al. “First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs”. In: arXiv preprint arXiv:1408.2873 (2014).

[He+18]

Yanzhang He et al. “Streaming End-to-end Speech Recognition For Mobile Devices”. In: arXiv preprint arXiv:1811.06621 (2018).

[Hea+13]

Kenneth Heafield et al. “Scalable modified Kneser-Ney language model estimation”. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Vol. 2. 2013, pp. 690–696.

[HCW18]

Takaaki Hori, Jaejin Cho, and Shinji Watanabe. “End-to-end Speech Recognition with Word-based RNN Language Models”. In: arXiv preprint arXiv:1808.02608 (2018).

[Hor+17]

Takaaki Hori et al. “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM”. In: arXiv preprint arXiv:1706.02737 (2017).

[KWL16]

Herman Kamper, Weiran Wang, and Karen Livescu. “Deep convolutional acoustic word embeddings using word-pair side information”. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE. 2016, pp. 4950–4954.

[KHW17]

Suyoun Kim, Takaaki Hori, and Shinji Watanabe. “Joint CTC-attention based end-to-end speech recognition using multi-task learning”. In: Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE. 2017, pp. 4835–4839.

[Li+17]

J. Li et al. “Acoustic-To-Word Model Without OOV”. In: ArXiv e-prints (Nov.2017).

[Liu+17]

Hairong Liu et al. “Gram-CTC: Automatic unit selection and target decomposition for sequence labelling”. In: arXiv preprint arXiv:1703.00096 (2017).

[MB18]

Benjamin Milde and Chris Biemann. “Unspeech: Unsupervised Speech Context Embeddings”. In: arXiv preprint arXiv:1804.06775 (2018).

[MHG+14]

Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. “Recurrent models of visual attention”. In: Advances in neural information processing systems. 2014, pp. 2204–2212.

[Pra+18]

Vineel Pratap et al. “wav2letter+ +: The Fastest Open-source Speech Recognition System”. In: arXiv preprint arXiv:1812.07625 (2018).

[SLS16]

Hagen Soltau, Hank Liao, and Hasim Sak. “Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition”. In: arXiv preprint arXiv:1610.09975 (2016).

[Sri+17]

Anuroop Sriram et al. “Cold fusion: Training seq2seq models together with language models”. In: arXiv preprint arXiv:1708.06426 (2017).

[Tos+18]

Shubham Toshniwal et al. “A comparison of techniques for language model integration in encoder-decoder speech recognition”. In: arXiv preprint arXiv:1807.10857 (2018).

[VDO+16]

Aäron Van Den Oord et al. “WaveNet: A generative model for raw audio.” In: SSW. 2016, p. 125.

[WLL18]

Yu-Hsuan Wang, Hung-yi Lee, and Lin-shan Lee “Segmental audio word2vec: Representing utterances as sequences of vectors with applications in spoken term detection”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2018, pp. 6269–6273.

[Wat+17b]

Shinji Watanabe et al. “Hybrid CTC/attention architecture for end-to-end speech recognition”. In: IEEE Journal of Selected Topics in Signal Processing 11.8 (2017), pp. 1240–1253.

[Xia+18]

Zhangyu Xiao et al. “Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units”. In: arXiv preprint arXiv:1807.04978 (2018).

[ZSG90]

Victor Zue, Stephanie Seneff, and James Glass. “Speech database development at MIT: TIMIT and beyond”. In: Speech communication 9.4 (1990), pp. 351–356.CrossRef

Title: End-to-End Speech Recognition
Authors: Uday Kamath
John Liu
James Whitaker
Publisher: Springer International Publishing
Book: Deep Learning for NLP and Speech Recognition
Print ISBN: 978-3-030-14595-8

Electronic ISBN: 978-3-030-14596-5

Copyright Year: 2019
DOI: https://doi.org/10.1007/978-3-030-14596-5_12

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner