Skip to main content

2015 | OriginalPaper | Buchkapitel

13. Recurrent Neural Networks and Related Models

verfasst von : Dong Yu, Li Deng

Erschienen in: Automatic Speech Recognition

Verlag: Springer London

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A recurrent neural network (RNN) is a class of neural network models where many connections among its neurons form a directed cycle. This gives rise to the structure of internal states or memory in the RNN, endowing it with the dynamic temporal behavior not exhibited by the DNN discussed in earlier chapters. In this chapter, we first present the state-space formulation of the basic RNN as a nonlinear dynamical system, where the recurrent matrix governing the system dynamics is largely unstructured. For such basic RNNs, we describe two algorithms for learning their parameters in some detail: (1) the most popular algorithm of backpropagation through time (BPTT); and (2) a more rigorous, primal-dual optimization technique, where constraints on the RNN’s recurrent matrix are imposed to guarantee stability during RNN learning. Going beyond basic RNNs, we further study an advanced version of the RNN, which exploits the structure called long-short-term memory (LSTM), and analyzes its strengths over the basic RNN both in terms of model construction and of practical applications including some latest speech recognition results. Finally, we analyze the RNN as a bottom-up, discriminative, dynamic system model against the top-down, generative counterpart of dynamic system as discussed in Chap. 4. The analysis and discussion lead to potentially more effective and advanced RNN-like architectures and learning paradigm where the strengths of discriminative and generative modeling are integrated while their respective weaknesses are overcome.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Most of the contrasts discussed here can be generalized to the differences in learning general deep generative models (i.e., those with latent variables) and in learning deep discriminative models with neural network architectures.
 
2
With less careful engineering, the basic RNN accepting inputs of raw speech features could only achieve 71.8 % accuracy as reported in [21] before using DNN-derived input features.
 
3
For example, in [15], second-order dynamics with critical damping were used to incorporate such constraints.
 
Literatur
1.
Zurück zum Zitat Bazzi, I., Acero, A., Deng, L.: An expectation-maximization approach for formant tracking using a parameter-free nonlinear predictor. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2003) Bazzi, I., Acero, A., Deng, L.: An expectation-maximization approach for formant tracking using a parameter-free nonlinear predictor. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2003)
2.
Zurück zum Zitat Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)CrossRefMATHMathSciNet Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)CrossRefMATHMathSciNet
3.
Zurück zum Zitat Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. In: Neural Networks. Tricks of the Trade, pp. 437–478. Springer (2012) Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. In: Neural Networks. Tricks of the Trade, pp. 437–478. Springer (2012)
4.
Zurück zum Zitat Bengio, Y.: Estimating or propagating gradients through stochastic neurons. CoRR (2013) Bengio, Y.: Estimating or propagating gradients through stochastic neurons. CoRR (2013)
5.
Zurück zum Zitat Bengio, Y., Boulanger, N., Pascanu, R.: Advances in optimizing recurrent networks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013) Bengio, Y., Boulanger, N., Pascanu, R.: Advances in optimizing recurrent networks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013)
6.
Zurück zum Zitat Bengio, Y., Boulanger-Lewandowski, N., Pascanu, R.: Advances in optimizing recurrent networks. In: Proceeding of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013) Bengio, Y., Boulanger-Lewandowski, N., Pascanu, R.: Advances in optimizing recurrent networks. In: Proceeding of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013)
7.
Zurück zum Zitat Boden, M.: A guide to recurrent neural networks and backpropagation. Technical Report T2002:03, SICS (2002) Boden, M.: A guide to recurrent neural networks and backpropagation. Technical Report T2002:03, SICS (2002)
8.
Zurück zum Zitat Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press (2004) Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press (2004)
9.
Zurück zum Zitat Bridle, J., Deng, L., Picone, J., Richards, H., Ma, J., Kamm, T., Schuster, M., Pike, S., Reagan, R.: An investigation fo segmental hidden dynamic models of speech coarticulation for automatic speech recognition. Final Report for 1998 Workshop on Langauge Engineering, CLSP, Johns Hopkins (1998) Bridle, J., Deng, L., Picone, J., Richards, H., Ma, J., Kamm, T., Schuster, M., Pike, S., Reagan, R.: An investigation fo segmental hidden dynamic models of speech coarticulation for automatic speech recognition. Final Report for 1998 Workshop on Langauge Engineering, CLSP, Johns Hopkins (1998)
10.
Zurück zum Zitat Chen, J., Deng, L.: A primal-dual method for training recurrent neural networks constrained by the echo-state property. In: Proceeding of the ICLR (2014) Chen, J., Deng, L.: A primal-dual method for training recurrent neural networks constrained by the echo-state property. In: Proceeding of the ICLR (2014)
11.
Zurück zum Zitat Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014) Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)
12.
Zurück zum Zitat Dahl, G.E., Yu, D., Deng, L., Acero, A.: Large vocabulary continuous speech recognition with context-dependent DBN-HMMs. In: Proceeding of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4688–4691 (2011) Dahl, G.E., Yu, D., Deng, L., Acero, A.: Large vocabulary continuous speech recognition with context-dependent DBN-HMMs. In: Proceeding of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4688–4691 (2011)
13.
Zurück zum Zitat Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20(1), 30–42 (2012)CrossRef Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20(1), 30–42 (2012)CrossRef
14.
Zurück zum Zitat Danilo Jimenez Rezende Shakir Mohamed, D.W.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of the International Conference on Machine Learning (ICML) (2014) Danilo Jimenez Rezende Shakir Mohamed, D.W.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of the International Conference on Machine Learning (ICML) (2014)
15.
Zurück zum Zitat Deng, L.: A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition. Speech Commun. 24(4), 299–323 (1998)CrossRef Deng, L.: A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition. Speech Commun. 24(4), 299–323 (1998)CrossRef
16.
Zurück zum Zitat Deng, L.: Computational models for speech production. In: Computational Models of Speech Pattern Processing, pp. 199–213. Springer, New York (1999) Deng, L.: Computational models for speech production. In: Computational Models of Speech Pattern Processing, pp. 199–213. Springer, New York (1999)
17.
Zurück zum Zitat Deng, L.: Switching dynamic system models for speech articulation and acoustics. In: Mathematical Foundations of Speech and Language Processing, pp. 115–134. Springer, New York (2003) Deng, L.: Switching dynamic system models for speech articulation and acoustics. In: Mathematical Foundations of Speech and Language Processing, pp. 115–134. Springer, New York (2003)
18.
Zurück zum Zitat Deng, L.: Dyamic Speech Models—Theory, Algorithm, and Applications. Morgan and Claypool (2006) Deng, L.: Dyamic Speech Models—Theory, Algorithm, and Applications. Morgan and Claypool (2006)
19.
Zurück zum Zitat Deng, L., Attias, H., Lee, L., Acero, A.: Adaptive kalman smoothing for tracking vocal tract resonances using a continuous-valued hidden dynamic model. IEEE Trans. Audio, Speech Lang. Process. 15, 13–23 (2007)CrossRef Deng, L., Attias, H., Lee, L., Acero, A.: Adaptive kalman smoothing for tracking vocal tract resonances using a continuous-valued hidden dynamic model. IEEE Trans. Audio, Speech Lang. Process. 15, 13–23 (2007)CrossRef
20.
Zurück zum Zitat Deng, L., Bazzi, I., Acero, A.: Tracking vocal tract resonances using an analytical nonlinear predictor and a target-guided temporal constraint. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2003) Deng, L., Bazzi, I., Acero, A.: Tracking vocal tract resonances using an analytical nonlinear predictor and a target-guided temporal constraint. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2003)
21.
Zurück zum Zitat Deng, L., Chen, J.: Sequence classification using high-level features extracted from deep neural networks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014) Deng, L., Chen, J.: Sequence classification using high-level features extracted from deep neural networks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014)
22.
Zurück zum Zitat Deng, L., Hassanein, K., Elmasry, M.: Analysis of correlation structure for a neural predictive model with application to speech recognition. Neural Netw. 7, 331–339 (1994)CrossRef Deng, L., Hassanein, K., Elmasry, M.: Analysis of correlation structure for a neural predictive model with application to speech recognition. Neural Netw. 7, 331–339 (1994)CrossRef
23.
Zurück zum Zitat Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: An overview. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013) Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: An overview. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013)
24.
Zurück zum Zitat Deng, L., Hinton, G., Yu, D.: Deep learning for speech recognition and related applications. In: NIPS Workshop. Whistler, Canada (2009) Deng, L., Hinton, G., Yu, D.: Deep learning for speech recognition and related applications. In: NIPS Workshop. Whistler, Canada (2009)
25.
Zurück zum Zitat Deng, L., Lee, L., Attias, H., Acero, A.: Adaptive kalman filtering and smoothing for tracking vocal tract resonances using a continuous-valued hidden dynamic model. IEEE Trans. Audio, Speech Lang. Process. 15(1), 13–23 (2007)CrossRef Deng, L., Lee, L., Attias, H., Acero, A.: Adaptive kalman filtering and smoothing for tracking vocal tract resonances using a continuous-valued hidden dynamic model. IEEE Trans. Audio, Speech Lang. Process. 15(1), 13–23 (2007)CrossRef
26.
Zurück zum Zitat Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., Gong, Y., Acero, A.: Recent advances in deep learning for speech research at microsoft. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013) Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., Gong, Y., Acero, A.: Recent advances in deep learning for speech research at microsoft. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013)
27.
Zurück zum Zitat Deng, L., Li, X.: Machine learning paradigms in speech recognition: An overview. IEEE Trans. Audio, Speech Lang. Process. 21(5), 1060–1089 (2013)CrossRef Deng, L., Li, X.: Machine learning paradigms in speech recognition: An overview. IEEE Trans. Audio, Speech Lang. Process. 21(5), 1060–1089 (2013)CrossRef
28.
Zurück zum Zitat Deng, L., Ma, J.: Spontaneous speech recognition using a statistical coarticulatory model for the hidden vocal-tract-resonance dynamics. J. Acoust. Soc. Am. 108, 3036–3048 (2000)CrossRef Deng, L., Ma, J.: Spontaneous speech recognition using a statistical coarticulatory model for the hidden vocal-tract-resonance dynamics. J. Acoust. Soc. Am. 108, 3036–3048 (2000)CrossRef
29.
Zurück zum Zitat Deng, L., O’Shaughnessy, D.: Speech Processing—A Dynamic and Optimization-Oriented Approach. Marcel Dekker Inc, NY (2003) Deng, L., O’Shaughnessy, D.: Speech Processing—A Dynamic and Optimization-Oriented Approach. Marcel Dekker Inc, NY (2003)
30.
Zurück zum Zitat Deng, L., Ramsay, G., Sun, D.: Production models as a structural basis for automatic speech recognition. Speech Commun. 33(2–3), 93–111 (1997)CrossRef Deng, L., Ramsay, G., Sun, D.: Production models as a structural basis for automatic speech recognition. Speech Commun. 33(2–3), 93–111 (1997)CrossRef
31.
Zurück zum Zitat Deng, L., Togneri, R.: Deep dynamic models for learning hidden representations of speech features. In: Speech and Audio Processing for Coding, Enhancement and Recognition. Springer (2014) Deng, L., Togneri, R.: Deep dynamic models for learning hidden representations of speech features. In: Speech and Audio Processing for Coding, Enhancement and Recognition. Springer (2014)
32.
Zurück zum Zitat Deng, L., Yu, D.: Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 445–448 (2007) Deng, L., Yu, D.: Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 445–448 (2007)
33.
Zurück zum Zitat Deng, L., Yu, D., Acero, A.: A bidirectional target filtering model of speech coarticulation: two-stage implementation for phonetic recognition. IEEE Trans. Speech Audio Process. 14, 256–265 (2006)CrossRef Deng, L., Yu, D., Acero, A.: A bidirectional target filtering model of speech coarticulation: two-stage implementation for phonetic recognition. IEEE Trans. Speech Audio Process. 14, 256–265 (2006)CrossRef
34.
Zurück zum Zitat Deng, L., Yu, D., Acero, A.: Structured speech modeling. IEEE Trans. Speech Audio Process. 14, 1492–1504 (2006)CrossRef Deng, L., Yu, D., Acero, A.: Structured speech modeling. IEEE Trans. Speech Audio Process. 14, 1492–1504 (2006)CrossRef
35.
Zurück zum Zitat Divenyi, P., Greenberg, S., Meyer, G.: Dynamics of Speech Production and Perception. IOS Press (2006) Divenyi, P., Greenberg, S., Meyer, G.: Dynamics of Speech Production and Perception. IOS Press (2006)
36.
Zurück zum Zitat Fan, Y., Qian, Y., Xie, F., Soong, F.K.: TTS synthesis with bidirectional lstm based recurrent neural networks. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014) Fan, Y., Qian, Y., Xie, F., Soong, F.K.: TTS synthesis with bidirectional lstm based recurrent neural networks. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014)
37.
Zurück zum Zitat Fernandez, R., Rendel, A., Ramabhadran, B., Hoory, R.: Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014) Fernandez, R., Rendel, A., Ramabhadran, B., Hoory, R.: Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014)
38.
Zurück zum Zitat Geiger, J., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G.: Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014) Geiger, J., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G.: Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014)
39.
Zurück zum Zitat Gers, F., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with lstm. Neural Comput. 12, 2451–2471 (2000)CrossRef Gers, F., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with lstm. Neural Comput. 12, 2451–2471 (2000)CrossRef
40.
Zurück zum Zitat Gers, F., Schraudolph, N., Schmidhuber, J.: Learning precise timing with lstm recurrent networks. J. Mach. Learn. Res. 3, 115–143 (2002)MathSciNet Gers, F., Schraudolph, N., Schmidhuber, J.: Learning precise timing with lstm recurrent networks. J. Mach. Learn. Res. 3, 115–143 (2002)MathSciNet
41.
Zurück zum Zitat Ghahramani, Z., Hinton, G.E.: Variational learning for switching state-space models. Neural Comput. 12, 831–864 (2000)CrossRef Ghahramani, Z., Hinton, G.E.: Variational learning for switching state-space models. Neural Comput. 12, 831–864 (2000)CrossRef
42.
Zurück zum Zitat Gonzalez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., Moreno, P.: Automatic language identification using long short-term memory recurrent neural networks. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014) Gonzalez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., Moreno, P.: Automatic language identification using long short-term memory recurrent neural networks. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014)
43.
Zurück zum Zitat Graves, A.: Sequence transduction with recurrent neural networks. In: ICML Representation Learning Workshop (2012) Graves, A.: Sequence transduction with recurrent neural networks. In: ICML Representation Learning Workshop (2012)
45.
Zurück zum Zitat Graves, A., Jaitly, N., Mahamed, A.: Hybrid speech recognition with deep bidirectional lstm. In: Proceeding of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013) Graves, A., Jaitly, N., Mahamed, A.: Hybrid speech recognition with deep bidirectional lstm. In: Proceeding of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013)
46.
Zurück zum Zitat Graves, A., Mahamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013) Graves, A., Mahamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013)
47.
Zurück zum Zitat Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M., Devin, M., Dean, J.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013) Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M., Devin, M., Dean, J.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)
48.
Zurück zum Zitat Hermans, M., Schrauwen, B.: Training and analysing deep recurrent neural networks. In: Proceedings of the Neural Information Processing Systems (NIPS) (2013) Hermans, M., Schrauwen, B.: Training and analysing deep recurrent neural networks. In: Proceedings of the Neural Information Processing Systems (NIPS) (2013)
49.
Zurück zum Zitat Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRef Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRef
50.
Zurück zum Zitat Hinton, G., Deng, L., Yu, D., Dahl, G.E.: Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRef Hinton, G., Deng, L., Yu, D., Dahl, G.E.: Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRef
51.
52.
53.
Zurück zum Zitat Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef
54.
Zurück zum Zitat Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference
55.
Zurück zum Zitat Jaeger, H.: Short term memory in echo state networks. GMD Report 152,GMD—German National Research Institute for Computer Science (2001) Jaeger, H.: Short term memory in echo state networks. GMD Report 152,GMD—German National Research Institute for Computer Science (2001)
56.
Zurück zum Zitat Jaeger, H.: Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach. GMD Report 159, GMD—German National Research Institute for Computer Science (2002) Jaeger, H.: Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach. GMD Report 159, GMD—German National Research Institute for Computer Science (2002)
57.
Zurück zum Zitat Jordan, M., Sudderth, E., Wainwright, M., Wilsky, A.: Major advances and emerging developments of graphical models, special issue. IEEE Signal Process. Mag. 27(6), 17,138 (2010) Jordan, M., Sudderth, E., Wainwright, M., Wilsky, A.: Major advances and emerging developments of graphical models, special issue. IEEE Signal Process. Mag. 27(6), 17,138 (2010)
59.
Zurück zum Zitat Kingma, D., Welling, M.: Efficient gradient-based inference through transformations between bayes nets and neural nets. In: Proceedings of the International Conference on Machine Learning (ICML) (2014) Kingma, D., Welling, M.: Efficient gradient-based inference through transformations between bayes nets and neural nets. In: Proceedings of the International Conference on Machine Learning (ICML) (2014)
60.
Zurück zum Zitat Kingsbury, B., Sainath, T.N., Soltau, H.: Scalable minimum bayes risk training of deep neural network acoustic models using distributed hessian-free optimization. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2012) Kingsbury, B., Sainath, T.N., Soltau, H.: Scalable minimum bayes risk training of deep neural network acoustic models using distributed hessian-free optimization. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2012)
61.
Zurück zum Zitat Lee, L., Attias, H., Deng, L.: Variational inference and learning for segmental switching state space models of hidden speech dynamics. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I-872–I-875 (2003) Lee, L., Attias, H., Deng, L.: Variational inference and learning for segmental switching state space models of hidden speech dynamics. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I-872–I-875 (2003)
62.
Zurück zum Zitat Ma, J., Deng, L.: A path-stack algorithm for optimizing dynamic regimes in a statistical hidden dynamic model of speech. Comput. Speech Lang. 14, 101–104 (2000) Ma, J., Deng, L.: A path-stack algorithm for optimizing dynamic regimes in a statistical hidden dynamic model of speech. Comput. Speech Lang. 14, 101–104 (2000)
63.
Zurück zum Zitat Ma, J., Deng, L.: Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Audio Speech Process. 11(6), 590–602 (2003) Ma, J., Deng, L.: Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Audio Speech Process. 11(6), 590–602 (2003)
64.
Zurück zum Zitat Ma, J., Deng, L.: Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Audio, Speech Lang. Process. 11(6), 590–602 (2004) Ma, J., Deng, L.: Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Audio, Speech Lang. Process. 11(6), 590–602 (2004)
65.
Zurück zum Zitat Ma, J., Deng, L.: Target-directed mixture dynamic models for spontaneous speech recognition. IEEE Trans. Audio Speech Process. 12(1), 47–58 (2004)CrossRef Ma, J., Deng, L.: Target-directed mixture dynamic models for spontaneous speech recognition. IEEE Trans. Audio Speech Process. 12(1), 47–58 (2004)CrossRef
66.
Zurück zum Zitat Maas, A.L., Le, Q., O’Neil, T.M., Vinyals, O., Nguyen, P., Ng, A.Y.: Recurrent neural networks for noise reduction in robust asr. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH). Portland, OR (2012) Maas, A.L., Le, Q., O’Neil, T.M., Vinyals, O., Nguyen, P., Ng, A.Y.: Recurrent neural networks for noise reduction in robust asr. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH). Portland, OR (2012)
67.
Zurück zum Zitat Mesnil, G., He, X., Deng, L., Bengio, Y.: Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH). Lyon, France (2013) Mesnil, G., He, X., Deng, L., Bengio, Y.: Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH). Lyon, France (2013)
69.
Zurück zum Zitat Mikolov, T.: Statistical Language Models Based on Neural Networks. Ph.D. thesis, Brno University of Technology (2012) Mikolov, T.: Statistical Language Models Based on Neural Networks. Ph.D. thesis, Brno University of Technology (2012)
70.
Zurück zum Zitat Mikolov, T., Deoras, A., Povey, D., Burget, L., Cernocky, J.: Strategies for training large scale neural network language models. In: Proceedings of the IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 196–201. IEEE, Honolulu, HI (2011) Mikolov, T., Deoras, A., Povey, D., Burget, L., Cernocky, J.: Strategies for training large scale neural network language models. In: Proceedings of the IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 196–201. IEEE, Honolulu, HI (2011)
71.
Zurück zum Zitat Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1045–1048. Makuhari, Japan (2010) Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1045–1048. Makuhari, Japan (2010)
72.
Zurück zum Zitat Mikolov, T., Kombrink, S., Burget, L., Cernocky, J., Khudanpur, S.: Extensions of recurrent neural network language model. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531. Prague, Czech (2011) Mikolov, T., Kombrink, S., Burget, L., Cernocky, J., Khudanpur, S.: Extensions of recurrent neural network language model. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531. Prague, Czech (2011)
73.
Zurück zum Zitat Mikolov, T., Zweig, G.: Context dependent recurrent neural network language model. In: Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 234–239 (2012) Mikolov, T., Zweig, G.: Context dependent recurrent neural network language model. In: Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 234–239 (2012)
74.
Zurück zum Zitat Mnih, A., Gregor, K.: Neural variational inference and learning in belief networks. In: Proceedings of the International Conference on Machine Learning (ICML) (2014) Mnih, A., Gregor, K.: Neural variational inference and learning in belief networks. In: Proceedings of the International Conference on Machine Learning (ICML) (2014)
75.
Zurück zum Zitat Mohamed, A.r., Dahl, G.E., Hinton, G.: Deep belief networks for phone recognition. In: NIPS Workshop on Deep Learning for Speech Recognition and Related Applications (2009) Mohamed, A.r., Dahl, G.E., Hinton, G.: Deep belief networks for phone recognition. In: NIPS Workshop on Deep Learning for Speech Recognition and Related Applications (2009)
76.
Zurück zum Zitat Ozkan, E., Ozbek, I., Demirekler, M.: Dynamic speech spectrum representation and tracking variable number of vocal tract resonance frequencies with time-varying dirichlet process mixture models. IEEE Trans. Audio, Speech Lang. Process. 17(8), 1518–1532 (2009)CrossRef Ozkan, E., Ozbek, I., Demirekler, M.: Dynamic speech spectrum representation and tracking variable number of vocal tract resonance frequencies with time-varying dirichlet process mixture models. IEEE Trans. Audio, Speech Lang. Process. 17(8), 1518–1532 (2009)CrossRef
77.
Zurück zum Zitat Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y.: How to construct deep recurrent neural networks. In: The 2nd International Conference on Learning Representation (ICLR) (2014) Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y.: How to construct deep recurrent neural networks. In: The 2nd International Conference on Learning Representation (ICLR) (2014)
78.
Zurück zum Zitat Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: Proceedings of the International Conference on Machine Learning (ICML). Atlanta, GA (2013) Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: Proceedings of the International Conference on Machine Learning (ICML). Atlanta, GA (2013)
79.
Zurück zum Zitat Pavlovic, V., Frey, B., Huang, T.: Variational learning in mixed-state dynamic graphical models. In: UAI, pp. 522–530. Stockholm (1999) Pavlovic, V., Frey, B., Huang, T.: Variational learning in mixed-state dynamic graphical models. In: UAI, pp. 522–530. Stockholm (1999)
80.
Zurück zum Zitat Picone, J., Pike, S., Regan, R., Kamm, T., bridle, J., Deng, L., Ma, Z., Richards, H., Schuster, M.: Initial evaluation of hidden dynamic models on conversational speech. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (1999) Picone, J., Pike, S., Regan, R., Kamm, T., bridle, J., Deng, L., Ma, Z., Richards, H., Schuster, M.: Initial evaluation of hidden dynamic models on conversational speech. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (1999)
81.
Zurück zum Zitat Robinson, A.J.: An application of recurrent nets to phone probability estimation. IEEE Trans. Neural Netw. 5(2), 298–305 (1994)CrossRef Robinson, A.J.: An application of recurrent nets to phone probability estimation. IEEE Trans. Neural Netw. 5(2), 298–305 (1994)CrossRef
82.
Zurück zum Zitat Robinson, A.J., Cook, G., Ellis, D.P., Fosler-Lussier, E., Renals, S., Williams, D.: Connectionist speech recognition of broadcast news. Speech Commun. 37(1), 27–45 (2002)CrossRefMATH Robinson, A.J., Cook, G., Ellis, D.P., Fosler-Lussier, E., Renals, S., Williams, D.: Connectionist speech recognition of broadcast news. Speech Commun. 37(1), 27–45 (2002)CrossRefMATH
83.
Zurück zum Zitat Rumelhart, D.E., Hintont, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)CrossRef Rumelhart, D.E., Hintont, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)CrossRef
84.
Zurück zum Zitat Sainath, T., Kingsbury, B., Soltau, H., Ramabhadran, B.: Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans. Audio, Speech, Lang. Process. 21(11), 2267–2276 (2013)CrossRef Sainath, T., Kingsbury, B., Soltau, H., Ramabhadran, B.: Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans. Audio, Speech, Lang. Process. 21(11), 2267–2276 (2013)CrossRef
85.
Zurück zum Zitat Sainath, T.N., Kingsbury, B., Mohamed, A.r., Dahl, G.E., Saon, G., Soltau, H., Beran, T., Aravkin, A.Y., Ramabhadran, B.: Improvements to deep convolutional neural networks for lvcsr. In: Proceedings of the IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 315–320 (2013) Sainath, T.N., Kingsbury, B., Mohamed, A.r., Dahl, G.E., Saon, G., Soltau, H., Beran, T., Aravkin, A.Y., Ramabhadran, B.: Improvements to deep convolutional neural networks for lvcsr. In: Proceedings of the IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 315–320 (2013)
86.
Zurück zum Zitat Sainath, T.N., Kingsbury, B., Mohamed, A.r., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: Proceedings of the IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU) (2013) Sainath, T.N., Kingsbury, B., Mohamed, A.r., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: Proceedings of the IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU) (2013)
87.
Zurück zum Zitat Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6655–6659 (2013) Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6655–6659 (2013)
88.
Zurück zum Zitat Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014) Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014)
89.
Zurück zum Zitat Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R., Mao, M.: Sequence discriminative distributed training of long short-term memory recurrent neural networks. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014) Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R., Mao, M.: Sequence discriminative distributed training of long short-term memory recurrent neural networks. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014)
90.
Zurück zum Zitat Schmidhuber, J.: Deep learning in neural networks: an overview. CoRR abs/1404.7828 (2014) Schmidhuber, J.: Deep learning in neural networks: an overview. CoRR abs/1404.7828 (2014)
91.
Zurück zum Zitat Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: On parallelizability of stochastic gradient descent for speech dnns. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014) Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: On parallelizability of stochastic gradient descent for speech dnns. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014)
92.
Zurück zum Zitat Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 437–440 (2011) Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 437–440 (2011)
93.
Zurück zum Zitat Shen, X., Deng, L.: Maximum likelihood in statistical estimation of dynamical systems: Decomposition algorithm and simulation results. Signal Process. 57, 65–79 (1997)CrossRefMATH Shen, X., Deng, L.: Maximum likelihood in statistical estimation of dynamical systems: Decomposition algorithm and simulation results. Signal Process. 57, 65–79 (1997)CrossRefMATH
94.
Zurück zum Zitat Stevens, K.: Acoustic Phonetics. MIT Press (2000) Stevens, K.: Acoustic Phonetics. MIT Press (2000)
95.
Zurück zum Zitat Stoyanov, V., Ropson, A., Eisner, J.: Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) (2011) Stoyanov, V., Ropson, A., Eisner, J.: Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) (2011)
96.
Zurück zum Zitat Sutskever, I.: Training Recurrent Neural Networks. Ph.D. thesis, University of Toronto (2013) Sutskever, I.: Training Recurrent Neural Networks. Ph.D. thesis, University of Toronto (2013)
97.
Zurück zum Zitat Togneri, R., Deng, L.: Joint state and parameter estimation for a target-directed nonlinear dynamic system model. IEEE Trans. Signal Process. 51(12), 3061–3070 (2003)CrossRefMathSciNet Togneri, R., Deng, L.: Joint state and parameter estimation for a target-directed nonlinear dynamic system model. IEEE Trans. Signal Process. 51(12), 3061–3070 (2003)CrossRefMathSciNet
98.
Zurück zum Zitat Togneri, R., Deng, L.: A state-space model with neural-network prediction for recovering vocal tract resonances in fluent speech from mel-cepstral coefficients. Speech Commun. 48(8), 971–988 (2006)CrossRef Togneri, R., Deng, L.: A state-space model with neural-network prediction for recovering vocal tract resonances in fluent speech from mel-cepstral coefficients. Speech Commun. 48(8), 971–988 (2006)CrossRef
99.
Zurück zum Zitat Triefenbach, F., Jalalvand, A., Demuynck, K., Martens, J.P.: Acoustic modeling with hierarchical reservoirs. IEEE Trans. Audio, Speech, Lang. Process. 21(11), 2439–2450 (2013)CrossRef Triefenbach, F., Jalalvand, A., Demuynck, K., Martens, J.P.: Acoustic modeling with hierarchical reservoirs. IEEE Trans. Audio, Speech, Lang. Process. 21(11), 2439–2450 (2013)CrossRef
100.
Zurück zum Zitat Vanhoucke, V., Devin, M., Heigold, G.: Multiframe deep neural networks for acoustic modeling Vanhoucke, V., Devin, M., Heigold, G.: Multiframe deep neural networks for acoustic modeling
101.
Zurück zum Zitat Vanhoucke, V., Senior, A., Mao, M.Z.: Improving the speed of neural networks on CPUs. In: Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011) Vanhoucke, V., Senior, A., Mao, M.Z.: Improving the speed of neural networks on CPUs. In: Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)
102.
Zurück zum Zitat Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Speech Audio Process. 37(3), 328–339 (1989)CrossRef Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Speech Audio Process. 37(3), 328–339 (1989)CrossRef
103.
Zurück zum Zitat Weninger, F., Geiger, J., Wollmer, M., Schuller, B., Rigoll, G.: Feature enhancement by deep lstm networks for ASR in reverberant multisource environments. Comput. Speech and Lang. 888–902 (2014) Weninger, F., Geiger, J., Wollmer, M., Schuller, B., Rigoll, G.: Feature enhancement by deep lstm networks for ASR in reverberant multisource environments. Comput. Speech and Lang. 888–902 (2014)
104.
Zurück zum Zitat Xing, E., Jordan, M., Russell, S.: A generalized mean field algorithm for variational inference in exponential families. In: Proceedings of the Uncertainty in Artificial Intelligence (2003) Xing, E., Jordan, M., Russell, S.: A generalized mean field algorithm for variational inference in exponential families. In: Proceedings of the Uncertainty in Artificial Intelligence (2003)
105.
Zurück zum Zitat Yu, D., Deng, L.: Speaker-adaptive learning of resonance targets in a hidden trajectory model of speech coarticulation. Comput. Speech Lang. 27, 72–87 (2007)CrossRef Yu, D., Deng, L.: Speaker-adaptive learning of resonance targets in a hidden trajectory model of speech coarticulation. Comput. Speech Lang. 27, 72–87 (2007)CrossRef
106.
Zurück zum Zitat Yu, D., Deng, L.: Deep-structured hidden conditional random fields for phonetic recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2010) Yu, D., Deng, L.: Deep-structured hidden conditional random fields for phonetic recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2010)
107.
Zurück zum Zitat Yu, D., Deng, L., Acero, A.: A lattice search technique for a long-contextual-span hidden trajectory model of speech. Speech Commun. 48, 1214–1226 (2006)CrossRef Yu, D., Deng, L., Acero, A.: A lattice search technique for a long-contextual-span hidden trajectory model of speech. Speech Commun. 48, 1214–1226 (2006)CrossRef
108.
Zurück zum Zitat Yu, D., Deng, L., Dahl, G.: Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. In: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Deep Learning and Unsupervised Feature Learning (2010) Yu, D., Deng, L., Dahl, G.: Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. In: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Deep Learning and Unsupervised Feature Learning (2010)
Metadaten
Titel
Recurrent Neural Networks and Related Models
verfasst von
Dong Yu
Li Deng
Copyright-Jahr
2015
Verlag
Springer London
DOI
https://doi.org/10.1007/978-1-4471-5779-3_13

Neuer Inhalt