Skip to main content

2015 | OriginalPaper | Buchkapitel

6. Deep Dynamic Models for Learning Hidden Representations of Speech Features

verfasst von : Li Deng, Roberto Togneri

Erschienen in: Speech and Audio Processing for Coding, Enhancement and Recognition

Verlag: Springer New York

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Deep hierarchical structure with multiple layers of hidden space in human speech is intrinsically connected to its dynamic characteristics manifested in all levels of speech production and perception. The desire and an attempt to capitalize on a (superficial) understanding of this deep speech structure helped ignite the recent surge of interest in the deep learning approach to speech recognition and related applications, and a more thorough understanding of the deep structure of speech dynamics and the related computational representations is expected to further advance the research progress in speech technology. In this chapter, we first survey a series of studies on representing speech in a hidden space using dynamic systems and recurrent neural networks, emphasizing different ways of learning the model parameters and subsequently the hidden feature representations of time-varying speech data. We analyze and summarize this rich set of deep, dynamic speech models into two major categories: (1) top-down, generative models adopting localist representations of speech classes and features in the hidden space; and (2) bottom-up, discriminative models adopting distributed representations. With detailed examinations of and comparisons between these two types of models, we focus on the localist versus distributed representations as their respective hallmarks and defining characteristics. Future directions are discussed and analyzed about potential strategies to leverage the strengths of both the localist and distributed representations while overcoming their respective weaknesses, beyond blind integration of the two by using the generative model to pre-train the discriminative one as a popular method of training deep neural networks.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat A. Acero, L. Deng, T. Kristjansson, J. Zhang, HMM adaptation using vector taylor series for noisy speech recognition, in Proceedings of International Conference on Spoken Language Processing (2000), pp. 869–872 A. Acero, L. Deng, T. Kristjansson, J. Zhang, HMM adaptation using vector taylor series for noisy speech recognition, in Proceedings of International Conference on Spoken Language Processing (2000), pp. 869–872
2.
Zurück zum Zitat J. Baker, Stochastic modeling for automatic speech recognition, in Speech Recognition, ed. by D. Reddy (Academic, New York, 1976) J. Baker, Stochastic modeling for automatic speech recognition, in Speech Recognition, ed. by D. Reddy (Academic, New York, 1976)
3.
Zurück zum Zitat J. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, D. O’Shgughnessy, Research developments and directions in speech recognition and understanding, part i. IEEE Signal Process. Mag. 26(3), 75–80 (2009)CrossRef J. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, D. O’Shgughnessy, Research developments and directions in speech recognition and understanding, part i. IEEE Signal Process. Mag. 26(3), 75–80 (2009)CrossRef
4.
Zurück zum Zitat J. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, D. O’Shgughnessy, Updated MINDS report on speech recognition and understanding. IEEE Signal Process. Mag. 26(4), 78–85 (2009)CrossRef J. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, D. O’Shgughnessy, Updated MINDS report on speech recognition and understanding. IEEE Signal Process. Mag. 26(4), 78–85 (2009)CrossRef
5.
Zurück zum Zitat L. Baum, T. Petrie, Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37(6), 1554–1563 (1966)MathSciNetCrossRefMATH L. Baum, T. Petrie, Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37(6), 1554–1563 (1966)MathSciNetCrossRefMATH
6.
Zurück zum Zitat Y. Bengio, N. Boulanger, R. Pascanu, Advances in optimizing recurrent networks, in Proceedings of ICASSP, Vancouver, 2013 Y. Bengio, N. Boulanger, R. Pascanu, Advances in optimizing recurrent networks, in Proceedings of ICASSP, Vancouver, 2013
7.
Zurück zum Zitat Y. Bengio, N. Boulanger-Lewandowski, R. Pascanu, Advances in optimizing recurrent networks, in Proceedings of ICASSP, Vancouver, 2013 Y. Bengio, N. Boulanger-Lewandowski, R. Pascanu, Advances in optimizing recurrent networks, in Proceedings of ICASSP, Vancouver, 2013
8.
Zurück zum Zitat J. Bilmes, Buried markov models: a graphical modeling approach to automatic speech recognition. Comput. Speech Lang. 17, 213–231 (2003)CrossRef J. Bilmes, Buried markov models: a graphical modeling approach to automatic speech recognition. Comput. Speech Lang. 17, 213–231 (2003)CrossRef
9.
Zurück zum Zitat J. Bilmes, What HMMs can do. IEICE Trans. Inf. Syst. E89-D(3), 869–891 (2006)CrossRef J. Bilmes, What HMMs can do. IEICE Trans. Inf. Syst. E89-D(3), 869–891 (2006)CrossRef
10.
Zurück zum Zitat M. Boden, A guide to recurrent neural networks and backpropagation. Tech. rep., T2002:03, SICS (2002) M. Boden, A guide to recurrent neural networks and backpropagation. Tech. rep., T2002:03, SICS (2002)
11.
Zurück zum Zitat H. Bourlard, N. Morgan, Connectionist Speech Recognition: A Hybrid Approach. The Kluwer International Series in Engineering and Computer Science, vol. 247 (Kluwer Academic, Boston, 1994) H. Bourlard, N. Morgan, Connectionist Speech Recognition: A Hybrid Approach. The Kluwer International Series in Engineering and Computer Science, vol. 247 (Kluwer Academic, Boston, 1994)
12.
Zurück zum Zitat J. Bridle, L. Deng, J. Picone, H. Richards, J. Ma, T. Kamm, M. Schuster, S. Pike, R. Reagan, An investigation of segmental hidden dynamic models of speech coarticulation for automatic speech recognition. Final Report for 1998 Workshop on Langauge Engineering, CLSP (Johns Hopkins, 1998) J. Bridle, L. Deng, J. Picone, H. Richards, J. Ma, T. Kamm, M. Schuster, S. Pike, R. Reagan, An investigation of segmental hidden dynamic models of speech coarticulation for automatic speech recognition. Final Report for 1998 Workshop on Langauge Engineering, CLSP (Johns Hopkins, 1998)
13.
Zurück zum Zitat J. Chen, L. Deng, A primal-dual method for training recurrent neural networks constrained by the echo-state property, in Proceedings of ICLR (2014) J. Chen, L. Deng, A primal-dual method for training recurrent neural networks constrained by the echo-state property, in Proceedings of ICLR (2014)
14.
Zurück zum Zitat J.-T. Chien, C.-H. Chueh, Dirichlet class language models for speech recognition. IEEE Trans. Audio Speech Lang. Process. 27, 43–54 (2011) J.-T. Chien, C.-H. Chueh, Dirichlet class language models for speech recognition. IEEE Trans. Audio Speech Lang. Process. 27, 43–54 (2011)
15.
Zurück zum Zitat G. Dahl, D. Yu, L. Deng, A. Acero, Large vocabulary continuous speech recognition with context-dependent DBN-HMMs, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2011) G. Dahl, D. Yu, L. Deng, A. Acero, Large vocabulary continuous speech recognition with context-dependent DBN-HMMs, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2011)
16.
Zurück zum Zitat G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)CrossRef G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)CrossRef
17.
Zurück zum Zitat A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum-likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. 39, 1–38 (1977)MathSciNetMATH A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum-likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. 39, 1–38 (1977)MathSciNetMATH
18.
Zurück zum Zitat L. Deng, A generalized hidden markov model with state-conditioned trend functions of time for the speech signal. Signal Process. 27(1), 65–78 (1992)CrossRefMATH L. Deng, A generalized hidden markov model with state-conditioned trend functions of time for the speech signal. Signal Process. 27(1), 65–78 (1992)CrossRefMATH
19.
Zurück zum Zitat L. Deng, A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition. Speech Commun. 24(4), 299–323 (1998)CrossRef L. Deng, A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition. Speech Commun. 24(4), 299–323 (1998)CrossRef
20.
Zurück zum Zitat L. Deng, Articulatory features and associated production models in statistical speech recognition, in Computational Models of Speech Pattern Processing (Springer, New York, 1999), pp. 214–224 L. Deng, Articulatory features and associated production models in statistical speech recognition, in Computational Models of Speech Pattern Processing (Springer, New York, 1999), pp. 214–224
21.
Zurück zum Zitat L. Deng, Computational models for speech production, in Computational Models of Speech Pattern Processing (Springer, New York, 1999), pp. 199–213 L. Deng, Computational models for speech production, in Computational Models of Speech Pattern Processing (Springer, New York, 1999), pp. 199–213
22.
Zurück zum Zitat L. Deng, Switching dynamic system models for speech articulation and acoustics, in Mathematical Foundations of Speech and Language Processing (Springer, New York, 2003), pp. 115–134 L. Deng, Switching dynamic system models for speech articulation and acoustics, in Mathematical Foundations of Speech and Language Processing (Springer, New York, 2003), pp. 115–134
23.
Zurück zum Zitat L. Deng, Dynamic Speech Models—Theory, Algorithm, and Applications (Morgan and Claypool, San Rafael, 2006) L. Deng, Dynamic Speech Models—Theory, Algorithm, and Applications (Morgan and Claypool, San Rafael, 2006)
24.
Zurück zum Zitat L. Deng, M. Aksmanovic, D. Sun, J. Wu, Speech recognition using hidden Markov models with polynomial regression functions as non-stationary states. IEEE Trans. Acoust. Speech Signal Process. 2(4), 101–119 (1994) L. Deng, M. Aksmanovic, D. Sun, J. Wu, Speech recognition using hidden Markov models with polynomial regression functions as non-stationary states. IEEE Trans. Acoust. Speech Signal Process. 2(4), 101–119 (1994)
25.
Zurück zum Zitat L. Deng, J. Chen, Sequence classification using high-level features extracted from deep neural networks, in Proceedings of ICASSP (2014) L. Deng, J. Chen, Sequence classification using high-level features extracted from deep neural networks, in Proceedings of ICASSP (2014)
26.
Zurück zum Zitat L. Deng, J. Droppo, A. Acero, A Bayesian approach to speech feature enhancement using the dynamic cepstral prior, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2002), pp. I-829–I-832 L. Deng, J. Droppo, A. Acero, A Bayesian approach to speech feature enhancement using the dynamic cepstral prior, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2002), pp. I-829–I-832
27.
Zurück zum Zitat L. Deng, K. Hassanein, M. Elmasry, Analysis of the correlation structure for a neural predictive model with application to speech recognition. Neural Netw. 7(2), 331–339 (1994)CrossRef L. Deng, K. Hassanein, M. Elmasry, Analysis of the correlation structure for a neural predictive model with application to speech recognition. Neural Netw. 7(2), 331–339 (1994)CrossRef
28.
Zurück zum Zitat L. Deng, G. Hinton, B. Kingsbury, New types of deep neural network learning for speech recognition and related applications: an overview, in Proceedings of IEEE ICASSP, Vancouver, 2013 L. Deng, G. Hinton, B. Kingsbury, New types of deep neural network learning for speech recognition and related applications: an overview, in Proceedings of IEEE ICASSP, Vancouver, 2013
29.
Zurück zum Zitat L. Deng, G. Hinton, D. Yu, Deep learning for speech recognition and related applications, in NIPS Workshop, Whistler, 2009 L. Deng, G. Hinton, D. Yu, Deep learning for speech recognition and related applications, in NIPS Workshop, Whistler, 2009
30.
Zurück zum Zitat L. Deng, P. Kenny, M. Lennig, V. Gupta, F. Seitz, P. Mermelsten, Phonemic hidden markov models with continuous mixture output densities for large vocabulary word recognition. IEEE Trans. Acoust. Speech Signal Process. 39(7), 1677–1681 (1991)CrossRef L. Deng, P. Kenny, M. Lennig, V. Gupta, F. Seitz, P. Mermelsten, Phonemic hidden markov models with continuous mixture output densities for large vocabulary word recognition. IEEE Trans. Acoust. Speech Signal Process. 39(7), 1677–1681 (1991)CrossRef
31.
Zurück zum Zitat L. Deng, L. Lee, H. Attias, A. Acero, Adaptive kalman filtering and smoothing for tracking vocal tract resonances using a continuous-valued hidden dynamic model. IEEE Trans. Audio Speech Lang. Process. 15(1), 13–23 (2007)CrossRef L. Deng, L. Lee, H. Attias, A. Acero, Adaptive kalman filtering and smoothing for tracking vocal tract resonances using a continuous-valued hidden dynamic model. IEEE Trans. Audio Speech Lang. Process. 15(1), 13–23 (2007)CrossRef
32.
Zurück zum Zitat L. Deng, M. Lennig, F. Seitz, P. Mermelstein, Large vocabulary word recognition using context-dependent allophonic hidden markov models. Comput. Speech Lang. 4, 345–357 (1991)CrossRef L. Deng, M. Lennig, F. Seitz, P. Mermelstein, Large vocabulary word recognition using context-dependent allophonic hidden markov models. Comput. Speech Lang. 4, 345–357 (1991)CrossRef
33.
Zurück zum Zitat L. Deng, X. Li, Machine learning paradigms in speech recognition: an overview. IEEE Trans. Audio Speech Lang. Process. 21(5), 1060–1089 (2013)CrossRef L. Deng, X. Li, Machine learning paradigms in speech recognition: an overview. IEEE Trans. Audio Speech Lang. Process. 21(5), 1060–1089 (2013)CrossRef
34.
Zurück zum Zitat L. Deng, J. Ma, A statistical coarticulatory model for the hidden vocal-tract-resonance dynamics, in EUROSPEECH (1999), pp. 1499–1502 L. Deng, J. Ma, A statistical coarticulatory model for the hidden vocal-tract-resonance dynamics, in EUROSPEECH (1999), pp. 1499–1502
35.
Zurück zum Zitat L. Deng, J. Ma, Spontaneous speech recognition using a statistical coarticulatory model for the hidden vocal-tract-resonance dynamics. J. Acoust. Soc. Am. 108, 3036–3048 (2000)CrossRef L. Deng, J. Ma, Spontaneous speech recognition using a statistical coarticulatory model for the hidden vocal-tract-resonance dynamics. J. Acoust. Soc. Am. 108, 3036–3048 (2000)CrossRef
36.
Zurück zum Zitat L. Deng, D. O’Shaughnessy, Speech Processing—A Dynamic and Optimization-Oriented Approach (Marcel Dekker, New York, 2003) L. Deng, D. O’Shaughnessy, Speech Processing—A Dynamic and Optimization-Oriented Approach (Marcel Dekker, New York, 2003)
37.
Zurück zum Zitat L. Deng, G. Ramsay, D. Sun, Production models as a structural basis for automatic speech recognition. Speech Commun. 33(2–3), 93–111 (1997)CrossRef L. Deng, G. Ramsay, D. Sun, Production models as a structural basis for automatic speech recognition. Speech Commun. 33(2–3), 93–111 (1997)CrossRef
38.
Zurück zum Zitat L. Deng, D. Yu, Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2007), pp. 445–448 L. Deng, D. Yu, Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2007), pp. 445–448
39.
Zurück zum Zitat L. Deng, D. Yu, A. Acero, A bidirectional target filtering model of speech coarticulation: two-stage implementation for phonetic recognition. IEEE Trans. Speech Audio Process. 14, 256–265 (2006)CrossRef L. Deng, D. Yu, A. Acero, A bidirectional target filtering model of speech coarticulation: two-stage implementation for phonetic recognition. IEEE Trans. Speech Audio Process. 14, 256–265 (2006)CrossRef
40.
Zurück zum Zitat L. Deng, D. Yu, A. Acero, Structured speech modeling. IEEE Trans. Speech Audio Process. 14, 1492–1504 (2006)CrossRef L. Deng, D. Yu, A. Acero, Structured speech modeling. IEEE Trans. Speech Audio Process. 14, 1492–1504 (2006)CrossRef
41.
Zurück zum Zitat P. Divenyi, S. Greenberg, G. Meyer, Dynamics of Speech Production and Perception (IOS Press, Amsterdam, 2006) P. Divenyi, S. Greenberg, G. Meyer, Dynamics of Speech Production and Perception (IOS Press, Amsterdam, 2006)
42.
Zurück zum Zitat J. Droppo, A. Acero, Noise robust speech recognition with a switching linear dynamic model, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2004), pp. I-953–I-956 J. Droppo, A. Acero, Noise robust speech recognition with a switching linear dynamic model, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2004), pp. I-953–I-956
43.
Zurück zum Zitat E. Fox, E. Sudderth, M. Jordan, A. Willsky, Bayesian nonparametric methods for learning markov switching processes. IEEE Signal Process. Mag. 27(6), 43–54 (2010) E. Fox, E. Sudderth, M. Jordan, A. Willsky, Bayesian nonparametric methods for learning markov switching processes. IEEE Signal Process. Mag. 27(6), 43–54 (2010)
44.
Zurück zum Zitat B. Frey, L. Deng, A. Acero, T. Kristjansson, Algonquin: iterating laplaces method to remove multiple types of acoustic distortion for robust speech recognition, in Proceedings of Eurospeech (2000) B. Frey, L. Deng, A. Acero, T. Kristjansson, Algonquin: iterating laplaces method to remove multiple types of acoustic distortion for robust speech recognition, in Proceedings of Eurospeech (2000)
45.
Zurück zum Zitat M. Gales, S. Young, Robust continuous speech recognition using parallel model combination. IEEE Trans. Speech Audio Process. 4(5), 352–359 (1996)CrossRef M. Gales, S. Young, Robust continuous speech recognition using parallel model combination. IEEE Trans. Speech Audio Process. 4(5), 352–359 (1996)CrossRef
46.
Zurück zum Zitat Z. Ghahramani, G.E. Hinton, Variational learning for switching state-space models. Neural Comput. 12, 831–864 (2000)CrossRef Z. Ghahramani, G.E. Hinton, Variational learning for switching state-space models. Neural Comput. 12, 831–864 (2000)CrossRef
47.
Zurück zum Zitat Y. Gong, I. Illina, J.-P. Haton, Modeling long term variability information in mixture stochastic trajectory framework, in Proceedings of International Conference on Spoken Language Processing (1996) Y. Gong, I. Illina, J.-P. Haton, Modeling long term variability information in mixture stochastic trajectory framework, in Proceedings of International Conference on Spoken Language Processing (1996)
48.
Zurück zum Zitat A. Graves, Sequence transduction with recurrent neural networks, in Representation Learning Workshop, ICML (2012) A. Graves, Sequence transduction with recurrent neural networks, in Representation Learning Workshop, ICML (2012)
49.
Zurück zum Zitat A. Graves, A. Mahamed, G. Hinton, Speech recognition with deep recurrent neural networks, in Proceedings of ICASSP, Vancouver, 2013 A. Graves, A. Mahamed, G. Hinton, Speech recognition with deep recurrent neural networks, in Proceedings of ICASSP, Vancouver, 2013
50.
Zurück zum Zitat G. E. Hinton, “A practical guide to training restricted Boltzmann machines,” in Technical report 2010-003, Machine Learning Group, University of Toronto, 2010. G. E. Hinton, “A practical guide to training restricted Boltzmann machines,” in Technical report 2010-003, Machine Learning Group, University of Toronto, 2010.
51.
Zurück zum Zitat G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRef G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRef
52.
53.
54.
Zurück zum Zitat W. Holmes, M. Russell, Probabilistic-trajectory segmental HMMs. Comput. Speech Lang. 13, 3–37 (1999)CrossRef W. Holmes, M. Russell, Probabilistic-trajectory segmental HMMs. Comput. Speech Lang. 13, 3–37 (1999)CrossRef
55.
Zurück zum Zitat X. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development (Upper Saddle River, New Jersey 07458) X. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development (Upper Saddle River, New Jersey 07458)
56.
Zurück zum Zitat H. Jaeger, Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach. GMD Report 159, GMD - German National Research Institute for Computer Science (2002) H. Jaeger, Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach. GMD Report 159, GMD - German National Research Institute for Computer Science (2002)
57.
Zurück zum Zitat F. Jelinek, Continuous speech recognition by statistical methods. Proc. IEEE 64(4), 532–557 (1976)CrossRef F. Jelinek, Continuous speech recognition by statistical methods. Proc. IEEE 64(4), 532–557 (1976)CrossRef
58.
Zurück zum Zitat B.-H. Juang, S.E. Levinson, M.M. Sondhi, Maximum likelihood estimation for mixture multivariate stochastic observations of markov chains. IEEE Trans. Inf. Theory 32(2), 307–309 (1986)CrossRef B.-H. Juang, S.E. Levinson, M.M. Sondhi, Maximum likelihood estimation for mixture multivariate stochastic observations of markov chains. IEEE Trans. Inf. Theory 32(2), 307–309 (1986)CrossRef
59.
Zurück zum Zitat B. Kingsbury, T. Sainath, H. Soltau, Scalable minimum Bayes risk training of deep neural network acoustic models using distributed hessian-free optimization, in Proceedings of Interspeech (2012) B. Kingsbury, T. Sainath, H. Soltau, Scalable minimum Bayes risk training of deep neural network acoustic models using distributed hessian-free optimization, in Proceedings of Interspeech (2012)
60.
Zurück zum Zitat H. Larochelle, Y. Bengio, Classification using discriminative restricted Boltzmann machines, in Proceedings of the 25th International Conference on Machine learning (ACM, New York, 2008), pp. 536–543 H. Larochelle, Y. Bengio, Classification using discriminative restricted Boltzmann machines, in Proceedings of the 25th International Conference on Machine learning (ACM, New York, 2008), pp. 536–543
61.
Zurück zum Zitat L. Lee, H. Attias, L. Deng, Variational inference and learning for segmental switching state space models of hidden speech dynamics, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2003), pp. I-872–I-875 L. Lee, H. Attias, L. Deng, Variational inference and learning for segmental switching state space models of hidden speech dynamics, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2003), pp. I-872–I-875
62.
Zurück zum Zitat L.J. Lee, P. Fieguth, L. Deng, A functional articulatory dynamic model for speech production, in Proceedings of ICASSP, Salt Lake City, vol. 2, 2001, pp. 797–800 L.J. Lee, P. Fieguth, L. Deng, A functional articulatory dynamic model for speech production, in Proceedings of ICASSP, Salt Lake City, vol. 2, 2001, pp. 797–800
63.
Zurück zum Zitat S. Liu, K. Sim, Temporally varying weight regression: a semi-parametric trajectory model for automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 22(1) 151–160 (2014)CrossRef S. Liu, K. Sim, Temporally varying weight regression: a semi-parametric trajectory model for automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 22(1) 151–160 (2014)CrossRef
64.
Zurück zum Zitat S.M. Siniscalchia, D. Yu, L. Deng, C.-H. Lee, Exploiting deep neural networks for detection-based speech recognition. Neurocomputing 106, 148–157 (2013)CrossRef S.M. Siniscalchia, D. Yu, L. Deng, C.-H. Lee, Exploiting deep neural networks for detection-based speech recognition. Neurocomputing 106, 148–157 (2013)CrossRef
65.
Zurück zum Zitat J. Ma, L. Deng, A path-stack algorithm for optimizing dynamic regimes in a statistical hidden dynamic model of speech. Comput. Speech Lang. 14, 101–104 (2000)CrossRef J. Ma, L. Deng, A path-stack algorithm for optimizing dynamic regimes in a statistical hidden dynamic model of speech. Comput. Speech Lang. 14, 101–104 (2000)CrossRef
66.
Zurück zum Zitat J. Ma, L. Deng, Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Audio Speech Process. 11(6), 590–602 (2003)CrossRef J. Ma, L. Deng, Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Audio Speech Process. 11(6), 590–602 (2003)CrossRef
67.
Zurück zum Zitat J. Ma, L. Deng, Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Audio Speech Lang. Process 11(6), 590–602 (2004)CrossRef J. Ma, L. Deng, Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Audio Speech Lang. Process 11(6), 590–602 (2004)CrossRef
68.
Zurück zum Zitat J. Ma, L. Deng, Target-directed mixture dynamic models for spontaneous speech recognition. IEEE Trans. Audio Speech Process. 12(1), 47–58 (2004)CrossRef J. Ma, L. Deng, Target-directed mixture dynamic models for spontaneous speech recognition. IEEE Trans. Audio Speech Process. 12(1), 47–58 (2004)CrossRef
69.
Zurück zum Zitat A.L. Maas, Q. Le, T.M. O’Neil, O. Vinyals, P. Nguyen, A.Y. Ng, Recurrent neural networks for noise reduction in robust asr, in Proceedings of INTERSPEECH, Portland, 2012 A.L. Maas, Q. Le, T.M. O’Neil, O. Vinyals, P. Nguyen, A.Y. Ng, Recurrent neural networks for noise reduction in robust asr, in Proceedings of INTERSPEECH, Portland, 2012
70.
Zurück zum Zitat J. Martens, I. Sutskever, Learning recurrent neural networks with hessian-free optimization, in Proceedings of ICML, Bellevue, 2011, pp. 1033–1040 J. Martens, I. Sutskever, Learning recurrent neural networks with hessian-free optimization, in Proceedings of ICML, Bellevue, 2011, pp. 1033–1040
71.
Zurück zum Zitat G. Mesnil, X. He, L. Deng, Y. Bengio, Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding, in Proceedings of INTERSPEECH, Lyon, 2013 G. Mesnil, X. He, L. Deng, Y. Bengio, Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding, in Proceedings of INTERSPEECH, Lyon, 2013
72.
Zurück zum Zitat B. Mesot, D. Barber, Switching linear dynamical systems for noise robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 15(6), 1850–1858 (2007)CrossRef B. Mesot, D. Barber, Switching linear dynamical systems for noise robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 15(6), 1850–1858 (2007)CrossRef
73.
Zurück zum Zitat T. Mikolov, Statistical language models based on neural networks, Ph.D. thesis, Brno University of Technology, 2012 T. Mikolov, Statistical language models based on neural networks, Ph.D. thesis, Brno University of Technology, 2012
74.
Zurück zum Zitat T. Mikolov, A. Deoras, D. Povey, L. Burget, J. Cernocky, Strategies for training large scale neural network language models, in Proceedings of IEEE ASRU (IEEE, Honolulu, 2011), pp. 196–201 T. Mikolov, A. Deoras, D. Povey, L. Burget, J. Cernocky, Strategies for training large scale neural network language models, in Proceedings of IEEE ASRU (IEEE, Honolulu, 2011), pp. 196–201
75.
Zurück zum Zitat T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, S. Khudanpur, Recurrent neural network based language model, in Proceedings of INTERSPEECH, Makuhari, 2010, pp. 1045–1048 T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, S. Khudanpur, Recurrent neural network based language model, in Proceedings of INTERSPEECH, Makuhari, 2010, pp. 1045–1048
76.
Zurück zum Zitat T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, S. Khudanpur, Extensions of recurrent neural network language model, in Proceedings of IEEE ICASSP, Prague, 2011, pp. 5528–5531 T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, S. Khudanpur, Extensions of recurrent neural network language model, in Proceedings of IEEE ICASSP, Prague, 2011, pp. 5528–5531
77.
Zurück zum Zitat A. Mohamed, G. Dahl, G. Hinton, Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)CrossRef A. Mohamed, G. Dahl, G. Hinton, Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)CrossRef
78.
Zurück zum Zitat A. Mohamed, G.E. Dahl, G.E. Hinton, Deep belief networks for phone recognition, in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications (2009) A. Mohamed, G.E. Dahl, G.E. Hinton, Deep belief networks for phone recognition, in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications (2009)
79.
Zurück zum Zitat A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, M. Picheny, Deep belief networks using discriminative features for phone recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2011), pp. 5060–5063 A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, M. Picheny, Deep belief networks using discriminative features for phone recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2011), pp. 5060–5063
80.
Zurück zum Zitat N. Morgan, Deep and wide: multiple layers in automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 7–13 (2012)CrossRef N. Morgan, Deep and wide: multiple layers in automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 7–13 (2012)CrossRef
81.
Zurück zum Zitat M. Ostendorf, V. Digalakis, O. Kimball, From HMM’s to segment models: a unified view of stochastic modeling for speech recognition. IEEE Trans. Speech Audio Process. 4(5), 360–378 (1996)CrossRef M. Ostendorf, V. Digalakis, O. Kimball, From HMM’s to segment models: a unified view of stochastic modeling for speech recognition. IEEE Trans. Speech Audio Process. 4(5), 360–378 (1996)CrossRef
82.
Zurück zum Zitat M. Ostendorf, A. Kannan, O. Kimball, J. Rohlicek, Continuous word recognition based on the stochastic segment model, in Proceedings of DARPA Workshop CSR (1992) M. Ostendorf, A. Kannan, O. Kimball, J. Rohlicek, Continuous word recognition based on the stochastic segment model, in Proceedings of DARPA Workshop CSR (1992)
83.
Zurück zum Zitat E. Ozkan, I. Ozbek, M. Demirekler, Dynamic speech spectrum representation and tracking variable number of vocal tract resonance frequencies with time-varying dirichlet process mixture models. IEEE Trans. Audio Speech Lang. Process. 17(8), 1518–1532 (2009)CrossRef E. Ozkan, I. Ozbek, M. Demirekler, Dynamic speech spectrum representation and tracking variable number of vocal tract resonance frequencies with time-varying dirichlet process mixture models. IEEE Trans. Audio Speech Lang. Process. 17(8), 1518–1532 (2009)CrossRef
84.
Zurück zum Zitat R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in Proceedings of ICML, Atlanta, 2013 R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in Proceedings of ICML, Atlanta, 2013
85.
Zurück zum Zitat V. Pavlovic, B. Frey, T. Huang, Variational learning in mixed-state dynamic graphical models, in Proceedings of UAI, Stockholm, 1999, pp. 522–530 V. Pavlovic, B. Frey, T. Huang, Variational learning in mixed-state dynamic graphical models, in Proceedings of UAI, Stockholm, 1999, pp. 522–530
86.
Zurück zum Zitat J. Picone, S. Pike, R. Regan, T. Kamm, J. Bridle, L. Deng, Z. Ma, H. Richards, M. Schuster, Initial evaluation of hidden dynamic models on conversational speech, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (1999) J. Picone, S. Pike, R. Regan, T. Kamm, J. Bridle, L. Deng, Z. Ma, H. Richards, M. Schuster, Initial evaluation of hidden dynamic models on conversational speech, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (1999)
87.
Zurück zum Zitat G. Puskorius, L. Feldkamp, Neurocontrol of nonlinear dynamical systems with kalman filter trained recurrent networks. IEEE Trans. Neural Netw. 5(2), 279–297 (1998)CrossRef G. Puskorius, L. Feldkamp, Neurocontrol of nonlinear dynamical systems with kalman filter trained recurrent networks. IEEE Trans. Neural Netw. 5(2), 279–297 (1998)CrossRef
88.
Zurück zum Zitat L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Upper Saddle River, 1993) L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Upper Saddle River, 1993)
89.
Zurück zum Zitat S. Rennie, J. Hershey, P. Olsen, Single-channel multitalker speech recognition—graphical modeling approaches. IEEE Signal Process.Mag. 33, 66–80 (2010) S. Rennie, J. Hershey, P. Olsen, Single-channel multitalker speech recognition—graphical modeling approaches. IEEE Signal Process.Mag. 33, 66–80 (2010)
90.
Zurück zum Zitat A.J. Robinson, An application of recurrent nets to phone probability estimation. IEEE Trans. Neural Netw. 5(2), 298–305 (1994)CrossRef A.J. Robinson, An application of recurrent nets to phone probability estimation. IEEE Trans. Neural Netw. 5(2), 298–305 (1994)CrossRef
91.
Zurück zum Zitat A. Rosti, M. Gales, Rao-blackwellised gibbs sampling for switching linear dynamical systems, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2004), pp. I-809–I-812 A. Rosti, M. Gales, Rao-blackwellised gibbs sampling for switching linear dynamical systems, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2004), pp. I-809–I-812
92.
Zurück zum Zitat M. Russell, P. Jackson, A multiple-level linear/linear segmental HMM with a formant-based intermediate layer. Comput. Speech Lang. 19, 205–225 (2005)CrossRef M. Russell, P. Jackson, A multiple-level linear/linear segmental HMM with a formant-based intermediate layer. Comput. Speech Lang. 19, 205–225 (2005)CrossRef
93.
Zurück zum Zitat T. Sainath, B. Kingsbury, H. Soltau, B. Ramabhadran, Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans. Audio Speech Lang. Process. 21(11), 2267–2276 (2013)CrossRef T. Sainath, B. Kingsbury, H. Soltau, B. Ramabhadran, Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans. Audio Speech Lang. Process. 21(11), 2267–2276 (2013)CrossRef
94.
Zurück zum Zitat F. Seide, G. Li, X. Chen, D. Yu, Feature engineering in context-dependent deep neural networks for conversational speech transcription, in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011 (Waikoloa, HI, USA), pp. 24–29 F. Seide, G. Li, X. Chen, D. Yu, Feature engineering in context-dependent deep neural networks for conversational speech transcription, in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011 (Waikoloa, HI, USA), pp. 24–29
95.
Zurück zum Zitat X. Shen, L. Deng, Maximum likelihood in statistical estimation of dynamical systems: decomposition algorithm and simulation results. Signal Process. 57, 65–79 (1997)CrossRefMATH X. Shen, L. Deng, Maximum likelihood in statistical estimation of dynamical systems: decomposition algorithm and simulation results. Signal Process. 57, 65–79 (1997)CrossRefMATH
96.
Zurück zum Zitat K. N. Stevens, Acoustic phonetics, Vol. 30, MIT Press, 2000. K. N. Stevens, Acoustic phonetics, Vol. 30, MIT Press, 2000.
97.
Zurück zum Zitat V. Stoyanov, A. Ropson, J. Eisner, Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure, in Proceedings of AISTAT (2011) V. Stoyanov, A. Ropson, J. Eisner, Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure, in Proceedings of AISTAT (2011)
98.
Zurück zum Zitat I. Suskever, J. Martens, G.E. Hinton, Generating text with recurrent neural networks, in Proceedings of 28th International Conference on Machine Learning (2011) I. Suskever, J. Martens, G.E. Hinton, Generating text with recurrent neural networks, in Proceedings of 28th International Conference on Machine Learning (2011)
99.
Zurück zum Zitat I. Sutskever, Training recurrent neural networks, Ph.D. thesis, University of Toronto, 2013 I. Sutskever, Training recurrent neural networks, Ph.D. thesis, University of Toronto, 2013
100.
Zurück zum Zitat I. Sutskever, J. Martens, G.E. Hinton, Generating text with recurrent neural networks, in Proceedings of ICML, Bellevue, 2011, pp. 1017–1024 I. Sutskever, J. Martens, G.E. Hinton, Generating text with recurrent neural networks, in Proceedings of ICML, Bellevue, 2011, pp. 1017–1024
101.
Zurück zum Zitat R. Togneri, L. Deng, Joint state and parameter estimation for a target-directed nonlinear dynamic system model. IEEE Trans. Signal Process. 51(12), 3061–3070 (2003)MathSciNetCrossRef R. Togneri, L. Deng, Joint state and parameter estimation for a target-directed nonlinear dynamic system model. IEEE Trans. Signal Process. 51(12), 3061–3070 (2003)MathSciNetCrossRef
102.
Zurück zum Zitat R. Togneri, L. Deng, A state-space model with neural-network prediction for recovering vocal tract resonances in fluent speech from mel-cepstral coefficients. Speech Commun. 48(8), 971–988 (2006)CrossRef R. Togneri, L. Deng, A state-space model with neural-network prediction for recovering vocal tract resonances in fluent speech from mel-cepstral coefficients. Speech Commun. 48(8), 971–988 (2006)CrossRef
103.
Zurück zum Zitat F. Triefenbach, A. Jalalvand, K. Demuynck, J.-P. Martens, Acoustic modeling with hierarchical reservoirs. EEE Trans. Audio Speech Lang. Process. 21(11 ), 2439–2450 (2013)CrossRef F. Triefenbach, A. Jalalvand, K. Demuynck, J.-P. Martens, Acoustic modeling with hierarchical reservoirs. EEE Trans. Audio Speech Lang. Process. 21(11 ), 2439–2450 (2013)CrossRef
104.
Zurück zum Zitat S. Wright, D. Kanevsky, L. Deng, X. He, G. Heigold, H. Li, Optimization algorithms and applications for speech and language processing. IEEE Trans. Audio Speech Lang. Process. 21(11), 2231–2243 (2013)CrossRef S. Wright, D. Kanevsky, L. Deng, X. He, G. Heigold, H. Li, Optimization algorithms and applications for speech and language processing. IEEE Trans. Audio Speech Lang. Process. 21(11), 2231–2243 (2013)CrossRef
105.
Zurück zum Zitat X. Xing, M. Jordan, S. Russell, A generalized mean field algorithm for variational inference in exponential families, in Proceedings of UAI (2003) X. Xing, M. Jordan, S. Russell, A generalized mean field algorithm for variational inference in exponential families, in Proceedings of UAI (2003)
106.
Zurück zum Zitat D. Yu, L. Deng, Speaker-adaptive learning of resonance targets in a hidden trajectory model of speech coarticulation. Comput. Speech Lang. 27, 72–87 (2007)CrossRef D. Yu, L. Deng, Speaker-adaptive learning of resonance targets in a hidden trajectory model of speech coarticulation. Comput. Speech Lang. 27, 72–87 (2007)CrossRef
107.
Zurück zum Zitat D. Yu, L. Deng, Discriminative pretraining of deep neural networks, US Patent 20130138436 A1, 2013 D. Yu, L. Deng, Discriminative pretraining of deep neural networks, US Patent 20130138436 A1, 2013
108.
Zurück zum Zitat D. Yu, L. Deng, G. Dahl, Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition, in NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2010) D. Yu, L. Deng, G. Dahl, Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition, in NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2010)
109.
Zurück zum Zitat D. Yu, F. Seide, G. Li, L. Deng, Exploiting sparseness in deep neural networks for large vocabulary speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2012), pp. 4409–4412 D. Yu, F. Seide, G. Li, L. Deng, Exploiting sparseness in deep neural networks for large vocabulary speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2012), pp. 4409–4412
110.
Zurück zum Zitat D. Yu, S. Siniscalchi, L. Deng, C. Lee, Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2012) D. Yu, S. Siniscalchi, L. Deng, C. Lee, Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2012)
111.
Zurück zum Zitat H. Zen, K. Tokuda, T. Kitamura, An introduction of trajectory model into HMM-based speech synthesis, in Proceedings of ISCA SSW5 (2004), pp. 191–196 H. Zen, K. Tokuda, T. Kitamura, An introduction of trajectory model into HMM-based speech synthesis, in Proceedings of ISCA SSW5 (2004), pp. 191–196
112.
Zurück zum Zitat L. Zhang, S. Renals, Acoustic-articulatory modelling with the trajectory HMM. IEEE Signal Process. Lett. 15, 245–248 (2008)CrossRef L. Zhang, S. Renals, Acoustic-articulatory modelling with the trajectory HMM. IEEE Signal Process. Lett. 15, 245–248 (2008)CrossRef
Metadaten
Titel
Deep Dynamic Models for Learning Hidden Representations of Speech Features
verfasst von
Li Deng
Roberto Togneri
Copyright-Jahr
2015
Verlag
Springer New York
DOI
https://doi.org/10.1007/978-1-4939-1456-2_6

Neuer Inhalt