Skip to main content

2015 | Buch

Automatic Speech Recognition

A Deep Learning Approach

insite
SUCHEN

Über dieses Buch

This book provides a comprehensive overview of the recent advancement in the field of automatic speech recognition with a focus on deep learning models including deep neural networks and many of their variants. This is the first automatic speech recognition book dedicated to the deep learning approach. In addition to the rigorous mathematical treatment of the subject, the book also presents insights and theoretical foundation of a series of highly successful deep learning models.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction
Abstract
Automatic speech recognition (ASR) is an important technology to enable and improve the human–human and human–computer interactions. In this chapter, we introduce the main application areas of ASR systems, describe their basic architecture, and then introduce the organization of the book.
Dong Yu, Li Deng

Conventional Acoustic Models

Frontmatter
Chapter 2. Gaussian Mixture Models
Abstract
In this chapter we first introduce the basic concepts of random variables and the associated distributions. These concepts are then applied to Gaussian random variables and mixture-of-Gaussian random variables. Both scalar and vector-valued cases are discussed and the probability density functions for these random variables are given with their parameters specified. This introduction leads to the Gaussian mixture model (GMM) when the distribution of mixture-of-Gaussian random variables is used to fit the real-world data such as speech features. The GMM as a statistical model for Fourier-spectrum-based speech features plays an important role in acoustic modeling of conventional speech recognition systems. We discuss some key advantages of GMMs in acoustic modeling, among which is the easy way of using them to fit the data of a wide range of speech features using the EM algorithm. We describe the principle of maximum likelihood and the related EM algorithm for parameter estimation of the GMM in some detail as it is still a widely used method in speech recognition. We finally discuss a serious weakness of using GMMs in acoustic modeling for speech recognition, motivating new models and methods that form the bulk part of this book.
Dong Yu, Li Deng
Chapter 3. Hidden Markov Models and the Variants
Abstract
This chapter builds upon the reviews in the previous chapter on aspects of probability theory and statistics, including random variables and Gaussian mixture models, and extends the reviews to the Markov chain and the hidden Markov sequence or model (HMM). Central to the HMM is the concept of state, which is itself a random variable typically taking discrete values. Extending from a Markov chain to an HMM involves adding uncertainty or a statistical distribution on each of the states in the Markov chain. Hence, an HMM is a doubly-stochastic process, or probabilistic function of a Markov chain. When the state of the Markov sequence or HMM is confined to be discrete and the distributions associated with the HMM states do not overlap, we reduce it to a Markov chain. This chapter covers several key aspects of the HMM, including its parametric characterization, its simulation by random number generators, its likelihood evaluation, its parameter estimation via the EM algorithm, and its state decoding via the Viterbi algorithm or a dynamic programming procedure. We then provide discussions on the use of the HMM as a generative model for speech feature sequences and its use as the basis for speech recognition. Finally, we discuss the limitations of the HMM, leading to its various extended versions, where each state is made associated with a dynamic system or a hidden time-varying trajectory instead of with a temporally independent stationary distribution such as a Gaussian mixture. These variants of the HMM with state-conditioned dynamic systems expressed in the state-space formulation are introduced as a generative counterpart of the recurrent neural networks to be described in detail in Chap. 13.
Dong Yu, Li Deng

Deep Neural Networks

Frontmatter
Chapter 4. Deep Neural Networks
Abstract
In this chapter, we introduce deep neural networks (DNNs)—multilayer perceptrons with many hidden layers. DNNs play an important role in the modern speech recognition systems, and are the focus of the rest of the book. We depict the architecture of DNNs, describe the popular activation functions and training criteria, illustrate the famous backpropagation algorithm for learning DNN model parameters, and introduce practical tricks that make the training process robust.
Dong Yu, Li Deng
Chapter 5. Advanced Model Initialization Techniques
Abstract
In this chapter, we introduce several advanced deep neural network (DNN) model initialization or pretraining techniques. These techniques have played important roles in the early days of deep learning research and continue to be useful under many conditions. We focus our presentation of pretraining DNNs on the following topics: the restricted Boltzmann machine (RBM), which by itself is an interesting generative model, the deep belief network (DBN), the denoising autoencoder, and the discriminative pretraining.
Dong Yu, Li Deng

Deep Neural Network-Hidden Markov Model Hybrid Systems for Automatic Speech Recognition

Frontmatter
Chapter 6. Deep Neural Network-Hidden Markov Model Hybrid Systems
Abstract
In this chapter, we describe one of the several possible ways of exploiting deep neural networks (DNNs) in automatic speech recognition systems—the deep neural network-hidden Markov model (DNN-HMM) hybrid system. The DNN-HMM hybrid system takes advantage of DNN’s strong representation learning power and HMM’s sequential modeling ability, and outperforms conventional Gaussian mixture model (GMM)-HMM systems significantly on many large vocabulary continuous speech recognition tasks. We describe the architecture and the training procedure of the DNN-HMM hybrid system and point out the key components of such systems by comparing a range of system setups.
Dong Yu, Li Deng
Chapter 7. Training and Decoding Speedup
Abstract
Deep neural networks (DNNs) have many hidden layers each of which has many neurons. This greatly increases the total number of parameters in the model and slows down both the training and decoding. In this chapter, we discuss algorithms and engineering techniques that speedup the training and decoding. Specifically, we describe the parallel training algorithms such as pipelined backpropagation algorithm, asynchronous stochastic gradient descend algorithm, and augmented Lagrange multiplier algorithm. We also introduce model size reduction techniques based on low-rank approximation which can speedup both training and decoding, and techniques such as quantization, lazy evaluation, and frame skipping that significantly speedup the decoding.
Dong Yu, Li Deng
Chapter 8. Deep Neural Network Sequence-Discriminative Training
Abstract
The cross-entropy criterion discussed in the previous chapters treats each frame independently. However, speech recognition is a sequence classification problem. In this chapter, we introduce the sequence-discriminative training techniques that match better to the problem. We describe the popular maximum mutual information (MMI), boosted MMI (BMMI), minimum phone error (MPE), and minimum Bayes risk (MBR) training criteria, and discuss the practical techniques, including lattice generation, lattice compensation, frame dropping, frame smoothing, and learning rate adjustment, to make DNN sequence-discriminative training effective.
Dong Yu, Li Deng

Representation Learning in Deep Neural Networks

Frontmatter
Chapter 9. Feature Representation Learning in Deep Neural Networks
Abstract
In this chapter, we show that deep neural networks jointly learn the feature representation and the classifier. Through many layers of nonlinear processing, DNNs transform the raw input feature to a more invariant and discriminative representation that can be better classified by the log-linear model. In addition, DNNs learn a hierarchy of features. The lower-level features typically catch local patterns. These patterns are very sensitive to changes in the raw feature. The higher-level features, however, are built upon the low-level features and are more abstract and invariant to the variations in the raw feature. We demonstrate that the learned high-level features are robust to speaker and environment variations.
Dong Yu, Li Deng
Chapter 10. Fuse Deep Neural Network and Gaussian Mixture Model Systems
Abstract
In this chapter, we introduce techniques that fuse deep neural networks (DNNs) and Gaussian mixture models (GMMs). We first describe the Tandem and bottleneck approach in which DNNs are used as feature extractors. The hidden layers, which are better representation than the raw input feature, are used as features in the GMM systems. We then introduce techniques that fuse the recognition results and frame-level scores of the DNN-HMM hybrid system with that of the GMM-HMM system.
Dong Yu, Li Deng
Chapter 11. Adaptation of Deep Neural Networks
Abstract
Adaptation techniques can compensate for the difference between the training and testing conditions and thus can further improve the speech recognition accuracy. Unlike Gaussian mixture models (GMMs), which are generative models, deep neural networks (DNNs) are discriminative models. For this reason, the adaptation techniques developed for GMMs cannot be directly applied to DNNs. In this chapter, we first introduce the concept of adaptation. We then describe the important adaptation techniques developed for DNNs, which are classified into the categories of linear transformation, conservative training, and subspace methods. We further show that adaptation in DNNs can bring significant error rate reduction at least for some speech recognition tasks and thus is as important as that in the GMM systems.
Dong Yu, Li Deng

Advanced Deep Models

Frontmatter
Chapter 12. Representation Sharing and Transfer in Deep Neural Networks
Abstract
We have emphasized in the previous chapters that in deep neural networks (DNNs) each hidden layer is a new representation of the raw input to the DNN. The representation at higher layers is more abstract than that at lower layers. In this chapter, we show that these feature representations can be shared and transferred across related tasks through techniques such as multitask and transfer learning. We will use multilingual and crosslingual speech recognition as the main example, which uses a shared-hidden-layer DNN architecture, to demonstrate these techniques.
Dong Yu, Li Deng
Chapter 13. Recurrent Neural Networks and Related Models
Abstract
A recurrent neural network (RNN) is a class of neural network models where many connections among its neurons form a directed cycle. This gives rise to the structure of internal states or memory in the RNN, endowing it with the dynamic temporal behavior not exhibited by the DNN discussed in earlier chapters. In this chapter, we first present the state-space formulation of the basic RNN as a nonlinear dynamical system, where the recurrent matrix governing the system dynamics is largely unstructured. For such basic RNNs, we describe two algorithms for learning their parameters in some detail: (1) the most popular algorithm of backpropagation through time (BPTT); and (2) a more rigorous, primal-dual optimization technique, where constraints on the RNN’s recurrent matrix are imposed to guarantee stability during RNN learning. Going beyond basic RNNs, we further study an advanced version of the RNN, which exploits the structure called long-short-term memory (LSTM), and analyzes its strengths over the basic RNN both in terms of model construction and of practical applications including some latest speech recognition results. Finally, we analyze the RNN as a bottom-up, discriminative, dynamic system model against the top-down, generative counterpart of dynamic system as discussed in Chap. 4. The analysis and discussion lead to potentially more effective and advanced RNN-like architectures and learning paradigm where the strengths of discriminative and generative modeling are integrated while their respective weaknesses are overcome.
Dong Yu, Li Deng
Chapter 14. Computational Network
Abstract
In the previous chapters, we have discussed various deep learning models for automatic speech recognition (ASR). In this chapter, we introduce computational network (CN), a unified framework for describing arbitrary learning machines, such as deep neural networks (DNNs), computational neural networks (CNNs), recurrent neural networks (RNNs), long short term memory (LSTM), logistic regression, and matrixum entropy model, that can be illustrated as a series of computational steps. A CN is a directed graph in which each leaf node represents an input value or a parameter and each nonleaf node represents a matrix operation upon its children. We describe algorithms to carry out forward computation and gradient calculation in CN and introduce most popular computation node types used in a typical CN.
Dong Yu, Li Deng
Chapter 15. Summary and Future Directions
Abstract
In this chapter, we summarize the book by first listing and analyzing what we view as major milestone studies in the recent history of developing the deep learning-based ASR techniques and systems. We describe the motivations of these studies, the innovations they have engendered, the improvements they have provided, and the impacts they have generated. In this road map, we will first cover the historical context in which the DNN technology made inroad into ASR around 2009 resulting from academic and industry collaborations. Then we select seven main themes in which innovations flourished across-the-board in ASR industry and academic research after the early debut of DNNs. Finally, our belief is provided on the current state-of-the-art of speech recognition systems, and we also discuss our thoughts and analysis on the future research directions.
Dong Yu, Li Deng
Backmatter
Metadaten
Titel
Automatic Speech Recognition
verfasst von
Dong Yu
Li Deng
Copyright-Jahr
2015
Verlag
Springer London
Electronic ISBN
978-1-4471-5779-3
Print ISBN
978-1-4471-5778-6
DOI
https://doi.org/10.1007/978-1-4471-5779-3

Neuer Inhalt