Skip to main content

1994 | Buch

Connectionist Speech Recognition

A Hybrid Approach

verfasst von: Hervé A. Bourlard, Nelson Morgan

Verlag: Springer US

Buchreihe : The International Series in Engineering and Computer Science

insite
SUCHEN

Über dieses Buch

Connectionist Speech Recognition: A Hybrid Approach describes the theory and implementation of a method to incorporate neural network approaches into state of the art continuous speech recognition systems based on hidden Markov models (HMMs) to improve their performance. In this framework, neural networks (and in particular, multilayer perceptrons or MLPs) have been restricted to well-defined subtasks of the whole system, i.e. HMM emission probability estimation and feature extraction.
The book describes a successful five-year international collaboration between the authors. The lessons learned form a case study that demonstrates how hybrid systems can be developed to combine neural networks with more traditional statistical approaches. The book illustrates both the advantages and limitations of neural networks in the framework of a statistical systems.
Using standard databases and comparison with some conventional approaches, it is shown that MLP probability estimation can improve recognition performance. Other approaches are discussed, though there is no such unequivocal experimental result for these methods.
Connectionist Speech Recognition is of use to anyone intending to use neural networks for speech recognition or within the framework provided by an existing successful statistical approach. This includes research and development groups working in the field of speech recognition, both with standard and neural network approaches, as well as other pattern recognition and/or neural network researchers. The book is also suitable as a text for advanced courses on neural networks or speech processing.

Inhaltsverzeichnis

Frontmatter

Background

Frontmatter
Chapter 1. Introduction
Abstract
For thirty years, Artificial Neural Networks (ANNs) have been used for difficult problems in pattern recognition [Viglione, 1970]. Some of these problems, such as the pattern analysis of brain waves, have been characterized by a low signal-to-noise ratio; in some cases it was not even known what was signal and what was noise.
Hervé A. Bourlard, Nelson Morgan
Chapter 2. Statistical Pattern Classification
Abstract
No new engineering or scientific technique, however novel, evolves in isolation. Both Hidden Markov Model (HMM) and Multilayer Perceptron (MLP) based approaches have been developed in the context of a long history of pattern recognition technology. Though specific methods are changing, the pattern recognition perspective continues to be useful for the description of many problems and their proposed solutions.
Hervé A. Bourlard, Nelson Morgan
Chapter 3. Hidden Markov Models
Abstract
The most efficient approach developed so far to deal with the statistical variations of speech (in both the frequency and temporal domains) consists in modeling the lexicon words or the constituent speech units by Hidden Markov Models (HMMs) [Baker 1975a,; Jelinek 1976; Bahl et al.,1983; Levinson et al., 1983; Rabiner & Juang, 1986]. First described in [Baum & Petrie, 1966; Baum & Eagon, 1967; Baum, 1972], this formalism was proposed as a statistical method of estimating the probabilistic functions of Markov chains.’ Shortly afterwards, they were extended to automatic speech recognition independently at CMU [Baker, 1975b] and IBM [Bakis, 1976; Jelinek, 1976].
Hervé A. Bourlard, Nelson Morgan
Chapter 4. Multilayer Perceptrons
Abstract
In this section, we will describe the perceptron and Multilayer Perceptron (MLP) classes of Artificial Neural Networks. MLPs can be used for tasks such as feature extraction (see Chapter 14) and prediction (see Section 6.8 and Chapter 13) with applications ranging from signal processing to stock market forecast. For reviews and further reading on the fundamentals of neural networks, see [Rumelhart, Hinton, & Williams, 1986b; Pao, 1989; Beale & Jackson, 1990; Hertz, Krogh, & Palmer, 1991; Zu-rada, 1992]. For more information on learning algorithms, performance evaluation, and applications, see [Karayiannis & Venetsanopoulos, 1993]. For more references and application areas, see [Simpson, 1991].
Hervé A. Bourlard, Nelson Morgan

Hybrid HMM/MLP Systems

Frontmatter
Chapter 5. Speech Recognition Using ANNs
Abstract
Given all the difficulties presented in Chapter 1, Automatic Speech Recognition (ASR) remains a challenging problem in pattern recognition. After half a century of research, the performance currently achieved by state of the art systems is not yet at the level of a mature technology. Over the years, many technological innovations have boosted the level of performance for more and more difficult tasks. Some of the most significant of these innovations include: (1) pattern matching approaches (e.g., DTW), (2) statistical pattern recognition (e.g., HMMs), (3) better use of a priori phonological knowledge, and (4) integration of syntactic constraints in Continuous Speech Recognition (CSR) algorithms. However, despite impressive improvements, performance on realistic (i.e., fairly unconstrained) tasks are still far too low for effective use. It seems likely that new technological breakthroughs will be required for the major performance improvement that will be required. Even if one assumes infinite computational power, an infinite storage and corresponding memory bandwidth, and an infinite amount of training data, it is still not certain that one could solve the ASR problem in a satisfactory way. It has also become clear that the use of higher level knowledge during the recognition process (or more generally, the efficient interaction between multiple knowledge sources) is required to overcome the limitations of current ASR systems.
Hervé A. Bourlard, Nelson Morgan
Chapter 6. Statistical Inference in MLPs
Abstract
In Chapter 3, we showed that HMMs were stochastic models that dealt efficiently with the statistical and sequential character of the speech signal, but which also suffer from several limiting assumptions that are required for tractable solutions. In Chapter 4, we discussed ANNs and showed that they had their own attractive properties; in particular, they appear to rely on fewer basic assumptions. Chapter 5 briefly reviewed the most popular ANN approaches currently used for sequence processing in general and speech recognition in particular. We concluded that none of these were able to solve CSR properly using ANNs by themselves. Given these tradeoffs, we have been interested in using ANNs to overcome some HMM drawbacks while staying within the latter’s formalism. This kind of hybrid is frequently not straightforward, however; for instance, it is difficult to optimally incorporate rule-based speech knowledge in an HMM-based ASR system.1
Hervé A. Bourlard, Nelson Morgan
Chapter 7. The Hybrid HMM/MLP Approach
Abstract
As described earlier, HMMs are now widely used for automatic speech recognition, and inherently incorporate the sequential and statistical character of the speech signal. However, notwithstanding their efficiency, standard HMM-based recognizers suffer from several weaknesses, mainly due to the many hypotheses required to make their optimization possible (see Chapter 3):
Hervé A. Bourlard, Nelson Morgan
Chapter 8. Experimental Systems
Abstract
The work presented in this chapter (and indeed, some of the writing as well) has been done in collaboration with other members of the Realization group at ICSI (in Berkeley, CA), (and more particularly Chuck Wooters, Phil Kohn, and Steve Renais, currently at Cambridge), Hynek Hermansky of US West Advanced Technologies (Denver, CO), and more recently with Michael Cohen, Horacio Franco, and Victor Abrash of SRI (Stanford, CA). The results of this continuing work are presented here to show that the hybrid HMM/MLP approach can improve state-of-the-art large vocabulary, continuous speech recognition systems.
Hervé A. Bourlard, Nelson Morgan
Chapter 9. Context-Dependent MLPs
Abstract
Chapters 7 and 8 have shown the ability of Multilayer Perceptrons (MLPs) to estimate emission probabilities for Hidden Markov Models (HMM). In these chapters, we have shown that these estimates led to improved performance over standard estimation techniques when a fairly simple HMM was used. However, current state-of-the-art continuous speech recognizers require HMMs with greater complexity, e.g., multiple densities per phone and/or context-dependent phone models. Will the consistent improvement we have seen in these tests be washed out in systems with more detailed models?
Hervé A. Bourlard, Nelson Morgan
Chapter 10. System Tradeoffs
Abstract
The results described in the preceding chapters suggest the utility of connectionist approaches for probabilistic estimation in speech recognition. However, word accuracy is only one measure of a practical speech recognition technique. Any computational method requires resources in the form of storage and communication (memory) bandwidth, as well as the ability to do the required arithmetic. The number of parameters used for a particular technique also has consequences for training. A particular design choice implies some tradeoff between these requirements. Additionally, trained systems such as those considered here may require entirely different resources for training and recognition modes, and these will be traded off in different techniques.
Hervé A. Bourlard, Nelson Morgan
Chapter 11. Training Hardware and Software
Abstract
The previous chapter showed that the MLP probability estimation approach appears to be conservative in system resource requirements for the recognition process. In particular, a speech recognizer using MLP-based approaches scales well with more phonetic categories for requirements of storage, memory bandwidth, and numerical computation. This is particularly true when one takes into account the parameter-sharing that occurs in the hidden layer(s) of a continuous input system.
Hervé A. Bourlard, Nelson Morgan

Additional Topics

Frontmatter
Chapter 12. Cross-validation in MLP Training
Abstract
It is well known that system models which have too many parameters (with respect to the number of measurements) do not generalize well to new measurements. For instance, an autoregressive (AR) model can be derived which will represent the training data with no error by using as many parameters as there are data points. This would generally be of no value, as it would only represent the training data. Criteria such as the Akaike Information Criterion (AIC) [Akaike, 1974, 1986] can be used to penalize both the complexity of AR models and their training error variance. In feedforward nets, we do not currently have such a measure. In fact, given the aim of building systems which are biologically plausible, there is a temptation to assume the usefulness of indefinitely large adaptive networks. In contrast to our best guess at Nature’stricks, manmade systems for pattern recognition seem to require nasty amounts of data for training. In short, the design of massively parallel systems is limited by the number of parameters that can be learned with available training data. It is likely that the only way truly massive systems can be built is with the help of prior information, e.g., connection topology and weights that need not be learned [Feldman et al., 1988]. Learning theory [Valiant, 1984; Pearl, 1978] has begun to establish what is possible for trained systems. Order-of-magnitude lower bounds have been established for the number of required measurements to train a desired size feedforward net [Baum & Haussler, 1988]. Rules of thumb suggesting the number of samples required for specific distributions could be useful for practical problems. Widrow has suggested having a training sample size that is 10 times the number of weights in a network (“Uncle Berllie’s Rule”) [Widrow, 1987].
Hervé A. Bourlard, Nelson Morgan
Chapter 13. HMM/MLP and Predictive Models
Abstract
Hidden Markov models are widely used for speech recognition. However, as shown in Chapter 3, strong assumptions have to be made to render the model computationally tractable. One of these assumptions is the observation independence of the acoustic vectors. Indeed, it is usually assumed that the probability that a particular acoustic vector is emitted at a given time only depends on the current state and the current acoustic vector. As a consequence, this model does not take account of the dynamic nature of the speech signal.1
Hervé A. Bourlard, Nelson Morgan
Chapter 14. Feature Extraction by MLP
Abstract
In the preceding chapters, emphasis was put on the use of MLPs as discriminant pattern classifiers for speech recognition applications. Although pattern classification plays a crucial role, it is only part of the vast speech recognition task. In spite of the spectacular progress made over the last decade, unrestricted speech recognition is still out of reach, and it is suspected that part of the difficulty lies in the use of inappropriate features for recognizing speech. A priori phonetic knowledge seems of little practical use in this respect. The elementary sounds composing speech can indeed be described by place and manner of articulation for instance, but it seems difficult to translate this knowledge to a precise characterization at the signal level. On the other hand, one can consider that the hidden units of an MLP develop an internal representation of the input signal which is the most appropriate for the classification task. From this point of view, the MLP performs some type of feature extraction which is given by the activity levels of the hidden units. This view of an MLP as a trainable feature extractor for speech processing was described in [Rumelhart et al., 1986a], was systematically investigated in [Elman & Zipser ,1988], and was more generally the original perspective in some of the work of Rosenblatt and his students.
Hervé A. Bourlard, Nelson Morgan

Finale

Frontmatter
Chapter 15. Final System Overview
Abstract
In this book, several theoretical and experimental developments related to HMMs,neural networks and hybrid HMM/MLP systems have been presented. While one of our goals was to discuss relationships between neural networks, statistics and linear algebra, the main aim of this book was to present the theories, experiments and hardware that were required to develop our hybrid HMM/MLP approach to improving large vocabulary, continuous speech recognition systems.
Hervé A. Bourlard, Nelson Morgan
Chapter 16. Conclusions
Abstract
As of this writing (1993), it is still too early to describe the long-term impact of neural networks on future ASR systems. Many of us have been attracted to ANNs at least partly because they often are useful for problems in which we have little prior knowledge. However, for a problem that has been investigated for decades like speech recognition, it is quite difficult to improve state-of-the-art systems with such a simple approach. We have yet to develop specific instances where such a completely “ignorance-based” approach can be used to successfully solve difficult problems. However, we now have a number of examples (in addition to the one described in this book) in which neural network techniques are successfully applied to practical problems such as recognizing handwritten postal mail codes or predicting time series.1 However, progress in any of these areas still requires an extensive knowledge of relevant fields; we cannot disregard what has been achieved by more traditional techniques.
Hervé A. Bourlard, Nelson Morgan
Backmatter
Metadaten
Titel
Connectionist Speech Recognition
verfasst von
Hervé A. Bourlard
Nelson Morgan
Copyright-Jahr
1994
Verlag
Springer US
Electronic ISBN
978-1-4615-3210-1
Print ISBN
978-1-4613-6409-2
DOI
https://doi.org/10.1007/978-1-4615-3210-1