Skip to main content

1996 | Buch

Automatic Speech and Speaker Recognition

Advanced Topics

herausgegeben von: Chin-Hui Lee, Frank K. Soong, Kuldip K. Paliwal

Verlag: Springer US

Buchreihe : The International Series in Engineering and Computer Science

insite
SUCHEN

Über dieses Buch

Research in the field of automatic speech and speaker recognition has made a number of significant advances in the last two decades, influenced by advances in signal processing, algorithms, architectures, and hardware. These advances include: the adoption of a statistical pattern recognition paradigm; the use of the hidden Markov modeling framework to characterize both the spectral and the temporal variations in the speech signal; the use of a large set of speech utterance examples from a large population of speakers to train the hidden Markov models of some fundamental speech units; the organization of speech and language knowledge sources into a structural finite state network; and the use of dynamic, programming based heuristic search methods to find the best word sequence in the lexical network corresponding to the spoken utterance.
Automatic Speech and Speaker Recognition: Advanced Topics groups together in a single volume a number of important topics on speech and speaker recognition, topics which are of fundamental importance, but not yet covered in detail in existing textbooks. Although no explicit partition is given, the book is divided into five parts: Chapters 1-2 are devoted to technology overviews; Chapters 3-12 discuss acoustic modeling of fundamental speech units and lexical modeling of words and pronunciations; Chapters 13-15 address the issues related to flexibility and robustness; Chapter 16-18 concern the theoretical and practical issues of search; Chapters 19-20 give two examples of algorithm and implementational aspects for recognition system realization.
Audience: A reference book for speech researchers and graduate students interested in pursuing potential research on the topic. May also be used as a text for advanced courses on the subject.

Inhaltsverzeichnis

Frontmatter
1. An Overview of Automatic Speech Recognition
Abstract
For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary keyword recognition over dial-up telephone lines, to medium size vocabulary voice interactive command and control systems on personal computers, to large vocabulary speech dictation, spontaneous speech understanding, and limited-domain speech translation. In this chapter we review some of the key advances in several areas of automatic speech recognition. We also briefly discuss the requirements in designing successful real-world applications and address technical challenges that need to be faced in order to reach the ultimate goal of providing an easy-to-use, natural, and flexible voice interface between people and machines.
L. R. Rabiner, B.-H. Juang, C.-H. Lee
2. An Overview of Speaker Recognition Technology
Abstract
This chapter overviews recent advances in speaker recognition technology. The first part of the chapter discusses general topics and issues. Speaker recognition can be divided in two ways: (a) speaker identification and verification, and (b) text-dependent and text-independent methods. The second part of the paper is devoted to discussion of more specific topics of recent interest which have led to interesting new approaches and techniques. They include parameter/distance normalization techniques, model adaptation techniques, VQ-/ergodic-HMM-based text-independent recognition methods, and a text-prompted recognition method. The chapter concludes with a short discussion assessing the current status and possibilities for the future.
Sadaoki Furui
3. Maximum Mutual Information Estimation of Hidden Markov Models
Abstract
This chapter describes ways in which the concept of maximum mutual information estimation (MMIE) can be used to improve the performance of HMM-based speech recognition systems. First, the basic MMIE concept is introduced with some intuition on how it works. Then we show how the concept can be extended to improve the power of the basic models. Since estimating HMM parameters with MMIE training can be computationally expensive, this problem is studied at length and some solutions proposed and demonstrated. Experiments are presented to demonstrate the usefulness of the MMIE technique.
Yves Normandin
4. Bayesian Adaptive Learning and Map Estimation of HMM
Abstract
A mathematical framework for Bayesian adaptive learning of the parameters of stochastic models is presented. Maximum a posteriori (MAP) estimation algorithms are then developed for hidden Markov models and for a number of useful parametric densities commonly used in automatic speech recognition and natural language processing. The MAP formulation offers a way to combine existing prior knowledge and a small set of newly acquired task-specific data in an optimal manner. Other techniques can also be combined with Bayesian learning to improve adaptation efficiency and effectiveness.
Chin-Hui Lee, Jean-Luc Gauvain
5. Statistical and Discriminative Methods for Speech Recognition
Abstract
A critical component in the pattern matching approach to speech recognition is the training algorithm which aims at producing typical (reference) patterns or models for accurate pattern comparison. In this chapter, we discuss the issue of speech recognizer training from a broad perspective with root in the classical Bayes decision theory. We differentiate the method of classifier design by way of distribution estimation and the method of discriminative training based on the fact that in many realistic applications, such as speech recognition, the real signal distribution form is rarely known precisely. We argue that traditional methods relying on distribution estimation are suboptimal when the assumed distribution form is not the true one, and that “optimality” in distribution estimation does not automatically translate into “optimality” in classifier design. We compare the two different methods in the context of hidden Markov modeling for speech recognition. We show the superiority of the discriminative method over the distribution estimation method by providing the results of several key speech recognition experiments.
B.-H. Juang, W. Chou, C.-H. Lee
6. Context-Dependent Vector Clustering for Speech Recognition
Abstract
The performance of a large vocabulary speech recognition system is critically tied to the quality of the acoustic prototypes that are established in the relevant feature space(s). This is especially true in continuous speech and/or for speaker-independent tasks, where pronunciation variability is the greatest. In this chapter, we will discuss a number of clustering techniques which can be used to derive high quality acoustic prototypes.
Jerome R. Bellegarda
7. Hidden Markov Network for Precise and Robust Acoustic Modeling
Abstract
This chapter discusses the structure of acoustic models and training algorithms for speech recognition. As is generally recognized, high acoustic model complexity demands more training data. One effective solution is tying at multiple levels such as allophone, state, distribution, or parameter levels. Tied structures such as generalized triphones, state tying, and tied mixtures, have been one of the main streams of research in acoustic modeling of speech. They offer not only precise and robust modeling, but also significant computational advantage. This chapter introduces the Hidden Markov Network (HMnet) which is derived by the Successive State Splitting algorithm. The ultimate goal is an acoustic model with a fully tied acoustic structure in four levels. Vector Field Smoothing (VFS) for speaker adaptation is also discussed for more efficient training of acoustic models.
Shigeki Sagayama
8. From HMMS to Segment Models: Stochastic Modeling for CSR
Abstract
In recent years, several alternative models have been proposed to address some of the shortcomings of the hidden Markov model (HMM), currently the most popular approach to speech recognition. Many of these models, which attempt to represent trends or correlation of observations over time, can broadly be classified as segment models. This chapter describes a general probabilistic framework for segment models, including HMMs as a special case, giving options for modeling assumptions in terms of correlation structure and parameter tying and outlining the extensions to HMM recognition and training algorithms needed to handle segment models.
Mari Ostendorf
9. Voice Identification Using Nonparametric Density Matching
Abstract
Text-independent speaker recognition is often based on the premise that acoustic measurements derived from the speech utterances of an individual are characterized by stable, speaker-unique probability density functions (PDFs). This chapter describes a method of comparing speech utterances to determine whether or not the underlying PDFs are the same, hence likely to have been spoken by the same person. The method is independent of assumptions about the form of the PDFs. Based on a conjecture regarding the local relationship between probability density and nearest-neighbor distance, the algorithm is shown to measure global differences between the speakers’ underlying feature distributions. Experimental results are presented for the King telephone database.
A. Higgins, L. Bahler, J. Porter
10. The Use of Recurrent Neural Networks in Continuous Speech Recognition
Abstract
This chapter describes a use of recurrent neural networks (i.e., feedback is incorporated in the computation) as an acoustic model for continuous speech recognition. The form of the recurrent neural network is described along with an appropriate parameter estimation procedure. For each frame of acoustic data, the recurrent network generates an estimate of the posterior probability of of the possible phones given the observed acoustic signal. The posteriors are then converted into scaled likelihoods and used as the observation probabilities within a conventional decoding paradigm (e.g., Viterbi decoding). The advantages of using recurrent networks are that they require a small number of parameters and provide a fast decoding capability (relative to conventional, large-vocabulary, HMM systems)3.
Tony Robinson, Mike Hochberg, Steve Renals
11. Hybrid Connectionist Models For Continuous Speech Recognition
Abstract
The dominant technology for the recognition of continuous speech is based on Hidden Markov Models (HMMs). These models provide a fundamental structure that is powerful and flexible, but the probability estimation techniques used with these models typically suffer from a number of significant limitations. Over the last few years, we have demonstrated that fairly simple Multi-Layered Perceptrons (MLPs) can be discriminatively trained to estimate emission probabilities for HMMs. Simple context-independent systems based on this approach have performed very well on large vocabulary continuous speech recognition. This chapter will briefly review the fundamentals of HMMs and MLPs, and will then describe a form of hybrid system that has some discriminant properties.
Hervé Bourlard, Nelson Morgan
12. Automatic Generation of Detailed Pronunciation Lexicons
Abstract
We explore different ways of “spelling” a word in a speech recognizer’s lexicon and how to obtain those spellings. In particular, we compare using as the source of sub-words units for which we build acoustic models (1) a coarse phonemic representation, (2) a single, fine phonetic realization, and (3) multiple phonetic realizations with associated likelihoods. We describe how we obtain these different pronunciations from text-to-speech systems and from procedures that build decision trees trained on phonetically-labeled corpora. We evaluate these methods applied to speech recognition with the DARPA Resource Management (RM) and the North American Business News (NAB) tasks. For the RM task (with perplexity 60 grammar), we obtain 93.4% word accuracy using phonemic pronunciations, 94.1% using a single phonetic pronunciation per word, and 96.3% using multiple phonetic pronunciations per word with associated likelihoods. For the NAB task (with 60K vocabulary and 34M 1–5 grams), we obtain 87.3% word accuracy with phonemic pronuncations and 90.0% using multiple phonetic pronuncations.
Michael D. Riley, Andrej Ljolje
13. Word Spotting from Continuous Speech Utterances
Abstract
There are many speech recognition applications that require only partial information to be extracted from a speech utterance. These applications include human-machine interactions where it may be difficult to constrain users’ utterances to be within the domain of the machine. Other types of applications that are of interest are those where speech utterances arise from human-human interaction, interaction with speech messaging systems, or any other domain that can be characterized as being unconstrained or spontaneous. This chapter is concerned with the problem of spotting keywords in continuous speech utterances. Many important speech input applications involving word spotting will be described. The chapter will also discuss Automatic Speech Recognition (ASR) problems that are particularly important in word spotting applications. These problems include rejection of out-of-vocabulary utterances, derivation of measures of confidence, and the development of efficient and flexible search algorithms.
Richard C. Rose
14. Spectral Dynamics for Speech Recognition Under Adverse Conditions
Abstract
Significant improvements in automatic speech recognition performance have been obtained through front-end feature representations which exploit the time varying properties of speech spectra. Various techniques have been developed to incorporate “spectral dynamics” into the speech representation, including temporal derivative features, spectral mean normalization and, more generally, spectral parameter filtering. This chapter describes the implementation and interrelationships of these techniques and illustrates their use in automatic speech recognition under different types of adverse conditions.
Brian A. Hanson, Ted H. Applebaum, Jean-Claude Junqua
15. Signal Processing for Robust Speech Recognition
Abstract
This chapter compares several different approaches to robust automatic speech recognition. We review ongoing research in the use of acoustical pre-processing to achieve robust speech recognition, discussing and comparing approaches based on direct cepstral comparisons, on parametric models of environmental degradation, and on cepstral high-pass filtering. We also describe and compare the effectiveness of two complementary methods of signal processing for robust speech recognition: microphone array processing and the use of physiologically-motivated models of peripheral auditory processing. This chapter includes comparisons of recognition error rates obtained when the various signal processing algorithms considered are used to process inputs to CMU’s SPHINX speech recognition system.
Richard M. Stern, Alejandro Acero, Fu-Hua Liu, Yoshiaki Ohshima
16. Dynamic Programming Search Strategies: From Digit Strings to Large Vocabulary Word Graphs
Abstract
This chapter gives an overview of the dynamic programming (DP) search strategy for large-vocabulary, continuous-speech recognition. Starting with the basic one-pass algorithm for word string recognition, we extend the search strategy to vocabularies of 20,000 words and more by using a tree organization of the vocabulary. Finally, we describe how this predecessor conditioned algorithm can be refined to produce high-quality word graphs. This method has been tested successfully on the Wall Street Journal task (American English, continuous speech, 20,000 words, speaker independent).
Hermann Ney, Xavier Aubert
17. Fast Match Techniques
Abstract
Real-time speech recognition systems often make use of a fast approximate match to quickly prune the search space to a manageable size. In this chapter we discuss several issues in connection with such fast match techniques.
P. S. Gopalakrishnan, L. R. Bahl
18. Multiple-Pass Search Strategies
Abstract
Large vocabulary speech recognition is very expensive computationally. We explore multi-pass search strategies as a way to reduce computation substantially, without any increase in error rate. We consider two basic strategies: the N-best Paradigm, and the Forward-Backward search. Both of these strategies operate on the entire sentence in (at least) two passes. The N-best Paradigm computes alternative hypotheses for a sentence, which can later be rescored using more detailed and more expensive knowledge sources. We present and compare many algorithms for finding the N-best sentence hypotheses, and suggest which are the most efficient and accurate. The Forward-Backward Search performs a time-synchronous forward search that finds all of the words that are likely to end at each frame within an utterance. Then, a second more expensive search can be performed in the backward direction, restricting its attention to those words found in the forward pass.
Richard Schwartz, Long Nguyen, John Makhoul
19. Issues in Practical Large Vocabulary Isolated Word Recognition: The IBM Tangora System
Abstract
The IBM TANGORA was the first real-time PC-based large vocabulary isolated word dictation system [14]. Its development and eventual productization in the form of the IBM Personal Dictation System required substantial innovation in all areas of speech recognition, from signal processing to language modeling. This chapter describes some of the algorithmic techniques that had to be developed in order to create a dictation system that could actually be used by real users to create text.
S. K. Das, M. A. Picheny
20. From Sphinx-II to Whisper — Making Speech Recognition Usable
Abstract
In this chapter, we first review Sphinx-II, a large-vocabulary speaker-independent continuous speech recognition system developed at Carnegie Mellon University, summarizing the techniques that helped Sphinx-II achieve state-of-the-art recognition performance. We then review Whisper, a system we developed here at Microsoft Corporation, focusing on recognition accuracy, efficiency and usability issues. These three issues are critical to the success of commercial speech applications. Whisper has significantly improved its performance in these three areas. It can be configured as a spoken language front-end (telephony or desktop) or dictation application.
X. Huang, A. Acero, F. Alleva, M. Hwang, L. Jiang, M. Mahajan
Backmatter
Metadaten
Titel
Automatic Speech and Speaker Recognition
herausgegeben von
Chin-Hui Lee
Frank K. Soong
Kuldip K. Paliwal
Copyright-Jahr
1996
Verlag
Springer US
Electronic ISBN
978-1-4613-1367-0
Print ISBN
978-1-4612-8590-8
DOI
https://doi.org/10.1007/978-1-4613-1367-0