Skip to main content

Über dieses Buch

The term speech processing refers to the scientific discipline concerned with the analysis and processing of speech signals for getting the best benefit in various practical scenarios. These different practical scenarios correspond to a large variety of applications of speech processing research. Examples of some applications include enhancement, coding, synthesis, recognition and speaker recognition. A very rapid growth, particularly during the past ten years, has resulted due to the efforts of many leading scientists. The ideal aim is to develop algorithms for a certain task that maximize performance, are computationally feasible and are robust to a wide class of conditions. The purpose of this book is to provide a cohesive collection of articles that describe recent advances in various branches of speech processing. The main focus is in describing specific research directions through a detailed analysis and review of both the theoretical and practical settings. The intended audience includes graduate students who are embarking on speech research as well as the experienced researcher already working in the field. For graduate students taking a course, this book serves as a supplement to the course material. As the student focuses on a particular topic, the corresponding set of articles in this book will serve as an initiation through exposure to research issues and by providing an extensive reference list to commence a literature survey. Expe­ rienced researchers can utilize this book as a reference guide and can expand their horizons in this rather broad area.



Speech Coding


Chapter 1. The Use of Pitch Prediction in Speech Coding

Two major types of correlations are present in a speech signal. These are known as near-sample redundancies and distant-sample redundancies. Near-sample redundancies are those which are present among speech samples that are close together. Distant-sample redundancies are due to the inherent periodicity of voiced speech. Predictive speech coders make use of these correlations in the speech signal to enhance coding efficiency. In predictive speech coders, the cascade of two nonrecursive prediction error filters process the original speech signal. The formant filter removes near-sample redundancies. The pitch filter acts on distant-sample waveform similarities. The result is a residual signal with little sample to sample correlations. The parameters that are quantized and coded for transmission include the filter coefficients and the residual signal. From the coded parameters, the receiver decodes the speech by passing the quantized residual through a pitch synthesis filter and a formant synthesis filter. The filtering steps at the receiver can be viewed in the frequency domain as first inserting the fine pitch structure and then, shaping the spectral envelope to insert the formant structure. The formant and pitch filters are adaptive in that the analysis to determine the coefficients is carried out frame by frame. Also, the bits representing the quantized parameters are transmitted on a frame by frame basis. The bit rate of the coder is the total number of bits transmitted in one frame divided by the time duration of the analysis frame.
Ravi P. Ramachandran

Chapter 2. Vector Quantization of Linear Predictor Coefficients

Quantization of data is performed to reduce the bit rate required either for storage of data, or transmission between two or more communicators. Linear prediction (LP) is an efficient way to represent the short term spectrum of the speech signal. The line spectral frequency (LSF) transformation provides benefits over than linear prediction representations such as reflection coefficients, arc sine reflection coefficients, or log area ratios. Vector quantization of the filter parameters allows for a larger reduction in the bit rate to represent a set of parameters over scalar quantization. This reduction comes at the expense of using more computational complexity and greater amounts of storage. To reduce the burden of both storage and computation, various techniques and procedures have been developed. These techniques include full, multistage and split vector quantization, as well as the adaptive, variable dimension and finite state vector quantizers [1-8].
John S. Collura

Chapter 3. Linear Predictive Analysis by Synthesis Coding

The availability of inexpensive signal processing chips and a demand for efficient digital representations of speech signals have led to an increase of applications for speech coding. Some examples are: wired and wireless networks, voice encryption, videophones, simultaneous voice and data transmission, multimedia, announcements, and solid state answering machines. In the past 10 years, many digital speech coding standards have been defined for network and wireless applications. Most of these standards are based on thelinear prediction based analysis by synthesis(LPAS) paradigm. LPAS coders provide state-of-the-art performance for bit rates in the range between 4 and 16 kb/s. This chapter will discuss this paradigm and related topics, and will focus on issues not discussed in some of the accompanying chapters in this book [1 2 3].
Peter Kroon, W. Bastiaan Kleijn

Chapter 4. Waveform Interpolation

Waveform interpolation (WI) has proved to be an efficient procedure for high quality coding of speech at low bit rates. In this method, the speech signal is described by a sequence of characteristic waveforms, which are interpolated during reconstruction. Originally, the characteristic waveform was identified with a pitch cycle, and WI was applied to voiced speech segments only. A number of implementations, which use CELP for unvoiced signal segments, showed that the procedure can provide high performance. Recently, the method was extended to include unvoiced speech and background noise. To this purpose the characteristic waveform is decomposed into a slowly evolving waveform (representing the periodic component of the signal), and a rapidly evolving waveform (representing the other components of the signal). The rapidly evolving waveform requires high time resolution and only low quantization accuracy, while the slowly evolving waveform requires less time resolution and a more precise description. With this decomposition, switching between different coding models is avoided, and a robust coding method results.
Jesper Haagen, W. Bastiaan Kleijn

Chapter 5. Variable Rate Speech Coding

An important goal in the design of voice communication networks and storage systems is to maximize capacity while maintaining an acceptable level of voice quality. Conventional speech coding systems use a fixed bit rate regardless of factors such as local speech statistics, transmission channel conditions, or network load. One method of maximizing capacity while maintaining an acceptable level of speech quality is to allow the bit rate to vary as a function of these factors. Variable rate speech coders exploit two important characteristics of speech communications: the large percentage of silence during conversations, and the large local changes in the minimal rate required to achieve a given speech reproduction quality.
Vladimir Cuperman, Peter Lupini

Speech Recognition


Chapter 6. Word Spotting

Word spotting has been an active area of speech recognition for over twenty years. Although it initially addressed applications requiring the scanning of audio data for occurrences of particular keywords, the technology has become an effective approach to speech recognition for a wide range of applications. The term “word spotting” is now used to refer to a variety of techniques that are useful in speech recognition applications where relevant information, such as a command, must be recognized even when it is embedded in irrelevant speech input or other audio interference, or when the desired information may not be present. The related areas of filler modeling and out-of-set rejection share many of the same underlying technical problems and approaches to word spotting. Depending on the particular application, different types and combinations of word spotting techniques are appropriate and effective. Most recently, a variety of statistical modeling techniques have provided higher accuracy than previous approaches. Many of these techniques share aspects, such as use of hidden Markov models (HMMs) and statistical language models, with other areas of speech recognition. This chapter presents a survey of various approaches to word spotting and related areas, suggests appropriate applications of these approaches, and identifies unresolved research problems.
Jan Robin Rohlicek

Chapter 7. Speech Recognition Using Neural Networks

The field of artificial neural networks has grown rapidly in recent years. This has been accompanied by an insurgence of work in speech recognition. Most speech recognition research has centered on stochastic models, in particular the use of hidden Markov models (HMMs) [9][28][29][30][45][47]. Alternate techniques have focused on applying neural networks to classify speech signals [6][11][48]. The inspiration for using neural networks as a classifier stems from the fact that neural networks within the human brain are used for speech recognition. This analogy unfortunately falls short of being close to an actual model of the brain, but the modeling mechanism and the training procedures allow the possiblility of using a neural network as a stochastic model that can be discrimitively trained.
Stephen V. Kosonocky

Chapter 8. Current Methods in Continuous Speech Recognition

Several significant advances have been made in continuous speech recognition over the last few years. In this chapter, we will discuss some of the current techniques in feature extraction and modeling for large vocabulary continuous speech recognition.
P. S. Gopalakrishnan

Chapter 9. Large Vocabulary Isolated Word Recognition

Many applications require recognition of spoken isolated words or phrases from a large vocabulary. For example, the goal of the 86000-word recognizer at INRS-Télécommunications [14] is to transcribe speech spoken as a sequence of isolated words. The sentences to be read are chosen arbitrarily from a variety of sources, including newspapers, books, magazines, etc. Another example is the StockTalk system running at BNR Montreal [24] which dispenses real time stock quotes by voice over the telephone for stocks traded in New York, Toronto and NASDAQ stock exchanges. The vocabulary for this system consists of words or phrases spoken in isolation. This system requires speaker-independent recognition over the telephone, while the first example requires speaker-dependent recognition over high quality microphones.
Vishwa Gupta, Matthew Lennig

Chapter 10. Recent Developments in Robust Speech Recognition

Robust speech recognition refers to the problem of designing an automatic speech recognizer that works well in a wide range of unexpected or adverse environments. As the technology of automatic speech recognition moves out of the laboratories into field applications, the issue of robustness becomes a key element that distinguishes a successful deployment from a failed one.
B. H. Juang

Chapter 11. How do Humans Process and Recognize Speech?

Until the performance of automatic speech recognition (ASR) hardware surpasses human performance in accuracy and robustness, we stand to gain by understanding the basic principles behind human speech recognition (HSR). This problem was studied exhaustively at Bell Labs between the years of 1918 and 1950 by Harvey Fletcher and his colleagues. The motivation for these studies was to quantify the quality of speech sounds in the telephone plant to improve both speech intelligibility and preference. To do this he and his group studied the effects of filtering and noise on speech recognition accuracy for nonsense consonant-vowel-consonant (CVC) syllables, words, and sentences. Fletcher used the termarticulationas the probability of correct recognition fornonsensespeech sounds, andintelligibilityas the probability of correction recognition for words (sounds having meaning). In 1919 Fletcher sought a nonlinear transformationA(s)of the articulationsfor filtered and unfiltered speech defined to give an additive articulation density functionD(f)over frequency. The area underD(f)is called theArticulation Index.The resulting transformation was shown to accurately predict the average articulation. Fletcher then went on to find relationships between the recognition errors for the nonsense speech sounds, words, and sentences. This work has recently been reviewed and partially replicated by Boothroyd and by Bronkhorstet al.Taken as a whole, these studies tell us a great deal about how humans process and recognize speech sounds.
Jont B. Allen

Speaker Recognition


Chapter 12. Data fusion Techniques for Speaker Recognition

Speaker recognition refers to the capability of recognizing a person based on his or her voice. Specifically, this consists of either speaker verification or speaker identification. The objective of speaker verification is to verify a person’s claimed identity based on a sample of speech from that person. The objective of speaker identification is to use a person’s voice to identify that person among a predetermined set of people.
Kevin R. Farrell, Richard J. Mammone

Chapter 13. Speaker Recognition Over Telephone Channels

Being able to verify or determine the identity of a person by voice is very useful in many applications. For example, in telephone banking or calling card charging, user identity must be verified before the transaction can be authorized. Most systems use a PIN for authorization, but it can be forgotten or stolen. Other methods of authorization, such as finger prints or retinal scans, may be more secure than a PIN, they are not practical in many situations.
Yu-Hung Kao, Lorin Netsch, P. K. Rajasekaran

Text to Speech Synthesis


Chapter 14. Approaches to Improve Automatic Speech Synthesis

For several years now, there have been automatic text-to-speech systems fof several languages which yield intelligible but unnatural synthetic speech. Quality inferior to that of human speech is usually due to inadequate modeling of human speech production in coarticulation, intonation, and vocal-tract excitation. We will examine the current approaches in these areas, discuss the compromises that are often made, and suggest ways for improvement.
Douglas O’Shaughnessy

Applications of Models


Chapter 15. Microphone Array for Hands-Free Voice Communication in a Car

We present the results of our research on developing a speech acquisition and enhancement system so that a speech recognizer can reliably be used inside a noisy automobile environment, for the digital cellular telephone application. Our research results have demonstrated that a beamforming method with a microphone array is a reliable approach for hands-free voice dialing application in the car noise environment. Two beamforming algorithms were investigated in our research: delay-and-sum beam-former and generalized sidelobe canceler. Performance evaluation results of these two algorithms are presented for speech database collected in real cars under different noise conditions. Signal-to-noise ratio and speech recognition error rate were used as performance measures in these evaluations.
Stephen Oh, Vishu Viswanathan

Chapter 16. The Pitch Mode Modulation Model and Its Application in Speech Processing

The techniques currently used for speech coding or enhancement critically depend upon some form of statistical stationarity either in the speech signal the noise signal or both in order to accomplish the coding or enhancement. Virtually all speech processing techniques utilize a speech model to reduce the amount of information necessary to characterize the speech signal. Although the speech signal is known to be highly redundant it is also non-stationary. This non-stationarity requires that the parameters of these models be extracted from short duration signal segments, where the stationarity assumption in the models is not seriously violated. Unfortunately the use of short speech frames makes the estimation of the model parameters difficult and sometimes obscures the very redundancy the model was based on. The use of a longer frame size is desirable for many signal processing techniques that require increased frequency domain resolution.
Michael A. Ramalho, Richard J. Mammone

Chapter 17. Auditory Models and Human Performance in Tasks Related to Speech Coding and Speech Recognition

Auditory models that are capable of achieving human performance in tasks related to speech perception would provide a basis for realizing effective speech processing systems. Saving bits in speech coders, for example, relies on a perceptual tolerance to acoustic deviations from the original speech. Perceptual invariance to adverse signal conditions (noise, microphone and channel distortions, room reverberations) and to phonemic variability (due to non-uniqueness of articulatory gestures) may provide a basis for robust speech recognition. In this paper we describe a state-of-the-art auditory model that simulates, in considerable detail, the outer parts of the auditory periphery up through the auditory nerve level. Speech information is extracted from the simulated auditory nerve firings, and used in place of the conventional input to several speech coding and recognition systems. The performance of these systems improves as a result of this replacement, but is still short of achieving human performance. The shortcomings occur, in particular, in tasks related to low bit-rate coding and to speech recognition. Since schemes for low bit-rate coding rely on signal manipulations that spread over durations of several tens of ms, and since schemes for speech recognition rely on phonemic/articulatory information that extend over similar time intervals, it is concluded that the shortcomings are due mainly to a lack of perceptually related integration rules over durations of 50-100 ms. These observations suggest a need for a study aimed at understanding how auditory nerve activity is integrated over time intervals of that duration. We discuss preliminary experimental results that confirm human usage of such integration, with different integration rules for different time-frequency regions depending on the phoneme-discrimination task.
Oded Ghitza

Chapter 18. Applications of wavelets to Speech processing: A case study of a celp coder

Wavelets are a new family of orthogonal basis functions for representing finite energy signals. In this chapter, we provide a brief review of wavelets and their properties. We cite a number of applications of wavelets to speech processing that have been proposed recently. As a detailed case study in wavelet applications, we present our work on a wavelet-transform-based CELP coder design for high-quality speech coding at about 4.8 kbits/s. The coder quantizes the second residual using a wavelet transform approach instead of the stochastic-codebook-based vector quantization normally used in CELP coders, including the U.S. Federal Standard FS 1016 coder at 4.8 kbits/s. The wavelet coder improves the computational efficiency for encoding the second residual by requiring only 1.2 MIPS instead of 8.3 MIPS required by FS 1016. Subjective speech quality tests involving pairwise comparisons show that the wavelet coder was preferred 61% of the time over FS 1016.
James Ooi, Vishu Viswanathan


Weitere Informationen