Skip to main content

2008 | Buch

Springer Handbook of Speech Processing

herausgegeben von: Prof. Jacob Benesty, Dr., Prof. M. Mohan Sondhi, Ph.D., Prof. Yiteng Arden Huang, Dr.

Verlag: Springer Berlin Heidelberg

insite
SUCHEN

Über dieses Buch

From common consumer products such as cell phones and MP3 players to more sophisticated projects such as human-machine interfaces and responsive robots, speech technologies are now everywhere. Many think that it is just a matter of time before more applications of the science of speech become inescapable in our daily life. This handbook is meant to play a fundamental role for sustainable progress in speech research and development. Springer Handbook of Speech Processing targets three categories of readers: graduate students, professors and active researchers in academia and research labs, and engineers in industry who need to understand or implement some specific algorithms for their speech-related products. The handbook could also be used as a sourcebook for one or more graduate courses on signal processing for speech and different aspects of speech processing and applications. A quickly accessible source of application-oriented, authoritative and comprehensive information about these technologies, it combines the established knowledge derived from research in such fast evolving disciplines as signal processing and communications, acoustics, computer science and linguistics.

Inhaltsverzeichnis

Frontmatter

Introduction

1. Introduction to Speech Processing

In this brief introduction we outline some major highlights in the history of speech processing. We briefly describe some of the important applications of speech processing. Finally, we introduce the reader to the various parts of this handbook.

Jacob Benesty, M. Mohan Sondhi, Yiteng (Arden) Huang

Production, Perception, and Modeling of Speech

Frontmatter
2. Physiological Processes of Speech Production

Speech sound is a wave of air that originates from complex actions of speechproductionthe human body, supported by three functional units: generation of air pressure, regulation of vibration, and control of resonators. The lung air pressure for speech results from functions of the respiratory system during a prolonged phase of expiration after a short inhalation. Vibrations of air for voiced sounds are introduced by the vocal folds in the larynx; they are controlled by a set of laryngeal muscles and airflow from the lungs. The oscillation of the vocal folds converts the expiratory air into intermittent airflow pulses that result in a buzzing sound. The narrow constrictions of the airway along the tract above the larynx also generate transient source sounds; their pressure gives rise to an airstream with turbulence or burst sounds. The resonators are formed in the upper respiratory tract by the pharyngeal, oral, and nasal cavities. These cavities act as resonance chambers to transform the laryngeal buzz or turbulence sounds into the sounds with special linguistic functions. The main articulators are the tongue, lower jaw, lips, and velum. They generate patterned movements to alter the resonance characteristics of the supra-laryngeal airway. In this chapter, contemporary views on phonatory and articulatory mechanisms are summarized to illustrate the physiological processes of speech production, with brief notes on their observation techniques.

Kiyoshi Honda
3. Nonlinear Cochlear Signal Processing and Masking in Speech Perception

auditorymaskingThere are many neuralmaskingdynamic maskingclasses of masking, but two major classes are easily defined: neural masking and dynamic masking. Neural masking characterizes the internal noise associated with the neural representation of the auditory signal, a form of loudness noise. Dynamic masking is strictly cochlear, and is associated with cochlear outer-hair-cell processing. This form is responsible for dynamic nonlinear cochlear gain changes associated with sensorineural hearing loss, the upward spread of masking, two-tone suppression and forward masking. The impact of these various forms of masking are critical to our understanding of speech and music processing. In this review, the details of what nonlinear cochleabasilar membrane (BM)signal processing we know about nonlinear cochlear and basilar membrane signal processing is reviewed, and the implications of neural masking is modeled, with a comprehensive historical review of the masking literature. This review is appropriate for a series of graduate lectures on nonlinear cochlear speech and music processing, from an auditory point of view.

Jont B. Allen
4. Perception of Speech and Sound

The transformation of acoustical signals into auditory sensations can be characterized by psychophysical quantities such as loudness, tonality, or perceived pitch. The resolution limits of the auditory system produce spectral and temporal masking phenomena and impose constraints on the perception of amplitude modulations. Binaural hearing (i.e., utilizing the acoustical difference across both ears) employs interaural time and intensity differences to produce localization and binaural unmasking phenomena such as the binaural intelligibility level difference, i.e., the speech reception threshold difference between listening to speech in noise monaurally versus listening with both ears. The acoustical information available to the listener for perceiving speech even under adverse conditions can be characterized using the articulation articulation index (AI) index, the speechtransmission index (STI) speech transmission index, and the speechintelligibility index (SII) speech intelligibility index. They can objectively predict speech reception thresholds as a function of spectral content, signal-to-noise ratio, and preservation of amplitude modulations in the speech waveform that enter the listenerʼs ear. The articulatory or phonetic information available to and received by the listener can be characterized by speech feature sets. Transinformation analysis allows one to detect the relative transmission error connected with each of these speech features. The comparison across man and machine in speech recognition allows one to test hypotheses and models of human speechperception speech perception. Conversely, automatic speech recognition may be improved by introducing human signal-processing principles into machine processing algorithms.

Birger Kollmeier, Thomas Brand, Bernd Meyer
5. Speech Quality Assessment

In this chapter, we provide an overview of methods for speech speechquality assessmentquality assessment. First, we define the term speech qualityspeechqualityand outline in Sect. 5.1 the main causes of degradation of speech quality. Then, we discuss subjective test methods for quality assessment, with a focus on standardized methods. Section 5.3 is dedicated to objective algorithms for quality assessment. We conclude the chapter with a reference table containing common quality assessment scenarios and the corresponding most suitable methods for quality assessment.

Volodya Grancharov, W. Bastiaan Kleijn

Signal Processing for Speech

Frontmatter
6. Wiener and Adaptive Filters

Wiener filterThe Wiener filter, named after its inventor, has been an extremely useful tool since its invention in the early 1930s. This optimal filter is not only popular in different aspects of speech processing but also in many other applications. This chapter presents the most fundamental results of the Wiener theory with an emphasis on the Wiener-Hopf equations, which are not convenient to solve in practice. An alternative approach to solving these equations directly is the use of an adaptive filter, which is why this work also describes the most classical adaptive algorithms that are able to converge, in a reasonable amount of time, to the optimal Wiener filter.

Jacob Benesty, Yiteng (Arden) Huang, Jingdong Chen
7. Linear Prediction

Linear prediction linear prediction (LP)plays a fundamental role in all aspects of speech. Its use seems natural and obvious in this context since for a speech signal the value of its current sample can be well modeled as a linear combination of its past values. In this chapter, we attempt to present the most important ideas on linear prediction. We derive the principal results, widely recognized by speech experts, in a very intuitive way without sacrificing mathematical rigor.

Jacob Benesty, Jingdong Chen, Yiteng (Arden) Huang
8. The Kalman Filter

The Kalman filter and its variants are some of the most popular tools in statistical signal processing and estimation theory. In this chapter, we introduce the Kalman filter, providing a succinct, yet rigorous derivation thereof, which is based on the orthogonality principle. We also introduce several important variants of the Kalman filter, namely various Kalman smoothers, a Kalman predictor, a nonlinear extension (the extended Kalman filter), and adaptation to cases of temporally correlated measurement noise. The application of the Kalman filter to two important speech processing problems, namely, speech enhancement and speakerlocalizationspeaker localization is demonstrated.

Sharon Gannot, Arie Yeredor
9. Homomorphic Systems and Cepstrum Analysis of Speech

In 1963,

Bogert

,

Healy

, and

Tukey

published a chapter with one of the most unusual titles to be found in the literature of science and engineering [

9.1

]. In this chapter, they observed that the logarithm of the power spectrum of a signal plus its echo (delayed and scaled replica) consists of the logarithm of the signal spectrum plus a periodic component due to the echo. They suggested that further spectrum analysis of the log spectrum could highlight the periodic component in the log spectrum and thus lead to a new indicator of the occurrence of an echo. Specifically they made the following observation:

In general, we find ourselves operating on the frequency side in ways customary on the time side and vice versa.

As an aid in formalizing this new point of view, they introduced a number of paraphrased words. For example, they defined the

cepstrum

of a signal as the power spectrum of the logarithm of the power spectrum of a signal. (In fact, they used discrete-time spectrum estimates based on the discrete Fourier transform.) Similarly, the term

quefrency

quefrency

was introduced for the independent variable of the cepstrum [

9.1

].

In this chapter we will explore why the cepstrum has emerged as a central concept in digital speech processing. We will start with definitions appropriate for discrete-time signal processing and develop some of the general properties and computational approaches for the cepstrum of speech. Using this basis, we will explore the many ways that the cepstrum has been used in speech processing applications.

Ronald W. Schafer
10. Pitch and Voicing Determination of Speech with an Extension Toward Music Signals

This chapter reviews selected methods for pitch determination of speech and music signals. As both these signals are time variant we first define what is subsumed under the term pitch. Then we subdivide pitch determination algorithms (PDAs) pitchdetermination algorithm (PDA)into short-term analysis algorithms, which apply some spectral transform and derive pitch from a frequency or lag domain representation, and time-domain algorithms, which analyze the signal directly and apply structural analysis or determine individual periods from the first partial or compute the instant of glottal closure in speech. In the 1970s, when many of these algorithms were developed, the main application in speech technology was the vocoder, whereas nowadays prosody recognition in speech understanding systems and high-accuracy pitch period determination for speech synthesis corpora are emphasized. In musical acoustics, pitch determination is applied in melody recognition or automatic musical transcription, where we also have the problem that several pitches can exist simultaneously.

Wolfgang J. Hess
11. Formant Estimation and Tracking

This chapter deals with the estimation formantestimationand tracking of the movements of the spectral resonances of human formantvocal tracts, also known as formants. The representation or modeling of speech in terms of formants is useful in several areas of speech processing: coding, recognition, synthesis, and enhancement, as formants efficiently describe essential aspects of speech using a very limited set of parameters. However, estimating formants is more difficult than simply searching for peaks in an amplitude spectrum, as the spectral peaks of vocal-tract output depend upon a variety for factors in complicated ways: vocal-tract shape, excitation, and periodicity. We describe in detail the formal task of formant tracking, and explore its successes and difficulties, as well as giving reasons for the various approaches.

Douglas OʼShaughnessy
12. The STFT, Sinusoidal Models, and Speech Modification

Frequency-domain signal representations are used for a wide variety of applications in speech processing. In this Chapter, we first consider the short-time Fourier transform (STFT), short-time Fourier transform (STFT)presenting a number of interpretations of the analysis-synthesis process in a consistent mathematical framework. We then develop the sinusoidal model as a parametric extension of the STFT wherein the data in the STFT is compacted, sacrificing perfect reconstruction at the benefit of achieving a sparser and essentially more meaningful representation. We discuss several methods for sinusoidal parameter estimation and signal reconstruction, and present a detailed treatment of a matching pursuit algorithm for sinusoidal modeling. The final part of the Chapter addresses speech modifications such as filtering, enhancement, and time-scaling, for which both the STFT and the sinusoidal model are effective tools.

Michael M. Goodwin
13. Adaptive Blind Multichannel Identification

Blind multichannel identification was first introduced in the blind multichannel identificationmid 1970s and initially studied in the communication society with the intention of designing more-efficient communication systems by avoiding a training phase. Recently this idea has become increasingly interesting for acoustics and speech processing research, driven by the fact that in most acoustic applications for speech processing and communication very little or nothing is known about the source signals. Since human ears have an extremely wide dynamic range and are much more sensitive to weak tails of the acoustic impulse responses, these impulse responses need to be modeled using fairly long filters. Attempting to identify such a multichannel system blindly with a batch method involves intensive computational complexity. This is not just bad system design, but technically rather implausible, particularly for real-time systems. Therefore, adaptive blind multichannel identification algorithms are favorable and pragmatically useful. This chapter describes some fundamental issues in blind multichannel identification and reviews a number of state-of-the-art adaptive algorithms.

Yiteng (Arden) Huang, Jacob Benesty, Jingdong Chen

Speech Coding

Frontmatter
14. Principles of Speech Coding

Speech coding is the art of reducing the bit rate required to describe a speech signal. In this chapter, we discuss the attributes of speech coders as well as the underlying principles that determine their behavior and their architecture. The ubiquitous class of linear-prediction-based coders is used as an illustration. Speech is generally modeled as a sequence of stationary signal segments, each having unique statistics. Segments are encoded using a two-step procedure: (1) find a model describing the speech segment, (2) encode the segment assuming it is generated by the model. We show that the bit allocation for the model (the predictor parameters) is independent of overall rate and of perception, which is consistent with existing experimental results. The modeling of perception is an important aspect of efficient coding and we discuss how various perceptual distortion measures can be integrated into speech coders.

W. Bastiaan Kleijn
15. Voice over IP: Speech Transmission over Packet Networks

The emergence of packet networks for both data and voice traffic has introduced new challenges for speech transmission designs that differ significantly from those encountered and handled in traditional circuit-switched telephone networks, such as the public switched telephone network public switched telephone network (PSTN) (PSTN). In this chapter, we present the many aspects that affect speech quality in voiceover IP (VoIP) a voice over IP (VoIP) conversation. We also present design techniques for coding systems that aim to overcome the deficiencies of the packet channel. By properly utilizing speech codecs tailored for packet networks, VoIP can in fact produce a quality higher than that possible with PSTN.

Jan Skoglund, Ermin Kozica, Jan Linden, Roar Hagen, W. Bastiaan Kleijn
16. Low-Bit-Rate Speech Coding

low-bit-rate speech coding Low-bit-rate speech coding, at rates below 4 kb/s, is needed for both communication and voice storage applications. At such low rates, full encoding of the speech waveform is not possible; therefore, low-rate coders rely instead on parametric models to represent only the most perceptually relevant aspects of speech. While there are a number of different approaches for this modeling, all can be related to the basic linear model of speech production, where an excitation signal drives a vocal-tract filter. The basic properties of the speech signal and of human speech perception can explain the principles of parametric speech coding as applied in early vocoders. Current speech modeling approaches, such as mixed excitation linear prediction, sinusoidal coding, and waveform interpolation, use more-sophisticated versions of these same concepts. Modern techniques for encoding the model parameters, in particular using the theory of vector quantization, allow the encoding of the model information with very few bits per speech frame. Successful standardization of low-rate coders has enabled their widespread use for both military and satellite communications, at rates from 4 kb/s all the way down to 600 b/s. However, the goal of toll-quality low-rate coding continues to provide a research challenge.

Alan V. McCree
17. Analysis-by-Synthesis Speech Coding

Since the early 1980s, advances in speech coding speechcoding technologies have enabled speech coders to achieve bit-rate reductions of a factor of 4 to 8 while maintaining roughly the same high speech quality. One of the most important driving forces behind this feat is the so-called analysis-by-synthesis paradigm for coding the excitation signal of predictive speech coders. In this chapter, we give an overview of many variations of the analysis-by-synthesis analysisby synthesis speech coding excitation coding paradigm as exemplified by various speech coding standards around the world. We describe the variations of the same basic theme in the context of different coder structures where these techniques are employed. We also attempt to show the relationship between them in the form of a family tree. The goal of this chapter is to give the readers a big-picture understanding of the dominant types of analysis-by-synthesis excitation coding techniques for predictive speech coding.

Juin-Hwey Chen, Jes Thyssen
18. Perceptual Audio Coding of Speech Signals

Traditionally algorithms for speech speechcoding coding exploit the features of speech signals by employing algorithmic models of the human vocal tract. More recently, the use of generic audio coders audio coding for coding of speech signals has gained increasing importance. Based on the properties of human hearing, such perceptual audio coders perceptualaudio coder offer attractive properties including full-bandwidth audio output, increased naturalness, and good handling of any type of non-speech material. The chapter discusses the principles of perceptual audio coding, some relevant standards, and a number of perceptual audio coders that find application in speech and audio transmission and storage.

Jürgen Herre, Manfred Lutzky

Text-to-Speech Synthesis

Frontmatter
19. Basic Principles of Speech Synthesis

text-to-speech (TTS) Speech synthesis enables voice output by machines or devices. Text-to-speech (TTS) synthesis does so by using text as input. Ever since the talking machine by von Kempelen in 1791 [19.1], researchers and technologists have endeavored to make machines talk. The first electronic synthesis, voder (voice coder) Homer Dudleyʼs Voder (Voice Coder), was demonstrated at the 1939 World Fair in New York City [19.2]. Today, TTS systems enjoy wide use in assistive technologies, telecommunications, entertainment, and education. In this chapter we will review the basic principles of this technology, which serves as an introduction to later chapters that provide the reader with more-detailed information.

Juergen Schroeter
20. Rule-Based Speech Synthesis

In this chapter, we review some of the issues in rule-based synthesis and specifically discuss formant synthesis. Formant synthesis and the theory behind have played an important role in both the scientific progress in understanding how humans talk and also the development of the first speech technology applications. Its flexibility and small footprint makes the approach still of interest and a valuable complement to the current dominant methods based on concatenative data-driven synthesis. As already mentioned in the overview by Schroeter (Chap. 19) we also see a new trend to combine the rule-based and data-driven approaches. Formant features from a database that can be used both to optimize a rule-based formant synthesis system and to optimize the search for good units in a concatenative system.

Rolf Carlson, Björn Granström
21. Corpus-Based Speech Synthesis

In corpus-based speech synthesis this chapter, we present the main trends in corpus-based speech synthesis, assuming a stream of phonemes and prosodic target as input. From the early diphone-based speech synthesizers to the state-of-the art unit-selection-based synthesizers, to the promising statistical parametric techniques, we emphasize the engineering trade-offs that arise when designing such systems. In particular, we examine the mathematical foundations of available methods for modifying the fundamental frequency and the duration of speech units for concatenative synthesis, as well as for smoothing discontinuities at concatenation points. For each of these problems, we analyze time- and frequency-domain processing, using algorithms such as time-domain pitch-synchronous overlap-add (TD-PSOLA), multiband resynthesis overlap-add (MBROLA), and the harmonic-plus-noise model (HNM). We then provide a comprehensive description of how and why concatenative speech synthesis has progressively adopted large speech corpora, using the principle of context-oriented clustering as a smooth transition from fixed inventory synthesis to unit selection and statistical parametric synthesis. Our description of unit selection emphasizes important issues related to the definition of optimal target and concatenation costs, as well as to the design of the speech corpus (including memory cost issues) and the reduction of computational costs. We conclude the chapter with the mathematical framework underlying HMM-based speech synthesis and an outline of its main perspectives.

Thierry Dutoit
22. Linguistic Processing for Speech Synthesis

An important part of any text-to-speech (TTS) text-to-speech synthesis system is the linguistic processing linguistic processing component that takes input text and converts it into a feature representation from which actual synthesis can proceed. Linguistic analysis is hard, in a large measure because written language massively underspecifies linguistic information. This chapter reviews several issues in linguistic analysis starting from low-level text normalization issues, and ending with higher-level problems such as accent prediction and document-level analysis. We end with some prognosis of the future prospects for improvements over current technology.

Richard Sproat
23. Prosodic Processing

Speech synthesis speechsynthesis systems have to generate natural-sounding speech output from text. One of the key aspects of speech is prosodic processing prosody, which must be both natural (i.e., sounding like a human) and meaningful (i.e., sounding like a human who understands the contents of the text). The computation of prosody from text can be divided into the computation of prosodic tags from text and the computation of acoustic speech features from these tags. This chapter focuses on the latter. It provides an overview of prosody in human-human communication, including the communicative functions of prosody and the acoustic correlates. Discussed next is a historical overview of the various methods that have been used for prosody generation in speech synthesis, as well as of current methods. Special attention is paid to prosody generation in unit selection synthesis methods, in which large corpora are searched for fragments of speech that match the phonemes and prosodic tags computed from text and that optimize various cost functions, and in which prosody is not modeled and speech not modified. We conclude the chapter by advocating hybrid approaches in which search capabilities of unit selection methods are combined with the speech modification methods from more-traditional approaches.

Jan van Santen, Taniya Mishra, Esther Klabbers
24. Voice Transformation

Voice transformation voicetransformation refers to the various modifications one may apply to the sound produced by a person, speaking or singing. In this chapter we give a description of various ways in which one can modify a voice and provide details on how to implement these modifications using a simple, but quite efficient, parametric model based on a harmonic representation of speech. By discussing the quality issues of current voice transformation algorithms in conjunction with the properties of speech production and perception systems we try to pave the way for more-natural voice transformation algorithms in the future.

Yannis Stylianou
25. Expressive/Affective Speech Synthesis

The focus of speech synthesis research has recently shifted from read speech towards more conversational styles of speech, in order to reproduce those situations where a speech synthesis is used as part of a dialogue. When a speech synthesizer speechsynthesizer is used to represent the voice of a cognisant agent, whether human or simulated, there is need for more than just the intelligible portrayal of linguistic information; there is also a need for the expression of affect. This chapter reviews some recent advances in the synthesis of expressive speech and shows how the technology can be adapted to include the display of affect in conversational speech. The chapter discusses how the presence of an interactive and active partner in a conversation can greatly affect the styles of human speech and presents a model of the cognitive processes that result in these differences, which concern not just the acoustic prosody and phonation quality of an utterance, but also its lexical selection and phrasing. It proposes a measure of the ratio of paralinguistic to linguistic content in an utterance as a means of quantifying the expressivity of a speaking style, and closes with a description of a phrase-level concatenative speech synthesis system that is currently in development.

Nick Campbell

Speech Recognition

Frontmatter
26. Historical Perspective of the Field of ASR/NLU

The quest for a machine that can recognize and understand speech, from any speaker, and in any environment has been the holy grail of speech recognition research for more than 70 years. Although we have made great progress in understanding how speech is produced and analyzed, and although we have made enough advances to build and deploy in the field a number of viable speech recognition systems, we still remain far from the ultimate goal of a machine that communicates naturally with any human being. It is the goal of this section to document the history of research in speech recognition and natural language understanding, and to point out areas where great progress has been made, along with the challenges that remain to be solved in the future.

Lawrence Rabiner, Biing-Hwang Juang
27. HMMs and Related Speech Recognition Technologies

continuous speech recognition (CSR) Almost all present-day continuous speech recognition (CSR) systems are based on hidden hidden Markov model (HMM) Markov models (HMMs). Although the fundamentals of HMM-based CSR have been understood for several decades, there has been steady progress in refining the technology both in terms of reducing the impact of the inherent assumptions, and in adapting the models for specific applications and environments. The aim of this chapter is to review the core architecture of an HMM-based CSR system and then outline the major areas of refinement incorporated into modern systems.

Steve Young
28. Speech Recognition with Weighted Finite-State Transducers

This chapter describes a general representation and algorithmic framework for speech speechrecognition recognition based on weighted finite-state weighted transducer transducers. These transducers provide a common and natural representation for major components of speech recognition systems, including hidden Markov models (HMMs), context-dependency models, pronunciation dictionaries, statistical grammars, and word or phone lattices. General algorithms for building and optimizing transducer models are presented, including composition for combining models, weighted determinization and minimization for optimizing time and space requirements, and a weight pushing algorithm for redistributing transition weights optimally for speech recognition. The application of these methods to large-vocabulary recognition tasks is explained in detail, and experimental results are given, in particular for the North American Business News (NAB) task, in which these methods were used to combine HMMs, full cross-word triphones, a lexicon of 40000 words, and a large trigram grammar into a single weighted transducer that is only somewhat larger than the trigram word grammar and that runs NAB in real time on a very simple decoder. Another example demonstrates that the same methods can be used to optimize lattices for second-pass recognition.

Mehryar Mohri, Fernando Pereira, Michael Riley
29. A Machine Learning Framework for Spoken-Dialog Classification

One of the key tasks in the design of large-scale dialog systems is classification. This consists of assigning, out of a finite set, a specific category to each spoken utterance, based on the output of a speech recognizer. Classification in general is a standard machine-learning machine learning problem, but the objects to classify in this particular case are word lattices, or weighted automata, and not the fixed-size vectors for which learning algorithms were originally designed. This chapter presents a general kernel-based learning framework for the design of classification algorithms for weighted automata. It introduces a family of kernels, rational kernels, that combined with support vector machines form powerful techniques for spoken-dialog spoken-dialog classification classification and other classification tasks in text and speech processing. It describes efficient algorithms for their computation and reports the results of their use in several difficult spoken-dialog classification tasks based on deployed systems. Our results show that rational kernels are easy to design and implement, and lead to substantial improvements of the classification accuracy. The chapter also provides some theoretical results helpful for the design of rational kernels.

Corinna Cortes, Patrick Haffner, Mehryar Mohri
30. Towards Superhuman Speech Recognition

After over 40 years of research, human speech recognition speechrecognition performance still substantially outstrips machine performance. Although enormous progress has been made, the ultimate goal of achieving or exceeding human performance - superhumansuperhuman speech recognition speech recognition - eludes us. On a more-prosaic level, many industrial concerns have been trying to make a go of various speech recognition businesses for many years, yet there is no clear killer app for speech. If the technology were as reliable as human perception, would such killer apps emerge? Either way, there would be enormous value in producing a recognizer with superhuman capabilities. This chapter describes an ongoing research program at IBM that attempts to address achieving superhuman speech recognition performance in the context of the metric of word error rate. First, a multidomain conversational test set to drive the research program is described. Then, a series of human listening experiments and speech recognition experiments based on the test set is presented. Large improvements in recognition performance can be achieved through a combination of adaptation, discriminative training, a combination of knowledge sources, and simple addition of more data. Unfortunately, devising a set of informative listening tests synchronized with the multidomain test set proved to be more difficult than expected because of the highly informal nature of the underlying speech. The problems encountered in performing the listening tests are presented along with suggestions for future listening tests. The chapter concludes with a set of speculations on the best way for speech recognition research to proceed in the future in this area.

Michael Picheny, David Nahamoo
31. Natural Language Understanding

We describe several algorithms for developing natural language understanding (NLU) natural languageunderstanding (NLU) applications. The algorithms include a rule-based system and several statistical systems. We consider two major types of NLU applications: dialog systems dialogsystem and speech mining. speechmining For dialog systems, the NLU function aims to understand the full meaning of a userʼs request in the context of a human-machine interaction in a narrow domain such as travel reservation. For speech mining applications, the NLU function aims at detecting the presence of a limited set of concepts and some of their relations in unrestricted human-human conversations such as in a call center or an oral history digital library. We describe in more detail a statistical parsing algorithm using decision trees for dialog systems and two word-tagging algorithms for speech mining.

Salim Roukos
32. Transcription and Distillation of Spontaneous Speech

Automatic transcription of spontaneous human-to-human speech is expected to expand the applications of speech technology, enabling efficient access to audio archives such as broadcast programs, lectures, and meetings. Compared with utterances in human-machine interfaces, which have been the focus of most conventional speech recognition research, spontaneous speech has greater variation in both its acoustic and linguistic characteristics. Therefore, it is necessary to explore more elaborate spontaneous speech recognition and flexible modeling techniques for spontaneous speech recognition. Moreover, spontaneous speech processing requires the development of a different paradigm, in that faithful transcription is not necessarily useful because of the existence of disfluencies and the lack of sentence and paragraph markers. Studies on automatic detection of sentence/discourse boundaries and disfluencies are also needed. Speech summarization is an approach that generates effective output from a transcript. This chapter gives an overview of major research activities and a number of recent findings on these topics.

Sadaoki Furui, Tatsuya Kawahara
33. Environmental Robustness

When a speech recognition system is deployed outside the laboratory setting, it needs to handle a variety of signal variabilities. These may be due to many factors, including additive noise, acoustic echo, and speaker accent. If the speech recognition accuracy does not degrade very much under these conditions, the system is called robust. Even though there are several reasons why real-world speech may differ from clean speech, in this chapter we focus on the influence of the acoustical environmentacoustical environment, defined as the transformations that affect the speech signal from the time it leaves the mouth until it is in digital format. Specifically, we discuss strategies for dealing with additive noise. Some of the techniques, like feature normalization, are general enough to provide robustness against several forms of signal degradation. Others, such as feature enhancement, provide superior noise robustness at the expense of being less general. A good system will implement several techniques to provide a strong defense against acoustical variabilities.

Jasha Droppo, Alex Acero
34. The Business of Speech Technologies

With the fast pace of developments of communications networks and devices, immediate and easy access to information and services is now the expected norm. Several critical technologies have entered the marketplace as key enablers to help make this a reality. In particular, speech technologies, such as speechrecognitionnatural languageunderstanding (NLU) speech recognition and natural language understanding, have changed the landscape of how services are provided by businesses to consumers forever. In 30 short years, speech has progressed from an idea in research laboratories across the world, to a multibillion-dollar industry of software, hardware, service hosting, and professional services. Speech is now almost ubiquitous in cell phones. Yet, the industry is still very much in its infancy with its focus being on simple low hanging fruit applications of the technologies where the current state of technology actually fits a specific market contact centervoicedialing need, such as voice enabling of call center services or voice dialing over a cell phone. With broadband access to networks (and therefore data), anywhere, anytime, and using any device, almost a reality, speech technologies will continue to be essential for unlocking the potential that such access provides. However, to unlock this potential, advances in basic speech technologies beyond the current state of the art are essential. In this chapter, we review the business of speech technologies and its development since the 1980s. How did it start? What were the key inventions that got us where we are, and the services innovations that supported the industry over the past few decades? What are the future trends on how speech technologies will be used? And what are the key technical challenges researchers must address and resolve for the industry to move forward to meet this vision of the future? This chapter is by no means meant to be exhaustive, but it gives the reader an understanding of speech technologies, the speech business, and areas where continued technical invention and innovation will be needed before the ubiquitous use of speech technologies can be seen in the marketplace.

Jay Wilpon, Mazin E. Gilbert, Jordan Cohen
35. Spoken Dialogue Systems

Spoken dialogue systems spoken dialogue system are a new breed of interfaces that enable humans to communicate with machines naturally and efficiently using a conversational paradigm. Such a system makes use of many human language technology human languagetechnology (HLT) (HLT) components, including speech recognition and synthesis, natural language understanding and generation, discourse modeling, and dialogue management. In this contribution, we introduce the nature of these interfaces, describe the underlying HLTs on which they are based, and discuss some of the development issues. After providing a historical perspective, we outline some new research directions.

Victor Zue, Stephanie Seneff

Speaker Recognition

Frontmatter
36. Overview of Speaker Recognition

speakerrecognition An introduction to automatic speaker recognition is presented in this chapter. The identifying characteristics of a personʼs voice that make it possible to automatically identify a speaker are discussed. Subtasks such as speaker identification, verification, and detection are described. An overview of the techniques used to build speaker models as well as issues related to system performance are presented. Finally, a few selected applications of speaker recognition are introduced to demonstrate the wide range of applications of speaker recognition technologies. Details of text-dependent and text-independent speaker recognition and their applications are covered in the following two chapters.

Aaron E. Rosenberg, Frédéric Bimbot, Sarangarajan Parthasarathy
37. Text-Dependent Speaker Recognition

Text-dependent speaker recognition textdependent speaker recognition characterizes a speaker recognition task, such as verification or identification, in which the set of words (or lexicon) used during the testing phase is a subset of the ones present during the enrollment phase. The restricted lexicon enables very short enrollment (or registration) and testing sessions to deliver an accurate solution but, at the same time, represents scientific and technical challenges. Because of the short enrollment and testing sessions, text-dependent speaker recognition technology is particularly well suited for deployment in large-scale commercial applications. These are the bases for presenting an overview of the state of the art in text-dependent speaker recognition as well as emerging research avenues. In this chapter, we will demonstrate the intrinsic dependence that the lexical content of the password phrase has on the accuracy. Several research results will be presented and analyzed to show key techniques used in text-dependent speaker recognition systems from different sites. Among these, we mention multichannel speaker model synthesis and continuous adaptation of speaker models with threshold tracking. Since text-dependent speaker recognition is the most widely used voice biometric in commercial deployments, several results drawn from realistic deployment scenarios are also included.

Matthieu Hébert
38. Text-Independent Speaker Recognition

In this chapter, we focus on the area of text-independent speaker verification, with an emphasis on text-independent speaker recognitionunconstrained telephone conversational speech. We begin by providing a general likelihood ratio detection task framework to describe the various components in modern text-independent speaker verification systems. We next describe the general hierarchy of speaker information conveyed in the speech signal and the issues involved in reliably exploiting these levels of information for practical speaker verification systems. We then describe specific implementations of state-of-the-art text-independent speaker verification systems utilizing low-level spectral information and high-level token sequence information with generative and discriminative modeling techniques. Finally, we provide a performance assessment of these systems using the National Institute of Standards and Technology (NIST) speaker recognition evaluation telephone corpora.

Douglas A. Reynolds, William M. Campbell

Language Recognition

Frontmatter
39. Principles of Spoken Language Recognition

spoken-languagerecognition In this introductory chapter to Part G of this Handbook on spoken language recognition, we provide a brief overview of the principles of state-of-the-art language recognition approaches, and a general discriminative training framework to improve the performance and robustness of language recognition systems. It is followed by three chapters. The first of these addresses issues related to spoken language characterization in which knowledge sources can be utilized to distinguish one language from another. The second chapter deals with language identification based on phone recognition followed by language modeling using either spectral or token-based approaches. The third chapter presents vector-space characterization approaches to converting speech utterances into spoken document vectors for modeling and classification. With recent progress in speech processing, machine learning, and text categorization, we expect significant technology advances in spoken language recognition in the years to come. This chapter is organized as follows. Section 39.2 briefly describes the principle of spoken language recognition. Sections 39.3 and 39.4 formulate the popular parallel phone recognition followed by language modeling (P-PRLM) and vector-space characterization (VSC) approaches to spoken language identification. In Sect. 39.5 we extend these formulations to spoken language verification. Finally a general discriminative training framework for non-support vector machine (non-SVM) classifiers is presented in Sect. 39.6, followed by a brief summary in Sect. 39.7.

Chin-Hui Lee
40. Spoken Language Characterization

This chapter describes the types of information that can be used spoken-languagecharacterizationto characterize spoken languages. Automatic spoken language identification languageidentification (LID)(LID) systems, which are tasked with determining the identity of the language of speech samples, can utilize a variety of information sources in order to distinguish among languages. In this chapter, we first define what we mean by a language (as opposed to a dialect). We then describe some of the language collections that have been used to investigate spoken language identification, followed by discussion of the types of features that have been or could be utilized by automatic systems and people. In general, approaches used by people and machines differ, perhaps sufficiently to suggest building a partnership between human and machine. We finish with a discussion of the conditions under which textual materials could be used to augment our ability to characterize a spoken language.

Mary P. Harper, Michael Maxwell
41. Automatic Language Recognition Via Spectral and Token Based Approaches

Automatic language recognition automaticlanguage recognitionfrom speech consists of algorithms and techniques that model and classify the language being spoken. Current state-of-the-art language recognition systems fall into two broad categories: spectral- and token-sequence-based approaches. In this chapter, we describe algorithms for extracting features and models representing these types of language cues and systems for making recognition decisions using one or more of these language cues. A performance assessment of these systems is also provided, in terms of both accuracy and computation considerations, using the National Institute of Science and Technology National Institute of Science and Technology(NIST) language recognition evaluation benchmarks.

Douglas A. Reynolds, William M. Campbell, Wade Shen, Elliot Singer
42. Vector-Based Spoken Language Classification

This chapter presents a vector space characterization (VSC) vector-spacecharacterization (VSC)approach to automatic spoken language classification. It is assumed that the space of all spoken utterances can be represented by a universal set of fundamental acoustic units common to all languages. We address research issues related to defining the set of fundamental acoustic units, modeling these units, transcribing speech utterances with these unit models and designing vector-based decision rules for spoken language classification. The proposed VSC approach is evaluated on the 1996 and 2003 National Institute of Standards and Technology (NIST) NIST (National Institute of Standards and Technology)language recognition evaluation NIST (National Institute of Standards and Technology)language recognition evaluationtasks. It is shown that the VSC framework is capable of incorporating any combination of existing vector-based feature representations and classifier designs. We will demonstrate that the VSC-based classification systems achieve competitively low error rates for both spoken language identification and verification. The chapter is organized as follows. In Sect. 42.1, we introduce the concept of vector space characterization of spoken utterance and establish the notion of acoustic letter, acoustic word and spoken document. In Sect. 42.2 we discuss acoustic segment modeling in relation to augmented phoneme inventory. In Sect. 42.3, we discuss voice tokenization and spoken document vectorization. In Sect. 42.4, we discuss vector-based classifier design strategies. In Sect. 42.5, we report several experiments as the case study of classifier design, and the analytic study of front- and back-end. Finally in Sect. 42.6, we summarize the discussions.

Haizhou Li, Bin Ma, Chin-Hui Lee

Speech Enhancement

Frontmatter
43. Fundamentals of Noise Reduction

The existence of noise is inevitable. In all applications that are related to voice and speech, from sound recording, telecommunications, and telecollaborations, to human-machine interfaces, the signal of interest that is picked up by a microphone is generally contaminated by noise. As a result, the microphone signal has to be cleaned up with digital signal-processing tools before it is stored, analyzed, transmitted, or played out. The cleaning process, which is often referred to as either noise reduction or speech enhancement, has attracted a considerable amount of research and engineering attention for several decades. Remarkable advances have already been made, and this area is continuing to progress, with the aim of creating processors that can extract the desired speech signal as if there is no noise. This chapter presents a methodical overview of the state of the art of noise-reduction algorithms. Based on their theoretical origin, the algorithms are categorized into three fundamental classes: filtering techniques, spectral restoration, and model-based methods. We outline the basic ideas underlying these approaches, discuss their characteristics, explain their intrinsic relationships, and review their advantages and disadvantages.

Jingdong Chen, Jacob Benesty, Yiteng (Arden) Huang, Eric J. Diethorn
44. Spectral Enhancement Methods

In this chapter, we focus on the statistical methods that constitute a speech spectral enhancement system and describe some of their fundamental components. We begin in Sect. 44.2 by formulating the problem of spectral enhancement. In Sect. 44.3, we address the time-frequency correlation of spectral coefficients for speech and noise signals, and present statistical models that conform with these characteristics. In Sect. 44.4, we present estimators for speech spectral coefficients under speech presence uncertainty based on various fidelity criteria. In Sect. 44.5, we address the problem of speech presence probability estimation. In Sect. 44.6, we present useful estimators for the a priori signal-to-noise ratio (SNR) under speech presence uncertainty. We present the decision-directed approach, which is heuristically motivated, and the recursive estimation approach, which is based on statistical models and follows the rationale of Kalman filtering. improved minima-controlled recursive averaging (IMCRA) In Sect. 44.7, we describe the improved minima-controlled recursive averaging (IMCRA) approach for noise power spectrum estimation. In Sect. 44.8, we provide a detailed example of a speech enhancement algorithm, and demonstrate its performance in environments with various noise types. In Sect. 44.9, we survey the main types of spectral enhancement components, and discuss the significance of the choice of statistical model, fidelity criterion, a priori SNR estimator, and noise spectrum estimator. Some concluding comments are made in Sect. 44.10.

Israel Cohen, Sharon Gannot
45. Adaptive Echo Cancelation for Voice Signals

This chapter deals with modern methods of dealing with echoes that may arise during a telephone conversation. Echoes may be generated due to impedance mismatch at various points in a telephone connection, or they may be generated acoustically due to coupling between microphones and loudspeakers placed in the same room or enclosure. If unchecked, such echoes can seriously disrupt a conversation. The device most successful in dealing with such echoes is the adaptive echo canceler. Several million such devices are deployed in the telephone networks around the world. This chapter discusses the principles on which echo cancelers are based, and describes their applications to network telephony and to single- and multichannel teleconferencing.

M. Mohan Sondhi
46. Dereverberation

Room reverberation is one of the two major causes (the other is background noise) of speech degradation and there has been an increasing need for speech dereverberation in various speech processing and communication applications. Although it has been studied for decades, speech dereverberation remains a challenging problem from both a theoretical and practical perspective. This chapter provides a methodical overview of speech dereverberation algorithms. They are classified into three categories: source model-based speech enhancement algorithms for dereverberation, separation of speech and reverberation via homomorphic transformation, and speech dereverberation by channel inversion and equalization. We outline the basic ideas behind these approaches, explain the assumptions they make, and discuss their characteristics as well as performance.

Yiteng (Arden) Huang, Jacob Benesty, Jingdong Chen
47. Adaptive Beamforming and Postfiltering

In this chapter, we explore many of the basic concepts of array processing with an emphasis on adaptive beamforming for speech enhancement applications. We begin in Sect. 47.1 by formulating the problem of microphone array in a noisy and reverberant environment. In Sect. 47.2, we derive the frequency-domain linearly constrained minimum-variance (LCMV) beamformer, and its generalized sidelobe canceller (GSC) variant. The GSC components are explored in Sect. 47.3, and several commonly used special cases of these blocks are presented. As the GSC structure necessitates an estimation of the speech related acoustical transfer functions (ATFs), several alternative system identification methods are addressed in Sect. 47.4. Beamformers often suffer from sensitivity to signal mismatch. We analyze this phenomenon in Sect. 47.5 and explore several cures to this problem. Although the GSC beamformer yields a significant improvement in speech quality, when the noise field is spatially incoherent or diffuse, the noise reduction is insufficient and additional postfiltering is normally required. In Sect. 47.6, we present multi-microphone postfilters, based on either minimum mean-squared error (MMSE) or log-spectral amplitude estimate criteria. An interesting relation between the GSC and the Wiener filter is derived in this Section as well. In Sect. 47.7, we analyze the performance of the transfer-function GSC (TF-GSC), and in Sect. 47.8 demonstrate the advantage of multichannel postfiltering over single-channel postfiltering in nonstationary noise conditions.

Sharon Gannot, Israel Cohen
48. Feedback Control in Hearing Aids

Acoustic feedback limits the maximum amplification that can be used in a hearing aid without making it unstable. This chapter gives an overview of existing techniques for feedback suppression and, in particular, adaptive feedback cancellation. Because of the presence of a closed signal loop, standard adaptive filtering techniques for open-loop systems fail to provide a reliable feedback path estimate if the desired signal is spectrally colored. Several approaches for improving the estimation accuracy of the adaptive feedback canceller will be reviewed and evaluated for acoustic feedback paths measured in a commercial behind-the-ear hearing aid. This chapter is organized as follows. Section 48.1 gives a mathematical formulation of the acoustic feedback problem in hearing aids and briefly describes the two possible approaches to reduce its negative effects, i.e., feedforward suppression and feedback cancellation. In addition, performance measures for feedback cancellation are defined. Section 48.2 discusses the standard continuous adaptation feedback (CAF) cancellation continuous adaptationfeedback (CAF) cancellation algorithm that is widely studied for application in hearing aids. We demonstrate that the standard CAF suffers from a model error or bias when the desired signal is spectrally colored (e.g., a speech signal). In the literature, several solutions have been proposed to reduce the bias of the CAF. A common approach is to incorporate signal decorrelating operations (such as delays) in the signal processing path of the hearing aid or to reduce the adaptation speed of the adaptive feedback canceller. Other techniques, discussed in Sect. 48.3, exploit prior knowledge of the acoustic feedback path to improve the adaptation of the feedback canceller. In Sect. 48.4, a final class of techniques is presented that view the feedback path as a part of a closed-loop system and apply closed-loop system identification theory [48.1]. Among the different closed-loop identification methods, especially the direct method is an appealing approach for feedback cancellation. In contrast to the other methods, this technique does not require the use of an external probe signal. The direct method reduces the bias of the feedback canceller by incorporating a (stationary or time-varying) model of the desired signal x[k] in the identification. Finally, Sect. 48.5 compares the steady-state performance as well as the tracking performance of different algorithms for acoustic feedback paths measured in a commercial behind-the-ear hearing aid.

Ann Spriet, Simon Doclo, Marc Moonen, Jan Wouters
49. Active Noise Control

active noise control (ANC) This chapter introduces principles, algorithms, and applications of active noise control (ANC) systems. We emphasize the practical aspects of ANC systems in terms of adaptive algorithms for real-world applications. The basic algorithm for ANC is first developed and analyzed based on broadband feedforward control. This algorithm is then modified for narrowband feedforward and adaptive feedback ANC systems. Finally, these single-channel ANC algorithms are expanded to multichannel systems for three-dimensional applications. An ANC system can be categorized as either feedforward or feedback control. Feedforward ANC systems are classified into 1. broadband control with a reference sensor, which will be discussed in Sect. 49.1, and 2. narrowband control with a reference sensor that is not influenced by the control field, which will be introduced in Sect. 49.2. The concept of adaptive feedback ANC will be developed in Sect. 49.3 from the standpoint of reference signal synthesis. These single-channel ANC systems will be expanded to multichannel systems in Sect. 49.4.

Sen M. Kuo, Dennis R. Morgan

Multichannel Speech Processing

Frontmatter
50. Microphone Arrays

microphone arraybeamforming This chapter introduces various types of microphone array beamforming systems and discusses some of the fundamental theory of their operation, design, implementation, and limitations. It is shown that microphone arrays have the ability to offer directional gains that can significantly improve the quality of signal pickup in reverberant and noisy environments. Hands-free audio communication is now a major feature in mobile communication systems as well as audio and video conferencing systems. One problem that becomes evident to users of these systems is the decrease in communication quality due to the pickup of room reverberation and background noise. In the past, this problem was dealt with by using microphones placed close to the desired talker or source. Although this simple solution has proven to be quite effective, it also has its drawbacks. First, it is not always possible or desirable to place the microphone very close to the talkerʼs mouth. Second, by placing the microphone close to the talkerʼs mouth, one has to deal with rapid level variation as the talker moves his or her mouth relative to the microphone. Third is the negative impact of speech plosives (airflow transients generated by plosive sounds) and forth, microphone structure-borne handling noise has a detrimental effect. Finally, for directional microphone elements, there is also a nearfield proximity effect (where the frequency response of the microphone is modulated by the relative position of the microphone to the mouth). With these issues in mind, it is of interest to investigate other potential solutions. One solution is to use beamforming microphone arrays, which can offer significant directional gain so as to result in similar audio performance to that of closely placed microphones.

Gary W. Elko, Jens Meyer
51. Time Delay Estimation and Source Localization

A fundamental requirement of microphone arrays is the capability of instantaneously locating and continuously tracking a speech sound source. The problem is challenging in practice due to the fact that speech is a nonstationary random process with a wideband spectrum, and because of the simultaneous presence of noise, room reverberation, and other interfering speech sources. This Chapter presents an overview of the research and development on this technology in the last three decades. Focusing on a two-stage framework for speech source localization, we survey and analyze the state-of-the-art time delay estimation (TDE) time delay estimation (TDE) and source localization source localization algorithms. This chapter is organized into two sections. In Sect. 51.2, we will study the TDE problem and review a number of cutting-edge TDE algorithms, ranging from the generalized cross-correlation methods to blind multichannel-identification-based algorithms and the second-order statistics-based multichannel cross-correlation coefficient method to the higher-order statistics-based entropy-minimization approach. In Sect. 51.3, we will investigate the source localization problem from the perspective of estimation theory. The emphasis is on least-squares estimators with closed-form estimates. The spherical intersection, spherical interpolation, and linear-correction spherical interpolation algorithms will be presented.

Yiteng (Arden) Huang, Jacob Benesty, Jingdong Chen
52. Convolutive Blind Source Separation Methods

In this chapter, we provide an overview of existing algorithms for blind source separation of convolutive audio mixtures. We provide a taxonomy in which many of the existing algorithms can be organized and present published results from those algorithms that have been applied to real-world audio separation tasks.

Michael Syskind Pedersen, Jan Larsen, Ulrik Kjems, Lucas C. Parra
53. Sound Field Reproduction

Multichannel sound field reproduction sound fieldreproductionaims at the physically correct synthesis of acoustical wave fields with a large number of loudspeakers. It goes beyond stereophony by extending or eliminating the so-called sweet spot. This chapter presents the physical and mathematical description of various methods for this purpose and presents the physical and mathematical background of several methods for multichannel sound field reproduction. Section 53.2 introduces the mathematical representation of sound fields. The panning laws of stereophony are given in Sect. 53.3 as an introduction to vector-based amplitude panning in Sect. 53.4. Then mathematically more-involved methods are described in the Sections on ambisonics (Sect. 53.5) and wave field synthesis (Sect. 53.6).

Rudolf Rabenstein, Sascha Spors
Backmatter
Metadaten
Titel
Springer Handbook of Speech Processing
herausgegeben von
Prof. Jacob Benesty, Dr.
Prof. M. Mohan Sondhi, Ph.D.
Prof. Yiteng Arden Huang, Dr.
Copyright-Jahr
2008
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-49127-9
Print ISBN
978-3-540-49125-5
DOI
https://doi.org/10.1007/978-3-540-49127-9

Neuer Inhalt