Skip to main content

2015 | Buch

Speech and Audio Processing for Coding, Enhancement and Recognition

herausgegeben von: Tokunbo Ogunfunmi, Roberto Togneri, Madihally (Sim) Narasimha

Verlag: Springer New York

insite
SUCHEN

Über dieses Buch

This book describes the basic principles underlying the generation, coding, transmission and enhancement of speech and audio signals, including advanced statistical and machine learning techniques for speech and speaker recognition with an overview of the key innovations in these areas. Key research undertaken in speech coding, speech enhancement, speech recognition, emotion recognition and speaker diarization are also presented, along with recent advances and new paradigms in these areas.

Inhaltsverzeichnis

Frontmatter

Overview of Speech and Audio Coding

Frontmatter
Chapter 1. From “Harmonic Telegraph” to Cellular Phones
Abstract
It all started with two patents issued to Alexander Graham Bell in March 1876 and the world changed forever. Vast distances began to shrink. Soon, nobody was isolated. The invention produced a new industrial giant whose research laboratories supported the best in scientific research and engineering leading to major technical advances of the twentieth century. The desire for communication, anytime, anywhere spread fast; stationary phones connected by wires started fading away, replaced by mobile phones or “cellular phones” reflecting the cell structure of the wireless medium. The book chapter will provide a history of the telephones, starting from Alexander Graham Bell’s “harmonic telegraph” in 1876 to modern cellular phones.
Bishnu S. Atal
Chapter 2. Challenges in Speech Coding Research
Abstract
Speech and audio coding underlie many of the products and services that we have come to rely on and enjoy today. In this chapter, we discuss speech and audio coding, including a concise background summary, key coding methods, and the latest standards, with an eye toward current limitations and possible future research directions.
Jerry D. Gibson
Chapter 3. Scalable and Multi-Rate Speech Coding for Voice-over-Internet Protocol (VoIP) Networks
Abstract
Communication by speech is still a very popular and effective means of transmitting information from one person to another. Speech signals form the basic method of human communication. The information communicated in this case is verbal or auditory information. The methods used for speech coding are very extensive and continuously evolving.
Speech Coding can be defined as the means by which the information-bearing speech signal is coded to remove redundancy thereby reducing transmission bandwidth requirements, improving storage efficiency, and making possible myriad other applications that rely on speech coding techniques.
The medium of speech transmission has also been changing over the years. Currently a large percentage of speech is communicated over channels using internet protocols. The voice-over-internet protocols (VoIP) channels present some challenges that have to be overcome in order to enable error-free, robust speech communication.
There are several advantages to use bit-streams that are multi-rate and scalable for time-varying VoIP channels. In this chapter, we present the methods for scalable, multi-rate speech coding for VoIP channels.
Tokunbo Ogunfunmi, Koji Seto
Chapter 4. Recent Speech Coding Technologies and Standards
Abstract
This chapter presents an overview of recent developments in conversational speech coding technologies, important new algorithmic advances, and recent standardization activities in ITU-T, 3GPP, 3GPP2, MPEG and IETF that offer a significantly improved user experience during voice calls on existing and future communication systems. User experience is determined by speech quality, hence network operators are very concerned about quality of speech coders. Operators are also concerned about capacity, hence coding efficiency is another important measure. Advanced speech coding technologies provide the capability to both improve coding efficiency and user experience. One option to improve quality is to extend the audio bandwidth from traditional narrowband to wideband (16 kHz sampling) and super-wideband (32 kHz sampling). Another method is in increasing the robustness of the coder against transmission errors. Error concealment algorithms are used which substitute the missing parts of the audio signal as far as possible. In packet-switched applications (VoIP systems), special mechanisms are included in jitter buffer management (JBM) algorithms to maximize sound quality. It is of high importance to ensure standardization and deployment of speech coders that meet quality expectations. As an example of this, we refer to the Enhanced Voice Services (EVS) project in 3GPP that is developing the next generation speech coder in 3GPP. The basic motivation for 3GPP to start the EVS project was to extend the path of codec evolution by providing super-wideband experience at around 13 kb/s and better quality for music and mixed content in conversational applications. Optimized behavior in VoIP applications is achieved through the introduction of high error robustness, jitter buffer management, inclusion of source-controlled variable bit rate operation, support of various audio bandwidths, and stereo.
Daniel J. Sinder, Imre Varga, Venkatesh Krishnan, Vivek Rajendran, Stéphane Villette

Review and Challenges in Speech, Speaker and Emotion Recognition

Frontmatter
Chapter 5. Ensemble Learning Approaches in Speech Recognition
Abstract
An overview is made on the ensemble learning efforts that have emerged in automatic speech recognition in recent years. The approaches that are based on different machine learning techniques and target various levels and components of speech recognition are described, and their effectiveness is discussed in terms of the direct performance measure of word error rate and the indirect measures of classification margin, diversity, as well as bias and variance. In addition, methods on reducing storage and computation costs of ensemble models for practical deployments of speech recognition systems are discussed. Ensemble learning for speech recognition has been largely fruitful, and it is expected to continue progress along with the advances in machine learning, speech and language modeling, as well as computing technology.
Yunxin Zhao, Jian Xue, Xin Chen
Chapter 6. Deep Dynamic Models for Learning Hidden Representations of Speech Features
Abstract
Deep hierarchical structure with multiple layers of hidden space in human speech is intrinsically connected to its dynamic characteristics manifested in all levels of speech production and perception. The desire and an attempt to capitalize on a (superficial) understanding of this deep speech structure helped ignite the recent surge of interest in the deep learning approach to speech recognition and related applications, and a more thorough understanding of the deep structure of speech dynamics and the related computational representations is expected to further advance the research progress in speech technology. In this chapter, we first survey a series of studies on representing speech in a hidden space using dynamic systems and recurrent neural networks, emphasizing different ways of learning the model parameters and subsequently the hidden feature representations of time-varying speech data. We analyze and summarize this rich set of deep, dynamic speech models into two major categories: (1) top-down, generative models adopting localist representations of speech classes and features in the hidden space; and (2) bottom-up, discriminative models adopting distributed representations. With detailed examinations of and comparisons between these two types of models, we focus on the localist versus distributed representations as their respective hallmarks and defining characteristics. Future directions are discussed and analyzed about potential strategies to leverage the strengths of both the localist and distributed representations while overcoming their respective weaknesses, beyond blind integration of the two by using the generative model to pre-train the discriminative one as a popular method of training deep neural networks.
Li Deng, Roberto Togneri
Chapter 7. Speech Based Emotion Recognition
Abstract
This chapter will examine current approaches to speech based emotion recognition. Following a brief introduction that describes the current widely utilised approaches to building such systems, it will attempt to broadly segregate components commonly involved in emotion recognition systems based on their function (i.e., feature extraction, normalisation, classifier, etc.) to give a broad view of the landscape. The next section of the chapter will then attempt to explain in more detail those components that are part of the most current systems. The chapter will also present a broad overview of how phonetic and speaker variability are dealt with in emotion recognition systems. Finally, the chapter presents the authors’ views on what are the current and future research challenges in the field.
Vidhyasaharan Sethu, Julien Epps, Eliathamby Ambikairajah
Chapter 8. Speaker Diarization: An Emerging Research
Abstract
Speaker diarization is the task of determining “Who spoke when?”, where the objective is to annotate a continuous audio recording with appropriate speaker labels corresponding to the time regions where they spoke. The labels are not necessarily the actual speaker identities, i.e. speaker identification, as long as the same labels are assigned to the regions uttered by the same speakers. These regions may overlap as multiple speakers could talk simultaneously. Speaker diarization is thus essentially the combination of two different processes: segmentation, in which the speaker turns are detected, and unsupervised clustering, in which segments of the same speakers are grouped. The clustering process is considered as unsupervised problem since there is no prior information about the number of speakers, their identities or acoustic conditions (Meignier et al., Comput Speech Lang 20(2–3):303–330, 2006; Zhou and Hansen, IEEE Trans Speech Audio Process 13(4):467–474, 2005). This chapter presents the fundamentals of speaker diarization and the most significant works over the recent years on this topic.
Trung Hieu Nguyen, Eng Siong Chng, Haizhou Li

Current Trends in Speech Enhancement

Frontmatter
Chapter 9. Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement
Abstract
When speech signals are captured in real acoustical environments, the captured signals are distorted by certain types of interference, such as ambient noise, reverberation, and extraneous speakers’ utterances. There are two important approaches to speech enhancement that reduce such interference in the captured signals. One approach is based on the spatial features of the signals, such as direction of arrival and acoustic transfer functions, and enhances speech using multichannel audio signal processing. The other approach is based on speech spectral models that represent the probability density function of the speech spectra, and it enhances speech by distinguishing between speech and noise based on the spectral models. In this chapter, we propose a new approach that integrates the above two approaches. The proposed approach uses the spatial and spectral features of signals in a complementary manner to achieve reliable and accurate speech enhancement. The approach can be applied to various speech enhancement problems, including denoising, dereverberation, and blind source separation (BSS). In particular, in this chapter, we focus on applying the approach to BSS. We show experimentally that the proposed integration can improve the performance of BSS compared with a conventional approach.
Yasuaki Iwata, Tomohiro Nakatani, Takuya Yoshioka, Masakiyo Fujimoto, Hirofumi Saito
Chapter 10. Modulation Processing for Speech Enhancement
Abstract
Many of the traditionally speech enhancement methods reduce noise from corrupted speech by processing the magnitude spectrum in a short-time Fourier analysis-modification-synthesis (AMS) based framework. More recently, use of the modulation domain for speech processing has been investigated, however early efforts in this direction did not account for the changing properties of the modulation spectrum across time. Motivated by this and evidence of the significance of the modulation domain, we investigated the processing of the modulation spectrum on a short-time basis for speech enhancement. For this purpose, a modulation domain-based AMS framework was used, in which the trajectories of each acoustic frequency bin were processed frame-wise in a secondary AMS framework. A number of different enhancement algorithms were investigated for the enhancement of speech in the short-time modulation domain. These included spectral subtraction and MMSE magnitude estimation. In each case, the respective algorithm was used to modify the short-time modulation magnitude spectrum within the modulation AMS framework. Here we review the findings of this investigation, comparing the quality of stimuli enhanced using these modulation based approaches to stimuli enhanced using corresponding modification algorithms applied in the acoustic domain. Results presented show modulation domain based approaches to have improved quality compared to their acoustic domain counterparts. Further, MMSE modulation magnitude estimation (MME) is shown to have improved speech quality compared to Modulation spectral subtraction (ModSSub) stimuli. MME stimuli are found to have good removal of noise without the introduction of musical noise, problematic in spectral subtraction based enhancement. Results also show that ModSSub has minimal musical noise compared to acoustic Spectral subtraction, for appropriately selected modulation frame duration. For modulation domain based methods, modulation frame duration is shown to be an important parameter, with quality generally improved by use of shorter frame durations. From the results of experiments conducted, it is concluded that the short-time modulation domain provides an effective alternative to the short-time acoustic domain for speech processing. Further, that in this domain, MME provides effective noise suppression without the introduction of musical noise distortion.
Kuldip Paliwal, Belinda Schwerin
Metadaten
Titel
Speech and Audio Processing for Coding, Enhancement and Recognition
herausgegeben von
Tokunbo Ogunfunmi
Roberto Togneri
Madihally (Sim) Narasimha
Copyright-Jahr
2015
Verlag
Springer New York
Electronic ISBN
978-1-4939-1456-2
Print ISBN
978-1-4939-1455-5
DOI
https://doi.org/10.1007/978-1-4939-1456-2

Neuer Inhalt