Skip to main content

2017 | Buch

Speech Coding

Code- Excited Linear Prediction

insite
SUCHEN

Über dieses Buch

This book provides scientific understanding of the most central techniques used in speech coding both for advanced students as well as professionals with a background in speech audio and or digital signal processing. It provides a clear connection between the Why’s?, How’s?, and What’s, such that the necessity, purpose and solutions provided by tools should be always within sight, as well as their strengths and weaknesses in each respect. Equivalently, this book sheds light on the following perspectives for each technology presented:

Objective: What do we want to achieve and especially why is this goal important?

Resource / Information: What information is available and how can it be useful?

Resource / Platform: What kind of platforms are we working with and what are the capabilities/restrictions of those platforms? This includes properties such as computational, memory, acoustic and transmission capacity of devices used.

Solutions: Which solutions have been proposed and how can they be used to reach the stated goals?

Strengths and weaknesses: In which ways do the solutions fulfill the objectives and where are they insufficient? Are resources used efficiently?

This book concentrates solely on code excited linear prediction and its derivatives since mainstream speech codecs are based on linear prediction It also concentrates exclusively on time domain techniques because frequency domain tools are to a large extent common with audio codecs.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction
Abstract
The objective of speech coding technologies is primarily to enable spoken communication between geographically separated people and also, to allow storage of speech signals. The performance of such technologies can be measured by both the perceived quality of the communication experience as well as the amount of resources required. For efficient performance, speech codecs are based on two types of modelling techniques applied in parallel: (1) they model the signal source by a model of speech production and (2) for optimisation of quality, they apply a perceptual model. These models include also entropy coding to remove statistical redundancy.
Tom Bäckström

Basic Properties of Speech Signals

Frontmatter
Chapter 2. Speech Production and Modelling
Abstract
Humans produce speech sounds by pushing air out of the lungs and letting the vocal folds oscillate by the airflow as well as by turbulent constrictions in the vocal tract. The flow-waveform thus created is further modulated by the resonances of the vocal tract. These features form the characteristic properties of phones. For efficient coding, we must model these features with a minimum number of parameters without altering the perceptual impression.
Tom Bäckström
Chapter 3. Principles of Entropy Coding with Perceptual Quality Evaluation
Abstract
The objective of speech coding is to transmit speech at the highest possible quality with the lowest possible amount of resources. To achieve the best compromise, we can use available information about 1. the source, which is the speech production system, 2. the quality measure or evaluation criteria, which depends on the performance of the human hearing system and 3. the statistical frequency and distribution of the involved parameters. By developing models for all such information, we can optimise the system to perform efficiently. In practice, the three methods are overlapping in the sense that it is often difficult to make a clear-cut separation between them. While source modelling was already discussed in Chap. 2, this chapter reviews entropy coding methods and the associated perceptual modelling methods.
Tom Bäckström

Core Tools

Frontmatter
Chapter 4. Spectral Envelope and Perceptual Masking Models
Abstract
Envelope models describe the gross shape of a signal, such as the magnitude spectrum of a speech signal. An envelope model of the spectrum is thus a source model of the speech signal. Perceptual (frequency) masking models are evaluation models, which describe the magnitude of the perceptually detrimental effect of errors in different parts of the spectrum. The two models tend to have similar shapes, whereby they are described jointly in this chapter. In CELP-type codecs, envelope models are usually based on linear prediction, whereby that will be the main theme of this chapter.
Tom Bäckström
Chapter 5. Windowing and the Zero Input Response
Abstract
Speech processing algorithms usually segment signals into finite-length blocks or windows, since block operations are generally more efficient in terms of both bitrate and computational complexity. Speech codecs model temporal correlations with linear prediction for both coding efficiency as well as to enable smooth transitions between frames. This chapter describes this framing or windowing process based on linear prediction. A central feature of this windowing process is the zero input response of the corresponding linear predictive model, which corresponds to the smooth overlap between processing windows.
Tom Bäckström
Chapter 6. Fundamental Frequency
Abstract
A model for the pitch is a central part of any source model of speech. It corresponds to the oscillations of the vocal folds and is usually modelled with a long-term predictor. In speech codecs it is generally implemented as an adaptive vector codebook, where the residual of the linear predictive filter is modelled using the past residual. The model parameters are therefore the lag and gain.
Tom Bäckström
Chapter 7. Residual Coding
Abstract
The spectral envelope and fundamental frequency of a speech signal is generally modelled by linear, short- and long-term predictive synthesis filters. The residual from these two filters is a signal without almost any temporal correlation. In this section we describe modelling and optimisation of the residual quantisation. The most famous approach is algebraic coding, which has also given the name to algebraic code-excited linear prediction (ACELP). It assumes that the residual signal follows the Laplace distribution and provides an enumeration method, the algebraic code, with which every possible quantisation can be encoded.
Tom Bäckström
Chapter 8. Signal Gain and Harmonics to Noise Ratio
Abstract
The final basic component of the speech source model is that of the intensity, volume or energy of the two main components, voiced and unvoiced parts of the speech signal. Usually, signal output energy is modelled by simple gain factors applied to the excitations of the filters. The two parts, unvoiced excitation corresponding to the residual codebook and the voiced excitation corresponding to the adaptive codebook, are most commonly multiplied with two scalar gain factors which are coded jointly. Alternatively, it is possible to encode a single output-energy gain and steer the proportion of components with a ratio-factor corresponding to the harmonics to noise ratio (HNR).
Tom Bäckström

Advanced Tools and Extensions

Frontmatter
Chapter 9. Pre- and Postfiltering
Abstract
At low bitrates, basic CELP codecs have an easily recognisable distortion characteristic often described as noisiness or roughness. To reduce the perceptual distortion, we can try to identify and remove typical artefacts by filtering the decoded signal. In this chapter we present most typical postfiltering techniques such as formant enhancement and bass post filtering.
Tom Bäckström
Chapter 10. Frequency Domain Coding
Abstract
Signals which are sufficiently stationary permit highly efficient coding in the frequency domain. Such signals include speech signals such as sustained vowels and prolonged fricatives, as well as generic audio signals such as music and mixed material. The main components of frequency domain coding methods include windowing, a time-frequency transform, perceptual modelling and entropy coding of the spectral components. This chapter gives an overview of such transform domain coding methods.
Tom Bäckström
Chapter 11. Bandwidth Extension
Abstract
Perceptual audio coding at low bit rates often relies on semi-parametric or parametric techniques to efficiently transmit and restore audio content that, after receiving, may be very different to the original in its waveform, but is perceptually still very close to it. Audio bandwidth extension exploits the limited resolution of the human auditory perception at high frequencies to recreate a spectral high band from the transmitted spectral low band and post-processing parameters, which elicits the sensation of plausible high frequency content that perceptually fuses with the low band into a decent broadband audio perception. The following chapter details the underlying thoughts, design criteria, perceptual trade-offs and signal processing techniques found in contemporary low bit rate audio codecs using audio bandwidth extension.
Sascha Disch, Tom Bäckström
Chapter 12. Packet Loss and Concealment
Abstract
Transmission over real-world networks will occasionally suffer from transmission errors, which can significantly deteriorate the perceived quality of a speech codec. This chapter addresses the problem of transmission errors in packet based voice applications, such as voice over Internet protocol (VoIP). A broad range of techniques for recovery from packet loss on the channel are presented, from channel coding to techniques using speech signal processing methods, as well as both sender-driven and receiver-based methods. The sender based methods include for example retransmission, interleaving and forward error correction (both media-specific as well as media-independent), whereas receiver-based techniques include noise substitution, repetition and synchronisation methods.
Jérémie Lecomte, Tom Bäckström
Chapter 13. Voice Activity Detection
Abstract
Voice Activity Detection (VAD) provides the information whether an audio signal contains speech or not. Besides speech coding and transmission, there are many other applications in speech and audio processing that benefit from this information, and their performance is crucially dependent on the accuracy and robustness of the applied VAD. Various approaches to detect speech have been developed in the past, but when considering the challenging scenarios in which speech needs to be detected, e.g. hands-free communication in noisy environments or dialog in background music, there is still room for improvement. In this chapter, we describe the problem and the environments of VAD, discuss the procedure, examples for methods and their evaluation. Especially the more challenging application scenarios illustrate how superior human hearing can be compared to implementations of audio signal processing.
Christian Uhle, Tom Bäckström
Chapter 14. Relaxed Code-Excited Linear Prediction (RCELP)
Abstract
While code-excited linear prediction shows good performance for bitrates above 8 kbits/s, the quality of the speech-specific waveform coding scheme drops noticeably at lower rates. At such low rates, all information included in the waveform of the speech cannot be correctly coded. On the other hand, parametric speech coders also called vocoders concentrate only on key perceptual features of speech, rather than the entire waveform. By encoding parameters of a linear model of speech, speech parametric coding can reach rates of 4 kbits/s down to 0.5 kbits/s, thought the quality is often qualified as synthetic. The relaxed code-excited linear prediction (RCELP) is a way of extending the code-excited linear prediction coding scheme towards low rates by keeping a natural speech quality. RCELP uses the generalised analysis-by-synthesis paradigm, which relaxes the waveform-matching constraints without affecting speech quality. The basic principle is to ease the encoding of the signal by modifying appropriately the input signal.
Guillaume Fuchs, Tom Bäckström

Standards and Specifications

Frontmatter
Chapter 15. Quality Evaluation
Abstract
For evaluating the performance of speech codecs or their algorithms, we need metrics for quality. Since humans are the ultimate users of most speech codecs, subjective testing is the gold standard in quality measurement. In this chapter we present a range of typical subjective evaluation methods and discuss their strengths and weaknesses. Subjective evaluation requires in general plenty of work and time, whereby there exists a selection of automated objective evaluation methods, which simulate subjective measurements. Objective quality estimation methods therefore provide fast and reproducible measures of quality, even if their reliability can never reach that of extensive subjective measurements.
Tom Bäckström
Backmatter
Metadaten
Titel
Speech Coding
verfasst von
Tom Bäckström
Copyright-Jahr
2017
Electronic ISBN
978-3-319-50204-5
Print ISBN
978-3-319-50202-1
DOI
https://doi.org/10.1007/978-3-319-50204-5

Neuer Inhalt