Skip to main content
main-content

Über dieses Buch

In the last decade the remarkable advances in computing and networking have sparked an enormous interest in deploying automatic speech recognition in devices and networks, and the trend is accelerating.

This book brings together leading academic researchers and industrial practitioners to address the issues in this emerging realm. It covers networked, distributed and embedded speech recognition systems, which are expected to co-exist in the future. The book is divided into four parts: networked speech recognition, distributed speech recognition, embedded speech recognition, and systems and applications. A profound and unified introduction about this area and its latest development is provided, as well as working knowledge needed for research and practical application deployment. This book covers the most up-to-date standards and a number of systems.

This all-inclusive reference is an essential read for graduate students, scientists and engineers working or researching in the field of speech recognition and processing.

Inhaltsverzeichnis

Frontmatter

Network Speech Recognition

1. Network, Distributed and Embedded Speech Recognition: An Overview

As mobile devices become pervasive and small, the design of efficient user interfaces is rapidly developing into a major issue. The expectation for speech-centric interfaces has stimulated a great interest in deploying automatic speech recognition (ASR) on devices like mobile phones, PDAs and automobiles. Mobile devices are characterised as having limited computational power, memory size and battery life, whereas state-of-the-art ASR systems are computationally intensive. To circumvent these restrictions, a great deal of effort has therefore been spent on enabling efficient ASR implementation on embedded platforms, primarily through fixed-point arithmetic and algorithm optimisation for low computational complexity and memory footprint. The restrictions can also be largely bypassed from the architecture side: Distributed speech recognition (DSR) splits ASR processing into the client based feature extraction and the server based recognition. The relief of computational burden on mobile devices, however, comes at the cost of network deteriorations and additional components such as feature quantisation, error recovery and concealment. An alternative to DSR is network speech recognition that uses a conventional speech coder for speech transmission from client to server. Over the past decade, these areas have undergone substantial development. This chapter gives a comprehensive overview of the areas and discusses the pros and cons of different approaches. The optimal choice is made according to the complexity of ASR components, the resources available on the device and in the network and the location of associated applications.
Zheng-Hua Tan, Imre Varga

2. Speech Coding and Packet Loss Effects on Speech and Speaker Recognition

This chapter is related to the speech coding and packet loss problems that occur in network speech recognition where speech is transmitted (and most of the time coded) from a client terminal to a recognition server. The first part describes some commonly used speech coding standards and presents a packet loss model useful to evaluate different channel degradation conditions in a controlled fashion. The second part evaluates the influence of different speech and audio codecs on the performance of a continuous speech recognition engine. It is shown that MPEG transcoding degrades the speech recognition performance for low bit rates whereas performance remains acceptable for specialized speech coders like G723. The same system is also evaluated for different simulated and real packet loss conditions; in that case, the significant degradation of the automatic speech recognition (ASR) performance is analyzed. The third part presents an overview of joint compression and packet loss effects on speech biometrics. Conversely to the ASR task, it is experimentally demonstrated that the adverse effects of packet loss alone are negligible, while the encoding of speech, particularly at a low bit rate, coupled with packet loss, can reduce the speaker recognition accuracy considerably. The fourth part discusses these experimental observations and refers to robustness approaches.
Laurent Besacier

3. Speech Recognition Over Mobile Networks

This chapter addresses issues associated with automatic speech recognition (ASR) over mobile networks, and introduces several techniques for improving speech recognition performance. One of these issues is the performance degradation of ASR over mobile networks that results from distortions produced by speech coding algorithms employed in mobile communication systems, transmission errors occurring over mobile telephone channels, and ambient background noise that can be particularly severe in mobile domains. In particular, speech coding algorithms have difficulty in modeling speech in ambient noise environments. To overcome this problem, noise reduction techniques can be integrated into speech coding algorithms to improve reconstructed speech quality under ambient noise conditions, or speech coding parameters can be made more robust with respect to ambient noise. As an alternative to mitigating the effects of speech coding distortions in the received speech signal, a bitstream-based framework has been proposed. In this framework, the direct transformation of speech coding parameters to speech recognition parameters is performed as a means of improving ASR performance. Furthermore, it is suggested that the receiver-side enhancement of speech coding parameters can be performed using either an adaptation algorithm or model compensation. Finally, techniques for reducing the effects of channel errors are also discussed in this chapter. These techniques include frame erasure concealment for ASR, soft-decoding, and missing feature theory-based ASR decoding.
Hong Kook Kim, Richard C. Rose

4. Speech Recognition Over IP Networks

This chapter introduces the basic features of speech recognition over an IP-based network. First of all, we review typical lossy packet channel models and several speech coders used for voice over IP, where the performance of a network speech recognition (NSR) system can significantly degrade. Second, several techniques for maintaining the performance of NSR against packet loss are addressed. The techniques are classified into client-based techniques and server-based techniques; the former ones include rate control approaches, forward error correction, and interleaving, and the latter ones include packet loss concealment and ASR-decoder based concealment. The last part of this chapter is devoted to explaining a new framework of NSR over IP networks. In particular, a speech coder that is optimized for automatic speech recognition (ASR) is presented, where it provides speech quality comparable to the conventional standard speech coders used in the IP networks. In addition, we compare the performance of NSR using the ASR-optimized speech coder to that using a conventional speech coder.
Hong Kook Kim

Distributed Speech Recognition

5. Distributed Speech Recognition Standards

This chapter provides an overview of the industry standards for Distributed Speech Recognition developed in ETSI, 3GPP and IETF. These standards were created to ensure interoperability between the feature extraction running on a client device and a compatible recogniser running on a remote server. They are intended for use in the implementation of commercial services for speech and multimodal services over mobile networks. In the process of developing and agreeing the standards substantial performance testing was conducted and these results are also summarised here. While other chapters provide more general information about feature extraction and channel error processing for DSR this chapter focuses on introducing the specifics of the standards.
David Pearce

6. Speech Feature Extraction and Reconstruction

This chapter is concerned with feature extraction and back-end speech reconstruction and is particularly aimed at distributed speech recognition (DSR) and the work carried out by the ETSI Aurora group. Feature extraction is examined first and begins with a basic implementation of mel-frequency cepstral coefficients (MFCCs). Additional processing, in the form of noise and channel compensation, is explained and has the aim of increasing speech recognition accuracy in real-world environments. Source and channel coding issues relevant to DSR are also briefly discussed. Back-end speech reconstruction using a sinusoidal model is explained and it is shown how this is possible by transmitting additional source information (voicing and fundamental frequency) from the terminal device. An alternative method of back-end speech reconstruction is then explained, where the voicing and fundamental frequency are predicted from the received MFCC vectors. This enables speech to be reconstructed solely from the MFCC vector stream and requires no explicit voicing and fundamental frequency transmission.
Ben Milner

7. Quantization of Speech Features: Source Coding

In this chapter, we describe various schemes for quantizing speech features to be used in distributed speech recognition (DSR) systems. We analyze the statistical properties of Mel frequency-warped cepstral coefficients (MFCCs) that are most relevant to quantization, namely the correlation and probability density function shape, in order to determine the type of quantization scheme that would be most suitable for quantizing them efficiently. We also determine empirically the relationship between mean squared error and recognition accuracy in order to verify that quantization schemes, which minimize mean squared error, are also guaranteed to improve the recognition performance. Furthermore, we highlight the importance of noise robustness in DSR and describe the use of a perceptually weighted distance measure to enhance spectral peaks in vector quantization. Finally, we present some experimental results on the quantization schemes in a DSR framework and compare their relative recognition performances.
Stephen So, Kuldip K. Paliwal

8. Error Recovery: Channel Coding and Packetization

Distributed Speech Recognition (DSR) systems rely on efficient transmission of speech information from distributed clients to a centralized server. Wireless or network communication channels within DSR systems are typically noisy and bursty. Thus, DSR systems must utilize efficient Error Recovery (ER) schemes during transmission of speech information. Some ER strategies, referred to as forward error control (FEC), aim to create redundancy in the source coded bitstream to overcome the effect of channel errors, while others are designed to create spread or delay in the feature stream in order to overcome the effect of bursty channel errors. Furthermore, ER strategies may be designed as a combination of the previously described techniques. This chapter presents an array of error recovery techniques for remote speech recognition applications.
This chapter is organized as follows. First, channel characterization and modeling are discussed. Next, media-specific FEC is presented for packet erasure applications, followed by a discussion on media-independent FEC techniques for bit error applications, including general linear block codes, cyclic codes, and convolutional codes. The application of unequal error protection (UEP) strategies utilizing combinations of the aforementioned FEC methods is also presented. Finally, frame-based interleaving is discussed as an alternative to overcoming the effect of bursty channel erasures. The chapter concludes with examples of modern standards for channel coding strategies for distributed speech recognition (DSR).
Bengt J. Borgström, Alexis Bernard, Abeer Alwan

9. Error Concealment

In distributed and network speech recognition the actual recognition task is not carried out on the user’s terminal but rather on a remote server in the network. While there are good reasons for doing so, a disadvantage of this client-server architecture is clearly that the communication medium may introduce errors, which then impairs speech recognition accuracy. Even sophisticated channel coding cannot completely prevent the occurrence of residual bit errors in the case of temporarily adverse channel conditions, and in packet-oriented transmission packets of data may arrive too late for the given real-time constraints and have to be declared lost. The goal of error concealment is to reduce the detrimental effect that such errors may induce on the recipient of the transmitted speech signal by exploiting residual redundancy in the bit stream at the source coder output. In classical speech transmission a human is the recipient, and erroneous data are reconstructed so as to reduce the subjectively annoying effect of corrupted bits or lost packets. Here, however, a statistical classifier is at the receiving end, which can benefit from knowledge about the quality of the reconstruction. In this book chapter we show how the classical Bayesian decision rule needs to be modified to account for uncertain features, and illustrate how the required feature posterior density can be estimated in the case of distributed speech recognition. Some other techniques for error concealment can be related to this approach. Experimental results are given for both a small and a medium vocabulary recognition task and both for a channel exhibiting bit errors and a packet erasure channel.
Reinhold Haeb-Umbach, Valentin Ion

Embedded Speech Recognition

10. Algorithm Optimizations: Low Computational Complexity

Advances in ASR are driven by both scientific achievements in the field and the availability of more powerful hardware. While very powerful CPUs allow us to use ever more complex algorithms in server-based large vocabulary ASR systems (e.g. in telephony applications), the capability of embedded platforms will always lag behind. Nevertheless as the popularity of ASR application grows, we can expect an increasing demand for functionality on embedded platforms as well. For example, replacing simple command and control grammar-based applications by natural language understanding (NLU) systems leads to increased vocabulary sizes and thus the need for greater CPU performance. In this chapter we present an overview of ASR decoder design options with an emphasis on techniques which are suitable for embedded platforms. One needs to keep in mind that there is no one-size-fits-all solution; specific algorithmic improvements may only be best applied to highly restricted applications or scenarios. The optimal solution can usually be achieved by making choices with respect to algorithms aimed at maximizing specific benefits for a particular platform and task.
Miroslav Novak

11. Algorithm Optimizations: Low Memory Footprint

For speech recognition algorithms targeting mobile devices the memory footprint is a critical parameter. Although the memory consumption can be both static (long-term) and dynamic (run-time) in this chapter we focus mainly on the long-term memory requirements and, more specifically, on the techniques for acoustic model compression. As all compression methods, acoustic model compression is exploiting redundancies within the data as well as the limits for the parameter representation accuracy. Considering data redundancies specific for hidden Markov models (HMMs), parameter tying and state or density clustering algorithms are presented with cases like semicontinuous HMMs (SCHMMs) and subspace distribution clustered HMMs (SDCHMMs). Regarding parameter representation a simple scalar quantized representation is shown for the case of quantized HMMs (qHMMs). The effects on computational complexity are also reviewed for all the compression methods presented.
Marcel Vasilache

12. Fixed-Point Arithmetic

There are two main requirements for embedded/mobile systems: one is low power consumption for long battery life and miniaturization, the other is low unit cost for components produced in very large numbers (cell phones, set-top boxes). Both requirements are addressed by CPU’s with integer-only arithmetic units which motivate the fixed-point arithmetic implementation of automatic speech recognition (ASR) algorithms. Large vocabulary continuous speech recognition (LVCSR) can greatly enhance the usability of devices, whose small size and typical on-the-go use hinder more traditional interfaces. The increasing computational power of embedded CPU’s will soon allow real-time LVCSR on portable and lowcost devices. This chapter reviews problems concerning the fixed-point implementation of ASR algorithms and it presents fixed-point methods yielding the same recognition accuracy of the floating-point algorithms. In particular, the chapter illustrates a practical approach to the implementation of the frame-synchronous beam-search Viterbi decoder, N-grams language models, HMM likelihood computation and mel-cepstrum front-end. The fixed-point recognizer is shown to be as accurate as the floating-point recognizer in several LVCSR experiments, on the DARPA Switchboard task, and on an AT&T proprietary task, using different types of acoustic front-ends, HMM’s and language models. Experiments on the DARPA Resource Management task, using the StrongARM-1100 206 MHz and the XScale PXA270 624 MHz CPU’s show that the fixed-point implementation enables real-time performance: the floating point recognizer, with floating-point software emulation is several times slower for the same accuracy.
Enrico Bocchieri

Systems and Applications

13. Software Architectures for Networked Mobile Speech Applications

We examine architectures for mobile speech applications. These use speech engines for synthesizing audio output and for recognizing audio input; a key architectural decision is whether to embed these speech engines on the mobile device or to locate them in the network. While both approaches have advantages, our focus here is on networked speech application architectures. Because user experience with speech is greatly improved when the speech modality is coupled with a visual modality, mobile speech applications will increasingly tend to be multimodal, so speech architectures therefore must support multimodal user interaction. Good architectures must reflect commercial reality and be economical, efficient, robust, reliable, and scalable. They must leverage existing commercial ecosystems if possible, and we contend that speech and multimodal applications must build on both the web model of application development and deployment, and the large ecosystem that has grown up around the W3C’s web speech standards.
James C. Ferrans, Jonathan Engelsma

14. Speech Recognition in Mobile Phones

Speech input implemented in voice user interface (voice UI) plays an important role in enhancing the usability of small portable devices, such as mobile phones. In these devices more traditional ways of interaction (e.g. keyboard and display) are limited by small size, battery life and cost. Speech is considered as a natural way of interaction for man-machine interfaces. After decades of research and development, voice UIs are becoming widely deployed and accepted in commercial applications. It is expected that the global proliferation of embedded devices will further strengthen this trend in the coming years. A core technology enabler of voice UIs is automatic speech recognition (ASR). Example applications in mobile phones relying on embedded ASR are name dialling, phone book search, command-and-control and more recently large vocabulary dictation. In the mobile context several technological challenges have to be overcome concerning ambient noise in the environment, constraints of available hardware platforms and cost limitations, and necessity for wide language coverage. In addition, mobile ASR systems need to achieve a virtually perfect performance level for user acceptance. This chapter reviews the application of embedded ASR in mobile phones, and describes specific issues related to language development, noise robustness and embedded implementation and platforms. Several practical solutions are presented throughout the chapter with supporting experimental results.
Imre Varga, Imre Kiss

15. Handheld Speech to Speech Translation System

Recent Advances in the processing capabilities of handheld devices (PDAs or mobile phones) have provided the opportunity for enablement of speech recognition system, and even end-to-end speech translation system on these devices. However, two-way free-form speech-to-speech translation (as opposite to fixed phrase translation) is a highly complex task. A large amount of computation is involved to achieve reliable transformation performance. Resource limitations are not just CPU speed, but also the memory and storage requirements, and the audio input and output requirements all tax current systems to their limits. When the resource demand exceeds the computational capability of available state-of-the-art hand-held devices, a common technique for mobile speech-to-speech translation system is to use a client-server approach, where the handheld device (a mobile phone or PDA) is treated simply as a system client. While we will briefly describe the client/server approach, we will mainly focus on the approach that the end-to-end speech-to-speech translation system is completely hosted on the handheld devices. We will describe the challenges and algorithm and code optimization solutions we developed for the handheld MASTOR systems (Multilingual Automatic Speech-to-Speech Translator) for between English and Mandarin Chinese, and between English and Arabic on embedded Linux and Windows CE operating systems. The system includes an HMM-based large vocabulary continuous speech recognizer using statistical n-grams, a translation module, and a multi-language speech synthesis system.
Yuqing Gao, Bowen Zhou, Weizhong Zhu, Wei Zhang

16. Automotive Speech Recognition

In the coming years speech recognition will be a commodity feature in car. Control of communication systems integrated in the car infotainment system including telephony, audio devices and destination inputs for navigation can be done via voice. Concerning speech recognition technology biggest the challenge is the recognition of large vocabularies in noisy environments using cost sensitive hardware platforms. Further intuitive dialog design coupled with natural sounding text to speech systems has to be provided to achieve a smooth man-machine interaction. This chapter describes commercial driven activities to develop and produce speech technology components for various automotive applications including the used speech recognition, speaker characterization, speech synthesis and dialog technology, the used platforms, and a methodology for the evaluation of recognition performance.
Harald Höge, Sascha Hohenner, Bernhard Kämmerer, Niels Kunstmann, Stefanie Schachtl, Martin Schönle, Panji Setiawan

17. Energy Aware Speech Recognition for Mobile Devices

As portable electronic devices move to smaller form-factors with more features, one challenge is managing and optimizing battery lifetime. Unfortunately, battery technology has not kept up with the rapid pace of semiconductor and wireless technology improvements over the years. In this chapter, we present a study of speech recognition with respect to energy consumption. Our analysis considers distributed speech recognition on hardware platforms with PDA-like functionality. We investigate quality of service and energy trade-offs in this context. We present software optimizations on a speech recognition front-end that can reduce the energy consumption by over 80% compared to the original implementation. A power on/off scheduling algorithm for the wireless interface is presented. This scheduling of the wireless interface can increase the battery lifetime by an order of magnitude. We study the effects of wireless networking and fading channel characteristics on distributed speech recognition using Bluetooth and IEEE 802.11b networks. When viewed as a whole, the optimized distributed speech recognition system can reduce the total energy consumption by over 95% compared to a software client-side ASR implementation. Error concealment techniques can be used to provide further energy savings in low channel SNR conditions.
Brian Delaney

Backmatter

Weitere Informationen