Skip to main content



Speech Pattern Processing

The study of ‘speech’ is a fragmented multi-disciplinary area of science which sits somewhere between acoustics, linguistics, engineering and psychology. The one unifying force, which links all of the practitioners in the field, is the study and exploitation of speech patterning. The ability to process speech patterns is thus central to the capabilities of all forms of speech communication, whether performed by a human or by a machine. To create a truly unified ‘theory of speech pattern processing’ it is necessary to focus on the positive contributions that can be made by both speech science and speech technology. Great strides have already been made using a paradigm based on stochastic modelling, and the prospects for further significant developments are good - as long as communications between all sectors of the R&D community are suitably encouraged.

Roger K. Moore

Psycho-acoustics and Speech Perception

Computational models of speech pattern processing might be able to benefit a lot from sound and speech perception by humans. Psycho-acoustics has given us insight into the limits and the capabilities of peripheral hearing for, mainly, simple stationary sounds. Threshold phenomena and temporal and spectral resolution for such stimuli are a first indication of how the front end of a recognizer should be modeled, and what level of precision is required in rule synthesis. Much less is known about the ear’s sensitivity to dynamic events with complex signals, such as formant-like transitions. Once the signal becomes a syllable or a meaningful word or sentence, our ear’s behavior and our brain’s interpretations become even more complex. A good example of that is our perception of stressed and unstressed syllables, including schwas. I will claim that vowel reduction manifests itself as contextual assimilation, rather than as a form of centralization, which again has implications for our phone and word models in ASR and for our coarticulation rules in a synthesizer.

Louis C. W. Pols

Acoustic Modelling for Large Vocabulary Continuous Speech Recognition

This chapter describes acoustic modelling in modern HMM-based LVCSR systems. The presentation emphasises the need to carefully balance model complexity with available training data, and the methods of state-tying and mixture-splitting are described as examples of how this can be done. Iterative parameter re-estimation using the forward-backward algorithm is then reviewed and the importance of the component occupation probabilities is emphasised. Using this as a basis, two powerful methods are presented for dealing with the inevitable mis-match between training and test data. Firstly, MLLR adaptation allows a set of HMM parameter transforms to be robustly estimated using small amounts of adaptation data. Secondly, MMI training based on lattices can be used to increase the inherent discrimination of the HMMs.

Steve Young

Tree-based Dependence Models for Speech Recognition

The independence assumptions typically used to make speech recognition practical ignore the fact that different sounds in speech are highly correlated. Tree-structured dependence models make it possible to represent cross-class acoustic dependence in recognition when used in conjunction with hidden Markov or other such models. These models have Markov-like assumptions on the branches of a tree, which lead to efficient recursive algorithms for state estimation. This paper will describe general approaches to topology design and parameter estimation of tree-based models and outline more specific solutions for two examples: discrete-state hidden dependence trees and continuous-state multiscale models, drawing analogies to results for time series models. Initial results for both cases will be described, followed by a discussion of questions raised by the experiments.

Mari Ostendorf, Ashvin Kannan, Orith Ronen

Connectionist and Hybrid Models for Automatic Speech Recognition

Automatic speech recognition (ASR) has now reached the point where practical applications can be envisaged. However the models that are presently used still have to be enhanced, especially to improve the robustness of recognition in real conditions. Most of present systems are based on stochastic models, especially hidden Markov models (HMMs). In the past few years, a quite large number of projects have been directed toward the development of a new class of models: the connectionist artificial neural networks (ANNs).The present chapter proposes an overview of ANNs and their various uses in the field of speech signal analysis, ASR, and speaker verification. After a brief review of the basic principles of these models and of their properties, the different models used in ASR are presented and compared: multi-layer perceptrons, self-organizing maps, recurrent ANNs, etc. The modifications made to these models in order to take into account the dynamic aspects of speech are then discussed. Various types of hybrid models are finally presented. Such models combine in different ways connectionist models with other models, especially stochastic models.

Jean-Paul Haton

Computational Models for Auditory Speech Processing

Auditory processing of speech is an important stage in the closed-loop human speech communication system. A computational auditory model for temporal processing of speech is described with details of numerical solution and of the temporal information extraction method given. The model is used to process fluent speech utterances and is applied to phonetic classification using both clean and noisy speech materials. The need for integrating auditory speech processing and phonetic modeling components in machine speech recognizer design is discussed within a proposed computational framework of speech recognition motivated by the closed-loop speech chain model for integrated human speech production and perception behaviors.

Li Deng

Speaker Adaptation of CDHMMs Using Bayesian Learning

We investigate the Bayesian Learning approach (also known as Maximum A Posteriori — MAP) to the speaker adaptation of Continuous Density Hidden Markov Models (CDHMMs). The parameters of the Gaussian mixture output densities are adapted using the exponential forgetting mechanism and performing the a priori parameter estimation in a model based outline. Moreover a channel adaptation is carried out by means of the cepstral mean normalization method (CMN).

Claudio Vair, Luciano Fissore

Discriminative Improvement of the Representation Space for Continuous Speech Recognition

Signal representation is a very important issue of the design of speech recognizers. An appropriate representation of the speech signal improves the recognizer performance. Recently, the Discriminative Feature Extraction (DFE) method has been applied for estimating transformations of the representation space for speech recognizers. In this work, a variant of the DFE method is applied in order to improve the representation space for Continuous Speech Recognition.

Ángel de la Torre, Antonio M. Peinado, Antonio J. Rubio, José C. Segura

Dealing with Loss of Synchronism in Multi-Band Continuous Speech Recognition Systems

In multi-band systems, the signal is decomposed into several frequency bands, which are processed separately. Then, the recombination part must compute a unique sentence from all these different solutions. The task is quite easy in isolated word recognition, each word ending at the same time, but it becomes more difficult in continuous speech recognition, where each band has a different segmentation. The problem here is to decide when the recombination should be done. Two major solutions have been tested: the first one introduces synchronism between the bands, and recombination is done when all the bands are synchronous. The second one leaves the sub-recognizers totally independent and tries to extract from their solutions a phonetic structure which will allow us to process the recombination part. We will briefly present an example of the first solution, then we will focus on the algorithm we have developed for the second one.

Christophe Cerisara

K-Nearest Neighbours Estimator in a HMM-Based Recognition System

For many years, the K-Nearest Neighbours method (K-NN) has been known as one of the best probability density function (pdf) estimator [2]. The development of fast K-NN algorithms allows to reconsider its use in applications with large sample sets. In this outlook, the K-NN decision principle has been assessed on a frame by frame phonetic identification on the TIMIT database. Thereafter, a method to integrate the K-NN pdf estimator in a HMM-based system is proposed and tested on an acoustic-phonetic decoding task.

Fabrice Lefèvre, Claude Montacié, Marie-José Caraty

Robust Speech Recognition

This paper overviews the main technologies that have recently been developed for making speech recognition systems more robust against acoustic variations. These technologies are reviewed from the viewpoint of a stochastic pattern matching paradigm for speech recognition. Improved robustness enables better speech recognition over a wide range of unexpected and adverse conditions by reducing mismatches between training and testing speech utterances.

Sadaoki Furui

Channel Adaptation

Any mismatch between training and test conditions can cause difficulty for current automatic speech recognition systems. In recent years many approaches have been proposed for resolving this mismatch problem. These approaches can be divided broadly into three classes: model adaptation, channel adaptation and robust features. This paper presents a review and discussion of methods for channel adaptation and their relationship to methods in the other classes.

Keith M. Ponting

Speaker Characterization, Speaker Adaptation and Voice Conversion

This paper discusses recent advances in and perspectives of research on speaker-dependent-feature extraction from speech waves, automatic speaker identification and verification, speaker adaptation in speech recognition, and voice conversion. Both supervised and unsupervised speaker adaptation algorithms for speech recognition have recently been actively investigated, and remarkable progress has been achieved in this field. Improving synthesized speech quality by adding natural characteristics of voice individuality, and converting synthesized voice individuality from one speaker to another, are as yet little exploited research fields to be studied in the near future.

Sadaoki Furui

Speaker Recognition

This paper introduces recent advances in speaker recognition technology. The first part discusses general topics and issues. The second part is devoted to a discussion of more specific topics of recent interest that have led to interesting new approaches and techniques. They include VQ- and ergodic-HMM-based text-independent recognition methods, a text-prompted recognition method, parameter/distance normalization and model adaptation techniques, and methods of updating models and a priori thresholds in speaker verification. The paper concludes with 16 open questions about speaker recognition and a short discussion assessing the current status and future possibilities.

Sadaoki Furui

Application of Acoustic Discriminative Training in an Ergodic HMM for Speaker Identification

We present a novel architecture for a Speaker Recognition system over the telephone. The proposed system introduces acoustic information into a HMM-based recognizer. This is achieved by using a phonetic classifier during the training phase. Three broad phonetic classes: voiced frames, unvoiced frames and transitions, are defined. We design speaker templates by the combination of four single state HMMs into a four state HMM after re-estimation of the transition probabilities. Experiments conducted with two databases are reported, and the results show that this architecture performs better than others without phonetic classification.

Leandro Rodríguez Liñares, Carmen García Mateo

Comparison of Several Compensation Techniques for Robust Speaker Verification

It is well known that the performance of speaker recognition systems degrade rapidly as the mismatch between the training and test conditions increases. Thus, for example, in real-world telephone-based speaker recognition systems, both, additive and convolutional noise influence the error rate considerably. In this paper, different techniques which make a speaker verification system more robust against noise are described and compared. Some of these techniques have already been successfully applied in Robust Speech Recognition, and our preliminary results show that they are also very encouraging for Robust Speaker Verification.

Laura Docío-Fernández, Carmen García-Mateo

Segmental Acoustic Modeling for Speech Recognition

In recent years, several alternative acoustic models have been proposed that attempt to represent trends or correlation of observations over time. These models, which can be broadly classified as segment models, are surveyed in this chapter and presented in a general probabilistic framework that includes the hidden Markov model (HMM) as a special case. The overview gives options for modeling assumptions in terms of correlation structure and parameter tying and outlines the extensions to HMM recognition and training algorithms needed to handle segment models.

Mari Ostendorf

Trajectory Representations and Acoustic Descriptions for a Segment-Modelling Approach to Automatic Speech Recognition

This paper discusses some of the possibilities for modelling speech segment trajectories in a domain which is more directly correlated with the mechanisms of speech production than the typical mel-cepstrum representation. Initial developments are described towards using linear dynamic segmental HMMs [12] to model underlying (unobserved) trajectories of features which closely reflect the nature of articulation. So far, this work has involved calculating segment probabilities using an approach which is different from that used in earlier studies (e.g. [4]), and is more consistent with the idea of treating the trajectory as unobserved. In parallel, experiments have demonstrated that formant features can be useful for HMM-based automatic speech recognition [3].

Wendy J. Holmes

Suprasegmental Modelling

We show how prosody can be used in speech understanding systems. This is demonstrated with the VERBMOBIL speech-to-speech translation system, the world wide first complete system, which successfully uses prosodic information in the linguistic analysis. Prosody is used by computing probabilities for clause boundaries, accentuation, and different types of sentence mood for each of the word hypotheses computed by the word recognizer. These probabilities guide the search of the linguistic analysis. Disambiguation is already achieved during the analysis and not by a prosodic verification of different linguistic hypotheses. So far, the most useful prosodic information is provided by clause boundaries. These are detected with a recognition rate of 94%. For the parsing of word hypotheses graphs, the use of clause boundary probabilities yields a speed-up of 92% and a 96% reduction of alternative readings.

E. Nöth, A. Batliner, A. Kießling, R. Kompe, H. Niemann

Computational Models for Speech Production

Major speech production models from speech science literature and a number of popular statistical “generative” models of speech used in speech technology are surveyed. Strengths and weaknesses of these two styles of speech models are analyzed, pointing to the need to integrate the respective strengths while eliminating the respective weaknesses. As an example, a statistical task-dynamic model of speech production is described, motivated by the original deterministic version of the model and targeted for integrated-multilingual speech recognition applications. Methods for model parameter learning (training) and for likelihood computation (recognition) are described based on statistical optimization principles integrated in neural network and dynamic system theories.

Li Deng

Articulatory Features and Associated Production Models Statistical Speech Recognition

A statistical approach to speech recognition is outlined which draws close parallel with closed-loop human speech communication schematized as a joint process of encoding and decoding of linguistic messages. The encoder consists of the symbolically-valued overlapping articulatory feature model and of its interface to a nonlinear task-dynamic model of speech production. A general speech recognizer architecture based on optimal decoding strategy incorporating encoder-decoder interactions is described and discussed.

Li Deng

Talker Normalization with Articulatory Analysis-by-Synthesis

Internal articulatory models are used in analysis-by-synthesis to recover the movement of the speech articulators from speech acoustics. The kind of articulatory information that is recovered depends on the application and the available data. In the laboratory some articulatory data may be available along with acoustic data, and in automatic speech recognition only acoustic data is available. While there is more data available in the former than in the latter case, the amount of information sought in recovery is different in the two cases. In the laboratory physically realistic articulatory trajectories are sought, while recovery in automatic speech recognition may simply require transforming the acoustic signal to an abstract articualtory representaion employed by statistical models for subsequent categorization. Both applications require that the internal articulatory models be normalized for each talker, either for realistic recovery or for robust statistical behavior. A method for constructing mappings between the human and the internal model, while simultaneously adjusting the internal model for acoustic matching is presented. The method is tested on x-ray microbeam data taken on human subjects.

Richard S. McGowan

The Psycholinguistics of Spoken Word Recognition

The process of mapping acoustic-phonetic level input to a lexical representation is multi-faceted. Models of spoken word recognition provide a variety of processing architectures and make different assumption(s) regarding the unit(s) of representation used in the exchange of information from signal-to-word and the nature of information flow through the system. The current models provide a backdrop for a discussion of some of the advances and debates in the field. Some of the issues considered are: early versus delayed commitment to a lexical hypothesis, consequences of multiple activation, segmentation and lexical access, the processing and representation of phonological variants, and the role of attention in spoken word recognition.

Cynthia M. Connine, Thomas Deelman

Issues in Using Models for Self Evaluation and Correction of Speech

The design of computer-based systems for training purposes requires the necessity of taking into account the different worlds in which the actors operate. In this paper, we deal with speech correction involving a therapist (“the orthophonist”), the trainee, and a technical aid performing extraction and visual displays of speech features. We want to point out some common-sense issues that appear as essential for designing such computer-based systems: the matching between those different worlds, the choice of reference patterns to which the subject’s utterances will be compared, the orthophonic check and the definition of a norm which could be considered as a target to be reached by the subject, the matching of the norm with the reference patterns, the matching between the vocal utterances and their technical equivalent, the settlement, and then, the management of a speech education program adapted to the trainee.

Marie-Christine Haton

The Use of the Maximum Likelihood Criterion in Language Modelling

This paper gives an overview over the use of the maximum likelihood criterion in stochastic language modelling. This criterion and its associated estimation techniques provide a unifying framework for various approaches that seem very much unrelated and different at first glance, such as smoothing and cross-validation, decision trees (CART), word classes obtained by clustering, word trigger pairs and maximum entropy models.

Hermann Ney

Language Model Adaptation

This paper reviews methods for language model adaptation. Paradigms and basic methods are first introduced. Basic theory is presented for maximum a-posteriori estimation, mixture based adaptation, and minimum discrimination information. Models to cope with long distance dependencies are also introduced. Applications and results from the recent literature are finally surveyed.

Renato DeMori, Marcello Federico

Using Natural-Language Knowledge Sources in Speech Recognition

High accuracy speech recognition requires a language model, to specify what word sequences are possible or at least likely. Standard n-gram language models for speech recognition ignore linguistic structures, but more linguistically sophisticated language models are possible. Unification grammars are widely used in natural languageand these can be compiled into non-left-recursive context-free grammars that can then be used in realtime speech recognizers by dynamically expanding them into state-transition networks. A hybrid language model incorporating both a unification grammar and n-gram statistics has been shown to increase speech recognition accuracy. Probabilistic context-free grammars and probabilistic unification grammars are also possible.

Robert C. Moore

How May I Help You?

We are interested in providing automated services via natural spoken dialog systems. By natural, we mean that the machine understands and acts upon what people actually say, in contrast to what one would like them to say. There are many issues that arise when such systems are targeted for large populations of non-expert users. In this paper, we focus on the task of automatically routing telephone calls based on a user’s fluently spoken response to the open-ended prompt of “How may I help you?”. We first describe a database generated from 10,000 spoken transactions between customers and human agents. We then describe methods for automatically acquiring language models for both recognition and understanding from such data. Experimental results evaluating call-classification from speech are reported for that database. These methods have been embedded within a spoken dialog system, with subsequent processing for information retrieval and form-filling.

A. L. Gorin, G. Riccardi, J. H. Wright

Introduction of Rules into a Stochastic Approach for Language Modelling

Automatic morpho-syntactic tagging is an area where statistical approaches have been more successful than rule-based methods. Nevertheless, available statistical systems appear to be unable to hold long span dependencies and to model unfrequent structures. In fact, part of the weakness of statistical techniques may be compensated by rule-based methods. Furthermore, the application of rules during the probabilistic process inhibits the error propagation. Such an improvement could not be obtained by a post processing analysis. In order to take advantage of features that are complementary with two approaches, a hybrid approach has been followed in the design of an improved tagger called ECSta. In ECSta, as shown in this paper, a stack-decoding algorithm is combined with the Viterbi classical one.

Thierry Spriet, Marc El-Bèze

History Integration into Semantic Classification

In spoken language systems, the classification of coherent linguistic/semantic phrases in terms of semantic classes is an important part of the whole understanding process. Basically, it relies on the plain text of the segment to be classified. Nevertheless, another important source of useful information is the dialogue context. In this paper, a number of different ways to integrate the dialogue history into the semantic classification are presented and tested on a corpus of person-to-person dialogues. Best result gives a 3.6% reduction of the error rate with respect to the performance obtained without using history.

Mauro Cettolo, Anna Corazza

Multilingual Speech Recognition

We present two concepts for systems with language identification in the context of multilingual information retrieval dialogs. The first one has an explicit module for language identification. It is based on training a common codebook for all the languages and integrating over the output probabilities of language specific n-gram models trained over the codebook sequences. The system can decide for one language either after a predefined time interval or if the difference between the probabilities of the languages succeeds a certain threshold. This approach allows to recognize languages that the system can not process and give out a prerecorded message in that language. In the second approach, the trained recognizers of the languages to be recognized, the lexicons, and the language models are combined to one multilingual recognizer. Only allowing transitions between the words from one language, each hypothesized word chain contains words from just one language and language identification is an implicit by-product of the speech recognizer. First results for both language identification approaches are presented.

E. Nöth, S. Harbeck, H. Niemann

Toward ALISP: A proposal for Automatic Language Independent Speech Processing

The models used in current automatic speech recognition (or synthesis) systems are generally relying on a representation based on phonetic symbols. The phonetic transcription of a word can be seen as an intermediate representation between the acoustic and the linguistic levels, but the a priori choice of phonemes (or phone-like units) can be questioned, as probably non-optimal. Moreover, the phonetic representation has the drawback of being strongly language-dependent, which partly prevents reusability of acoustic resources across languages. In this article, we expose and develop the concept of ALISP (Automatic Language Independent Speech Processing), namely a general methodology which consists in inferring the intermediate representation between the acoustic and the linguistic levels, from speech and linguistic data rather than from a priori knowledge, with as little supervision as possible. We expose the benefits that can be expected from developing the ALISP approach, together with the key issues to be solved. We also present preliminary experiments that can be viewed as first steps towards the ALISP goal.

Gérard Chollet, Jan Černocký, Andrei Constantinescu, Sabine Deligne, Frédéric Bimbot

Interactive Translation of Conversational Speech

We present JANUS-II, a large scale system effort aimed at interactive spoken language translation. JANUS-II now accepts spontaneous conversational speech in a limited domain in English, German or Spanish and produces output in German, English, Spanish, Japanese and Korean. The challenges of coarticulated, disfluent, ill-formed speech are manifold, and have required advances in acoustic modeling, dictionary learning, language modeling, semantic parsing and generation, to achieve acceptable performance. A semantic “inter-lingua” that represents the intended meaning of an input sentence, facilitates the generation of culturally and contextually appropriate translation in the presence of irrelevant or erroneous information. Application of statistical, contextual, prosodic and discourse constraints permits a progressively narrowing search for the most plausible interpretation of an utterance. During translation, JANUS-II produces paraphrases that are used for interactive correction of translation errors. Beyond our continuing efforts to improve robustness and accuracy, we have also begun to study possible forms of deployment. Several system prototypes have been implemented to explore translation needs in different settings: speech translation in one-on-one video conferencing, as portable mobile interpreter, or as passive simultaneous conversation translator. We will discuss their usability and performance.

Alex Waibel

Multimodal Speech Systems

This chapter describes the various knowledge sources required to handle human-machine multimodal interaction efficiently: they constitute the task, user, dialogue, environment and system models. The first part of the chapter discusses the content of these models, emphasising the problems occurring when speech is combined with other modalities. The second part focuses on spoken language characteristics, describes different parsing methods (rule-based and stochastic) using a task model, and briefly presents the integration of the rule-based method in an end-to-end information retrieval system.

Françoise D. Néel, Wolfgang M. Minker

Multimodal Interfaces for Multimedia Information Agents

When humans communicate they take advantage of a rich spectrum of cues. Some are verbal and acoustic. Some are non-verbal and non-acoustic. Signal processing technology has devoted much attention to the recognition of speech, as a single human communication signal. Most other complementary communication cues, however, remain unexplored and unused in human-computer interaction. In this paper we show that the addition of non-acoustic or non-verbal cues can significantly enhance robustness, flexibility, naturalness and performance of human-computer interaction. We demonstrate computer agents that use speech, gesture, handwriting, pointing, spelling jointly for more robust, natural and flexible human-computer interaction in the various tasks of an information worker: information creation, access, manipulation or dissemination.

Alex Waibel, Bernhard Suhm, Minh Tue Vo, Jie Yang


Weitere Informationen