Skip to main content

1992 | Buch

Speech Recognition and Understanding

Recent Advances, Trends and Applications

herausgegeben von: Pietro Laface, Renato De Mori

Verlag: Springer Berlin Heidelberg

Buchreihe : NATO ASI Series

insite
SUCHEN

Über dieses Buch

The book collects the contributions to the NATO Advanced Study Institute on "Speech Recognition and Understanding: Recent Advances, Trends and Applications", held in Cetraro, Italy, during the first two weeks of July 1990. This Institute focused on three topics that are considered of particular interest and rich of i'p.novation by researchers in the fields of speech recognition and understanding: Advances in Hidden Markov modeling, connectionist approaches to speech and language modeling, and linguistic processing including language and dialogue modeling. The purpose of any ASI is that of encouraging scientific communications between researchers of NATO countries through advanced tutorials and presentations: excellent tutorials were offered by invited speakers that present in this book 15 papers which sum­ marize or detail the topics covered in their lectures. The lectures were complemented by discussions, panel sections and by the presentation of related works carried on by some of the attending researchers: these presentations have been collected in 42 short contributions to the Proceedings. This volume, that the reader can find useful for an overview, although incomplete, of the state of the art in speech understanding, is divided into 6 Parts.

Inhaltsverzeichnis

Frontmatter

Recent Results on Hidden Markov Models

Frontmatter

Inivited Papers

Hidden Markov Models for Speech Recognition — Strengths and Limitations

The use of hidden Markov models for speech recognition has become predominant for the last several years, as evidenced by the number of published papers and talks at major speech conferences. The reasons why this method has become so popular are the inherent statistical (mathematically precise) framework, the ease and availability of training algorithms for estimating the parameters of the models from finite training sets of speech data, the flexibility of the resulting recognition system where one can easily change the size, type, or architecture of the models to suit particular words, sounds etc., and the ease of implementation of the overall recognition system. However, although hidden Markov model technology has brought speech recognition system performance to new high levels for a variety of applications, there remain some fundamental areas where aspects of the theory are either inadequate for speech, or for which the assumptions that are made do not apply. Examples of such areas range from the fundamental modeling assumption, i.e. that a maximum likelihood estimate of the model parameters provides the best system performance, to issues involved with inadequate training data which leads to the concepts of parameter tying across states, deleted interpolation and other smoothing methods, etc. Other aspects of the basic hidden Markov modeling methodology which are still not well understood include; ways of integrating new features (e.g. prosodic versus spectral features) into the framework in a consistent and meaningful way; the way to properly model sound durations (both within a state and across states of a model); the way to properly use the information in state transitions; and finally the way in which models can be split or clustered as warranted by the training data. It is the purpose of this paper to examine each of these strengths and limitations and discuss how they affect overall performance of a typical speech recognition system.

L. R. Rabiner, B. H. Juang
Hidden Markov Models and Speaker Adaptation

In this chapter we first review the use of Hidden Markov Models for Continuous Speech Recognition. Then we discuss techniques for adapting speech recognizers to the speech of new speakers. In the first section particular attention is paid to the need to incorporate basic speech knowledge within the model. We present the N-Best Paradigm, which provides a simple way of integrating speech recognition with natural language processing. We describe several algorithms for computing the N best sentence hypotheses, and show that one of them is both efficient and empirically accurate. We also introduce the Forward-Backward Search Algorithm, which reduces the computation needed for the N-Best search by an additional factor of 40. We review many different speaker adaptation techniques and distinguish among their salient features. We present a new more practical method for training a speaker-independent system. In particular, we show that it is not necessary to have training speech from a very large number of speakers. Finally, we described a method for adapting a speaker-independent model to a new speaker. This speaker adaptation method results in half the error rate of the unadapted speaker-independent system.

Richard Schwartz, Francis Kubala

Contributed Papers

A 20,000 word automatic speech recognizer
Adaptation to French of the US TANGORA system

In this article we describe the adaptation to the French language of the Tangora system, developed by the IBM Yorktown Speech Group for American English. This work is a continuation of the activity of our group which developed the PARSYFAL system [1], a 200,000 word speaker-dependent syllable based dictation system which word recognition accuracy is about 90%. The French Tangora system uses exactly the same hardware as its US counterpart. It is a single box autonomous system running on an IBM PC AT machine with 4 signal processing cards (Albert) and an INTEL 80386 processor [2]. It handles a vocabulary of 20,000 words and is designed to take down dictated press agency dispatches. The dictation and training are done in isolated word mode.

H. Cerf-Danon, P. de La Noue, L. Diringer, M. El-Beze, J. C. Marcadet
Automatic adjustments of the Markov models topology for speech recognition applications over the telephone

This paper presents some automatic adjustments of the structure of Markov models with the objective of either reducing model complexity, or improving recognition performance. These modifications are tested on a 36 word vocabulary recorded by more than 500 speakers over the telephone network. The reduction of the model complexity is carried out by merging the similar gaussian functions using an iterative procedure. A 40% reduction of the number of gaussian functions is obtained on word based models without altering recognition performance.The improvement of the recognition performance is obtained by dynamically expanding the Markov model. This is achieved mainly by splitting the gaussian functions which make the highest contribution to the observation probability of the training set and by discarding the infrequently used transitions. After some iterations (involving the splitting and discarding operators) a 30% reduction of the word error rate is achieved using pseudo-diphone based models.

Denis Jouvet, Laurent Mauuary, Jean Monné
Phonetic Structure Inference of Phonemic HMM

An information based, two-stage DP procedure is proposed for phonemic HMM structure inference. It operates on the phonetic description of speech provided by means of an ergodic HMM. The resulting HMM, while comprising a quite large number of states, share the observation densities originated by the EHMM.

Alessandro Falaschi
Phonetic Units and Phonotactical Structure Inference by Ergodic Hidden Markov Models

A general model for the representation of the spoken language is proposed. It is based on Ergodic Hidden Markov Models and allows the introduction of linguistically consistent sub-phonetic units, accounting for the stationary and transitional behaviour of the spoken message. The main characteristics of the proposed model are analyzed, and some applications to speech recognition and coding are presented.

Piero Pierucci, Alessandro Falaschi, Massimo Giustiniani
Clustering of Gaussian densities in hidden Markov models

In order to reduce the number of Gaussian densities in a hidden Markov model based speech recognition system, a clustering scheme based on the Kullback divergence and the k-means clustering algorithm is proposed. The approach is tested in speaker independent recognition experiments for a vocabulary of 23 German words. A reduction of 50% in the number of densities can be achieved without degradation of the recognition performance.

S. Euler
Developments in High-Performance Connected Digit Recognition

Recent advances in Hidden Markov Model (HMM) based speaker-independent connected digit recognition have usually tended to make the models more complex. This paper concentrates on improving the training techniques in order to make the most of the available parameters. A new algorithm, Corrective MMI Training is introduced. Use of this algorithm resulted in significant improvements in our recognition rates. We now obtain less than 2% string error rate using semi-continuous HMMs with two models per digit.

Yves Normandin, Régis Cardin
Robust Speaker-Independent Hidden Markov Model Based Word Spotter

Since January 1990, we have endeavoured, in conjunction with the Canadian Department of National Defense, to design a speaker-independent word-spotter. This paper describes the several aspects investigated over the past few months. The open-set nature of the problem requires different techniques than those used in continuous speech recognition. This paper presents the baseline word-spotting system, using hidden Markov models (HMM) to model both keyword and non-keyword speech. In the course of the research, we have found that the features and their transformations greatly affect the performance of the system. Since the task requires speech over a long period of time, various schemes were investigated to produce a robust system with respect to variations in background noise. We present the results of these investigations. Finally, experimental results from the investigations are presented.

Louis C. Vroomen, Yves Normandin
Robust Speech Recognition in Noisy and Reverberant Environments

A method of digital speech signal processing is introduced to reduce the reverberation of rooms as well as noise. The ability of this method is shown additive to improve speech recognition in such situations.

H. G. Hirsch
An ISDN speech server based on speaker independent continuous Hidden Markov Models

In this paper a real time prototype dedicated to single word recognition in ISDN lines is described. This system is speaker independent for a fixed hierarchical command set of totally 61 words. Context dependent continuous density Markov phoneme models are used. To improve recognition rates, a postprocessor based on information measures is proposed, which chooses the best word candidate in respect to transinformation.In the first part the used speech recognition algorithms are presented. The second part deals with the ISDN speech database, the recording conditions and the achieved recognition rates. In the last part the hardware configuration of the speech server and the implementation of the described algorithms is explained in more detail. An outlook to future work concludes this contribution.

Klaus Zünkler
RAMSES: A Spanish Demisyllable Based Continuous Speech Recognition System

A continuous speech recognition system (called RAMSES) has been built based on the demisyllable as phonetic unit and tools from connected speech recognition. Speech is parameterized by band-pass lifted LPC-cepstra and demisyllables are represented by hidden Markov models (HMM). In this paper, the application of this system to recognize integer numbers from zero to one thousand is described. The paper contains a general overview of the system, an outline of the grammar inference, a description of the HMM training procedure and an assessment on the recognition performance in a speaker independent experiment.

J. B. Mariño, C. Nadeu, A. Moreno, E. Lleida, E. Monte, A. Bonafonte
Speaker Independent 1000 Words Speech Recognition in Spanish

In this paper a speaker-independent speech recognition system for 1006 isolated words in Spanish is presented. The approach used is based on discrete Hidden Markov Models. It is a first effort to develop a system with these characteristics in our Department. The initial task has been to recognize isolated words but the final objective is getting a speaker-independent continuous speech recognition system for large vocabularies (more than 1000 words) in our language. The different phases of building our system, its characteristics and used techniques are described. Speech data base characteristics and the process of collecting it are commented. Finally, some relevant experiments which have been achieved are presented and different future improvements will be commented.

J. M. Pardo, H. Hasan, J. Colás
Continuously Variable Transition Probability HMM for Speech Recognition

A new duration intrinsic model for improved speech recognition by HMM techniques is presented. Assuming an exponentially decaying time dependency of the states loop probability, the duration density can be factorized and a path early pruning theorem demonstrated. As a consequence, computational complexity is greatly reduced with respect to explicit models, whereas recognition performances improve considerably.

Alessandro Falaschi

Continuous Speech Recognition Systems

Frontmatter

Invited papers

Context-Dependent Phonetic Hidden Markov Models for Speaker-Independent Continuous Speech Recognition

The effectiveness of context-dependent phone modeling for speaker-dependent continuous speech recognition has recently been demonstrated. In this study, we apply context-dependent phone models to speaker-independent continuous speech recognition, and show that they are equally effective in this domain. In addition to evaluating several previously proposed context-dependent models, we also introduce two new context-dependent phonetic units: 1) function-word-dependent phone models, which focus on the most difficult subvocabulary, and 2) generalized triphones, which combine similar triphones together based on an information-theoretic measure. The subword clustering procedure used for generalized triphones can find the optimal number of models given a fixed amount of training data. We demonstrate that context-dependent modeling reduces the error rate by as much as 60%.

Kay-Fu Lee
Speaker Independent Continuous Speech Recognition Using Continuous Density Hidden Markov Models

The field of large vocabulary continuous speech recognition has advanced to the point where there are several systems capable of providing greater than 95% word accuracy for speaker independent recognition, of a 1000 word vocabulary, spoken fluently for a task with a perplexity of about 60. There are several factors which account for the high performance achieved by these systems, including the use of effective feature analysis, the use of hidden Markov model (HMM) methodology, the use of context-dependent sub-word units to capture intra-word and inter-word phonemic variations, and the use of corrective training techniques to emphasize differences between acoustically similar words in the vocabulary. In this paper we describe a large vocabulary continuous speech recognition system developed at AT&T Bell Laboratories, and discuss the methods used to provide high word recognition accuracy. In particular we focus our discussion on the techniques adopted to select the set of fundamental speech units and to provide the acoustic models of these sub-word units based on a continuous density HMM (CDHMM) framework. Different modeling approaches, such as a discrete HMM and a tied-mixture HMM, will also be discussed and compared to the CDHMM approach.

Chin-Hui Lee, Lawrence R. Rabiner, Roberto Pieraccini

Contributed papers

A Fast Lexical Selection Strategy for Large Vocabulary Continuous Speech Recognition

This paper describes an algorithm for performing fast selection of word candidates to speed up the decoding of continuousspeech. Interacting with word transitions, it fits very well into standard Viterbi decoding. The main features are: a tree-organization of the lexicon, coarse models deduced from the detailed ones, a selection rule compatible with unknown word boundaries, an efficient DP implementation. Various applications show a global reduction of the search cost by a factor 5 to 10 depending on the lexicon size, the error rate being increased by less than 1%.

Xavier L. Aubert
Performance of a Speaker-Independent Continuous Speech Recognizer

The speaker-independent word recognition component of a Speech Understanding System is described. It is based on Hidden Markov Models (HMM) of phonetic units and produces a lattice of word hypotheses suitable to be parsed by a linguistic analyzer that uses the acoustic scores of words for generating the best interpretation of the spoken utterance given a set of syntactic and semantic rules.This work presents the results of a set of experiments carried out with the aim of selecting a suitable set of sub-word unit models for an E-mail inquire application. The inquiries have been recorded over a dialed-up telephone line connected to the local PABX. The performance of the system is presented and comparative results are given for Discrete and Continuous Density HMMs.

L. Fissore, P. Laface, G. Micca, R. Pieraccini
Automatic Transformation of Speech Databases for Continuous Speech Recognition

In this paper a method is described to generate automatically the labels for a new speech database from an existing manually labeled speech database. This becomes necessary when new standards are introduced and the speech signals have to be resampled. A dynamic time warping algorithm is used to match the original and the resampled speech signals. The comparison is carried out on mel based features. To improve computation time the search space for the DTW algorithm is restricted. Several experiments were carried out with a normal density Bayes classifier to check the quality of the new labelings. The results showed only a slight decrease in performance when using the new labelings.

S. Rieck, E. G. Schukat-Talamazzini, T. Kuhn, S. Kunzmann, E. Nöth
Iterative Optimization of the Data Driven Analysis in Continuous Speech

We present an iterative method to optimize the word recognition rate for a data driven analysis in continuous speech by using a large set of speech samples. After a short description of our system environment a bootstrapping method for an iterative parameter estimation will be discussed. The initialization of the bootstrapping procedure is done by using a limited amount of hand labeled training data to estimate the statistical parameters roughly. In the second step the statistical parameters are estimated more exactly on the basis of unlabeled training data. Some experimental results for the bootstrapping method performed on unlabeled training data in comparison with results achieved by parameter estimation on labeled training data will be given.

T. Kuhn, S. Kunzmann, E. Nöth, S. Rieck, E. Schukat-Talamazzini
Syllable-based stochastic models for continuous speech recognition

The paper describes an automatic speech recognition system which is based on syllabic segmentation of the speech signal. Stochastic models (HMMs) are used for representing demisyllable segments. The advantages of syllabic processing within the different stages of the system (i.e. segmentation, phonetic classification, word and sentence recognition) are demonstrated and discussed on the basis of experimental results. Word and sentence recognition with a perplexity of 27 reached 74% and 96%, respectively.

G. Ruske, W. Weigel
Word Hypothesization in Continuous Speech Recognition

In this work, we propose the FUB point of view about the word hypothesization in the context of an Automatic Continuous Speech Recognition System. Our starting idea is the impossibility of the modeling the recognition as a sequential process: it is necessary to design an efficient device to locate the words in the continuous context in which they are embeded. We present a quick review of our current efforts on the procedural definition of the Word Hypothesizer and its iteration with a syntactical parsing system developed in our laboratory. We would put some emphasis on the analogy between the word lattice, the output of the word hypothesizer, and the chart, the working data structure of the parser. They are both defined as directed graphs giving the alternate representations — at lexical and at syntactical levels — of input portions. This analogy is the basis of the interaction: the word lattice can be directly mapped into the chart and this becomes the media for information communications among components.

Andrea Di Carlo
Phone Recognition Using High Order Phonotactic Constraints

A new phone recognizer has been implemented which extends the (phonotactic) decoding constraint to sequences of three phones. It is based on a structure similar to a second order ergodic hidden Markov model (HMM). This kind of model assumes direct correspondence between the model states and phones, thus constraints on possible state sequences are equivalent to phonotactic constraints. Very high coverage by both left and right context dependent phone models has been achieved using two methods. The first assumes that some contexts have the same or very similar effect on the phone in question. Thus they are merged into the same contextual class. The outcome is a set of 19 left context classes and 18 right context classes. The second assumes that left context mostly influences the beginning of a phone, whereas the right context influences the end of the phone. Each phone (a state in an ergodic HMM) is represented by a sequence of three probability density functions (pdf s), which is equivalent to a three state left-to-right HMM. We generate acoustic models such that first pdf in the model is conditioned on the left context, the middle pdf is context independent, and the last pdf is conditioned on the right context. A large number of such quasi-triphonic acoustic models can be generated providing a good triphone coverage for a given task efficiently utilizing the available training data set. The current implementation of the recognizer described here has been applied to the DARPA Resource Management Task. Since true phone sequences are not available, they are estimated from text from a phone realization regression tree trained on TIMIT database transcriptions. The estimates of the true phone sequences are used in training the models and generating reference phone sequences for scoring. The best phone recognition match between the most likely output of the regression tree and the phone recognizer for the DARPA February 89 test set was 75.5% correct with 79.5% accuracy.

Andrej Ljolje
An Efficient Structure for Continuous Speech Recognition

In this paper, we present an efficient data structure for implementing a continuous, large vocabulary, speech recognizer. The recognition system is based on hidden Markov models of phonetic units for representing both intraword and interword context dependent phones. Due to the large number of connections present in the decoding network, the structure of the recognizer must be carefully designed in order to perform experiments in a reasonable amount of computing time.

R. Pieraccini, C. H. Lee, E. Giachin, L. R. Rabiner
Search Organization for Large Vocabulary Continuous Speech Recognition

This paper describes the search organization of the Sphinx continuous speech recognition system. The search is organized as a hybrid Viterbi beam search that uses Viterbi beam search [2] to control activation of words and sub-word units at the high level and Viterbi optimal search [3] to evaluate hidden Markov models at the low level. This hybrid Viterbi search allows the Sphinx search running on a Sun 3/60, to evaluate the February 1988 Resource Management test set in 3.0 times real-time.

Fileno A. Alleva

Connectionist Models of Speech

Frontmatter

Invited Papers

Neural Networks or Hidden Markov Models for Automatic Speech Recognition: Is there a Choice?

Various algorithms based on “neural network” (NN) ideas have been proposed as alternatives to hidden Markov models (HMMs) for automatic speech recognition. We first consider the conceptual differences and relative strengths of NN and HMM approaches, then examine a recurrent computation, motivated by HMMs, that can be regarded as a new kind of neural network specially suitable for dealing with patterns with sequential structure. This “alphanet” exposes interesting relationships between NNs and discriminative training of HMMs, and suggests methods for properly integrating the training of non-linear feed-forward data transformations with the rest of an HMM-style speech recognition system. We conclude that NNs and HMMs are not distinct, so there is no simple choice of one or the other. However, there are many detailed choices to be made, and many experiments to be done.

John S. Bridle
Neural Networks for Continuous Speech Recognition

The paper reviews a number of methods for continuous speech recognition, concentrating mostly on work at Cambridge University. The methods reviewed are a ‘sound’ and phoneme recogniser using duration sensitive nets; the modified Kanerva model for phoneme recognition; a recurrent net for phoneme recognition; a Classification and Regressive Tree (CART) for phoneme recognition; together with methods for lexical access including the NET-gram, the modified Kanerva model, and the ‘Compositional Representation’ approach

Frank Fallside
Connectionist Large Vocabulary Speech Recognition

In this paper, the problem of large vocabulary word recognition is addressed from a connectionist perspective. The problem is not only of practical interest but also of scientific importance, since a workable solution must integrate pattern recognition under consideration of sequential, symbolic constraints. We have developed two large vocabulary word recognition systems based on different speech recognition philosophies. One of the systems exploits the power of neural networks in performing accurate classification, the other the power of producing good non-linear function approximation and signal prediction. We present each system’s operation and evaluate its performance. Both achieved respectable recognition scores in excess of 90% correct for vocabularies of up to 5000 words. We suggest further avenues towards improvement of either system and in the process discuss the relative strengths of either approach.

Alex Waibel
The cortical column as a model for speech recognition: principles and first experiments

Connectionist models, also known as neural networks, have been widely studied during the past few years. Applications concern various tasks in the fields of pattern recognition and signal processing, especially automatic speech recognition. This chapter presents the basic properties of these models and the different problems in the area of speech recognition to which they have been applied so far. The classical models of neural networks are also briefly recalled. We then concentrate on a particular model grounded on neuro-biological data, the cortical column. The characteristics of the model are given and we then present the architectures of two systems based on the cortical column for solving two different problems of speech recognition, i.e. acoustic-phonetic decoding and isolated word recognition.

Frédéric Guyot, Frédéric Alexandre, Catherine Dingeon, Jean-Paul Haton

Contributed papers

Radial Basis Functions for Speech Recognition

the purpose of this paper is to study the application of Radial Basis Functions (RBF) to automatic speech recognition. Results of several experiments with these networks on the recognition of phonemes for the TIMIT database are presented, including an experiment on a recurrent network of RBFs.

Yoshua Bengio
Phonetic features extraction using Time-Delay Neural Networks

A.Waibel introduced Time-Delay Neural Networks (TDNNs) as a specific neural network architecture that is especially well adapted to the “dynamic nature of speech”. We propose here to use low-dimensioned TDNNs for discriminating between phonetic features. We give evaluations of the different performances and we comment them. We also compare direct phoneme recognition scores using a sophisticated classical classifier to those obtained with a medium-size TDNN.

Frédéric Bimbot, Gérard Chollet, Jean-Pierre Tubach
Improved Broad Phonetic Classification and Segmentation with an Auditory Model

We describe a broad phonetic classification and segmentation algorithm based on neural networks and dynamic programming. The basics of our algorithm are outlined in another paper [7], so here we will focus on the introduction of auditory model features replacing the mel-scale parameters. Our auditory model incorporates critical band filtering, short time adaptation and temporal analysis of the auditory nerve responses. Unlike previously proposed synchrony models, it emphasizes the envelope rather than the instantaneous frequency as the carrier of perceptually relevant information.

L. Depuydt, J. P. Martens, L. Van Immerseel
Automatic Learning of a Production Rule System for Acoustic-Phonetic Decoding

Results are reported of experiments in the use of Charade, automatic learning of a production rule system, for acoustic-phonetic decoding. The Charade system is evaluated in terms of performance to classify phonetic macro-classes from generated production rules. As points of comparison, the same experiments are carried out with an usual classifier (i.e., Hamming Distance Nearest Neighbor) and a neural net based technique (i.e., Modified Hopfied Net). The results can be summarized as follows: For a given reasonable error rate, Charade classifier gives a higher accuracy rate than HDNN and performs as well as MHN.The generated production rules can be analysed and interpreted for knowledge acquisition.

M.-J. Caraty, C. Montacié, X. Rodet

Stochastic Models for Language and Dialogue

Frontmatter

Invited papers

Stochastic Grammars and Pattern Recognition

This paper presents a unifying framework of syntactic and statistical pattern recognition for one-dimensional observations and signals like speech. The syntactic constraints will be based upon stochastic extensions of the grammars in the Chomsky hierarchy. These extended stochastic grammars can be applied to both discrete and continuous observations. Neglecting the mathematical details and complications, we can convert a grammar of the Chomsky hierarchy to a stochastic grammar by attaching probabilities to the grammar rules and, for continuous observations, attaching probability density functions to the terminals of the grammar. In such a framework, a consistent integration of syntactic pattern recognition and statistical pattern recognition, which is typically based upon Bayes’ decision rule for minimum error rate, can be achieved such that no error correction or postprocessing after the recognition phase is required. Efficient algorithms and closed-form solutions for the parsing and recognition problem will be presented for the following types of stochastic grammars: regular, linear and context-free. It will be shown how these techniques can be applied to the task of continuous speech recognition.

Hermann Ney
Basic Methods of Probabilistic Context Free Grammars

In automatic speech recognition, language models can be represented by Probabilistic Context Free Grammars (PCFGs). In this lecture we review some known algorithms which handle PCFGs; in particular an algorithm for the computation of the total probability that a PCFG generates a given sentence (Inside), an algorithm for finding the most probable parse tree (Viterbi), and an algorithm for the estimation of the probabilities of the rewriting rules of a PCFG given a corpus (Inside-Outside). Moreover, we introduce the Left-to-Right Inside algorithm, which computes the probability that successive applications of the grammar rewriting rules (beginning with the sentence start symbol s) produce a word string whose initial substring is a given one.

F. Jelinek, J. D. Lafferty, R. L. Mercer
A Probabilistic Approach to Person-Robot Dialogue

This paper surveys current work on systems that support spoken man-machine dialogue. Subsequently, we describe our current research on a natural language interface for robot control, indicating how our system uses probability scores to evaluate and combine hypotheses generated at different conceptual levels.

Renato De Mori, Jacqueline Bourdeau, Roland Kuhn

Contributed Papers

Experimenting Text Creation by Natural-Language, Large-Vocabulary Speech Recognition

In the last years the probabilistic approach to speech recognition has allowed the development of high-performances large-vocabulary speech recognition systems [1] [2]. At the IBM Rome Scientific Center a speech-recognition prototype for the Italian language, based on this approach, has been built. The prototype is able to recognize in real time natural-language sentences built using a vocabulary containing up to 20000 words. [4]. Once and for all the user has to perform an acoustic training phase (about 20 minutes long), during which he is required to utter a predefined text. Words must be uttered inserting small pauses (a few centiseconds), between them. The prototype architecture is based on a personal computer equipped with special hardware. The first system we developed was aimed at a business and finance lexicon. Many laboratory tests have shown the effectiveness of the prototype as a tool to create texts by voice. After a first phase during which in-house experiments were carried on [5], the need arose to test the system in real work enviroments and for different applications. Two applications were considered: the dictation of radiological reports and of insurance company documents. Due to their characteristics, these applications seemed to be very well suited for our purposes. Since the vocabulary of the recognizer must be predefined, we had to adapt the system to the lexicon required by the new applications. The paper describes the techniques developed to efficiently adapt the basic component of the recognizer the acoustic and language models. The results obtained experimenting automatic text dictation during real work are also presented.

P. Alto, M. Brandetti, M. Ferretti, G. Maltese, F. Mancini, A. Mazza, S. Scarci, G. Vitillaro
DUALGRAM: An Efficient Method for Representing Limited-Domain Language Models

This paper describes DUALGRAM, a language model developed at CSTR to cope with certain problems arising from applications of speech recognition systems in limited domains. The essential feature of the model is the use of two finite state mechanisms running in parallel; one is a standard word Bigram, while the other is a tag Finite-State Machine which handles sequences of syntactic categories. Some advantages of the model are discussed in comparison to other approaches, and some results are presented on using the various models to describe unseen material.

C. A. Matheson, J. C. Foster, A. W. Black, I. A. Nairn
Strategies for Speech Recognition and Understanding using Layered Protocols

The theory of Layered Protocols, developed for more general description of intelligent communication and design of human-computer interfaces, is proposed as a framework for integrating syntactic, semantic, and pragmatic information of various types in a speech understanding system. A 9-element protocol structure is described, and the interaction of protocols in a processing network is suggested as a mechanism for understanding acoustic messages. Hidden Markov Models are a special case of the proposed structure.

M. Martin Taylor, Joyce van de Vegte

Understanding and Dialogue Systems

Frontmatter

Invited Papers

TINA: A Probabilistic Syntactic Parser for Speech Understanding Systems

A new natural language system, Tina, has been developed for applications involving speech understanding tasks, which integrates key ideas from context free grammars, Augmented Transition Networks (ATN’s) [12], and Lexical Functional Grammars (LFG’s) [2]. The parser uses a best-first search strategy, with probability assignments on all arcs obtained automatically from a set of example sentences. An initial context-free grammar, derived from the example sentences, is first converted to a probabilistic network structure. Control includes both top-down and bottom-up cycles, and key parameters are passed among nodes to deal with long-distance movement and agreement constraints. The probabilities provide a natural mechanism for exploring more common grammatical constructions first. Arc probabilities also reduced test-set perplexity by nearly an order of magnitude. Included is a new strategy for dealing with movement, which can handle efficiently nested and chained gaps, and rejects crossed gaps.

Stephanie Seneff
The Voyager Speech Understanding System: A Progress Report

As part of the DARPA Spoken Language System program, we recently initiated an effort in spoken language understanding. A spoken language system addresses applications in which speech is used for interactive problem solving between a person and a computer. In these applications, not only must the system convert the speech signal into text, it must also understand the linguistic structure of a sentence in order to generate the correct response. This paper describes our early experience with the development of the MIT VOYAGER spoken language system.

Victor Zue, James Glass, David Goodine, Hong Leung, Michael Phillips, Joseph Polifroni, Stephanie Seneff
The Interaction of Word Recognition and Linguistic Processing in Speech Understanding

This contribution describes an approach to integrate a speech understanding and dialog system into a homogeneous architecture based on semantic networks. The definition of the network as well as its use in speech understanding is described briefly. A scoring function for word hypotheses meeting the requirements of a graph search algorithm is presented. The main steps of the linguistic analysis, i.e. syntax, semantics, and pragmatics, are described and their realization in the semantic network is shown. The processing steps alternating between data- and model-driven phases are outlined using an example sentence which demonstrates a tight interaction between word recognition and linguistic processing.

H. Niemann, G. Sagerer, U. Ehrlic, E. G. Schukat-Talamazzini, F. Kummert
Linguistic Processing in a Speech Understanding System

The goal of a speech understanding system is to correctly identify the action to be taken as a response to a user’s voiced request. To this purpose, the system has to rely on some type of linguistic knowledge beside merely recognize words. Several approaches have been proposed to employ language modeling in speech understanding. They include unified architectures integrating modular knowledge sources that account for every level of knowledge from acoustics to linguistics, and two-level architectures in which the separation between recognition and linguistic processing is well defined. Within this approach, two main methods may be conceived: linguistic constraints are integrated into the recognizer, which decodes one string of words that is treated by a natural language interface; or the recognizer produces a scored word lattice that is subsequently processed by a suitable linguistic module. For the present study, this latter approach was considered the most promising one, provided a satisfactory solution to efficient word lattice parsing could be found.Parsing a word lattice is a search activity whose space is extremely large. It may be performed in two basic modes, namely the left-to-right mode and the score-driven middle-out mode. Optimal algorithms based on the left-to-right mode induce a computation that grows polynomially with the lattice length, while those based on the middle-out mode work exponentially with length. However, it is possible to devise score-driven middle-out methods so that the amount of computation they induce depends on the average likelihood score of the word sequence they are expected to output. Hence, if these words are recognized with a good score, computation may get lower than with left-to-right methods.This paper describes in detail an algorithm that was experimentally proven to exhibit high parsing efficiency in the task it was designed for (1000-word continuous speech understanding, restricted semantic domain, and high syntactic freedom). Improved efficiency is reached through the use of heuristics which, exploiting the redundancy of the middle-out parsing approach, permit to cut down search without sensibly invalidate the optimality of the method. Problems like imperfect determination of start and ending points of words and the absence of short function words from the lattice are also kept into account.Experimental results, evaluated on lattices produced on the speaker-dependent version of the recognizer available in 1988, show that high-speed speech understanding is feasible compatibly with habitable language models (for a specific application) and reasonable accuracy of comprehension. Parsing time is about 1.8 seconds on a Sun 4 workstation and correct sentence understanding is 82% for a language model of perplexity 25.

Egidio P. Giachin, Claudio Rullent

Contributed papers

Linguistic Tools for Speech Recognition and Understanding

A realistic model of speech recognition and understanding should be heavily based both on linguistic and acoustic knowledge. If this fact seems to be acknowledged by most reasearch people in the field it is not yet clear how to thread that knowledge into acoustic evidence. We present a proposal that tries to solve some of the problems involved in this task. In particular, acoustic patterns made up of word models are substitued by a model which makes use of features and syllables. The phone (the phoneme is too abstract!) and the word are regarded as abstract objects which are built up the former from feature matrixes, and the latter from syllables and morphological parsing into morphemes. Thus the lexicon is not a list of word forms but is composed of root morphemes and affixes in graph structure, and is traversed by a parser which makes use of rules for the composition of legal words of the language from subword units. Hypotheses are fired out both at phone and at syllable level on the basis of feature extraction.

Rodolfo Delmonte
Evidential reasoning and the combination of knowledge and statistical techniques in syllable based speech recognition

In what follows we describe SYLK: a project which aims to deploy expert phonetic knowledge within an admissible statistical framework for syllable based speech recognition — hence ‘SYLK’, Statistical sYLlabic Knowledge. The project has been running at the Universities of Sheffield and Leeds, in the UK, since mid 1989 and has both HMM and Syllabic IKBS constituents. See the overview in Fig.1. It is funded by the UK IED programme and follows on from previous Alvey-funded work on the Speech Sketch (Green et al [6]) and HMM-based broad phonetic classification (P.Roach[10]). In this paper an overview of the system is presented, with particular attention paid to its statistical and hypothesis refinement aspects.

Luke Boucher, Tony Simons, Phil Green

Speech Analysis, Coding and Segmentation

Frontmatter

Contributed papers

Data Base Management for Use with Acoustic-Phonetic Speech Data Bases

In this paper, the requirements for an acoustic-phonetic data base are outlined and three different managements systems (flat files, relational DBMS’s, and Object-oriented DBMS’s) are compared with respect to these demands.

Jan P. Hendriks, Sandra H. Swagten, Louis Boves
BPF Outputs Compared with Formant Frequencies and LPC’s for the Recognition of Vowels

The purpose of this paper is to compare the existing different measurement methods for the recognition of Turkish vowels. These methods use the band-pass-filter (BPF) outputs, the formant frequencies, the linear prediction coefficients (LPC’s). The recognition rates obtained by using the minimum distance principle show that the BPF outputs seem to be the best.

Atalay Barkana, Atila Barkana
A Codification of Error Signal by Splines Functions

One possible way of increasing the low-bit rate speech-coders performance is to consider a parametric glottal model as source for voiced sounds and to approach the glottal waveform using trigonometric functions. The main problem of this procedure is how to determine the opening and closure times. This work proposes an alternative method based on an approximation of the glottal signal by spline functions (particularly cubic splines).Another objective is to remove the influence from the non-white glottal source in the estimation of the parameters of the production filter. In the proposed algorithm, this is obtained using conventional predictive analysis.

M. C. Benitez, J. A. Galvez, A. Rubio, J. Diaz
Specific Distance for Feature Selection in Speech Recognition

In this paper, the use of a specific metric as a feature selection step is investigated. The feature selection step tries to model the correlation among adjacent feature vectors and the variability of the speech. We propose a new procedure which performs the feature selection in two steps. The first step takes into account the temporal correlation among the N feature vectors of a template in order to obtain a new set of feature vectors which are uncorrelated. This step gives a new template of M feature vectors, with M ≪ N. The second step defines a specific distance among feature vectors to take into account the frequency discrimination features which discriminate each word of the vocabulary from the others or a set of them. Thus, the new feature vectors are uncorrelated in time and discriminant in frequency.

E. Lleida, C. Nadeu, J. B. Mariño, E. Monte, A. Moreno
Multiple Template Modeling of Sublexical Units

Automatic training procedures are developed to obtain the model or models for a certain type of linguistic unit, under the framework of a Distance-based approach. The chosen units are phonetic-units and the models are templates. In a first approach, one prototype (centroid) per phonetic-unit is obtained through an iterative process and by using Dynamic Time Warping techniques. A refinement is performed through a Clustering procedure that obtains several prototypes per phonetic-unit. Another refinement, which is based on Multiedit Condensing techniques, is also proposed. Some preliminary experimental results for single-speaker and multi-speaker tasks are reported.

P. Aibar, M. J. Castro, F. Casacuberta, E. Vidal
Learning Structural Models of Sublexical Units

The use of structural-stochastic models has been proved to be adequate for the Acoustic-Phonetic Decoding problem. Whereas there exist good algorithms to estimate the parameters of the models, the structures are chosen according to the a-priori knowledge about the speech objects which are to be represented. This structural information is represented through a finite state network and therefore the use of a Grammatical Inference algorithm is a natural way to infer the structure of the models. In this paper, we propose the use of a specific grammatical inference algorithm (the Error Correcting Grammatical Inference algorithm) to learn structural models of the phonetic units.

E. Sanchis, F. Casacuberta, S. Carpi
On the use of Negative Samples in the MGGI Methodology and its application for Difficult Vocabulary Recognition Tasks

The inference methods which are proposed in Syntactic Pattern Recognition in practice only make use of positive data and generate a heuristic generalization of strings in the data. However, the use of positive data becomes insufficient when very discriminatory models are needed. This is the case of Difficult Vocabularies in Isolated Word Recognition tasks. This paper is a first attempt at using positive and negative data that presents two main characteristics: it respects the computational efficiency with moderate-sized training sets, and it is suitable for tasks in Syntactic Pattern Recognition, specifically in Automatic Speech Recognition.

Encarna Segarra, Pedro García, Jose M. Oncina, Armando Suarez
A New Method for Dynamic Time Alignment of Speech Waveforms

In this paper, a new method for dynamic time alignment of speech waveforms is introduced. The method attempts to address the shortcomings of traditional time alignment approaches, commonly based on dynamic programming algorithms. Such methods, usually called dynamic time warping (DTW) algorithms, make the assumption that the samples of the speech waveform under consideration are statistically independent. The proposed method makes no such assumption. Instead, the method is based on models of speech entities with Gaussian distributions and general covariance matrices. These ideas are implemented by employing the branch and bound search algorithm [1] coupled with the Mahalanobis distance measure as the matching criterion. Hence, the new method attempts to utilise more discriminatory information than is presently incorporated. Preliminary results on a spoken letter recognition problem are reported validating the approach.

J. Kittler, A. E. Lucas
A New Technique for Automatic Segmentation of Continuous Speech

A new procedure is proposed to automatically segment speech signals in accordance with a given linguistic-unit description of the signals. This technique is “model-free” in that no models are required or used for the units. Segmentation is obtained by (locally) minimizing an appropriate distortion function that measures the linguistic inconsistency of a segmentation through DTW acoustic dissimilarities. Experimental results show the inferiority (greater distortion) of manual segmentations with respect to the automatic segmentations obtained with the proposed technique. These automatic segmentations could be useful to improve the performance of any training procedure for whatever type of linguistic-unit models.

E. Vidal, A. Marzal
Segmentation of speech based upon a linear model of the effects of coarticulation

Most acoustic-phonetic, knowledge based approaches to speech recognition are based on the phonemic hypothesis. According to this hypothesis it is possible to divide a speech utterance into a sequence of phonemes, each phoneme corresponding to a phonetic event or, as for instance in the case of plosives and diphthong, to several phonetic events. The purpose of speech segmentation algorithms is to automatically carry out a division of an utterance into a sequence of events. Due to coarticulation, neighboring events are often poorly separated in time and this makes segmentation a difficult task. In this article we will describe a speech segmentation algorithm which is based on a linear model of the effects of coarticulation. The model assumes that for a suitable set of speech parameters the intersegment transition between two neighboring articulatory events can be described by a linear combination of two speech vectors.

P. J. Dix, G. Bloothooft
Backmatter
Metadaten
Titel
Speech Recognition and Understanding
herausgegeben von
Pietro Laface
Renato De Mori
Copyright-Jahr
1992
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-76626-8
Print ISBN
978-3-642-76628-2
DOI
https://doi.org/10.1007/978-3-642-76626-8