Skip to main content

1988 | Buch

Recent Advances in Speech Understanding and Dialog Systems

herausgegeben von: H. Niemann, M. Lang, G. Sagerer

Verlag: Springer Berlin Heidelberg

Buchreihe : NATO ASI Series

insite
SUCHEN

Über dieses Buch

This volume contains invited and contributed papers presented at the NATO Advanced study Insti tute on "Recent Advances in Speech Understanding and Dialog systems" held in Bad Windsheim, Federal Republic of Germany, July 5 to July 18, 1987. It is divided into the three parts Speech coding and Segmentation, Word Recognition, and Linguistic Processing. Although this can only be a rough organization showing some overlap, the editors felt that it most naturally represents the bottom-up strategy of speech understanding and, therefore, should be useful for the reader. Part 1, SPEECH CODING AND SEGMENTATION, contains 4 invited and 14 contributed papers. The first invited paper summarizes basic properties of speech signals, reviews coding schemes, and describes a particular solution which guarantees high speech quality at low data rates. The second and third invited papers are concerned with acoustic-phonetic decoding. Techniques to integrate knowledge­ sources into speech recognition systems are presented and demonstrated by experimental systems. The fourth invited paper gives an overview of approaches for using prosodic knowledge in automatic speech recogni tion systems, and a method for assigning a stress score to every syllable in an utterance of German speech is reported in a contributed paper. A set of contributed papers treats the problem of automatic segmentation, and several authors successfully apply knowledge-based methods for interpreting speech signals and spectrograms. The last three papers investigate phonetic models, Markov models and fuzzy quantization techniques and provide a transi tion to Part 2 .

Inhaltsverzeichnis

Frontmatter

Speech Coding and Segmentation

Invited Papers

Recent Advances in Speech Coding

After a short summary of some basic properties of speech signals and of speech signal models the effect of linear prediction and vector quantization for data compression in speech coding is outlined. Some well-known coding schemes are reviewed. The recently developed RELP-S schemes based on speech analysis by synthesis are discussed in more detail. In particular a scheme using stochastic excitation sequences is expected to guarantee high speech quality at data rates far below 8 kb/s.

D. Wolf, H. Reininger
Acoustic-Phonetic Decoding of Speech
Statistical Modeling for Phonetic Recognition

Several methods for acoustic-phonetic decoding are reviewed. Emphasis is placed on the need for mathematical methods for speech recognition. Several examples of statistical methods are described. The author presents several techniques for incorporating “speech knowledge” into these statistical models, and provides a simple formalism for using multiple knowledge sources in a coherent speech recognition system.

Richard M. Schwartz, Y. Chow, M. Dunham, O. Kimball, M. Krasner, F. Kubala, J. Makhoul, P. Price, S. Roucos
Knowledge-Based Approaches in Acoustic-Phonetic Decoding of Speech

The level of acoustic-phonetic decoding (i.e. the transformation of the acoustic continuum of speech into a description under the form of discrete, linguistic units) represents an important step and a major bottleneck in the overall process of automatic speech recognition.A number of different approaches have been proposed so far for solving problem. After a brief recall of these approaches based on pattern matching and/or stochastic modelling, we introduce a new class of methods based on Artificial Intelligence knowledge-based techniques that make an explicit use of all available kinds of knowledge in order to carry out the phonetic decoding of a sentence. The various issues involved in this approach are discussed and illustrated by the presentation of APHODEX, an expert system for acoustic-phonetic decoding.

Jean-Paul Haton
The Use of Prosodic Parameters in Automatic Speech Recognition

The present communication concerns the use of prosodic parameters in automatic speech recognition (ASR), i.e. the feasibility of automatically extracting prosodic information from a set of acoustic measurements done on the signal, and the incidence of integrating such information on the performance of ASR. Prosodic parameters include pauses and contrasts in pitch, duration and intensity between successive segments (mainly the vocalic parts). This notion is also extended to number of syllables and to ratios of voiced to unvoiced portions of the words. Part one introduces the various aspects of prosody (linguistic and non linguistic) and the main problems to be solved in automatically extracting linguistic messages conveyed by prosodic features. Part two deals with word level and lexical search: it presents work done (1) on the feasibility of word stress detection (primary stress, estimation of its magnitude, and evaluation of the complete word stress pattern) and (2) on the estimation of the amount of lexical constraints imposed by stress information in lexical search, completed by other suprasegmental information (number of syllables, word boundaries, ratios between voiced and unvoiced portion in the word, etc.). Part three deals with phrase and sentence levels and syntactic constraints provided by the automatic detection of word, phrase and sentence boundaries. Part four relates a number of miscellaneous uses at the phonemic level: phonetic segmentation, identification of the voicing feature of consonants, and estimation of the “segmental quality” of the underlying segments.It is observed that prosodic parameters have been exploited rather poorly compared to segmental aspects of speech. Integration of prosodic knowledge with segmental knowledge in an ASR system is a difficult problem: to know when and where to integrate the prosodic knowledge into the system and how to combine the evidence and scores obtained from different sources. The exact contribution of the use of prosody in ASR is still to be estimated in an ASR system flexible enough to efficiently test such an integration.

J. Vaissière

Contributed Papers

Prosodic Features in German Speech: Stress Assignment by Man and Machine

We present a method for automatically assigning a stress score to every syllable in an utterance. The syllables are detected by looking at the energy in three energy bands. Based on features representing the prosodic parameters duration, intensity, and pitch, a stress assignment number for each syllable is calculated. The algorithm is tested on material collected from real dialogs. The automatic stress assignment is compared with the result of a stress perception experiment by human listeners.

E. Nöth, H. Niemann, S. Schmölz
Recognition of Speech using Temporal Decomposition

A natural approach to speech analysis is to attempt a segmentation of the speech signal into phoneme units. However, decomposition into phonemes is difficult, as they are context-dependent. Instead we propose a larger segmental unit, the “polysons” /BIM 86/.A technique is proposed for automatic labelling of speech, using a robust temporal decomposition, a technique originally proposed for speech coding. The spectral targets are evaluated by successive iterations on the computation of a set of compact interpolation functions, Ø(t), and corresponding spectra. These spectra are characterized by vectors of Log Area Ratios, LAR. A discussion on the choice of the representation may be found in /BIM 87/. Applications to speech recognition strategies are discussed.

Gérard Chollet, Gunnar Ahlbom, Frédéric Bimbot
Long Term Analysis-Synthesis of Speech by Non-Stationary AR Methods

The use of non-stationary AR techniques for characterizing segments of speech of relatively long duration (e.g. 0.5 sec., a time duration that may include many bisillabic words) has been recently considered. A method has been proposed for this purpose, based on a non-stationary lattice, representing the dependence on time by a linear combination of functions of a suitable orthogonal basis. An iterative procedure for removing the bias of the formant frequencies due to the fundamental frequency is proposed in connection with the parametric non-stationary estimation of long-term segments of speech. The non-stationary tecnique can be extended to the identification of the excitation source; a method is described for recovering this signal from the residual of the lattice predictor. The efficiency of the resulting analysis-synthesis method is illustrated by real speech examples.

Susanna Ragazzini
Using Contextual Information in View of Formant Speech Analysis Improvement

Parallel formant synthesis of speech is a well known technique used in building performant synthesis systems. The fact that the synthesizer parameters can be closely controlled permits to obtain very high quality synthetic speech. However, the extraction of the parameters is not an easy task, usually semi-automatic methods are used for this purpose.Our aim is to extract these parameters as automatically as possible. Hence, we use suitable signal processing procedures, together with a-priori and context knowledge on speech in order to guide the analysis and correct eventual errors where the automatic procedures are unable to give the correct results. The fact that we already have a whole French diphone dictionary, analysed by an operator-controlled semiautomatic method will help us to gather the necessary a-priori knowledge of speech analysis. This analysis parameters are smoothed for all the diphones and then they are stored in a data-base form, we can then study the parameters’ behavior and integrate them in the analysis system, the long term object being synthesis by rule.

O. Al-Dakkak, G. Murillo, G. Bailly, B. Guérin
A Speech Recognition Strategy Based on Making Acoustic Evidence and Phonetic Knowledge Explicit

We describe a prototype implementation of a representational approach to acoustic-phonetics in knowledge-based speech recognition. Our scheme is based on the ‘Speech Sketch’, a structure which enables acoustic evidence and phonetic knowledge to be represented in similar ways, so that like can be compared with like. The process of building the Speech Sketch begins with spectrogram image processing and goes on to exploit elementary phonetic constraints. A multiscale approach is used throughout. The process of interpreting the Speech Sketch makes use of an object-oriented phonetic knowledge base. Objects in the knowledge base can be matched against objects in the Speech Sketch in a manner directed by the incoming evidence. This technique promises to avoid a combinatorial explosion.

P. D. Green, M. P. Cooke, H. H. Lafferty, A. J. H. Simons
On Finding Objects in Spectrograms: A Multiscale Relaxation Labelling Approach

This paper describes a new technique for finding objects in spectrograms, and illustrates the idea with an application to the formant-tracking task. Starting with a multiscale representation of speech spectra, a probabilistic relaxation labelling algorithm is applied to determine primitive interpretations of the spectral components. Finally, a cross-scale integration procedure enables the scale space to be collapsed in a principled manner. The techniques are illustrated with an example of voiced speech.

M. P. Cooke, P. D. Green
Phonetic Segmentation Using Psychoacoustic Speech Parameters

In this paper we describe a phonetic segmentation before classification algorithm. It is based on the inspection of temporal loudness patterns extracted by an acoustic front end incorporating knowledge about the human auditory system. In a speaker independent phonetic segmentation task more than 92.5% of the phonetic boundaries were correctly detected, and no more than 10% false alarms were counted.

J.-P. Martens
Morphological Representation of Speech Knowledge for Automatic Speech Recognition Systems

This work proposes a technique to capture speech knowledge which is available in spectrograms by considering it as a scene. A simple pattern analysis technique applied to these patterns reveals significant properties which are relevant to transitions of vocal tract as well as being speaker independent in nature. This process is labeled under Biological Vision since, Biological vision system uses a global recognition strategy by considering the image as a “whole”. The recognition processor, the brain, uses symbols and symbolic relationships in image for image interpretation. Also, the knowledge base consists of symbols as well as symbolic relationship of objects in its long term memory. In order to give the machine a similar capability as that of biological vision systems, the pattern of a speech spectrogram is described as a morphology of symbols and symbolic relationships. Such symbols are then used for final hypothesis generation by statistical means.

Mathew J. Palakal
Speaker-Independent Automatic Recognition of Plosive Sound in Letters and Digits

This paper reports an experiments on a speaker-independent automatic recognition system of plosive sound. Experimental results in the recognition of plosive sound for the set [bdev], which is the set that gives the worst result, on 20 speakers are presented.

Régis Cardin
A Real-Time Auditory Model: Description and Applications in an ASR Acoustic-Phonetic Front End

A model of peripheral auditory processing is implemented using a Texas Instruments TMS 320C25 Digital Signal Processor mounted on an IBM PC-AT compatible microcomputer. The TMS 320C25, running at 40 MHz, performs both the spectral analysis of a sampled signal and the auditory transformation at a speed that allows for real-time presentation of an “auditory spectrogram” of any audio signal. Such a representation would serve as the basis for parameterization, feature extraction, etc., in an ASR system.

Frank Gooding, Ian Shaw, Husein Mahdi
A New Phonematic Approach to Speech Recognition

A phonematic recognition system is presented. It works in only one step without precategorical classification nor segmentation. The system analyses a frame of 12.8 ms of voice and decides the phoneme to which it belongs. A piece of speech can be considered as a series of input symbols (phonemes-frames) transmitted through an information channel (recognition system) to give a series of outpout symbols. In this work the recognition system is characterized as an information channel in order to compute an error rate.

A. J. Rubio-Ayuso, J. M. Herrera-Garrido
Primary Perceptual Units in Word Recognition

Results from different experiments concerning word similarity, word identification and manipulation of single sounds and findings from first language acquisition research are discussed as to their relevance for an understanding of which perceptual units are primary in word recognition. It is argued that units of different sizes may serve as primary perceptual units equally well depending on certain characteristics of the perceptual situation.

Walter F. Sendlmeier
Context-Dependent Phone Markov Models for Speech Recognition

In our automatic dictation system, which accepts sentences pronounced in Isolated syllable mode, the acoustic decoder is classically based on 40 phonetic Markov machines. These models obviously do not account for coarticulatory effects. In this paper, we deduce general principles about coarticulation and automatic phoneme recognition. This allows us to classify the right contexts of the consonants into relevant classes, and to keep the resulting number of models reasonable for training and robustness. The new contextual system achieves a significant improvement in phoneme recognition performance.

Anne-Marie Derouault
Speech Recognition Based on Speech Units

In a classical quantization system, each vector is represented by the nearest centroïd; two vectors belonging to the same class are then indistinguishable. In order to mitigate this situation, we take into account the two nearest neighbours and define a “belonging degree” calculated from the distances between the vector and the two centroïds. In the case of speaker independent speech recognition system, this “fuzzy” quantization gives better results.

G. Zanellato

Word Recognition

Invited Papers

Mathematical Foundations of Hidden Markov Models

Stochastic methods of signal modeling have become increasingly popular. There are two strong reasons why this has occurred. First the models are very rich in mathematical structure and hence can form the theoretical basis for use in a wide range of applications. Second the models, when applied properly, work very well in practice for several important applications. In this paper we attempt to carefully and methodically review the theoretical aspects of one type of stochastic modelling, namely hidden Markov models (HMM’s), and show how they have been applied to a couple of problems in machine recognition of speech.

Lawrence R. Rabiner
Computer Recognition of Spoken Letters and Digits

Recent results on Automatic Speech Recognition (ASR) and Speech Analysis suggest that progress in designing recognition devices and in advancing speech science knowledge may arise from an integration of the so called cognitive and information-theoretic approaches/LEVINSON 85/.

Renato De Mori
Recognition of Words in Very Large Vocabulary

Word pre-selection by means of partial phonetic descriptions is a method of lexical access in speech recognition systems for very large vocabularies that is being receiving particular attention. It can be effective provided that segmentation errors are taken into account within the lexical access procedure, and that the resulting candidate word set is reasonably sized. As errors in the segmentation of input utterances are unavoidable, even if a limited number of phonetic categories must be discriminated, a lattice of segmentation hypotheses is generated. Word pre-selection is obtained, therefore, by matching a lattice of phonetic hypotheses against a graph structure that represents a generic word. A Dynamic Programming procedure is introduced that solves this problem. A sub-optimal solution and heuristic constraints have been investigated that improve the algorithm efficiency. In the second step, word verification, a detailed ‘representation of the phonemic structure of word candidates is used for estimating the most likely words. Words are modeled by sequences of sub-word units represented by Hidden Markov Models and a beam-search Viterbi algorithm estimates their likelihood. Experimental results on large vocabularies demonstrate the effectiveness of the method.

P. Laface, G. Micca, R. Pieraccini

Contributed Papers

Isolated Word Recognition Using Hidden Markov Models

An automatic speaker independent isolated word recognition system based on continuous hidden Markov models is presented using multidimensional spherically invariant density functions which describe the statistical properties of words. Different types of density functions are applied to represent the observed data. In simulations the recognition performance depending on these density functions is evaluated.

S. Euler, D. Wolf
Isolated Digit Recognition Using the Multi-Layer Perceptron

This paper introduces the multi-layer perceptron (MLP) as a new approach to isolated digit recognition. A comparison is made with hidden Markov modelling (HMM) techniques applied to the same data. The experimental results show that the performance of the multi-layer perceptron is comparable with that of hidden Markov modelling.

S. M. Peeling, R. K. Moore, A. P. Varga
Use of Procedural Knowledge for Spoken Letters and Digits Recognition

A Procedural Network (PN) can be described with a formalism similar to that used for an Augmented Transition Network Grammar (ATNG). This formalism has been successfully used for Natural Language and Pattern Recognition [8]. A PN is a 5-tuple (1)$$PN\, = \,\left\{ {j,Q,A,q_0 ,q_f } \right\}$$ where j is the network identifier, Q is a finite set of states, A is a finite set of directed arcs, q 0 ∈Q is the initial state and q f is the final state. Without any loss of generality we consider only PNs with a single initial state and a single final state.

Ettore Merlo
Real-Time Large Vocabulary Word Recognition via Diphone Spotting and Multiprocessor Implementation

This paper describes Elsag’s Large Vocabulary Isolated Word Recognition system DSPELL. The system makes use of a diphone-based speech model and an extremely efficient word decoding algorithm, and is implemented on Elsag’s multiprocessor EMMA-21. DSPELL requires a very convenient training session and features a high recognition performance and real-time response on lexicons of up to 2,000 words.

C. Scagliola, A. Carossino, A. M. Colla, C. Favareto, P. Pedrazzi, D. Sciarra, C. Vicenzi
Speech Recognition With Difficult Dictionaries

In this paper we introduce a new generalization of the Generalized Linear Discriminant Function, in which the dimension of the representation space depends on the class. This generalization is applied to Speech Recognition with Difficult Dictionaries, leading to a (preliminary) error rate reduction over 50% with respect to the standard approach.

F. Casacuberta, E. Vidal
Recent Results on the Application of a Metric — Space Search Algorithm (AESA) to Multispeaker Data

Isolated Word Recognition (IWR) is usually approached by Minimum Distance classifiers [3] which rely on Dynamic Time Warping (DTW) procedures to compute appropriate Dissimilarity Measures between the uttered test words and the reference utterances (prototypes) stored in a dictionary [6] [1], As the time complexity of these methods is linear with the size of the dictionary, they are often considered “efficient” in the literature. However, even linear time is, in practice, fairly inadequate for the problem in hand; first, because of the relatively large time complexity of the DTW procedures, and second, because it is quite “unnatural” (and also undesirable) that the time response of a speech recognition device be proportional to the number of words of its vocabulary.

Enrique Vidal, M. José Lloret
Robust Features for Word Recognition

This paper presents a new approach to automatic recognition of spoken words. After discussing the demands upon appropriate subword units and reporting some experiments in using phone superclasses for word recognition we will develop techniques of robust classification, segmentation, and lexical access, utilizing binary phonetic features as processing units.

E. Günter Schukat-Talamazzini
Statistical Analysis of Left-to-Right Parser for Word-Hypothesing

The recognition rate for fluently spoken speech achieved with beam search algorithms depend on factors as the perplexity and the recombination structure of the language model, the difficulty of the vocabulary, the quality of the acoustic phonetic decoder, and the chosen pruning strategy. In this paper a statistical model taking these factors in account is develloped. An iterative algorithm to calculate the score distribution of beam search paths and the resulting sentence error rate is presented.

H. Höge, E. Marschall
Overview of Speech Recognition in the ‘SPICOS’ System

In this paper, a recognition technique used in the ‘SPICOS’ project is described. It is based on an integrated approach that combines the various knowledge sources, such as inventory of subword unit, pronunciation lexicon and language model, during the process of decision making in order to improve the reliability of the acoustic recognition. The recognition decision amounts to a search through a large state space with delayed decisions. The speaker dependent recognition tests are performed on a speech data base comprising 3 sessions of each of 5 speakers. A session consists of 200 sentences and amounts to 1391 word samples.

H. Ney, D. Mergel, A. Noll, A. Paeseler
An Experimental Environment for Generating Word Hypotheses in Continuous Speech

We present a flexible environment for the generation of word hypotheses in continuous speech. After describing the interface to the other modules of our speech understanding system a word spotting technique based on HMM will be discussed. The generation of reference models for the matching procedure is done automatically using the standard pronunciation of a word and a set of phonological rules about intra word assimilation. These alternative pronunciations are represented by graphs with labeled edges. Some preliminary results for parameter training and the matching procedure are also given.

S. Kunzmann, T. Kuhn, H. Niemann
Application of the Error Correcting Grammatical Inference Method (ECGI) to Multi-Speaker Isolated Word Recognition

It is well known that speech signals constitute highly structured objects which are composed of different kinds of subobjects such as words, phonemes, etc. This fact has motivated several researchers to propose different models which more or less explicitly assume the structural nature of speech. Notable examples of these models are Markov models /Bak 75/, /Jel 76/; the famous Harpy /Low 76/; Scriber and Lafs /Kla 80/; and many others works in which the convenience of some structural model of the speech objects considered is explicitly claimed /Gup 82/, /Lev 83/, /Cra 84/, /Sca 85/, /Kam 85/, /Sau 85/, /Rab 85/, /Kop 85/, /Sch 85/, /Der 86/, /Tan 86/.

E. Vidal, N. Prieto, E. Sanchis, H. Rulot
Multi-Speaker Experiments with the Morphic Generator Grammatical Inference Methodology

Grammatical Inference (GI) is the learning or model estimation phase required by any Syntactic approach to Pattern Recognition (PR). Some fundamental results on GI have been known since the 60’s through the works by Gold (1967) and Feldman (1972), which stablished that the decidability of any (even regular) GI problem depends largely upon the avaibility of both an adequate positive sample R+ of strings known to have been generated by the unknown Grammar, and an equally adequate negative sample R- of strings not generated by that Grammar. Despite these results being commonly recognized, taking into account negative samples, lead, in general to intractable GI problems (see /Angluin,78/) and, consequently, most recent works on GI only use positive samples, an aim just at giving practical solutions to specific PR problems (see eg. /Angluin,83/ /Fu,75/). Clearly, this paradigm is not a very appealing one, and some general methodology seems to be strongly required.

E. Vidal, E. Segarra, P. García, I. Galiano
A New Approach to Template Selection for Speaker Independent Word Recognition

This study explores the possibility of using Condensed Nearest Neighbor (CNN) rule for classification in various word recognition problems. A modified version of “Condensing” combined with “Editing” algorithm is implemented to select the reference templates for a speaker independent isolated word recognition problem. It is shown that these algorithms improve the recognition rate in comparison to using clustering techniques for template selection.

N. Yalabik, F. Yarman-Vural, A. Mansur
Dynamic Spectral Adaptation of Automatic Speech Recognizers to New Speakers

This is an overview of approaches to automatically reduce the intra and inter speaker variability in the speech signal. Attention is focussed on particular methods to adapt standard Automatic Speech Recognizers (ASR) to new users, taking into account their specific acoustical characteristics. Speaker normalization and adaptation techniques for a template-based recognizers are described. They are based on spectral normalization, on the learning of spectral transformations and on code book substitution for VQ front end.

Gérard Chollet, Khalid Choukri
Towards Speaker-Independent Continuous Speech Recognition

Speaker-independent continuous speech recognition is an extremely difficult task. In this paper, we analyze the nature of its difficulty. Moreover, we propose a new approach to speaker-independent continuous speech recognition through the use of hidden Markov models, context-dependent phonetic units, perceptually motivated parameters, and two speaker-adaptation algorithms. Finally, we present some preliminary results, and outline future plans.

Kai-Fu Lee
Evaluating Speech Recognizers and Data Bases

The problems of evaluating speech recognizers are considered from several points of view. The value of making large data bases available to the research and development community is acknowledged. Some recommandations are made about the size and content of. data bases distributed for test purpose. A test workstation implemented on a PC is described. Reference speech recognition algorithms are available on this workstation. Examples of several applications are given.

Gérard Chollet, Claude Montacie

Linguistic Processing

Invited Papers

On-line Interpretation in Speech Understanding and Dialogue Systems

This paper addresses syntactic, semantic and pragmatic aspects of central concern in the design of on-line language understanding systems.In the area of syntax, the paper focuses on the phenomenon of discontinuous constituents. A form of syntactic representation is defined, called discontinuous trees, which allows the constituent structure of sentences with discontinuous constituents to be represented without changing the word order in the sentence. A kind of phrase-structure grammar is defined which can generate such representations, and an on-line parsing algorithm is presented which parses sentences into these representations.In the area of semantics, the paper focuses on the effective resolution of ambiguity and vagueness, and on the on-line interpretation of anaphora. For dealing with ambiguous and vague expressions in an effective, on-line manner a cascaded model-theoretic approach is developed which makes use of intermediate semantic representations in a formal language that preserves some of the ambiguity and vagueness in natural language. For the interpretation of anaphora, an approach is outlined that has recently been suggested by Groenendijk and Stokhof, based on the use of dynamic logic.In the area of pragmatics, the paper focuses on the analysis of information-exchange dialogues as consisting of communicative actions. The notion function of a communicative action is defined in terms of the flow of information between speaker and addressee. It is argued that Groenendijk and Stokhof’s dynamic approach to semantic interpretation can be elegantly combined with this’functional’ approach to communicative action into in an integrated theory of utterance meaning. The resulting‘dynamic interpretation theory’ is outlined, which offers exciting perspectives for a full-fledged, on-line utterance interpretation process.

Harry Bunt
Semantic Processing in Speech Understanding

This paper will try to evaluate the state of the art in semantic representation techniques as far as they are used in speech understanding systems. It will not deal with knowledge representation techniques in general, because only some of them are used and are even usable in speech environment. First, a scetch of the problem domain will be given: Recognition levels, search spaces and system design decisions are outlined, and from this, the task of linguistic analysis procedures is derived. Second, existing techniques to incorporate semantic information into the Speech Understanding task are presented and discussed, like statistics, semantic markers, semantic grammars etc. Third, two approaches for Speech Understanding are discussed: predicate logic as representation as used in the SPIC0S system, and semantic frames as used in the CMU-DARPA project. At the end, two proposals of possible system architectures are presented.

G. Thurmair
Knowledge Based Systems for Speech Understanding

Based on a detained discussion of system architectures for knowledge based speech understanding and knowledge representation techniques, criteria for both a knowledge representation scheme and system architectures are developed. Upon this background a system is introduced which is organized around a homogenuous knowledge base. Both the knowledge representation language and the content of the knowledge base are described. The knowledge representation language do not only cover declarative but also procedural knowledge. Analysis processes are guided by a flexibel bottom-up top-down strategy. Besides the procedural semantics of the language the A -Algorithm is used for this purpose. Search graph nodes are judged by a vector which reflects knowledge dependent and acoustic scores and which is admissable for the A -Algorithm.

G. Sagerer, F. Kummert

Contributed Papers

Recognition of Speaker-Dependent Continuous Speech with Keal-Nevezh

A description of the speaker-dependent continuous speech understanding system KEAL-NEVEZH is given. An unknown utterance is recognized by means of the following procedures: Acoustic analysis, phonetic segmentation and identification, word and sentence analysis. This new system is an extension of the KEAL system, connected to ALOEMDA, an active chart parser modifying its strategy and linguistic capabilities.A speaker adaptation module allows to adjust some of the system parameters by matching known utterances with their acoustical representation.The task to be performed, described by its vocabulary and its grammar, is given as a parameter of the system. The new configuration is under test and first results are given. Continuously spoken sentences extracted from a “pseudo-LOGO” language are analysed with two different linguistic modules and recognition figures are presented.

G. Mercier, A. Cozannet, J. Vaissiere
Modification of Earley’s Algorithm for Speech Recognition

This paper describes an adaptation of Earley’s algorithm for the recognition of spoken sentences. Earley’s algorithm is one of the most efficient parsing algorithms for written sentences. The modifications and extensions of Earley’s algorithm required by the variability and ambiguity of the speech signal concern the scoring of word and sentence hypotheses and the application of a beam search or pruning technique.

Annedore Paeseler
Expectation-Based Speech Recognition

Expectation-based speech recognition is a knowledge-based approach to recognize continously spoken German speech. The knowledge base mainly consists of a large comprehensive lexicon providing various informations about each word of the registered vocabulary. During the recognition process numerous weighted hypotheses are generated that enter into competition with each other. To succeed in recognition the task is to keep the number of alternative hypotheses low by making good choices among all alternatives. The S*-algorithm performs an optimal search strategy regarding phrase hypotheses and using the information given by categorial search trees.

J. Mudler, E. Paulus
Merging Acoustics and Linguistics in Speech-Understanding

The paper concentrates on the discussion of the interaction between acoustic and linguistic analysis. Recent approaches have either tried to master the number of acoustic hypotheses by applying successively syntactic and semantic restrictions on them. Other approaches have tried to have the acoustic recognition fully guided by these restrictions. The problems of both will be discussed. The paper will sketch a proposal for a closely linked, controlled interaction between linguistic prediction, acoustic recognition and linguistic verification in speech-understanding.

Gerh. Th. Niedermair
Using Semantic and Pragmatic Knowledge for the Interpretation of Syntactic Constituents

In the speech understanding and dialogue system EVAR /NIEMANN 85/ linguistic knowledge is used for the interpretation of the meaning of a sentence and finding the user’s intention for uttering it /BRIETZMANN 86/. Another very important point especially for the semantic and pragmatic knowledge is using it as early as possible for filtering hypotheses produced during the different steps of analysis, i.e. for example word recognition and syntactic anlaysis. This can be done also for hypotheses about single syntactic constituents like nominal or prepositional phrases. In this paper it is shown how special semantic features in our system are used for compatibility checks within the words of a syntactic constituent. On the other hand complex constituents built up of several simple syntactic constituents are sought (like “the train | to Hamburg”). For all so found “semantic” constituent hypotheses the possible referential objects are determined using information about the dialogue context, or if no referent can be found, they are rejected.

Ute Ehrlich, Heinrich Niemann
Task-Oriented Dialogue Processing in Human-Computer Voice Communication

For three years we have been developing a voice dialogue system capable of continuous speech recognition, natural language understanding, task-oriented dialogue managing. First, the architecture of the system is surveyed. Then, the dialogue component is described. Finally, a dialogue execution trace produced by a prototype implementation is discussed by way of illustration.We are developing a knowledge-based system capable of understanding oral task- oriented dialogues in a multi-speaker environment. This system should be able to process pseudo-natural sublanguages, that is subsets from natural language with few syntactic restrictions and large vocabularies (several thousand words).Information Centers provide a wide range of potential applications for the general public. In order to test and validate design choices, we are implementing an automatic Administrative Information center.We will first survey the overall architecture of the system, which is fully described in /CAR 86b/. Then we shall focus on the dialogue component.

N. Carbonell, J. M. Pierrel
Experimentation in the Specification of an Oral Dialogue

With the aim of achieving a system for oral interrogation of a telephone information type application, we have carried out experimental work* on dialogue between subjects and a simulated machine. This work was carried out in two phases in order to yield a definition of the dialogue making it possible to undertake its implementation. In the first phase we used both a highly constrained dialogue and an unrestricted dialogue. The second phase was based on the conclusions drawn from the previous dialogues and on a strategy for cooperation. In this paper we describe and analyse both phases. Particular attention is paid to the dialogue strategy (cooperation, phatic management).

M. Guyomard, J. Siroux
Backmatter
Metadaten
Titel
Recent Advances in Speech Understanding and Dialog Systems
herausgegeben von
H. Niemann
M. Lang
G. Sagerer
Copyright-Jahr
1988
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-83476-9
Print ISBN
978-3-642-83478-3
DOI
https://doi.org/10.1007/978-3-642-83476-9