Skip to main content



Review of Basic Algorithms

An Overview of Digital Techniques for Processing Speech Signals

This paper discusses major digital signal processing methods used in processing speech signals. Basic tools, such as the discrete Fourier transform, the z transform and linear filter theory are briefly introduced first. A general view of fast transformation algorithms and most widely used particular fast transformations are given. Linear prediction is then described with a particular emphasis on its lattice structure. A brief introduction to homomorphic processing for multiplied and convolved signals and to its applications in speech processing is given. Recalling some fundamentals of the speech signal, various speech analysis and synthesis models are described, showing which kind of processing methods are involved. Finally, two aspects of speech recognition are presented: feature traction and pattern matching using dynamic time warping.
Murat Kunt, Heinz Hugli

Systems for Isolated and Connected Word Recognition

This lecture is intended to provide an insight into some of the algorithms and techniques that lie behind contemporary automatic speech recognition systems. It is noted that, due to the lack of success of earlier phonetically motivated approaches, the majority of current speech recognizers employ whole-word pattern matching techniques. It is pointed out that these techniques, although rather shallow in concept, have enabled the development of commercial recognizers which exhibit useful and practical capabilities. A range of whole-word pattern matching algorithms are discussed, and in particular, key techniques such as dynamic-time-warping and hidden Markov modelling are explained in some detail. It is also shown how techniques for isolated word recognition may be extended to recognize connected speech. Each of the various methods is reviewed in the context of their computational implications as well as their recognition performance. It is also shown how suitable modifications to the basic algorithms can facilitate real-time operation. Where possible, specific techniqes are highlighted by reference to existing commercial recognition equipment. The lecture concludes by focusing on the key factors which limit the performance of current recognition tchniques, and by outlining some of the research work which may be relevant to future automatic speech recognition systems.
Roger K. Moore

System Architecture and VLSI for Speech Processing

Systolic Architectures for Connected Speech Recognition

Systolic algorithm concept is introduced and three speech recognition algorithms are presented together with a systolic version. The first algorithm is based on the dynamic time warping algorithm which is applied directly on acoustic feature patterns. The second algorithm is the probabilistic matching algorithm which requires that the input sentence be preprocessed by a phonetic analyzer. The third algorithm is the connected word recognition algorithm that find the best matching word sequence. Finally, the architecture of a speech recognition machine using a VLSI chip, called API89, as the basic processor is presented.
Patrice Frison, Patrice Quinton

Computer Systems for High-Performance Speech Recognition

The object of this review and position paper is to draw a parallel between the field of high-performance speech recognition systems on one side and the field of computer systems (in particular computer systems for Artificial Intelligence applications) on the other. This paper gives a characterization of the computational requirements of speech recognition systems, and describes and exemplifies the classes of machines that could be useful in speeding up speech recognition systems.
Roberto Bisiani

VLSI Architectures for Recognition of Context-Free Languages

This paper presents two VLSI architectures for the recognition of context-free languages based on Cocke-Younger-Kasami’s and Earley’s algorithms. By restricting the context-free grammar to be free of null rule, it is possible to implement the two algorithms on triangular shape VLSI systems. For both parsing algorithms, the designed VLSI systems are capable of recognizing a string of length n in 2n time units. Extensions to the recognition of regular tree languages and finite-state languages are also discussed.
K. S. Fu

Implementation of an Acoustical Front-End for Speech Recognition

We describe the implementation of a programmable general-purpose acoustical front-end for speech recognition; its design keeps into account, as an example, the algorithm of centisecond cepstrum extraction for an acoustical signal sampled at a maximum rate of 12.8 kHz.
It consists of three boards, a master board controlled by a general purpose microprocessor, a slave board containing two digital signal processors working in parallel and an input/output analog board.
The overall system is connected to a general-purpose minicomputer, which constitutes the system host. The implementation details and its rationale (mainly reprogrammability and performance) are outlined. In cases of more demanding applications, the system could also be hardware reconfigured with cascade or parallel sections.
Michele Cavazza, Alberto Ciaramella, Roberto Pacificl

Reconfigurable Modular Architecture for a Man-Machine Vocal Communication System in Real Time

The man-machine vocal communication requires autonomous, application-adaptable real time systems, whose cost is both reasonable and proportional to their efficiency.
The realisation of a vocal terminal having such characteristics, makes it necessary to choose a parallel architecture.
We present one architecture which is a combination of parallelism and pipelining. It makes it possible to get the best of the execution parallelism proper to the application class treating a continuous data flow.
It is a modular architecture, staticly reconfigurable, functionally distributed, monitored by the data and with a multi-levelled hierarchic control.
D. Dours, R. Facca

A Survey of Algorithms & Architecture for Connected Speech Recognition

The paper surreys dynamic programming based connected speech recognition algorithms and architectures. A discussion of the computational complexities of the algorithms is given and suggests that the single pass algorithm of Bridle et al is the most suitable for real time operation. Currently available architectures for dynamic programming are discussed and it is shown that these are not suitable for the single pass algorithm in their present form. An alternative linear systolic architecture is described which is capable of matching vocabularies of up to 25000 words, in real time, using the single pass algorithm. The linear array has simpler data flows and uses fewer processing elements than most existing systolic structures.
D. Wood

Software Systems for Automatic Speech Recognition

Knowledge-Based and Expert Systems in Automatic Speech Recognition

Artificial Intelligence (AI) has recently advanced to the point that practical applications are now existing in several domains. Most of the results obtained are not due to general problem solving techniques but to the use of specific, domain-dependent knowledge. Formalizing and incorporating specific knowledge into a system makes it possible to reach the level of expertise comparable to that of a human expert in some specialized field. Such knowledge-based and expert systems have been extensively used in various domains like chemistry, medicine, geology, etc. The basic idea in these systems is to clearly distinguish between the knowledge base which usually incorporates rules and meta-rules about the domain of expertise and the control structures which manipulate this knowledge. That ensures great modularity and flexibility and makes it easy to modify and update a system [11].
Jean-Paul Haton

The Speech Understanding and Dialog System Evar

This paper gives an overview of a research effort whose goal is to develop a system which can carry out a dialog concerning a particular task domain using continuous German speech for input and output. The main processing phases are initial segmentation and labeling, finding words, understanding the meaning and giving an answer. Specialized processing modules for handling these four phases were developed or are being developed. The processing modules communicate via a common database.
H. Niemann, A. Brietzmann, R. Mühlfeld, P. Regel, G. Schukat

A New Rule-Based Expert System for Speech Recognition

SERAC (Expert System for Acoustic-phonetic Recognition) is an Expert System that applies Artificial Intelligence techniques to Automatic Speech Recognition.
G. Mercier, M. Gilloux, C. Tarridec, J. Vaissiere

SAY — A PC based Speech Analysis system

This paper is concerned with an experimental system of value to anyone interested in speech research in general, and in particular to those interested in speech input and output by computer. At the IBM UKSC we are building a system capable of converting text data to natural sounding speech. This embodies many of the features of an expert system since the system must understand and use the same rules of spelling, syntax, intonation, pronunciation and phonetics that a human speaker draws upon when talking. In building this system we must have a detailed understanding of normal human speech and a means of analysing synthetic speech to enable us to quantify the factors that determine intelligibility and acceptability. To achieve this we need a knowledge of both the physics and anatomy of speech production in the human articulatory system, and of the speech signal itself. We will need techniques for analysing synthetic speech and comparing it with its natural counterpart. An understanding of the process of speech perception, and of which parts of the speech signal carry the important perceptual information, is also relevant. A suitable system for the analysis of speech signals is thus an essential tool in this project and it is the development of such a speech analyser that is the subject of this paper.
P. R. Alderson, G. Kaye, S. G. C. Lawrence, D. A. Sinclair, B. J. Williams, G. J. Wolff

Automatic Generation of Linguistic, Phonetic and Acoustic Knowledge for a Diphone-Based Continuous Speech Recognition System

An important issue in template-matching continuous-speech recognition systems is the right choice of the language model, together with an appropriate definition of the basic units to be recognized. The advantages of using a hierarchical transition network model with diphones and diphone-like elements as basic units are illustrated in the paper. However, a severe drawback in the use of sub-word units is an increased complexity in producing and managing the overall knowledge relating to language representation and template definition and extraction. An efficient solution to this problem is required especially when the recognition system is to be used by unskilled users in actual applications. For this purpose we have developed an automatic procedure for generating the linguistic, phonetic and acoustic data bases expressing the whole information required by the diphone-based system.
Anna Maria Colla, Donatella Sciarra

The Use of Dynamic Frequency Warping in a Speaker-Independent Vowel Classifier

A dynamic frequency warping algorithm has been used to match the spectra of the vowels of one speaker against the spectra of vowels of different speakers. Although the method resulted in a transformation which produced a good match, it was not accurate as a speaker-independent vowel classifier. With reference spectra from the vowels of male speakers and test spectra from the vowels of female speakers, and vice versa, the recogntion scores were only 33%, whilst with reference and test spectra from different utterances of the same speaker the mean score was 96%.
Various parameters of the spectra and the DFW algorithm have been studied. It was found that limiting the frequency range of the spectra to approximately telephone bandwidth (250–3200 Hz) increased the male-female scores by about 6%. Changing the frequency scale to barks or reducing the order of the linear prediction analysis reduced the recognition scores. Adjusting the warping window in the dynamic programming algorithm so that it was 160 Hz wide above the diagonal raised the male-female recognition score to 48.6%.
W. A. Ainsworth, H. M. Foster

Dynamic Time Warping Algorithms for Isolated and Connected Word Recognition

In this paper we present a new formulation of the dynamic programming recursive relations both for word and connected word recognition that permits relaxation of boundary conditions imposed on the warping paths, while preserving the optimal character of the dynamic time warping algorithms.
J. di Martino

An Efficient Algorithm for Recognizing Isolated Turkish Words

In this study, a computationally efficient speaker independent isolated word recognition system for Turkish language is designed and implemented. The approach used is a combination of whole-word matching techniques with segmentation into phonetic units before classification. Linear Predictive Coding (LPC) coefficients for an eight-pole model of the short-time signal are used as feature vectors. Computational costs are reduced by a two-step classification strategy where unlikely words are eliminated in the first step by comparing only the first syllable. The Dynamic Time Warping (DTW) method is used in comparisons at both levels.
CPU time spent for word comparisons is reduced by about 40% compared to the time that has to be spent for a one-step whole-word classification without degrading the system performance.
Neşe Yalabik, Fatih Ünal

A General Fuzzy-Parsing Scheme for Speech Recognition

In this paper a Speech Recognition Methodology is proposed which is based on the general assumption of ‘fuzzyness’ of both speech-data and knowledge-sources. Besides this general principle, there are other fundamental assumptions which are also the bases of the proposed methodology: ‘Modularity’ in the knowledge organization, ‘Homogeneity’ in the representation of data and knowledge, ‘Passiveness’ of the ‘understanding flow’ (no backtraking or feedback), and ‘Parallelism’ in the recognition activity.
The proposed methodology is formally presented, and algorithms to develop actual systems on general pourpose hardware are given. An implementation example as well as the results obtained with it are also presented.
Enrique Vidal, Francisco Casacuberta, Emilio Sanchis, Jose M. Benedi

Speech Synthesis and Phonetics

Linguistics and Automatic Processing of Speech

To build a bridge it is helpful to know something about the physics of stress and vibration, about materials science, and so on. To cure sickness, it is helpful to know something about the nature and causes of disease, i.e., microbes, metabolism, etc. This is not to say that it is impossible to build bridges and cure sickness without such knowledge. Indeed, some bridges were built and some sicknesses cured centuries before anyone knew anything about the physical and physiological principles involved. However, it must be admitted that the bridges were modest, most sickness was not alleviated and successes in both areas owed more to trial and error or an intuitive understanding of the relevant principles than to any sort of systematic, scientific, knowledge. From the time that knowledge in these areas was put on a firm scientific basis, taking into account the factors underlying the behavior of materials and microbes, much more impressive bridges and cures were possible.
John J. Ohala

Synthesis of Speech by Computers and Chips

Speech of unlimited vocabulary can be synthesized from systems which can produce the basic set of about 40 phonetic sounds of a language. However, the quality of the speech output is highly dependent on the method used in the system. Good synthesized speech requires good phonetic rules to give an accurate transcription of the input texts. A description of an accurate unrestricted text-to-speech algorithm will be described. An evaluation of speech synthesizers and the experimental results will be discussed. Further addition of prosody features will make the synthesized sound more and more natural. Some new systems developed in recent years and their characteristics will be presented.
Ching Y. Suen, Stephen B. Stein

Prosodic Knowledge in the Rule-Based Synthex Expert System for Speech Synthesis

Speech synthesis is the transformation of a mitten text into an acoustic signal.
A. Aggoun, C. Sorin, F. Emerard, M. Stella

Syntex — Unrestricted Conversion of Text to Speech for German

This paper is intended to give an overview over the SYNTEX system, a text-to-speech software for the German language designated to control phoneme synthesizers. Descriptions of the algorithms used for word structure analysis, letter-to-sound conversion, computing of word accent, sentence parsing, and generating an intonation contour are given. The software runs on a small microprocessor system much faster than real-time.
Wolfgang Kulas, Hans-Wilhelm Rühl

Concatenation Rules for Demisyllable Speech Synthesis

A system for speech synthesis by rule is described which uses demisyllables as phonetic units. The problem of concatenation is discussed in detail; the pertinent stage converts a string of phonetic symbols into a stream of speech parameter frames. For German about 1650 demisyllables are required to permit synthesizing a very large vocabulary. Synthesis is controlled by 18 rules which are used for splitting up the phonetic string into demisyllableß, for selecting the demisyllables in such a way that the size of the inventory is minimized, and —; last but not least — for concatenation. The quality and intelligibility of the synthetic signal is very good; in a subjective test the median word intelligibility dropped from 96.6% for a LPC vocoder to 92.1% for the demisyllable synthesis, and the difference in quality between the demisyllable synthesis and ordinary vocoded speech was judged very small.
Helmut Dettweiler, Wolfang Hess

On the Use of Phonetic Knowledge for Automatic Speech Recognition

A distributed rule-based system for automatic speech recognition is described.
Acoustic property extraction and feature hypothesization are performed by the application of sequences of operators. These sequences, called plans, are executed by cooperative expert programs.
Experimental results on the automatic segmentation and recognition of phrases, made of connected letters and digits are described and discussed.
Renato De Mori, Pietro Laface

Demisyllables as Processing Units for Automatic Speech Recognition and Lexical Access

This paper describes a number of experimental investigations into syllable-based acoustic-phonetic analysis of German words; these methods can be used as a basic processing stage in a system for automatic speech recognition as well as for speech understanding. In this connection the importance of the syllable in speech processing by man and machine will first be discussed. Then several methods and experiments are presented involving segmentation into syllables and recognition of vowels and consonant clusters, as well as two methods for lexical access and lexical search using these units. The search in the lexicon is necessary in order to find the word in a word-list corresponding to the units recognized, or alternatively to determine the most similar word. The most salient feature of this system is that so-called demisyllables are used as the processing units.
G. Ruske

Detection and Recognition of Nasal Consonants in Continuous Speech — Preliminary Results

The problem of the detection, recognition or perception of nasal consonants in continuous speech signals is not frequently treated in the literature. Among the most significant works on this subject should be mentioned the classical ones such as [5,9,4] or more recently [11,3,2], The principal reason that there are relatively few papers devoted to this subject is, perhaps, its complexity. In fact, from the acoustical point of view, the spectral structure of nasal consonants is rather complicated because their spectral envelope is formed not only by the resonants of the pharyngeal and nasal cavities but also by the anti-resonants of the oral cavity, which is closed during the articulation of nasals.
R. Gubrynowicz, L. Le Guennec, G. Mercier


Weitere Informationen