Skip to main content
main-content

Über dieses Buch

It is well-known that phonemes have different acoustic realizations depending on the context. Thus, for example, the phoneme /t! is typically realized with a heavily aspirated strong burst at the beginning of a syllable as in the word Tom, but without a burst at the end of a syllable in a word like cat. Variation such as this is often considered to be problematic for speech recogni­ tion: (1) "In most systems for sentence recognition, such modifications must be viewed as a kind of 'noise' that makes it more difficult to hypothesize lexical candidates given an in­ put phonetic transcription. To see that this must be the case, we note that each phonological rule [in a certain example] results in irreversible ambiguity-the phonological rule does not have a unique inverse that could be used to recover the underlying phonemic representation for a lexical item. For example, . . . schwa vowels could be the first vowel in a word like 'about' or the surface realization of almost any English vowel appearing in a sufficiently destressed word. The tongue flap [(] could have come from a /t! or a /d/. " [65, pp. 548-549] This view of allophonic variation is representative of much of the speech recognition literature, especially during the late 1970's. One can find similar statements by Cole and Jakimik [22] and by Jelinek [50].

Inhaltsverzeichnis

Frontmatter

1. Introduction

Abstract
The main goal of speech recognition/understanding research is to develop techniques and systems for speech input to machines.3 Ideally, we would like to construct an intelligent machine that could listen to a speech utterance and respond appropriately. This task is called speech understanding. Speech recognition is a different problem. A recognizer is an automatic dictation machine; it transforms an acoustic speech signal (utterance) into a text file (e.g., a sequence of ascii4 characters).
Kenneth W. Church

2. Representation of Segments

Abstract
It is commonly agreed that there are two types of cues: those that vary a great deal with context (e.g., aspiration, retroflex, glottalization) and those that are relatively invariant to context (e.g., voicing, place, manner). The parsing and matching design proposed above is intended to take advantage of both types of cues in a uniform and modular fashion. The parser uses variant cues in order to locate suprasegmental constituents. Having found these constituents, the matcher canonicalizes them into sequences of invariant phonemic-like feature bundles and then looks them up in the lexicon to find word candidates.
Kenneth W. Church

3. Allophonic Rules

Abstract
As mentioned in the introduction, speech researchers [6, 21, 92] and linguists [15] have developed systems of allophonic rules for capturing the relevant generalizations. A typical rule of flapping might look some thing like:
$$ \text{t} \to \text{L/V\_V} $$
(64)
which says that an intervocalic /t/ can be flapped.
Kenneth W. Church

4. An Alternative: Phrase-Structure Rules

Abstract
As mentioned above, we have attempted to reformulate allophonic rules in terms of rewrite rules. How can we reformulate these rules as phrase-structure rules? Basically, we will introduce additional constituent categories to encode contextual conditions. For example, one might want to say that a /t/ obligatorily aspirates before a vowel as in a word like Tom.
Kenneth W. Church

5. Parser Implementation

Abstract
In the previous chapter, we proposed phrase-structure rules as an alternative to rewrite rules. Recall that aspiration could be restricted to syllable initial position (a reasonable first approximation) with a set of rules of the form:
  • (109a) utterance → syllable*
  • (109b) syllable → onset rhyme
  • (109c) onset → aspirated-t | aspirated-k | aspirated-p | ...
  • (109d) rhyme → peak coda
  • (109e) coda → unreleased-t | unreleased-k | unreleased-p | ...
This sort of context-free phrase-structure grammar can be processed with well-known parsers like Earley’s Algorithm [30]. Thus, if we completed this grammar in a pure context-free formalism, we could employ a straightforward Earley parser to find syllables, onset, rhymes and so forth.
Kenneth W. Church

6. Phonotactic Constraints

Abstract
Many speech recognition devices operate with an inadequate model of phonotactic constraints. In particular, most recognition devices incorporate a model of word structure of the form:
  • (158a) word → syllable*
  • (158b) syllable → initial-cluster vowel final-cluster
This model assumes a list of permissible initial- and final-clusters. How do we estimate the set of permissible clusters? It is widely assumed that the set of syllable initial-clusters are identical to the set of word initial-clusters, and that the set of syllable final-clusters are identical to the set of word final-clusters.
Kenneth W. Church

7. When Phonotactic Constraints are Not Enough

Abstract
Phonotactic constraints are not sufficiently strong to produce a unique syllabification in all cases. For example, in words like western, winter, water, veto, divinity, and so forth, phonotactic constraints do not tell us which way the /t/ will attach. Most theories of syllable structure invoke “tie-breaking principles” for words such as these. Two very common “tie-breaking principles” are The Maximize Onset Principle and Stress Resyllabification. By the first principle, there is a preference toward assigning consonants to the onset of the following syllable rather than than the coda of the previous syllable. Thus, for example, the /t/ in retire will be assigned to the onset of the second syllable (i.e., rě-tire) and not to the coda of the first syllable (i.e., *rět-ire). By the second principle, there is a preference toward assigning consonants to a stressed syllable over an unstressed one. Thus, for example, the /k/ in record will be assigned to the first syllable when it is stressed (i.e., réc-órd), and to the second syllable when it is stressed (i.e., rě-córd).
Kenneth W. Church

8. Robustness Issues

Abstract
In a practical speech recognition system, we could not expect the front end acoustic processor to be as good as a linguist. It is unlikely that the acoustic processor will be able to provide a segmentation lattice of the same caliber as those obtained from our linguistic consultant, Meg Withgott. More realistic segmentation lattices are likely to contain a large number of errors and alternative choices. How can we modify our system to deal with these realities?
Kenneth W. Church

9. Conclusion

Abstract
This thesis has argued that allophonic constraints provide an important source of information that can be useful for speech recognition. Allophonic cues reveal important properties of the suprasegmental context because they are variant. In this respect, allophonic cues are more useful than invariant cues; invariant cues can’t tell us anything about the syllable structure and foot structure, since, by definition, they are insensitive to context.
Kenneth W. Church

Backmatter

Weitere Informationen