Electronic Dictionaries and Automata in Computational Linguistics

LITP Spring School on Theoretical Computer Science Saint-Pierre d'Oléron, France, May 25–29, 1987 Proceedings

herausgegeben von: Maurice Gross, Dominique Perrin

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Professional Book Archive

Einloggen, um Zugang zu erhalten

Über dieses Buch

This volume contains the proceedings of the 15th Spring School of the LITP (Laboratoire d'Informatique Théorique et de Programmation, Université Paris VI-VII, CNRS) held from May 25 to 29, 1987 in Saint-Pierre d'Oléron. The meeting was organized by M. Borillo, M. Gross, M. Nivat and D. Perrin. The purpose of this yearly meeting is to present the state of the art in a specific topic which has gained considerable maturity. The proceedings of the last three Spring Schools have already been published in this series and deal with "Automata on Infinite Words" (LNCS 192), "Combinators and Functional Programming Languages " (LNCS 242) and "Automata Networks" (LNCS 316). The contributions gathered for the 1987 conference present a unique combination of automata theory on the one hand and natural language processing on the other hand. Both fields have strong historical links as exemplified by the works of Chomsky and Harris in Linguistics, the work of Backus and others in Computer Science and the work of Schützenberger in Algebra. The methods described and discussed in the field of string processing and automata cover the traditional algorithms for string matching, data compression, sequence comparison and lexical analysis. The papers that deal more directly with natural language processing treat automated text generation, lexical analysis and formal representation.

Inhaltsverzeichnis

Frontmatter

Data compression with substitution

Abstract

All the data compression methods described in this paper are based on substitutions acting on characters or factors occurring inside the source texts. The average expected compression ratio is often close to 2. Most methods have a bad behaviour when error appears in encoded texts. One bit lost and the decompression is almost impossible!

To increase the compression ratios, other methods can be used. Arithmetic coding is such an example which leads to higher efficiency.

Another way to increase the compression ratios is to give up the "lossless information" condition. These compaction methods must use semantic rule to recover the original information. Such methods cannot be applied to create archives or to communicate. A compaction example is found in [McI 82] for the "spell" program available under the Unix operating system.

Maxime Crochemore

Some pronominalization issues in generation of texts in romance languages

Laurence Danlos

The use of finite automata in the lexical representation of natural language

Maurice Gross

Estimation of the entropy by the lempel-ziv method

G. Hansel

Applications of phonetic description

Abstract

Computer applications with vocal input-output in natural languages will involve the use of accurate phonetic data, and in particular of data about phonetic variations. Those variations can be described, analyzed and given a formal representation with simple tools from computer theory: strings of symbols and transductions. However, gathering those basic phonetic data also implies a systematic study of the lexicon of the languages. At that expense, some of the specific difficulties of speech, as opposed to written texts, can be overcome.

Eric Laporte

Sequence comparison: Some theory and some practice

Abstract

A brief survey of the theory and practice of sequence comparison is made focusing on diff, the UNIX¹ file difference utility.

Imre Simon

The lexical analysis of French

Abstract

The automatic linguistic analysis of texts requires basic information about the simple and compound words of the text. Lexical analysis is the preliminary step before syntactic analysis. We have shown that important linguistic problems appear during this basic step. Some of them cannot yet be solved (recognition of proper names, compound verbs, and so on); others, if solved during lexical analysis, facilitates the syntactic analysis by reducing the degree of ambiguity of the text.

The lexical parser is based on a program (automaton) which could be used in more general cases in order to disambiguate some strings. For example, we have described the context of the string j'; in the same way, it would be possible to describe the context of the word je, which appears only in a limited number of schemas.

We have given an enumeration of problems that have to be solved in order to recognize words. Each of these problems is well-known. What we have attempted here is to formulate them in such a form that they can be represented by finite automata and treated by the corresponding algorithms. It should be clear that the number of automata to be built, their size and the formulation of their interaction is by no mean trivial: a complex program is required simply to recognize the words of a text.

Max Silberztein

Titel: Electronic Dictionaries and Automata in Computational Linguistics
herausgegeben von: Maurice Gross
Dominique Perrin
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-48140-9
Print ISBN: 978-3-540-51465-7
DOI: https://doi.org/10.1007/3-540-51465-1

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Data compression with substitution

Some pronominalization issues in generation of texts in romance languages

The use of finite automata in the lexical representation of natural language

Estimation of the entropy by the lempel-ziv method

Applications of phonetic description

Sequence comparison: Some theory and some practice

The lexical analysis of French