Skip to main content
main-content

Über dieses Buch

This book explains advanced theoretical and application-related issues in grammatical inference, a research area inside the inductive inference paradigm for machine learning. The first three chapters of the book deal with issues regarding theoretical learning frameworks; the next four chapters focus on the main classes of formal languages according to Chomsky's hierarchy, in particular regular and context-free languages; and the final chapter addresses the processing of biosequences.

The topics chosen are of foundational interest with relatively mature and established results, algorithms and conclusions. The book will be of value to researchers and graduate students in areas such as theoretical computer science, machine learning, computational linguistics, bioinformatics, and cognitive psychology who are engaged with the study of learning, especially of the structure underlying the concept to be learned. Some knowledge of mathematics and theoretical computer science, including formal language theory, automata theory, formal grammars, and algorithmics, is a prerequisite for reading this book.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Gold-Style Learning Theory

A Selection of Highlights Since Gold

This chapter is a tutorial addressed to, among other audiences, the grammatical inference community. It is on the computational learning theory initiated by E. Mark Gold’s seminal 1967 paper. One of Gold’s important motivations was to present a formal model of child language learning. This chapter introduces Gold’s model and also presents an introduction to some selected highlights of the theory appearing since Gold 1967. Since both Gold and the present author had or have cognitive science motivations, many of the results discussed herein also have implications for, and were motivated by, questions regarding the human cognitive ability to learn. For this chapter some prior knowledge of computability theory would be helpful to the reader, as would some prior acquaintance with Gold’s identification/learning in the limit concept. The first section concentrates on results which are independent of computational complexity considerations. Covered are: Gold’s model of child language learning (including some reasonable, post-Gold criteria of successful learning and some critical discussion); some severe constraints on successful learning; some characterization results and important examples of successful learning; discussion and results on potential child insensitivities to data presentation order; and empirical U-shaped learning with some corresponding formal results interpreted for their cognitive science significance. The second section concentrates on relevant complexity issues. The issues considered are: large database size and corresponding memory-limited learners; unavoidable cases of complexity and information deficiencies of the programs learned; and the complexity of learner updates. In this section a few, seemingly difficult, open mathematical questions are indicated. Some of them are important for cognitive science.

John Case

Chapter 2. Efficiency in the Identification in the Limit Learning Paradigm

The most widely used learning paradigm in Grammatical Inference was introduced in 1967 and is known as identification in the limit. An important issue that has been raised with respect to the original definition is the absence of efficiency bounds. Nearly fifty years after its introduction, it remains an open problem how to best incorporate a notion of efficiency and tractability into this framework. This chapter surveys the different refinements that have been developed and studied, and the challenges they face. Main results for each formalization, along with comparisons, are provided.

Rémi Eyraud, Jeffrey Heinz, Ryo Yoshinaka

Chapter 3. Learning Grammars and Automata with Queries

When learning languages or grammars, an attractive alternative to using a large corpus is to learn by interacting with the environment. This can allow us to deal with situations where data is scarce or expensive, but testing or experimenting is possible. The situation, which arises in a number of fields, is formalised in a setting called active learning or query learning. By controlling better the information to which one has access, this setting provides us with a better understanding of the hardness of learning tasks. But the setting also allows us to solve practical learning situations, for which new algorithms are needed.

Colin de la Higuera

Chapter 4. On the Inference of Finite State Automata from Positive and Negative Data

The works by Thrakhtenbrot–Barzdin and Gold can be considered to be the first works on the identification of Finite Automata from given data. The main drawback of their results is that they may obtain hypotheses that may be inconsistent with the provided data. This drawback was solved by the RPNI and Lang algorithms. Aside from these works, other works have introduced more efficient algorithms with respect to the training data. The direct consequence of this improvement has lead to algorithms that have lower error rates. Recently, some works have tackled the identification of NFAs instead of using the traditional DFA model. In this line of research, the inference of Residual Finite State Automata (RFSA) provides a canonical non-deterministic model. Other works consider the inference of teams of NFAs to be a method that is suitable to solve the grammatical inference of finite automata. We review the main approaches that solve the inference of finite automata by using positive and negative data from the target language. In this review, we will describe the above-mentioned formalisms and induction techniques.

Damián López, Pedro García

Chapter 5. Learning Probability Distributions Generated by Finite-State Machines

We review methods for inference of probability distributions generated by probabilistic automata and related models for sequence generation. We focus on methods that can be proved to learn in the inference in the limit and PAC formal models. The methods we review are state merging and state splitting methods for probabilistic deterministic automata and the recently developed spectral method for nondeterministic probabilistic automata. In both cases, we derive them from a high-level algorithm described in terms of the Hankel matrix of the distribution to be learned, given as an oracle, and then describe how to adapt that algorithm to account for the error introduced by a finite sample.

Jorge Castro, Ricard Gavaldà

Chapter 6. Distributional Learning of Context-Free and Multiple Context-Free Grammars

This chapter reviews recent progress in distributional learning in grammatical inference as applied to learning context-free and multiple context-free grammars. We discuss the basic principles of distributional learning, and present two classes of representations, primal and dual, where primal approaches use nonterminals based on strings or sets of strings and dual approaches use nonterminals based on contexts or sets of contexts. We then present learning algorithms based on these two models using a variety of learning paradigms, and then discuss the natural extension to mildly context-sensitive formalisms, using multiple context-free grammars as a representative formalism.

Alexander Clark, Ryo Yoshinaka

Chapter 7. Learning Tree Languages

Tree languages have proved to be a versatile and rewarding extension of the classical notion of string languages. Many nice applications have been established over the years, in areas such as Natural Language Processing, Information Extraction, and Computational Biology. Although some properties of string languages transfer easily to the tree case, in particular for regular languages, several computational aspects turn out to be harder. It is therefore both of theoretical and of practical interest to investigate how far and in what ways Grammatical Inference algorithms developed for the string case are applicable to trees. This chapter surveys known results in this direction. We begin by recalling the basics of tree language theory. Then, the most popular learning scenarios and algorithms are presented. Several applications of Grammatical Inference of tree languages are reviewed in some detail. We conclude by suggesting a number of directions for future research.

Johanna Björklund, Henning Fernau

Chapter 8. Learning the Language of Biological Sequences

The application to biological sequences is an appealing challenge for Grammatical Inference. While some first successes have already been recorded, such as the inference of profile Hidden Markov Models or stochastic Context-Free Grammars which are now part of the classical Bioinformatics toolbox, it is still a nice and open source of problems or inspiration for our research, with the possibility to apply our ideas to real fundamental applications. In this chapter, we survey biological sequences’ main specificities and how they are handled in Pattern/Motif Discovery in order to introduce the important concepts and techniques used and present the latest successful approaches in that field by Grammatical Inference.

François Coste
Weitere Informationen

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

Whitepaper

- ANZEIGE -

INDUSTRIE 4.0

Der Hype um Industrie 4.0 hat sich gelegt – nun geht es an die Umsetzung. Das Whitepaper von Protolabs zeigt Unternehmen und Führungskräften, wie sie die 4. Industrielle Revolution erfolgreich meistern. Es liegt an den Herstellern, die besten Möglichkeiten und effizientesten Prozesse bereitzustellen, die Unternehmen für die Herstellung von Produkten nutzen können. Lesen Sie mehr zu: Verbesserten Strukturen von Herstellern und Fabriken | Konvergenz zwischen Soft- und Hardwareautomatisierung | Auswirkungen auf die Neuaufstellung von Unternehmen | verkürzten Produkteinführungszeiten
Jetzt gratis downloaden!

Bildnachweise