Skip to main content

2010 | Buch

Grammatical Inference: Theoretical Results and Applications

10th International Colloquium, ICGI 2010, Valencia, Spain, September 13-16, 2010. Proceedings

herausgegeben von: José M. Sempere, Pedro García

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter

Invited Talks

Grammatical Inference and Games: Extended Abstract

This paper discusses the potential synergy between research in grammatical inference and research in artificial intelligence applied to games. There are two aspects to this: the potential as a rich source of challenging and engaging test problems, and the potential for real applications.

Simon M. Lucas
Molecules, Languages and Automata

Molecular biology is full of linguistic metaphors, from the language of DNA to the genome as “book of life.” Certainly the organization of genes and other functional modules along the DNA sequence invites a syntactic view, which can be seen in certain tools used in bioinformatics such as hidden Markov models. It has also been shown that folding of RNA structures is neatly expressed by grammars that require expressive power beyond context-free, an approach that has even been extended to the much more complex structures of proteins. Processive enzymes and other “molecular machines” can also be cast in terms of automata. This paper briefly reviews linguistic approaches to molecular biology, and provides perspectives on potential future applications of grammars and automata in this field.

David B. Searls

Regular Papers

Inferring Regular Trace Languages from Positive and Negative Samples

In this work, we give an algorithm that infers Regular Trace Languages. Trace languages can be seen as regular languages that are closed under a partial commutation relation called the independence relation. This algorithm is similar to the RPNI algorithm, but it is based on Asynchronous Cellular Automata. For this purpose, we define Asynchronous Cellular Moore Machines and implement the merge operation as the calculation of an equivalence relation. After presenting the algorithm we provide a proof of its convergence (which is more complicated than the proof of convergence of the RPNI because there are no Minimal Automata for Asynchronous Automata), and we discuss the complexity of the algorithm.

Antonio Cano Gómez
Distributional Learning of Some Context-Free Languages with a Minimally Adequate Teacher

Angluin showed that the class of regular languages could be learned from a Minimally Adequate Teacher (

mat

) providing membership and equivalence queries. Clark and Eyraud (2007) showed that some context free grammars can be identified in the limit from positive data alone by identifying the congruence classes of the language. In this paper we consider learnability of context free languages using a

mat

. We show that there is a natural class of context free languages, that includes the class of regular languages, that can be polynomially learned from a

mat

, using an algorithm that is an extension of Angluin’s

lstar

algorithm.

Alexander Clark
Learning Context Free Grammars with the Syntactic Concept Lattice

The Syntactic Concept Lattice is a residuated lattice based on the distributional structure of a language; the natural representation based on this is a context sensitive formalism. Here we examine the possibility of basing a context free grammar (

cfg

) on the structure of this lattice; in particular by choosing non-terminals to correspond to concepts in this lattice. We present a learning algorithm for context free grammars which uses positive data and membership queries, and prove its correctness under the identification in the limit paradigm. Since the lattice itself may be infinite, we consider only a polynomially bounded subset of the set of concepts, in order to get an efficient algorithm. We compare this on the one hand to learning algorithms for context free grammars, where the non-terminals correspond to congruence classes, and on the other hand to the use of context sensitive techniques such as Binary Feature Grammars and Distributional Lattice Grammars. The class of

cfg

s that can be learned in this way includes inherently ambiguous and thus non-deterministic languages; this approach therefore breaks through an important barrier in

cfg

inference.

Alexander Clark
Learning Automata Teams

We prove in this work that, under certain conditions, an algorithm that arbitrarily merges states in the prefix tree acceptor of the sample in a consistent way, converges to the minimum

DFA

for the target language in the limit. This fact is used to learn automata teams, which use the different automata output by this algorithm to classify the test. Experimental results show that the use of automata teams improve the best known results for this type of algorithms. We also prove that the well known Blue-Fringe EDSM algorithm, which represents the state of art in merging states algorithms, suffices a polynomial characteristic set to converge.

Pedro García, Manuel Vázquez de Parga, Damián López, José Ruiz
Exact DFA Identification Using SAT Solvers

We present an exact algorithm for identification of deterministic finite automata (DFA) which is based on satisfiability (SAT) solvers. Despite the size of the low level SAT representation, our approach is competitive with alternative techniques. Our contributions are fourfold: First, we propose a compact translation of DFA identification into SAT. Second, we reduce the SAT search space by adding lower bound information using a fast max-clique approximation algorithm. Third, we include many redundant clauses to provide the SAT solver with some additional knowledge about the problem. Fourth, we show how to use the flexibility of our translation in order to apply it to very hard problems. Experiments on a well-known suite of random DFA identification problems show that SAT solvers can efficiently tackle all instances. Moreover, our algorithm outperforms state-of-the-art techniques on several hard problems.

Marijn J. H. Heule, Sicco Verwer
Learning Deterministic Finite Automata from Interleaved Strings

Workflows are an important knowledge representation used to understand and automate processes in diverse task domains. Past work has explored the problem of learning workflows from traces of processing. In this paper, we are concerned with learning workflows from

interleaved

traces captured during the concurrent processing of multiple task instances. We first present an abstraction of the problem of recovering workflows from interleaved example traces in terms of grammar induction. We then describe a two-stage approach to reasoning about the problem, highlighting some negative results that demonstrate the need to work with a restricted class of languages. Finally, we give an example of a restricted language class called

terminated languages

for which an accepting deterministic finite automaton (DFA) can be recovered in the limit from interleaved strings, and make remarks about the applicability of the two-stage approach to terminated languages.

Joshua Jones, Tim Oates
Learning Regular Expressions from Representative Examples and Membership Queries

A learning algorithm is developed for a class of regular expressions equivalent to the class of all unionless unambiguous regular expressions of loop depth 2. The learner uses one

representative

example of the target language (where every occurrence of every loop in the target expression is unfolded at least twice) and a number of membership queries. The algorithm works in time polynomial in the length of the input example.

Efim Kinber
Splitting of Learnable Classes

A class

$\mathcal{L}$

is called mitotic if it admits a splitting

$\mathcal{L}_0,\mathcal{L}_1$

such that

$\mathcal{L},\mathcal{L}_0,\mathcal{L}_1$

are all equivalent with respect to a certain reducibility. Such a splitting might be called a symmetric splitting. In this paper we investigate the possibility of constructing a class which has a splitting and where any splitting of the class is a symmetric splitting. We call such a class a symmetric class. In particular we construct an incomplete symmetric BC-learnable class with respect to strong reducibility. We also introduce the notion of very strong reducibility and construct a complete symmetric BC-learnable class with respect to very strong reducibility. However, for EX-learnability, it is shown that there does not exist a symmetric class with respect to any weak, strong or very strong reducibility.

Hongyang Li, Frank Stephan
PAC-Learning Unambiguous k,l-NTS ≤  Languages

In this paper we present two hierarchies of context-free languages: The

k

,

l

-NTS languages and the

k

,

l

-NTS

 ≤ 

languages.

k

,

l

-NTS languages generalize the concept of Non-Terminally Separated (NTS) languages by adding a fixed size context to the constituents, in the analog way as

k

,

l

-substitutable languages generalize substitutable languages (Yoshinaka, 2008).

k

,

l

-NTS

 ≤ 

languages are

k

,

l

-NTS languages that also consider the edges of sentences as possible contexts. We then prove that Unambiguous

k

,

l

-NTS

 ≤ 

(

k

,

l

-UNTS

 ≤ 

) languages be converted to plain old UNTS languages over a richer alphabet. Using this and the result of polynomial PAC-learnability with positive data of UNTS grammars proved by Clark (2006), we prove that

k

,

l

-UNTS

 ≤ 

languages are also PAC-learnable under the same conditions.

Franco M. Luque, Gabriel Infante-Lopez
Bounding the Maximal Parsing Performance of Non-Terminally Separated Grammars

Unambiguous Non-Terminally Separated (UNTS) grammars have good learnability properties but are too restrictive to be used for natural language parsing. We present a generalization of UNTS grammars called Unambiguous Weakly NTS (UWNTS) grammars that preserve the learnability properties. Then, we study the problem of using them to parse natural language and evaluating against a gold treebank. If the target language is not UWNTS, there will be an upper bound in the parsing performance. In this paper we develop methods to find upper bounds for the unlabeled

F

1

performance that any UWNTS grammar can achieve over a given treebank. We define a new metric, show that its optimization is NP-Hard but solvable with specialized software, and show a translation of the result to a bound for the

F

1

. We do experiments with the WSJ10 corpus, finding an

F

1

bound of 76.1% for the UWNTS grammars over the POS tags alphabet.

Franco M. Luque, Gabriel Infante-Lopez
CGE: A Sequential Learning Algorithm for Mealy Automata

We introduce a new algorithm for sequential learning of Mealy automata by

congruence generator extension

(CGE). Our approach makes use of techniques from term rewriting theory and universal algebra for compactly representing and manipulating automata using finite congruence generator sets represented as

string rewriting systems

(SRS). We prove that the CGE algorithm correctly learns in the limit.

Karl Meinke
Using Grammar Induction to Model Adaptive Behavior of Networks of Collaborative Agents

We introduce a formal paradigm to study global adaptive behavior of organizations of collaborative agents with local learning capabilities. Our model is based on an extension of the classical language learning setting in which a teacher provides examples to a student that must guess a correct grammar. In our model the teacher is transformed in to a workload dispatcher and the student is replaced by an organization of worker-agents. The jobs that the dispatcher creates consist of sequences of tasks that can be modeled as sentences of a language. The agents in the organization have language learning capabilities that can be used to learn local work-distribution strategies. In this context one can study the conditions under which the organization can adapt itself to structural pressure from an environment. We show that local learning capabilities contribute to global performance improvements. We have implemented our theoretical framework in a workbench that can be used to run simulations. We discuss some results of these simulations. We believe that this approach provides a viable framework to study processes of self-organization and optimization of collaborative agent networks.

Wico Mulder, Pieter Adriaans
Transducer Inference by Assembling Specific Languages

Grammatical Inference has recently been applied successfully to bioinformatic tasks as protein domain prediction. In this work we present a new approach to infer regular languages. Although used in a biological task, our results may be useful not only in bioinformatics, but also in many applied tasks. To test the algorithm we consider the transmembrane domain prediction task. A preprocessing of the training sequences set allows us to use this heuristic to obtain a transducer. The transducer obtained is then used to label problem sequences. The experimentation carried out shows that this approach is suitable for the task.

Piedachu Peris, Damián López
Sequences Classification by Least General Generalisations

In this paper, we present a general framework for supervised classification. This framework provides methods like boosting and only needs the definition of a generalisation operator called

lgg

. For sequence classification tasks,

lgg

is a learner that only uses positive examples. We show that

grammatical inference

has already defined such learners for automata classes like

reversible automata

or

k-TSS automata

. Then we propose a generalisation algorithm for the class of balls of words. Finally, we show through experiments that our method efficiently resolves sequence classification tasks.

Frédéric Tantini, Alain Terlutte, Fabien Torre
A Likelihood-Ratio Test for Identifying Probabilistic Deterministic Real-Time Automata from Positive Data

We adapt an algorithm (

RTI

) for identifying (learning) a

deterministic real-time automaton

(DRTA) to the setting of positive timed strings (or time-stamped event sequences). An DRTA can be seen as a deterministic finite state automaton (DFA) with time constraints. Because DRTAs model time using numbers, they can be exponentially more compact than equivalent DFA models that model time using states.

We use a new likelihood-ratio statistical test for checking consistency in the

RTI

algorithm. The result is the

RTI

 + algorithm, which stands for

real-time identification from positive data

.

RTI

 + is an efficient algorithm for identifying DRTAs from positive data. We show using artificial data that

RTI

 + is capable of identifying sufficiently large DRTAs in order to identify real-world real-time systems.

Sicco Verwer, Mathijs de Weerdt, Cees Witteveen
A Local Search Algorithm for Grammatical Inference

In this paper, a heuristic algorithm for the inference of an arbitrary context-free grammar is presented. The input data consist of a finite set of representative words chosen from a (possibly infinite) context-free language and of a finite set of counterexamples—words which do not belong to the language. The time complexity of the algorithm is polynomially bounded. The experiments have been performed for a dozen or so languages investigated by other researchers and our results are reported.

Wojciech Wieczorek
Polynomial-Time Identification of Multiple Context-Free Languages from Positive Data and Membership Queries

This paper presents an efficient algorithm that identifies a rich subclass of multiple context-free languages in the limit from positive data and membership queries by observing where each tuple of strings may occur in sentences of the language of the learning target. Our technique is based on Clark et al.’s work (ICGI 2008) on learning of a subclass of context-free languages. Our algorithm learns those context-free languages as well as many non-context-free languages.

Ryo Yoshinaka
Grammatical Inference as Class Discrimination

Grammatical inference is typically defined as the task of finding a compact representation of a language given a subset of sample sequences from that language. Many different aspects, paradigms and settings can be investigated, leading to different proofs of language learnability or practical systems. The general problem can be seen as a one class classification or discrimination task. In this paper, we take a slightly different view on the task of grammatical inference. Instead of learning a full description of the language, we aim to learn a representation of the boundary of the language. Effectively, when this boundary is known, we can use it to decide whether a sequence is a member of the language or not. An extension of this approach allows us to decide on membership of sequences over a collection of (mutually exclusive) languages. We will also propose a systematic approach that learns language boundaries based on subsequences from the sample sequences and show its effectiveness on a practical problem of music classification. It turns out that this approach is indeed viable.

Menno van Zaanen, Tanja Gaustad

Short Papers

MDL in the Limit

We show that within the Gold paradigm for language learning an informer for a superfinite set can cause an optimal MDL learner to make an infinite amount of mind changes. In this setting an optimal learner can make an infinite amount of wrong choices without approximating the right solution. This result helps us to understand the relation between MDL and identification in the limit in learning: MDL is an optimal model selection paradigm, identification in the limit defines recursion theoretical conditions for convergence of a learner.

Pieter Adriaans, Wico Mulder
Grammatical Inference Algorithms in MATLAB

Although MATLAB has become one of the mainstream languages for the machine learning community, there is still skepticism among the Grammatical Inference (GI) community regarding the suitability of MATLAB for implementing and running GI algorithms. In this paper we will present implementation results of several GI algorithms, e.g., RPNI (Regular Positive and Negative Inference), EDSM (Evidence Driven State Merging), and k-testable machine. We show experimentally based on our MATLAB implementation that state merging algorithms can successfully be implemented and manipulated using MATLAB in the similar fashion as other machine learning tools. Moreover, we also show that MATLAB provides a range of toolboxes that can be leveraged to gain parallelism, speedup etc.

Hasan Ibne Akram, Colin de la Higuera, Huang Xiao, Claudia Eckert
A Non-deterministic Grammar Inference Algorithm Applied to the Cleavage Site Prediction Problem in Bioinformatics

We report results on applying the OIL (Order Independent Language) grammar inference algorithm to predict cleavage sites in polyproteins from translation of Potivirus genome. This non-deterministic algorithm is used to generate a group of models which vote to predict the occurrence of the pattern. We built nine models, one for each cleavage site in this kind of virus genome and report sensibility, specificity, accuracy for each model. Our results show that this technique is useful to predict cleavage sites in the given task with accuracy rates higher than 95%.

Gloria Inés Alvarez, Jorge Hernán Victoria, Enrique Bravo, Pedro García
Learning PDFA with Asynchronous Transitions

In this paper we extend the PAC learning algorithm due to Clark and Thollard for learning distributions generated by PDFA to automata whose transitions may take varying time lengths, governed by exponential distributions.

Borja Balle, Jorge Castro, Ricard Gavaldà
Grammar Inference Technology Applications in Software Engineering

While Grammar Inference (GI) has been successfully applied to many diverse domains such as speech recognition and robotics, its application to software engineering has been limited, despite wide use of context-free grammars in software systems. This paper reports current developments and future directions in the applicability of GI to software engineering, where GI is seen to offer innovative solutions to the problems of inference of domain-specific language (DSL) specifications from example DSL programs and recovery of metamodels from instance models.

Barrett R. Bryant, Marjan Mernik, Dejan Hrnčič, Faizan Javed, Qichao Liu, Alan Sprague
Hölder Norms and a Hierarchy Theorem for Parameterized Classes of CCG

We develop a framework based on Hölder norms that allows us to easily transfer learnability results. This idea is concretized by applying it to Classical Categorial Grammars (CCG).

Christophe Costa Florêncio, Henning Fernau
Learning of Church-Rosser Tree Rewriting Systems

Tree rewriting systems are sets of tree rewriting rules used to compute by repeatedly replacing equal trees in a given formula until the simplest possible form (normal form) is obtained. The Church-Rosser property is certainly one of the most fundamental properties of tree rewriting system. In this system the simplest form of a given tree is unique since the final result does not depend on the order in which the rewritings rules are applied. The Church-Rosser system can offer both flexible computing and effecting reasoning with equations and have been intensively researched and widely applied to automated theorem proving and program verification etc. [3,5].

M. Jayasrirani, D. G. Thomas, Atulya K. Nagar, T. Robinson
Generalizing over Several Learning Settings

We recapitulate inference from membership and equivalence queries, positive and negative samples. Regular languages cannot be learned from one of those information sources only [1,2,3]. Combinations of two sources allowing regular (polynomial) inference are MQs and EQs [4], MQs and positive data [5,6], positive and negative data [7,8]. We sketch a meta-algorithm fully presented in [9] that generalizes over as many combinations of those sources as possible. This includes a survey of pairings for which there are no well-studied algorithms.

Anna Kasprzik
Rademacher Complexity and Grammar Induction Algorithms: What It May (Not) Tell Us

This paper revisits a problem of the evaluation of computational grammatical inference (GI) systems and discusses what role complexity measures can play for the assessment of GI. We provide a motivation for using the Rademacher complexity and give an example showing how this complexity measure can be used in practice.

Sophia Katrenko, Menno van Zaanen
Extracting Shallow Paraphrasing Schemata from Modern Greek Text Using Statistical Significance Testing and Supervised Learning

Paraphrasing normally involves sophisticated linguistic resources for pre-processing. In the present work Modern Greek paraphrases are automatically generated using statistical significance testing in a novel manner for the extraction of applicable reordering schemata of syntactic constituents. Next, supervised filtering helps remove erroneously generated paraphrases, taking into account the context surrounding the reordering position. The proposed process is knowledge-poor, and thus portable to languages with similar syntax, robust and domain-independent. The intended use of the extracted paraphrases is hiding secret information underneath a cover text.

Katia Lida Kermanidis
Learning Subclasses of Parallel Communicating Grammar Systems

Pattern language learning algorithms within the inductive inference model and query learning setting have been of great interest. In this paper an algorithm to learn a parallel communicating grammar system in which the master component is a regular grammar and the other components are pure pattern grammars is given.

Sindhu J. Kumaar, P. J. Abisha, D. G. Thomas
Enhanced Suffix Arrays as Language Models: Virtual k-Testable Languages

In this article, we propose the use of suffix arrays to efficiently implement

n

-gram language models with practically unlimited size

n

. This approach, which is used with synchronous back-off, allows us to distinguish between alternative sequences using large contexts. We also show that we can build this kind of models with additional information for each symbol, such as part-of-speech tags and dependency information.

The approach can also be viewed as a collection of virtual

k

-testable automata. Once built, we can directly access the results of any

k

-testable automaton generated from the input training data. Synchronous back-off automatically identifies the

k

-testable automaton with the largest feasible

k

. We have used this approach in several classification tasks.

Herman Stehouwer, Menno van Zaanen
Learning Fuzzy Context-Free Grammar—A Preliminary Report

This paper takes up the topic of a task of learning fuzzy context-free grammar from data. The induction process is divided into two phases: first the generic grammar is derived from the positive sentences, next the membership grades are assigned to the productions taking into account the occurrences of productions in a learning set. The problem of predicting the location of promoters in

Escherichia coli

is examined. Language of bacterial sequence can be described using formal system such as context-free grammar, and problem of promoter region recognition can be replaced by grammar induction. The induced fuzzy grammar was compared to other machine learning methods.

Olgierd Unold
Polynomial Time Identification of Strict Prefix Deterministic Finite State Transducers

This paper is concerned with a subclass of finite state transducers, called

strict prefix deterministic finite state transducers

(

SPDFST

’s for short), and studies a problem of identifying the subclass in the limit from positive data. After providing some properties of languages accepted by SPDFST’s, we show that the class of SPDFST’s is polynomial time identifiable in the limit from positive data in the sense of Yokomori.

Mitsuo Wakatsuki, Etsuji Tomita
Backmatter
Metadaten
Titel
Grammatical Inference: Theoretical Results and Applications
herausgegeben von
José M. Sempere
Pedro García
Copyright-Jahr
2010
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-15488-1
Print ISBN
978-3-642-15487-4
DOI
https://doi.org/10.1007/978-3-642-15488-1