Skip to main content

2000 | Buch

Algorithmic Learning Theory

11th International Conference, ALT 2000 Sydney, Australia, December 11–13, 2000 Proceedings

herausgegeben von: Hiroki Arimura, Sanjay Jain, Arun Sharma

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter

INVITED LECTURES

Extracting Information from the Web for Concept Learning and Collaborative Filtering
(Extended Abstract)
Abstract
Previous work on extracting information from the web generally makes few assumptions about how the extracted information will be used. As a consequence, the goal of web-based extraction systems is usually taken to be the creation of high-quality, noise-free data with clear semantics. This is a difficult problem which cannot be completely automated. Here we consider instead the problem of extracting web data for certain machine learning systems: specifically, collaborative filtering (CF) and concept learning (CL) systems. CF and CL systems are highly tolerant of noisy input, and hence much simpler extraction systems can be used in this context. For CL, we will describe a simple method that uses a given set of web pages to construct new features, which reduce the error rate of learned classifiers in a wide variety of situations. For CF, we will describe a simple method that automatically collects useful information from the web without any human intervention. The collected information, represented as “pseudo-users”, can be used to “jumpstart” a CF system when the user base is small (or even absent).
William W. Cohen
The Divide-and-Conquer Manifesto
Abstract
Existing machine learning theory and algorithms have focused on learning an unknown function from training examples, where the unknown function maps from a feature vector to one of a small number of classes. Emerging applications in science and industry require learning much more complex functions that map from complex input spaces (e.g., 2-dimensional maps, time series, and strings) to complex output spaces (e.g., other 2-dimensional maps, time series, and strings). Despite the lack of theory covering such cases, many practical systems have been built that work well in particular applications. These systems all employ some form of divide-and-conquer, where the inputs and outputs are divided into smaller pieces (e.g., “windows”), classified, and then the results are merged to produce an overall solution. This paper defines the problem of divide-and-conquer learning and identifies the key research questions that need to be studied in order to develop practical, general-purpose learning algorithms for divide-and-conquer problems and an associated theory.
Thomas G. Dietterich
Sequential Sampling Techniques for Algorithmic Learning Theory
Abstract
A sequential sampling algorithm or adaptive sampling algorithm is a sampling algorithm that obtains instances sequentially one by one and determines from these instances whether it has already seen enough number of instances for achieving a given task. In this paper, we present two typical sequential sampling algorithms. By using simple estimation problems for our example, we explain when and how to use such sampling algorithms for designing adaptive learning algorithms.
Osamu Watanabe

REGULAR CONTRIBUTIONS

Statistical Learning

Towards an Algorithmic Statistics
(Extended Abstract)
Abstract
While Kolmogorov complexity is the accepted absolute measure of information content of an individual finite object, a similarly absolute notion is needed for the relation between an individual data sample and an individual model summarizing the information in the data, for example, a finite set where the data sample typically came from. The statistical theory based on such relations between individual objects can be called algorithmic statistics, in contrast to ordinary statistical theory that deals with relations between probabilistic ensembles. We develop a new algorithmic theory of typical statistic, sufficient statistic, and minimal suffcient statistic.
Peter Gács, John Tromp, Paul Vitányi
Minimum Message Length Grouping of Ordered Data
Abstract
Explicit segmentation is the partitioning of data into homogeneous regions by specifying cut-points. W. D. Fisher (1958) gave an early example of explicit segmentation based on the minimisation of squared error. Fisher called this the grouping problem and came up with a polynomial time Dynamic Programming Algorithm (DPA). Oliver, Baxter and colleagues (1996, 1997, 1998) have applied the informationtheoretic Minimum Message Length (MML) principle to explicit segmentation. They have derived formulas for specifying cut-points imprecisely and have empirically shown their criterion to be superior to other segmentation methods (AIC, MDL and BIC). We use a simple MML criterion and Fisher’s DPA to perform numerical Bayesian (summing and) integration (using message lengths) over the cut-point location parameters. This gives an estimate of the number of segments, which we then use to estimate the cut-point positions and segment parameters by minimising the MML criterion. This is shown to have lower Kullback-Leibler distances on generated data.
Leigh J. Fitzgibbon, Lloyd Allison, David L. Dowe
Learning From Positive and Unlabeled Examples
Abstract
In many machine learning settings, examples of one class (called positive class) are easily available. Also, unlabeled data are abundant. We investigate in this paper the design of learning algorithms from positive and unlabeled data only. Many machine learning and data min ing algorithms use examples for estimate of probabilities. Therefore, we design an algorithm which is based on positive statistical queries (estimates for probabilities over the set of positive instances) and instance statistical queries (estimates for probabilities over the instance space). Our algorithm guesses the weight of the target concept (the ratio of positive instances in the instance space) with the help of a hypothesis testing algorithm. It is proved that any class learnable in the Statistical Query model [Kea93] such that a lower bound on the weight ofany target concept f can be estimated in polynomial time is learnable from positive statistical queries and instance statistical queries only. Then, we design a decision tree induction algorithm POSC4.5, based on C4.5 [Qui93], using only positive and unlabeled examples. We alsogive experimental results for this algorithm.
Fabien Letouzey, François Denis, Rémi Gilleron

Inductive Inference

Learning Erasing Pattern Languages with Queries
Abstract
A pattern is a finite string of constant and variable symbols. The non-erasing language generated by a pattern is the set of all strings of constant symbols that can be obtained by substituting non-empty strings for variables. In order to build the erasing language generated by a pattern, it is also admissible to substitute the empty string.
The present paper deals with the problem of learning erasing pattern languages within Angluin’s model of learning with queries. Moreover, the learnability of erasing pattern languages with queries is studied when additional information is available. The results obtained are compared with previously known results concerning the case that non-erasing pattern languages have to be learned.
Jochen Nessel, Steffen Lange
Learning Recursive Concepts with Anomalies
Abstract
This paper provides a systematic study of inductive inference of indexable concept classes in learning scenarios in which the learner is successful if its final hypothesis describes a finite variant of the target concept - henceforth called learning with anomalies. As usual, we distinguish between learning from only positive data and learning from positive and negative data.
We investigate the following learning models: finite identification, conservative inference, set-driven learning, and behaviorally correct learning. In general, we focus our attention on the case that the number of allowed anomalies is finite but not a priori bounded. However, we also present a few sample results that affect the special case of learning with an a priori bounded number of anomalies. We provide characterizations of the corresponding models of learning with anomalies in terms of finite tell-tale sets. The varieties in the degree of recursiveness of the relevant tell-tale sets observed are already sufficient to quantify the differences in the corresponding models of learning with anomalies.
In addition, we study variants of incremental learning and derive a complete picture concerning the relation of all models of learning with and without anomalies mentioned above.
Gunter Grieser, Steffen Lange, Thomas Zeugmann
Identification of Function Distinguishable Languages
Abstract
We show how appropriately chosen functions which we call distinguishing can be used to make deterministic finite automata backward deterministic. These ideas can be exploited to design regular language classes identifiable in the limit from positive samples. Special cases of this approach are the k-reversible and terminal distinguishable languages as discussed in [1],[8],[10],[17],[18].
Henning Fernau
A Probabilistic Identification Result
Abstract
The approach used to assess a learning algorithm should reect the type of environment we place the algorithm within. Often learners are given examples that both contain noise and are governed by a particular distribution. Hence, probabilistic identification in the limit is an appropriate tool for assessing such learners. In this paper we introduce an exact notion of probabilistic identification in the limit based on Laird’s thesis. The strategy presented incorporates a variety of learning situations including: noise free positive examples, noisy independently generated examples, and noise free with both positive and negative examples. This yields a useful technique for assessing the effectiveness of a learner when training data is governed by a distribution and is possibly noisy. An attempt has been made to give a preliminary theoretical evaluation of the Q-heuristic. To this end, we have shown that a learner using the Q-heuristic stochastically learns in the limit any finite class of concepts, even when noise is present in the training examples. This result is encouraging, because with enough data, there is the expectation that the learner will induce a correct hypothesis. The proof of this result is extended to show that a restricted infinite class of concepts can also be stochastically learnt in the limit. The restriction requires the hypothesis space to be g-sparse.
Eric McCreath

ILP

A New Framework for Discovering Knowledge from Two-Dimensional Structured Data Using Layout Formal Graph System
Abstract
We present a new framework for discovering knowledge from two-dimensional structured data by using Inductive Logic Programming. Two-dimensional graph structured data such as image or map data are widely used for representing relations and distances between various objects. First, we define a layout term graph suited for representing twodimensional graph structured data. A layout term graph is a pattern consisting of variables and two-dimensional graph structures. Moreover, we propose Layout Formal Graph System (LFGS) as a new logic programming system having a layout term graph as a term. LFGS directly deals with graphs having positional relations just like first order terms. Second, we show that LFGS is more powerful than Layout Graph Grammar, which is a generating system consisting of a context-free graph grammar and positional relations. This indicates that LFGS has the richness and advantage of representing knowledge about two-dimensional structured data.
Finally, we design a knowledge discovery system, which uses LFGS as a knowledge representation language and refutably inductive inference as a learning method. In order to give a theoretical foundation of our knowledge discovery system, we give the set of weakly reducing LFGS programs which is a sufficiently large hypothesis space of LFGS programs and show that the hypothesis space is refutably inferable from complete data.
Tomoyuki Uchida, Yuko Itokawa, Takayoshi Shoudai, Tetsuhiro Miyahara, Yasuaki Nakamura
Hypotheses Finding via Residue Hypotheses with the Resolution Principle
Abstract
For given logical formulae B and E such that BE, hypothesis finding means the generation of a formula H such that BHE. Hypothesis finding constitutes a basic technique for fields of inference, like inductive inference and knowledge discovery. It can also be considered a special case of abduction. In this paper we define a hypothesis finding method which is a combination of residue hypotheses and anti-subsumption. Residue hypotheses have been proposed on the basis of the terminology of the Connection Method, while in this paper we define it in the terminology of resolution. We show that hypothesis finding methods previously proposed on the bases of resolution are embedded into our new method. We also point out that computing residue hypotheses becomes a lot more efficient under the restrictions required by the previous methods to be imposed on hypotheses, but that these methods miss some hypotheses which our method can find. Finally, we show that our method constitutes an extension of Plotkin’s relative subsumption.
Akihiro Yamamoto, Bertram Fronhöfer
Conceptual Classifications Guided by a Concept Hierarchy
Abstract
Given a concept hierarchy and a set of instances of multiple concepts, we consider the revision problem that the primary concepts subsuming the instances are judged inadequate by a user. The basic strategy to resolve this conict is to utilize the information the hierarchy involves in order to classify the instance set and to form a set of several intermediate concepts. We refer to the strategy of this kind as hierarchy-guided classification. For this purpose, we make a condition, Similarity Independence Condition, that checks similarities between the hierarchy and the instances so that the similarities are invariant even when we generalize those instances to some concept at the middle. Based on the condition, we present an algorithm for classifying instances and for modifying the concept hierarchy.
Yuhsuke Itoh, Makoto Haraguchi
Learning Taxonomic Relation by Case-based Reasoning
Abstract
In this paper, we propose a learning method of minimal casebase to represent taxonomic relation in a tree-structured concept hierarchy. We firstly propose case-based taxonomic reasoning and show an upper bound of necessary positive cases and negative cases to represent a relation. Then, we give an learning method of a minimal casebase with sampling and membership queries. We analyze this learning method by sample complexity and query complexity in the framework of PAC learning.
Ken Satoh

Complexity

Average-Case Analysis of Classification Algorithms for Boolean Functions and Decision Trees
Abstract
We conduct an average-case analysis of the generalization error rate of classification algorithms with finite model classes. Unlike worst-case approaches, we do not rely on bounds that hold for all possible learning problems. Instead, we study the behavior of a learning algorithm for a given problem, taking properties of the problem and the learner into account. The solution depends only on known quantities (e.g., the sample size), and the histogram of error rates in the model class which we determine for the case that the sought target is a randomly drawn Boolean function. We then discuss how the error histogram can be estimated from a given sample and thus show how the analysis can be applied approximately in the more realistic scenario that the target is unknown. Experiments show that our analysis can predict the behavior of decision tree algorithms fairly accurately even if the error histogram is estimated from a sample.
Tobias Scheffer
Self-duality of Bounded Monotone Boolean Functions and Related Problems
Abstract
In this paper we show the equivalence between the problem of determining self-duality of a boolean function in DNF and a special type of satisfiability problem called NAESPI. Eiter and Gottlob [8] use a result from [2] to show that self-duality of monotone boolean functions which have bounded clause sizes (by some constant) can be determined in polynomial time. We show that the self-duality of instances in the class studied by Eiter and Gottlob can be determined in time linear in the number of clauses in the input, thereby strengthening their result. Domingo [7] recently showed that self-duality of boolean functions where each clause is bounded by \( \left( {\sqrt {\log n} } \right) \) can be solved in polynomial time. Our linear time algorithm for solving the clauses with bounded size infact solves the \( \left( {\sqrt {\log n} } \right) \) bounded self-duality problem in \( O\left( {n^2 \sqrt {\log n} } \right) \) time, which is better bound then the algorithm of Domingo [7], O(n3). Another class of self-dual functions arising naturally in application domain has the property that every pair of terms in f intersect in at most constant number of variables. The equivalent subclass of NAESPI is the c-bounded NAESPI. We also show that c-bounded NAESPI can be solved in polynomial time when c is some constant. We also give an alternative characterization of almost self-dual functions proposed by Bioch and Ibaraki [5] in terms of NAESPI instances which admit solutions of a ‘particular’ type.
Daya Ram Gaur, Ramesh Krishnamurti
Sharper Bounds for the Hardness of Prototype and Feature Selection
Abstract
As pointed out by Blum [Blu94], “nearly all results in Machine Learning [...] deal with problems of separating relevant from irrelevant information in some way”. This paper is concerned with structural complexity issues regarding the selection of relevant Prototypes or Features. We give the first results proving that both problems can be much harder than expected in the literature for various notions of relevance. In particular, the worst-case bounds achievable by any efficient algorithm are proven to be very large, most of the time not so far from trivial bounds. We think these results give a theoretical justification for the numerous heuristic approaches found in the literature to cope with these problems.
Richard Nock, Marc Sebban
On the Hardness of Learning Acyclic Conjunctive Queries
Abstract
A conjunctive query problem in relational database theory is a problem to determine whether or not a tuple belongs to the answer of a conjunctive query over a database. Here, a tuple and a conjunctive query are regarded as a ground atom and a nonrecursive function-free definite clause, respectively. While the conjunctive query problem is NP-complete in general, it becomes efficiently solvable if a conjunctive query is acyclic. Concerned with this problem, we investigate the learnability of acyclic conjunctive queries from an instance with a j-database which is a finite set of ground unit clauses containing at most j-ary predicate symbols. We deal with two kinds of instances, a simple instance as a set of ground atoms and an extended instance as a set of pairs of a ground atom and a description. Then, we show that, for each j ≥ 3, there exist a j-database such that acyclic conjunctive queries are not polynomially predictable from an extended instance under the cryptographic assumptions. Also we show that, for each n > 0 and a polynomial p, there exists a p(n)- database of size O(2p(n)) such that predicting Boolean formulae of size p(n) over n variables reduces to predicting acyclic conjunctive queries from a simple instance. This result implies that, if we can ignore the size of a database, then acyclic conjunctive queries are not polynomially predictable from a simple instance under the cryptographic assumptions. Finally, we show that, if either j = 1, or j = 2 and the number of element of a database is at most l (≥ 0), then acyclic conjunctive queries are paclearnable from a simple instance with j-databases.
Kouichi Hirata

Neural Network and Other Paradigms

Dynamic Hand Gesture Recognition Based On Randomized Self-Organizing Map Algorithm
Abstract
Gesture recognition is an appealing tool for natural interface with computers especially for physically impaired persons. In this paper, it is proposed to use Self-Organized Map (SOM) to recognize the posture images of hand gestures. Since the competition algorithm of SOM allows alleviating many difficulties associated with gesture recognition. However, it is required to reduce the recognition time of one image in SOM network to the range of normal video camera rates, this permits the network to accept dynamic input images and to perform on-line recognition for hand gestures. To achieve this, the Randomized Self-Organizing Map algorithm (RSOM) is proposed as a new recognition algorithm for SOM. With RSOM algorithm, the recognition time of one image reduced to 12.4 % of the normal SOM competition algorithm with 100 % accuracy and allowed the network to recognize images within the range of normal video rates. The experimental results to recognize six dynamic hand gestures using RSOM algorithm is presented.
Tarek El. Tobely, Yuichiro Yoshiki, Ryuichi Tsuda, Naoyuki Tsuruta, Makoto Amamiy
On Approximate Learning by Multi-layered Feedforward Circuits
Abstract
We consider the problem of efficient approximate learning by multi-layered feedforward circuits subject to two objective functions.
First, we consider the objective to maximize the ratio of correctly classified points compared to the training set size (e.g., see [3],[5]). We show that for single hidden layer threshold circuits with n hidden nodes and varying input dimension, approximation of this ratio within a relative error c/n3, for some positive constant c, is NP-hard even if the number of examples is limited with respect to n. For architectures with two hidden nodes (e.g., as in [6]), approximating the objective within some fixed factor is NP-hard even if any sigmoid-like activation function in the hidden layer and å-separation of the output [19] is considered, or if the semilinear activation function substitutes the threshold function.
Next, we consider the objective to minimize the failure ratio [2]. We show that it is NP-hard to approximate the failure ratio within every constant larger than 1 for a multilayered threshold circuit provided the input biases are zero. Furthermore, even weak approximation of this objective is almost NP-hard.
Bhaskar DasGupta, Barbara Hammer
The Last-Step Minimax Algorithm
Abstract
We consider on-line density estimation with a parameterized density from an exponential family. In each trial t the learner predicts a parameter θ t . Then it receives an instance xt chosen by the adversary and incurs loss -ln p(χ t |θ t which is the negative log-likelihood of χ t w.r.t. the predicted density of the learner. The performance of the learner is measured by the regret defined as the total loss of the learner minus the total loss of the best parameter chosen off-line. We develop an algorithm called the Last-step Minimax Algorithm that predicts with the minimax optimal parameter assuming that the current trial is the last one. For one-dimensional exponential families, we give an explicit form of the prediction of the Last-step Minimax Algorithm and show that its regret is O(lnT), where T is the number of trials. In particular, for Bernoulli density estimation the Last-step Minimax Algorithm is slightly better than the standard Krichevsky-Trofimov probability estimator
Eiji Takimoto, Manfred K. Warmuth
Rough Sets and Ordinal Classification
Abstract
The classical theory of Rough Sets describes objects by discrete attributes, and does not take into account the ordering of the attributes values. This paper proposes a modification of the Rough Set approach applicable to monotone datasets. We introduce respectively the concepts of monotone discernibility matrix and monotone (object) reduct. Furthermore, we use the theory of monotone discrete functions developed earlier by the first author to represent and to compute decision rules. In particular we use monotone extensions, decision lists and dualization to compute classification rules that cover the whole input space. The theory is applied to the bankruptcy problem.
Jan C. Bioch, Viara Popova

Support Vector Machines

A note on the generalization performance of kernel classifiers with margin
Abstract
We present distribution independent bounds on the generalization misclassification performance of a family of kernel classifiers with margin. Support Vector Machine classifiers (SVM) stem out of this class of machines. The bounds are derived through computations of the Vγ dimension of a family of loss functions where the SVM one belongs to. Bounds that use functions of margin distributions (i.e. functions of the slack variables of SVM) are derived.
Theodoros Evgeniou, Massimiliano Pontil
On the Noise Model of Support Vector Machines Regression
Abstract
Support Vector Machines Regression (SVMR) is a learning technique where the goodness of fit is measured not by the usual quadratic loss function (the mean square error), but by a different loss function called the ∈-Insensitive Loss Function (ILF), which is similar to loss functions used in the field of robust statistics. The quadratic loss function is well justified under the assumption of Gaussian additive noise. However, the noise model underlying the choice of the ILF is not clear. In this paper the use of the ILF is justified under the assumption that the noise is additive and Gaussian, where the variance and mean of the Gaussian are random variables. The probability distributions for the variance and mean will be stated explicitly. While this work is presented in the framework of SVMR, it can be extended to justify non-quadratic loss functions in any Maximum Likelihood or Maximum AP osteriori approach. It applies not only to the ILF, but to a much broader class of loss functions.
Massimiliano Pontil, Sayan Mukherjee, Federico Girosi
Computationally Efficient Transductive Machines
Abstract
In this paper we propose a new algorithm for providing confidence and credibility values for predictions on a multi-class pattern recognition problem which uses Support Vector machines in its implementation. Previous algorithms which have been proposed to achieve this are very processing intensive and are only practical for small data sets. We present here a method which overcomes these limitations and can deal with larger data sets (such as the US Postal Service database). The measures of confidence and credibility given by the algorithm are shown empirically to reflect the quality of the predictions obtained by the algorithm, and are comparable to those given by the less computationally efficient method. In addition to this the overall performance of the algorithm is shown to be comparable to other techniques (such as standard Support Vector machines), which simply give flat predictions and do not provide the extra confidence/credibility measures.
Craig Saunders, Alex Gammerman, Volodya Vovk
Backmatter
Metadaten
Titel
Algorithmic Learning Theory
herausgegeben von
Hiroki Arimura
Sanjay Jain
Arun Sharma
Copyright-Jahr
2000
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-40992-2
Print ISBN
978-3-540-41237-3
DOI
https://doi.org/10.1007/3-540-40992-0