The discovery of algorithmic probability: A guide for the programming of true creativity

Ray J. Solomonoff

A desicion-theoretic generalization of on-line learning and an application to boosting

We consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update rule of Littlestone and Warmuth [10] can be adapted to this mode yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games and prediction of points in ℝn. We also show how the weight-update rule can be used to derive a new boosting algorithm which does not require prior knowledge about the performance of the weak learning algorithm.

Yoav Freund, Robert E. Schapire

Online learning versus offline learning

We present an off-line variant of the mistake-bound model of learning. Just like in the well studied on-line model, a learner in the offline model has to learn an unknown concept from a sequence of elements of the instance space on which he makes “guess and test” trials. In both models, the aim of the learner is to make as few mistakes as possible. The difference between the models is that, while in the on-line model only the set of possible elements is known, in the off-line model the sequence of elements (i.e., the identity of the elements as well as the order in which they are to be presented) is known to the learner in advance.We give a combinatorial characterization of the number of mistakes in the off-line model. We apply this characterization to solve several natural questions that arise for the new model. First, we compare the mistake bounds of an off-line learner to those of a learner learning the same concept classes in the on-line scenario. We show that the number of mistakes in the on-line learning is at most a log n factor more than the off-line learning, where n is the length of the sequence. In addition, we show that if there is an off-line algorithm that does not make more than a constant number of mistakes for each sequence then there is an online algorithm that also does not make more than a constant number of mistakes.The second issue we address is the effect of the ordering of the elements on the number of mistakes of an off-line learner. It turns out that there are sequences on which an off-line learner can guarantee at most one mistake, yet a permutation of the same sequence forces him to err on many elements. We prove, however, that the gap, between the off-line mistake bounds on permutations of the same sequence of n-many elements, cannot be larger than a multiplicative factor of log n, and we present examples that obtain such a gap.

Shai Ben-David, Eyal Kushilevitz, Yishay Mansour

Learning distributions by their density levels — A paradigm for learning without a teacher

Can we learn from unlabeled examples? We consider here the un-supervised learning scenario in which the examples provided are not labeled (and are not necessarily all positive or all negative). The only information about their membership is indirectly disclosed to the student through the sampling distribution.We view this problem as a restricted instance of the fundamental issue of inferring information about a probability distribution from the random samples it generates. We propose a framework, density-level-learning, for acquiring some partial information about a distribution and develop a model of un-supervised concept learning based on this framework.We investigate the basic features of these types of learning and provide lower and upper bounds on the sample complexity of these tasks. Our main result is that the learnability of a class in this setting is equivalent to the finiteness of its VC-dimension. One direction of the proof involves a reduction of the density-level-learnability to PAC learnability, while the sufficiency condition is proved through the introduction of a generic learning algorithm.

Shai Ben-David, Michael Lindenbaum

Tight worst-case loss bounds for predicting with expert advice

We consider on-line algorithms for predicting binary outcomes, when the algorithm has available the predictions made by N experts. For a sequence of trials, we compute total losses for both the algorithm and the experts under a loss function. At the end of the trial sequence, we compare the total loss of the algorithm to the total loss of the best expert, i.e., the expert with the least loss on the particular trial sequence. Vovk has introduced a simple algorithm for this prediction problem and proved that for a large class of loss functions, with binary outcomes the total loss of the algorithm exceeds the total loss of the best expert at most by the amount ein N, where c is a constant determined by the loss function. This upper bound does not depend on any assumptions on how the experts' predictions or the outcomes are generated, and the trial sequence can be arbitrarily long. We give a straightforward alternative method for finding the correct value c and show by a lower bound that for this value of c, the upper bound is asymptotically tight. The lower bound is based on a probabilistic adversary argument. The class of loss functions for which the c ln N upper bound holds includes the square loss, the logarithmic loss, and the Hellinger loss. We also consider another class of loss functions, including the absolute loss, for which we have an $$\Omega (\sqrt {\ell logN} )$$ lower bound, where ℓ is the number of trials.

David Haussler, Jyrki Kivinen, Manfred K. Warmuth

On-line maximum likelihood prediction with respect to general loss functions

This paper introduces a new family of deterministic and stochastic on-line prediction algorithms which perform well with respect to general loss functions and analyzes their behavior in terms of expected loss bounds. The algorithms use parametric probabilistic models regardless of the kind of loss function used. The key of the algorithms is to iteratively estimate the probabilistic model using the maximum likelihood method, and then to construct the optimal prediction function which minimizes the average of the loss taken with respect to the estimated probabilistic model. A future outcome is predicted using this optimal prediction function. We analyze the algorithms for the cases where the target distribution is 1) k-dimensional parametric and k is known, 2) k-dimensional parametric but k is unknown, and 3) non-parametric. For all the cases, we derive upper bounds on the expected instantaneous or cumulative losses for the algorithms with respect to a large family of loss functions satisfying the constraint introduced by Merhav and Feder. These loss bounds show new universal relations among the expected prediction accuracy, the indexes of the loss function, the complexity of the target rule, and the number of training examples.

Kenji Yamanishi

The power of procrastination in inductive inference: How it depends on used ordinal notations

We consider inductive inference with procrastination. Usually it is defined using constructive ordinals. For constructive ordinals there exist many different systems of notations. In this paper we study how the power of inductive inference depends on used system of notations.We prove that for constructive ordinals α smaller than ω2 each set of total recursive functions which is EX_α-identifiable in one system of notations is EX_α-identifiable in arbitrary system of notations. For $$EX_{\omega ^2 } $$ -identification such property does not hold.Also, we consider the question whether, among all systems of notations there exists the strongest and the weakest system. We prove that there exist such system of notations S that arbitrary set of functions which is EX_α-identifiable in some system of notations is EX_α-identifiable in S, too. If ω2≤α<2ω2, there exist such system A that each set of functions which is EX_α-identifiable in A is EX_α-identifiable in arbitrary system of notations.Further, we consider possible tradeoffs between using larger ordinals and using more complicated systems of notations. We prove that, in general, there are no such tradeoffs.

Andris Ambainis

Learnability of Kolmogorov-easy circuit expressions via queries

Circuit expressions were introduced to provide a natural link between Computational Learning and certain aspects of Structural Complexity. Upper and lower bounds on the learnability of circuit expressions are known. We study here the case in which the circuit expressions are of low (time-bounded) Kolmogorov complexity. We show that these are polynomial-time learnable from membership queries in the presence of an NP oracle. We also exactly characterize the sets that have such circuit expressions, and precisely identify the subclass whose circuit expressions can be learned from membership queries alone. The extension of the results to various Kolmogorov complexity bounds is discussed.

José L. Balcázar, Harry Buhrman, Montserrat Hermo

Trading monotonicity demands versus mind changes

The present paper deals with with the learnability of indexed families $${\cal L}$$ of uniformly recursive languages from positive data. We consider the influence of three monotonicity demands to the efficiency of the learning process. The efficiency of learning is measured in dependence on the number of mind changes a learning algorithm is allowed to perform. The three notions of monotonicity reflect different formalizations of the requirement that the learner has to produce better and better generalizations when fed more and more data on the target concept.We distinguish between exact learnability ( $${\cal L}$$ has to be inferred with respect to $${\cal L}$$ ), class preserving learning ( $${\cal L}$$ has to be inferred with respect to some suitable chosen enumeration of all the languages from $${\cal L}$$ ), and class comprising inference ( $${\cal L}$$ has to be learned with respect to some suitable chosen enumeration of uniformly recursive languages containing at least all the languages from $${\cal L}$$ ).In particular, we prove that a relaxation of the relevant monotonicity requirement may result in an arbitrarily large speed-up.

Steffen Lange, Thomas Zeugmann

Learning recursive functions from approximations

Extended abstract

Investigated is algorithmic learning, in the limit, of correct programs for recursive functions f from both input/output examples of f and several interesting varieties of approximate additional (algorithmic) information about f. Specifically considered, as such approximate additional information about f, are Rose's frequency computations for f and several natural generalizations from the literature, each generalization involving programs for restricted trees of recursive functions which have f as a branch. Considered as the types of trees are those with bounded variation, finite width, and finite rank.For the case of learning final correct programs for recursive functions, EX-learning, where the additional information involves frequency computations, an insightful and interestingly complex combinatorial characterization of learning power is presented as a function of the frequency parameters. Pitt has previously characterized the learning power of probabilistic machines in terms of the size of equally powerful teams of (deterministic) machines. For EX-learning (as well as for BC-learning, where a final sequence of correct programs is learned), for the cases of providing the types of additional information considered in this paper, the optimal machine team size is determined such that the entire class of recursive functions is learnable.

John Case, Susanne Kaufmann, Efim Kinber, Martin Kummer

On the intrinsic complexity of learning

A new view of learning is presented. The basis of this view is a natural notion of reduction. We prove completeness and relative difficulty results. An infinite hierarchy of intrinsically more and more difficult to learn concepts is presented. Our results indicate that the complexity notion captured by our new notion of reduction differs dramatically from the traditional studies of the complexity of the algorithms performing learning tasks.

Rūsiņš Freivalds, Efim Kinber, Carl H. Smith

The structure of intrinsic complexity of learning

Recently, a new approach to the study of “intrinsic” complexity of learning has originated in the work of Freivalds, and has been investigated for identification in the limit of functions by Freivalds, Kinber, and Smith and for identification in the limit of languages by Jain and Sharma. Instead of concentrating on the complexity of the learning algorithm, this approach uses the notion of reducibility to investigate the complexity of the concept classes being learned. Three representative classes have been presented that classify learning problems of increasing difficulty.(a)Classes that can be learned by machines that confirm their success (singleton languages).(b)Classes that can be learned by machines that cannot confirm their success but can provide an upper bound on the number of mind changes after inspecting an element of the language (pattern languages).(c)Classes that can be learned by machines that can neither confirm success nor can provide an upper bound on the number of mind changes (finite languages).The present paper shows that there is an infinite hierarchy of language classes that represent learning problems of increasing difficulty. Language classes constituting this hierarchy are languages with bounded cardinality, and it can be shown that collections of languages that can be identified using bounded number of mind changes are reducible to the classes in this hierarchy. It is also shown that language classes in this hierarchy are incomparable, under the reductions introduced, to the collection of pattern languages.The structure of intrinsic complexity is shown to be rich as any finite, acyclic, directed graph can be embedded in this structure. However, it is also established that this structure is not dense.

Sanjay Jain, Arun Sharma

Kolmogorov numberings and minimal identification

Identification of programs for computable functions from their graphs by algorithmic devices is a well studied problem in learning theory. Freivalds and Chen consider identification of ‘minimal’ and ‘nearly minimal’ programs for functions from their graphs. To address certain problems in minimal identification for Gödel numberings, Freivalds later considered minimal identification in Kolmogorov Numberings. Kolmogorov numberings are in some sense optimal numberings and have some nice properties. We prove certain hierarchy results for minimal identification in every Kolmogorov numbering. In addition we also compare minimal identification in Gödel numbering versus minimal identification in Kolmogorov numberings.

Rusins Freivalds, Sanjay Jain

Stochastic complexity in learning

This is an expository paper on the latest results in the theory of stochastic complexity and the associated MDL principle with special interest in modeling problems arising in machine learning. As an illustration we discuss the problem of designing MDL decision trees, which are meant to improve the earlier designs in two ways: First, by use of the sharper formula for the stochastic complexity at the nodes the earlier found tendency of getting too small trees appears to be overcome. Secondly, a dynamic programming based pruning algorithm is described for finding the optimal trees, which generalizes an algorithm described in Nohre (1994).

J. Rissanen

Function learning from interpolation (extended abstract)

In the problem of learning a real-valued function from examples, in an extension of the ‘PAC’ model, a learner sees a sequence of values of an unknown function at a number of randomly chosen points. On the basis of these examples, the learner chooses a function—called a hypothesis—from some class H of hypotheses, with the aim that the learner's hypothesis is close to the target function on future random examples. In this paper we require that, for most training samples, with high probability the absolute difference between the values of the learner's hypothesis and the target function on a random point is small. A natural learning algorithm to consider is one that chooses a function in H that is close to the target function on the training examples. This, together with the success criterion described above, leads to the definition of a statistical property which we would wish a class of functions to possess.We derive a characterization of function classes that have this property, in terms of their ‘fat-shattering function’, a notion that has proven useful in other problems in computational learning theory, such as the learnability of probabilistic concepts and the learnability of functions in the presence of random noise. This work has applications to the learning of functions in the presence of malicious noise.

Martin Anthony, Peter Bartlett

Approximation and learning of convex superpositions

We present a fairly general method for constructing classes of functions of finite scale-sensitive dimension (the scale-sensitive dimension is a generalization of the Vapnik-Chervonenkis dimension to real-valued functions). The construction is as follows: start from a class F of functions of finite VC dimension, take the convex hull coF of F, and then take the closure $${\cal L}$$ of coF in an appropriate sense. As an example, we study in more detail the case where F is the class of threshold functions. It is shown that $${\cal L}$$ includes two important classes of functions:neural networks with one hidden layer and bounded output weights;the so-called Γ class of Barron, which was shown to satisfy a number of interesting approximation and closure properties.We also give an integral representation in the form of a “continuous neural network” which generalizes Barren's. It is shown that the existence of an integral representation is equivalent to both L2 and L∞ approximability.

Leonid Gurvits, Pascal Koiran

Minimum description length estimators under the optimal coding scheme

Following Rissanen we consider the statistical model {P_θ| as a code-book, θ indexing the codes. To obtain a single code, we first encode some θ and then encode our data x with the code corresponding to this θ. Rissanen's minimum description length principle recommends using the value of θ minimizing the total code length as an estimate of θ given the data x. For some standard statistical models we find easily computable estimators which respect this principle when θ is encoded with the asymptotically optimal coding scheme due to Levin and Chaitin.

V. G. Vovk

MDL learning of unions of simple pattern languages from positive examples

The following learning task is considered: Given a set S of strings consisting of basic symbols and a set C of patterns consisting of basic symbols and variables, compute a concise set $$C \subseteq C$$ such that each string in S is obtained from some pattern in C by substituting basic symbols for the variables. We apply Rissanen's MDL principle to the selection of patterns to the result. This leads to a length-minimization problem that we approximately solve in polynomial time (in the length of S and C) with a logarithmic performance guarantee. Our algorithm is based on a greedy solution of a variant of the set covering problem.

Pekka Kilpeläinen, Heikki Mannila, Esko Ukkonen

A note on the use of probabilities by mechanical learners

We raise the following problem: in a probabilistic context, is it always fruitful for a machine to compute probabilities? The question is made precise in a paradigm of the limit-identification kind, where a learner must discover almost surely whether an infinite sequence of heads and tails belongs to an effective subset S of the Cantor space. In this context, a successful strategy for an ineffective learner is to compute, at each stage, the conditional probability that he is faced with an element of S, given the data received so far. We show that an effective learner should not proceed this way in all circumstances. Indeed, even if he gets an optimal description of a set S, and even if some machine can always compute the conditional probability for S given any data, an effective learner optimizes his inductive competence only if he does not compute the relevant probabilities. We conclude that the advice “compute probabilities whenever you can” should sometimes be received with caution in the context of machine learning.

Eric Martin, Daniel Osherson

Characterizing rational versus exponential learning curves

We consider the standard problem of learning a concept from random examples. Here a learning curve can be defined to be the expected error of a learner's hypotheses as a function of training sample size. Haussler, Littlestone and Warmuth have shown that, in the distribution free setting, the smallest expected error a learner can achieve in the worst case over a concept class C converges rationally to zero error (i.e., Θ(l/t) for training sample size t). However, recently Cohn and Tesauro have demonstrated how exponential convergence can often be observed in experimental settings (i.e., average error decreasing as eΘ(−t)). By addressing a simple non-uniformity in the original analysis, this paper shows how the dichotomy between rational and exponential worst case learning curves can be recovered in the distribution free theory. These results support the experimental findings of Cohn and Tesauro: for finite concept classes, any consistent learner achieves exponential convergence, even in the worst case; but for continuous concept classes, no learner can exhibit sub-rational convergence for every target concept and domain distribution. A precise boundary between rational and exponential convergence is drawn for simple concept Chains. Here we show that somewhere dense chains always force rational convergence in the worst case, but exponential convergence can always be achieved for nowhere dense chains.

Dale Schuurmans

Is pocket algorithm optimal?

The pocket algorithm is considered able to provide for any classification problem the weight vector which satisfies the maximum number of input-output relations contained in the training set. A proper convergence theorem ensures the achievement of an optimal configuration with probability one when the number of iterations grows indefinitely. In the present paper a new formulation of this theorem is given; a rigorous proof corrects some formal and substantial errors which invalidate previous theoretical results. In particular it is shown that the optimality of the asymptotical solution is ensured only if the number of permanences for the pocket vector lies in a proper interval of the real axis which bounds depend on the number of iterations.

Marco Muselli

Some theorems concerning the free energy of (Un) constrained stochastic Hopfield neural networks

General stochastic binary Hopfield models are viewed from the angle of statistical mechanics. Both the general unconstrained binary stochastic Hopfield model and a certain constrained one are analyzed yielding explicit expressions of the free energy. Moreover, conditions are given for which some of these free energy expressions are Lyapunov functions of the corresponding differential equations. In mean field approximation, either stochastic model appears to coincide with a specific continuous model. Physically, the models are related to spin and to Potts glass models. By means of an alternative derivation, an expression of a ‘complementary’ free energy is presented. Some surveying computational results are reported and an alternative use of the discussed models in resolving constrained optimization problems is discussed.

Jan van den Berg, Jan C. Bioch

A space-bounded learning algorithm for axis-parallel rectangles

We consider the on-line learnability from equivalence queries only of axis-parallel rectangles over the discrete grid {1,..., n}d (BOX_nd). Further we impose the restriction of “k-space-bounded learning”, i.e. the information the learner can store about the history of the learning protocol is restricted to the previous hypothesis and at most k of the examples seen. Our result improves the best known algorithm about learning BOX_nd due to Chen and Maass [9]. Their algorithm has learning complexity O(d2 log n) requires space Θ(d2logn) and time Ω(log(d2log n)) for each learning step. We present an on-line learning algorithm for BOX_nd with the same learning complexity, time complexity O(d3log n) which is 2d-space-bounded.

Foued Ameur

Learning decision lists and trees with equivalence-queries

This paper is concerned with the model of learning with equivalence-queries which was introduced by Angluin in [2]. We show that decision lists and decision trees of bounded rank are polynomially learnable in this model. If there are N base functions, then N2 queries are sufficient for learning lists. For learning trees of rank r, (1+o(1))N2r queries are sufficient. We also investigate the problem of learning a shortest representation of a target decision list. Let k-DL denote the class of decision lists with boolean terms of maximal length k as base functions. We show that shortest representations for lists from 1-DL can be learned efficiently. The corresponding questions for k≥2 are open, although we are able to show some related (but weaker) results. For instance, we present an algorithm which efficiently learns shortest representations of boolean 2-CNF or 2-DNF formulas.

Hans Ulrich Simon

Bounding VC-dimension for neural networks: Progress and prospects

Techniques from differential topology are used to give polynomial bounds for the VC-dimension of sigmoidal neural networks. The bounds are quadratic in ω, the dimension of the space of weights. Similar results are obtained for a wide class of Pfaffian activation functions. The obstruction (in differential topology) to improving the bound to an optimal bound $${\cal O}(w log w)$$ (ω log ω) is discussed, and attention is paid to the role of other parameters involved in the network architecture.

Marek Karpinski, Angus Macintyre

Average case analysis of a learning algorithm for μ-DNF expressions

In this paper, we present an average case model for analyzing learning algorithms. We show how the average behavior of a learning algorithm can be understood in terms of a single hypothesis that we refer to as the average hypothesis. As a case study, we apply the average case model to a simplified version of Pagallo and Haussler's algorithm for PAG learning μDNF expressions on the uniform distribution [15]. The average case analysis reveals that, as the training sample size m increases, the average hypothesis evolves from an almost random DNF expression to a well structured μDNF expression that represents exactly the target function. The learning curves exhibit a strong threshold behavior and, in some cases, have a terraced structure. That is, as m increases, the average accuracy stays relatively constant for short/long periods, interspersed with periods in which it rises quickly. This nontrivial behavior cannot not be deduced from a simple PAC analysis. The average sample complexity of the algorithm is O(n2), a large improvement over the PAC analysis result of O(n6) reported in [15]. The results of the numerical simulations are in very good agreement with the theoretical predictions

Mostefa Golea

Learning by extended statistical queries and its relation to PAC learning

PAC learning from examples is factored so that (i) the membership queries are used to evaluate empirically “statistical queries” — certain expectations of functionals involving the unknown target. (ii) approximate value of these statistical queries are used to compute an output — an approximation of the target.Kearns' original formulation of statistical queries [we use the abbreviation SQ], is extended here to include as SQ functionals of arbitrary range and order higher than one — second order being the most useful addition. This enables us to capture more ground for casting efficient PAC learning algorithms in SQ form: The important Kushilevitz-Mansour Fourier - based algorithm has an SQ rendition, as well as its derivatives, e.g. Jackson's recent DNF learning.Efficient evaluation of extended SQ by membership queries, if possible at all, becomes quite intricate. We show, however, that it is usually robust under classification noise.

Eli Shamir, Clara Shwartzman

Typed pattern languages and their learnability

In this paper, we extend patterns, introduced by Angluin [Ang80b], to typed patterns by introducing types into variables. A type is a recursive language and a variable of the type is substituted only with an element in the recursive language. This extension enhances the expressive power of patterns with preserving their good properties. First, we give a general learnability result for typed pattern languages. We show that if a class of types has finite elasticity then the typed pattern language is identifiable in the limit from positive data. Next, we give a useful tool to show the conservative learnability of typed pattern languages. That is, if an indexed family $${\cal L}$$ of recursive languages has recursive finite thickness and the equivalence problem for $${\cal L}$$ is decidable, then $${\cal L}$$ is conservatively learnable from positive data. Using this tool, we consider the following classes of types: (1) the class of all strings over subsets of the alphabet, (2) the class of all untyped pattern languages, and (3) a class of k-bounded regular languages. We show that each of these typed pattern languages is conservatively learnable from positive data.

Takeshi Koshiba

Learning behaviors of automata from shortest counterexamples

Generalizing a result of Ibarra and Jiang on the learning of regular languages from counterexamples, we show a polynomial time algorithm that exactly identifies an unknown Q-automaton from equivalence queries, given an oracle that provides counterexamples which are minimal w.r.t. the alphabetic ordering.

F. Bergadano, S. Varricchio

Learning of regular expressions by pattern matching

We consider the problem of restoring regular expressions from good examples. We describe a natural learning algorithm for obtaining a “plausible” regular expression from one example. The algorithm is based on finding the longest substring which can be matched by some part of the so far obtained expression. We believe that the algorithm to a certain extent mimics humans guessing regular expressions from the same sort of examples. We show that for regular expressions of bounded length successful learning takes time linear in the length of the example, provided that the example is “good”. Under certain natural restrictions the run-time of the learning algorithm is polynomial also in unsuccessful cases. In the end we discuss the computer experiment of learning regular expressions via the described algorithm, showing that the proposed learning method is quite practical.

Alvis Brāzma

The query complexity of learning some subclasses of context-free grammars

In the last COLT Conference A.Burago showed that the class of structurally reversible grammars is learnable in polynomial time using membership and equivalence queries. The paper shows that this class of grammars is not learnable using either membership alone or equivalence alone. However, it turns out that structurally reversible grammars are a superclass of very simple grammars which are learnable using only equivalence queries. Furthermore, we prove that the number of alternations between equivalence and membership queries cannot be constant. We also prove a lower bound in the number of alternations between membership and equivalence queries which is an improvement with respect to the same lower bound for deterministic finite automata. Finally, we disccus a possible trade-off between membership and equivalence queries in Burago's algorithm that might allow us to reduce the number of equivalence queries.

Carlos Domingo, Víctor Lavín

Springer Professional

Computational Learning Theory

Second European Conference, EuroCOLT '95 Barcelona, Spain, March 13–15, 1995 Proceedings

Über dieses Buch

Inhaltsverzeichnis

Frontmatter