Skip to main content

Über dieses Buch

This volume contains the papers presented at the 21st International Conf- ence on Algorithmic Learning Theory (ALT 2010), which was held in Canberra, Australia, October 6–8, 2010. The conference was co-located with the 13th - ternational Conference on Discovery Science (DS 2010) and with the Machine Learning Summer School, which was held just before ALT 2010. The tech- cal program of ALT 2010, contained 26 papers selected from 44 submissions and ?ve invited talks. The invited talks were presented in joint sessions of both conferences. ALT 2010 was dedicated to the theoretical foundations of machine learning and took place on the campus of the Australian National University, Canberra, Australia. ALT provides a forum for high-quality talks with a strong theore- cal background and scienti?c interchange in areas such as inductive inference, universal prediction, teaching models, grammatical inference, formal languages, inductive logic programming, query learning, complexity of learning, on-line learning and relative loss bounds, semi-supervised and unsupervised learning, clustering,activelearning,statisticallearning,supportvectormachines,Vapnik- Chervonenkisdimension,probablyapproximatelycorrectlearning,Bayesianand causal networks, boosting and bagging, information-based methods, minimum descriptionlength,Kolmogorovcomplexity,kernels,graphlearning,decisiontree methods, Markov decision processes, reinforcement learning, and real-world - plications of algorithmic learning theory. DS 2010 was the 13th International Conference on Discovery Science and focused on the development and analysis of methods for intelligent data an- ysis, knowledge discovery and machine learning, as well as their application to scienti?c knowledge discovery. As is the tradition, it was co-located and held in parallel with Algorithmic Learning Theory.



Editors’ Introduction

Editors’ Introduction

The conference “Algorithmic Learning Theory 2010” is dedicated to studies of learning from a mathematical and algorithmic perspective. Researchers consider various abstract models of the problem of learning and investigate how the learning goal in such a setting can be formulated and achieved. These models describe ways to define

the goal of learning,

how the learner retrieves information about its environment,

how to form of the learner’s models of the world (in some cases).

Retrieving information in some models is passive where the learner just views a stream of data. In other models, the learner is more active, asking questions or learning from its actions. Besides explicit formulation of hypotheses in an abstract language with respect to some indexing system, there are also more implicit methods like making predictions according to the current hypothesis on some arguments which then are evaluated with respect to their correctness, and wrong predictions (coming from wrong hypotheses) incur some loss on the learner. In the following, a more detailed introduction is given to the five invited talks and then to the regular contributions.

Marcus Hutter, Frank Stephan, Vladimir Vovk, Thomas Zeugmann

Invited Papers

Towards General Algorithms for Grammatical Inference

Many algorithms for grammatical inference can be viewed as instances of a more general algorithm which maintains a set of primitive elements, which distributionally define sets of strings, and a set of features or tests that constrain various inference rules. Using this general framework, which we cast as a process of logical inference, we re-analyse Angluin’s famous


algorithm and several recent algorithms for the inference of context-free grammars and multiple context-free grammars. Finally, to illustrate the advantages of this approach, we extend it to the inference of functional transductions from positive data only, and we present a new algorithm for the inference of finite state transducers.

Alexander Clark

The Blessing and the Curse of the Multiplicative Updates

Multiplicative updates multiply the parameters by nonnegative factors. These updates are motivated by a Maximum Entropy Principle and they are prevalent in evolutionary processes where the parameters are for example concentrations of species and the factors are survival rates. The simplest such update is Bayes rule and we give an in vitro selection algorithm for RNA strands that implements this rule in the test tube where each RNA strand represents a different model. In one liter of the RNA “soup” there are approximately 10


different strands and therefore this is a rather high-dimensional implementation of Bayes rule. We investigate multiplicative updates for the purpose of learning online while processing a stream of examples. The “blessing” of these updates is that they learn very fast because the good parameters grow exponentially. However their “curse” is that they learn too fast and wipe out parameters too quickly. We describe a number of methods developed in the realm of online learning that ameliorate the curse of these updates. The methods make the algorithm robust against data that changes over time and prevent the currently good parameters from taking over. We also discuss how the curse is circumvented by nature. Some of nature’s methods parallel the ones developed in Machine Learning, but nature also has some additional tricks.

Manfred K. Warmuth

Discovery of Abstract Concepts by a Robot

This paper reviews experiments with an approach to discovery through robot’s experimentation in its environment. In addition to discovering laws that enable predictions, we are particularly interested in the mechanisms that enable the discovery of abstract concepts that are not explicitly observable in the measured data, such as the notions of a tool or stability. The approach is based on the use of Inductive Logic Programming. Examples of actually discovered abstract concepts in the experiments include the concepts of a movable object, an obstacle and a tool.

Ivan Bratko

Contrast Pattern Mining and Its Application for Building Robust Classifiers

The ability to distinguish, differentiate and contrast between different data sets is a key objective in data mining. Such ability can assist domain experts to understand their data and can help in building classification models. This presentation will introduce the techniques for contrasting data sets. It will also focus on some important real world applications that illustrate how contrast patterns can be applied effectively for building robust classifiers.

Kotagiri Ramamohanarao

Optimal Online Prediction in Adversarial Environments

In many prediction problems, including those that arise in computer security and computational finance, the process generating the data is best modelled as an adversary with whom the predictor competes. Even decision problems that are not inherently adversarial can be usefully modeled in this way, since the assumptions are sufficiently weak that effective prediction strategies for adversarial settings are very widely applicable.

Peter L. Bartlett

Regular Contributions

Statistical Learning

An Algorithm for Iterative Selection of Blocks of Features

We focus on the problem of linear regression estimation in high dimension, when the parameter


is ”sparse” (most of its coordinates are 0) and ”blocky” (






 + 1

are likely to be equal). Recently, some authors defined estimators taking into account this information, such as the Fused-LASSO [19] or the S-LASSO [10] among others. However, there are no theoretical results about the obtained estimators in the general design matrix case. Here, we propose an alternative point of view, based on the Iterative Feature Selection method [1]. We propose an iterative algorithm that takes into account the fact that


is sparse


blocky, with no prior knowledge on the position of the blocks. Moreover, we give a theoretical result that ensures that every step of our algorithm actually improves the statistical performance of the obtained estimator. We provide some simulations, where our method outperforms LASSO-type methods in the cases where the parameter is sparse and blocky. Moreover, we give an application to real data (CGH arrays), that shows that our estimator can be used on large datasets.

Pierre Alquier

Bayesian Active Learning Using Arbitrary Binary Valued Queries

We explore a general Bayesian active learning setting, in which the learner can ask arbitrary yes/no questions. We derive upper and lower bounds on the expected number of queries required to achieve a specified expected risk.

Liu Yang, Steve Hanneke, Jaime Carbonell

Approximation Stability and Boosting

Stability has been explored to study the performance of learning algorithms in recent years and it has been shown that stability is sufficient for generalization and is sufficient and necessary for consistency of ERM in the general learning setting. Previous studies showed that AdaBoost has almost-everywhere uniform stability if the base learner has



stability. The



stability, however, is too restrictive and we show that AdaBoost becomes constant learner if the base learner is not real-valued learner. Considering that AdaBoost is mostly successful as a classification algorithm, stability analysis for AdaBoost when the base learner is not real-valued learner is an important yet unsolved problem. In this paper, we introduce the

approximation stability

and prove that approximation stability is sufficient for generalization, and sufficient and necessary for learnability of AERM in the general learning setting. We prove that AdaBoost has approximation stability and thus has good generalization, and an exponential bound for AdaBoost is provided.

Wei Gao, Zhi-Hua Zhou

Grammatical Inference and Graph Learning

A Spectral Approach for Probabilistic Grammatical Inference on Trees

We focus on the estimation of a probability distribution over a set of trees. We consider here the class of distributions computed by weighted automata - a strict generalization of probabilistic tree automata. This class of distributions (called rational distributions, or rational stochastic tree languages - RSTL) has an algebraic characterization: All the residuals (conditional) of such distributions lie in a finite-dimensional vector subspace. We propose a methodology based on Principal Components Analysis to identify this vector subspace. We provide an algorithm that computes an estimate of the target residuals vector subspace and builds a model which computes an estimate of the target distribution.

Raphaël Bailly, Amaury Habrard, François Denis

PageRank Optimization in Polynomial Time by Stochastic Shortest Path Reformulation

The importance of a node in a directed graph can be measured by its PageRank. The PageRank of a node is used in a number of application contexts – including ranking websites – and can be interpreted as the average portion of time spent at the node by an infinite random walk. We consider the problem of maximizing the PageRank of a node by selecting some of the edges from a set of edges that are under our control. By applying results from Markov decision theory, we show that an optimal solution to this problem can be found in polynomial time. It also indicates that the provided reformulation is well-suited for reinforcement learning algorithms. Finally, we show that, under the slight modification for which we are given mutually exclusive pairs of edges, the problem of PageRank optimization becomes NP-hard.

Balázs Csanád Csáji, Raphaël M. Jungers, Vincent D. Blondel

Inferring Social Networks from Outbreaks

We consider the problem of inferring the most likely social network given connectivity constraints imposed by observations of outbreaks within the network. Given a set of vertices (or agents)


and constraints (or observations)





we seek to find a minimum log-likelihood cost (or maximum likelihood) set of edges (or connections)


such that each



induces a connected subgraph of (




). For the offline version of the problem, we prove an Ω(log(


)) hardness of approximation result for uniform cost networks and give an algorithm that almost matches this bound, even for arbitrary costs. Then we consider the online problem, where the constraints are satisfied as they arrive. We give an






))-competitive algorithm for the arbitrary cost online problem, which has an Ω(


)-competitive lower bound. We look at the uniform cost case as well and give an









))-competitive algorithm against an oblivious adversary, as well as an


-competitive lower bound against an adaptive adversary. We examine cases when the underlying network graph is known to be a star or a path, and prove matching upper and lower bounds of Θ(log(


)) on the competitive ratio for them.

Dana Angluin, James Aspnes, Lev Reyzin

Probably Approximately Correct Learning

Distribution-Dependent PAC-Bayes Priors

We develop the idea that the PAC-Bayes prior can be informed by the data-generating distribution. We prove sharp bounds for an existing framework, and develop insights into function class complexity in this model and suggest means of controlling it with new algorithms. In particular we consider controlling capacity with respect to the


geometry of the data-generating distribution. We finally extend this localization to more practical learning methods.

Guy Lever, François Laviolette, John Shawe-Taylor

PAC Learnability of a Concept Class under Non-atomic Measures: A Problem by Vidyasagar

In response to a 1997 problem of M. Vidyasagar, we state a necessary and sufficient condition for distribution-free PAC learnability of a concept class

under the family of all non-atomic (diffuse) measures on the domain Ω. Clearly, finiteness of the classical Vapnik–Chervonenkis dimension of

is a sufficient, but no longer necessary, condition. Besides, learnability of

under non-atomic measures does not imply the uniform Glivenko–Cantelli property with regard to non-atomic measures. Our learnability criterion is stated in terms of a combinatorial parameter

which we call the VC dimension of

modulo countable sets. The new parameter is obtained by “thickening up” single points in the definition of VC dimension to uncountable “clusters”. Equivalently,

if and only if every countable subclass of

has VC dimension ≤ 


outside a countable subset of Ω. The new parameter can be also expressed as the classical VC dimension of

calculated on a suitable subset of a compactification of Ω. We do not make any measurability assumptions on

, assuming instead the validity of Martin’s Axiom (MA).

Vladimir Pestov

A PAC-Bayes Bound for Tailored Density Estimation

In this paper we construct a general method for reporting on the accuracy of density estimation. Using variational methods from statistical learning theory we derive a PAC, algorithm-dependent bound on the distance between the data generating distribution and a learned approximation. The distance measure takes the role of a loss function that can be tailored to the learning problem, enabling us to control discrepancies on tasks relevant to subsequent inference. We apply the bound to an efficient mixture learning algorithm. Using the method of localisation we encode properties of both the algorithm and the data generating distribution, producing a tight, empirical, algorithm-dependent upper risk bound on the performance of the learner. We discuss other uses of the bound for arbitrary distributions and model averaging.

Matthew Higgs, John Shawe-Taylor

Compressed Learning with Regular Concept

We revisit compressed learning in the PAC learning framework. Specifically, we derive error bounds for learning halfspace concepts with compressed data. We propose the


assumption over a pair of concept and data distribution to greatly generalize former assumptions. For a


concept we define a

robust factor

to characterize the margin distribution and show that such a factor tightly controls the generalization error of a learned classifier. Moreover, we extend our analysis to the more general linearly non-separable case. Empirical results on both toy and real world data validate our analysis.

Jiawei Lv, Jianwen Zhang, Fei Wang, Zheng Wang, Changshui Zhang

Query Learning and Algorithmic Teaching

A Lower Bound for Learning Distributions Generated by Probabilistic Automata

Known algorithms for learning PDFA can only be shown to run in time polynomial in the so-called distinguishability


of the target machine, besides the number of states and the usual accuracy and confidence parameters. We show that the dependence on


is necessary for every algorithm whose structure resembles existing ones. As a technical tool, a new variant of Statistical Queries termed L


-queries is defined. We show how these queries can be simulated from samples and observe that known PAC algorithms for learning PDFA can be rewritten to access its target using L


-queries and standard Statistical Queries. Finally, we show a lower bound: every algorithm to learn PDFA using queries with a resonable tolerance needs a number of queries larger than (1/




for every


 < 1.

Borja Balle, Jorge Castro, Ricard Gavaldà

Lower Bounds on Learning Random Structures with Statistical Queries

We show that random DNF formulas, random log-depth decision trees and random deterministic finite acceptors cannot be weakly learned with a polynomial number of statistical queries with respect to an arbitrary distribution on examples.

Dana Angluin, David Eisenstat, Leonid (Aryeh) Kontorovich, Lev Reyzin

Recursive Teaching Dimension, Learning Complexity, and Maximum Classes

This paper is concerned with the combinatorial structure of concept classes that can be learned from a small number of examples. We show that the recently introduced notion of recursive teaching dimension (RTD, reflecting the complexity of teaching a concept class) is a relevant parameter in this context. Comparing the RTD to self-directed learning, we establish new lower bounds on the query complexity for a variety of query learning models and thus

connect teaching to query learning


For many general cases, the RTD is upper-bounded by the VC-dimension, e.g., classes of VC-dimension 1, (nested differences of) intersection-closed classes, “standard” boolean function classes, and finite maximum classes. The RTD thus is the first model to

connect teaching to the VC-dimension


The combinatorial structure defined by the RTD has a remarkable resemblance to the structure exploited by sample compression schemes and hence

connects teaching to sample compression

. Sequences of teaching sets defining the RTD coincide with unlabeled compression schemes both (i) resulting from Rubinstein and Rubinstein’s corner-peeling and (ii) resulting from Kuzmin and Warmuth’s Tail Matching algorithm.

Thorsten Doliwa, Hans Ulrich Simon, Sandra Zilles

On-line Learning

Toward a Classification of Finite Partial-Monitoring Games

In a finite partial-monitoring game against Nature, the Learner repeatedly chooses one of finitely many actions, the Nature responds with one of finitely many outcomes, the Learner suffers a loss and receives feedback signal, both of which are fixed functions of the action and the outcome. The goal of the Learner is to minimize its total cumulative loss. We make progress towards classification of these games based on their minimax expected regret. Namely, we classify almost all games with two outcomes: We show that their minimax expected regret is either zero,


, Θ(



), or Θ(


) and we give a simple and efficiently computable classification of these four classes of games. Our hope is that the result can serve as a stepping stone toward classifying all finite partial-monitoring games.

Gábor Bartók, Dávid Pál, Csaba Szepesvári

Switching Investments

We present a simple online two-way trading algorithm that exploits fluctuations in the unit price of an asset. Rather than analysing worst-case performance under some assumptions, we prove a novel, unconditional performance bound that is parameterised either by the actual dynamics of the price of the asset, or by a simplifying model thereof. The algorithm processes


prices in





) time and




) space, but if the employed prior density is exponential, the time requirement reduces to




). The result translates to the prediction with expert advice framework, and has applications in data compression and hypothesis testing.

Wouter M. Koolen, Steven de Rooij

Prediction with Expert Advice under Discounted Loss

We study prediction with expert advice in the setting where the losses are accumulated with some discounting and the impact of old losses can gradually vanish. We generalize the Aggregating Algorithm and the Aggregating Algorithm for Regression, propose a new variant of exponentially weighted average algorithm, and prove bounds on the cumulative discounted loss.

Alexey Chernov, Fedor Zhdanov

A Regularization Approach to Metrical Task Systems

We address the problem of constructing randomized online algorithms for the Metrical Task Systems (MTS) problem on a metric


against an oblivious adversary. Restricting our attention to the class of “work-based” algorithms, we provide a framework for designing algorithms that uses the technique of


. For the case when


is a uniform metric, we exhibit two algorithms that arise from this framework, and we prove a bound on the competitive ratio of each. We show that the second of these algorithms is ln






) competitive, which is the current state-of-the art for the uniform MTS problem.

Jacob Abernethy, Peter L. Bartlett, Niv Buchbinder, Isabelle Stanton

Inductive Inference

Solutions to Open Questions for Non-U-Shaped Learning with Memory Limitations

A U-shape occurs when a learner first learns, then unlearns, and, finally, relearns, some target concept. Within the framework of Inductive Inference, previous results have shown, for example, that U-shapes are unnecessary for explanatory learning, but are necessary for behaviorally correct learning.

This paper solves the following two problems regarding non-U-shaped learning posed in the prior literature.

First, it is shown that there are classes learnable with three memory states that are not learnable non-U-shapedly with any finite number of memory states. This result is surprising, as for learning with one or two memory states, U-shapes are known to be unnecessary.

Secondly, it is shown that there is a class learnable memorylessly with a single feedback query such that this class is not learnable non-U-shapedly memorylessly with any finite number of feedback queries. This result is complemented by the result that any class of


languages learnable memorylessly with finitely many feedback queries is so learnable without U-shapes – in fact, all classes of infinite languages learnable with


memory are learnable memorylessly with finitely many feedback queries and without U-shapes. On the other hand, we show that there is a class of infinite languages learnable memorylessly with a single feedback query, which is


learnable with


U-shapes by any particular bounded number of feedback queries.

John Case, Timo Kötzing

Learning without Coding

Iterative learning is a model of language learning from positive data, due to Wiehagen. When compared to a learner in Gold’s original model of language learning from positive data, an iterative learner can be thought of as


. However, an iterative learner can memorize


input elements by


them into the syntax of its hypotheses. A main concern of this paper is: to what extent are such coding tricks



One means of preventing


such coding tricks is to require that the hypothesis space used be free of redundancy, i.e., that it be 1-1. By extending a result of Lange & Zeugmann, we show that many interesting and non-trivial classes of languages can be iteratively identified in this manner. On the other hand, we show that there exists a class of languages that can


be iteratively identified using any 1-1 effective numbering as the hypothesis space.

We also consider an iterative-like learning model in which the computational component of the learner is modeled as an

enumeration operator

, as opposed to a partial computable function. In this new model, there are no hypotheses, and, thus, no syntax in which the learner can encode what elements it has or has not yet seen. We show that there exists a class of languages that


be identified under this new model, but that can


be iteratively identified. On the other hand, we show that there exists a class of languages that can


be identified under this new model, but that


be iteratively identified using a Friedberg numbering as the hypothesis space.

Samuel E. Moelius, Sandra Zilles

Learning Figures with the Hausdorff Metric by Fractals


is a fundamental process for machine learning from analog data such as continuous signals. For example, the discrete Fourier analysis is one of the most essential signal processing methods for learning or recognition from continuous signals. However, only the direction of the time axis is discretized in the method, meaning that each datum is not purely discretized. To give a completely computational theoretical basis for machine learning from analog data, we construct a learning framework based on the Gold-style learning model. Using a modern mathematical computability theory in the field of Computable Analysis, we show that scalable sampling of analog data can be formulated as


Gold-style learning. On the other hand, recursive algorithms are a key expression for models or rules explaining analog data. For example, FFT (Fast Fourier Transformation) is a fundamental recursive algorithm for discrete Fourier analysis. In this paper we adopt


, since they are general geometric concepts of recursive algorithms, and set learning objects as nonempty compact sets in the Euclidean space, called


, in order to introduce fractals into Gold-style learning model, where the Hausdorff metric can be used to measure generalization errors. We analyze learnable classes of figures from informants (positive and negative examples) and from texts (positive examples), and reveal the hierarchy of learnabilities under various learning criteria. Furthermore, we measure the number of positive examples, one of complexities of learning, by using the Hausdorff dimension, which is the central concept of Fractal Geometry, and the VC dimension, which is used to measure the complexity of classes of hypotheses in the Valiant-style learning model. This work provides theoretical support for machine learning from analog data.

Mahito Sugiyama, Eiju Hirowatari, Hideki Tsuiki, Akihiro Yamamoto

Inductive Inference of Languages from Samplings

We introduce, discuss, and study a model for inductive inference from samplings, formalizing an idea of learning different “projections” of languages. One set of our results addresses the problem of finding a uniform learner for all samplings of a language from a certain set when learners for particular samplings are available. Another set of results deals with extending learnability from a large natural set of samplings to larger sets. A number of open problems is formulated.

Sanjay Jain, Efim Kinber

Reinforcement Learning

Optimality Issues of Universal Greedy Agents with Static Priors

Finding the universal artificial intelligent agent is the old dream of AI scientists. Solomonoff Induction was one big step towards this, giving a universal solution to the general problem of Sequence Prediction, by defining a universal prior distribution. Hutter defined AIXI, which extends the latter to the Reinforcement Learning framework, where almost all if not all AI problems can be formulated. However, new difficulties arise, because the agent is now active, whereas it is only passive in the Sequence Prediction case. This makes proving AIXI’s optimality difficult. In fact, we prove that the current definition of AIXI can sometimes be only suboptimal in a certain sense, and we generalize this result to infinite horizon agents and to any static prior distribution.

Laurent Orseau

Consistency of Feature Markov Processes

We are studying long term sequence prediction (forecasting). We approach this by investigating criteria for choosing a compact useful state representation. The state is supposed to summarize useful information from the history. We want a method that is asymptotically consistent in the sense it will provably eventually only choose between alternatives that satisfy an optimality property related to the used criterion. We extend our work to the case where there is side information that one can take advantage of and, furthermore, we briefly discuss the active setting where an agent takes actions to achieve desirable outcomes.

Peter Sunehag, Marcus Hutter

Algorithms for Adversarial Bandit Problems with Multiple Plays

Adversarial bandit problems studied by Auer et al. [4] are multi-armed bandit problems in which no stochastic assumption is made on the nature of the process generating the rewards for actions. In this paper, we extend their theories to the case where


( ≥ 1) distinct actions are selected at each time step. As algorithms to solve our problem, we analyze an extension of


[4] and an application of a bandit online linear optimization algorithm [1] in addition to two existing algorithms (




[5] in terms of time and space efficiency and the regret for the best fixed action set. The extension of


, called


, performs best with respect to all the measures: it runs in






 + 1)) time and




) space, and suffers at most


regret, where


is the number of possible actions and


is the number of iterations. The upper bound of the regret we proved for


is an extension of that proved by Auer et al. for



Taishi Uchiya, Atsuyoshi Nakamura, Mineichi Kudo

On-line Learning and Kernel Methods

Online Multiple Kernel Learning: Algorithms and Mistake Bounds

Online learning


kernel learning

are two active research topics in machine learning. Although each of them has been studied extensively, there is a limited effort in addressing the intersecting research. In this paper, we introduce a new research problem, termed

Online Multiple Kernel Learning

(OMKL), that aims to learn a kernel based prediction function from a pool of predefined kernels in an online learning fashion. OMKL is generally more challenging than typical online learning because both the kernel classifiers and their linear combination weights must be learned simultaneously. In this work, we consider two setups for OMKL, i.e. combining binary predictions or real-valued outputs from multiple kernel classifiers, and we propose both deterministic and stochastic approaches in the two setups for OMKL. The deterministic approach updates all kernel classifiers for every misclassified example, while the stochastic approach randomly chooses a classifier(s) for updating according to some sampling strategies. Mistake bounds are derived for all the proposed OMKL algorithms.

Rong Jin, Steven C. H. Hoi, Tianbao Yang

An Identity for Kernel Ridge Regression

This paper provides a probabilistic derivation of an identity connecting the square loss of ridge regression in on-line mode with the loss of a retrospectively best regressor. Some corollaries of the identity providing upper bounds for the cumulative loss of on-line ridge regression are also discussed.

Fedor Zhdanov, Yuri Kalnishkan


Weitere Informationen

Premium Partner