Top

2007 | Book

Read chapter Read first chapter

Algorithmic Learning Theory

18th International Conference, ALT 2007, Sendai, Japan, October 1-4, 2007. Proceedings

Editors: Marcus Hutter, Rocco A. Servedio, Eiji Takimoto

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This volume contains the papers presented at the 18th International Conf- ence on Algorithmic Learning Theory (ALT 2007), which was held in Sendai (Japan) during October 1–4, 2007. The main objective of the conference was to provide an interdisciplinary forum for high-quality talks with a strong theore- cal background and scienti?c interchange in areas such as query models, on-line learning, inductive inference, algorithmic forecasting, boosting, support vector machines, kernel methods, complexity and learning, reinforcement learning, - supervised learning and grammatical inference. The conference was co-located with the Tenth International Conference on Discovery Science (DS 2007). This volume includes 25 technical contributions that were selected from 50 submissions by the ProgramCommittee. It also contains descriptions of the ?ve invited talks of ALT and DS; longer versions of the DS papers are available in the proceedings of DS 2007. These invited talks were presented to the audience of both conferences in joint sessions.

Frontmatter

Editors’ Introduction

Philosophers have pondered the phenomenon of learning for millennia; scientists and psychologists have studied learning for more than a century. But the analysis of learning as a

computational

and

algorithmic

phenomenon is much more recent, going back only a few decades. Learning theory is now an active research area that incorporates ideas, problems, and techniques from a wide range of disciplines including statistics, artificial intelligence, information theory, pattern recognition, and theoretical computer science. Learning theory has many robust connections with more applied research in machine learning and has made significant contributions to the development of applied systems and to fields such as electronic commerce and computational biology.

Marcus Hutter, Rocco A. Servedio, Eiji Takimoto

Invited Papers

A Theory of Similarity Functions for Learning and Clustering

Kernel methods have proven to be powerful tools in machine learning. They perform well in many applications, and there is also a well-developed theory of sufficient conditions for a kernel to be useful for a given learning problem. However, while a kernel can be thought of as just a pairwise similarity function that satisfies additional mathematical properties, this theory requires viewing kernels as implicit (and often difficult to characterize) maps into high-dimensional spaces. In this talk I will describe work on developing a theory that applies to more general similarity functions (not just legal kernels) and furthermore describes the usefulness of a given similarity function in terms of more intuitive, direct properties, without need to refer to any implicit spaces.

An interesting feature of the proposed framework is that it can also be applied to learning from purely unlabeled data, i.e., clustering. In particular, one can ask how much stronger the properties of a similarity function should be (in terms of its relation to the unknown desired clustering) so that it can be used to

cluster

well: to learn well without any label information at all. We find that if we are willing to relax the objective a bit (for example, allow the algorithm to produce a hierarchical clustering that we will call successful if some pruning is close to the correct answer), then this question leads to a number of interesting graph-theoretic and game-theoretic properties that are sufficient to cluster well. This work can be viewed as an approach to defining a PAC model for clustering.

This talk is based on work joint with Maria-Florina Balcan and Santosh Vempala.

Avrim Blum

Machine Learning in Ecosystem Informatics

The emerging field of Ecosystem Informatics applies methods from computer science and mathematics to address fundamental and applied problems in the ecosystem sciences. The ecosystem sciences are in the midst of a revolution driven by a combination of emerging technologies for improved sensing and the critical need for better science to help manage global climate change. This paper describes several initiatives at Oregon State University in ecosystem informatics.

Thomas G. Dietterich

Challenge for Info-plosion

Information created by people has increased rapidly since the year 2000, and now we are in a time which we could call the “information-explosion era.” The project “Cyber Infrastructure for the Information-explosion Era” is a six-year project from 2005 to 2010 supported by Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan. The project aims to establish the following fundamental technologies in this information-explosion era: novel technologies for efficient and trustable information retrieval from explosively growing and heterogeneous information resources; stable, secure, and scalable information systems for managing rapid information growth; and information utilization by harmonized human-system interaction. It also aims to design a social system that cooperates with these technologies. Moreover, it maintains the synergy of cutting-edge technologies in informatics.

Masaru Kitsuregawa

A Hilbert Space Embedding for Distributions

We describe a technique for comparing distributions without the need for density estimation as an intermediate step. Our approach relies on mapping the distributions into a reproducing kernel Hilbert space. Applications of this technique can be found in two-sample tests, which are used for determining whether two sets of observations arise from the same distribution, covariate shift correction, local learning, measures of independence, and density estimation.

Alex Smola, Arthur Gretton, Le Song, Bernhard Schölkopf

Simple Algorithmic Principles of Discovery, Subjective Beauty, Selective Attention, Curiosity and Creativity

I postulate that human or other intelligent agents function or should function as follows. They store all sensory observations as they come—the data is ‘holy.’ At any time, given some agent’s current coding capabilities, part of the data is compressible by a short and hopefully fast program / description / explanation / world model. In the agent’s subjective eyes, such data is more regular and more

beautiful

than other data [2,3]. It is well-known that knowledge of regularity and repeatability may improve the agent’s ability to plan actions leading to external rewards. In absence of such rewards, however,

known

beauty is boring. Then

interestingness

becomes the

first derivative

of subjective beauty: as the learning agent improves its compression algorithm, formerly apparently random data parts become subjectively more regular and beautiful. Such progress in data compression is measured and maximized by the

curiosity

drive [1,4,5]: create action sequences that extend the observation history and yield previously unknown / unpredictable but quickly learnable algorithmic regularity. We discuss how all of the above can be naturally implemented on computers, through an extension of passive unsupervised learning to the case of active data selection: we reward a general reinforcement learner (with access to the adaptive compressor) for actions that improve the subjective compressibility of the growing data. An unusually large data compression breakthrough deserves the name

discovery

. The

creativity

of artists, dancers, musicians, pure mathematicians can be viewed as a by-product of this principle. Good observer-dependent art deepens the observer’s insights about this world or possible worlds, unveiling previously unknown regularities in compressible data, connecting previously disconnected patterns in an initially surprising way that makes the combination of these patterns subjectively more compressible, and eventually becomes known and less interesting. Several qualitative examples support this hypothesis.

Jürgen Schmidhuber

Invited Papers

Inductive Inference

Feasible Iteration of Feasible Learning Functionals

For learning functions in the limit, an algorithmic learner obtains successively more data about a function and calculates trials each resulting in the output of a corresponding program, where, hopefully, these programs eventually converge to a correct program for the function. The authors desired to provide a feasible version of this learning in the limit — a version where each trial was conducted feasibly

and

there was some feasible limit on the number of trials allowed. Employed were

basic feasible functionals

which query an input function as to its values and which provide each trial. An additional tally argument 0

was provided to the functionals for their execution of the i-th trial. In this way more time resource was available for each successive trial. The mechanism employed to feasibly limit the number of trials was to feasibly count them down from some feasible notation for a constructive ordinal. Since all processes were feasible, their termination was feasibly detectable, and, so, it was possible to wait for the trials to terminate and suppress all the output programs but the last. Hence, although there is still an iteration of trials, the learning was a special case of what has long been known as total

Fin

-learning, i.e., learning in the limit, where, on each function, the learner always outputs exactly one conjectured program. Our general main results provide for strict learning hierarchies where the trial count down involves all and only notations for infinite limit ordinals. For our hierarchies featuring finitely many limit ordinal jumps, we have upper and lower total run time bounds of our feasible

Fin

-learners in terms of finite stacks of exponentials. We provide, though, an example of how to regain feasibility by a suitable parameterized complexity analysis.

John Case, Timo Kötzing, Todd Paddock

Parallelism Increases Iterative Learning Power

Iterative learning

(

$\textbf{It}$

-learning) is a Gold-style learning model in which each of a learner’s output conjectures may depend

only

upon the learner’s

current

conjecture and the

current

input element. Two extensions of the

$\textbf{It}$

-learning model are considered, each of which involves parallelism. The first is to run, in parallel, distinct instantiations of a single learner on each input element. The second is to run, in parallel,

individual learners

incorporating the first extension

, and to allow the

learners to communicate their results. In most contexts, parallelism is only a means of improving efficiency. However, as shown herein, learners incorporating the first extension are more powerful than

$\textbf{It}$

-learners, and,

collective

learners resulting from the second extension increase in learning power as

increases. Attention is paid to how one would actually implement a learner incorporating each extension. Parallelism is the underlying mechanism employed.

John Case, Samuel E. Moelius III

Prescribed Learning of R.E. Classes

This work extends studies of Angluin, Lange and Zeugmann on the dependence of learning on the hypotheses space chosen for the class. In subsequent investigations, uniformly recursively enumerable hypotheses spaces have been considered. In the present work, the following four types of learning are distinguished: class-comprising (where the learner can choose a uniformly recursively enumerable superclass as hypotheses space), class-preserving (where the learner has to choose a uniformly recursively enumerable hypotheses space of the same class), prescribed (where there must be a learner for every uniformly recursively enumerable hypotheses space of the same class) and uniform (like prescribed, but the learner has to be synthesized effectively from an index of the hypothesis space). While for explanatory learning, these four types of learnability coincide, some or all are different for other learning criteria. For example, for conservative learning, all four types are different. Several results are obtained for vacillatory and behaviourally correct learning; three of the four types can be separated, however the relation between prescribed and uniform learning remains open. It is also shown that every (not necessarily uniformly recursively enumerable) behaviourally correct learnable class has a prudent learner, that is, a learner using a hypotheses space such that it learns every set in the hypotheses space. Moreover the prudent learner can be effectively built from any learner for the class.

Sanjay Jain, Frank Stephan, Nan Ye

Learning in Friedberg Numberings

In this paper we consider learnability in some special numberings, such as Friedberg numberings, which contain all the recursively enumerable languages, but have simpler grammar equivalence problem compared to acceptable numberings. We show that every explanatorily learnable class can be learnt in some Friedberg numbering. However, such a result does not hold for behaviourally correct learning or finite learning. One can also show that some Friedberg numberings are so restrictive that all classes which can be explanatorily learnt in such Friedberg numberings have only finitely many infinite languages. We also study similar questions for several properties of learners such as consistency, conservativeness, prudence, iterativeness and non U-shaped learning. Besides Friedberg numberings, we also consider the above problems for programming systems with

-recursive grammar equivalence problem.

Sanjay Jain, Frank Stephan

Complexity Aspects of Learning

Separating Models of Learning with Faulty Teachers

We study the power of two models of faulty teachers in Angluin’s exact learning model. The first model we consider is learning from equivalence and incomplete membership query oracles introduced by Angluin and Slonim [1]. In this model, the answers to a random subset of the learner’s membership queries may be missing. The second model we consider is random persistent classification noise in membership queries introduced by Goldman et al. [2]. In this model, the answers to a random subset of the learner’s membership queries are flipped.

We show that the incomplete membership query oracle is strictly stronger than the membership query oracle with persistent noise under the assumption that the problem of PAC learning parities with noise is intractable.

We also show that under the standard cryptographic assumptions the incomplete membership query oracle is strictly weaker than the perfect membership query oracle. This strengthens the result of Simon [3] and resolves an open question of Bshouty and Eiron [4].

Our techniques are based on ideas from coding theory and cryptography.

Vitaly Feldman, Shrenik Shah, Neal Wadhwa

Vapnik-Chervonenkis Dimension of Parallel Arithmetic Computations

We provide upper bounds for the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers whose membership tests are programs described by bounded-depth arithmetic networks. Our upper bounds are of the kind

(

), where

is the depth of the network (representing the parallel running time) and

is the number of parameters needed to codify the concept. This bound becomes

(

) when membership tests are described by Boolean-arithmetic circuits. As a consequence we conclude that families of concepts classes having parallel polynomial time algorithms expressing their membership tests have polynomial VC dimension.

César L. Alonso, José Luis Montaña

Parameterized Learnability of k-Juntas and Related Problems

We study the parameterized complexity of learning

-juntas and some variations of juntas. We show the hardness of learning

-juntas and subclasses of

-juntas in the PAC model by reductions from a W[2]-complete problem. On the other hand, as a consequence of a more general result we show that

-juntas are exactly learnable with improper equivalence queries and access to a W[P] oracle.

Vikraman Arvind, Johannes Köbler, Wolfgang Lindner

On Universal Transfer Learning

In transfer learning the aim is to solve new learning tasks using fewer examples by using information gained from solving related tasks. Existing transfer learning methods have been used successfully in practice and PAC analysis of these methods have been developed. But the key notion of relatedness between tasks has not yet been defined clearly, which makes it difficult to understand, let alone answer, questions that naturally arise in the context of transfer, such as, how much information to transfer, whether to transfer information, and how to transfer information across tasks. In this paper we look at transfer learning from the perspective of Algorithmic Information Theory, and formally solve these problems in the same sense Solomonoff Induction solves the problem of inductive inference. We define universal measures of relatedness between tasks, and use these measures to develop universally optimal Bayesian transfer learning methods.

M. M. Hassan Mahmud

Online Learning

Tuning Bandit Algorithms in Stochastic Environments

Algorithms based on upper-confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. In this paper we consider a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, such algorithms were found to outperform the competing algorithms. The purpose of this paper is to provide a theoretical explanation of these findings and provide theoretical guidelines for the tuning of the parameters of these algorithms. For this we analyze the expected regret and for the first time the concentration of the regret. The analysis of the expected regret shows that variance estimates can be especially advantageous when the payoffs of suboptimal arms have low variance. The risk analysis, rather unexpectedly, reveals that except for some very special bandit problems, the regret, for upper confidence bounds based algorithms with standard bias sequences, concentrates only at a polynomial rate. Hence, although these algorithms achieve logarithmic expected regret rates, they seem less attractive when the risk of suffering much worse than logarithmic regret is also taken into account.

Jean-Yves Audibert, Rémi Munos, Csaba Szepesvári

Following the Perturbed Leader to Gamble at Multi-armed Bandits

Following the perturbed leader (

fpl

) is a powerful technique for solving online decision problems. Kalai and Vempala [1] rediscovered this algorithm recently. A traditional model for online decision problems is the multi-armed bandit. In it a gambler has to choose at each round one of the

levers to pull with the intention to minimize the cumulated cost. There are four versions of the nonstochastic optimization setting out of which the most demanding one is a game played against an adaptive adversary in the bandit setting. An adaptive adversary may alter its game strategy of assigning costs to decisions depending on the decisions chosen by the gambler in the past. In the bandit setting the gambler only gets to know the cost of the choice he made, rather than the costs of all available alternatives. In this work we show that the very straightforward and easy to implement algorithm Adaptive Bandit

fpl

can attain a regret of

$O(\sqrt{T \ln T})$

against an adaptive adversary. This regret holds with respect to the best lever in hindsight and matches the previous best regret bounds of

$O(\sqrt{T \ln T})$

Jussi Kujala, Tapio Elomaa

Online Regression Competitive with Changing Predictors

This paper deals with the problem of making predictions in the online mode of learning where the dependence of the outcome

on the signal x

can change with time. The Aggregating Algorithm (AA) is a technique that optimally merges experts from a pool, so that the resulting strategy suffers a cumulative loss that is almost as good as that of the best expert in the pool. We apply the AA to the case where the experts are all the linear predictors that can change with time. KAARCh is the kernel version of the resulting algorithm. In the kernel case, the experts are all the decision rules in some reproducing kernel Hilbert space that can change over time. We show that KAARCh suffers a cumulative square loss that is almost as good as that of any expert that does not change very rapidly.

Steven Busuttil, Yuri Kalnishkan

Unsupervised Learning

Cluster Identification in Nearest-Neighbor Graphs

Assume we are given a sample of points from some underlying distribution which contains several distinct clusters. Our goal is to construct a neighborhood graph on the sample points such that clusters are “identified”: that is, the subgraph induced by points from the same cluster is connected, while subgraphs corresponding to different clusters are not connected to each other. We derive bounds on the probability that cluster identification is successful, and use them to predict “optimal” values of

for the mutual and symmetric

-nearest-neighbor graphs. We point out different properties of the mutual and symmetric nearest-neighbor graphs related to the cluster identification problem.

Markus Maier, Matthias Hein, Ulrike von Luxburg

Multiple Pass Streaming Algorithms for Learning Mixtures of Distributions in ${\mathbb R}^d$

We present a multiple pass streaming algorithm for learning the density function of a mixture of

uniform distributions over rectangles (cells) in

${\mathbb R}^d$

, for any

> 0. Our learning model is: samples drawn according to the mixture are placed in

arbitrary order

in a data stream that may only be accessed sequentially by an algorithm with a very limited random access memory space. Our algorithm makes 2ℓ + 1 passes, for any ℓ> 0, and requires memory at most

$\tilde O(\epsilon^{-2/\ell}k^2d^4+(2k)^d)$

. This exhibits a strong memory-space tradeoff: a few more passes significantly lowers its memory requirements, thus trading one of the two most important resources in streaming computation for the other. Chang and Kannan ? first considered this problem for [1]

= 1, 2.

Our learning algorithm is especially appropriate for situations where massive data sets of samples are available, but practical computation with such large inputs requires very restricted models of computation.

Kevin L. Chang

Language Learning

Learning Efficiency of Very Simple Grammars from Positive Data

The class of very simple grammars is known to be polynomial-time identifiable in the limit from positive data. This paper gives even more general discussion on the efficiency of identification of very simple grammars from positive data, which includes both positive and negative results. In particular, we present an alternative efficient inconsistent learning algorithm for very simple grammars.

Ryo Yoshinaka

Learning Rational Stochastic Tree Languages

We consider the problem of learning stochastic tree languages, i.e. probability distributions over a set of trees

$T({\cal F})$

, from a sample of trees independently drawn according to an unknown target

. We consider the case where the target is a rational stochastic tree language, i.e. it can be computed by a rational tree series or, equivalently, by a multiplicity tree automaton. In this paper, we provide two contributions. First, we show that rational tree series admit a canonical representation with parameters that can be efficiently estimated from samples. Then, we give an inference algorithm that identifies the class of rational stochastic tree languages in the limit with probability one.

François Denis, Amaury Habrard

Query Learning

One-Shot Learners Using Negative Counterexamples and Nearest Positive Examples

As some cognitive research suggests, in the process of learning languages, in addition to

overt

explicit negative evidence, a child often receives

covert

explicit evidence in form of corrected or rephrased sentences. In this paper, we suggest one approach to formalization of overt and covert evidence within the framework of

one-shot

learners via subset and membership queries to a teacher (oracle). We compare and explore general capabilities of our models, as well as complexity advantages of learnability models of one type over models of other types, where complexity is measured in terms of number of queries. In particular, we establish that “correcting” positive examples give sometimes more power to a learner than just negative (counter)examples and access to full positive data.

Sanjay Jain, Efim Kinber

Polynomial Time Algorithms for Learning k-Reversible Languages and Pattern Languages with Correction Queries

We investigate two of the language classes intensively studied by the algorithmic learning theory community in the context of learning with correction queries. More precisely, we show that any pattern language can be inferred in polynomial time in length of the pattern by asking just a linear number of correction queries, and that

-reversible languages are efficiently learnable within this setting. Note that although the class of all pattern languages is learnable with membership queries, this cannot be done in polynomial time. Moreover, the class of

-reversible languages is not learnable at all using membership queries only.

Cristina Tîrnăucă, Timo Knuutila

Learning and Verifying Graphs Using Queries with a Focus on Edge Counting

We consider the problem of learning and verifying hidden graphs and their properties given query access to the graphs. We analyze various queries (edge detection, edge counting, shortest path), but we focus mainly on edge counting queries. We give an algorithm for learning graph partitions using

(

log

) edge counting queries. We introduce a problem that has not been considered: verifying graphs with edge counting queries, and give a randomized algorithm with error

for graph verification using

(log(1/

)) edge counting queries. We examine the current state of the art and add some original results for edge detection and shortest path queries to give a more complete picture of the relative power of these queries to learn various graph classes. Finally, we relate our work to Freivalds’ ‘fingerprinting technique’ – a probabilistic method for verifying that two matrices are equal by multiplying them by random vectors.

Lev Reyzin, Nikhil Srivastava

Exact Learning of Finite Unions of Graph Patterns from Queries

A linear graph pattern is a labeled graph such that its vertices have constant labels and its edges have either constant or mutually distinct variable labels. An edge having a variable label is called a variable and can be replaced with an arbitrary labeled graph. Let

${\mathcal GPC}$

be the set of all linear graph patterns having a structural feature

${\mathcal C}$

like “having a tree structure”, “having a two-terminal series parallel graph structure” and so on. The graph language

GLc

(

) of a linear graph pattern

${\cal GP}({\mathcal C})$

is the set of all labeled graphs obtained from

by substituting arbitrary labeled graphs having the structural feature

${\mathcal C}$

to all variables in

. In this paper, for any set

${\cal T_*}$

linear graph patterns in

${\cal GP}({\mathcal C})$

, we present a query learning algorithm for finding a set

of linear graph patterns in

${\cal GP}({\mathcal C})$

with

$\bigcup_{g\in{\cal T_*}}GLc{(g)}=\bigcup_{f\in S}GLc{(f)}$

in polynomial time using at most

+ 1 equivalence queries and

(

)) restricted subset queries, where

is the maximum number of edges of counterexamples, if the number of labels of edges is infinite. Next we show that finite sets of graph languages generated by linear graph patterns having tree structures or two-terminal series parallel graph structures are not learnable in polynomial time using restricted equivalence, membership and subset queries.

Rika Okada, Satoshi Matsumoto, Tomoyuki Uchida, Yusuke Suzuki, Takayoshi Shoudai

Kernel-Based Learning

Polynomial Summaries of Positive Semidefinite Kernels

Although polynomials have proven to be useful tools to tailor generic kernels to context of specific applications, little was known about generic rules for tuning parameters (

. coefficients) to engineer new positive semidefinite kernels. This not only may hinder intensive exploitation of the flexibility of the kernel method, but also may cause misuse of indefinite kernels. Our main theorem presents a sufficient condition on polynomials such that applying the polynomials to known positive semidefinite kernels results in positive semidefinite kernels. The condition is very simple and therefore has a wide range of applications. In addition, in the case of degree 1, it is a necessary condition as well. We also prove the effectiveness of our theorem by showing three corollaries to it: the first one is a generalization of the polynomial kernels, while the second one presents a way to extend the principal-angle kernels, the trace kernels, and the determinant kernels. The third corollary shows corrected sufficient conditions for the codon-improved kernels and the weighted-degree kernels with shifts to be positive semidefinite.

Kilho Shin, Tetsuji Kuboyama

Learning Kernel Perceptrons on Noisy Data Using Random Projections

In this paper, we address the issue of learning nonlinearly separable concepts with a kernel classifier in the situation where the data at hand are altered by a uniform classification noise. Our proposed approach relies on the combination of the technique of random or deterministic projections with a classification noise tolerant perceptron learning algorithm that assumes distributions defined over finite-dimensional spaces. Provided a sufficient separation margin characterizes the problem, this strategy makes it possible to envision the learning from a noisy distribution in any separable Hilbert space, regardless of its dimension; learning with any appropriate Mercer kernel is therefore possible. We prove that the required sample complexity and running time of our algorithm is polynomial in the classical PAC learning parameters. Numerical simulations on toy datasets and on data from the UCI repository support the validity of our approach.

Guillaume Stempfel, Liva Ralaivola

Continuity of Performance Metrics for Thin Feature Maps

We study the class of hypothesis composed of linear functionals superimposed with smooth feature maps. We show that for “typical” smooth feature map, the pointwise convergence of hypothesis implies the convergence of some standard metrics such as error rate or area under ROC curve with probability 1 in selection of the test sample from a (Lebesgue measurable) probability density. Proofs use transversality theory. The crux is to show that for every “typical”, sufficiently smooth feature map into a finite dimensional vector space, the counter-image of every affine hyperplane has Lebesgue measure 0.

The results extend to every real analytic, in particular polynomial, feature map if its domain is connected and the limit hypothesis is non-constant. In the process we give an elementary proof of the fundamental lemma that locus of zeros of a real analytic function on a connected domain either fills the whole space or forms a subset of measure 0.

Adam Kowalczyk

Other Directions

Multiclass Boosting Algorithms for Shrinkage Estimators of Class Probability

Our purpose is to estimate conditional probabilities of output labels in multiclass classification problems. Adaboost provides highly accurate classifiers and has potential to estimate conditional probabilities. However, the conditional probability estimated by Adaboost tends to overfit to training samples. We propose loss functions for boosting that provide shrinkage estimator. The effect of regularization is realized by shrinkage of probabilities toward the uniform distribution. Numerical experiments indicate that boosting algorithms based on proposed loss functions show significantly better results than existing boosting algorithms for estimation of conditional probabilities.

Takafumi Kanamori

Pseudometrics for State Aggregation in Average Reward Markov Decision Processes

We consider how state similarity in average reward Markov decision processes (MDPs) may be described by pseudometrics. Introducing the notion of

adequate

pseudometrics which are well adapted to the structure of the MDP, we show how these may be used for state aggregation. Upper bounds on the loss that may be caused by working on the aggregated instead of the original MDP are given and compared to the bounds that have been achieved for discounted reward MDPs.

Ronald Ortner

On Calibration Error of Randomized Forecasting Algorithms

Recently, it was shown that calibration with an error less than

> 0 is almost surely guaranteed with a randomized forecasting algorithm, where forecasts are chosen using randomized rounding up to

of deterministic forecasts. We show that this error can not be improved for a large majority of sequences generated by a probabilistic algorithm: we prove that combining outcomes of coin-tossing and a transducer algorithm, it is possible to effectively generate with probability close to one a sequence “resistant” to any randomized rounding forecasting with an error much smaller than

Vladimir V. V’yugin

Backmatter

Title: Algorithmic Learning Theory
Editors: Marcus Hutter
Rocco A. Servedio
Eiji Takimoto
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-75225-7
Print ISBN: 978-3-540-75224-0
DOI: https://doi.org/10.1007/978-3-540-75225-7

Springer Professional

About this book

Table of Contents

Frontmatter

Editors’ Introduction

Editors’ Introduction

Invited Papers

A Theory of Similarity Functions for Learning and Clustering

Machine Learning in Ecosystem Informatics

Challenge for Info-plosion

A Hilbert Space Embedding for Distributions

Simple Algorithmic Principles of Discovery, Subjective Beauty, Selective Attention, Curiosity and Creativity

Invited Papers

Inductive Inference

Feasible Iteration of Feasible Learning Functionals

Parallelism Increases Iterative Learning Power

Prescribed Learning of R.E. Classes

Learning in Friedberg Numberings

Complexity Aspects of Learning

Separating Models of Learning with Faulty Teachers

Vapnik-Chervonenkis Dimension of Parallel Arithmetic Computations

Parameterized Learnability of k-Juntas and Related Problems

On Universal Transfer Learning

Online Learning

Tuning Bandit Algorithms in Stochastic Environments

Following the Perturbed Leader to Gamble at Multi-armed Bandits

Online Regression Competitive with Changing Predictors

Unsupervised Learning

Cluster Identification in Nearest-Neighbor Graphs

Multiple Pass Streaming Algorithms for Learning Mixtures of Distributions in ${\mathbb R}^d$

Language Learning

Learning Efficiency of Very Simple Grammars from Positive Data

Learning Rational Stochastic Tree Languages

Query Learning

One-Shot Learners Using Negative Counterexamples and Nearest Positive Examples

Polynomial Time Algorithms for Learning k-Reversible Languages and Pattern Languages with Correction Queries

Learning and Verifying Graphs Using Queries with a Focus on Edge Counting

Exact Learning of Finite Unions of Graph Patterns from Queries

Kernel-Based Learning

Polynomial Summaries of Positive Semidefinite Kernels

Learning Kernel Perceptrons on Noisy Data Using Random Projections

Continuity of Performance Metrics for Thin Feature Maps

Other Directions

Multiclass Boosting Algorithms for Shrinkage Estimators of Class Probability

Pseudometrics for State Aggregation in Average Reward Markov Decision Processes

On Calibration Error of Randomized Forecasting Algorithms

Backmatter

Premium Partner