main-content

2003 | Book

# Learning Theory and Kernel Machines

## 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings

Editors: Bernhard Schölkopf, Manfred K. Warmuth

Publisher: Springer Berlin Heidelberg

Book Series: Lecture Notes in Computer Science

Included in:

insite
SEARCH

#### Invited Talk

##### A General Class of No-Regret Learning Algorithms and Game-Theoretic Equilibria

A general class of no-regret learning algorithms, called Φ-no-regret learning algorithms is defined, which spans the spectrum from no-internal-regret learning to no-external-regret learning, and beyond. Φ describes the set of strategies to which the play of a learning algorithm is compared: a learning algorithm satisfies Φ-no-regret iff no regret is experienced for playing as the algorithm prescribes, rather than playing according to any of the transformations of the algorithm’s play prescribed by elements of Φ. Analogously, a class of game-theoretic equilibria, called Φ-equilibria, is defined, and it is shown that the empirical distribution of play of Φ-no-regret algorithms converges to the set of Φ-equilibria. Perhaps surprisingly, the strongest form of no-regret algorithms in this class are no-internal-regret algorithms. Thus, the tightest game-theoretic solution concept to which Φ-no-regret algorithms (provably) converge is correlated equilibrium. In particular, Nash equilibrium is not a necessary outcome of learning via any Φ-no-regret learning algorithms.

Amy Greenwald, Amir Jafari

#### Contributed Talks

##### Preference Elicitation and Query Learning

In this paper we initiate an exploration of relationships between “preference elicitation”, a learning-style problem that arises in combinatorial auctions, and the problem of learning via queries studied in computational learning theory. Preference elicitation is the process of asking questions about the preferences of bidders so as to best divide some set of goods. As a learning problem, it can be thought of as a setting in which there are multiple target concepts that can each be queried separately, but where the goal is not so much to learn each concept as it is to produce an “optimal example”. In this work, we prove a number of similarities and differences between preference elicitation and query learning, giving both separation results and proving some connections between these problems.

Avrim Blum, Jeffrey C. Jackson, Tuomas Sandholm, Martin Zinkevich
##### Efficient Algorithms for Online Decision Problems

In an online decision problem, one makes a sequence of decisions without knowledge of the future. Tools from learning such as Weighted Majority and its many variants [4, 13, 18] demonstrate that online algorithms can perform nearly as well as the best single decision chosen in hindsight, even when there are exponentially many possible decisions. However, the naive application of these algorithms is inefficient for such large problems. For some problems with nice structure, specialized efficient solutions have been developed [3, 6, 10, 16, 17].We show that a very simple idea, used in Hannan’s seminal 1957 paper [9], gives efficient solutions to all of these problems. Essentially, in each period, one chooses the decision that worked best in the past. To guarantee low regret, it is necessary to add randomness. Surprisingly, this simple approach gives additive ε regret per period, efficiently. We present a simple general analysis and several extensions, including a (1+ε)-competitive algorithm as well as a lazy one that rarely switches between decisions.

#### Kernel Machines

##### Positive Definite Rational Kernels

Kernel methods are widely used in statistical learning techniques. We recently introduced a general kernel framework based on weighted transducers or rational relations, rational kernels, to extend kernel methods to the analysis of variable-length sequences or more generally weighted automata. These kernels are efficient to compute and have been successfully used in applications such as spoken-dialog classification. Not all rational kernels are positive definite and symmetric (PDS) however, a sufficient property for guaranteeing the convergence of discriminant classification algorithms such as Support Vector Machines. We present several theoretical results related to PDS rational kernels. We show in particular that under some conditions these kernels are closed under sum, product, or Kleene-closure and give a general method for constructing a PDS rational kernel from an arbitrary transducer defined on some non-idempotent semirings. We also show that some commonly used string kernels or similarity measures such as the edit-distance, the convolution kernels of Haussler, and some string kernels used in the context of computational biology are specific instances of rational kernels. Our results include the proof that the edit-distance over a non-trivial alphabet is not negative definite, which, to the best of our knowledge, was never stated or proved before.

Corinna Cortes, Patrick Haffner, Mehryar Mohri
##### Bhattacharyya and Expected Likelihood Kernels

We introduce a new class of kernels between distributions. These induce a kernel on the input space between data points by associating to each datum a generative model fit to the data point individually. The kernel is then computed by integrating the product of the two generative models corresponding to two data points. This kernel permits discriminative estimation via, for instance, support vector machines, while exploiting the properties, assumptions, and invariances inherent in the choice of generative model. It satisfies Mercer’s condition and can be computed in closed form for a large class of models, including exponential family models, mixtures, hidden Markov models and Bayesian networks. For other models the kernel can be approximated by sampling methods. Experiments are shown for multinomial models in text classification and for hidden Markov models for protein sequence classification.

Tony Jebara, Risi Kondor
##### Maximal Margin Classification for Metric Spaces

In this article we construct a maximal margin classification algorithm for arbitrary metric spaces. At first we show that the Support Vector Machine (SVM) is a maximal margin algorithm for the class of metric spaces where the negative squared distance is conditionally positive definite (CPD). This means that the metric space can be isometrically embedded into a Hilbert space, where one performs linear maximal margin separation. We will show that the solution only depends on the metric, but not on the kernel. Following the framework we develop for the SVM, we construct an algorithm for maximal margin classification in arbitrary metric spaces. The main difference compared with SVM is that we no longer embed isometrically into a Hilbert space, but a Banach space. We further give an estimate of the capacity of the function class involved in this algorithm via Rademacher averages. We recover an algorithm of Graepel et al. [6].

Matthias Hein, Olivier Bousquet
##### Maximum Margin Algorithms with Boolean Kernels

Recent work has introduced Boolean kernels with which one can learn over a feature space containing all conjunctions of length up to k (for any 1≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as support vector machines can learn Disjunctive Normal Form expressions in the PAC learning model using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions.We show that such maximum margin algorithms do not PAC learn t(n)-term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length $\tilde{\omega}(\sqrt{n})$ then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels.

Roni Khardon, Rocco A. Servedio
##### Knowledge-Based Nonlinear Kernel Classifiers

Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a nonlinear kernel support vector machine (SVM) classifier. The resulting formulation leads to a linear program that can be solved efficiently. This extends, in a rather unobvious fashion, previous work [3] that incorporated similar prior knowledge into a linear SVM classifier. Numerical tests on standard-type test problems, such as exclusive-or prior knowledge sets and a checkerboard with 16 points and prior knowledge instead of the usual 1000 points, show the effectiveness of the proposed approach in generating sharp nonlinear classifiers based mostly or totally on prior knowledge.

Glenn M. Fung, Olvi L. Mangasarian, Jude W. Shavlik
##### Fast Kernels for Inexact String Matching

We introduce several new families of string kernels designed in particular for use with support vector machines (SVMs) for classification of protein sequence data. These kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by k-length subsequences from the string alphabet Σ (or the alphabet augmented by a wildcard character), and hence they are related to the recently presented (k,m)-mismatch kernel and string kernels used in text classification. However, for all kernels we define here, the kernel value K(x,y) can be computed in O(c K (|x| + |y|)) time, where the constant c K depends on the parameters of the kernel but is independent of the size |Σ| of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameter-dependent constant $c_K = k^{m+1} |\Sigma|^m$ of the mismatch kernel. We compute the kernels efficiently using a recursive function based on a trie data structure and relate our new kernels to the recently described transducer formalism. Finally, we report protein classification experiments on a benchmark SCOP dataset, where we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models.

Christina Leslie, Rui Kuang
##### On Graph Kernels: Hardness Results and Efficient Alternatives

As most ‘real-world’ data is structured, research in kernel methods has begun investigating kernels for various kinds of structured data. One of the most widely used tools for modeling structured data are graphs. An interesting and important challenge is thus to investigate kernels on instances that are represented by graphs. So far, only very specific graphs such as trees and strings have been considered.This paper investigates kernels on labeled directed graphs with general structure. It is shown that computing a strictly positive definite graph kernel is at least as hard as solving the graph isomorphism problem. It is also shown that computing an inner product in a feature space indexed by all possible graphs, where each feature counts the number of subgraphs isomorphic to that graph, is NP-hard. On the other hand, inner products in an alternative feature space, based on walks in the graph, can be computed in polynomial time. Such kernels are defined in this paper.

Thomas Gärtner, Peter Flach, Stefan Wrobel
##### Kernels and Regularization on Graphs

We introduce a family of kernels on graphs based on the notion of regularization operators. This generalizes in a natural way the notion of regularization and Greens functions, as commonly used for real valued functions, to graphs. It turns out that diffusion kernels can be found as a special case of our reasoning. We show that the class of positive, monotonically decreasing functions on the unit interval leads to kernels and corresponding regularization operators.

Alexander J. Smola, Risi Kondor
##### Data-Dependent Bounds for Multi-category Classification Based on Convex Losses

Algorithms for solving multi-category classification problems using output coding have become very popular in recent years. Following initial attempts with discrete coding matrices, recent work has attempted to alleviate some of their shortcomings by considering real-valued ‘coding’ matrices. We consider an approach to multi-category classification, based on minimizing a convex upper bound on the 0-1 loss. We show that this approach is closely related to output coding, and derive data-dependent bounds on the performance. These bounds can be optimized in order to obtain effective coding matrices, which guarantee small generalization error. Moreover, our results apply directly to kernel based approaches.

Ilya Desyatnikov, Ron Meir
##### Tutorial: Learning Topics in Game-Theoretic Decision Making

The tutorial will cover some topics of recent interest in AI and economics concerning design making in a computational game-theory framework. It will highlight areas in which computational learning theory has played a role and could play a greater role in the future. Covered areas include recent representational and algorithmic advances, stochastic games and reinforcement learning, no regret algorithms, and the role of various equilibrium concepts.

Michael L. Littman

#### Statistical Learning Theory

##### Optimal Rates of Aggregation

We study the problem of aggregation of M arbitrary estimators of a regression function with respect to the mean squared risk. Three main types of aggregation are considered: model selection, convex and linear aggregation. We define the notion of optimal rate of aggregation in an abstract context and prove lower bounds valid for any method of aggregation. We then construct procedures that attain these bounds, thus establishing optimal rates of linear, convex and model selection type aggregation.

Alexandre B. Tsybakov
##### Distance-Based Classification with Lipschitz Functions

The goal of this article is to develop a framework for large margin classification in metric spaces. We want to find a generalization of linear decision functions for metric spaces and define a corresponding notion of margin such that the decision function separates the training points with a large margin. It will turn out that using Lipschitz functions as decision functions, the inverse of the Lipschitz constant can be interpreted as the size of a margin. In order to construct a clean mathematical setup we isometrically embed the given metric space into a Banach space and the space of Lipschitz functions into its dual space. Our approach leads to a general large margin algorithm for classification in metric spaces. To analyze this algorithm, we first prove a representer theorem. It states that there exists a solution which can be expressed as linear combination of distances to sets of training points. Then we analyze the Rademacher complexity of some Lipschitz function classes. The generality of the Lipschitz approach can be seen from the fact that several well-known algorithms are special cases of the Lipschitz algorithm, among them the support vector machine, the linear programming machine, and the 1-nearest neighbor classifier.

Ulrike von Luxburg, Olivier Bousquet
##### Random Subclass Bounds

It has been recently shown that sharp generalization bounds can be obtained when the function class from which the algorithm choo-ses its hypotheses is “small” in the sense that the Rademacher averages of this function class are small [8,9]. Seemingly based on different arguments, generalization bounds were obtained in the compression scheme [7], luckiness [13], and algorithmic luckiness [6] frameworks in which the “size” of the function class is not specified a priori.We show that the bounds obtained in all these frameworks follow from the same general principle, namely that coordinate projections of this function subclass evaluated on random samples are “small” with high probability.

Shahar Mendelson, Petra Philips
##### PAC-MDL Bounds

We point out that a number of standard sample complexity bounds (VC-dimension, PAC-Bayes, and others) are all related to the number of bits required to communicate the labels given the unlabeled data for a natural communication game. Motivated by this observation, we give a general sample complexity bound based on this game that allows us to unify these different bounds in one common framework.

Avrim Blum, John Langford

#### Online Learning

##### Universal Well-Calibrated Algorithm for On-Line Classification

We study the problem of on-line classification in which the prediction algorithm is given a “confidence level” 1-δ and is required to output as its prediction a range of labels (intuitively, those labels deemed compatible with the available data at the level δ) rather than just one label; as usual, the examples are assumed to be generated independently from the same probability distribution P. The prediction algorithm is said to be “well-calibrated” for P and δ if the long-run relative frequency of errors does not exceed δ almost surely w.r. to P. For well-calibrated algorithms we take the number of “uncertain” predictions (i.e., those containing more than one label) as the principal measure of predictive performance. The main result of this paper is the construction of a prediction algorithm which, for any (unknown) P and any δ: (a) makes errors independently and with probability δ at every trial (in particular, is well-calibrated for P and δ); (b) makes in the long run no more uncertain predictions than any other prediction algorithm that is well-calibrated for P and δ; (c) processes example n in time O(logn).

##### Learning Probabilistic Linear-Threshold Classifiers via Selective Sampling

In this paper we investigate selective sampling, a learning model where the learner observes a sequence of i.i.d. unlabeled instances each time deciding whether to query the label of the current instance. We assume that labels are binary and stochastically related to instances via a linear probabilistic function whose coefficients are arbitrary and unknown. We then introduce a new selective sampling rule and show that its expected regret (with respect to the classifier knowing the underlying linear function and observing the label realization after each prediction) grows not much faster than the number of sampled labels. Furthermore, under additional assumptions on the true margin distribution, we prove that the number of sampled labels grows only logarithmically in the number of observed instances. Experiments carried out on a text categorization problem show that: (1) our selective sampling algorithm performs better than the Perceptron algorithm even when the latter is given the true label after each classification; (2) when allowed to observe the true label after each classification, the performance of our algorithm remains the same. Finally, we note that by expressing our selective sampling rule in dual variables we can learn nonlinear probabilistic functions via the kernel machinery.

Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile
##### Learning Algorithms for Enclosing Points in Bregmanian Spheres

We discuss the problem of finding a generalized sphere that encloses points originating from a single source. The points contained in such a sphere are within a maximal divergence from a center point. The divergences we study are known as the Bregman divergences which include as a special case both the Euclidean distance and the relative entropy. We cast the learning task as an optimization problem and show that it results in a simple dual form which has interesting algebraic properties. We then discuss a general algorithmic framework to solve the optimization problem. Our training algorithm employs an auxiliary function that bounds the dual’s objective function and can be used with a broad class of Bregman functions. As a specific application of the algorithm we give a detailed derivation for the relative entropy. We analyze the generalization ability of the algorithm by adopting margin-style proof techniques. We also describe and analyze two schemes of online algorithms for the case when the radius of the sphere is set in advance.

Koby Crammer, Yoram Singer
##### Internal Regret in On-Line Portfolio Selection

This paper extends the game-theoretic notion of internal regret to the case of on-line potfolio selection problems. New sequential investment strategies are designed to minimize the cumulative internal regret for all possible market behaviors. Some of the introduced strategies, apart from achieving a small internal regret, achieve an accumulated wealth almost as large as that of the best constantly rebalanced portfolio. It is argued that the low-internal-regret property is related to stability and experiments on real stock exchange data demonstrate that the new strategies achieve usually better returns compared to some known algorithms.

Gilles Stoltz, Gábor Lugosi

#### Other Approaches

##### Lower Bounds on the Sample Complexity of Exploration in the Multi-armed Bandit Problem

We consider the Multi-armed bandit problem under the PAC (“probably approximately correct”) model. It was shown by Even-Dar et al. [5] that given n arms, it suffices to play the arms a total of$O\big(({n}/{\epsilon^2})\log ({1}/{\delta})\big)$ times to find an ε-optimal arm with probability of at least 1-δ. Our contribution is a matching lower bound that holds for any sampling policy. We also generalize the lower bound to a Bayesian setting, and to the case where the statistics of the arms are known but the identities of the arms are not.

Shie Mannor, John N. Tsitsiklis
##### Smooth ε-Insensitive Regression by Loss Symmetrization

We describe a framework for solving regression problems by reduction to classification. Our reduction is based on symmetrization of margin-based loss functions commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. Our construction yields a smooth version of the ε-insensitive hinge loss that is used in support vector regression. A byproduct of this construction is a new simple form of regularization for boosting-based classification and regression algorithms. We present two parametric families of batch learning algorithms for minimizing these losses. The first family employs a log-additive update and is based on recent boosting algorithms while the second family uses a new form of additive update. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the ε-insensitive logistic loss. Our regression framework also has implications on classification algorithms, namely, a new additive batch algorithm for the log-loss and exp-loss used in boosting.

Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer
##### On Finding Large Conjunctive Clusters

We propose a new formulation of the clustering problem that differs from previous work in several aspects. First, the goal is to explicitly output a collection of simple and meaningful conjunctive descriptions of the clusters. Second, the clusters might overlap, i.e., a point can belong to multiple clusters. Third, the clusters might not cover all points, i.e., not every point is clustered. Finally, we allow a point to be assigned to a conjunctive cluster description even if it does not completely satisfy all of the attributes, but rather only satisfies most.A convenient way to view our clustering problem is that of finding a collection of large bicliques in a bipartite graph. Identifying one largest conjunctive cluster is equivalent to finding a maximum edge biclique. Since this problem is NP-hard [28] and there is evidence that it is difficult to approximate [12], we solve a relaxed version where the objective is to find a large subgraph that is close to being a biclique. We give a randomized algorithm that finds a relaxed biclique with almost as many edges as the maximum biclique. We then extend this algorithm to identify a good collection of large relaxed bicliques. A key property of these algorithms is that their running time is independent of the number of data points and linear in the number of attributes.

Nina Mishra, Dana Ron, Ram Swaminathan
##### Learning Arithmetic Circuits via Partial Derivatives

We present a polynomial time algorithm for learning several models of algebraic computation. We show that any arithmetic circuit whose partial derivatives induce a “low”-dimensional vector space is exactly learnable from membership and equivalence queries. As a consequence we obtain the first polynomial time algorithm for learning depth three multilinear arithmetic circuits. In addition, we give the first polynomial time algorithms for learning restricted algebraic branching programs and noncommutative arithmetic formulae. Previously only restricted versions of depth-2 arithmetic circuits were known to be learnable in polynomial time. Our learning algorithms can be viewed as solving a generalization of the well known polynomial interpolation problem where the unknown polynomial has a succinct representation. We can learn representations of polynomials encoding exponentially many monomials. Our techniques combine a careful algebraic analysis of arithmetic circuits’ partial derivatives with the “multiplicity automata” techniques due to Beimel et al [BBB+00].

##### Comparing Clusterings by the Variation of Information

This paper proposes an information theoretic criterion for comparing two partitions, or clusterings, of the same data set. The criterion, called variation of information (VI), measures the amount of information lost and gained in changing from clustering ${\cal C}$ to clustering ${\cal C}'$. The criterion makes no assumptions about how the clusterings were generated and applies to both soft and hard clusterings. The basic properties of VI are presented and discussed from the point of view of comparing clusterings. In particular, the VI is positive, symmetric and obeys the triangle inequality. Thus, surprisingly enough, it is a true metric on the space of clusterings.

Marina Meilă
##### Multiplicative Updates for Large Margin Classifiers

Various problems in nonnegative quadratic programming arise in the training of large margin classifiers. We derive multiplicative updates for these problems that converge monotonically to the desired solutions for hard and soft margin classifiers. The updates differ strikingly in form from other multiplicative updates used in machine learning. In this paper, we provide complete proofs of convergence for these updates and extend previous work to incorporate sum and box constraints in addition to nonnegativity.

Fei Sha, Lawrence K. Saul, Daniel D. Lee
##### Simplified PAC-Bayesian Margin Bounds

The theoretical understanding of support vector machines is largely based on margin bounds for linear classifiers with unit-norm weight vectors and unit-norm feature vectors. Unit-norm margin bounds have been proved previously using fat-shattering arguments and Rademacher complexity. Recently Langford and Shawe-Taylor proved a dimension-independent unit-norm margin bound using a relatively simple PAC-Bayesian argument. Unfortunately, the Langford-Shawe-Taylor bound is stated in a variational form making direct comparison to fat-shattering bounds difficult. This paper provides an explicit solution to the variational problem implicit in the Langford-Shawe-Taylor bound and shows that the PAC-Bayesian margin bounds are significantly tighter. Because a PAC-Bayesian bound is derived from a particular prior distribution over hypotheses, a PAC-Bayesian margin bound also seems to provide insight into the nature of the learning bias underlying the bound.

David McAllester
##### Sparse Kernel Partial Least Squares Regression

Partial Least Squares Regression (PLS) and its kernel version (KPLS) have become competitive regression approaches. KPLS performs as well as or better than support vector regression (SVR) for moderately-sized problems with the advantages of simple implementation, less training cost, and easier tuning of parameters. Unlike SVR, KPLS requires manipulation of the full kernel matrix and the resulting regression function requires the full training data. In this paper we rigorously derive a sparse KPLS algorithm. The underlying KPLS algorithm is modified to maintain sparsity in all steps of the algorithm. The resulting ν-KPLS algorithm explicitly models centering and bias rather than using kernel centering. An ε-insensitive loss function is used to produce sparse solutions in the dual space. The final regression function for the ν-KPLS algorithm only requires a relatively small set of support vectors.

Michinari Momma, Kristin P. Bennett
##### Sparse Probability Regression by Label Partitioning

A large-margin learning machine for sparse probability regression is presented. Unlike support vector machines and other forms of kernel machines, nonlinear features are obtained by transforming labels into higher-dimensional label space rather than transforming data vectors into feature space. Linear multi-class logistic regression with partitioned classes of labels yields a nonlinear classifier in the original labels. With a linear kernel in data space, storage and run-time requirements are reduced from the number of support vectors to the number of partitioned labels. Using the partitioning property of KL-divergence in label space, an iterative alignment procedure produces sparse training coefficients. Experiments show that label partitioning is effective in modeling non-linear decision boundaries with same, and in some cases superior, generalization performance to Support Vector Machines with significantly reduced memory and run-time requirements.

##### Learning with Rigorous Support Vector Machines

We examine the so-called rigorous support vector machine (RSVM) approach proposed by Vapnik (1998). The formulation of RSVM is derived by explicitly implementing the structural risk minimization principle with a parameter H used to directly control the VC dimension of the set of separating hyperplanes. By optimizing the dual problem, RSVM finds the optimal separating hyperplane from a set of functions with VC dimension approximate to H2+1. RSVM produces classifiers equivalent to those obtained by classic SVMs for appropriate parameter choices, but the use of the parameter H facilitates model selection, thus minimizing VC bounds on the generalization risk more effectively. In our empirical studies, good models are achieved for an appropriate H2 ∈ [5% ℓ, 30% ℓ] where ℓ is the size of training data.

##### Robust Regression by Boosting the Median

Most boosting regression algorithms use the weighted average of base regressors as their final regressor. In this paper we analyze the choice of the weighted median. We propose a general boosting algorithm based on this approach. We prove boosting-type convergence of the algorithm and give clear conditions for the convergence of the robust training error. The algorithm recovers $\textsc{AdaBoost}$ and $\textsc{AdaBoost}_\varrho$ as special cases. For boosting confidence-rated predictions, it leads to a new approach that outputs a different decision and interprets robustness in a different manner than the approach based on the weighted average. In the general, non-binary case we suggest practical strategies based on the analysis of the algorithm and experiments.

Balázs Kégl
##### Boosting with Diverse Base Classifiers

We establish a new bound on the generalization error rate of the Boost-by-Majority algorithm. The bound holds when the algorithm is applied to a collection of base classifiers that contains a “diverse” subset of “good” classifiers, in a precisely defined sense. We describe cross-validation experiments that suggest that Boost-by-Majority can be the basis of a practically useful learning method, often improving on the generalization of AdaBoost on large datasets.

Sanjoy Dasgupta, Philip M. Long
##### Reducing Kernel Matrix Diagonal Dominance Using Semi-definite Programming

Kernel-based learning methods revolve around the notion of a kernel or Gram matrix between data points. These square, symmetric, positive semi-definite matrices can informally be regarded as encoding pairwise similarity between all of the objects in a data-set. In this paper we propose an algorithm for manipulating the diagonal entries of a kernel matrix using semi-definite programming. Kernel matrix diagonal dominance reduction attempts to deal with the problem of learning with almost orthogonal features, a phenomenon commonplace in kernel matrices derived from string kernels or Gaussian kernels with small width parameter. We show how this task can be formulated as a semi-definite programming optimization problem that can be solved with readily available optimizers. Theoretically we provide an analysis using Rademacher based bounds to provide an alternative motivation for the 1-norm SVM motivated from kernel diagonal reduction. We assess the performance of the algorithm on standard data sets with encouraging results in terms of approximation and prediction.

Jaz Kandola, Thore Graepel, John Shawe-Taylor

#### Poster Session 2

##### Using a Linear Fit to Determine Monotonicity Directions

Let f be a function on ℝd that is monotonic in every variable. There are 2d possible assignments to the directions of monotonicity (two per variable). We provide sufficient conditions under which the optimal linear model obtained from a least squares regression on f will identify the monotonicity directions correctly. We show that when the input dimensions are independent, the linear fit correctly identifies the monotonicity directions. We provide an example to illustrate that in the general case, when the input dimensions are not independent, the linear fit may not identify the directions correctly. However, when the inputs are jointly Gaussian, as is often assumed in practice, the linear fit will correctly identify the monotonicity directions, even if the input dimensions are dependent. Gaussian densities are a special case of a more general class of densities (Mahalanobis densities) for which the result holds. Our results hold when f is a classification or regression function.If a finite data set is sampled from the function, we show that if the exact linear regression would have yielded the correct monotonicity directions, then the sample regression will also do so asymptotically (in a probabilistic sense). This result holds even if the data are noisy.

Malik Magdon-Ismail, Joseph Sill
##### Generalization Bounds for Voting Classifiers Based on Sparsity and Clustering

We prove new margin type bounds on the generalization error of voting classifiers that take into account the sparsity of weights and certain measures of clustering of weak classifiers in the convex combination. We also present experimental results to illustrate the behavior of the parameters of interest for several data sets.

Vladimir Koltchinskii, Dmitry Panchenko, Savina Andonova
##### Sequence Prediction Based on Monotone Complexity

This paper studies sequence prediction based on the monotone Kolmogorov complexity Km = − logm, i.e. based on universal deterministic/one-part MDL. m is extremely close to Solomonoff’s prior M, the latter being an excellent predictor in deterministic as well as probabilistic environments, where performance is measured in terms of convergence of posteriors or losses. Despite this closeness to M, it is difficult to assess the prediction quality of m, since little is known about the closeness of their posteriors, which are the important quantities for prediction. We show that for deterministic computable environments, the “posterior” and losses of m converge, but rapid convergence could only be shown on-sequence; the off-sequence behavior is unclear. In probabilistic environments, neither the posterior nor the losses converge, in general.

Marcus Hutter
##### How Many Strings Are Easy to Predict?

It is well known in the theory of Kolmogorov complexity that most strings cannot be compressed; more precisely, only exponentially few (Θ(2n − m)) strings of length n can be compressed by m bits. This paper extends the ‘incompressibility’ property of Kolmogorov complexity to the ‘unpredictability’ property of predictive complexity. The ‘unpredictability’ property states that predictive complexity (defined as the loss suffered by a universal prediction algorithm working infinitely long) of most strings is close to a trivial upper bound (the loss suffered by a trivial minimax constant prediction strategy). We show that only exponentially few strings can be successfully predicted and find the base of the exponent.

Yuri Kalnishkan, Volodya Vovk, Michael V. Vyugin
##### Polynomial Certificates for Propositional Classes

This paper studies the query complexity of learning classes of expressions in propositional logic from equivalence and membership queries. We give new constructions of polynomial size certificates of non-membership for monotone, unate and Horn CNF functions. Our constructions yield quantitatively different bounds from previous constructions of certificates for these classes. We prove lower bounds on certificate size which show that for some parameter settings the certificates we construct for these classes are exactly optimal. Finally, we also prove that a natural generalization of these classes, the class of renamable Horn CNF functions, does not have polynomial size certificates of non-membership, thus answering an open question of Feigelson.

Marta Arias, Roni Khardon, Rocco A. Servedio
##### On-Line Learning with Imperfect Monitoring

We study on-line play of repeated matrix games in which the observations of past actions of the other player and the obtained reward are partial and stochastic. We define the Partial Observation Bayes Envelope (POBE) as the best reward against the worst-case stationary strategy of the opponent that agrees with past observations. Our goal is to have the (unobserved) average reward above the POBE. For the case where the observations (but not necessarily the rewards) depend on the opponent play alone, an algorithm for attaining the POBE is derived. This algorithm is based on an application of approachability theory combined with a worst-case view over the unobserved rewards. We also suggest a simplified solution concept for general signaling structure. This concept may fall short of the POBE.

Shie Mannor, Nahum Shimkin

The approach of learning of multiple “related” tasks simultaneously has proven quite successful in practice; however, theoretical justification for this success has remained elusive. The starting point for previous work on multiple task learning has been that the tasks to be learned jointly are somehow “algorithmically related”, in the sense that the results of applying a specific learning algorithm to these tasks are assumed to be similar. We offer an alternative approach, defining relatedness of tasks on the basis of similarity between the example generating distributions that underline these task.We provide a formal framework for this notion of task relatedness, which captures a sub-domain of the wide scope of issues in which one may apply a multiple task learning approach. Our notion of task similarity is relevant to a variety of real life multitask learning scenarios and allows the formal derivation of generalization bounds that are strictly stronger than the previously known bounds for both the learning-to-learn and the multitask learning scenarios. We give precise conditions under which our bounds guarantee generalization on the basis of smaller sample sizes than the standard single-task approach.

Shai Ben-David, Reba Schuller
##### Approximate Equivalence of Markov Decision Processes

We consider the problem of finding the minimal ε-equivalent MDP for an MDP given in its tabular form. We show that the problem is NP-Hard and then give a bicriteria approximation algorithm to the problem. We suggest that the right measure for finding minimal ε-equivalent model is L1 rather than L ∞  by giving both an example, which demonstrates the drawback of using L ∞ , and performance guarantees for using L1. In addition, we give a polynomial algorithm that decides whether two MDPs are equivalent.

Eyal Even-Dar, Yishay Mansour
##### An Information Theoretic Tradeoff between Complexity and Accuracy

A fundamental question in learning theory is the quantification of the basic tradeoff between the complexity of a model and its predictive accuracy. One valid way of quantifying this tradeoff, known as the “Information Bottleneck”, is to measure both the complexity of the model and its prediction accuracy by using Shannon’s mutual information. In this paper we show that the Information Bottleneck framework answers a well defined and known coding problem and at same time it provides a general relationship between complexity and prediction accuarcy, measured by mutual information. We study the nature of this complexity-accuracy tradeoff and discuss some of its theoretical properties. Furthermore, we present relations to classical information theoretic problems, such as rate-distortion theory, cost-capacity tradeoff and source coding with side information.

Ran Gilad-Bachrach, Amir Navot, Naftali Tishby
##### Learning Random Log-Depth Decision Trees under the Uniform Distribution

We consider three natural models of random log-depth decision trees. We give an efficient algorithm that for each of these models learns—as a decision tree—all but an inverse polynomial fraction of such trees using only uniformly distributed random examples.

Jeffrey C. Jackson, Rocco A. Servedio
##### Projective DNF Formulae and Their Revision

Valiant argued that biology imposes various constraints on learnability, and, motivated by these constraints, introduced his model of projection learning [14]. Projection learning aims to learn a target concept over some large domain, in this paper {0,1}n, by learning some of its projections to a class of smaller domains, and combining these projections. Valiant proved a general mistake bound for the resulting algorithm under certain conditions. The basic assumption underlying projection learning is that there is a family of simple projections that cover all positive instances of the target, where simple means belonging to some efficiently learnable class. The projections describing the target in this way can also be thought of as a set of experts, each specialized to classify a subset of the instances, such that whenever two experts overlap they always agree in their classification.

Robert H. Sloan, Balázs Szörényi, György Turán
##### Learning with Equivalence Constraints and the Relation to Multiclass Learning

We study the problem of learning partitions using equivalence constraints as input. This is a binary classification problem in the product space of pairs of datapoints. The training data includes pairs of datapoints which are labeled as coming from the same class or not. This kind of data appears naturally in applications where explicit labeling of datapoints is hard to get, but relations between datapoints can be more easily obtained, using, for example, Markovian dependency (as in video clips).Our problem is an unlabeled partition problem, and is therefore tightly related to multiclass classification. We show that the solutions of the two problems are related, in the sense that a good solution to the binary classification problem entails the existence of a good solution to the multiclass problem, and vice versa. We also show that bounds on the sample complexity of the two problems are similar, by showing that their relevant ‘dimensions’ (VC dimension for the binary problem, Natarajan dimension for the multiclass problem) bound each other. Finally, we show the feasibility of solving multiclass learning efficiently by using a solution of the equivalent binary classification problem. In this way advanced techniques developed for binary classification, such as SVM and boosting, can be used directly to enhance multiclass learning.

Aharon Bar-Hillel, Daphna Weinshall

#### Invited Talks

##### Learning from Uncertain Data

The application of statistical methods to natural language processing has been remarkably successful over the past two decades. But, to deal with recent problems arising in this field, machine learning techniques must be generalized to deal with uncertain data, or datasets whose elements are distributions over sequences, such as weighted automata. This paper reviews a number of recent results related to this question. We discuss how to compute efficiently basic statistics from a weighted automaton such as the expected count of an arbitrary sequence and higher moments of that distribution, by using weighted transducers. Both the corresponding transducers and related algorithms are described. We show how general classification techniques such as Support Vector Machines can be extended to deal with distributions by using general kernels between weighted automata. We describe several examples of positive definite kernels between weighted automata such as kernels based on counts of common n-gram sequences, counts of common factors or suffixes, or other more complex kernels, and describe a general algorithm for computing them efficiently. We also demonstrate how machine learning techniques such as clustering based on the edit-distance can be extended to deal with unweighted and weighted automata representing distributions.

Mehryar Mohri
##### Learning and Parsing Stochastic Unification-Based Grammars

Stochastic Unification-Based Grammars combine knowledge-rich and data-rich approaches to natural language processing. This provides a rich structure to the learning and parsing (decoding) tasks that can be described with undirected graphical models. While most work to date has treated parsing as a straight-forward multi-class classification problem, we are beginning to see how this structure can be exploited in learning and parsing. Exploiting this structure is likely to become more important as the research focus moves from parsing to more realistic tasks such as machine translation and summarization.

Mark Johnson

#### Inductive Inference Learning

##### Generality’s Price
Inescapable Deficiencies in Machine-Learned Programs

This paper investigates some delicate tradeoffs between the generality of an algorithmic learning device and the quality of the programs it learns successfully. There are results to the effect that, thanks to small increases in generality of a learning device, the computational complexity of some successfully learned programs is provably unalterably suboptimal. There are also results in which the complexity of successfully learned programs is asymptotically optimal and the learning device is general, but, still thanks to the generality, some of those optimal, learned programs are provably unalterably information deficient — in some cases, deficient as to safe, algorithmic extractability/provability of the fact that they are even approximately optimal. The paper is on the borderline between learning theory, complexity theory and logic.

John Case, Keh-Jiann Chen, Sanjay Jain, Wolfgang Merkle, James S. Royer
##### On Learning to Coordinate
Random Bits Help, Insightful Normal Forms, and Competency Isomorphisms

A mere bounded number of random bits judiciously employed by a probabilistically correct algorithmic coordinator is shown to increase the power of learning to coordinate compared to deterministic algorithmic coordinators. Furthermore, these probabilistic algorithmic coordinators are provably not characterized in power by teams of deterministic ones.An insightful, enumeration technique based, normal form characterization of the classes that are learnable by total computable coordinators is given. These normal forms are for insight only since it is shown that the complexity of the normal form of a total computable coordinator can be infeasible compared to the original coordinator.Montagna and Osherson showed that the competence class of a total coordinator cannot be strictly improved by another total coordinator. It is shown in the present paper that the competencies of any two total coordinators are the same modulo isomorphism. Furthermore, a completely effective, index set version of this competency isomorphism result is given, where all the coordinators are total computable.

John Case, Sanjay Jain, Franco Montagna, Giulia Simi, Andrea Sorbi
##### Learning All Subfunctions of a Function

Sublearning, a model for learning of subconcepts of a concept, is presented. Sublearning a class of total recursive functions informally means to learn all functions from that class together with all of their subfunctions. While in language learning it is known to be impossible to learn any infinite language together with all of its sublanguages, the situation changes for sublearning of functions.Several types of sublearning are defined and compared to each other as well as to other learning types. For example, in some cases, sublearning coincides with robust learning. Furthermore, whereas in usual function learning there are classes that cannot be learned consistently, all sublearnable classes of some natural types can be learned consistently.Moreover, the power of sublearning is characterized in several terms, thereby establishing a close connection to measurable classes and variants of this notion. As a consequence, there are rich classes which do not need any self-referential coding for sublearning them.

Sanjay Jain, Efim Kinber, Rolf Wiehagen

#### Open Problems

##### When Is Small Beautiful?

The basic bound on the generalisation error of a PAC learner makes the assumption that a consistent hypothesis exists. This makes it appropriate to apply the method only in the case where we have a guarantee that a consistent hypothesis can be found, something that is rarely possible in real applications. The same problem arises if we decide not to use a hypothesis unless its error is below a prespecified number.

##### Learning a Function of r Relevant Variables

This problem has been around for a while but is one of my favorites. I will state it here in three forms, discuss a number of known results (some easy and some more intricate), and finally end with small financial incentives for various kinds of partial progress. This problem appears in various guises in [2, 3, 10]. To begin we need the following standard definition: a boolean function f over {0,1}n has (at most) r relevant variables if there exist r indices i1, ..., i r such that $f(x) = g(x_{i_1}, \ldots, x_{i_r})$ for some boolean function g over {0,1}r. In other words, the value of f is determined by only a subset of r of its n input variables. For instance, the function $f(x) = x_1\bar{x}_2 \vee x_2\bar{x}_5 \vee x_5\bar{x}_1$ has three relevant variables. The “class of boolean functions with r relevant variables” is the set of all such functions, over all possible g and sets {i1, ..., i r }.

Avrim Blum
##### Subspace Detection: A Robust Statistics Formulation

If data in ℝd actually lie in a linear subspace, then principal component analysis (PCA) will find this subspace. If the data are corrupted by benign (eg. independent Gaussian) noise, then approximation bounds can quite easily be shown for the solution returned by PCA. What if the noise is malicious?

Sanjoy Dasgupta
##### How Fast Is k-Means?

The k-means algorithm is probably the most widely used clustering heuristic, and has the reputation of being fast. How fast is it exactly? Almost no non-trivial time bounds are known for it.

Sanjoy Dasgupta
##### Universal Coding of Zipf Distributions

Background. One of the best known results in information theory says that a data sequence x1,x2,...,x n produced by independent random draws from a fixed distribution P over a discrete domain can be compressed into a binary sequence, or code whose expected length is at most nH(P)+1 bits, where H(P) = − ∑  i P i logP i is the entropy of P. It is also known that this compression is near optimal as nH(P) is the smallest achievable expected number of code bits.

Yoav Freund, Alon Orlitsky, Prasad Santhanam, Junan Zhang
##### An Open Problem Regarding the Convergence of Universal A Priori Probability

Is the textbook result that Solomonoff’s universal posterior converges to the true posterior for all Martin-Löf random sequences true?

Marcus Hutter
##### Entropy Bounds for Restricted Convex Hulls

An unsolved problem of bounding the entropy of a “restricted” convex hull of a set in a Hilbert space is discussed. The problem is related to bounding the generalization error of convex combinations of base classifiers.

##### Compressing to VC Dimension Many Points

Any set of labeled examples consistent with some hidden orthogonal rectangle can be “compressed” to at most four examples: An upmost, a leftmost, a rightmost and a bottommost positive example. These four examples represent an orthogonal rectangle (the smallest such rectangle that contains them) that is consistent with all examples.

Manfred K. Warmuth
##### Tutorial: Machine Learning Methods in Natural Language Processing

Statistical or machine learning approaches have become quite prominent in the Natural Language Processing literature. Common techniques include generative models such as Hidden Markov Models or Probabilistic Context-Free Grammars, and more general noisy-channel models such as the statistical approach to machine translation pioneered by researchers at IBM in the early 90s. Recent work has considered discriminative methods such as (conditional) markov random fields, or large-margin methods. This tutorial will describe several of these techniques. The methods will be motivated through a number of natural language problems: from part-of-speech tagging and parsing, to machine translation, dialogue systems and information extraction problems. I will also concentrate on links to the COLT and kernel methods literature: for example covering kernels over the discrete structures found in NLP, online algorithms for NLP problems, and the issues in extending generalization bounds from classification problems to NLP problems such as parsing.

Michael Collins
##### Backmatter
Title
Learning Theory and Kernel Machines
Editors
Bernhard Schölkopf
Manfred K. Warmuth