Skip to main content
main-content

Über dieses Buch

This book brings together historical notes, reviews of research developments, fresh ideas on how to make VC (Vapnik–Chervonenkis) guarantees tighter, and new technical contributions in the areas of machine learning, statistical inference, classification, algorithmic statistics, and pattern recognition.

The contributors are leading scientists in domains such as statistics, mathematics, and theoretical computer science, and the book will be of interest to researchers and graduate students in these domains.

Inhaltsverzeichnis

Frontmatter

History of VC Theory

Frontmatter

Chapter 1. Chervonenkis’s Recollections

Abstract
These recollections about the origins of VC theory were written by Alexey Chervonenkis in 2004 for several colleagues and not intended for publication. They are now published for the first time. (Eds.) Translated by Vladimir Vovk.
Alexey Chervonenkis

Chapter 2. A Paper that Created Three New Fields: Teoriya Veroyatnosteĭ i Ee Primeneniya 16(2), 1971, pp. 264–279

Abstract
This is an introduction, with remarks, to the paper by Vapnik and Chervonenkis in which they gave proofs for the strikingly innovative statements they had announced in Doklady Akademii Nauk SSSR three years earlier.
R. M. Dudley

Chapter 3. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities

Abstract
This chapter reproduces the English translation by B. Seckler of the paper by Vapnik and Chervonenkis in which they gave proofs for the innovative results they had obtained in a draft form in July 1966 and announced in 1968 in their note in Soviet Mathematics Doklady. The paper was first published in Russian as Вапник В. Н. and Червоненкис А. Я. О равномерноЙ сходимости частот появления событиЙ к их вероятностям. Теория вероятностеЙ и ее применения 16(2), 264–279 (1971).
V. N. Vapnik, A. Ya. Chervonenkis

Chapter 4. Sketched History: VC Combinatorics, 1826 up to 1975

Abstract
The earliest published instance found of combinatorial quantities later occurring in the work of Vapnik and Chervonenkis was in a paper of Jakob Steiner in 1826. The next, still more pertinent occurrence found was in work of Ludwig Schläfli done around 1850 but not published until 1901, after his death. The nineteenth century work was on subsets of Euclidean spaces cut by intersections of finitely many half-spaces. Then there is another long gap until a paper of T.M. Cover, who cited Schläfli, in 1965, preceding by a few years the landmark announcement by Vapnik and Chervonenkis in 1968 and longer paper of 1971. Further history is given about Steiner, Schläfli, and some of their contemporary mathematicians and about the initial reception of the Vapnik and Chervonenkis work.
R. M. Dudley

Chapter 5. Institute of Control Sciences Through the Lens of VC Dimension

Abstract
This chapter describes the history of the Institute of Control Sciences, including the arrival of V.N. Vapnik and A.Ya. Chervonenkis at the Institute and the main directions of their work against the background of everyday life of the Institute, and the beginning of the Institute’s studies in the field of pattern recognition. It also provides a brief summary of VC theory in machine learning. Personal observations of the author accompany the story.
Vasily N. Novoseltsev

Reviews of Measures of Complexity

Frontmatter

Chapter 6. VC Dimension, Fat-Shattering Dimension, Rademacher Averages, and Their Applications

Abstract
We consider several complexity measures which capture the difficulty of learning under the i.i.d. assumption. Among these measures are growth function and VC dimension, covering number and fat-shattering dimension, and Rademacher complexity from statistical learning theory. Relationships among these complexity measures, their connection to learning, and tools for bounding them are provided. For each complexity measure, a uniform upper bound on the generalization error of classification problems is presented.
Vladimir V. V’yugin

Chapter 7. Around Kolmogorov Complexity: Basic Notions and Results

Abstract
Algorithmic information theory studies description complexity andrandomness and is now a well-known field of theoretical computer science and mathematical logic. There are several textbooks and monographs devoted to this theory (Calude, Information and Randomness. An Algorithmic Perspective, 2002, Downey and Hirschfeldt, Algorithmic Randomness and Complexity, 2010, Li and Vitányi, An Introduction to Kolmogorov Complexity and Its Applications, 2008, Nies, Computability and Randomness, 2009, Vereshchagin et al., Kolmogorov Complexity and Algorithmic Randomness, in Russian, 2013) where one can find a detailed exposition of many difficult results as well as historical references. However, it seems that a short survey of its basic notions and main results relating these notions to each other is missing. This chapter attempts to fill this gap and covers the basic notions of algorithmic information theory: Kolmogorov complexity (plain, conditional, prefix), Solomonoff universal a priori probability, notions of randomness (Martin-Löf randomness, Mises–Church randomness), and effective Hausdorff dimension. We prove their basic properties (symmetry of information, connection between a priori probability and prefix complexity, criterion of randomness in terms of complexity, complexity characterization for effective dimension) and show some applications (incompressibility method in computational complexity theory, incompleteness theorems). The chapter is based on the lecture notes of a course at Uppsala University given by the author (Shen, Algorithmic information theory and Kolmogorov complexity. Technical Report, 2000).
Alexander Shen

Chapter 8. Predictive Complexity for Games with Finite Outcome Spaces

Abstract
Predictive complexity is a generalization of Kolmogorov complexity motivated by an on-line prediction scenario. It quantifies the “unpredictability” of a sequence in a particular prediction environment. This chapter surveys key results on predictive complexity for games with finitely many outcomes. The issues of existence, non-existence, uniqueness, and linear inequalities are covered.
Yuri Kalnishkan

Making VC Bounds More Accurate

Frontmatter

Chapter 9. Making Vapnik–Chervonenkis Bounds Accurate

Abstract
This chapter shows how returning to the combinatorial nature of the Vapnik–Chervonenkis bounds provides simple ways to increase their accuracy, take into account properties of the data and of the learning algorithm, and provide empirically accurate estimates of the deviation between training error and test error.
Léon Bottou

Chapter 10. Comment: TransductiveTransduction PAC-Bayes Bounds Seen as a Generalization of Vapnik–Chervonenkis Bounds

Abstract
We would like here to complement the analysis of Vapnik–Chervonenkis bounds made by L. Bottou in Chap. 9 and by K.V. Vorontsov in (Comput Math Math Phys 44(11):1997–2009, 2004, Pattern Recognit Image Anal 18(2):243–259, 2008, Pattern Recognit Image Anal 20(3):269–285, 2010), pointing out the connections with transductive PAC-Bayes bounds (The Thermodynamics of Statistical Learning, 2007, McAllester, Proceedings of the 11th Annual Conference on Computational Learning Theory, 1998, McAllester, Proceedings of the 12th Annual Conference on Computational Learning Theory, 1999) as another way to come to the same kind of conclusions.
Olivier Catoni

Chapter 11. Comment: The Two Styles of VC Bounds

Abstract
First of all, I would like to congratulate Léon Bottou on an excellent paper, containing a lucid discussion of the sources of looseness in the Vapnik–Chervonenkis bounds on the generalization error derived in Vapnik, Chervonenkis, Proc USSR Acad Sci 181(4): 781–783, 1968 and Vapnik, Chervonenkis, Theory Probab Appl 16(2): 264–280, (1971). I will comment only on the paper in this volume (Chap. 9), although most of the comments will also be applicable to the other papers co-authored by Léon and by Konstantin Vorontsov that are mentioned in Léon’s bibliography.
Vladimir Vovk

Chapter 12. Rejoinder: Making VC Bounds Accurate

Abstract
I am very grateful to my colleagues Olivier Catoni and Vladimir Vovk because their insightful comments add considerable value to my article.
Léon Bottou

Advances in VC Theory

Frontmatter

Chapter 13. Measures of Complexity in the Theory of Machine Learning

Abstract
This text, prepared by the editors, closely follows the abstract (Chervonenkis, Measures of complexity, 2013) and slides of Alexey’s talk at the symposium “Measures of Complexity” (given on Wednesday, October 2, 2013, 2:30–3:20 p.m., in Paphos, Cyprus). (Eds.)
Alexey Chervonenkis

Chapter 14. Classes of Functions Related to VC Properties

Abstract
The notion of Vapnik–Chervonenkis (VC) class of sets can be extended to classes of functions in a few ways. Under further hypotheses, central limit theorems for empirical measures can be proved uniformly over such classes. Specific such classes on Euclidean spaces can be used to show the existence of location vector and scatter matrix functionals, replacing mean vectors and covariance matrices, but on classes of probability measures P that are weakly dense, weakly open, and so contain arbitrarily heavy-tailed distributions.
R. M. Dudley

Chapter 15. On Martingale Extensions of Vapnik–Chervonenkis Theory with Applications to Online Learning

Abstract
We review recent advances on uniform martingale laws of large numbers and the associated sequential complexity measures. These results may be considered as forming a non-i.i.d. generalization of Vapnik–Chervonenkis theory. We discuss applications to online learning, provide a recipe for designing online learning algorithms, and illustrate the techniques on the problem of online node classification. We outline connections to statistical learning theory and discuss inductive principles of stochastic approximation and empirical risk minimization.
Alexander Rakhlin, Karthik Sridharan

Chapter 16. Measuring the Capacity of Sets of Functions in the Analysis of ERM

Abstract
Empirical risk minimization (ERM) is a fundamental learning principle that serves as the underlying idea for various learning algorithms. Moreover, ERM appears in many hyper-parameter selection strategies. Not surprisingly, the statistical analysis of ERM has thus attracted a lot of attention during the last four decades. In particular, it is well known that as soon as ERM uses an infinite set of hypotheses, the problem of measuring the size, or capacity, of this set is central in the statistical analysis. We provide a brief, incomplete, and subjective survey of different techniques for this problem, and illustrate how theconcentration inequalities used in the analysis of ERM determine suitable capacity measures.
Ingo Steinwart

Chapter 17. Algorithmic Statistics Revisited

Abstract
The mission of statistics is to provide adequate statistical hypotheses (models) for observed data. But what is an “adequate” model? To answer this question, one needs to use the notions of algorithmic information theory. It turns out that for every data string x one can naturally define a “stochasticity profile,” a curve that represents a trade-off between the complexity of a model and its adequacy. This curve has four different equivalent definitions in terms of (1) randomness deficiency, (2) minimal description length, (3) position in the lists of simple strings, and (4) Kolmogorov complexity with decompression time bounded by the busy beaver function. We present a survey of the corresponding definitions and results relating them to each other.
Nikolay Vereshchagin, Alexander Shen

Chapter 18. Justifying Information-Geometric Causal Inference

Abstract
Information-Geometric Causal Inference (IGCI) is a new approach to distinguish between cause and effect for two variables. It is based on an independence assumption between input distribution and causal mechanism that can be phrased in terms of orthogonality in information space. We describe two intuitive reinterpretations of this approach that make IGCI more accessible to a broader audience. Moreover, we show that the described independence is related to the hypothesis that unsupervised learning and semi-supervised learning only work for predicting the cause from the effect and not vice versa.
Dominik Janzing, Bastian Steudel, Naji Shajarisales, Bernhard Schölkopf

Chapter 19. Interpretation of Black-Box Predictive Models

Abstract
Many machine learning applications involve predictive data-analytic modeling using black-box techniques. A common problem in such studies is understanding/interpretation of estimated nonlinear high-dimensional models. Whereas human users naturally favor simple interpretable models, such models may not be practically feasible with modern adaptive methods such as Support Vector Machines (SVMs), Multilayer Perceptron Networks (MLPs), AdaBoost, etc. This chapter provides a brief survey of the current techniques for visualization and interpretation of SVM-based classification models, and then highlights potential problems with such methods. We argue that, under the VC-theoretical framework, model interpretation cannot be achieved via technical analysis of predictive data-analytic models. That is, any meaningful interpretation should incorporate application domain knowledge outside data analysis. We also describe a simple graphical technique for visualization of SVM classification models.
Vladimir Cherkassky, Sauptik Dhar

Chapter 20. PAC-Bayes Bounds for Supervised Classification

Abstract
We present in this contribution a synthesis of Seeger’s (PAC-Bayesian generalization error bounds for Gaussian process classification, 2002) and our own (Catoni, PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, 2007) approach of PAC-Bayes inequalities for 0–1 loss functions. We apply it to supervised classification, and more specifically to the proof of new margin bounds for support vector machines, in the spirit of the bounds established by Langford and Shawe-Taylor (Advances in Neural Information Processing Systems, 2002) and McAllester (Learning Theory and Kernel Machines, COLT 2003).
Olivier Catoni

Chapter 21. Bounding Embeddings of VC Classes into Maximum Classes

Abstract
One of the earliest conjectures in computational learning theory—the Sample Compression conjecture—asserts that concept classes (equivalently set systems) admit compression schemes of size linear in their VC dimension.
J. Hyam Rubinstein, Benjamin I. P. Rubinstein, Peter L. Bartlett

Chapter 22. Strongly Consistent Detection for Nonparametric Hypotheses

Abstract
Consider two robust detection problems formulated by nonparametric hypotheses such that both hypotheses are composite and indistinguishable. Strongly consistent testing rules are shown.
László Györfi, Harro Walk

Chapter 23. On the Version Space Compression Set Size and Its Applications

Abstract
The version space compression set size n is the size of the smallest subset of a training set that induces the same (agnostic) version space with respect to a hypothesis class.
Ran El-Yaniv, Yair Wiener

Chapter 24. Lower Bounds for Sparse Coding

Abstract
We give lower bounds on the reconstruction error for PCA, k-means clustering, and various sparse coding methods. It is shown that the two objectives of good data approximation and sparsity of the solution are incompatible if the data distribution is evasive in the sense that most of its mass lies away from any low dimensional subspace. We give closure properties and examples of evasive distributions and quantify the extent to which distributions of bounded support and bounded density are evasive.
Andreas Maurer, Massimiliano Pontil, Luca Baldassarre

Chapter 25. Robust Algorithms via PAC-Bayes and Laplace Distributions

Abstract
Laplace random variables are commonly used to model extreme noise in many fields, while systems trained to deal with such noise are often characterized by robustness properties. We introduce new learning algorithms that minimize objectives derived directly from PAC-Bayes generalization bounds, incorporating Laplace distributions. The resulting algorithms are regulated by the Huber loss function which is considered relatively robust to large noise. We analyze the convexity properties of the objective, and propose a few bounds which are fully convex, two of which are jointly convex in the mean and standard deviation under certain conditions. We derive new algorithms analogous to recent boosting algorithms, providing novel relations between boosting and PAC-Bayes analysis. Experiments show that our algorithms outperform AdaBoost (Freund and Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, 1995), L1-LogBoost (Duchi and Singer, Boostingwith structural sparsity, 2009), and RobustBoost (Freund, A more robust boosting algorithm, 2009) in a wide range of noise.
Asaf Noy, Koby Crammer

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise