article

Free Access

How to use expert advice

Authors:
Nicolò Cesa-Bianchi

Università di Milano, Milan, Italy

Università di Milano, Milan, Italy
View Profile

,
Yoav Freund

AT&T Labs, Florham Park, New Jersey

AT&T Labs, Florham Park, New Jersey
View Profile

,
David Haussler

University of California, Santa Cruz, Santa Cruz, California

University of California, Santa Cruz, Santa Cruz, California
View Profile

,
David P. Helmbold

University of California, Santa Cruz, Santa Cruz, California

University of California, Santa Cruz, Santa Cruz, California
View Profile

,
Robert E. Schapire

AT&T Labs, Florham Park, New Jersey

AT&T Labs, Florham Park, New Jersey
View Profile

,
Manfred K. Warmuth

University of California, Santa Cruz, Santa Cruz, California

University of California, Santa Cruz, Santa Cruz, California
View Profile

Authors Info & Claims

Journal of the ACM Volume 44 Issue 3pp 427–485https://doi.org/10.1145/258128.258179

Published:01 May 1997Publication History

Journal of the ACM

Abstract

We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worst-case situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictins. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show how this leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently know in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.

References

BIRGE, L., AND MASSART, P. 1993. Rates of convergence for minimum contrast estimators. Prob. Theory Rel. Fields 97, 113-150.Google ScholarCross Ref
BLUMER, A., EHRENFEUCHT, A., HAUSSLER, D., AND WARMUTH, M.K. 1989. Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36, 4, 929-965. Google ScholarDigital Library
CESA-BIANCHI, N., FREUND, Y., HELMBOLD, D. P., HAUSSLER, D., SCHAPIRE, R. E., AND WARMUTH, M.K. 1993. How to use expert advice. In Proceedings of the 25th Annual ACM Symposium on the Theory of Computing (San Diego, Calif., May 16-18). ACM, New York, pp. 382-391. Google Scholar
CESA-BIANCHI, N., FREUND, Y., HELMBOLD, D. P., AND WARMUTH, M.K. 1996. On-line prediction and conversion strategies. Mach. Learn., to appear. Google Scholar
CHUNG, T.H. 1994. Approximate methods for sequential decision making using expert advice. In Proceedings of the 7th Annual ACM Conference on Computational Learning Theory (New Brunswick, N.J., July 12-15). ACM, New York, pp. 183-189. Google Scholar
COVER, T.M. 1965. Behaviour of sequential predictors of binary sequences. In Transactions of the 4th Prague Conference on Information Theory, Statistical Decision Functions, Random Processes. Publishing House of the Czechoslovak Academy of Sciences.Google Scholar
COVER, T. M., AND SHANAR, A. 1977. Compound Bayes predictors for sequences with apparent Markov structure. IEEE Trans. Syst. Man Cybernet. SMC-7, 6 (June), 421-424.Google ScholarCross Ref
DAWID, A. P. 1984. Statistical theory: The prequential approach. J. Roy. Stat. Soc., Series A, 278-292.Google ScholarCross Ref
DAWlD, A. 1991. Prequential analysis, stochastic complexity and Bayesian inference. In Bayesian Statistics, vol. 4. Oxford University Press, pp. 109-125.Google Scholar
DAWlD, A.P. 1996. Prequential data analysis. Curt. Iss. Stat. Inference, to appear.Google Scholar
DESANTIS, A., MARKOWSKI, G., AND WEGMAN, M. N. 1988. Learning probabilistic prediction functions. In Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan- Kaufmann, San Mateo, Calif., pp. 312-328. Google Scholar
FEDER, M., MERHAV, N., AND GUTMAN, M. 1992. Universal prediction of individual sequences. IEEE Trans. Inf. Theory 38, 1258-1270.Google ScholarCross Ref
FIAT, A., FOSTER, D., KARLOFF, H., RABANI, Y., RAVID, Y., AND VISWANATHAN, S. 1991a. Competitive algorithms for layered graph traversal. In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science. IEEE, New York, pp. 288-297. Google ScholarDigital Library
FIAT, A., KARP, R., LUBY, M., MCGEOCH, L., SLEATOR, D., AND YOUNG, N. 1991b. Competitive paging algorithms. J. Algorithms 12, 688-699. Google ScholarDigital Library
FIAT, A., RABANI, Y., AND RAVID, Y. 1994. Competitive k-server algorithms. J. Comput Syst. Sci. 48, 3, 410-428. Google ScholarDigital Library
GALAMBOS, J. 1987. The Asymptotic Theory of Extreme Order Statistics, 2nd ed. R. E. Kreiger.Google Scholar
HANNAN, J. 1957. Approximation to Bayes risk in repeated play. In Contributions to the Theory of Games, vol. 3. Princeton University Press, Princeton, N.J., pp. 97-139.Google Scholar
HAUSSLER, D., AND BARRON, A. 1992. How well do Bayes methods work for on-line prediction of { + 1, -1 } values? In Proceedings of the 3rd NEC Symposium on Computation and Cognition. SIAM.Google Scholar
HAUSSLER, D., KEARNS, M., LITTLESTONE, N., AND WARMUTH, M.K. 1991. Equivalence of models for polynomial learnability. Inf. Comput. 95, 129-161. Google ScholarDigital Library
HAUSSLER, D., KEARNS, M., AND SCHAPIRE, R. 1994. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimensions. Mach. Learn. 14, 84-114. Google ScholarCross Ref
HAUSSLER, D., KIVINEN, J., AND WARMUTH, M. K. 1995. Tight worst-case loss bounds for predicting with expert advice. In Computational Learning Theory: Second European Conference, EuroCOLT '95. Springer-Verlag, New York, pp. 69-83. Google Scholar
HAUSSLER, D., LITTLESTONE, N., AND WARMUTH, M. K. 1994. Predicting {0, 1}-functions on randomly drawn points. Inf. Comput. 115, 2, 248-292. Google ScholarDigital Library
HAYKIN, S. 1994. Neural Networks: A Comprehensive Foundation. MacMillan, New York. Google ScholarDigital Library
HELMBOLD, U., AND WARMUTH, M.K. 1995. On weak learning. J. Comput. Syst. Sci. 50, 3 (June), 551-573. Google ScholarDigital Library
KEARNS, M. J., AND SCHAPIRE, R. E. 1994. Efficient distribution-free learning of probabilistic concepts. J. Comput. Syst. Sci. 48, 3, 464-497. Google ScholarDigital Library
KEARNS, M. J., SCHAPIRE, R. E., AND NELLIE, L. M. 1994. Toward efficient agnostic learning. Mach. Learn. 17, 115-141. Google ScholarCross Ref
KIVINEN, J., AND WARMUTH, M.K. 1994. Using experts for predicting continuous outcomes. In Computational Learning Theory: EuroCOLT '93. Springer-Verlag, New York, pp. 109-120. Google Scholar
LITTLESTONE, N. 1989. From on-line to batch learning. In Proceedings of the 2nd Annual Workshop ~on Computational Learning Theory. Morgan-Kaufmann, San Mateo, Calif., pp. 269-284. Google Scholar
LITTLESTONE, N., LONG, P. M., AND WARMUTH, M.K. 1995. On-line learning of linear functions. Computat. Complexity 5, 1, 1-23. Google ScholarDigital Library
LITTLESTONE, N., AND WARMUTH, M.K. 1994. The weighted majority algorithm. Inf. Comput. 108, 2, 212-261. Google ScholarDigital Library
MERHAV, N., AND FEDER, M. 1993. Universal schemes for sequential decision for individual data sequences. IEEE Trans. Inf. Theory, 39, 4, 1280-1292.Google ScholarCross Ref
RISSANEN, J. 1978. Modeling by shortest data description. Automatica 14, 465-471.Google ScholarDigital Library
RISSANEN, J. 1986. Stochastic complexity and modeling. Ann. Star. 14, 3, 1080-1100.Google ScholarCross Ref
RISSANEN, J., AND LANGDON, G. G., JR. 1981. Universal modeling and coding. IEEE Trans. Inf. Theory IT-27, 1 (Jan.), 12-23.Google ScholarDigital Library
SEUNG, n. S., SOMPOLINSKY, n., AND TISHBY, N. 1992. Statistical mechanics of learning from examples. Phys. Rev A 45, 8, 6056-6091.Google ScholarCross Ref
SHTARKOV, YU. M. 1975. Coding of discrete sources with unknown statistics. In Topics in Information Theory. North-Holland, Amsterdam, The Netherlands, pp. 559-574.Google Scholar
SHTARKOV, YU. M. 1987. Universal sequential coding of single messages. Prob. Inf. Transm. 23 (July-September), 175-186.Google Scholar
SOMPOLINSKY, H., TISHBY, N., AND SEUNG, H. S. 1992. Learning from examples in large neural networks. Phys. Rev. Lett. 65, 1683-1686.Google ScholarCross Ref
STONE, C. J. 1977. Cross-validation: A review. Math. Operationforsch. Statist. Ser Statist. 9, 127-139.Google ScholarCross Ref
TALAGRAND, M. 1994. Sharper bounds for Gaussian and empirical processes. Ann. Prob. 22, 1, 28-76.Google ScholarCross Ref
VALIANT, L.G. 1984. A theory of the learnable. Commun. ACM 27, 11 (Nov.), 1134-1142. Google ScholarDigital Library
VAPNIK, V.N. 1982. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York. Google Scholar
VAPNIK, V. 1992. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, vol. 4. John E. Moody, Steve J. Hanson, and Richard P. Lippman, eds. Morgan Kaufmann, San Mateo, Calif.Google Scholar
VOVK, V.G. 1990. Aggregating strategies. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory. Morgan-Kaufmann, San Mateo, Calif., pp. 371-383. Google Scholar
VOVK, V.G. 1992. Universal forecasting algorithms. Inf. Comput. 96, 2 (Feb.), 245-277. Google ScholarDigital Library
VOVK, V.G. 1993. A logic of probability, with application to the foundations of statistics. J. Roy. Statis Soc, Ser. B-Methodolical 55, 2, 317-351.Google Scholar
YAMANISHI, K. 1995. A loss bound model for on-line stochastic prediction algorithms. Inf. Comput. 119, 1, 39-54. Google ScholarDigital Library

Index Terms

How to use expert advice

Recommendations

HOW TO USE EXPERT ADVICE
Read More
How to Better Use Expert Advice

This work is concerned with online learning from expert advice. Extensive work on this problem generated numerous "expert advice algorithms" whose total loss is provably bounded above in terms of the loss incurred by the best expert in hindsight. Such ...
Read More
TIGHT WORST-CASE LOSS BOUNDS FOR PREDICTING WITH EXPERT ADVICE
Read More

Reviews

Reviewer: Giuseppina Carla Gini

While the title of this paper suggests the use of traditional expert systems technology, the research reported here concerns how to make predictions when a team of experts is available and their predictions are known. In particular, the authors define an algorithm to solve the following prediction problem: We want to predict whether an event will occur, and we may use the predictions of a finite set of experts; each expert can see a sequence of measurements, and gives his or her opinion on the probability of the event as a real number between 0 and 1. The problem is to build an algorithm that uses the experts' advice to give the best possible estimate. This requires minimization of the loss, where loss is defined as the absolute value of the difference between the predicted number and the real outcome. The authors define a family of algorithms, some of them taken from the literature, and study their performance. In particular, the optimal solution algorithms are discussed, and their lower and upper bounds are demonstrated. Since the optimal algorithms have complexity exponential in the number of experts, the authors provide other algorithms to generate the prediction, and demonstrate their lower and upper bounds. The algorithm P is based on computing and upgrading a nonnegative weight for each expert that reflects the expert's prediction ability. A parameter controls when to drop poorly performing experts. The choice of the parameter uses a priori knowledge. An iterative variation P* can make predictions without a priori knowledge of the best expert's loss. In the last part of the paper, the authors discuss an application of the algorithm in pattern recognition, where the problem is to predict the label (0 or 1) of examples taken at random from an arbitrary distribution. The authors compare their solution with other, similar results. The comparison indicates that the new approach is better. The proofs of the main theorems are difficult and require a few lemmas. This paper is important for everyone working in the field. It covers almost every aspect of a loss function that minimizes the number of mistakes and can use any kind of expert output.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of the ACM Volume 44, Issue 3
May 1997
163 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/258128
Editor:
F. T. Leighton
Massachusetts Institute of Technology, Cambridge
Issue’s Table of Contents
Copyright © 1997 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 1997
Published in jacm Volume 44, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
algorithms
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 416
  Total Citations
  View Citations
- 2,416
  Total Downloads
- Downloads (Last 12 months)207
- Downloads (Last 6 weeks)30
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

How to use expert advice

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

HOW TO USE EXPERT ADVICE

How to Better Use Expert Advice

TIGHT WORST-CASE LOSS BOUNDS FOR PREDICTING WITH EXPERT ADVICE

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

How to use expert advice

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

HOW TO USE EXPERT ADVICE

How to Better Use Expert Advice

TIGHT WORST-CASE LOSS BOUNDS FOR PREDICTING WITH EXPERT ADVICE

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media