skip to main content
article
Free Access

How to use expert advice

Published:01 May 1997Publication History
Skip Abstract Section

Abstract

We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worst-case situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictins. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show how this leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently know in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.

References

  1. BIRGE, L., AND MASSART, P. 1993. Rates of convergence for minimum contrast estimators. Prob. Theory Rel. Fields 97, 113-150.Google ScholarGoogle ScholarCross RefCross Ref
  2. BLUMER, A., EHRENFEUCHT, A., HAUSSLER, D., AND WARMUTH, M.K. 1989. Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36, 4, 929-965. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. CESA-BIANCHI, N., FREUND, Y., HELMBOLD, D. P., HAUSSLER, D., SCHAPIRE, R. E., AND WARMUTH, M.K. 1993. How to use expert advice. In Proceedings of the 25th Annual ACM Symposium on the Theory of Computing (San Diego, Calif., May 16-18). ACM, New York, pp. 382-391. Google ScholarGoogle Scholar
  4. CESA-BIANCHI, N., FREUND, Y., HELMBOLD, D. P., AND WARMUTH, M.K. 1996. On-line prediction and conversion strategies. Mach. Learn., to appear. Google ScholarGoogle Scholar
  5. CHUNG, T.H. 1994. Approximate methods for sequential decision making using expert advice. In Proceedings of the 7th Annual ACM Conference on Computational Learning Theory (New Brunswick, N.J., July 12-15). ACM, New York, pp. 183-189. Google ScholarGoogle Scholar
  6. COVER, T.M. 1965. Behaviour of sequential predictors of binary sequences. In Transactions of the 4th Prague Conference on Information Theory, Statistical Decision Functions, Random Processes. Publishing House of the Czechoslovak Academy of Sciences.Google ScholarGoogle Scholar
  7. COVER, T. M., AND SHANAR, A. 1977. Compound Bayes predictors for sequences with apparent Markov structure. IEEE Trans. Syst. Man Cybernet. SMC-7, 6 (June), 421-424.Google ScholarGoogle ScholarCross RefCross Ref
  8. DAWID, A. P. 1984. Statistical theory: The prequential approach. J. Roy. Stat. Soc., Series A, 278-292.Google ScholarGoogle ScholarCross RefCross Ref
  9. DAWlD, A. 1991. Prequential analysis, stochastic complexity and Bayesian inference. In Bayesian Statistics, vol. 4. Oxford University Press, pp. 109-125.Google ScholarGoogle Scholar
  10. DAWlD, A.P. 1996. Prequential data analysis. Curt. Iss. Stat. Inference, to appear.Google ScholarGoogle Scholar
  11. DESANTIS, A., MARKOWSKI, G., AND WEGMAN, M. N. 1988. Learning probabilistic prediction functions. In Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan- Kaufmann, San Mateo, Calif., pp. 312-328. Google ScholarGoogle Scholar
  12. FEDER, M., MERHAV, N., AND GUTMAN, M. 1992. Universal prediction of individual sequences. IEEE Trans. Inf. Theory 38, 1258-1270.Google ScholarGoogle ScholarCross RefCross Ref
  13. FIAT, A., FOSTER, D., KARLOFF, H., RABANI, Y., RAVID, Y., AND VISWANATHAN, S. 1991a. Competitive algorithms for layered graph traversal. In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science. IEEE, New York, pp. 288-297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. FIAT, A., KARP, R., LUBY, M., MCGEOCH, L., SLEATOR, D., AND YOUNG, N. 1991b. Competitive paging algorithms. J. Algorithms 12, 688-699. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. FIAT, A., RABANI, Y., AND RAVID, Y. 1994. Competitive k-server algorithms. J. Comput Syst. Sci. 48, 3, 410-428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. GALAMBOS, J. 1987. The Asymptotic Theory of Extreme Order Statistics, 2nd ed. R. E. Kreiger.Google ScholarGoogle Scholar
  17. HANNAN, J. 1957. Approximation to Bayes risk in repeated play. In Contributions to the Theory of Games, vol. 3. Princeton University Press, Princeton, N.J., pp. 97-139.Google ScholarGoogle Scholar
  18. HAUSSLER, D., AND BARRON, A. 1992. How well do Bayes methods work for on-line prediction of { + 1, -1 } values? In Proceedings of the 3rd NEC Symposium on Computation and Cognition. SIAM.Google ScholarGoogle Scholar
  19. HAUSSLER, D., KEARNS, M., LITTLESTONE, N., AND WARMUTH, M.K. 1991. Equivalence of models for polynomial learnability. Inf. Comput. 95, 129-161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. HAUSSLER, D., KEARNS, M., AND SCHAPIRE, R. 1994. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimensions. Mach. Learn. 14, 84-114. Google ScholarGoogle ScholarCross RefCross Ref
  21. HAUSSLER, D., KIVINEN, J., AND WARMUTH, M. K. 1995. Tight worst-case loss bounds for predicting with expert advice. In Computational Learning Theory: Second European Conference, EuroCOLT '95. Springer-Verlag, New York, pp. 69-83. Google ScholarGoogle Scholar
  22. HAUSSLER, D., LITTLESTONE, N., AND WARMUTH, M. K. 1994. Predicting {0, 1}-functions on randomly drawn points. Inf. Comput. 115, 2, 248-292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. HAYKIN, S. 1994. Neural Networks: A Comprehensive Foundation. MacMillan, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. HELMBOLD, U., AND WARMUTH, M.K. 1995. On weak learning. J. Comput. Syst. Sci. 50, 3 (June), 551-573. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. KEARNS, M. J., AND SCHAPIRE, R. E. 1994. Efficient distribution-free learning of probabilistic concepts. J. Comput. Syst. Sci. 48, 3, 464-497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. KEARNS, M. J., SCHAPIRE, R. E., AND NELLIE, L. M. 1994. Toward efficient agnostic learning. Mach. Learn. 17, 115-141. Google ScholarGoogle ScholarCross RefCross Ref
  27. KIVINEN, J., AND WARMUTH, M.K. 1994. Using experts for predicting continuous outcomes. In Computational Learning Theory: EuroCOLT '93. Springer-Verlag, New York, pp. 109-120. Google ScholarGoogle Scholar
  28. LITTLESTONE, N. 1989. From on-line to batch learning. In Proceedings of the 2nd Annual Workshop ~on Computational Learning Theory. Morgan-Kaufmann, San Mateo, Calif., pp. 269-284. Google ScholarGoogle Scholar
  29. LITTLESTONE, N., LONG, P. M., AND WARMUTH, M.K. 1995. On-line learning of linear functions. Computat. Complexity 5, 1, 1-23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. LITTLESTONE, N., AND WARMUTH, M.K. 1994. The weighted majority algorithm. Inf. Comput. 108, 2, 212-261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. MERHAV, N., AND FEDER, M. 1993. Universal schemes for sequential decision for individual data sequences. IEEE Trans. Inf. Theory, 39, 4, 1280-1292.Google ScholarGoogle ScholarCross RefCross Ref
  32. RISSANEN, J. 1978. Modeling by shortest data description. Automatica 14, 465-471.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. RISSANEN, J. 1986. Stochastic complexity and modeling. Ann. Star. 14, 3, 1080-1100.Google ScholarGoogle ScholarCross RefCross Ref
  34. RISSANEN, J., AND LANGDON, G. G., JR. 1981. Universal modeling and coding. IEEE Trans. Inf. Theory IT-27, 1 (Jan.), 12-23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. SEUNG, n. S., SOMPOLINSKY, n., AND TISHBY, N. 1992. Statistical mechanics of learning from examples. Phys. Rev A 45, 8, 6056-6091.Google ScholarGoogle ScholarCross RefCross Ref
  36. SHTARKOV, YU. M. 1975. Coding of discrete sources with unknown statistics. In Topics in Information Theory. North-Holland, Amsterdam, The Netherlands, pp. 559-574.Google ScholarGoogle Scholar
  37. SHTARKOV, YU. M. 1987. Universal sequential coding of single messages. Prob. Inf. Transm. 23 (July-September), 175-186.Google ScholarGoogle Scholar
  38. SOMPOLINSKY, H., TISHBY, N., AND SEUNG, H. S. 1992. Learning from examples in large neural networks. Phys. Rev. Lett. 65, 1683-1686.Google ScholarGoogle ScholarCross RefCross Ref
  39. STONE, C. J. 1977. Cross-validation: A review. Math. Operationforsch. Statist. Ser Statist. 9, 127-139.Google ScholarGoogle ScholarCross RefCross Ref
  40. TALAGRAND, M. 1994. Sharper bounds for Gaussian and empirical processes. Ann. Prob. 22, 1, 28-76.Google ScholarGoogle ScholarCross RefCross Ref
  41. VALIANT, L.G. 1984. A theory of the learnable. Commun. ACM 27, 11 (Nov.), 1134-1142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. VAPNIK, V.N. 1982. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York. Google ScholarGoogle Scholar
  43. VAPNIK, V. 1992. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, vol. 4. John E. Moody, Steve J. Hanson, and Richard P. Lippman, eds. Morgan Kaufmann, San Mateo, Calif.Google ScholarGoogle Scholar
  44. VOVK, V.G. 1990. Aggregating strategies. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory. Morgan-Kaufmann, San Mateo, Calif., pp. 371-383. Google ScholarGoogle Scholar
  45. VOVK, V.G. 1992. Universal forecasting algorithms. Inf. Comput. 96, 2 (Feb.), 245-277. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. VOVK, V.G. 1993. A logic of probability, with application to the foundations of statistics. J. Roy. Statis Soc, Ser. B-Methodolical 55, 2, 317-351.Google ScholarGoogle Scholar
  47. YAMANISHI, K. 1995. A loss bound model for on-line stochastic prediction algorithms. Inf. Comput. 119, 1, 39-54. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. How to use expert advice

        Recommendations

        Reviews

        Giuseppina Carla Gini

        While the title of this paper suggests the use of traditional expert systems technology, the research reported here concerns how to make predictions when a team of experts is available and their predictions are known. In particular, the authors define an algorithm to solve the following prediction problem: We want to predict whether an event will occur, and we may use the predictions of a finite set of experts; each expert can see a sequence of measurements, and gives his or her opinion on the probability of the event as a real number between 0 and 1. The problem is to build an algorithm that uses the experts' advice to give the best possible estimate. This requires minimization of the loss, where loss is defined as the absolute value of the difference between the predicted number and the real outcome. The authors define a family of algorithms, some of them taken from the literature, and study their performance. In particular, the optimal solution algorithms are discussed, and their lower and upper bounds are demonstrated. Since the optimal algorithms have complexity exponential in the number of experts, the authors provide other algorithms to generate the prediction, and demonstrate their lower and upper bounds. The algorithm P is based on computing and upgrading a nonnegative weight for each expert that reflects the expert's prediction ability. A parameter controls when to drop poorly performing experts. The choice of the parameter uses a priori knowledge. An iterative variation P* can make predictions without a priori knowledge of the best expert's loss. In the last part of the paper, the authors discuss an application of the algorithm in pattern recognition, where the problem is to predict the label (0 or 1) of examples taken at random from an arbitrary distribution. The authors compare their solution with other, similar results. The comparison indicates that the new approach is better. The proofs of the main theorems are difficult and require a few lemmas. This paper is important for everyone working in the field. It covers almost every aspect of a loss function that minimizes the number of mistakes and can use any kind of expert output.

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Journal of the ACM
          Journal of the ACM  Volume 44, Issue 3
          May 1997
          163 pages
          ISSN:0004-5411
          EISSN:1557-735X
          DOI:10.1145/258128
          • Editor:
          • F. T. Leighton
          Issue’s Table of Contents

          Copyright © 1997 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 May 1997
          Published in jacm Volume 44, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Author Tags

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader