Abstract
We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worst-case situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictins. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show how this leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently know in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.
- BIRGE, L., AND MASSART, P. 1993. Rates of convergence for minimum contrast estimators. Prob. Theory Rel. Fields 97, 113-150.Google ScholarCross Ref
- BLUMER, A., EHRENFEUCHT, A., HAUSSLER, D., AND WARMUTH, M.K. 1989. Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36, 4, 929-965. Google ScholarDigital Library
- CESA-BIANCHI, N., FREUND, Y., HELMBOLD, D. P., HAUSSLER, D., SCHAPIRE, R. E., AND WARMUTH, M.K. 1993. How to use expert advice. In Proceedings of the 25th Annual ACM Symposium on the Theory of Computing (San Diego, Calif., May 16-18). ACM, New York, pp. 382-391. Google Scholar
- CESA-BIANCHI, N., FREUND, Y., HELMBOLD, D. P., AND WARMUTH, M.K. 1996. On-line prediction and conversion strategies. Mach. Learn., to appear. Google Scholar
- CHUNG, T.H. 1994. Approximate methods for sequential decision making using expert advice. In Proceedings of the 7th Annual ACM Conference on Computational Learning Theory (New Brunswick, N.J., July 12-15). ACM, New York, pp. 183-189. Google Scholar
- COVER, T.M. 1965. Behaviour of sequential predictors of binary sequences. In Transactions of the 4th Prague Conference on Information Theory, Statistical Decision Functions, Random Processes. Publishing House of the Czechoslovak Academy of Sciences.Google Scholar
- COVER, T. M., AND SHANAR, A. 1977. Compound Bayes predictors for sequences with apparent Markov structure. IEEE Trans. Syst. Man Cybernet. SMC-7, 6 (June), 421-424.Google ScholarCross Ref
- DAWID, A. P. 1984. Statistical theory: The prequential approach. J. Roy. Stat. Soc., Series A, 278-292.Google ScholarCross Ref
- DAWlD, A. 1991. Prequential analysis, stochastic complexity and Bayesian inference. In Bayesian Statistics, vol. 4. Oxford University Press, pp. 109-125.Google Scholar
- DAWlD, A.P. 1996. Prequential data analysis. Curt. Iss. Stat. Inference, to appear.Google Scholar
- DESANTIS, A., MARKOWSKI, G., AND WEGMAN, M. N. 1988. Learning probabilistic prediction functions. In Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan- Kaufmann, San Mateo, Calif., pp. 312-328. Google Scholar
- FEDER, M., MERHAV, N., AND GUTMAN, M. 1992. Universal prediction of individual sequences. IEEE Trans. Inf. Theory 38, 1258-1270.Google ScholarCross Ref
- FIAT, A., FOSTER, D., KARLOFF, H., RABANI, Y., RAVID, Y., AND VISWANATHAN, S. 1991a. Competitive algorithms for layered graph traversal. In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science. IEEE, New York, pp. 288-297. Google ScholarDigital Library
- FIAT, A., KARP, R., LUBY, M., MCGEOCH, L., SLEATOR, D., AND YOUNG, N. 1991b. Competitive paging algorithms. J. Algorithms 12, 688-699. Google ScholarDigital Library
- FIAT, A., RABANI, Y., AND RAVID, Y. 1994. Competitive k-server algorithms. J. Comput Syst. Sci. 48, 3, 410-428. Google ScholarDigital Library
- GALAMBOS, J. 1987. The Asymptotic Theory of Extreme Order Statistics, 2nd ed. R. E. Kreiger.Google Scholar
- HANNAN, J. 1957. Approximation to Bayes risk in repeated play. In Contributions to the Theory of Games, vol. 3. Princeton University Press, Princeton, N.J., pp. 97-139.Google Scholar
- HAUSSLER, D., AND BARRON, A. 1992. How well do Bayes methods work for on-line prediction of { + 1, -1 } values? In Proceedings of the 3rd NEC Symposium on Computation and Cognition. SIAM.Google Scholar
- HAUSSLER, D., KEARNS, M., LITTLESTONE, N., AND WARMUTH, M.K. 1991. Equivalence of models for polynomial learnability. Inf. Comput. 95, 129-161. Google ScholarDigital Library
- HAUSSLER, D., KEARNS, M., AND SCHAPIRE, R. 1994. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimensions. Mach. Learn. 14, 84-114. Google ScholarCross Ref
- HAUSSLER, D., KIVINEN, J., AND WARMUTH, M. K. 1995. Tight worst-case loss bounds for predicting with expert advice. In Computational Learning Theory: Second European Conference, EuroCOLT '95. Springer-Verlag, New York, pp. 69-83. Google Scholar
- HAUSSLER, D., LITTLESTONE, N., AND WARMUTH, M. K. 1994. Predicting {0, 1}-functions on randomly drawn points. Inf. Comput. 115, 2, 248-292. Google ScholarDigital Library
- HAYKIN, S. 1994. Neural Networks: A Comprehensive Foundation. MacMillan, New York. Google ScholarDigital Library
- HELMBOLD, U., AND WARMUTH, M.K. 1995. On weak learning. J. Comput. Syst. Sci. 50, 3 (June), 551-573. Google ScholarDigital Library
- KEARNS, M. J., AND SCHAPIRE, R. E. 1994. Efficient distribution-free learning of probabilistic concepts. J. Comput. Syst. Sci. 48, 3, 464-497. Google ScholarDigital Library
- KEARNS, M. J., SCHAPIRE, R. E., AND NELLIE, L. M. 1994. Toward efficient agnostic learning. Mach. Learn. 17, 115-141. Google ScholarCross Ref
- KIVINEN, J., AND WARMUTH, M.K. 1994. Using experts for predicting continuous outcomes. In Computational Learning Theory: EuroCOLT '93. Springer-Verlag, New York, pp. 109-120. Google Scholar
- LITTLESTONE, N. 1989. From on-line to batch learning. In Proceedings of the 2nd Annual Workshop ~on Computational Learning Theory. Morgan-Kaufmann, San Mateo, Calif., pp. 269-284. Google Scholar
- LITTLESTONE, N., LONG, P. M., AND WARMUTH, M.K. 1995. On-line learning of linear functions. Computat. Complexity 5, 1, 1-23. Google ScholarDigital Library
- LITTLESTONE, N., AND WARMUTH, M.K. 1994. The weighted majority algorithm. Inf. Comput. 108, 2, 212-261. Google ScholarDigital Library
- MERHAV, N., AND FEDER, M. 1993. Universal schemes for sequential decision for individual data sequences. IEEE Trans. Inf. Theory, 39, 4, 1280-1292.Google ScholarCross Ref
- RISSANEN, J. 1978. Modeling by shortest data description. Automatica 14, 465-471.Google ScholarDigital Library
- RISSANEN, J. 1986. Stochastic complexity and modeling. Ann. Star. 14, 3, 1080-1100.Google ScholarCross Ref
- RISSANEN, J., AND LANGDON, G. G., JR. 1981. Universal modeling and coding. IEEE Trans. Inf. Theory IT-27, 1 (Jan.), 12-23.Google ScholarDigital Library
- SEUNG, n. S., SOMPOLINSKY, n., AND TISHBY, N. 1992. Statistical mechanics of learning from examples. Phys. Rev A 45, 8, 6056-6091.Google ScholarCross Ref
- SHTARKOV, YU. M. 1975. Coding of discrete sources with unknown statistics. In Topics in Information Theory. North-Holland, Amsterdam, The Netherlands, pp. 559-574.Google Scholar
- SHTARKOV, YU. M. 1987. Universal sequential coding of single messages. Prob. Inf. Transm. 23 (July-September), 175-186.Google Scholar
- SOMPOLINSKY, H., TISHBY, N., AND SEUNG, H. S. 1992. Learning from examples in large neural networks. Phys. Rev. Lett. 65, 1683-1686.Google ScholarCross Ref
- STONE, C. J. 1977. Cross-validation: A review. Math. Operationforsch. Statist. Ser Statist. 9, 127-139.Google ScholarCross Ref
- TALAGRAND, M. 1994. Sharper bounds for Gaussian and empirical processes. Ann. Prob. 22, 1, 28-76.Google ScholarCross Ref
- VALIANT, L.G. 1984. A theory of the learnable. Commun. ACM 27, 11 (Nov.), 1134-1142. Google ScholarDigital Library
- VAPNIK, V.N. 1982. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York. Google Scholar
- VAPNIK, V. 1992. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, vol. 4. John E. Moody, Steve J. Hanson, and Richard P. Lippman, eds. Morgan Kaufmann, San Mateo, Calif.Google Scholar
- VOVK, V.G. 1990. Aggregating strategies. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory. Morgan-Kaufmann, San Mateo, Calif., pp. 371-383. Google Scholar
- VOVK, V.G. 1992. Universal forecasting algorithms. Inf. Comput. 96, 2 (Feb.), 245-277. Google ScholarDigital Library
- VOVK, V.G. 1993. A logic of probability, with application to the foundations of statistics. J. Roy. Statis Soc, Ser. B-Methodolical 55, 2, 317-351.Google Scholar
- YAMANISHI, K. 1995. A loss bound model for on-line stochastic prediction algorithms. Inf. Comput. 119, 1, 39-54. Google ScholarDigital Library
Index Terms
- How to use expert advice
Recommendations
How to Better Use Expert Advice
This work is concerned with online learning from expert advice. Extensive work on this problem generated numerous "expert advice algorithms" whose total loss is provably bounded above in terms of the loss incurred by the best expert in hindsight. Such ...
Comments