Abstract
We study task sequences that allow for speeding up the learner's average reward intake through appropriate shifts of inductive bias (changes of the learner's policy). To evaluate long-term effects of bias shifts setting the stage for later bias shifts we use the “success-story algorithm” (SSA). SSA is occasionally called at times that may depend on the policy itself. It uses backtracking to undo those bias shifts that have not been empirically observed to trigger long-term reward accelerations (measured up until the current SSA call). Bias shifts that survive SSA represent a lifelong success history. Until the next SSA call, they are considered useful and build the basis for additional bias shifts. SSA allows for plugging in a wide variety of learning algorithms. We plug in (1) a novel, adaptive extension of Levin search and (2) a method for embedding the learner's policy modification strategy within the policy itself (incremental self-improvement). Our inductive transfer case studies involve complex, partially observable environments where traditional reinforcement learning fails.
Article PDF
Similar content being viewed by others
References
Berry, D. A. & Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London.
Boddy, M. & Dean, T. L. (1994). Deliberation scheduling for problem solving in time-constrained environments. Artificial Intelligence, 67:245–285.
Caruana, R., Silver, D. L., Baxter, J., Mitchell, T. M., Pratt, L. Y., & Thrun, S. (1995). Learning to learn: knowledge consolidation and transfer in inductive systems. Workshop held at NIPS-95, Vail, CO, see http://www.cs.cmu.edu/afs/user/caruana/pub/transfer.html.
Chaitin, G. (1969). On the length of programs for computing finite binary sequences: statistical considerations. Journal of the ACM, 16: 145–159.
Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth International Conference on Artificial Intelligence, pages 183–188. AAAI Press, San Jose, California.
Cliff, D. & Ross, S. (1994). Adding temporary memory to ZCS. Adaptive Behavior, 3: 101–150.
Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs. In Grefenstette, J., editor, Proceedings of an International Conference on Genetic Algorithms and Their Applications, Hillsdale NJ. Lawrence Erlbaum Associates.
Dickmanns, D., Schmidhuber, J., & Winklhofer, A. (1987). Der genetische Algorithmus: Eine Implementierung in Prolog. Fortgeschrittenenpraktikum, Institut für Informatik, Lehrstuhl Prof. Radig, Technische Universität München.
Fogel, L., Owens, A., & Walsh, M. (1966). Artificial Intelligence through Simulated Evolution. Willey, New York.
Gittins, J. C. (1989). Multi-armed Bandit Allocation Indices. Wiley-Interscience series in systems and optimization. Wiley, Chichester, NY.
Greiner, R. (1996). PALO: A probabilistic hill-climbing algorithm. Artificial Intelligence, 83(2).
Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors, Advances in Neural Information Processing Systems 7, pages 345–352. MIT Press, Cambridge MA.
Kaelbling, L. (1993). Learning in Embedded Systems. MIT Press.
Kaelbling, L., Littman, M., & Cassandra, A. (1995). Planning and acting in partially observable stochastic domains. Technical report, Brown University, Providence RI.
Kolmogorov, A. (1965). Three approaches to the quantitative definition of information. Problems of Information Transmission, 1:1–11.
Koza, J. R. (1992). Genetic evolution and co-evolution of computer programs. In Langton, C., Taylor, C., Farmer, J. D., and Rasmussen, S., editors, Artificial Life II, pages 313–324. Addison Wesley Publishing Company.
Kumar, P. R. & Varaiya, P. (1986). Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice Hall.
Lenat, D. (1983). Theory formation by heuristic search. Machine Learning, 21.
Levin, L. A. (1973). Universal sequential search problems. Problems of Information Transmission, 9(3):265–266.
Levin, L. A. (1984). Randomness conservation inequalities: Information and independence in mathematical theories. Information and Control, 61:15–37.
Li, M. & Vitányi, P. M. B. (1993). An Introduction to Kolmogorov Complexity and its Applications. Springer.
Lin, L. (1993). Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh.
Littman, M. (1994). Memoryless policies: Theoretical limitations and practical results. In D. Cliff, P. Husbands, J. A. M. and Wilson, S. W., editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats 3, pages 297–305. MIT Press/Bradford Books.
McCallum, R. A. (1995). Instance-based utile distinctions for reinforcement learning with hidden state. In Prieditis, A. and Russell, S., editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 387–395. Morgan Kaufmann Publishers, San Francisco, CA.
Narendra, K. S. & Thathatchar, M. A. L. (1974). Learning automata-a survey. IEEE Transactions on Systems, Man, and Cybernetics, 4:323–334.
Pratt, L. & Jennings, B. (1996). A survey of transfer between connectionist networks. Connection Science, 8(2):163–184.
Ring, M. B. (1994). Continual Learning in Reinforcement Environments. PhD thesis, University of Texas at Austin, Austin, Texas 78712.
Rosenbloom, P. S., Laird, J. E., & Newell, A. (1993). The SOAR Papers. MIT Press.
Russell, S. & Wefald, E. (1991). Principles of Metareasoning. Artificial Intelligence, 49:361–395.
Schmidhuber, J. (1987). Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-⋯ hook. Institut für Informatik, Technische Universität München.
Schmidhuber, J. (1991). Reinforcement learning in Markovian and non-Markovian environments. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems 3, pages 500–506. San Mateo, CA: Morgan Kaufmann.
Schmidhuber, J. (1993). A self-referential weight matrix. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 446–451. Springer.
Schmidhuber, J. (1994). On learning how to learn learning strategies. Technical Report FKI-198-94, Fakultät für Informatik, Technische Universität München. Revised 1995.
Schmidhuber, J. (1995). Discovering solutions with low Kolmogorov complexity and high generalization capability. In Prieditis, A. and Russell, S., editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 488–496. Morgan Kaufmann Publishers, San Francisco, CA.
Schmidhuber, J. (1997a). Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks. In press.
Schmidhuber, J. (1997b). A general method for incremental self-improvement and multi-agent learning in unrestricted environments. In Yao, X., editor, Evolutionary Computation: Theory and Applications. Scientific Publ. Co., Singapore. In press.
Schmidhuber, J., Zhao, J., & Wiering, M. (1996). Simple principles of metalearning. Technical Report IDSIA69-96, IDSIA.
Schwefel, H. P. (1974). Numerische Optimierung von Computer-Modellen. Dissertation. Published 1977 by Birkhäuser, Basel.
Solomonoff, R. (1964). A formal theory of inductive inference. Part I. Information and Control, 7:1–22.
Solomonoff, R. (1986). An application of algorithmic probability to problems in artificial intelligence. In Kanal, L. N. and Lemmer, J. F., editors, Uncertainty in Artificial Intelligence, pages 473–491. Elsevier Science Publishers.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44.
Utgoff, P. (1986). Shift of bias for inductive concept learning. In Michalski, R., Carbonell, J., and Mitchell, T., editors, Machine Learning, volume 2, pages 163–190. Morgan Kaufmann, Los Altos, CA.
Watanabe, O. (1992). Kolmogorov complexity and computational complexity. EATCS Monographs on Theoretical Computer Science, Springer.
Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292.
Whitehead, S. & Ballard, D. H. (1990). Active perception and reinforcement learning. Neural Computation, 2(4):409–419.
Wiering, M. & Schmidhuber, J. (1996). Solving POMDPs with Levin search and EIRA. In Saitta, L., editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 534–542. Morgan Kaufmann Publishers, San Francisco, CA.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256.
Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):1341–1390.
Zhao, J. & Schmidhuber, J. (1996). Incremental self-improvement for life-time multi-agent reinforcement learning. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., and Wilson, S. W., editors, From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pages 516–525. MIT Press, Bradford Books.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Schmidhuber, J., Zhao, J. & Wiering, M. Shifting Inductive Bias with Success-Story Algorithm, Adaptive Levin Search, and Incremental Self-Improvement. Machine Learning 28, 105–130 (1997). https://doi.org/10.1023/A:1007383707642
Issue Date:
DOI: https://doi.org/10.1023/A:1007383707642