Kernel-Based Reinforcement Learning

Ormoneit, Dirk; Sen, Śaunak

doi:10.1023/A:1017928328829

Kernel-Based Reinforcement Learning

Published: November 2002

Volume 49, pages 161–178, (2002)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Kernel-Based Reinforcement Learning

Download PDF

Dirk Ormoneit¹ &
Śaunak Sen²

7302 Accesses
251 Citations
3 Altmetric
Explore all metrics

Abstract

We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernel-based approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the bias-variance tradeoff in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or non-parametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.

References

Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted regression for control. Artificial Intelligence Review, 11:1-5, 75–113.
Google Scholar
Baird, L. C., & Klopf, A. H. (1993). Reinforcement learning with high-dimensional, continuous actions. Technical Report WL-TR-93-1147, Wright Laboratory, Wright-Patterson Air Force Base Ohio.
Bellman, R. E. (1957). Dynamic programming. Englewood Cliffs, NJ: Princeton University Press.
Google Scholar
Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vols. 1 and 2). Belmont, MA: Athena Scientific.
Google Scholar
Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advance in neural information processing systems (Vol. 7, pp. 369–376). Combridge MA: The MIT Press.
Google Scholar
Bradtke, S. J. (1993). Reinforcement learning applied to linear quadratic regulation. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems (Vol. 5, pp. 295–302). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Brandt, M. W. (1999). Estimating portfolio and consumption choice. A conditional Euler equations approach. Journal of Finance, 54:5, 1609–1645.
Google Scholar
Connell, M. E., & Utgoff, P. E. (1987). Learning to control a dynamic physical system. In Sixth National Conference on Artificial Intelligence (pp. 456–460). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Devroye L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer-Verlag.
Google Scholar
Fan, J., & Gijbels, I. (1996). Local polynomial modelling and its applications. London: Chapman & Hall.
Google Scholar
Gordon, G. (1999). Approximate solutions to Markov decision processes. Ph.D. Thesis, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA.
Google Scholar
Hastie, T., & Loader, C. (1993). Local regression: Autmatic kernel carpentry. Statistical Science, 8:2, 120–143.
Google Scholar
Jain, N. C., & Marcus, M. B. (1975). Central limit theorems for C(S)-valued random variables. Journal of Functional Analysis, 19, 216–231.
Google Scholar
Landelius, T. (1997). Reinforcement learning and distributed local model synthesis. Ph.D. Thesis, Linköping University.
Longstaff, F. A., & Schwartz, E. S. (1998). Valuing American options by simulations: A simple least-squares approach. Technical Report 25-98, Department of Finance, UCLA.
Ormoneit, D., & Glynn, P. W. (2001). Kernel-based reinforcement learning in average-cost problems: An application to optimal portfolio choice. In Advances in neural information processing systems (Vol. 13). Cambridge, MA: The MIT Press.
Google Scholar
Ormoneit, D., & Glynn, P. W. (2002). Kernel-based reinforcement learning in average-cost problems. IEEE Transactions on Automatic Control, accepted for publication.
Ormoneit, D., & Hastie, T. (2000). Optimal kernel shapes for local linear regression. In S. A. Solla, T. K. Leen, & K-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12, pp. 540–546). Cambridge, MA: The MIT Press.
Google Scholar
Papavassiliou, V., & Russell, S. (1999). Convergence of reinforcement learning with general function approximators. In Proceedings IJCAI (p. 974).
Peng, J., & Williams, R. J. (1995). Efficient learning and planning within the Dyna framework. In Twelfth International Conference on Machine Learning (pp. 438–446). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.
Google Scholar
Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 20, 400–407.
Google Scholar
Rust, J. (1997). Using randomization to break the curse of dimensionality. Econometrica, 65:3, 487–516.
Google Scholar
Singh, S., & Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems (Vol. 9, pp. 974). Cambridge, MA: The MIT Press.
Google Scholar
Smart, W. D., & Kaebling, L. P. (2000). Practical reinforcement learning in continuous spaces. In P. Langley (Ed.), Proceedings of the Seventeenth International Conference on Machine Learning (pp. 903–910). San Francisco, CA: Morgan Kaufmann.
Google Scholar
Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Annals of Statistics, 10:4, 1040–1053.
Google Scholar
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Google Scholar
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12). Cambridge, MA: The MIT Press.
Google Scholar
Tesauro, G. (1989). Neurogammon wins computer olympiad. Neural Computation, 1:3, 321–323.
Google Scholar
Thrun, S. B., & Möller, K. (1992). Active exploration in dynamic environments. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural informaton processing systems (Vol. 4, pp. 531–538). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Tsitsiklis, J. N., & Van Roy, B. (1996). Feature-based methods for large-scale dynamic programming. Machine Learning, 22, 59–94.
Google Scholar
Tsitsiklis, J. N., & Van Roy, B. (1999). Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Transactions on Automatic Control, 44:10, 1840–1851.
Google Scholar
Tsitsiklis, J. N., & Konda, V. R. (2000). Actor-critic algorithms. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12). Cambridge, MA: The MIT Press.
Google Scholar
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.
Google Scholar
Werbos, P. J. (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3, 179–189.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Stanford University, Stanford, CA, 94305-9010, USA.
Dirk Ormoneit
The Jackson Laboratory, Bar Harbor, ME, 04609, USA
Śaunak Sen

Authors

Dirk Ormoneit
View author publications
You can also search for this author in PubMed Google Scholar
Śaunak Sen
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ormoneit, D., Sen, Ś. Kernel-Based Reinforcement Learning. Machine Learning 49, 161–178 (2002). https://doi.org/10.1023/A:1017928328829

Download citation

Issue Date: November 2002
DOI: https://doi.org/10.1023/A:1017928328829

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Kernel-Based Reinforcement Learning

Abstract

Article PDF

Similar content being viewed by others

Continuous Kernel Learning

Nonlinear Adaptive Filtering in Kernel Spaces

Kernel methods for the approximation of discrete-time linear autonomous and control systems

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Kernel-Based Reinforcement Learning

Abstract

Article PDF

Similar content being viewed by others

Continuous Kernel Learning

Nonlinear Adaptive Filtering in Kernel Spaces

Kernel methods for the approximation of discrete-time linear autonomous and control systems

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation