Abstract
We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernel-based approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the bias-variance tradeoff in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or non-parametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.
Article PDF
Similar content being viewed by others
References
Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted regression for control. Artificial Intelligence Review, 11:1-5, 75–113.
Baird, L. C., & Klopf, A. H. (1993). Reinforcement learning with high-dimensional, continuous actions. Technical Report WL-TR-93-1147, Wright Laboratory, Wright-Patterson Air Force Base Ohio.
Bellman, R. E. (1957). Dynamic programming. Englewood Cliffs, NJ: Princeton University Press.
Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vols. 1 and 2). Belmont, MA: Athena Scientific.
Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advance in neural information processing systems (Vol. 7, pp. 369–376). Combridge MA: The MIT Press.
Bradtke, S. J. (1993). Reinforcement learning applied to linear quadratic regulation. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems (Vol. 5, pp. 295–302). San Mateo, CA: Morgan Kaufmann.
Brandt, M. W. (1999). Estimating portfolio and consumption choice. A conditional Euler equations approach. Journal of Finance, 54:5, 1609–1645.
Connell, M. E., & Utgoff, P. E. (1987). Learning to control a dynamic physical system. In Sixth National Conference on Artificial Intelligence (pp. 456–460). San Mateo, CA: Morgan Kaufmann.
Devroye L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer-Verlag.
Fan, J., & Gijbels, I. (1996). Local polynomial modelling and its applications. London: Chapman & Hall.
Gordon, G. (1999). Approximate solutions to Markov decision processes. Ph.D. Thesis, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA.
Hastie, T., & Loader, C. (1993). Local regression: Autmatic kernel carpentry. Statistical Science, 8:2, 120–143.
Jain, N. C., & Marcus, M. B. (1975). Central limit theorems for C(S)-valued random variables. Journal of Functional Analysis, 19, 216–231.
Landelius, T. (1997). Reinforcement learning and distributed local model synthesis. Ph.D. Thesis, Linköping University.
Longstaff, F. A., & Schwartz, E. S. (1998). Valuing American options by simulations: A simple least-squares approach. Technical Report 25-98, Department of Finance, UCLA.
Ormoneit, D., & Glynn, P. W. (2001). Kernel-based reinforcement learning in average-cost problems: An application to optimal portfolio choice. In Advances in neural information processing systems (Vol. 13). Cambridge, MA: The MIT Press.
Ormoneit, D., & Glynn, P. W. (2002). Kernel-based reinforcement learning in average-cost problems. IEEE Transactions on Automatic Control, accepted for publication.
Ormoneit, D., & Hastie, T. (2000). Optimal kernel shapes for local linear regression. In S. A. Solla, T. K. Leen, & K-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12, pp. 540–546). Cambridge, MA: The MIT Press.
Papavassiliou, V., & Russell, S. (1999). Convergence of reinforcement learning with general function approximators. In Proceedings IJCAI (p. 974).
Peng, J., & Williams, R. J. (1995). Efficient learning and planning within the Dyna framework. In Twelfth International Conference on Machine Learning (pp. 438–446). San Mateo, CA: Morgan Kaufmann.
Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 20, 400–407.
Rust, J. (1997). Using randomization to break the curse of dimensionality. Econometrica, 65:3, 487–516.
Singh, S., & Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems (Vol. 9, pp. 974). Cambridge, MA: The MIT Press.
Smart, W. D., & Kaebling, L. P. (2000). Practical reinforcement learning in continuous spaces. In P. Langley (Ed.), Proceedings of the Seventeenth International Conference on Machine Learning (pp. 903–910). San Francisco, CA: Morgan Kaufmann.
Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Annals of Statistics, 10:4, 1040–1053.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12). Cambridge, MA: The MIT Press.
Tesauro, G. (1989). Neurogammon wins computer olympiad. Neural Computation, 1:3, 321–323.
Thrun, S. B., & Möller, K. (1992). Active exploration in dynamic environments. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural informaton processing systems (Vol. 4, pp. 531–538). San Mateo, CA: Morgan Kaufmann.
Tsitsiklis, J. N., & Van Roy, B. (1996). Feature-based methods for large-scale dynamic programming. Machine Learning, 22, 59–94.
Tsitsiklis, J. N., & Van Roy, B. (1999). Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Transactions on Automatic Control, 44:10, 1840–1851.
Tsitsiklis, J. N., & Konda, V. R. (2000). Actor-critic algorithms. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12). Cambridge, MA: The MIT Press.
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.
Werbos, P. J. (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3, 179–189.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Ormoneit, D., Sen, Ś. Kernel-Based Reinforcement Learning. Machine Learning 49, 161–178 (2002). https://doi.org/10.1023/A:1017928328829
Issue Date:
DOI: https://doi.org/10.1023/A:1017928328829