Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Singh, Satinder; Jaakkola, Tommi; Littman, Michael L.; Szepesvári, Csaba

doi:10.1023/A:1007678930559

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Published: March 2000

Volume 38, pages 287–308, (2000)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Download PDF

Satinder Singh¹,
Tommi Jaakkola²,
Michael L. Littman³ &
…
Csaba Szepesvári⁴

11k Accesses
340 Citations
Explore all metrics

Abstract

An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.

Article PDF

Generalized exploration in policy search

Article 13 July 2017

Reinforcement Learning with Guarantees that Hold for Ever

An Incremental Fast Policy Search Using a Single Sample Path

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Barto, A. G., Bradtke S. J., & Singh S. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1), 81–138.
Google Scholar
Bellman, R. (1957). Dynamic Programming. Princeton, NJ: Princeton University Press.
Google Scholar
Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control. (Vol. 1 and 2). Belmont, Massachusetts: Athena Scientific.
Google Scholar
Boyan, J. A. & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (Vol. 7, pp. 369–376). Cambridge, MA: The MIT Press.
Google Scholar
Breiman, L. (1992).Probability. Philadelphia, Pennsylvania: Society for Industrial and Applied Mathematics.
Google Scholar
Crites, R. H. & Barto, A. G. (1996). Improving elevator performance using reinforcement learning. Advances in neural information processing systems (Vol. 8). MIT Press.
Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8(3), 341–362.
Google Scholar
Dayan, P. & Sejnowski, T. J. (1994). TD.(λ)converges with probability 1. Machine Learning, 14 (3).
Dayan, P. & Sejnowski, T. J. (1996). Exploration bonuses and dual control. Machine Learning, 25, 5–22.
Google Scholar
Gullapalli, V. & Barto, A. G. (1994). Convergence of indirect adaptive asynchronous value iteration algorithms. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 695–702). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Jaakkola, T., Jordan, M. I., & Singh, S. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1185–1201.
Google Scholar
John, G. H. (1994). When the best move isn't optimal: Q-learning with exploration. In Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, WA, p. 1464.
John, G. H. (1995). When the best move isn't optimal: Q-learning with exploration. Unpublished manuscript, available at URL ftp://starry.stanford.edu/pub/gjohn/papers/rein-nips.ps.
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285.
Google Scholar
Kumar, P. R. & Varaiya, P. P. (1986). Stochastic systems: estimation, identification, and adaptive control. Englewood Cliffs, NJ: Prentice Hall.
Google Scholar
Littman, M. L. & Szepesvári, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In L. Saitta (Ed.), Proceedings of the Thirteenth International Conference on Machine Learning (pp. 310–318).
Littman, M. L. (1996). Algorithms for sequential decision making. Ph.D. Thesis, Department of Computer Science, Brown University, February. Also Technical Report CS–96–09.
Puterman, M. L. (1994). Markov decision processes—discrete stochastic dynamic programming. New York, NY: John Wiley & Sons, Inc.
Google Scholar
Rummery, G. A. (1994). Problem solving with reinforcement learning. Ph.D. Thesis, Cambridge University Engineering Department.
Rummery, G. A. & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.
Singh, S. & Bertsekas, D. P. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. Advances in neural information processing systems (Vol. 9, pp. 974–980). MIT Press.
Google Scholar
Singh, S. & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1–3):123–158.
Google Scholar
Singh, S. & Yee, R. C. (1994). An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16, 227–233.
Google Scholar
Sutton, R. S. & Barto, A. G. (1998). An introduction to reinforcement learning. The MIT Press.
Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3(1): 9–44.
Google Scholar
Sutton, R. S. (1996). Generalization in reinforcement learning: successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8). Cambridge, MA: The MIT Press.
Google Scholar
Szepesvári, C. & Littman, M. L. (1996). Generalized Markov decision processes: dynamic-programming and reinforcement-learning algorithms. Technical Report CS–96–11, Brown University, Providence, RI.
Tesauro, G. J. (1995). Temporal difference learning and TD-gammon. Communications of the ACM, 38(3), 58–68.
Google Scholar
Thrun, S. B. (1992). The role of exploration in learning control. In D. A. White & D. A. Sofge (Eds.), Handbook of intelligent control: neural, fuzzy, and adaptive approaches. New York, NY: Van Nostrand Reinhold.
Google Scholar
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185–202,September 1994.
Google Scholar
Tsitsiklis, J. N. & Van Roy, B. (1996). An analysis of temporal-difference learning with function approximation. Technical Report LIDS-P-2322, Massachusetts Institute of Technology, March. Available through http://web.mit.edu/bvr/www/td.ps. To appear in IEEE Transactions on Automatic Control.
Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Thesis, King's College, Cambridge, UK.
Google Scholar
Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279–292.
Google Scholar
Williams, R. J. & Baird, L. C., III (1993). Tight performance bounds on greedy policies based on imperfect value functions. Technical Report NU-CCS–93–14, Northeastern University, College of Computer Science, Boston, MA.
Google Scholar
Zhang, W. & Dietterich, T. G. (1995). High-performance job-shop scheduling with a time delay TD(λ) network. Advances in neural information processing systems (Vol. 8, pp. 1024–1030). MIT Press.
Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Labs-Research, 180 Park Avenue, Florham Park, NJ, 07932, USA
Satinder Singh
Department of Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
Tommi Jaakkola
Department of Computer Science, Duke University, Durham, NC, 27708-0129, USA
Michael L. Littman
Mindmaker Ltd., Konkoly Thege M. u. 29-33, Budapest, 1121, Hungary
Csaba Szepesvári

Authors

Satinder Singh
View author publications
You can also search for this author in PubMed Google Scholar
Tommi Jaakkola
View author publications
You can also search for this author in PubMed Google Scholar
Michael L. Littman
View author publications
You can also search for this author in PubMed Google Scholar
Csaba Szepesvári
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, S., Jaakkola, T., Littman, M.L. et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. Machine Learning 38, 287–308 (2000). https://doi.org/10.1023/A:1007678930559

Download citation

Issue Date: March 2000
DOI: https://doi.org/10.1023/A:1007678930559

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Abstract

Article PDF

Similar content being viewed by others

Generalized exploration in policy search

Reinforcement Learning with Guarantees that Hold for Ever

An Incremental Fast Policy Search Using a Single Sample Path

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Abstract

Article PDF

Similar content being viewed by others

Generalized exploration in policy search

Reinforcement Learning with Guarantees that Hold for Ever

An Incremental Fast Policy Search Using a Single Sample Path

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation