Abstract
This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic-programming based reinforcement learning method, with the TD(λ) return estimation process, which is typically used in actor-critic learning, another well-known dynamic-programming based reinforcement learning method. The parameter λ is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the non-Markovian effect of coarse state-space quantization. The resulting algorithm, Q(λ)-learning, thus combines some of the best features of the Q-learning and actor-critic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.
Article PDF
Similar content being viewed by others
References
BarroA.G.-Sutton R.S. & Anderson C.W. 1983. Neuron like elements that can solve difficult learning control problems. IEEE Transactions on Systems Manand Cybernetics 13: 835–846
Cichosz, P. & Mulawka, J. J. (1995) Fast and efficient reinforcement learning with truncated temporal differences. Proceedings of the Twelfth International Conference on Machine Learning 99–107.
Lin, L. J. (1992). Reinforcement learning for robots using neural networks. Ph. D. Dissertation, Carnegie Mellon University, PA.
Moore, A. W. & Atkcson, C. G. (1994). Prioritized sweeping: reinforcement learning with less data and less time. Machine Learning 13(1):103–130.
Pendrith, M. (1994). On reinforcement learning of control actions in noisy and nont-Markovian domains. UNSW-CSE-TR-9410, University of New South Wales, Australia.
Peng, J. (1993) Efficient Dynamic Programming-Based Learning for Control. Ph. D. Dissertation, Northeastern University, Boston, MA 02115.
Peng, J. & Williams R. J. (1993). Efficient learning and planning within the Dyna Framework. Adaptive Behavior 1(4):437–454.
Ross, S. (1983). Introduction to Stochastic Dynamic Programming. New York, Academic Press.
Runmey, G. A. & Niranjan, M. (1994). On-line Q-learning using connectionist systems CUED/F-INFENG/IR 166. Cambridge University, UK.
Sutton, R S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, 216–224.
Sutton, R S. (1988). Learning to predict by the method of temporal differences. Machine Learning 3:9-l4.
Sutton, R. S. & Singh, S. P. (1994). On step-size and bias in temporal-difference learning. In Eighth Yale Workshop on adaptive and Learning Systems, pages 91–96. New Haven, CT.
Watkins, C. J. C. H. & Dayan, P (1992). Q-leaming. Machine Learning 279–292.
Watkins, C. J. C. H (1989). Learning from delayed rewards. Ph. D. Dissertation, King's College, UK.
Rights and permissions
About this article
Cite this article
Peng, J., Williams, R.J. Incremental Multi-Step Q-Learning. Machine Learning 22, 283–290 (1996). https://doi.org/10.1023/A:1018076709321
Issue Date:
DOI: https://doi.org/10.1023/A:1018076709321