Skip to main content
Log in

Basis Function Adaptation in Temporal Difference Reinforcement Learning

  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

Reinforcement Learning (RL) is an approach for solving complex multi-stage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact representation and enables generalization. Linear approximation architectures (where the adjustable parameters are the weights of pre-fixed basis functions) have recently gained prominence due to efficient algorithms and convergence guarantees. Nonetheless, an appropriate choice of basis function is important for the success of the algorithm. In the present paper we examine methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy. Using the Bellman approximation error as an optimization criterion, we optimize the weights of the basis function while simultaneously adapting the (non-linear) basis function parameters. We present two algorithms for this problem. The first uses a gradient-based approach and the second applies the Cross Entropy method. The performance of the proposed algorithms is evaluated and compared in simulations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  • Alon, G., D.P. Kroese, T. Raviv, and R.Y. Rubinstein. (2005).“Application of the Cross-Entropy Method to the Buffer Allocation Problem in a Simulation-Based Environment. ”Annals of Operation Research 134, 137–151, a preliminary version appeared in the third Aegean International Conference on Design and Analysis of Manufacturing Systems.

  • Auer, P., M. Herbster, and M. Warmuth. (1996).“Exponentially Many Local Minima for Single Neurons.” In D. Touretzky, M. Mozer, and M. Hasselmo (eds.), Advances in Neural Information Processing Systems, Vol. 8, MIT Press, pp. 316–322.

  • Barto, A., R. Sutton, and C. Anderson. (1983).“Neuron-Like Adaptive Elements that can Solve Difficult Learning Control Problems.” IEEE Transactions on Systems, Man, and Cybernetics 13, 834–846.

    Google Scholar 

  • Bertsekas, D. (1995).Dynamic Programming and Optimal Control. Athena Scientific.

  • Bertsekas, D. (1999).Nonlinear Programming, 2nd edition. Athena Scientific.

  • Bertsekas, D. and J. Tsitsiklis (1996). Neuro-Dynamic Programming. Athena Scientific.

  • Boyan, J.A. (2002).“Technical Update: Least-Squares Temporal Difference Learning.”Machine Learning 49, 233–246.

    Article  Google Scholar 

  • Bradtke, S. (1993).“Reinforcement Learning Applied to Linear Quadratic Regulation.”In S. Hanson and J. Cowan (eds.), Advances in Neural Information Processing Systems, Vol. 5, Morgan Kaufmann, pp. 295–302.

  • Bradtke, S. and A. Barto. (1996).“Linear Least-Squares Algorithms for Temporal Difference Learning.”Machine Learning 22(1/2/3), 33–57.

    Article  Google Scholar 

  • de-Boer, P., D.Y. Kroese, S. Mannor, and R.Y. Rubinstein. (2005).“A Tutorial on the Cross-Entropy Method.”Available from http://www.cemethod.org. Annals of Operation Research 134, 19–67.

    Article  Google Scholar 

  • Dubin, U. (2002). “Application of the Cross-Entropy Method to Neural Computation.” Unpublished Master’s thesis, Technion.

  • Ghosh, J. and A. Nag. (2000). “An Overview on Radial Basis Function Networks.” In R.J. Howlett and L.C. Jain (eds.), Radial Basis Function Neural Networks Theory and Applications., Physica-Verlag.

  • Haykin, S.S. (1998). Neural Networks : A Comprehensive Foundation. Prentice Hall.

  • Helvik, B.E. and O. Wittner. (2001). “Using the Cross-Entropy Method to Guide/Govern Mobile Agent’s Path Finding in Networks.” In Proceedings of the 3rd International Workshop on Mobile Agents for Telecommunication Applications – MATA 01. Morgan Kaufmann.

  • Kaelbling, L.P., M. Littman, and A.W. Moore. (1996). “Reinforcement Learning – A Survey.” Journal of Artificial Intelligence Research 4, 237–285.

    Google Scholar 

  • Lagoudakis, M.G. and R. Parr (2001). “Model-Free Least-Squares Policy Iteration.” In Advances in Neural Information Processing Systems, Vol. 14, Morgan Kaufmann, pp. 1547–1554.

  • Mannor, S., R.Y. Rubinstein, and Y. Gat (2003). “The Cross-Entropy Method for Fast Policy Search.” In T. Fawcett and N. Mishra (eds.), Machine Learning, Proceedings of the Twentieth International Conference, AAAI press, pp. 512–519.

  • McGovern, A. and A.G. Barto. (2001). “Automatic Discovery of Subgoals in Reinforcement Learning Using Diverse Density.” In Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann, pp. 361–368.

  • McLachlan, G. and T. Krishnan. (1997). The EM Algorithm and Extensions. John Wiley & Sons.

  • Menache, I., S. Mannor, and N. Shimkin. (2002). “Q-cut-Dynamic Discovery of Sub-Goals in Reinforcement Learning.” In Proceedings of the 13th European Conference on Machine Learning, Vol 2430, Springer, pp. 295–306.

  • Munos, R. (2003). “Error Bounds for Approximate Policy Iteration.” In T. Fawcett and N. Mishra (eds.), Machine Learning, Proceedings of the Twentieth International Conference, AAAI press, pp. 560–567.

  • Nedic, A. and D. Bertsekas. (2001). Least-Squares Policy Evaluation Algorithms with Linear Function Approximation. LIDS Report LIDS-P-2537, to appear in J. of Discrete Event Systems.

  • Puterman, M. (1994). Markov Decision Processes. Wiley-Interscience.

  • Ratitch, B. and D. Precup. (2002). “Characterizing Markov Decision Processes.” In Proceedings of the 13th European Conference on Machine Learning, Vol. 2430, Springer, pp. 391–404.

  • Rubinstein, R.Y. and D.P. Kroese. (2004). The Cross-Entropy Method. A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Neural Computation. Springer.

  • Rubinstein, R.Y. (1999). “The Cross-Entropy Method for Combinatorial and Continuous Optimization.” Methodology and Computing in Applied Probability 1, 127–190.

    Article  Google Scholar 

  • Singh, S.P., T. Jaakkola, and M.I. Jordan. (1995). “Reinforcement Learning with Soft State Aggregation.” In Advances in Neural Information Processing Systems, Vol. 7, MIT Press, pp. 361–368.

  • Sutton, R.S. (1988). “Learning to Predict by the Method of Temporal Differences.” Machine Learning 3, 9–44.

    Google Scholar 

  • Sutton, R.S. and A.G. Barto. (1998). Reinforcement Learning: An Introduction. MIT Press.

  • Tsitsiklis, J. and B. Van-Roy. (1996). “Feature-Based Methods for Large Scale Dynamic Programming.” Machine Learning 22, 50–94.

    Google Scholar 

  • Tsitsiklis, J. and B. Van Roy. (1997). “An Analysis of Temporal-Difference Learning with Function Approximation.” IEEE Transactions on Automatic Control 42, 674–690.

    Article  Google Scholar 

  • Werbos, P. (1990). “Consistency of HDP Applied to Simple Reinforcement Learning Problem.” Neural Networks 3, 170–189.

    Article  Google Scholar 

  • Witten, I.H. (1977). “An Adaptive Optimal Controller for Discrete-Time Markov Environments.” Information and Control 34, 286–295.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shie Mannor.

Additional information

This research was partially supported by the Fund for Promotion of Research at the Technion. The work of S.M. was partially supported by the National Science Foundation under grant ECS-0312921.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Menache, I., Mannor, S. & Shimkin, N. Basis Function Adaptation in Temporal Difference Reinforcement Learning. Ann Oper Res 134, 215–238 (2005). https://doi.org/10.1007/s10479-005-5732-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-005-5732-z

Key words

Navigation