Skip to main content
Erschienen in: Soft Computing 6/2011

01.06.2011 | Focus

Continuous-action reinforcement learning with fast policy search and adaptive basis function selection

verfasst von: Xin Xu, Chunming Liu, Dewen Hu

Erschienen in: Soft Computing | Ausgabe 6/2011

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state and action spaces. In this paper, a novel RL approach with fast policy search and adaptive basis function selection, which is called Continuous-action Approximate Policy Iteration (CAPI), is proposed for RL in MDPs with both continuous state and action spaces. In CAPI, based on the value functions estimated by temporal-difference learning, a fast policy search technique is suggested to search for optimal actions in continuous spaces, which is computationally efficient and easy to implement. To improve the generalization ability and learning efficiency of CAPI, two adaptive basis function selection methods are developed so that sparse approximation of value functions can be obtained efficiently both for linear function approximators and kernel machines. Simulation results on benchmark learning control tasks with continuous state and action spaces show that the proposed approach not only can converge to a near-optimal policy in a few iterations but also can obtain comparable or even better performance than Sarsa-learning, and previous approximate policy iteration methods such as LSPI and KLSPI.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Barto AG, Sutton RS, Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13(5):835–846 Barto AG, Sutton RS, Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13(5):835–846
Zurück zum Zitat Baxter J, Bartlett PL (2001) Infinite-horizon policy-gradient estimation. J Artif Intell Res 15:319–350MathSciNetMATH Baxter J, Bartlett PL (2001) Infinite-horizon policy-gradient estimation. J Artif Intell Res 15:319–350MathSciNetMATH
Zurück zum Zitat Bertsekas DP, Tsitsiklis JN (1996) Neurodynamic programming. Athena Scientific, Belmont Bertsekas DP, Tsitsiklis JN (1996) Neurodynamic programming. Athena Scientific, Belmont
Zurück zum Zitat Boyan J (2002) Technical update: least-squares temporal difference learning. Mach Learn 49(2–3):233–246MATHCrossRef Boyan J (2002) Technical update: least-squares temporal difference learning. Mach Learn 49(2–3):233–246MATHCrossRef
Zurück zum Zitat Crites RH, Barto AG (1998) Elevator group control using multiple reinforcement learning agents. Mach Learn 33(2–3):235–262MATHCrossRef Crites RH, Barto AG (1998) Elevator group control using multiple reinforcement learning agents. Mach Learn 33(2–3):235–262MATHCrossRef
Zurück zum Zitat Dayan P (1992) The convergence of TD(λ) for general λ. Mach Learn 8:341–362MATH Dayan P (1992) The convergence of TD(λ) for general λ. Mach Learn 8:341–362MATH
Zurück zum Zitat Dayan P, Sejnowski TJ (1994) TD(λ) converges with probability 1. Mach Learn 14:295–301 Dayan P, Sejnowski TJ (1994) TD(λ) converges with probability 1. Mach Learn 14:295–301
Zurück zum Zitat Engel Y, Mannor S, Meir R (2004) The kernel recursive least-squares algorithm. IEEE Trans Signal Process 52(8):2275–2285MathSciNetCrossRef Engel Y, Mannor S, Meir R (2004) The kernel recursive least-squares algorithm. IEEE Trans Signal Process 52(8):2275–2285MathSciNetCrossRef
Zurück zum Zitat Hasselt HV, Wiering M (2007) Reinforcement learning in continuous action spaces. In: 2007 IEEE symposium on approximate dynamic programming and reinforcement learning, pp 272–279 Hasselt HV, Wiering M (2007) Reinforcement learning in continuous action spaces. In: 2007 IEEE symposium on approximate dynamic programming and reinforcement learning, pp 272–279
Zurück zum Zitat Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285 Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285
Zurück zum Zitat Lazaric A, Restelli M, Bonarini A (2008) Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. In: Advances in neural information processing systems. MIT Press, Cambridge Lazaric A, Restelli M, Bonarini A (2008) Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. In: Advances in neural information processing systems. MIT Press, Cambridge
Zurück zum Zitat Mahadevan S, Maggioni M (2007) Proto-value functions: a Laplacian framework for learning representation and control in Markov decision processes. J Mach Learn Res 8:2169–2231MathSciNet Mahadevan S, Maggioni M (2007) Proto-value functions: a Laplacian framework for learning representation and control in Markov decision processes. J Mach Learn Res 8:2169–2231MathSciNet
Zurück zum Zitat Millan JDR, Posenato D, Dedieu E (2002) Continuous-action q-learning. Mach Learn 49(2/3):247–265MATHCrossRef Millan JDR, Posenato D, Dedieu E (2002) Continuous-action q-learning. Mach Learn 49(2/3):247–265MATHCrossRef
Zurück zum Zitat Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8(5):997–1007CrossRef Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8(5):997–1007CrossRef
Zurück zum Zitat Rasmussen CE, Kuss M (2004) Gaussian processes in reinforcement learning. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems, vol 16. MIT Press, Cambridge, pp 751–759 Rasmussen CE, Kuss M (2004) Gaussian processes in reinforcement learning. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems, vol 16. MIT Press, Cambridge, pp 751–759
Zurück zum Zitat Schölkopf B, Smola A (2002) Learning with kernels. MIT Press, Cambridge Schölkopf B, Smola A (2002) Learning with kernels. MIT Press, Cambridge
Zurück zum Zitat Singh SP, Jaakkola T, Littman ML, Szepesvari C (2000) Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn 38:287–308MATHCrossRef Singh SP, Jaakkola T, Littman ML, Szepesvari C (2000) Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn 38:287–308MATHCrossRef
Zurück zum Zitat Sutton R (1988) Learning to predict by the method of temporal differences. Mach Learn 3(1):9–44 Sutton R (1988) Learning to predict by the method of temporal differences. Mach Learn 3(1):9–44
Zurück zum Zitat Sutton R (1996) Generalization in reinforcement learning: successful examples using sparse coarse coding. In: Advances in neural information processing systems, vol 8. MIT Press, Cambridge, pp 1038–1044 Sutton R (1996) Generalization in reinforcement learning: successful examples using sparse coarse coding. In: Advances in neural information processing systems, vol 8. MIT Press, Cambridge, pp 1038–1044
Zurück zum Zitat Sutton R, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge Sutton R, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
Zurück zum Zitat Tesauro G (1994) TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput 6:215–219CrossRef Tesauro G (1994) TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput 6:215–219CrossRef
Zurück zum Zitat Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Mach Learn 16:185–202MATH Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Mach Learn 16:185–202MATH
Zurück zum Zitat Tsitsiklis JN, Roy BV (1997) An analysis of temporal difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690MATHCrossRef Tsitsiklis JN, Roy BV (1997) An analysis of temporal difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690MATHCrossRef
Zurück zum Zitat Watkins CJCH, Dayan P (1992) Q-learning. Mach Learn 8:279–292MATH Watkins CJCH, Dayan P (1992) Q-learning. Mach Learn 8:279–292MATH
Zurück zum Zitat Whiteson S, Stone P (2006) Evolutionary function approximation for reinforcement learning. J Mach Learn Res 7:877–917MathSciNet Whiteson S, Stone P (2006) Evolutionary function approximation for reinforcement learning. J Mach Learn Res 7:877–917MathSciNet
Zurück zum Zitat Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8:229–256MATH Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8:229–256MATH
Zurück zum Zitat Xu X, Hu DW, Lu XC (2007) Kernel-based least-squares policy iteration for reinforcement learning. IEEE Trans Neural Netw 18(4):973–997CrossRef Xu X, Hu DW, Lu XC (2007) Kernel-based least-squares policy iteration for reinforcement learning. IEEE Trans Neural Netw 18(4):973–997CrossRef
Zurück zum Zitat Zhang W, Dietterich T (1995) A reinforcement learning approach to job-shop scheduling. In: Proceedings of the fourteenth international joint conference on artificial intelligence. Morgan Kaufmann, pp 1114–1120 Zhang W, Dietterich T (1995) A reinforcement learning approach to job-shop scheduling. In: Proceedings of the fourteenth international joint conference on artificial intelligence. Morgan Kaufmann, pp 1114–1120
Metadaten
Titel
Continuous-action reinforcement learning with fast policy search and adaptive basis function selection
verfasst von
Xin Xu
Chunming Liu
Dewen Hu
Publikationsdatum
01.06.2011
Verlag
Springer-Verlag
Erschienen in
Soft Computing / Ausgabe 6/2011
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-010-0581-3

Weitere Artikel der Ausgabe 6/2011

Soft Computing 6/2011 Zur Ausgabe