Skip to main content
Top

2015 | OriginalPaper | Chapter

2. The Explore–Exploit Dilemma in Nonstationary Decision Making under Uncertainty

Authors : Allan Axelrod, Girish Chowdhary

Published in: Handling Uncertainty and Networked Structure in Robot Control

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

It is often assumed that autonomous systems are operating in environments that may be described by a stationary (time-invariant) environment. However, real-world environments are often nonstationary (time-varying), where the underlying phenomena changes in time, so stationary approximations of the nonstationary environment may quickly lose relevance. Here, two approaches are presented and applied in the context of reinforcement learning in nonstationary environments. In Sect. 2.2, the first approach leverages reinforcement learning in the presence of a changing reward-model. In particular, a functional termed the Fog-of-War is used to drive exploration which results in the timely discovery of new models in nonstationary environments. In Sect. 2.3, the Fog-of-War functional is adapted in real-time to reflect the heterogeneous information content of a real-world environment; this is critically important for the use of the approach in Sect. 2.2 in real world environments.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Abdoos M, Mozayani N, Bazzan AL (2011) Traffic light control in non-stationary environments based on multi agent q-learning. In: 2011 14th international IEEE conference on, IEEE Intelligent Transportation Systems (ITSC), pp 1580–1585 Abdoos M, Mozayani N, Bazzan AL (2011) Traffic light control in non-stationary environments based on multi agent q-learning. In: 2011 14th international IEEE conference on, IEEE Intelligent Transportation Systems (ITSC), pp 1580–1585
go back to reference Adams RP, Murray I, MacKay DJ (2009) Tractable nonparametric bayesian inference in poisson processes with gaussian process intensities. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp 9–16 Adams RP, Murray I, MacKay DJ (2009) Tractable nonparametric bayesian inference in poisson processes with gaussian process intensities. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp 9–16
go back to reference Allamaraju R, Kingravi H, Axelrod A, Chowdhary G, Grande R, Crick C, Sheng W, How J (2014) Human aware path planning in urban environments with nonstationary mdps. In: IEEE international conference on robotics and automation, Hong Kong, China Allamaraju R, Kingravi H, Axelrod A, Chowdhary G, Grande R, Crick C, Sheng W, How J (2014) Human aware path planning in urban environments with nonstationary mdps. In: IEEE international conference on robotics and automation, Hong Kong, China
go back to reference Axelrod A, Chowdhary G (2015) Adaptive algorithms for autonomous data-ferrying in nonstationary environments. In: AIAA Aerospace science and technology forum, Kissimmee, FL Axelrod A, Chowdhary G (2015) Adaptive algorithms for autonomous data-ferrying in nonstationary environments. In: AIAA Aerospace science and technology forum, Kissimmee, FL
go back to reference Bodik P, Hong W, Guestrin C, Madden S, Paskin M, Thibaux R (2004) Intel lab data. Technical report Bodik P, Hong W, Guestrin C, Madden S, Paskin M, Thibaux R (2004) Intel lab data. Technical report
go back to reference Boone G (1997) Efficient reinforcement learning: Model-based acrobot control. In: Robotics and Automation, 1997. Proceedings., 1997 IEEE International Conference on, IEEE, vol 1, pp 229–234 Boone G (1997) Efficient reinforcement learning: Model-based acrobot control. In: Robotics and Automation, 1997. Proceedings., 1997 IEEE International Conference on, IEEE, vol 1, pp 229–234
go back to reference Busoniu L, Babuska R, Schutter BD, Ernst D (2010) Reinforcement learning and dynamic programming using function approximators, 1st edn. CRC Press Busoniu L, Babuska R, Schutter BD, Ernst D (2010) Reinforcement learning and dynamic programming using function approximators, 1st edn. CRC Press
go back to reference Choi SP, Yeung DY, Zhang NL (2001) Hidden-mode markov decision processes for nonstationary sequential decision making. In: Sequence learning, Springer, pp 264–287 Choi SP, Yeung DY, Zhang NL (2001) Hidden-mode markov decision processes for nonstationary sequential decision making. In: Sequence learning, Springer, pp 264–287
go back to reference Coleman TP, Kiyavash N, Subramanian VG (2008) The rate-distortion function of a poisson process with a queueing distortion measure. In: Data Compression Conference, DCC 2008, IEEE, pp 63–72 Coleman TP, Kiyavash N, Subramanian VG (2008) The rate-distortion function of a poisson process with a queueing distortion measure. In: Data Compression Conference, DCC 2008, IEEE, pp 63–72
go back to reference Daw ND, O’Doherty JP, Dayan P, Seymour B, Dolan RJ (2006) Cortical substrates for exploratory decisions in humans. Nature 441(7095):876–879CrossRef Daw ND, O’Doherty JP, Dayan P, Seymour B, Dolan RJ (2006) Cortical substrates for exploratory decisions in humans. Nature 441(7095):876–879CrossRef
go back to reference Frost VS, Melamed B (1994) Traffic modeling for telecommunications networks. IEEE Commun Mag 32(3):70–81CrossRef Frost VS, Melamed B (1994) Traffic modeling for telecommunications networks. IEEE Commun Mag 32(3):70–81CrossRef
go back to reference Garivier A, Moulines E (2008) On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:08053415 Garivier A, Moulines E (2008) On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:​08053415
go back to reference Gelman A, Carlin JB, Stern HS, Rubin DB (2014) Bayesian data analysis, vol 2. Taylor & Francis Gelman A, Carlin JB, Stern HS, Rubin DB (2014) Bayesian data analysis, vol 2. Taylor & Francis
go back to reference Geramifard A, Walsh TJ, Tellex S, Chowdhary G, Roy N, How JP (2013) A tutorial on linear function approximators for dynamic programming and reinforcement learning. Foundations and Trends\({}^{\textregistered }\) in Machine Learning 6(4): 375–451. doi:10.1561/2200000042 Geramifard A, Walsh TJ, Tellex S, Chowdhary G, Roy N, How JP (2013) A tutorial on linear function approximators for dynamic programming and reinforcement learning. Foundations and Trends\({}^{\textregistered }\) in Machine Learning 6(4): 375–451. doi:10.​1561/​2200000042
go back to reference Grande RC (2014) Computationally efficient gaussian process changepoint detection and regression. Ph.D thesis, Massachusetts Institute of Technology Grande RC (2014) Computationally efficient gaussian process changepoint detection and regression. Ph.D thesis, Massachusetts Institute of Technology
go back to reference Granmo OC, Berg S (2010) Solving non-stationary bandit problems by random sampling from sibling kalman filters. In: Trends in applied intelligent systems, Springer, pp 199–208 Granmo OC, Berg S (2010) Solving non-stationary bandit problems by random sampling from sibling kalman filters. In: Trends in applied intelligent systems, Springer, pp 199–208
go back to reference Grondman I, Busoniu L, Lopes GA, Babuska R (2012) A survey of actor–critic reinforcement learning: Standard and natural policy gradients. IEEE Trans Syst, Man, and Cybern, Part C: Appl Rev 42(6):1291–1307 Grondman I, Busoniu L, Lopes GA, Babuska R (2012) A survey of actor–critic reinforcement learning: Standard and natural policy gradients. IEEE Trans Syst, Man, and Cybern, Part C: Appl Rev 42(6):1291–1307
go back to reference Gunter T, Lloyd C, Osborne MA, Roberts SJ (2014) Efficient bayesian nonparametric modelling of structured point processes. arXiv preprint arXiv:14076949 Gunter T, Lloyd C, Osborne MA, Roberts SJ (2014) Efficient bayesian nonparametric modelling of structured point processes. arXiv preprint arXiv:​14076949
go back to reference Guo D, Shamai S, Verdú S (2008) Mutual information and conditional mean estimation in poisson channels. IEEE Trans Inf Theory 54(5):1837–1849CrossRefMATH Guo D, Shamai S, Verdú S (2008) Mutual information and conditional mean estimation in poisson channels. IEEE Trans Inf Theory 54(5):1837–1849CrossRefMATH
go back to reference Gur Y, Zeevi A, Besbes O (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. In: Advances in neural information processing systems, pp 199–207 Gur Y, Zeevi A, Besbes O (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. In: Advances in neural information processing systems, pp 199–207
go back to reference Harremoës P (2001) Binomial and poisson distributions as maximum entropy distributions. IEEE Trans Inf Theory 47(5):2039–2041CrossRefMATH Harremoës P (2001) Binomial and poisson distributions as maximum entropy distributions. IEEE Trans Inf Theory 47(5):2039–2041CrossRefMATH
go back to reference Harremoës P, Ruzankin P (2004) Rate of convergence to poisson law in terms of information divergence Harremoës P, Ruzankin P (2004) Rate of convergence to poisson law in terms of information divergence
go back to reference Harremoës P, Johnson O, Kontoyiannis I (2007) Thinning and the law of small numbers. IEEE International Symposium on Information Theory, ISIT 2007. IEEE, pp 1491–1495 Harremoës P, Johnson O, Kontoyiannis I (2007) Thinning and the law of small numbers. IEEE International Symposium on Information Theory, ISIT 2007. IEEE, pp 1491–1495
go back to reference Harremoës P, Vignat C et al (2003) A nash equilibrium related to the poisson channel. Commun Inf Syst 3(3):183–190MathSciNetMATH Harremoës P, Vignat C et al (2003) A nash equilibrium related to the poisson channel. Commun Inf Syst 3(3):183–190MathSciNetMATH
go back to reference Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: ICASSP (4), pp 317–320 Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: ICASSP (4), pp 317–320
go back to reference Hester T, Stone P (2013) Texplore: real-time sample-efficient reinforcement learning for robots. Mach learning 90(3):385–429CrossRefMathSciNet Hester T, Stone P (2013) Texplore: real-time sample-efficient reinforcement learning for robots. Mach learning 90(3):385–429CrossRefMathSciNet
go back to reference Johnson O (2007) Log-concavity and the maximum entropy property of the poisson distribution. Stoch Process Appl 117(6):791–802CrossRefMATH Johnson O (2007) Log-concavity and the maximum entropy property of the poisson distribution. Stoch Process Appl 117(6):791–802CrossRefMATH
go back to reference Kakade S, Langford J, Kearns M (2003) Exploration in metric state spaces Kakade S, Langford J, Kearns M (2003) Exploration in metric state spaces
go back to reference Karagiannis T, Molle M, Faloutsos M, Broido A (2004) A nonstationary poisson view of internet traffic. In: INFOCOM 2004. Twenty-third annualjoint conference of the IEEE computer and communications societies. IEEE, vol 3, pp 1558–1569 Karagiannis T, Molle M, Faloutsos M, Broido A (2004) A nonstationary poisson view of internet traffic. In: INFOCOM 2004. Twenty-third annualjoint conference of the IEEE computer and communications societies. IEEE, vol 3, pp 1558–1569
go back to reference Karaman S, Walter M, Perez A, Frazzoli E, Teller S (2011) Anytime motion planning using the RRT*. In: International conference on robotics and automation. IEEE, pp 1478–1483 Karaman S, Walter M, Perez A, Frazzoli E, Teller S (2011) Anytime motion planning using the RRT*. In: International conference on robotics and automation. IEEE, pp 1478–1483
go back to reference Kolter JZ, Ng AY (2009) Near-bayesian exploration in polynomial time. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 513–520 Kolter JZ, Ng AY (2009) Near-bayesian exploration in polynomial time. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 513–520
go back to reference Kontoyiannis I, Harremoës P, Johnson O (2005) Entropy and the law of small numbers. IEEE Trans Inf Theory 51(2):466–472CrossRefMATH Kontoyiannis I, Harremoës P, Johnson O (2005) Entropy and the law of small numbers. IEEE Trans Inf Theory 51(2):466–472CrossRefMATH
go back to reference Koulouriotis DE, Xanthopoulos A (2008) Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems. Appl Math Comput 196(2):913–922CrossRefMATH Koulouriotis DE, Xanthopoulos A (2008) Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems. Appl Math Comput 196(2):913–922CrossRefMATH
go back to reference Leśniewicz M (2014) Expected entropy as a measure and criterion of randomness of binary sequences. Przeglad Elektrotechniczny 90(1):42–46 Leśniewicz M (2014) Expected entropy as a measure and criterion of randomness of binary sequences. Przeglad Elektrotechniczny 90(1):42–46
go back to reference Møller J, Syversveen AR, Waagepetersen RP (1998) Log gaussian cox processes. Scand J Stat 25(3):451–482CrossRef Møller J, Syversveen AR, Waagepetersen RP (1998) Log gaussian cox processes. Scand J Stat 25(3):451–482CrossRef
go back to reference Mu B, Chowdhary G, How J (2014) Efficient distributed sensing using adaptive censoring-based inference. Automatica Mu B, Chowdhary G, How J (2014) Efficient distributed sensing using adaptive censoring-based inference. Automatica
go back to reference Pazis J, Parr R (2013) Pac optimal exploration in continuous space markov decision processes Pazis J, Parr R (2013) Pac optimal exploration in continuous space markov decision processes
go back to reference Rasmussen C, Williams C (2005) Gaussian processes for machine learning (Adaptive Computation and Machine Learning). The MIT Press Rasmussen C, Williams C (2005) Gaussian processes for machine learning (Adaptive Computation and Machine Learning). The MIT Press
go back to reference Reverdy P, Wilson RC, Holmes P, Leonard NE (2012) Towards optimization of a human-inspired heuristic for solving explore–exploit problems. In: CDC, pp 2820–2825 Reverdy P, Wilson RC, Holmes P, Leonard NE (2012) Towards optimization of a human-inspired heuristic for solving explore–exploit problems. In: CDC, pp 2820–2825
go back to reference Ross S, Pineau J (2012) Model-based bayesian reinforcement learning in large structured domains. arXiv preprint arXiv:12063281 Ross S, Pineau J (2012) Model-based bayesian reinforcement learning in large structured domains. arXiv preprint arXiv:​12063281
go back to reference Rubin I (1974) Information rates and data-compression schemes for poisson processes. IEEE Trans Inf Theory 20(2):200–210CrossRefMATH Rubin I (1974) Information rates and data-compression schemes for poisson processes. IEEE Trans Inf Theory 20(2):200–210CrossRefMATH
go back to reference Scholkopft B, Mullert KR (1999) Fisher discriminant analysis with kernels. Proceedings of the IEEE signal processing society workshop neural networks for signal processing IX. Madison, WI, USA, pp 23–25 Scholkopft B, Mullert KR (1999) Fisher discriminant analysis with kernels. Proceedings of the IEEE signal processing society workshop neural networks for signal processing IX. Madison, WI, USA, pp 23–25
go back to reference Sutton R, Barto A (1998) Reinforcement learning, an introduction. MIT Press, Cambridge Sutton R, Barto A (1998) Reinforcement learning, an introduction. MIT Press, Cambridge
go back to reference Thrun S, Burgard W, Fox D, et al (2005) Probabilistic robotics, vol 1. MIT press Cambridge Thrun S, Burgard W, Fox D, et al (2005) Probabilistic robotics, vol 1. MIT press Cambridge
go back to reference Thrun SB (1992) Efficient exploration in reinforcement learning. Carnegie-Mellon University, Technical report Thrun SB (1992) Efficient exploration in reinforcement learning. Carnegie-Mellon University, Technical report
go back to reference Tsitsiklis JN, Roy BV (1997) An analysis of temporal difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690CrossRefMATH Tsitsiklis JN, Roy BV (1997) An analysis of temporal difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690CrossRefMATH
go back to reference Vlassis N, Ghavamzadeh M, Mannor S, Poupart P (2012) Bayesian reinforcement learning. In: Reinforcement learning, Springer, pp 359–386 Vlassis N, Ghavamzadeh M, Mannor S, Poupart P (2012) Bayesian reinforcement learning. In: Reinforcement learning, Springer, pp 359–386
go back to reference Walsh TJ, Goschin S, Littman ML (2010) Integrating sample-based planning and model-based reinforcement learning. In: AAAI Walsh TJ, Goschin S, Littman ML (2010) Integrating sample-based planning and model-based reinforcement learning. In: AAAI
go back to reference Watkins CJCH, Dayan P (1992) Q-learning. J Mach Learn 16:185–202 Watkins CJCH, Dayan P (1992) Q-learning. J Mach Learn 16:185–202
go back to reference Wiering MA (1999) Explorations in efficient reinforcement learning. Ph.D thesis, University of Amsterdam/IDSIA Wiering MA (1999) Explorations in efficient reinforcement learning. Ph.D thesis, University of Amsterdam/IDSIA
go back to reference Wilson RC, Geana A, White JM, Ludvig EA, Cohen JD (2014) Humans use directed and random exploration to solve the explore–exploit dilemma. J Exp Psychol: Gen 143(6):2074 Wilson RC, Geana A, White JM, Ludvig EA, Cohen JD (2014) Humans use directed and random exploration to solve the explore–exploit dilemma. J Exp Psychol: Gen 143(6):2074
go back to reference Wilson SW, et al (1996) Explore/exploit strategies in autonomy. In: From animals to animats 4: Proceedings of the 4th international conference on simulation of adaptive behavior, pp 325–332 Wilson SW, et al (1996) Explore/exploit strategies in autonomy. In: From animals to animats 4: Proceedings of the 4th international conference on simulation of adaptive behavior, pp 325–332
go back to reference Yu JY, Mannor S (2009) Online learning in markov decision processes with arbitrarily changing rewards and transitions. In: International conference on game theory for networks, GameNets’ 09. IEEE, pp 314–322 Yu JY, Mannor S (2009) Online learning in markov decision processes with arbitrarily changing rewards and transitions. In: International conference on game theory for networks, GameNets’ 09. IEEE, pp 314–322
go back to reference Zhou Z, Matteson DS, Woodard DB, Henderson SG, Micheas AC (2014) A spatio-temporal point process model for ambulance demand. arXiv preprint arXiv:14015547 Zhou Z, Matteson DS, Woodard DB, Henderson SG, Micheas AC (2014) A spatio-temporal point process model for ambulance demand. arXiv preprint arXiv:​14015547
Metadata
Title
The Explore–Exploit Dilemma in Nonstationary Decision Making under Uncertainty
Authors
Allan Axelrod
Girish Chowdhary
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-26327-4_2