Skip to main content
Erschienen in: Autonomous Agents and Multi-Agent Systems 5/2017

13.10.2016

An exploration strategy for non-stationary opponents

verfasst von: Pablo Hernandez-Leal, Yusen Zhan, Matthew E. Taylor, L. Enrique Sucar, Enrique Munoz de Cote

Erschienen in: Autonomous Agents and Multi-Agent Systems | Ausgabe 5/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The success or failure of any learning algorithm is partially due to the exploration strategy it exerts. However, most exploration strategies assume that the environment is stationary and non-strategic. In this work we shed light on how to design exploration strategies in non-stationary and adversarial environments. Our proposed adversarial drift exploration (DE) is able to efficiently explore the state space while keeping track of regions of the environment that have changed. This proposed exploration is general enough to be applied in single agent non-stationary environments as well as in multiagent settings where the opponent changes its strategy in time. We use a two agent strategic interaction setting to test this new type of exploration, where the opponent switches between different behavioral patterns to emulate a non-deterministic, stochastic and adversarial environment. The agent’s objective is to learn a model of the opponent’s strategy to act optimally. Our contribution is twofold. First, we present DE as a strategy for switch detection. Second, we propose a new algorithm called R-max# for learning and planning against non-stationary opponent. To handle such opponents, R-max# reasons and acts in terms of two objectives: (1) to maximize utilities in the short term while learning and (2) eventually explore opponent behavioral changes. We provide theoretical results showing that R-max# is guaranteed to detect the opponent’s switch and learn a new model in terms of finite sample complexity. R-max# makes efficient use of exploration experiences, which results in rapid adaptation and efficient DE, to deal with the non-stationary nature of the opponent. We show experimentally how using DE outperforms the state of the art algorithms that were explicitly designed for modeling opponents (in terms average rewards) in two complimentary domains.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Godfather [33] offers the opponent a situation where it can obtain a high reward. If the opponent does not accept the offer, Godfather forces the opponent to obtain a low reward.
 
2
A related behavior called observationally equivalent models has been reported by Doshi et al. [17].
 
3
To ensure a DE there was a constant \(\epsilon \)-greedy\(=0.2\) exploration and no decay in the learning rate (\(\alpha \)).
 
4
Optimal policies are always cooperate, Pavlov and always defect, against opponents TFT, Pavlov and Bully, respectively.
 
Literatur
1.
Zurück zum Zitat Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed Bandit problem. Machine Learning, 47(2/3), 235–256.CrossRefMATH Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed Bandit problem. Machine Learning, 47(2/3), 235–256.CrossRefMATH
3.
Zurück zum Zitat Babes, M., Munoz de Cote, E., & Littman, M. L. (2008). Social reward shaping in the prisoner’s dilemma. In Proceedings of the 7th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1389–1392). Estoril: International Foundation for Autonomous Agents and Multiagent Systems. Babes, M., Munoz de Cote, E., & Littman, M. L. (2008). Social reward shaping in the prisoner’s dilemma. In Proceedings of the 7th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1389–1392). Estoril: International Foundation for Autonomous Agents and Multiagent Systems.
4.
Zurück zum Zitat Banerjee, B., & Peng, J. (2005). Efficient learning of multi-step best response. In Proceedings of the 4th International Conference on Autonomous Agents and Multiagent Systems, (pp. 60–66). Utretch: ACM. Banerjee, B., & Peng, J. (2005). Efficient learning of multi-step best response. In Proceedings of the 4th International Conference on Autonomous Agents and Multiagent Systems, (pp. 60–66). Utretch: ACM.
5.
Zurück zum Zitat Bard, N., Johanson, M., Burch, N., & Bowling, M. (2013). Online implicit agent modelling. In Proceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems, (pp. 255–262). Saint Paul, MN: International Foundation for Autonomous Agents and Multiagent Systems. Bard, N., Johanson, M., Burch, N., & Bowling, M. (2013). Online implicit agent modelling. In Proceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems, (pp. 255–262). Saint Paul, MN: International Foundation for Autonomous Agents and Multiagent Systems.
6.
Zurück zum Zitat Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence, 136(2), 215–250.MathSciNetCrossRefMATH Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence, 136(2), 215–250.MathSciNetCrossRefMATH
7.
Zurück zum Zitat Brafman, R. I., & Tennenholtz, M. (2003). R-MAX a general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 3, 213–231.MathSciNetMATH Brafman, R. I., & Tennenholtz, M. (2003). R-MAX a general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 3, 213–231.MathSciNetMATH
8.
Zurück zum Zitat Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man and Cybernetics, Part C Applications and Reviews, 38(2), 156–172.CrossRef Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man and Cybernetics, Part C Applications and Reviews, 38(2), 156–172.CrossRef
9.
Zurück zum Zitat Carmel, D., & Markovitch, S. (1999). Exploration strategies for model-based learning in multi-agent systems. Autonomous Agents and Multi-Agent Systems, 2(2), 141–172.CrossRef Carmel, D., & Markovitch, S. (1999). Exploration strategies for model-based learning in multi-agent systems. Autonomous Agents and Multi-Agent Systems, 2(2), 141–172.CrossRef
10.
Zurück zum Zitat Cawley, G. C., & Talbot, N. L. C. (2003). Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers. Pattern Recognition, 36(11), 2585–2592.CrossRefMATH Cawley, G. C., & Talbot, N. L. C. (2003). Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers. Pattern Recognition, 36(11), 2585–2592.CrossRefMATH
11.
Zurück zum Zitat Chakraborty, D., Agmon, N., & Stone, P. (2013). Targeted opponent modeling of memory-bounded agents. In Proceedings of the Adaptive Learning Agents Workshop (ALA). Saint Paul, MN, USA. Chakraborty, D., Agmon, N., & Stone, P. (2013). Targeted opponent modeling of memory-bounded agents. In Proceedings of the Adaptive Learning Agents Workshop (ALA). Saint Paul, MN, USA.
12.
Zurück zum Zitat Chakraborty, D., & Stone, P. (2008). Online multiagent learning against memory bounded adversaries. In Machine Learning and Knowledge Discovery in Databases (pp. 211–226). Berlin: Springer. Chakraborty, D., & Stone, P. (2008). Online multiagent learning against memory bounded adversaries. In Machine Learning and Knowledge Discovery in Databases (pp. 211–226). Berlin: Springer.
13.
Zurück zum Zitat Chakraborty, D., & Stone, P. (2013). Multiagent learning in the presence of memory-bounded agents. Autonomous Agents and Multi-Agent Systems, 28(2), 182–213.CrossRef Chakraborty, D., & Stone, P. (2013). Multiagent learning in the presence of memory-bounded agents. Autonomous Agents and Multi-Agent Systems, 28(2), 182–213.CrossRef
14.
Zurück zum Zitat Choi, S. P. M., Yeung, D. Y., & Zhang, N. L. (1999). An environment model for nonstationary reinforcement learning. In Advances in Neural Information Processing Systems, (pp. 987–993). Denver, CO, USA. Choi, S. P. M., Yeung, D. Y., & Zhang, N. L. (1999). An environment model for nonstationary reinforcement learning. In Advances in Neural Information Processing Systems, (pp. 987–993). Denver, CO, USA.
15.
Zurück zum Zitat Da Silva, B. C., Basso, E. W., Bazzan, A. L., & Engel, P. M. (2006). Dealing with non-stationary environments using context detection. In Proceedings of the 23rd International Conference on Machine Learnig, (pp. 217–224). Pittsburgh, PA, USA. Da Silva, B. C., Basso, E. W., Bazzan, A. L., & Engel, P. M. (2006). Dealing with non-stationary environments using context detection. In Proceedings of the 23rd International Conference on Machine Learnig, (pp. 217–224). Pittsburgh, PA, USA.
16.
Zurück zum Zitat Dietterich, T. G. (2000). Ensemble methods in machine learning. In Multiple Classifier Systems, (pp. 1–15). Berlin: Springer Dietterich, T. G. (2000). Ensemble methods in machine learning. In Multiple Classifier Systems, (pp. 1–15). Berlin: Springer
17.
Zurück zum Zitat Doshi, P., & Gmytrasiewicz, P. J. (2006). On the difficulty of achieving equilibrium in interactive POMDPs. In Twenty-first National Conference on Artificial Intelligence, (pp. 1131–1136). Boston, MA, USA. Doshi, P., & Gmytrasiewicz, P. J. (2006). On the difficulty of achieving equilibrium in interactive POMDPs. In Twenty-first National Conference on Artificial Intelligence, (pp. 1131–1136). Boston, MA, USA.
18.
Zurück zum Zitat Elidrisi, M., Johnson, N., & Gini, M. (2012). Fast learning against adaptive adversarial opponents. In Proceedings of the Adaptive Learning Agents Workshop (ALA), Valencia, Spain. Elidrisi, M., Johnson, N., & Gini, M. (2012). Fast learning against adaptive adversarial opponents. In Proceedings of the Adaptive Learning Agents Workshop (ALA), Valencia, Spain.
19.
Zurück zum Zitat Elidrisi, M., Johnson, N., Gini, M., & Crandall, J. W. (2014). Fast adaptive learning in repeated stochastic games by game abstraction. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1141–1148). Paris, France. Elidrisi, M., Johnson, N., Gini, M., & Crandall, J. W. (2014). Fast adaptive learning in repeated stochastic games by game abstraction. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1141–1148). Paris, France.
20.
Zurück zum Zitat Fulda, N., & Ventura, D. (2007). Predicting and preventing coordination problems in cooperative Q-learning systems. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, (pp. 780–785). Hyderabad, India. Fulda, N., & Ventura, D. (2007). Predicting and preventing coordination problems in cooperative Q-learning systems. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, (pp. 780–785). Hyderabad, India.
21.
Zurück zum Zitat Garivier, A., & Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems. In Algorithmic Learning Theory, (pp. 174–188). Berlin: Springer. Garivier, A., & Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems. In Algorithmic Learning Theory, (pp. 174–188). Berlin: Springer.
22.
Zurück zum Zitat Geibel, P. (2001). Reinforcement learning with bounded risk. In Proceedings of the Eighteenth International Conference on Machine Learning, (pp. 162–169). Williamstown, MA: Morgan Kaufmann Publishers Inc. Geibel, P. (2001). Reinforcement learning with bounded risk. In Proceedings of the Eighteenth International Conference on Machine Learning, (pp. 162–169). Williamstown, MA: Morgan Kaufmann Publishers Inc.
23.
Zurück zum Zitat Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, 41(2), 148–177.MathSciNetMATH Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, 41(2), 148–177.MathSciNetMATH
24.
Zurück zum Zitat Hans, A., Schneegaß, D., Schäfer, A. M., & Udluft, S. (2008). Safe exploration for reinforcement learning. In European Symposium on Artificial Neural Networks, (pp. 143–148). Bruges, Belgium. Hans, A., Schneegaß, D., Schäfer, A. M., & Udluft, S. (2008). Safe exploration for reinforcement learning. In European Symposium on Artificial Neural Networks, (pp. 143–148). Bruges, Belgium.
25.
Zurück zum Zitat Hernandez-Leal, P., Munoz de Cote, E., & Sucar, L. E. (2013). Modeling non-stationary opponents. In Proceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1135–1136). International Foundation for Autonomous Agents and Multiagent Systems, Saint Paul, MN, USA. Hernandez-Leal, P., Munoz de Cote, E., & Sucar, L. E. (2013). Modeling non-stationary opponents. In Proceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1135–1136). International Foundation for Autonomous Agents and Multiagent Systems, Saint Paul, MN, USA.
26.
Zurück zum Zitat Hernandez-Leal, P., Munoz de Cote, E., & Sucar, L. E. (2014). A framework for learning and planning against switching strategies in repeated games. Connection Science, 26(2), 103–122.CrossRef Hernandez-Leal, P., Munoz de Cote, E., & Sucar, L. E. (2014). A framework for learning and planning against switching strategies in repeated games. Connection Science, 26(2), 103–122.CrossRef
27.
Zurück zum Zitat Hernandez-Leal, P., Munoz de Cote, E., & Sucar, L. E. (2014). Exploration strategies to detect strategy switches. In Proceedings of the Adaptive Learning Agents Workshop (ALA). Paris, France. Hernandez-Leal, P., Munoz de Cote, E., & Sucar, L. E. (2014). Exploration strategies to detect strategy switches. In Proceedings of the Adaptive Learning Agents Workshop (ALA). Paris, France.
28.
Zurück zum Zitat Hernandez-Leal, P., Taylor, M. E., Rosman, B., Sucar, L. E., & Munoz de Cote, E. (2016). Identifying and tracking switching, non-stationary opponents: a Bayesian approach. In In Multiagent Interaction without Prior Coordination Workshop at AAAI. Phoenix, AZ, USA. Hernandez-Leal, P., Taylor, M. E., Rosman, B., Sucar, L. E., & Munoz de Cote, E. (2016). Identifying and tracking switching, non-stationary opponents: a Bayesian approach. In In Multiagent Interaction without Prior Coordination Workshop at AAAI. Phoenix, AZ, USA.
29.
Zurück zum Zitat HolmesParker, C., Taylor, M. E., Agogino, A., & Tumer, K. (2014). CLEANing the reward: counterfactual actions to remove exploratory action noise in multiagent learning. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1353–1354). International Foundation for Autonomous Agents and Multiagent Systems, Paris, France. HolmesParker, C., Taylor, M. E., Agogino, A., & Tumer, K. (2014). CLEANing the reward: counterfactual actions to remove exploratory action noise in multiagent learning. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1353–1354). International Foundation for Autonomous Agents and Multiagent Systems, Paris, France.
30.
Zurück zum Zitat Kakade, S. M. (2003). On the sample complexity of reinforcement learning. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London. Kakade, S. M. (2003). On the sample complexity of reinforcement learning. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London.
31.
Zurück zum Zitat Lazaric, A., Munoz de Cote, E., & Gatti, N. (2007). Reinforcement learning in extensive form games with incomplete information: The bargaining case study. In Proceedings of the 6th International Conference on Autonomous Agents and Multiagent Systems. Honolulu, HI: ACM. Lazaric, A., Munoz de Cote, E., & Gatti, N. (2007). Reinforcement learning in extensive form games with incomplete information: The bargaining case study. In Proceedings of the 6th International Conference on Autonomous Agents and Multiagent Systems. Honolulu, HI: ACM.
32.
Zurück zum Zitat Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning, (pp. 157–163). New Brunswick, NJ. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning, (pp. 157–163). New Brunswick, NJ.
33.
Zurück zum Zitat Littman, M. L., & Stone, P. (2001). Implicit Negotiation in Repeated Games. In ATAL ’01: Revised Papers from the 8th International Workshop on Intelligent Agents VIII. Littman, M. L., & Stone, P. (2001). Implicit Negotiation in Repeated Games. In ATAL ’01: Revised Papers from the 8th International Workshop on Intelligent Agents VIII.
34.
Zurück zum Zitat Lopes, M., Lang, T., Toussaint, M., & Oudeyer, P. Y. (2012). Exploration in model-based reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems, (pp. 206–214). Lake Tahoe, NV. Lopes, M., Lang, T., Toussaint, M., & Oudeyer, P. Y. (2012). Exploration in model-based reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems, (pp. 206–214). Lake Tahoe, NV.
35.
Zurück zum Zitat MacAlpine, P., Urieli, D., Barrett, S., Kalyanakrishnan, S., Barrera, F., Lopez-Mobilia, A., Ştiurcă, N., Vu, V., & Stone, P. (2012). UT Austin Villa 2011: a champion agent in the RoboCup 3D Soccer simulation competition. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, (pp. 129–136). International Foundation for Autonomous Agents and Multiagent Systems, Valencia, Spain. MacAlpine, P., Urieli, D., Barrett, S., Kalyanakrishnan, S., Barrera, F., Lopez-Mobilia, A., Ştiurcă, N., Vu, V., & Stone, P. (2012). UT Austin Villa 2011: a champion agent in the RoboCup 3D Soccer simulation competition. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, (pp. 129–136). International Foundation for Autonomous Agents and Multiagent Systems, Valencia, Spain.
36.
Zurück zum Zitat Marinescu, A., Dusparic, I., Taylor, A., Cahill, V., & Clarke, S. (2015). P-MARL: Prediction-based multi-agent reinforcement learning for non-stationary environments. In Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems. Marinescu, A., Dusparic, I., Taylor, A., Cahill, V., & Clarke, S. (2015). P-MARL: Prediction-based multi-agent reinforcement learning for non-stationary environments. In Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems.
37.
Zurück zum Zitat Mohan, Y., & Ponnambalam, S. G. (2011). Exploration strategies for learning in multi-agent foraging. In Swarm, Evolutionary, and Memetic Computing 2011, (pp. 17–26). Springer. Mohan, Y., & Ponnambalam, S. G. (2011). Exploration strategies for learning in multi-agent foraging. In Swarm, Evolutionary, and Memetic Computing 2011, (pp. 17–26). Springer.
38.
Zurück zum Zitat Monahan, G. E. (1982). A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science, 28, 1–16.MathSciNetCrossRefMATH Monahan, G. E. (1982). A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science, 28, 1–16.MathSciNetCrossRefMATH
39.
Zurück zum Zitat Mota, P., Melo, F., & Coheur, L. (2015). Modeling students self-studies behaviors. In Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1521–1528). Istanbul, Turkey Mota, P., Melo, F., & Coheur, L. (2015). Modeling students self-studies behaviors. In Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1521–1528). Istanbul, Turkey
40.
Zurück zum Zitat Munoz de Cote, E., Chapman, A. C., Sykulski, A. M., & Jennings, N. R. (2010). Automated planning in repeated adversarial games. In Uncertainty in Artificial Intelligence, (pp. 376–383). Catalina Island, CA. Munoz de Cote, E., Chapman, A. C., Sykulski, A. M., & Jennings, N. R. (2010). Automated planning in repeated adversarial games. In Uncertainty in Artificial Intelligence, (pp. 376–383). Catalina Island, CA.
41.
Zurück zum Zitat Munoz de Cote, E., & Jennings, N. R. (2010). Planning against fictitious players in repeated normal form games. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1073–1080). International Foundation for Autonomous Agents and Multiagent Systems, Toronto, Canada. Munoz de Cote, E., & Jennings, N. R. (2010). Planning against fictitious players in repeated normal form games. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1073–1080). International Foundation for Autonomous Agents and Multiagent Systems, Toronto, Canada.
42.
Zurück zum Zitat Ng, A. Y., Harada, D., & Russell, S. J. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, (pp. 278–287). Bled, Slovenia. Ng, A. Y., Harada, D., & Russell, S. J. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, (pp. 278–287). Bled, Slovenia.
43.
Zurück zum Zitat Puterman, M. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.CrossRefMATH Puterman, M. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.CrossRefMATH
44.
Zurück zum Zitat Rejeb, L., Guessoum, Z., & M’Hallah, R. (2005). An adaptive approach for the exploration–exploitation dilemma for learning agents. In Proceedings of the 4th international Central and Eastern European conference on Multi-Agent Systems and Applications, (pp. 316–325). Springer, Budapest, Hungary. Rejeb, L., Guessoum, Z., & M’Hallah, R. (2005). An adaptive approach for the exploration–exploitation dilemma for learning agents. In Proceedings of the 4th international Central and Eastern European conference on Multi-Agent Systems and Applications, (pp. 316–325). Springer, Budapest, Hungary.
45.
Zurück zum Zitat Stahl, I. (1972). Bargaining theory. Stockolm: Stockolm School of Economics. Stahl, I. (1972). Bargaining theory. Stockolm: Stockolm School of Economics.
46.
Zurück zum Zitat Stone, P., & Veloso, M. (2000). Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8(3), 345–383.CrossRef Stone, P., & Veloso, M. (2000). Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8(3), 345–383.CrossRef
47.
Zurück zum Zitat Suematsu, N., & Hayashi, A. (2002). A multiagent reinforcement learning algorithm using extended optimal response. In Proceedings of the 1st International Conference on Autonomous Agents and Multiagent Systems, (pp. 370–377). ACM Request Permissions, Bologna, Italy. Suematsu, N., & Hayashi, A. (2002). A multiagent reinforcement learning algorithm using extended optimal response. In Proceedings of the 1st International Conference on Autonomous Agents and Multiagent Systems, (pp. 370–377). ACM Request Permissions, Bologna, Italy.
48.
Zurück zum Zitat Taylor, M. E., & Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. The Journal of Machine Learning Research, 10, 1633–1685.MathSciNetMATH Taylor, M. E., & Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. The Journal of Machine Learning Research, 10, 1633–1685.MathSciNetMATH
49.
Zurück zum Zitat Vrancx, P., Gurzi, P., Rodriguez, A., Steenhaut, K., & Nowe, A. (2015). A reinforcement learning approach for interdomain routing with link prices. ACM Transactions on Autonomous and Adaptive Systems, 10(1), 1–26.CrossRef Vrancx, P., Gurzi, P., Rodriguez, A., Steenhaut, K., & Nowe, A. (2015). A reinforcement learning approach for interdomain routing with link prices. ACM Transactions on Autonomous and Adaptive Systems, 10(1), 1–26.CrossRef
50.
Zurück zum Zitat Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.MATH Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.MATH
51.
Zurück zum Zitat Weinberg, M., & Rosenschein, J. S. (2004). Best-response multiagent learning in non-stationary environments. In Proceedings of the 3rd International Conference on Autonomous Agents and Multiagent Systems, (pp. 506–513). New York: IEEE Computer Society. Weinberg, M., & Rosenschein, J. S. (2004). Best-response multiagent learning in non-stationary environments. In Proceedings of the 3rd International Conference on Autonomous Agents and Multiagent Systems, (pp. 506–513). New York: IEEE Computer Society.
52.
Zurück zum Zitat Zinkevich, M. A., Bowling, M., & Wunder, M. (2011). The lemonade stand game competition: Solving unsolvable games. SIGecom Exchanges, 10(1), 35–38.CrossRef Zinkevich, M. A., Bowling, M., & Wunder, M. (2011). The lemonade stand game competition: Solving unsolvable games. SIGecom Exchanges, 10(1), 35–38.CrossRef
Metadaten
Titel
An exploration strategy for non-stationary opponents
verfasst von
Pablo Hernandez-Leal
Yusen Zhan
Matthew E. Taylor
L. Enrique Sucar
Enrique Munoz de Cote
Publikationsdatum
13.10.2016
Verlag
Springer US
Erschienen in
Autonomous Agents and Multi-Agent Systems / Ausgabe 5/2017
Print ISSN: 1387-2532
Elektronische ISSN: 1573-7454
DOI
https://doi.org/10.1007/s10458-016-9347-3

Weitere Artikel der Ausgabe 5/2017

Autonomous Agents and Multi-Agent Systems 5/2017 Zur Ausgabe