Top

Autonomous Agents and Multi-Agent Systems

Published in:

01-07-2014

Learning potential functions and their representations for multi-task reinforcement learning

Authors: Matthijs Snel, Shimon Whiteson

Published in: Autonomous Agents and Multi-Agent Systems | Issue 4/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In multi-task learning, there are roughly two approaches to discovering representations. The first is to discover task relevant representations, i.e., those that compactly represent solutions to particular tasks. The second is to discover domain relevant representations, i.e., those that compactly represent knowledge that remains invariant across many tasks. In this article, we propose a new approach to multi-task learning that captures domain-relevant knowledge by learning potential-based shaping functions, which augment a task’s reward function with artificial rewards. We address two key issues that arise when deriving potential functions. The first is what kind of target function the potential function should approximate; we propose three such targets and show empirically that which one is best depends critically on the domain and learning parameters. The second issue is the representation for the potential function. This article introduces the notion of \(k\)-relevance, the expected relevance of a representation on a sample sequence of \(k\) tasks, and argues that this is a unifying definition of relevance of which both task and domain relevance are special cases. We prove formally that, under certain assumptions, \(k\)-relevance converges monotonically to a fixed point as \(k\) increases, and use this property to derive Feature Selection Through Extrapolation of k-relevance (FS-TEK), a novel feature-selection algorithm. We demonstrate empirically the benefit of FS-TEK on artificial domains.

previous article TESLA: an extended study of an energy-saving agent that leverages schedule flexibility

next article An operational semantics for the goal life-cycle in BDI agents

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

The authors termed these potential-based advice; specifically, look-ahead advice for the formula introduced here. We use the term “shaping” for both methods, and let function arguments resolve any ambiguity.

Relevance is not a measure in the strict mathematical sense; because of dependence between feature sets, \(\rho (\mathsf{F } \cup \mathsf{G }) \ne \rho (\mathsf{F }) + \rho (\mathsf{G })\) for some disjoint feature sets \(\mathsf{F }\) and \(\mathsf{G }\) and relevance \(\rho \).

We employ a standard real-valued GA with population size 100, no crossover and mutation with \(p=0.5\); mutation adds a random value \(\delta \in [-0.05, 0.05]\). Policies are constructed by a softmax distribution over the chromosome values.

Note that the addition of this sensor is not the same as the manual separation of state features for the value and potential function as done in [34, 63]—see related work (Sect. 6). In the experiments reported in this section, both functions use the exact same set of features.

In the policy improvement step, the policy is made only \(\varepsilon \)-greedy w.r.t. the value function.

Albus, J. S. (1971). A theory of cerebellar function. Mathematical Biosciences, 10, 25–61.CrossRef

Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Machine Learning, 73(3), 243–272.CrossRef

Asmuth, J., Littman, M., & Zinkov, R. (2008). Potential-based shaping in model-based reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence ( pp. 604–609). Cambridge: The AAAI Press.

Babes, M., de Cote, E.M., & Littman, M. L. (2008). Social reward shaping in the prisoner’s dilemma. In 7th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2008) (pp. 1389–1392).

Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research (JAIR), 12, 149–198.MATHMathSciNet

Bertsekas, D. P. (1995). Dynamic programming and optimal control. Belmont: Athena.MATH

Boutilier, C., Dearden, R., & Goldszmidt, M. (2000). Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1–2), 49–107.CrossRefMATHMathSciNet

Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.CrossRefMathSciNet

Caruana, R. (2005). Inductive transfer retrospective and review. In NIPS 2005 Workshop on Inductive Transfer: 10 Years Later.

10.

Devlin, S., Grzes, M., & Kudenko, D. (2011). Multi-agent, reward shaping for robocup keepaway. In AAMAS (pp. 1227–1228).

11.

Devlin, S., & Kudenko, D. (2011). Theoretical considerations of potential-based reward shaping for multi-agent systems. In AAMAS, AAMAS ’11 (pp. 225–232).

12.

Devlin, S., & Kudenko, D. (2012). Dynamic potential-based reward shaping. In AAMAS (pp. 433–440).

13.

Diuk, C., Li, L., & Leffler, B. R. (2009). The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In ICML (p. 32).

14.

Dorigo, M., & Colombetti, M. (1994). Robot shaping: Developing autonomous agents through learning. Artificial Intelligence, 71(2), 321–370.CrossRef

15.

Elfwing, S., Uchibe, E., Doya, K., & Christensen, H. (2008). Co-evolution of shaping: Rewards and meta-parameters in reinforcement learning. Adaptive Behavior, 16(6), 400–412.CrossRef

16.

Elfwing, S., Uchibe, E., Doya, K., & Christensen, H. I. (2011). Darwinian embodied evolution of the learning ability for survival. Adaptive Behavior, 19(2), 101–120.CrossRef

17.

Erez, T., & Smart, W. (2008) What does shaping mean for computational reinforcement learning? In 7th IEEE International Conference on Development and Learning, 2008. ICDL 2008 (pp. 215–219).

18.

Ferguson, K., & Mahadevan, S. (2006). Proto-transfer learning in markov decision processes using spectral methods. In ICML Workshop on Structural Knowledge Transfer for Machine Learning.

19.

Ferrante, E., Lazaric, A., & Restelli, M. (2008). Transfer of task representation in reinforcement learning using policy-based proto-value functions. In AAMAS (pp. 1329–1332).

20.

Foster, D. J., & Dayan, P. (2002). Structure in the space of value functions. Machine Learning, 49(2–3), 325–346.CrossRefMATH

21.

Frommberger, L. (2011). Task space tile coding: In-task and cross-task generalization in reinforcement learning. In Proceedings of the 9th European Workshop on Reinforcement, Learning (EWRL9).

22.

Frommberger, L., & Wolter, D. (2010). Structural knowledge transfer by spatial abstraction for reinforcement learning agents. Adaptive Behavior, 18(6), 507–525.CrossRef

23.

Geramifard, A., Doshi, F., Redding, J., Roy, N., & How, J. P. (2011). Online discovery of feature dependencies. In ICML (pp. 881–888).

24.

Grześ, M., & Kudenko, D. (2009). Learning shaping rewards in model-based reinforcement learning. In Proceedings of AAMAS 2009 Workshop on Adaptive Learning Agents.

25.

Grzes, M., & Kudenko, D. (2009). Theoretical and empirical analysis of reward shaping in reinforcement learning. In ICMLA (pp. 337–344).

26.

Grześ, M., & Kudenko, D. (2010). Online learning of shaping rewards in reinforcement learning. Neural Networks, 23(4), 541–550.CrossRef

27.

Gullapalli, V., & Barto, A.G. (1992). Shaping as a method for accelerating reinforcement learning. In Proceedings of IEEE International Symposium on Intelligent, Control (pp. 554–559).

28.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.MATH

29.

Hachiya, H., & Sugiyama, M. (2010). Feature selection for reinforcement learning: Evaluating implicit state-reward dependency via conditional mutual information. In ECML/PKDD (pp. 474–489).

30.

Jong, N. K., & Stone, P. (2005). State abstraction discovery from irrelevant state variables. In IJCAI-05.

31.

Kakade, S. M. (2003). On the sample complexity of reinforcement learning. Ph.D. Thesis, University College London, London.

32.

Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In ICML (pp. 284–292).

33.

Kolter, J. Z., & Ng, A. Y. (2009). Regularization and feature selection in least-squares temporal difference learning. In ICML (p. 66).

34.

Konidaris, G., & Barto, A. (2006). Autonomous shaping: Knowledge transfer in reinforcement learning. In Proceedings of 23rd International Conference on Machine Learning (pp. 489–496).

35.

Konidaris, G., Scheidwasser, I., & Barto, A. G. (2012). Transfer in reinforcement learning via shared features. Journal of Machine Learning Research, 13, 1333–1371.MathSciNet

36.

Koren, Y., & Borenstein, J. (1991). Potential field methods and their inherent limitations for mobile robot navigation. In Proceedings of IEEE Conference on Robotics and Automation (pp. 1398–1404).

37.

Kroon, M., & Whiteson, S. (2009). Automatic feature selection for model-based reinforcement learning in factored MDPs. In ICMLA 2009: Proceedings of the Eighth International Conference on Machine Learning and Applications (pp. 324–330).

38.

Laud, A., & DeJong, G. (2002). Reinforcement learning and shaping: Encouraging intended behaviors. In Proceedings of 19th International Conference on Machine Learning (pp. 355–362).

39.

Laud, A., & DeJong, G. (2003). The influence of reward on the speed of reinforcement learning: An analysis of shaping. In ICML (pp. 440–447).

40.

Lazaric, A. (2008). Knowledge transfer in reinforcement learning. Ph.D. Thesis, Politecnico di Milano, Milan.

41.

Lazaric, A., & Ghavamzadeh, M. (2010). Bayesian multi-task reinforcement learning. In ICML (pp. 599–606).

42.

Lazaric, A., Restelli, M., & Bonarini, A. (2008). Transfer of samples in batch reinforcement learning. In ICML (pp. 544–551).

43.

Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for mdps. In Aritificial Intelligence and Mathematics.

44.

Lu, X., Schwartz, H. M., & Givigi, S. N. (2011). Policy invariance under reward transformations for general-sum stochastic games. Journal of Artificial Intelligence Research (JAIR), 41, 397–406.MATHMathSciNet

45.

Maclin, R., & Shavlik, J. W. (1996). Creating advice-taking reinforcement learners. Machine Learning, 22(1–3), 251–281.

46.

Mahadevan, S. (2010). Representation discovery in sequential decision making. In AAAI.

47.

Manoonpong, P., Wörgötter, F., & Morimoto, J. (2010). Extraction of reward-related feature space using correlation-based and reward-based learning methods. In ICONIP (Vol. 1, pp. 414–421).

48.

Marquardt, D. (1963). An algorithm for least-squares estimation of nonlinear parameters. SIAM Journal of Applied Mathematics, 11, 431–441.CrossRefMATHMathSciNet

49.

Marthi, B. (2007). Automatic shaping and decomposition of reward functions. In Proceedings of 24th International Conference on Machine Learning (pp. 601–608).

50.

Matarić, M. J. (1994). Reward functions for accelerated learning. In Proceedings of 11th International Conference on Machine Learning.

51.

Mehta, N., Natarajan, S., Tadepalli, P., & Fern, A. (2008). Transfer in variable-reward hierarchical reinforcement learning. Machine Learning, 73(3), 289–312.CrossRef

52.

Midtgaard, M., Vinther, L., Christiansen, J. R., Christensen, A. M., & Zeng, Y. (2010). Time-based reward shaping in real-time strategy games. In Proceedings of the 6th International Conference on Agents and Data Mining Interaction, ADMI’10 (pp. 115–125). Berlin, Heidelberg: Springer-Verlag.

53.

Ng, A., Harada, D., & Russell, S.(1999). Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of 16th International Conference on Machine Learning.

54.

Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., & Littman, M. L. (2008). An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In ICML (pp. 752–759).

55.

Petrik, M., Taylor, G., Parr, R., & Zilberstein, S. (2010). Feature selection using regularization in approximate linear programs for markov decision processes. InICML (pp. 871–878).

56.

Proper, S., & Tumer, K. (2012). Modeling difference rewards for multiagent learning (extended abstract). In AAMAS, Valencia, Spain.

57.

Randløv, J., & Alstrøm, P. (1998). Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of 15th International Conference on Machine Learning.

58.

Rummery, G., & Niranjan, M. (1994). On-line q-learning using connectionist systems. Technical Report CUED/F-INFENG-RT 116, Engineering Department, Cambridge University, Cambridge.

59.

Saksida, L. M., Raymond, S. M., & Touretzky, D. S. (1997). Shaping robot behavior using principles from instrumental conditioning. Robotics and Autonomous Systems, 22(3–4), 231–249.CrossRef

60.

van Seijen, H., Whiteson, S., & Kester, L. (2010). Switching between representations in reinforcement learning. In Interactive Collaborative, Information Systems (pp. 65–84).

61.

Selfridge, O., Sutton, R. S., & Barto, A. G. (1985). Training and tracking in robotics. In Proceedings of Ninth International Joint Conference on Artificial Intelligence.

62.

Sherstov, A. A., & Stone, P. (2005). Improving action selection in MDP’s via knowledge transfer. InProceedings of the Twentieth National Conference on Artificial Intelligence.

63.

Singh, S., Lewis, R., & Barto, A. (2009). Where do rewards come from? In Proceedings of 31st Annual Conference of the Cognitive Science Society (pp. 2601–2606).

64.

Singh, S., & Sutton, R. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1), 123–158.MATH

65.

Singh, S. P. (1992). Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3), 323–339.MATH

66.

Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Learning without state-estimation in partially observable markovian decision processes. In ICML (pp. 284–292).

67.

Skinner, B. F. (1938). The behavior of organisms: An experimental analysis. New York: Appleton-Century-Crofts.

68.

Snel, M., & Whiteson, S. (2010). Multi-task evolutionary shaping without pre-specified representations. In Genetic and Evolutionary Computation Conference (GECCO’10).

69.

Snel, M., & Whiteson, S. (2011). Multi-task reinforcement learning: Shaping and feature selection. In Proceedings of the European Workshop on Reinforcement Learning (EWRL).

70.

Sorg, J., & Singh, S. (2009). Transfer via soft homomorphisms. In Proceedings of 8th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2009) (pp. 741–748).

71.

Strehl, A. L., Diuk, C., & Littman, M. L. (2007). Efficient structure learning in factored-state mdps. In AAAI (pp. 645–650).

72.

Sutton, R. (1983). Learning to predict by the method of temporal differences. Machine Learning, 3, 9–44.

73.

Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. Cambridge: The MIT Press.

74.

Tanaka, F., & Yamamura, M. (2003). Multitask reinforcement learning on the distribution of mdps. In Proceedings of 2003 IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA 2003) (pp. 1108–113).

75.

Taylor, J., Precup, D., & Panagaden, P. (2009). Bounding performance loss in approximate mdp homomorphisms. In Koller D., Schuurmans D., Bengio Y., & Bottou L. (Eds.), Advances in Neural Information Processing Systems (Vol. 21, pp. 1649–1656).

76.

Taylor, M., & Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(1), 1633–1685.MATHMathSciNet

77.

Taylor, M., Stone, P., & Liu, Y. (2007). Transfer learning via inter-task mappings for temporal difference learning. Journal of Machine Learning Research, 8(1), 2125–2167.MATHMathSciNet

78.

Taylor, M. E., Whiteson, S., & Stone, P. (2007). Transfer via inter-task mappings in policy search reinforcement learning. In AAMAS (p. 37).

79.

Thrun, S. (1995). Is learning the n-th thing any easier than learning the first? In Advances in Neural Information Processing (pp. 640–646).

80.

Torrey, L., Shavlik, J. W., Walker, T., & Maclin, R. (2010). Transfer learning via advice taking. In Advances in Machine Learning I (pp. 147–170). New York: Springer.

81.

Torrey, L., Walker, T., Shavlik, J. W., & Maclin, R.: Using advice to transfer knowledge acquired in one reinforcement learning task to another. In Proceedings of the Sixteenth European Conference on Machine Learning (ECML 2005) (pp. 412–424).

82.

Vlassis, N., Littman, M. L., & Barber, D. (2011). On the computational complexity of stochastic controller optimization in pomdps. CoRR abs/1107.3090.

83.

Walsh, T. J., Li, L., & Littman, M. L. (2006). Transferring state abstractions between mdps. In ICML-06 Workshop on Structural Knowledge Transfer for Machine Learning.

84.

Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.MATH

85.

Whitehead, S. D. (1991). A complexity analysis of cooperative mechanisms in reinforcement learning. In Proceedings AAAI-91 (pp. 607–613).

86.

Whiteson, S., Tanner, B., Taylor, M. E., & Stone, P. (2011). Protecting against evaluation overfitting in empirical reinforcement learning. In ADPRL 2011: Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement, Learning (pp. 120–127).

87.

Wiewiora, E. (2003). Potential-based shaping and q-value initialization are equivalent. Journal of Artificial Intelligence Research, 19, 205–208.MATHMathSciNet

88.

Wiewiora, E., Cottrell, G., & Elkan, C.(2003). Principled methods for advising reinforcement learning agents. InProceedings of 20th International Conference on Machine Learning (pp. 792–799).

89.

Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007). Multi-task reinforcement learning: A hierarchical Bayesian approach. In ICML (pp. 1015–1022).

Title: Learning potential functions and their representations for multi-task reinforcement learning
Authors: Matthijs Snel
Shimon Whiteson
Publication date: 01-07-2014
Publisher: Springer US
Published in: Autonomous Agents and Multi-Agent Systems / Issue 4/2014
Print ISSN: 1387-2532
Electronic ISSN: 1573-7454
DOI: https://doi.org/10.1007/s10458-013-9235-z

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2014

An operational semantics for the goal life-cycle in BDI agents

TESLA: an extended study of an energy-saving agent that leverages schedule flexibility

Engineering commitment-based business protocols with the 2CL methodology

BDD-versus SAT-based bounded model checking for the existential fragment of linear temporal logic with knowledge: algorithms and their performance

Premium Partner