After reviewing the main ingredients of the UCRL algorithm and its analysis for online reinforcement learning — exploration vs. exploitation, optimism in the face of uncertainty, consistency with observations and upper confidence bounds, regret analysis — I show how these techniques can also be used to derive PAC-MDP bounds which match the best currently available bounds for the discounted and the undiscounted setting. As typical for reinforcement learning, the analysis for the undiscounted setting is significantly more involved.
In the second part of my talk I consider a model for autonomous exploration, where an agent learns about its environment and how to navigate in it. Whereas evaluating autonomous exploration is typically difficult, in the presented setting rigorous performance bounds can be derived. For that we present an algorithm that optimistically explores, by repeatedly choosing the apparently closest unknown state — as indicated by an optimistic policy — for further exploration.
This is joint work with Shiau Hong Lim. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement 231495 (CompLACS).