Most reinforcement learning algorithms assume that the system to be controlled can be accurately approximated given the measurements and the available resources. However, this assumption is overly optimistic for too many problems of practical interest: Real-world problems are messy. For example, the number of unobserved variables influencing the dynamics can be very large and the dynamics governing can be highly complicated. How can then one ask for near-optimal performance without requiring an enormous amount of data? In this talk we explore an alternative to this standard criterion, based on the concept of regret, borrowed from the online learning literature. Under this alternative criterion, the performance of a learning algorithm is measured by how much total reward is collected by the algorithm as compared to the total reward that could have been collected by the best policy from a fixed policy class, the best policy being determined in hindsight. How can we design algorithms that keep the regret small? Do we need to change existing algorithm designs? In this talk, following the initial steps made by Even-Dar et al. and Yu et al., I will discuss some of our new results that shed some light on these questions.
The talk is based on joint work with Gergely Neu, Andras Gyorgy and Andras Antos.