Skip to main content
main-content

Über dieses Buch

This book constitutes revised and selected papers of the 9th European Workshop on Reinforcement Learning, EWRL 2011, which took place in Athens, Greece in September 2011. The papers presented were carefully reviewed and selected from 40 submissions. The papers are organized in topical sections online reinforcement learning, learning and exploring MDPs, function approximation methods for reinforcement learning, macro-actions in reinforcement learning, policy search and bounds, multi-task and transfer reinforcement learning, multi-agent reinforcement learning, apprenticeship and inverse reinforcement learning and real-world reinforcement learning.

Inhaltsverzeichnis

Frontmatter

Invited Talk Abstracts

Invited Talk: UCRL and Autonomous Exploration

After reviewing the main ingredients of the UCRL algorithm and its analysis for online reinforcement learning — exploration vs. exploitation, optimism in the face of uncertainty, consistency with observations and upper confidence bounds, regret analysis — I show how these techniques can also be used to derive PAC-MDP bounds which match the best currently available bounds for the discounted and the undiscounted setting. As typical for reinforcement learning, the analysis for the undiscounted setting is significantly more involved.

In the second part of my talk I consider a model for autonomous exploration, where an agent learns about its environment and how to navigate in it. Whereas evaluating autonomous exploration is typically difficult, in the presented setting rigorous performance bounds can be derived. For that we present an algorithm that optimistically explores, by repeatedly choosing the apparently closest unknown state — as indicated by an optimistic policy — for further exploration.

Acknowledgements.

This is joint work with Shiau Hong Lim. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement 231495 (CompLACS).

Peter Auer

Invited Talk: Increasing Representational Power and Scaling Inference in Reinforcement Learning

As robots are starting to perform everyday manipulation tasks, such as cleaning up, setting a table or preparing simple meals, they must become much more knowledgeable than they are today. Natural environments are composed of objects, and the possibilities to manipulate them are highly structured due to the general laws governing our relational world. All these need to be acknowledged when we want to realize thinking robots that efficiently learn how to accomplish tasks in our relational world.

Triggered by this grand vision, this talk discusses the very promising perspective on the application of Statistical Relational AI techniques to reinforcement learning. Specifically, it reviews existing symbolic dynamic programming and relational RL approaches that exploit the symbolic structure in the solution of relational and first-order logical Markov decision processes. They illustrate that Statistical Relational AI may give new tools for solving the “scaling challenge”. It is sometimes mentioned that scaling RL to real-world scenarios is a core challenge for robotics and AI in general. While this is true in a trivial sense, it might be beside the point. Reasoning and learning on appropriate (e.g. relational) representations leads to another view on the “scaling problem”: often we are facing problems with symmetries not reflected in the structure used by our standard solvers. As additional evidence for this, the talk concludes by presenting our ongoing work on the first lifted linear programming solvers for MDPs. Given an MDP, our approach first constructs a lifted program where each variable presents a set of original variables that are indistinguishable given the objective function and constraints. It then runs any standard LP solver on this program to solve the original program optimally.

Acknowledgements.

This talk is based on joint works with Babak Ahmadi, Kurt Driessens, Saket Joshi, Roni Khardon, Tobias Lang, Martin Mladenov, Sriraam Natarajan, Scott Sanner, Jude Shavlik, Prasad Tadepalli, and Marc Toussaint.

Kristian Kersting

Invited Talk: PRISM – Practical RL: Representation, Interaction, Synthesis, and Mortality

When scaling up RL to large continuous domains with imperfect representations and hierarchical structure, we often try applying algorithms that are proven to converge in small finite domains, and then just hope for the best. This talk will advocate instead designing algorithms that adhere to the constraints, and indeed take advantage of the opportunities, that might come with the problem at hand. Drawing on several different research threads within the Learning Agents Research Group at UT Austin, I will discuss four types of issues that arise from these contraints and opportunities: 1) Representation – choosing the algorithm for the problem’s representation and adapating the representation to fit the algorithm; 2) Interaction – with other agents and with human trainers; 3) Synthesis – of different algorithms for the same problem and of different concepts in the same algorithm; and 4) Mortality – the opportunity to improve learning based on past experience and the constraint that one can’t explore exhaustively.

Peter Stone

Invited Talk: Towards Robust Reinforcement Learning Algorithms

Most reinforcement learning algorithms assume that the system to be controlled can be accurately approximated given the measurements and the available resources. However, this assumption is overly optimistic for too many problems of practical interest: Real-world problems are messy. For example, the number of unobserved variables influencing the dynamics can be very large and the dynamics governing can be highly complicated. How can then one ask for near-optimal performance without requiring an enormous amount of data? In this talk we explore an alternative to this standard criterion, based on the concept of regret, borrowed from the online learning literature. Under this alternative criterion, the performance of a learning algorithm is measured by how much total reward is collected by the algorithm as compared to the total reward that could have been collected by the best policy from a fixed policy class, the best policy being determined in hindsight. How can we design algorithms that keep the regret small? Do we need to change existing algorithm designs? In this talk, following the initial steps made by Even-Dar et al. and Yu et al., I will discuss some of our new results that shed some light on these questions.

Acknowledgements.

The talk is based on joint work with Gergely Neu, Andras Gyorgy and Andras Antos.

Csaba Szepesvári

Online Reinforcement Learning

Automatic Discovery of Ranking Formulas for Playing with Multi-armed Bandits

We propose an approach for discovering in an automatic way formulas for ranking arms while playing with multi-armed bandits.

The approach works by defining a grammar made of basic elements such as for example addition, subtraction, the max operator, the average values of rewards collected by an arm, their standard deviation etc., and by exploiting this grammar to generate and test a large number of formulas. The systematic search for good candidate formulas is carried out by a built-on-purpose optimization algorithm used to navigate inside this large set of candidate formulas towards those that give high performances when using them on some multi-armed bandit problems.

We have applied this approach on a set of bandit problems made of Bernoulli, Gaussian and truncated Gaussian distributions and have identified a few simple ranking formulas that provide interesting results on every problem of this set. In particular, they clearly outperform several reference policies previously introduced in the literature. We argue that these newly found formulas as well as the procedure for generating them may suggest new directions for studying bandit problems.

Francis Maes, Louis Wehenkel, Damien Ernst

Goal-Directed Online Learning of Predictive Models

We present an algorithmic approach for integrated learning and planning in predictive representations. The approach extends earlier work on predictive state representations to the case of online exploration, by allowing exploration of the domain to proceed in a goal-directed fashion and thus be more efficient. Our algorithm interleaves online learning of the models, with estimation of the value function. The framework is applicable to a variety of important learning problems, including scenarios such as apprenticeship learning, model customization, and decision-making in non-stationary domains.

Sylvie C. W. Ong, Yuri Grinberg, Joelle Pineau

Gradient Based Algorithms with Loss Functions and Kernels for Improved On-Policy Control

We introduce and empirically evaluate two novel online gradient-based reinforcement learning algorithms with function approximation – one model based, and the other model free. These algorithms come with the possibility of having non-squared loss functions which is novel in reinforcement learning, and seems to come with empirical advantages. We further extend a previous gradient based algorithm to the case of full control, by using generalized policy iteration. Theoretical properties of these algorithms are studied in a companion paper.

Matthew Robards, Peter Sunehag

Learning and Exploring MDPs

Active Learning of MDP Models

We consider the active learning problem of inferring the transition model of a Markov Decision Process by acting and observing transitions. This is particularly useful when no reward function is

a priori

defined. Our proposal is to cast the active learning task as a utility maximization problem using Bayesian reinforcement learning with belief-dependent rewards. After presenting three possible performance criteria, we derive from them the belief-dependent rewards to be used in the decision-making process. As computing the optimal Bayesian value function is intractable for large horizons, we use a simple algorithm to approximately solve this optimization problem. Despite the sub-optimality of this technique, we show experimentally that our proposal is efficient in a number of domains.

Mauricio Araya-López, Olivier Buffet, Vincent Thomas, François Charpillet

Handling Ambiguous Effects in Action Learning

We study the problem of learning stochastic actions in propositional, factored environments, and precisely the problem of identifying STRIPS-like effects from transitions in which they are ambiguous. We give an unbiased, maximum likelihood approach, and show that maximally likely actions can be computed efficiently from observations. We also discuss how this study can be used to extend an RL approach for actions with independent effects to one for actions with correlated effects.

Boris Lesner, Bruno Zanuttini

Feature Reinforcement Learning in Practice

Following a recent surge in using history-based methods for resolving perceptual aliasing in reinforcement learning, we introduce an algorithm based on the feature reinforcement learning framework called ΦMDP [13]. To create a practical algorithm we devise a stochastic search procedure for a class of context trees based on parallel tempering and a specialized proposal distribution. We provide the first empirical evaluation for ΦMDP. Our proposed algorithm achieves superior performance to the classical U-tree algorithm [20] and the recent active-LZ algorithm [6], and is competitive with MC-AIXI-CTW [29] that maintains a bayesian mixture over all context trees up to a chosen depth. We are encouraged by our ability to compete with this sophisticated method using an algorithm that simply picks one single model, and uses Q-learning on the corresponding MDP. Our ΦMDP algorithm is simpler and consumes less time and memory. These results show promise for our future work on attacking more complex and larger problems.

Phuong Nguyen, Peter Sunehag, Marcus Hutter

Function Approximation Methods for Reinforcement Learning

Reinforcement Learning with a Bilinear Q Function

Many reinforcement learning methods are based on a function

Q

(

s

,

a

) whose value is the discounted total reward expected after performing the action

a

in the state

s

. This paper explores the implications of representing the Q function as

Q

(

s

,

a

) = 

s

T

Wa

, where

W

is a matrix that is learned. In this representation, both

s

and

a

are real-valued vectors that may have high dimension. We show that action selection can be done using standard linear programming, and that

W

can be learned using standard linear regression in the algorithm known as fitted Q iteration. Experimentally, the resulting method learns to solve the mountain car task in a sample-efficient way. The same method is also applicable to an inventory management task where the state space and the action space are continuous and high-dimensional.

Charles Elkan

ℓ1-Penalized Projected Bellman Residual

We consider the task of feature selection for value function approximation in reinforcement learning. A promising approach consists in combining the Least-Squares Temporal Difference (LSTD) algorithm with ℓ

1

-regularization, which has proven to be effective in the supervised learning community. This has been done recently whit the LARS-TD algorithm, which replaces the projection operator of LSTD with an ℓ

1

-penalized projection and solves the corresponding fixed-point problem. However, this approach is not guaranteed to be correct in the general off-policy setting. We take a different route by adding an ℓ

1

-penalty term to the projected Bellman residual, which requires weaker assumptions while offering a comparable performance. However, this comes at the cost of a higher computational complexity if only a part of the regularization path is computed. Nevertheless, our approach ends up to a supervised learning problem, which let envision easy extensions to other penalties.

Matthieu Geist, Bruno Scherrer

Regularized Least Squares Temporal Difference Learning with Nested ℓ2 and ℓ1 Penalization

The construction of a suitable set of features to approximate value functions is a central problem in reinforcement learning (RL). A popular approach to this problem is to use high-dimensional feature spaces together with least-squares temporal difference learning (LSTD). Although this combination allows for very accurate approximations, it often exhibits poor prediction performance because of overfitting when the number of samples is small compared to the number of features in the approximation space. In the linear regression setting, regularization is commonly used to overcome this problem. In this paper, we review some regularized approaches to policy evaluation and we introduce a novel scheme (

L

21

) which uses ℓ

2

regularization in the projection operator and an ℓ

1

penalty in the fixed-point step. We show that such formulation reduces to a standard Lasso problem. As a result, any off-the-shelf solver can be used to compute its solution and standardization techniques can be applied to the data. We report experimental results showing that

L

21

is effective in avoiding overfitting and that it compares favorably to existing ℓ

1

regularized methods.

Matthew W. Hoffman, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos

Recursive Least-Squares Learning with Eligibility Traces

In the framework of Markov Decision Processes, we consider the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We describe a systematic approach for adapting

on-policy

learning least squares algorithms of the literature (LSTD [5], LSPE [15], FPKF [7] and GPTD [8]/KTD [10]) to

off-policy learning with eligibility traces

. This leads to two known algorithms, LSTD(

λ

)/LSPE(

λ

) [21] and suggests new extensions of FPKF and GPTD/KTD. We describe their recursive implementation, discuss their convergence properties, and illustrate their behavior experimentally. Overall, our study suggests that the state-of-art LSTD(

λ

) [21] remains the best least-squares algorithm.

Bruno Scherrer, Matthieu Geist

Value Function Approximation through Sparse Bayesian Modeling

In this study we present a sparse Bayesian framework for value function approximation. The proposed method is based on the on-line construction of a dictionary of states which are collected during the exploration of the environment by the agent. A linear regression model is established for the observed partial discounted return of such dictionary states, where we employ the Relevance Vector Machine (RVM) and exploit its enhanced modeling capability due to the embedded sparsity properties. In order to speed-up the optimization procedure and allow dealing with large-scale problems, an incremental strategy is adopted. A number of experiments have been conducted on both simulated and real environments, where we took promising results in comparison with another Bayesian approach that uses Gaussian processes.

Nikolaos Tziortziotis, Konstantinos Blekas

Macro-actions in Reinforcement Learning

Automatic Construction of Temporally Extended Actions for MDPs Using Bisimulation Metrics

Temporally extended actions are usually effective in speeding up reinforcement learning. In this paper we present a mechanism for automatically constructing such actions, expressed as options [24], in a finite Markov Decision Process (MDP). To do this, we compute a bisimulation metric [7] between the states in a small MDP and the states in a large MDP, which we want to solve. The

shape

of this metric is then used to completely define a set of options for the large MDP. We demonstrate empirically that our approach is able to improve the speed of reinforcement learning, and is generally not sensitive to parameter tuning.

Pablo Samuel Castro, Doina Precup

Unified Inter and Intra Options Learning Using Policy Gradient Methods

Temporally extended actions (or macro-actions) have proven useful for speeding up planning and learning, adding robustness, and building prior knowledge into AI systems. The options framework, as introduced in Sutton, Precup and Singh (1999), provides a natural way to incorporate macro-actions into reinforcement learning. In the subgoals approach, learning is divided into two phases, first learning each option with a prescribed subgoal, and then learning to compose the learned options together. In this paper we offer a unified framework for concurrent inter- and intra-options learning. To that end, we propose a modular parameterization of intra-option policies together with option termination conditions and the option selection policy (inter options), and show that these three decision components may be viewed as a unified policy over an augmented state-action space, to which standard policy gradient algorithms may be applied. We identify the basis functions that apply to each of these decision components, and show that they possess a useful orthogonality property that allows to compute the natural gradient independently for each component. We further outline the extension of the suggested framework to several levels of options hierarchy, and conclude with a brief illustrative example.

Kfir Y. Levy, Nahum Shimkin

Options with Exceptions

An option is a policy fragment that represents a solution to a frequent subproblem encountered in a domain. Options may be treated as temporally extended actions thus allowing us to reuse that solution in solving larger problems. Often, it is hard to find subproblems that are exactly the same. These differences, however small, need to be accounted for in the reused policy. In this paper, the notion of options with exceptions is introduced to address such scenarios. This is inspired by the Ripple Down Rules approach used in data mining and knowledge representation communities. The goal is to develop an option representation so that small changes in the subproblem solutions can be accommodated without losing the original solution. We empirically validate the proposed framework on a simulated game domain.

Munu Sairamesh, Balaraman Ravindran

Policy Search and Bounds

Robust Bayesian Reinforcement Learning through Tight Lower Bounds

In the Bayesian approach to sequential decision making, exact calculation of the (subjective) utility is intractable. This extends to most special cases of interest, such as reinforcement learning problems. While utility bounds are known to exist for this problem, so far none of them were particularly tight. In this paper, we show how to efficiently calculate a lower bound, which corresponds to the utility of a near-optimal

memoryless

policy for the decision problem, which is generally different from both the Bayes-optimal policy and the policy which is optimal for the expected MDP under the current belief. We then show how these can be applied to obtain robust exploration policies in a Bayesian reinforcement learning setting.

Christos Dimitrakakis

Optimized Look-ahead Tree Search Policies

We consider in this paper look-ahead tree techniques for the discrete-time control of a deterministic dynamical system so as to maximize a sum of discounted rewards over an infinite time horizon. Given the current system state

x

t

at time t, these techniques explore the look-ahead tree representing possible evolutions of the system states and rewards conditioned on subsequent actions

u

t

,

u

t

 + 1

, …. When the computing budget is exhausted, they output the action

u

t

that led to the best found sequence of discounted rewards. In this context, we are interested in computing good strategies for exploring the look-ahead tree. We propose a generic approach that looks for such strategies by solving an optimization problem whose objective is to compute a (budget compliant) tree-exploration strategy yielding a control policy maximizing the average return over a postulated set of initial states.

This generic approach is fully specified to the case where the space of candidate tree-exploration strategies are “best-first” strategies parameterized by a linear combination of look-ahead path features – some of them having been advocated in the literature before – and where the optimization problem is solved by using an EDA-algorithm based on Gaussian distributions. Numerical experiments carried out on a model of the treatment of the HIV infection show that the optimized tree-exploration strategy is orders of magnitudes better than the previously advocated ones.

Francis Maes, Louis Wehenkel, Damien Ernst

A Framework for Computing Bounds for the Return of a Policy

We present a framework for computing bounds for the return of a policy in finite-horizon, continuous-state Markov Decision Processes with bounded state transitions. The state transition bounds can be based on either prior knowledge alone, or on a combination of prior knowledge and data. Our framework uses a piecewise-constant representation of the return bounds and a backwards iteration process. We instantiate this framework for a previously investigated type of prior knowledge – namely, Lipschitz continuity of the transition function. In this context, we show that the existing bounds of Fonteneau et al. (2009, 2010) can be expressed as a particular instantiation of our framework, by bounding the immediate rewards using Lipschitz continuity and choosing a particular form for the regions in the piecewise-constant representation. We also show how different instantiations of our framework can improve upon their bounds.

Cosmin Păduraru, Doina Precup, Joelle Pineau

Multi-Task and Transfer Reinforcement Learning

Transferring Evolved Reservoir Features in Reinforcement Learning Tasks

The major goal of transfer learning is to transfer knowledge acquired on a source task in order to facilitate learning on another, different, but usually related, target task. In this paper, we are using neuroevolution to evolve echo state networks on the source task and transfer the best performing reservoirs to be used as initial population on the target task. The idea is that any non-linear, temporal features, represented by the neurons of the reservoir and evolved on the source task, along with reservoir properties, will be a good starting point for a stochastic search on the target task. In a step towards full autonomy and by taking advantage of the random and fully connected nature of echo state networks, we examine a transfer method that renders any inter-task mappings of states and actions unnecessary. We tested our approach and that of inter-task mappings in two RL testbeds: the mountain car and the server job scheduling domains. Under various setups the results we obtained in both cases are promising.

Kyriakos C. Chatzidimitriou, Ioannis Partalas, Pericles A. Mitkas, Ioannis Vlahavas

Transfer Learning via Multiple Inter-task Mappings

In this paper we investigate using multiple mappings for transfer learning in reinforcement learning tasks. We propose two different transfer learning algorithms that are able to manipulate multiple inter-task mappings for both model-learning and model-free reinforcement learning algorithms. Both algorithms incorporate mechanisms to select the appropriate mappings, helping to avoid the phenomenon of negative transfer. The proposed algorithms are evaluated in the Mountain Car and Keepaway domains. Experimental results show that the use of multiple inter-task mappings can significantly boost the performance of transfer learning methodologies, relative to using a single mapping or learning without transfer.

Anestis Fachantidis, Ioannis Partalas, Matthew E. Taylor, Ioannis Vlahavas

Multi-Task Reinforcement Learning: Shaping and Feature Selection

Shaping functions can be used in multi-task reinforcement learning (RL) to incorporate knowledge from previously experienced source tasks to speed up learning on a new target task. Earlier work has not clearly motivated choices for the shaping function. This paper discusses and empirically compares several alternatives, and demonstrates that the most intuive one may not always be the best option. In addition, we extend previous work on identifying good representations for the value and shaping functions, and show that selecting the right representation results in improved generalization over tasks.

Matthijs Snel, Shimon Whiteson

Multi-Agent Reinforcement Learning

Transfer Learning in Multi-Agent Reinforcement Learning Domains

In the context of reinforcement learning, transfer learning refers to the concept of reusing knowledge acquired in past tasks to speed up the learning procedure in new tasks. Transfer learning methods have been succesfully applied in single-agent reinforcement learning algorithms, but no prior work has focused on applying them in a multi-agent environment. We propose a novel method for transfer learning in multi-agent reinforcement learning domains. We proceed to test the proposed approach in a multi-agent domain under various configurations. The results demonstrate that the method can reduce the learning time and increase the asymptotic performance of the learning algorithm.

Georgios Boutsioukis, Ioannis Partalas, Ioannis Vlahavas

An Extension of a Hierarchical Reinforcement Learning Algorithm for Multiagent Settings

This paper compares and investigates single-agent reinforcement learning (RL) algorithms on the simple and an extended taxi problem domain, and multiagent RL algorithms on a multiagent extension of the simple taxi problem domain we created. In particular, we extend the Policy Hill Climbing (PHC) and the Win or Learn Fast-PHC (WoLF-PHC) algorithms by combining them with the MAXQ hierarchical decomposition and investigate their efficiency. The results are very promising for the multiagent domain as they indicate that these two newly-created algorithms are the most efficient ones from the algorithms we compared.

Ioannis Lambrou, Vassilis Vassiliades, Chris Christodoulou

Apprenticeship and Inverse Reinforcement Learning

Bayesian Multitask Inverse Reinforcement Learning

We generalise the problem of inverse reinforcement learning to multiple tasks, from multiple demonstrations. Each one may represent one expert trying to solve a different task, or as different experts trying to solve the same task. Our main contribution is to formalise the problem as statistical preference elicitation, via a number of structured priors, whose form captures our biases about the relatedness of different tasks or expert policies. In doing so, we introduce a prior on policy optimality, which is more natural to specify. We show that our framework allows us not only to learn to efficiently from multiple experts but to also effectively differentiate between the goals of each. Possible applications include analysing the intrinsic motivations of subjects in behavioural experiments and learning from multiple teachers.

Christos Dimitrakakis, Constantin A. Rothkopf

Batch, Off-Policy and Model-Free Apprenticeship Learning

This paper addresses the problem of apprenticeship learning, that is learning control policies from demonstration by an expert. An efficient framework for it is inverse reinforcement learning (IRL). Based on the assumption that the expert maximizes a utility function, IRL aims at learning the underlying reward from example trajectories. Many IRL algorithms assume that the reward function is linearly parameterized and rely on the computation of some associated

feature expectations

, which is done through Monte Carlo simulation. However, this assumes to have full trajectories for the expert policy as well as at least a generative model for intermediate policies. In this paper, we introduce a temporal difference method, namely LSTD-

μ

, to compute these feature expectations. This allows extending apprenticeship learning to a batch and off-policy setting.

Edouard Klein, Matthieu Geist, Olivier Pietquin

Real-World Reinforcement Learning

Introduction of Fixed Mode States into Online Profit Sharing and Its Application to Waist Trajectory Generation of Biped Robot

In reinforcement learning of long-term tasks, learning efficiency may deteriorate when an agent’s probabilistic actions cause too many mistakes before task learning reaches its goal. The new type of state we propose –

fixed mode

– to which a normal state shifts if it has already received sufficient reward – chooses an action based on a greedy strategy, eliminating randomness of action selection and increasing efficiency. We start by proposing the combining of an algorithm with penalty avoiding rational policy making and online profit sharing with fixed mode states. We then discuss the target system and learning-controller design. In simulation, the learning task involves stabilizing of biped walking by using the learning controller to modify a robot’s waist trajectory. We then discuss simulation results and the effectiveness of our proposal.

Seiya Kuroda, Kazuteru Miyazaki, Hiroaki Kobayashi

MapReduce for Parallel Reinforcement Learning

We investigate the parallelization of reinforcement learning algorithms using MapReduce, a popular parallel computing framework. We present parallel versions of several dynamic programming algorithms, including policy evaluation, policy iteration, and off-policy updates. Furthermore, we design parallel reinforcement learning algorithms to deal with large scale problems using linear function approximation, including model-based projection, least squares policy iteration, temporal difference learning and recent gradient temporal difference learning algorithms. We give time and space complexity analysis of the proposed algorithms. This study demonstrates how parallelization opens new avenues for solving large scale reinforcement learning problems.

Yuxi Li, Dale Schuurmans

Compound Reinforcement Learning: Theory and an Application to Finance

This paper describes compound reinforcement learning (RL) that is an extended RL based on the compound return. Compound RL maximizes the logarithm of expected double-exponentially discounted compound return in return-based Markov decision processes (MDPs). The contributions of this paper are (1) Theoretical description of compound RL that is an extended RL framework for maximizing the compound return in a return-based MDP and (2) Experimental results in an illustrative example and an application to finance.

Tohgoroh Matsui, Takashi Goto, Kiyoshi Izumi, Yu Chen

Proposal and Evaluation of the Active Course Classification Support System with Exploitation-Oriented Learning

The National Institution for Academic Degrees and University Evaluation (NIAD-UE) is an exclusive institution in Japan which can award academic degrees based on the accumulation of academic credits for non-university students. An applicant who wishes to be awarded a degree from NIAD-UE must classify his obtained credits according to pre-determined

criteria

for each disciplinary field. The criteria are constructed by several items. A sub-committee of experts in the relevant field judges whether the course credit classification by the applicant is appropriate or not paying attention on the first item in the criteria. Recently, the Active Course Classification Support (ACCS) system has been proposed to support the sub-committee for the validation of applicant’s course classification. Entering a classification item number into ACCS, ACCS suggests an appropriate course that belongs to the set of classification item number. However, some difficulties of deciding appropriate item numbers still remain. This study aims to improve the method of determining item numbers, which should be judged by the sub-committee, by using machine learning. We use

Exploitation-oriented Learning

as the learning method for improving ACCS, and present a numerical example to show the effectiveness of our proposed method.

Kazuteru Miyazaki, Masaaki Ida

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise