main-content

## Über dieses Buch

This book presents state-of-the-art solution methods and applications of stochastic optimal control. It is a collection of extended papers discussed at the traditional Liverpool workshop on controlled stochastic processes with participants from both the east and the west. New problems are formulated, and progresses of ongoing research are reported.

Topics covered in this book include theoretical results and numerical methods for Markov and semi-Markov decision processes, optimal stopping of Markov processes, stochastic games, problems with partial information, optimal filtering, robust control, Q-learning, and self-organizing algorithms. Real-life case studies and applications, e.g., queueing systems, forest management, control of water resources, marketing science, and healthcare, are presented.

Scientific researchers and postgraduate students interested in stochastic optimal control,- as well as practitioners will find this book appealing and a valuable reference.

## Inhaltsverzeichnis

### Average Cost Markov Decision Processes with Semi-Uniform Feller Transition Probabilities

Abstract
This paper studies average-cost Markov decision processes with semi-uniform Feller transition probabilities. This class of MDPs was recently introduced by the authors to study MDPs with incomplete information. This paper studies the validity of optimality inequalities, the existence of optimal policies, and the approximations of optimal policies by policies optimizing total discounted costs.
Eugene A. Feinberg, Pavlo O. Kasyanov, Michael Z. Zgurovsky

### First Passage Exponential Optimality Problem for Semi-Markov Decision Processes

Abstract
This paper deals with the exponential utility maximization problem for semi-Markov decision process with Borel state and action spaces, and nonnegative reward rates. The criterion to be optimized is the expected exponential utility of the total rewards before the system state enters the target set. Under the regular and compactness-continuity conditions, we establish the corresponding optimality equation, and prove the existence of an exponential utility optimal stationary policy by an invariant embedding technique. Moreover, we provide an iterative algorithm for calculating the value function as well as the optimal policies. Finally, we illustrate the computational aspects of an optimal policy with an example.
Haifeng Huo, Xian Wen

### Controlled Random Walk: Conjecture and Counter-Example

Abstract
In this paper we investigate the following conjecture about the random walk on the positive integer lattice, starting from a large point $$i>0$$ and up to the absorption at negative points: on the first steps, one has to maximize the expected reward coming from passing through one point on the lattice. Under appropriate conditions, this conjecture is true. The counter-example shows that sometimes it is not valid.
Alexey B. Piunovskiy

### Optimal Stopping Problems for a Family of Continuous-Time Markov Processes

Abstract
In this chapter we study the well-known optimal stopping problem applied to a general family of continuous-time Markov processes. The approach to follow is merely analytic and it is based on the characterization of stopping problems through the study of a certain variational inequality; namely one solution of this inequality will coincide with the optimal value of the stopping problem. In addition, by means of this characterization, it is possible to find the so-named continuation region, and as a byproduct obtaining the optimal stopping time. The most of the material is based on the semigroup theory, infinitesimal generators and resolvents. The chapter is a complete version of the former presentation without detailed proofs in [25].
Héctor Jasso-Fuentes, Jose-Luis Menaldi, Fidel Vásquez-Rojas

### Control of Continuous-Time Markov Jump Linear Systems with Partial Information

Abstract
In this paper we study the $$H_2$$ state-feedback control of continuous-time Markov jump linear systems considering that the mode of operation cannot be directly measured. Instead we assume that there is a detector that provides the only information concerning the main jump process, so that the jump processes are modelled by a continuous-time exponential hidden Markov model. We present a new convex design condition for calculating a state-feedback controller depending only on the detector which guarantees stability in the mean-square sense of the closed-loop system, as well as a suitable bound on its $$H_2$$ norm. We present an illustrative example in the context of systems subject to faults and compare our results with the current literature.
André Marcorin de Oliveira, Oswaldo Luiz do Valle Costa

### Q-Learning for Distributionally Robust Markov Decision Processes

Abstract
In this paper, we consider distributionally robust Markov Decision Processes with Borel state and action spaces and infinite time horizon. The problem is formulated as a Stackelberg game where nature as a second player chooses the least favorable disturbance density in each scenario. Under suitable assumptions, we prove that the value function is the unique fixed point of an operator and that minimizers respectively, maximizers lead to optimal policies for the decision maker and nature. Based on this result, we introduce a Q-learning approach to solve the problem via simulation-based techniques. We prove the convergence of the Q-learning algorithm and study its performance using a distributionally robust irrigation problem.
Nicole Bäuerle, Alexander Glauner

### State Estimation in Partially Observed Stochastic Networks with Queueing Applications

Abstract
The problem of filter-based state estimation for a partially observed stochastic network is considered in this paper, using the measure change approach. The network is assumed to have two types of nodes: observed and hidden. Their dynamics are defined by a set of counting processes with state-dependent intensities. The goal is to derive the nonlinear optimal filter and to propose a numerical scheme for its practical implementation. Network models that allow the optimal filter to be finite-dimensional are also considered. The theoretical results are applied to a retrial queuing system to track changes in two hidden stations: one accumulates blocked customers and the other contains unsatisfied customers.
Konstantin V. Semenikhin

### Estimation of Equilibria in an Advertising Game with Unknown Distribution of the Response to Advertising Efforts

Abstract
We study a class of discrete-time advertising game with random responses to the advertising efforts made by a duopoly. The firms are assumed to observe the values of the random responses but they do not know their distributions. With the recorded values, firms estimate distributions and play estimated equilibrium strategies. Under suitable assumptions, we prove that the estimated equilibrium strategies converge to equilibria of the advertising game with the true distributions. Our results are numerically illustrated for specific cases.
Alan D. Robles-Aguilar, David González-Sánchez, J. Adolfo Minjárez-Sosa

### Robustness to Approximations and Model Learning in MDPs and POMDPs

Abstract
In stochastic control applications, typically only an ideal model (controlled transition kernel) is assumed and the control design is based on the given model, raising the problem of performance loss due to the mismatch between the assumed model and the actual model. In some further setups, an exact model may be known, but this model may entail computationally challenging optimality analysis leading to the solution of some approximate model being implemented. With such a motivation, we study continuity properties of discrete-time stochastic control problems with respect to system models and robustness of optimal control policies designed for incorrect models applied to the true system. We study both fully observed and partially observed setups under an infinite horizon discounted expected cost criterion. We show that continuity can be established under total variation convergence of the transition kernels under mild assumptions and with further restrictions on the dynamics and observation model under weak and setwise convergence of the transition kernels. Using these, we establish convergence results and error bounds due to mismatch that occurs by the application of a control policy which is designed for an incorrectly estimated system model to the actual system, thus establishing results on robustness. These entail implications on empirical learning in (data-driven) stochastic control since often system models are learned through empirical training data where typically the weak convergence criterion applies but stronger convergence criteria do not. We finally view and establish approximation as a particular instance of robustness.
Ali Devran Kara, Serdar Yüksel

### Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme

Abstract
We analyze the DQN reinforcement learning algorithm as a stochastic approximation scheme using the o.d.e. (for ‘ordinary differential equation’) approach and point out certain theoretical issues. We then propose a modified scheme called Full Gradient DQN (FG-DQN, for short) that has a sound theoretical basis and compare it with the original scheme on sample problems. We observe a better performance for FG-DQN.
Konstantin E. Avrachenkov, Vivek S. Borkar, Hars P. Dolhare, Kishor Patil

### On Finite Approximations to Markov Decision Processes with Recursive and Nonlinear Discounting

Abstract
In this paper, finite approximation schemes are justified for Markov decision processes in Borel spaces with recursive and nonlinear discounting. Explicit error bounds are obtained in terms of the system primitives. This allows one to solve the original problem approximately up to any given accuracy, by solving a sequence of problems in finite spaces.
Fan Deng, Xin Guo, Yi Zhang

### Locks, Bombs and Testing: The Case of Independent Locks

Abstract
We present a Defense/Attack resource allocation model, where Defender has some number of “locks” to protect n vulnerable boxes (sites), and Attacker is trying to destroy these boxes, having m “bombs,” which can be placed into boxes. Similar models were studied in game theory - (Colonel) Blotto games, but our model has a feature absent in previous literature. Attackers test the vulnerability of all sites before allocating their resources, and these tests are not perfect, i.e., a test can give plus for a box without a lock and minus for a box with a lock. We describe the optimal strategies for a version of this Locks-Bombs-Testing (LBT) model when locks appear independently in each box with the same probability.
Li Liu, Isaac M. Sonin

### A Survey of Stability Results for Redundancy Systems

Abstract
Redundancy mechanisms consist in sending several copies of a same job to a subset of servers. It constitutes one of the most promising ways to exploit diversity in multi-servers applications. However, its pros and cons are still not sufficiently understood in the context of realistic models with generic statistical properties of service-times distributions and correlation structures of copies. We aim at giving a survey of recent results concerning the stability - arguably the first benchmark of performance - of systems with cancel-on-completion redundancy. We also point out open questions and conjectures.
Elene Anton, Urtzi Ayesta, Matthieu Jonckheere, Ina Maria Verloop

### IBM Crew Pairing and Rostering Optimization (C-PRO) Technology with MDP for Optimization Flow Orchestration

Abstract
We created the IBM Crew Pairing and Rostering Optimization (C-PRO) solution for air crew scheduling. It was deployed at El Al in 2013 and at Aeroflot in 2020. The core of the system is an optimization flow, which models the problem using mixed integer linear programming (MILP) with millions of integer variables. The solution is derived iteratively using heuristics. Most recently, we applied Markov Decision Process (MDP) in place of the heuristics orchestrator and realized a 30% improvement in performance.

### A Regulatory Principle for Robust Reciprocal-Time Decay of the Adaptive Immune Response

Abstract
Follicular dendritic cells (FDC) play a crucial role in the regulation of immunity. They are believed to be responsible for long-term persistence of humoral antibody following vaccination or infection, due to their role in antibody response induction and their ability to retain antigen for long periods. In this paper, a regulatory control model is described which links persistence of humoral immunity with cellular processes associated with FDCs. The model predicts universal and stable reciprocal-time (= 1/t) decay of humoral antibody levels, which has been widely reported over a range of ages, observation times and vaccine types.
Anthony Almudevar

### Swarm Intelligence and Swarm Robotics in the Path Planning Problem

Abstract
In this chapter, we introduce the basic characteristics of swarm intelligence, the path planning problem for robots, and how to apply the self-organizing migrating algorithm, a representative of swarm intelligence to solve that real-world problem. We set up simulations in the Matlab environment with four common possible scenarios to demonstrate the effectiveness of the solution.
Quoc Bao Diep, Thanh Cong Truong, Ivan Zelinka

### Utilizing Differential Evolution into Optimizing Targeted Cancer Treatments

Abstract
Working towards the development of an evolvable cancer treatment simulator, the investigation of including evolutionary optimization methods was considered. Namely, Differential Evolution (DE) is studied here, motivated by the high efficiency of variations of this technique in real-valued problems. A basic DE algorithm, namely “DE/rand/1” was used to optimize in silico the design of a targeted drug delivery system (DDS) for tumor treatment on PhysiCell simulator. The suggested approach proved to be more efficient than a standard Genetic Algorithm (GA), which was not able to escape local minima after a predefined number of generations. The key attribute of DE that enables it to outperform standard GAs, is the fact that it keeps the diversity of the population high, throughout all the generations. This work will be incorporated with ongoing research in a more wide applicability platform that will design, develop and evaluate targeted DDSs aiming cancer tumours.
Michail-Antisthenis Tsompanas, Larry Bull, Andrew Adamatzky, Igor Balaz

### On an Approach to Evaluation of Health Care Programme by Markov Decision Model

Abstract
In this paper, we consider the evaluation of periodic screening programme for woman breast cancer and formulate the model as a partially observable Markov decision process (POMDP). We convert a POMDP with finite state, observation state and action spaces to an equivalent completely observable MDP with continuous state and finite action spaces. By this approach, we have an optimal policy from dynamic programming (DP) equation in an equivalent MDP, but we focus on considering the evaluation in scenarios of periodic screening for participants with silent condition of breast cancer and seeking an answer which programme is better than others for themselves. The aim of this paper is, by using the data sets based on cancer registration and estimated parameters of survival rates and other ratios related to screening and diagnoses in Japan, to evaluate some scenarios of breast cancer screening programme in POMDP.
Masayuki Horiguchi

### Backmatter

Weitere Informationen