1 Introduction
We consider two-player zero-sum stochastic games with a countable state space, finite action spaces, and a bounded and universally measurable payoff function. Such a game models a repeated interaction between two players with opposite objectives. The environment in which the interaction takes place is fully characterized by the state variable. The transition from one state variable to the next one is influenced by both players as well as an element of chance. Throughout the paper, we take the perspective of the maximizing player. We are interested in strategies of the maximizing player that guarantee the lower value at every subgame and call such strategies subgame maxmin strategies. Under the assumptions as made in the paper, the value may not exist, which explains why we consider the lower value instead.
As the name subgame maxmin strategy suggests, this concept is closely related to the concept of a subgame perfect equilibrium as defined in Selten ([
19]). In two-player zero-sum games where the value exists, for conditions see Maitra and Sudderth ([
13]) and Martin ([
14]), the notions of a subgame maxmin strategy and a subgame minmax strategy coincide with the notion of a subgame optimal strategy. Moreover, in such games a strategy profile is a subgame perfect equilibrium if and only if it consists of a pair of subgame optimal strategies.
As illustrated by the Big Match, a game introduced in Gillette ([
9]) and analyzed in Blackwell and Ferguson ([
2]), even if the value exists, it is not guaranteed that optimal strategies exist, so a fortiori, subgame optimal strategies and subgame perfect equilibria may not exist. A large part of the literature therefore focuses on so-called subgame perfect
\(\epsilon \)-equilibria as defined in Radner ([
17]). This equilibrium concept is more permissive than a subgame perfect equilibrium and consists of a strategy pair such that every player obtains the value at each history up to a fixed error term of
\(\epsilon /2\).
Instead of having a fixed error term at each subgame, we allow the error term to vary across different subgames. This error term is expressed as a function
\(\phi \) of the histories and is called the tolerance function. The central topic of this paper is the concept of a subgame
\(\phi \)-maxmin strategy. This is a strategy of the maximizing player that guarantees the lower value at every subgame within the allowed tolerance level. Intuitively, a subgame
\(\phi \)-maxmin strategy performs sufficiently well across all subgames. This type of strategy is related to the concept of
\(\phi \)-tolerance equilibrium as defined in Flesch and Predtetchinski ([
6]). Indeed, if the value exists, then a strategy profile in which both players use a subgame
\((\phi /2)\)-optimal strategy is a
\(\phi \)-tolerance equilibrium.
One motivation for letting the tolerance level vary across subgames is given by Mailath, Postlewaite, and Samuelson ([
11]) when introducing the concept of a contemporaneous perfect
\(\epsilon \)-equilibrium. The authors focus on games in which the payoff function of the players is given by the discounted sum of periodic rewards. Due to this discounting, there exists a period after which the maximal total discounted reward a player can receive is smaller than
\(\epsilon \). If the allowed tolerance level
\(\epsilon \) is fixed across all subgames, any strategy will be an
\(\epsilon \)-maxmin strategy of a subgame in such a period. Therefore, the concept of subgame
\(\epsilon \)-maxmin strategy does not impose any restrictions on the actions chosen at a very distant future. The issue here is that it would be more intuitive to discount not only the reward but also the allowed tolerance level.
Additional motivation for letting the tolerance level vary across subgames stems from the fact that the notion of what is considered sufficiently good might be relative. For instance, Tversky and Kahneman [
21] observe that people evaluate decisions with respect to a reference level. They find that significantly more people were willing to exert extra effort to save $5 on a $15 purchase than to save $5 on a $125 purchase. To apply this to the context of zero-sum games, consider the following game to which we will refer as the high stakes-low stakes game. In the first stage of the game, a chance move determines whether the player will engage in the high stakes or the low stakes variant of this game. The high stakes and the low stakes games are identical in terms of possible strategies. The only difference is that the payoffs in the high stakes game are a thousand fold the payoffs in the low stakes game. Furthermore, assume that in the high stakes subgame, the payoff of player 1 ranges between 0 and 2000 and the value is 1000, while in the low stakes subgame the payoff of player 1 ranges between 0 and 2 and the value is 1. Players that evaluate decisions with respect to a reference level may label a strategy which guarantees a payoff of 999 in the high stakes game as sufficiently good. However, in the low stakes game, 0 corresponds to the minimum payoff. Allowing the tolerance level to vary across subgames can therefore be used to accommodate such behavioral effects into the model of zero-sum stochastic games.
Another case where history-dependent tolerance levels are natural is the following. In situations that commonly occur, a player may use a familiar strategy irrespective of the scale of the payoffs. To understand this better, imagine a player who is highly experienced in playing a certain zero-sum game. He has a trusted strategy which guarantees him the value of this game within some error. Now, consider the high stakes–low stakes game again. The player might well use the trusted strategy in both the low stakes and the high stakes subgame. Therefore, the error related to this strategy will be relative with respect to the value of the respective subgame.
Finally, in the class of stochastic games as identified in Flesch, Thuijsman, and Vrieze ([
8]), the only way to obtain
\(\epsilon \)-optimality is to use strategies that are called improving. Improving strategies are closely related to subgame
\(\phi \)-maxmin strategies such that the tolerance level in some subgames is smaller than the tolerance level at the root.
With respect to the concept of subgame
\(\phi \)-maxmin strategies, this paper attempts to provide answers to the following two fundamental questions:
1.
For positive tolerance functions \( \phi ,\) when does a subgame \(\phi \)-maxmin strategy exist?
2.
How is the existence of a subgame \(\phi \)-maxmin strategy related to the existence of a subgame maxmin strategy?
To answer these questions we will first derive necessary and sufficient conditions for a strategy to be a subgame
\(\phi \)-maxmin strategy. This is done in Sect.
4. As a special case of these necessary and sufficient conditions, we obtain a characterization of subgame maxmin strategies. For the special class of positive and negative stochastic games, a related characterization of subgame maxmin strategies was obtained by Flesch, Predtetchinski, and Sudderth ([
7]).
The necessary and sufficient conditions for strategies to be subgame \(\phi \)-maxmin can be split into a local condition and an equalizing condition. Informally, the local condition states that the lower value one expects to get in the next subgame should always be at least the lower value of the current subgame. The equalizing condition requires that for every strategy of the other player, a subgame \( \phi \)-maxmin strategy almost surely results in a play with an eventually good enough payoff, where eventually good enough means being at least the lower value in very deep subgames up to the allowed tolerance level.
In Sect.
5, we consider the question of existence of subgame
\(\phi \)-maxmin strategies for positive tolerance functions
\(\phi \). We prove that for a positive tolerance function
\(\phi \), a subgame
\(\phi \)-maxmin strategy exists if every play is either a point of upper semicontinuity of the payoff function or if the sequence of tolerance levels which occur along the play has a positive lower bound. The novel aspect of the theorem is the way in which it encompasses a number of important special cases, some of which have been studied extensively in the literature. One special case of interest is when the payoff function is upper semicontinuous everywhere. In this case, the existence of subgame maxmin strategies follows by a result of Laraki, Maitra, and Sudderth ([
10]). Another is the case when along each play tolerance levels remain bounded away from zero. This latter case subsumes the familiar class of all positive constant tolerance functions. When the tolerance level is equal to a constant
\(\epsilon \) at each history, subgame
\(\phi \)-maxmin strategies are exactly the so-called subgame
\(\epsilon \)-optimal strategies. Existence of subgame
\(\epsilon \)-optimal strategies for each
\(\epsilon > 0\) is established in Mashiah–Yaakovi ([
15]).
Our technique relies on switching between strategies: one starts by following a
\(\tfrac{\phi (x_{0})}{2}\)-optimal strategy until a history (say
h) is reached where it ceases to be
\(\phi (h)\)-optimal. At that moment, one discards the original strategy and switches instead to a
\(\tfrac{\phi (h)}{2}\)-optimal strategy, and so on. This technique is well-attested in the literature, and in particular is used in Mashiah–Yaakovi ([
15]). We refine the method to accommodate both types of plays: those where the tolerance levels remain bounded away from zero, and those where they converge to zero (the latter are, by assumption, the plays where the payoff function is upper semicontinuous).
In Sect.
6, we study the relation between the existence of subgame
\(\phi \)-maxmin strategies and subgame maxmin strategies. Our main result states that the existence of a particular subgame
\(\phi ^*\)-maxmin strategy, with
\(\phi ^*>0\), is equivalent to the existence of a subgame maxmin strategy. Consequently, the existence of subgame
\(\phi \)-maxmin strategies for all positive tolerance functions
\(\phi \) is equivalent to the existence of a subgame maxmin strategy. In the special case of an upper semi-continuous payoff function, the results of Sects.
5 and
6 imply the existence of a subgame maxmin strategy. As noted above, this result is not new, and is a consequence of Laraki, Maitra, and Sudderth ([
10]).
The connection between existence of subgame
\(\phi \)-maxmin strategies for every positive tolerance function
\(\phi \) and the existence of subgame maxmin strategies is not only useful to further understand the results obtained by Laraki, Maitra, and Sudderth ([
10]) but also highlights an important and surprising difference between subgame
\(\phi \)-maxmin strategies and the closely related concept of subgame
\(\epsilon \)-maxmin strategies. Indeed, the existence of a subgame
\(\epsilon \)-maxmin strategy for every
\(\epsilon >0\) does not imply the existence of a subgame maxmin strategy.
The rest of the paper is structured as follows. In Sect.
2, we formulate the model setup, and in Sect.
3, we formally define the main concepts. Then in Sect.
4, we discuss the necessary and sufficient conditions for a strategy to be a subgame
\(\phi \)-maxmin strategy and give a characterization for subgame maxmin strategies. We continue in Sect.
5 by providing sufficient conditions to guarantee the existence of a subgame
\(\phi \)-maxmin strategy. Then in Sect.
6, we explain why the existence of subgame
\(\phi \)-maxmin strategies for every positive tolerance function
\(\phi \) is equivalent to existence of a subgame maxmin strategy. Finally, in Sect.
7, we discuss the importance of the assumptions we made and whether they might be relaxed.
2 Two-Player Zero-Sum Stochastic Games
We consider a two-player zero-sum stochastic game with finitely many actions and countably many states. The payoff function is required to be bounded and universally measurable. The model encompasses all two-player zero-sum games with perfect information and stochastic moves.
Actions, states, and histories The action sets of players 1 and 2 are given by the finite sets \({\mathcal {A}} \) and \( {\mathcal {B}}, \) respectively. The state space is given by the countable set \( {\mathcal {X}}. \) Let \(x_{0}\) denote the initial state and define the set \({\mathcal {Z}} = {\mathcal {A}}\times {\mathcal {B}}\times {\mathcal {X}}\). The game consists of an infinite sequence of one-shot games. At the initial state \(x_{0},\) the one-shot game \(G(x_{0})\) is played as follows: Players 1 and 2 simultaneously select an action from their respective action sets, denoted by \( a_{1} \) and \( b_{1}, \) respectively. Then, the next state \(x_1\) is selected according to the transition probability \(q(\cdot \vert x_{0},a_1,b_1)\). At the new state \(x_1,\) this process repeats itself and both players play the one-shot game \(G(x_1)\) by selecting actions \(a_2\) and \(b_2\) from their respective action sets. The next state \(x_2\) is selected according to the transition probability \(q(\cdot \vert x_{0},a_1,b_1,x_{1},a_{2},b_{2})\). This goes on indefinitely and creates a play \(p=(x_{0},a_1,b_1,x_1,a_2,b_2,\ldots )\). Note that the transition probability can depend on the entire history.
For every \(t\in {\mathbb {N}}=\lbrace 0,1,2,\ldots \rbrace \), let \({\mathcal {H}}^t=\lbrace x_{0}\rbrace \times {\mathcal {Z}}^{t}\) denote the set of all histories that are generated after t one-shot games. The set \({\mathcal {H}}^0\) consists of the single element \(x_{0}\). For \(t \ge 1,\) elements of \({\mathcal {H}}^t\) are of the form \((x_{0},a_1,b_1,\ldots ,a_t,b_t,x_t)\). Let \({\mathcal {H}} = \cup _{t\in {\mathbb {N}}}{\mathcal {H}}^t\) denote the set of all histories. For all \(h \in {\mathcal {H}},\) let \(\Vert h\Vert = \Vert (x_{0},a_1,b_1,\ldots ,a_t,b_t,x_t)\Vert = t\) denote the number of one-shot games that occurred during the history h. We refer to \(\Vert h\Vert \) as the length of the history h. For all \(t\le \Vert h\Vert , \) the restriction of the history h to the first t one-shot games is denoted by \(h_{\vert t}=(x_{0},a_1,b_1,\ldots ,a_t,b_t,x_t)\). We write \(h' \preceq h\) if there exists \(t\le \Vert h\Vert \) such that \(h_{\vert t}=h'\), so if the history h extends the history \(h'\).
The space of plays Define \({\mathcal {P}} = \lbrace x_{0}\rbrace \times {\mathcal {Z}}^{{\mathbb {N}}}\) to be the set of plays. Elements of \({\mathcal {P}}\) are of the form \(p = (x_{0},a_1,b_1,x_1,a_2,b_2,\ldots )\). For \(t \in {\mathbb {N}},\) let \(p_{\vert t}\) denote the prefix of p of length t: that is \(p_{\vert 0} = x_{0}\) and \(p_{\vert t} = (x_{0},a_1,b_1,\ldots ,a_t,b_t,x_t)\) for \(t \ge 1\). We write \(h \prec p\) if a history h is a prefix of p. For every \(h \in {\mathcal {H}},\) let \({\mathcal {P}}(h) = \lbrace p \in {\mathcal {P}}\vert h\prec p\rbrace \) denote the cylinder set consisting of all plays which extend history h.
We endow \({\mathcal {Z}}\) with the discrete topology and \({\mathcal {P}}\) with the product topology. The collection of all cylinder sets is a basis for the product topology on \({\mathcal {P}}\).
For \(t \in {\mathbb {N}}\), let \({\mathcal {F}}^t\) be the sigma-algebra generated by the collection of cylinder sets \(\{{\mathcal {P}}(h) \mid h \in {\mathcal {H}}^{t}\}\). Each set in \({\mathcal {F}}^t\) can be written as the union of sets in \(\{{\mathcal {P}}(h) \mid h \in {\mathcal {H}}^{t}\}\). Let \({\mathcal {F}}^\infty \) be the sigma-algebra generated by \(\displaystyle \cup _{t\in {\mathbb {N}}} {\mathcal {F}}^t\). This is exactly the Borel sigma-algebra generated by the product topology on \({\mathcal {P}}\). The sigma-algebra of universally measurable subsets of \({\mathcal {P}}\) is denoted by \({\mathcal {F}}\). Elements of \({\mathcal {F}}\) are sets that belong to the completion of each Borel probability measure on \({\mathcal {P}}\). For details of the definition of the sigma-algebra \({\mathcal {F}}, \) the reader is referred to Appendix A. It holds that \({\mathcal {F}}^{t} \subseteq {\mathcal {F}}^{t+1} \subseteq \cdots \subseteq {\mathcal {F}}^\infty \subseteq {\mathcal {F}}\). The set of plays \({\mathcal {P}}\) together with the universally measurable sigma-algebra \({\mathcal {F}}\) determines a measurable space \(({\mathcal {P}},{\mathcal {F}})\). A stochastic variable is a universally measurable function from \({\mathcal {P}}\) to \({\mathbb {R}}\).
Strategies Let \(\Delta ({\mathcal {A}})\) denote the set of probability measures over the action set of player 1 and \(\Delta ({\mathcal {B}})\) the set of probability measures over the action set of player 2. A behavioral strategy for player 1 is a function \(\sigma : {\mathcal {H}} \rightarrow \Delta ({\mathcal {A}})\). Hence, at each history player 1 chooses a mixed action. A pure strategy for player 1 is a function that assigns to every history an action, with a minor abuse of notation, \(\sigma :{\mathcal {H}}\rightarrow {\mathcal {A}}\). Similarly, one can define a behavioral and a pure strategy \(\tau \) for player 2. Let \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\) denote the sets of behavioral strategies of players 1 and 2, respectively.
It follows from the Ionescu Tulcea extension theorem that every history \(h\in {\mathcal {H}}\) and strategy profile \((\sigma ,\tau ) \in {{{\mathcal {S}}}}_{1} \times {{{\mathcal {S}}}}_{2}\) determine a probability measure \({\mathbb {P}}_{h,\sigma ,\tau }\) on the measurable space \(({\mathcal {P}}(h),{\mathcal {F}}^\infty _{{\mathcal {P}}(h)})\), where \({\mathcal {F}}^\infty _{{\mathcal {P}}(h)}\) denotes the Borel sigma-algebra over the set of plays extending the history h. The measure \({\mathbb {P}}_{h,\sigma ,\tau }\) can be extended to the measurable space \(({\mathcal {P}},{\mathcal {F}}^\infty )\) in the obvious way. Taking the completion of the probability space \(({\mathcal {P}},{\mathcal {F}}^\infty ,{\mathbb {P}}_{h,\sigma ,\tau })\) yields the probability space \(({\mathcal {P}}, {\mathcal {F}}, {\mathbb {P}}_{h,\sigma ,\tau }^{\mathrm{c}})\). With a minor abuse of notation, we will omit the superscript \(\mathrm{c}\) and write \({\mathbb {P}}_{h,\sigma ,\tau }\) in the remainder of this paper.
Payoff function We assume that the payoff function \(u:{\mathcal {P}}\rightarrow {\mathbb {R}}\) of player 1 is bounded, where we define \( M = \sup _{p\in {\mathcal {P}}} \vert u(p)\vert , \) and universally measurable. We also assume the game to be zero-sum. The payoff function of player 2 is therefore given by \(-u\). We denote the game as described above by \(\Gamma _{x_{0}}(u)\). Throughout the paper, we take the point of view of player 1. This is without loss of generality, since the situation of player 2 in the game \(\Gamma _{x_{0}}(u)\) is identical to that of player 1 in the game \(\Gamma _{x_{0}}(-u)\).
The expected payoff of player 1 corresponding to strategy profile \((\sigma ,\tau ) \in {{{\mathcal {S}}}}_{1} \times {{{\mathcal {S}}}}_{2} \) at history \( h \in {{{\mathcal {H}}}} \) is given by \({\mathbb {E}}_{h,\sigma ,\tau }\left[ u\right] , \) where the expectation is taken with respect to the probability measure \({\mathbb {P}}_{h,\sigma ,\tau }\). The expected payoff of player 1 at the history \(x_0\) is denoted by \({\mathbb {E}}_{\sigma ,\tau }\left[ u\right] \).
We imposed the condition above that the payoff function
u is universally measurable. As Borel measurable functions are also universally measurable, this condition is weaker than requiring that
u be Borel measurable. The main difference between these two conditions is the following. When the payoff function
u is Borel measurable, the game is known to admit a value (cf. Sect.
3 for the formal definition). However, it is not provable from the standard ZFC axioms of set theory
1 that our games with a universally measurable payoff function also admit a value. Therefore, instead of the value, in the remaining part of the paper we will use the lower value (cf. Sect.
3 for the formal definition). We would like to emphasize that except for the difference between the concepts of the value and the lower value, assuming Borel measurable payoffs would not lead to any simplification in the proofs. So the reader unfamiliar with universally measurable functions can read the entire paper imagining Borel measurable payoffs.
5 Existence of subgame \(\phi \)-maxmin strategies
The goal of this section is to give sufficient conditions for the existence of a subgame \(\phi \)-maxmin strategy if the tolerance function \( \phi \) is positive. We will prove the following theorem.
Two special cases of the above theorem are worth noting. One is when the payoff function is everywhere upper semicontinuous. In this special case, the existence of a
\(\phi \)-maxmin strategy (in fact, of a maxmin strategy) follows by a result of Laraki, Maitra, and Sudderth ([
10]). We note that the latter result is not entirely subsumed by the theorem above, since the authors allow for a general Borel state space. Another special case of interest is when along each play, the tolerance remains bounded away from zero (i.e., if each play satisfies condition 2 of the theorem). This special case, to the best of our knowledge, has not been explicitly considered in the literature. However, it does encompass the familiar situation of a constant tolerance function:
\(\phi (h) \equiv \epsilon > 0\). For such tolerance functions,
\(\phi \)-maxmin strategies are exactly subgame
\(\epsilon \)-optimal strategies, and existence follows by Proposition 11 in Mashiah–Yaakovi ([
15]). Note that the case when all plays satisfy condition 2 of the theorem is more general than the case of a constant tolerance function, since no uniform lower bound on the tolerance is imposed. A novel feature of the above theorem is the way it combines these various special cases. We note that the theorem remains original even if we restrict our attention to Borel measurable payoff functions.
The main technique used in the proof of this theorem are switching strategies. The idea of switching between progressively “more optimal” strategies is certainly not new. A similar construction is used for example in Rosenberg, Solan, and Vieille ([
18]), Solan and Vieille ([
20]), and Mashiah–Yaakovi ([
15]). However, the method needs to be fine-tuned, in order for it to accommodate both the plays where the tolerance remains bounded away from zero (condition 2), as well as those where the payoff function is upper semicontinuous (condition 1). It is not straightforward to use this proof technique when we have a general Borel measurable state space as the selection of switching strategies would need to satisfy measurability conditions. It is therefore an interesting open question whether the result can be extended to that case.
The switching strategy
\( \sigma ^{\phi } \) is formally defined in Sect.
5.1. To analyze its properties, we define in Sect.
5.2, for every
\( k \in {\mathbb {N}}, \) a strategy
\( \sigma ^{k} \) that coincides with
\( \sigma ^{\phi } \) as long as at most
k switches have been made and it stops switching thereafter. We also present three technical lemmas for finite switching strategies in this subsection, the most important one showing that the guaranteed expected payoff of a strategy
\( \sigma ^{k} \) is increasing in
k. Sect.
5.3 uses the lemmas of Sect.
5.2 to show the desired existence result in Theorem
5.7.
5.1 Definition of the Switching Strategy \(\sigma ^\phi \)
Fix a tolerance function
\(\phi >0\). For every history
\(h\in {\mathcal {H}}, \) player 1 has a
\( (\phi (h)/2)\)-maxmin strategy for the subgame at history
h, denoted by
\(\sigma ^{h}\). The function
\( \psi : {{{\mathcal {H}}}} \rightarrow {{{\mathcal {H}}}} \) maps histories into histories and is such that player 1 is going to use strategy
\( \sigma ^{\psi (h)} \) at subgame
\( h \in {{{\mathcal {H}}}}. \) The function
\( \psi \) is used to describe when player 1 switches strategies and is recursively defined by setting
\( \psi (x_{0})=x_{0}\) and, for every
\(h\in {\mathcal {H}},\) for every
\(z\in {\mathcal {Z}},\)$$\begin{aligned} \psi (hz)= \left\{ \begin{array}{ll} \psi (h), &{} \text{ if } {\underline{u}}(\sigma ^{\psi (h)},hz) \ge {\underline{v}}(hz)-\phi (hz),\\ hz, &{} \text{ otherwise }. \end{array} \right. \end{aligned}$$
The condition
\({\underline{u}}(\sigma ^{\psi (h)},hz) \ge {\underline{v}}(hz)-\phi (hz)\) verifies whether the strategy to which player 1 switched last,
\(\sigma ^{\psi (h)}\), is a
\(\phi (hz)\)-maxmin strategy for the subgame at history
hz. If this is the case, then there is no need to switch and
\( \psi (hz) = \psi (h). \) Otherwise, player 1 switches to
\( \sigma ^{hz}, \) which is achieved by setting
\( \psi (hz) = hz. \) Formally, we define the switching strategy
\(\sigma ^\phi :{\mathcal {H}}\rightarrow \Delta ({\mathcal {A}})\) by
$$\begin{aligned} \sigma ^\phi (h)=\sigma ^{\psi (h)}(h), \quad h \in {{{\mathcal {H}}}}. \end{aligned}$$
(5.1)
The following example illustrates the switching strategy
\( \sigma ^{\phi } \) and shows that it may not be subgame
\(\phi \)-maxmin.
5.2 Definition and Analysis of Finite Switching Strategies
Given the switching strategy
\( \sigma ^{\phi }, \) for every
\(k\in {\mathbb {N}} \), we define the strategy
\(\sigma ^k:{\mathcal {H}}\rightarrow \Delta ({\mathcal {A}})\) such that it coincides with
\(\sigma ^{\phi }\) as long as at most
k switches have been made and it stops switching thereafter. Formally, we recursively define the function
\(\kappa : {{{\mathcal {H}}}} \rightarrow {\mathbb {N}} \) which counts the number of switches along a history
h by setting
\(\kappa (x_{0}) = 0\) and, for all histories
\(h,hz\in {\mathcal {H}}\),
$$\begin{aligned} \kappa (hz)= \left\{ \begin{array}{ll} \kappa (h), &{} \text{ if } {\underline{u}}(\sigma ^{\psi (h)},hz) \ge {\underline{v}}(hz)-\phi (hz),\\ \kappa (h)+1, &{} \text{ otherwise }. \end{array} \right. \end{aligned}$$
For every
\(k\in {\mathbb {N}}, \) we define the stopping time
\(T_k:{\mathcal {P}}\rightarrow {\mathbb {N}}\cup \lbrace \infty \rbrace \) by
$$\begin{aligned} T_k(p) = \inf \lbrace t \in {\mathbb {N}}\vert \kappa (p_{\vert t})=k \rbrace , \quad p \in {{{\mathcal {P}}}}. \end{aligned}$$
(5.2)
The stopping time
\( T_{k} \) indicates the time at which switch
k occurred. Since, for
\(t<\infty ,\) the expression
\(T_k(p)\le t\) only depends on the history up to period
t, it holds that the set
\(\lbrace p\in {\mathcal {P}}\vert T_k(p)\le t \rbrace \) is
\({\mathcal {F}}^{t}\)-measurable and
\(T_k\) is a stopping time indeed.
We now formally define
\(\sigma ^k:{\mathcal {H}}\rightarrow \Delta ({\mathcal {A}}). \) Take any
\(p\in {\mathcal {P}}(h)\) and let
$$\begin{aligned} \sigma ^k(h)=\left\{ \begin{array}{ll} \sigma ^{\phi }(h), &{} \text{ if } \kappa (h)\le k, \\ \sigma ^{h_{\vert T_k(p)}}(h), &{}\text{ otherwise. } \end{array} \right. \end{aligned}$$
(5.3)
If
\( \kappa (h) > k, \) then the time at which switch
k has occurred is the same for every
\(p\in {\mathcal {P}}(h), \) so it holds that
\( \sigma ^{k} \) is well-defined.
For every
\(k\in {\mathbb {N}}, \) let
\({\mathcal {R}}_k\subseteq {\mathcal {P}}\) be the set of plays along which at least
k switches occur,
$$\begin{aligned} {\mathcal {R}}_k=\lbrace p\in {\mathcal {P}}\vert T_k(p)<\infty \rbrace , \end{aligned}$$
(5.4)
so
\({\mathcal {R}}_1\supseteq {\mathcal {R}}_2 \supseteq {\mathcal {R}}_3 \supseteq \cdots \). Furthermore, let
\({\mathcal {R}}_{\infty } \subseteq {\mathcal {P}} \) denote the set of plays along which infinitely many switches occur,
$$\begin{aligned} {\mathcal {R}}_\infty = \displaystyle \cap _{k=1}^{\infty } {\mathcal {R}}_k. \end{aligned}$$
(5.5)
The remainder of the section is devoted to the proof of three technical lemmas concerning the finite switching strategies. These lemmas are needed in the analysis of
\( \sigma ^{\phi }. \)
Consider the strategies \(\sigma ^{k}, \sigma ^{k+1},\dots \). All these strategies agree with \(\sigma ^{\phi }\) for as long as \(\sigma ^{\phi }\) does not require more than k switches. Consequently, the measures that these strategies induce on \({\mathcal {P}}\) assign the same probability to any event that is “determined” before switch \( k+1 \) occurs, i.e., to any event in the sigma-algebra \({\mathcal {F}}^{T_{k+1}}\).
Proof
A set A of the universally measurable sigma-algebra \( {\mathcal {F}}\) is called agreeable if for \(\sigma = \sigma ^{k}, \sigma ^{k+1},\dots ,\sigma ^{\phi }\) the measures \({\mathbb {P}}_{h,\sigma ,\tau }\) all assign the same probability to A.
We argue first that each cylinder set in \({\mathcal {F}}^{T_{k+1}}\) is agreeable. A cylinder set \({\mathcal {P}}(h)\) is a member of \({\mathcal {F}}^{T_{k+1}}\) if and only if \(\kappa (h) \le k+1\). Let a cylinder set \({\mathcal {P}}(h')\) in \({\mathcal {F}}^{T_{k+1}}\) be given. Since \(\kappa (h') \le k + 1\), we know that \(\kappa (h'') \le k\) for each history \(h''\) preceding \(h'\). It follows that \(\sigma ^{k}(h'') = \sigma ^{k + 1}(h'') = \cdots = \sigma ^{\phi }(h'')\). Since this applies to each history \(h''\) that precedes \(h^{\prime }\), the set \({\mathcal {P}}(h')\) is agreeable.
Now take any \(E \in {\mathcal {F}}^{T_{k+1}}\). For \(t \in {\mathbb {N}},\) let \(E_{t} = E \cap \{p \in {{{\mathcal {P}}}} \mid T_{k+1}(p) = t\}\) and let \( E_{\infty } = E \cap \{p \in {{{\mathcal {P}}}} \mid T_{k+1}(p) = \infty \}\). To show that E is agreeable, it suffices to show that the sets \(E_{t}\) and \(E_{\infty }\) are.
Let some \( t \in {\mathbb {N}} \) be given. We know that \(E_{t}\) is a member of \({\mathcal {F}}^{t}\). Consequently, \(E_{t}\) can be written as a disjoint union of cylinder sets in \({\mathcal {F}}^{t}, \) say \(E_{t} = \cup \{C_{n} \mid n \in {\mathbb {N}}\}\), with each \(C_{n}\) a member of \({\mathcal {F}}^{t}\). Since each \(C_{n}\) is a subset of the set \(\{p \in {{{\mathcal {P}}}} \mid T_{k+1}(p)=t\}\), it is a member of \({\mathcal {F}}^{T_{k+1}}, \) so is agreeable by the result of the second paragraph in the proof. It now follows that \(E_{t}\) is agreeable.
To show that \(E_{\infty }\) is agreeable, we make use of the fact that \( E_{\infty } \) is a Borel set and of the regularity of \( \sigma \) on the Borel sigma-algebra. Let O be any open subset of \({\mathcal {P}}\) containing \(E_{\infty }\). The set O can be written as a disjoint union of cylinder sets, say \(O = \cup \{{\mathcal {P}}(h_{n}) \mid n \in {\mathbb {N}}\}\). Without loss of generality it holds that for every \(n \in {\mathbb {N}}\) the set \({\mathcal {P}}(h_{n})\) has a point in common with \(E_{\infty }\). Thus in particular, there is \(p \in {\mathcal {P}}(h_{n})\) with \(T_{k+1}(p) = \infty \). This implies that \(\kappa (h_{n}) \le k\) and hence that \({\mathcal {P}}(h_{n})\) is a member of \({\mathcal {F}}^{T_{k+1}}\). We conclude that each \({\mathcal {P}}(h_{n})\) is agreeable by the result of the second paragraph in the proof. It follows that O is an agreeable set.
To prove the second claim, we notice that all Borel subsets of \({\mathcal {P}} \setminus {\mathcal {R}}_{k + 1} = \{p \in {{{\mathcal {P}}}} \mid T_{k+1}(p) = \infty \}\) are members of \({\mathcal {F}}^{T_{k+1}}\), so are agreeable. The result for universally measurable subsets of \(\{p \in {{{\mathcal {P}}}} \mid T_{k+1}(p) = \infty \}\) follows since each such set differs from a Borel set by a negligible set. \(\square \)
Let T be a stopping time. We define the stochastic variable \({\underline{U}}_{\sigma }^{T}\) by letting it agree with \({\underline{U}}_{\sigma }^{t}\) on the set \(\{p \in {\mathcal {P}}:T(p) = t\}\) for each \(t \in {\mathbb {N}}\) and by letting it agree with \({\underline{U}}_{\sigma }^{\infty }\) on the set \(\{p \in {\mathcal {P}}:T(p) = \infty \}, \) where \( {\underline{U}}_{\sigma }^{\infty } \) is some suitably chosen \({\mathcal {F}}^{\infty }\)-measurable stochastic variable. The next lemma relates the guaranteed expected payoffs of strategies \( \sigma ^{k} \) and \( \sigma ^{k+1} \) at the moment of switch \(k+1\). We write \(\Phi ^t\) to denote the \({\mathcal {F}}^{t}\)-measurable stochastic variable given by \(\Phi ^t(p)=\phi (p_{\vert t})\).
The following lemma states the intuitive property that for histories with less than \( k+1 \) switches or histories at which switch \( k+1 \) occurs, strategy \( \sigma ^{k+1} \) guarantees at least the same payoff to player 1 than strategy \( \sigma ^{k}. \)
5.3 Properties of the Switching Strategy \(\sigma ^\phi \)
We show first that the switching strategy \( \sigma ^{\phi } \) is always n-day \(\phi \)-maxmin for every \(n \in {\mathbb {N}}\).
The next lemma says that along almost any play \(p \in {\mathcal {P}}\) only finitely many switches occur or the tolerance level goes to zero. This lemma is crucial to show the equalizing property of switching strategies.
We are now in a position to state the main result of this section, which gives sufficient conditions for the existence of subgame \( \phi \)-maxmin strategies.
Proof
We define the tolerance function
\(\phi ^{\prime }\) by
$$\begin{aligned} \phi ^{\prime }(h) = \tfrac{1}{2}\min \lbrace \phi (h')\vert h'\preceq h\rbrace , \quad h \in {\mathcal {H}}. \end{aligned}$$
Thus,
\(\phi ^{\prime }\) is a non-increasing tolerance function with
\(\phi ^{\prime } \le \tfrac{1}{2} \phi \).
We show that
\(\sigma ^{\phi ^{\prime }} \) is
\(\phi ^{\prime }\)-equalizing. Since
\(2\phi ^{\prime } \le \phi \) and
\(\sigma ^{\phi ^{\prime }}\) is an
n-day
\(\phi ^{\prime }\)-maxmin strategy for every
\(n \in {\mathbb {N}}\) by Lemma
5.5, it then follows from Theorem
4.5 that
\(\sigma ^{\phi ^{\prime }}\) is a subgame
\(\phi \)-maxmin strategy.
Let \({\mathcal {U}}\) denote the set of plays \(p \in {\mathcal {P}}\) at which u is upper semi-continuous and let \({\mathcal {I}}\) denote the set of plays \(p \in {\mathcal {P}}\) such that \(\liminf _{t\rightarrow \infty } \phi (p_{\vert t}) > 0\). By the assumption of the theorem it holds that \({\mathcal {P}} = {\mathcal {U}} \cup {\mathcal {I}}\). By the definition of \(\phi ^{\prime }\), we have \(\liminf _{t\rightarrow \infty } \phi ^{\prime }(p_{\vert t}) > 0\) for each \(p \in {\mathcal {I}}\).
Let some \(t \in {\mathbb {N}}\), a history \(h \in {\mathcal {H}}^{t}\), and a strategy \(\tau \in {\mathcal {S}}_{2}\) be given.
Step 1 For every \( p \in {{{\mathcal {U}}}}, \) \( u(p) \ge \limsup _{n\rightarrow \infty } {\underline{V}}^n(p) - \phi ^{\prime }. \)
This follows directly from Lemma
4.10.
For every \( k \in {\mathbb {N}}, \) we define \( {\mathcal {J}}_{k} = \{p \in {\mathcal {I}} \cap ({\mathcal {R}}_{k} \setminus {\mathcal {R}}_{k + 1}) \mid \lim _{n \rightarrow \infty } U_{\sigma ^{k},\tau }^{n}(p) = u(p)\} \) and \( {\mathcal {J}} = \bigcup _{k \in {\mathbb {N}}}{\mathcal {J}}_{k}, \) where we remind the reader that \( {{{\mathcal {R}}}}_{k} \) contains the plays along which at least k switches occur.
Step 2 For every \( p \in {{{\mathcal {J}}}}, \) \( u(p) \ge \limsup _{n\rightarrow \infty } {\underline{V}}^n(p) - \phi ^{\prime }. \)
Let
\(k \in {\mathbb {N}}\) and
\(p \in {\mathcal {J}}_{k}\) be given. Exactly
k switches occur along the play
p and the last switch occurs at time
\(T_{k}(p)\). By our construction of
\(\sigma ^{\phi ^{\prime }}\), this means that for each time
\(n > T_{k}(p)\), the strategy
\(\sigma ^k\) is a
\(\phi ^{\prime }(p_{|n})\)-maxmin strategy for the subgame at history
\(p_{|n}\), so
\({\underline{u}}(\sigma ^{k},p_{|n}) \ge {\underline{v}}(p_{|n})-\phi ^{\prime }(p_{|n})\). Since
\(U_{\sigma ^{k},\tau }^{n}(p) \ge {\underline{u}}(\sigma ^{k},p_{|n})\) and since for
\(n \ge t\) we have
\(\phi ^{\prime }(p_{|n}) \le \phi ^{\prime }(h)\), we conclude that for
\(n \ge t\)$$\begin{aligned} U_{\sigma ^{k},\tau }^{n}(p)\ge {\underline{V}}^{n}(p) -\phi ^{\prime }(h). \end{aligned}$$
Taking the limit as
n goes to infinity, and making use of the fact that
\(p \in {\mathcal {J}}_{k}\), we obtain
$$\begin{aligned} u(p) = \lim _{n\rightarrow \infty } U_{\sigma ^{k},\tau }^n(p) \ge \limsup _{n\rightarrow \infty } {\underline{V}}^{n}(p) - \phi ^{\prime }(h). \end{aligned}$$
Step 3:
\({\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }({\mathcal {I}} \cap {\mathcal {R}}_\infty ) = 0\).
Recall that by Lemma
5.6,
\(\Phi ^{T_k} \cdot I(T_k<\infty )\) converges to 0 as
k goes to infinity,
\({\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }\)-almost surely. Also recall that
\({\mathcal {R}}_{\infty }\) is the set of plays where infinitely many switches occur. Thus
\(I(T_k < \infty )\) is identically equal to 1 on
\({\mathcal {R}}_{\infty }\). Furthermore,
\(\liminf _{k \rightarrow \infty }\Phi ^{T_{k}} > 0\) everywhere on
\({\mathcal {I}}\). We conclude that
\(\Phi ^{T_{k}} \cdot I(T_k < \infty )\) does not converge to zero on
\({\mathcal {I}} \cap {\mathcal {R}}_\infty \) and the result follows.
Step 4: \({\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }({\mathcal {U}} \cup {\mathcal {J}}) = 1\).
For every
\( k \in {\mathbb {N}}, \) it holds by Levy’s zero-one law (Lemma
A.1 in Appendix A) that
$$\begin{aligned} {\mathbb {P}}_{h,\sigma ^{k},\tau }({\mathcal {J}}_{k}) = {\mathbb {P}}_{h,\sigma ^{k},\tau }({\mathcal {I}} \cap ({{{\mathcal {R}}}}_{k} \setminus {\mathcal {R}}_{k + 1})). \end{aligned}$$
Using Lemma
5.2 twice yields
$$\begin{aligned} {\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }({\mathcal {J}}_{k}) = {\mathbb {P}}_{h,\sigma ^{k},\tau }({\mathcal {J}}_{k}) = {\mathbb {P}}_{h,\sigma ^{k},\tau }({\mathcal {I}} \cap ({{{\mathcal {R}}}}_{k} \setminus {\mathcal {R}}_{k + 1})) = {\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }({\mathcal {I}} \cap (\mathcal{R}_{k} \setminus {\mathcal {R}}_{k + 1})). \end{aligned}$$
We now have
$$\begin{aligned} {\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }({\mathcal {J}}) = {\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }({\mathcal {I}} \setminus {\mathcal {R}}_{\infty }) = {\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }({\mathcal {I}}), \end{aligned}$$
where the last equality follows from Step 3. Finally, we obtain
$$\begin{aligned} {\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }({\mathcal {U}} \cup {\mathcal {J}}) \ge {\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }({\mathcal {U}} \setminus {\mathcal {I}}) + {\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }({\mathcal {J}}) = {\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }({\mathcal {U}} \setminus {\mathcal {I}}) + {\mathbb {P}}_{h,\sigma ^{\phi ^{\prime }},\tau }({\mathcal {I}}) = 1, \end{aligned}$$
where the last equality follows from the fact that the sets
\({\mathcal {U}}\) and
\({\mathcal {I}}\) cover
\({\mathcal {P}}(h)\).
\(\square \)
Theorem
5.7 generalizes Proposition 11 in Mashiah–Yaakovi ([
15]), where the existence of a subgame
\(\epsilon \)-optimal strategy in a two-player zero-sum stochastic game with Borel measurable payoff functions is shown. The tolerance function
\( \phi \) there is given by
\(\phi (h)=\epsilon \) for every
\(h\in {\mathcal {H}}, \) so for every play the tolerance function has a positive limit inferior. Theorem
5.7 yields the existence of a subgame
\(\epsilon \)-maxmin strategy. Because the Borel measurability of the payoff function guarantees the existence of the value (Maitra and Sudderth, [
13], and Martin, [
14]), this is equivalent to proving the existence of a subgame
\(\epsilon \)-optimal strategy.
In Sect.
6, we argue that Theorem
5.7 also provides further insight into the main result of Laraki, Maitra, and Sudderth ([
10]). There the authors prove among other things that if the payoff function is bounded and upper semi-continuous, then the first player has a subgame optimal strategy.
6 Subgame maxmin strategies
The goal of this section is to explore the relationship between the concept of a subgame maxmin strategy and that of a subgame \(\phi \)-maxmin strategy for \( \phi > 0. \) The main result of this section is the following theorem.
Because every subgame maxmin strategy is a subgame
\(\phi \)-maxmin strategy, statement 1 clearly implies statement 2. To prove the converse, we define a tolerance function
\(\phi ^*>0\); the precise definition is given in Subsection
6.3. Intuitively, we fix a positive sequence
\((\delta _t)_{t\in {\mathbb {N}}}\) with the property that
\(\sum _{t=0}^\infty \delta _t<\infty \). Then, we choose
\(\phi ^*\) such that (1)
\(\phi ^*(p_{\vert t})<\delta _t\) for very play
\(p\in {\mathcal {P}}\) and every
\(t\in {\mathbb {N}}\) and (2) every one-day
\(\phi ^*\)-maxmin action is
\(\delta _t\)-close to the region of optimal mixed actions in the one-day game. The first condition implies that tolerance function
\(\phi ^*\) decreases rapidly to zero, i.e.,
\(\sum _{t=0}^\infty \phi ^*(p_{\vert t})<\infty \). By assumption, player 1 has a subgame
\(\phi ^*\)-maxmin strategy. We obtain a subgame maxmin strategy as follows: at every history
\(h_t\) we pick a mixed action which guarantees the lower value
\({\underline{v}}(h_t)\) in the one-day game and is
\(\delta _t\)-close to the mixed action chosen by the subgame
\(\phi ^{*}\)-maxmin strategy. This method creates a strategy which is clearly one-day maxmin. The condition that the tolerance function
\(\phi ^{*}\) fastly decreases to 0 is then used to show that expected payoffs from a history
\(h_t\) with length
t obtained by the subgame
\(\phi ^{*}\)-maxmin strategy and the constructed strategy are within a distance of
\(M\cdot \sum _{n=t}^\infty \delta _n\), where
\(M=\sup _{p\in {{{\mathcal {P}}}}} |u(p)|\). From this, the equalizing property of the constructed strategy follows. Then, Corollary
4.7 from Sect.
4 lets us conclude that the constructed strategy is indeed a subgame maxmin strategy.
Theorem
6.1 has the following corollary.
We would like to make two remarks. First, as the payoff function
u need not be continuous, one cannot simply use a continuity argument to prove that statement 2 of Corollary
6.2 implies statement 1. Second, note that the existence of a subgame
\(\epsilon \)-maxmin strategy for every
\(\epsilon >0\) is a weaker requirement than statements 1 and 2 of Corollary
6.2. Indeed, as Example
3.2 already illustrated, there exist games which admit a subgame
\(\epsilon \)-maxmin strategy for every
\(\epsilon >0\) but do not admit a subgame maxmin strategy.
The special case where the payoff-function is upper semi-continuous deserves some additional attention. From Maitra and Sudderth ([
13]) and Martin ([
14]), it follows that every two-player zero-sum stochastic game with a countable state space, finite action sets, and a Borel measurable payoff function admits a value. Because every upper semi-continuous payoff function is Borel measurable, the existence of the value is guaranteed. From Theorem
5.7, we obtain the existence of a subgame
\(\phi \)-optimal strategy for every
\(\phi >0\). Combining this with Corollary
6.2 shows that player 1 has a subgame optimal strategy. Hereby, we obtain a special case of the result by Laraki, Maitra, and Sudderth ([
10]), where the authors allow the state space to be a Borel subset of a Polish space and transition probabilities to be determined by a Borel transition function.
When we are interested in pure strategies that guarantee the maxmin levels, we can strengthen the result of Theorem
6.1. In the context of simultaneous move games, pure strategies are of course rather restrictive. Still, there are important classes of games, such as perfect information games, in which they are natural and play a prominent role.
The tolerance function
\( \phi ^*\) in Theorem
6.3 will be defined based on the data of the game and the lower values in the subgames.
This section is structured as follows. We start by proving Theorem
6.3, which is easier than Theorem
6.1 and helps us explain the main ideas. Then, we turn to the proof of Theorem
6.1.
6.1 The proof of Theorem 6.3
In this subsection, we prove Theorem
6.3. Since the statement of this theorem is about pure strategies, the proof is less technical and the intuition is more transparent.
We only need to prove that there is
\( \phi ^* > 0 \) such that statement 2 of Theorem
6.3 implies statement 1 of Theorem
6.3. From now on, for any
\(t\in {\mathbb {N}}\), history
\(h\in {\mathcal {H}}^t\) with some final state
x, and mixed actions
\(m_1\in \Delta ({\mathcal {A}})\) and
\(m_2\in \Delta ({\mathcal {B}})\), we denote the expectation of the lower value at the next stage by
$$\begin{aligned} {\mathbb {E}}_{h,m_1,m_2}\left[ {\underline{V}}^{t+1}\right] \;=\;\sum _{a\in {\mathcal {A}}}\sum _{b\in {\mathcal {B}}}\sum _{x'\in {\mathcal {X}}}q(x'|h,a,b)\cdot {\underline{v}}(h,a,b,x'). \end{aligned}$$
(6.1)
The proof consists of two steps.
Step 1. Definition of \(\phi ^* > 0. \)
For every
\(t\in {\mathbb {N}}, \) for every history
\(h\in {\mathcal {H}}^t\), there exists a number
\(d(h) >0\) such that for every action
\(a\in {\mathcal {A}}\) of player 1 either one of the following holds:
-
Action a guarantees that the lower value does not drop in expectation: for every \(m_2\in \Delta ({\mathcal {B}})\), \({\mathbb {E}}_{h,a,m_2}\left[ {\underline{V}}^{t+1}\right] \ge {\underline{v}}(h)\).
-
There exists a mixed action \(m_2\in \Delta ({\mathcal {B}})\) for player 2 such that, if player 1 uses action a, the lower value drops in expectation by more than d(h): \({\mathbb {E}}_{h,a,m_2}\left[ {\underline{V}}^{t+1}\right] < {\underline{v}}(h)-d(h)\).
This statement is true because the action set
\({\mathcal {A}}\) of player 1 is finite. We define
\( \phi ^* > 0 \) as follows. For every
\(t\in {\mathbb {N}}, \) for every history
\(h\in {\mathcal {H}}^t\), let
\(\phi ^*(h)=\min \{d(h),2^{-t}\}\). The term
\(2^{-t}\) is included so that the tolerance levels tend to 0 along each play. For every
\(p\in {\mathcal {P}}\), it holds that
\(\displaystyle \lim _{t\rightarrow \infty }\phi ^*(p_{\vert t})=0\).
Step 2. If the pure strategy \( \sigma \in \mathcal{S}_{1} \) is a subgame \( \phi ^{*}\)-maxmin strategy, then \( \sigma \) is a subgame maxmin strategy.
Let the pure strategy
\( \sigma \in {{{\mathcal {S}}}}_{1} \) be subgame
\(\phi ^*\)-maxmin. We verify that
\(\sigma \) satisfies the conditions of Corollary
4.7.
First, we show that
\(\sigma \) is 1-day maxmin. Fix
\( t \in {\mathbb {N}} \) and
\(h\in {\mathcal {H}}^t. \) Let
\(a=\sigma (h)\) denote the action that
\(\sigma \) recommends at
h. Then, for every mixed action
\(m_2\in \Delta ({\mathcal {B}})\) and strategy
\(\tau \in {\mathcal {S}}_2\) with
\(m_2=\tau (h)\), we have
$$\begin{aligned} {\mathbb {E}}_{h,a,m_2}\left[ {\underline{V}}^{t+1}\right] ={\mathbb {E}}_{h,\sigma ,\tau }\left[ {\underline{V}}^{t+1}\right] \ge {\underline{v}}(h)-\phi ^*(h)\ge {\underline{v}}(h)-d(h), \end{aligned}$$
where the first inequality follows from the fact that
\(\sigma \) is subgame
\(\phi ^*\)-maxmin and by condition 1 of Theorem
4.6, and the second inequality follows from the definition of
\(\phi ^*(h). \) Therefore, by the choice of
d(
h) in Step 1, for every
\(m_2\in \Delta ({\mathcal {B}})\), it holds that
$$\begin{aligned} {\mathbb {E}}_{h,a,m_2}\left[ {\underline{V}}^{t+1}\right] \ge {\underline{v}}(h). \end{aligned}$$
Hence, for every strategy
\(\tau \in {\mathcal {S}}_2\), it holds that
$$\begin{aligned} {\mathbb {E}}_{h,\sigma ,\tau }\left[ {\underline{V}}^{t+1}\right] ={\mathbb {E}}_{h,a,\tau (h)}\left[ {\underline{V}}^{t+1}\right] \ge {\underline{v}}(h). \end{aligned}$$
Thus,
\(\sigma \) is 1-day maxmin.
We show that
\(\sigma \) is equalizing. Take some
\( \tau \in \mathcal{S}_{2}. \) Because
\(\sigma \) is subgame
\(\phi ^*\)-maxmin, for every
\(p\in {\mathcal {P}}(h), \) for every
\(n\ge t, \) we have
$$\begin{aligned} U_{\sigma ,\tau }^n(p)={\mathbb {E}}_{p_{\vert n},\sigma ,\tau }\left[ u\right] \ge {\underline{v}}(p_{\vert n})-\phi ^*(p_{\vert n})={\underline{V}}^n(p)-\Phi ^{*n}(p), \end{aligned}$$
where
\(\Phi ^{*n}(p)=\phi ^*(p_{\vert n})\). We conclude that
$$\begin{aligned} \lim _{n\rightarrow \infty }U_{\sigma ,\tau }^n \ge \limsup _{n\rightarrow \infty }\left( {\underline{V}}^n-\Phi ^{*n} \right) =\limsup _{n\rightarrow \infty } {\underline{V}}^n-\lim _{n\rightarrow \infty }\Phi ^{*n} = \limsup _{n\rightarrow \infty } {\underline{V}}^n, \end{aligned}$$
where the last equality from the fact that for all
\(p\in {\mathcal {P}}\) we have
\(\displaystyle \lim _{n\rightarrow \infty }\phi ^*(p_{\vert n})=0, \) so
$$\begin{aligned} u = \lim _{n\rightarrow \infty }U_{\sigma ,\tau }^n \ge \limsup _{n\rightarrow \infty } {\underline{V}}^n,\quad {\mathbb {P}}_{h,\sigma ,\tau }\text {-almost surely,} \end{aligned}$$
where the equality follows from Lemma
A.1 in Appendix A. We have shown that
\(\sigma \) is equalizing.
6.2 The One-Shot Game \(\Upsilon _h\)
To prove Theorem
6.1, we first analyze a one-shot zero-sum game in this subsection. For each history, the one-shot game is such that the payoff is given by the lower value at the next stage. In the next subsection, we use these one-shot games to construct the tolerance function
\(\phi ^* > 0\).
For some \( t \in {\mathbb {N}}, \) let \( h \in {\mathcal {H}}^t \) be a history in the game \(\Gamma _{x_{0}}(u). \) The one-shot zero-sum game \(\Upsilon _h\) is played as follows. Player 1 chooses an action \(a\in {\mathcal {A}}\) and player 2 simultaneously chooses an action \(b\in {\mathcal {B}}\). Then, player 1 receives from player 2 the amount \({\mathbb {E}}_{h,a,b}\left[ {\underline{V}}^{t+1}\right] \). As the action sets \({\mathcal {A}}\) and \({\mathcal {B}}\) are finite, the game \(\Upsilon _h\) has a value, which we denote by w(h). Furthermore, both players have optimal mixed actions in the game \(\Upsilon _h\).
The following lemma states that the value w(h) of the one-shot game \(\Upsilon _h\) equals the lower value \({\underline{v}}(h)\) at the history h in the original game \(\Gamma _{x_{0}}(u)\). In case the value exists, this is a classic result of dynamic programming.
The total variation distance between two mixed actions
\(m_1,n_1\in \Delta ({\mathcal {A}})\) of player 1 is defined as
$$\begin{aligned} \Vert m_1-n_1\Vert _{\mathrm{TV}}=\sum _{a\in {\mathcal {A}}}\vert m_1(a)-n_1(a)\vert . \end{aligned}$$
The total variation distance between two probability measures
\( {\mathbb {P}} \) and
\( {\mathbb {P}}^{\prime } \) on
\( ({{{\mathcal {P}}}}, {{{\mathcal {F}}}}) \) is defined as
$$\begin{aligned} \Vert {\mathbb {P}} - {\mathbb {P}}^{\prime }\Vert _{{\mathrm{TV}}} = \sup \{\sum _{i=1}^{n} |{\mathbb {P}}(F_{i}) - {\mathbb {P}}^{\prime }(F_{i})| : F_{1}, \ldots , F_{n} \in {{{\mathcal {F}}}} \text{ and } \{F_{1}, \ldots , F_{n}\} \text{ is } \text{ a } \text{ partition } \text{ of } {{{\mathcal {P}}}}\}. \end{aligned}$$
Let
\( t \in {\mathbb {N}} \) and a history
\(h\in {\mathcal {H}}^t\) be given. Let
\(O_h \subseteq \Delta ({{{\mathcal {A}}}}) \) denote the set of optimal mixed actions of player 1 in the one-shot game
\(\Upsilon _h\). By Lemma
6.4 it holds that
$$\begin{aligned} O_h = \left\{ m_1\in \Delta ({\mathcal {A}}) \mid \text{ for } \text{ every } m_2\in \Delta ({\mathcal {B}}),~{\mathbb {E}}_{h,m_1,m_2}\left[ {\underline{V}}^{t+1}\right] \ge {\underline{v}}(h) \right\} . \end{aligned}$$
Note that
\(O_h\) is a compact subset of
\(\Delta ({\mathcal {A}})\). For every
\(m_1\in \Delta ({\mathcal {A}}), \) the distance of
\(m_1\) to
\(O_h\) is defined by
$$\begin{aligned} \Vert m_1-O_h\Vert _{\mathrm{TV}}=\displaystyle \min _{n_1\in O_h}\Vert m_1-n_1\Vert _{\mathrm{TV}}. \end{aligned}$$
Due to the compactness of
\(O_h, \) the minimum is attained. For
\(\delta >0\), let
\(D_h^\delta \) be the set of mixed actions of player 1 which have a distance of at least
\(\delta \) to the set
\(O_h, \) so
$$\begin{aligned} D_h^\delta \;=\;\left\{ m_1\in \Delta ({\mathcal {A}})\vert \Vert m_1-O_h\Vert _{\mathrm{TV}}\ge \delta \right\} . \end{aligned}$$
(6.4)
The mixed actions in
\(D_h^\delta \) are not optimal in the one-shot game
\(\Upsilon _h\). The following lemma says that the loss in utility caused by these mixed actions has a positive lower bound.
6.3 Definition of the Tolerance Function \(\phi ^*\)
In this subsection, we define a positive tolerance function \(\phi ^*\). Fix a positive and decreasing sequence \((\delta _t)_{t\in {\mathbb {N}}}\) such that \(\sum _{t=0}^{\infty } \delta _t<\infty \). For example, \(\delta _t=2^{-t}\) for every \(t\in {\mathbb {N}}\). Note that \(\lim _{t\rightarrow \infty }\delta _t=0\).
For every
\( t \in {\mathbb {N}}, \) for every history
\(h\in {\mathcal {H}}^t, \) we define the constant
c(
h) as follows. If the set
\(D_h^{\delta _t}\) is non-empty, then
c(
h) is equal to the positive number
\( \epsilon \) of Lemma
6.5 and
\( c(h) = \delta _t \) otherwise. We define
$$\begin{aligned} \phi ^*(h)=\tfrac{\min \left\{ c(h),\delta _{t}\right\} }{2}. \end{aligned}$$
(6.6)
Notice that
\(0<\phi ^*(h)<\delta _t\).
We summarize the properties of the tolerance function
\(\phi ^*\):
1.
For every history \(h\in {\mathcal {H}}\), we have \(\phi ^*(h)>0\).
2.
For every play \(p\in {\mathcal {P}}\), we have \(\sum _{t=0}^{\infty } \phi ^*(p_{\vert t})\le \sum _{t=0}^{\infty } \delta _t<\infty \).
3.
For every play \(p\in {\mathcal {P}}\), \(\displaystyle \lim _{t\rightarrow \infty }\phi ^*(p_{\vert t})=0\).
4.
If the set
\(D_h^{\delta _t}\) is non-empty, then by the choice of
c(
h), for every
\(m_1\in D_h^{\delta _t}\) there exists
\(b\in {\mathcal {B}}\) such that
$$\begin{aligned} {\mathbb {E}}_{h,m_1,b}\left[ {\underline{V}}^{t+1}\right] \;\le \;{\underline{v}}(h)-c(h) \;<\;{\underline{v}}(h)-\phi ^*(h). \end{aligned}$$
The importance of
\(\sum _{t=0}^{\infty } \delta _t<\infty \) is underlined by the following lemma. Recall that
\( M=\sup _{p\in {\mathcal {P}}} \vert u(p)\vert . \)
Lemma
6.6 says the following. Consider two arbitrary strategies
\(\sigma , \sigma ' \in {{{\mathcal {S}}}}_{1} \) such that the total variation distance between the mixed actions at every history in period
t is at most
\(\delta _t\). Now consider
\( t \in {\mathbb {N}}, \) a history
\( h \in {\mathcal {H}}^t, \) and a strategy
\( \tau \in {{{\mathcal {S}}}}_{2} \) of player 2. Then, the expected payoffs under
\((\sigma ,\tau )\) and
\((\sigma ',\tau )\) in the subgame at
h differ at most the constant
M times
\(\sum _{n=t}^{\infty }\delta _n\). Note that this bound does not depend on the strategy
\(\tau \) and it only depends on the history
h through its period
t. Moreover, these bounds tend to 0 as
t goes to infinity.
The importance of property 4 of the tolerance function is shown by the following lemma. It says that if \( \sigma \in {{{\mathcal {S}}}}_{1} \) is a subgame \(\phi ^*\)-maxmin strategy, then for every \( h \in H \) the mixed action \( \sigma (h) \) is close to the set of optimal mixed actions \( O_{h} \) in the one-shot game \(\Upsilon _h\).
6.4 The proof of Theorem 6.1
Note that it is not necessary that the initial \(\delta _t\) are small, or that for initial histories the tolerance level \(\phi ^*(h)\) is small. The conditions put on the sequence \((\delta _t)_{t\in {\mathbb {N}}}\) are there to ensure that the equalizing condition for subgame maxmin strategies is satisfied. The equalizing condition essentially only cares about what happens to the payoff at deep subgames. The requirement that the sequence \((\delta _t)_{t\in {\mathbb {N}}}\) decreases to zero very fast guarantees that \(\lim _{t\rightarrow \infty }\Vert {\mathbb {P}}_{p_{\vert t}, \sigma ,\tau }-{\mathbb {P}}_{p_{\vert t}, \sigma ,\tau }\Vert _{\mathrm{TV}}=0\). This allows us to use the \(\phi ^*\)-equalizing property of the subgame \(\phi ^*\)-maxmin strategy to show that the strategy \(\sigma ^*\) satisfies the equalizing condition.
7 Discussion
Countable action sets. We have assumed throughout the paper that the action sets are finite. If one player has countably many actions, several of our arguments and results are no longer valid. For example, consider the following game. In the initial state, player 1’s action set is
\({\mathbb {N}}\), player 2 has only one action, and if player 1 chooses action
\(n\in {\mathbb {N}}\) then regardless how the play continues, the payoff is
\(1-\frac{1}{n+1}\). This is essentially a one-shot game. Clearly, for every tolerance function
\(\phi >0\) player 1 has a pure subgame
\(\phi \)-maxmin strategy, i.e. he can play action
n with
n sufficiently large. Yet, player 1 has no subgame maxmin strategy. So the statements of Theorems
6.1 and
6.3 and Corollary
6.2 are not valid for this game.
One-player games. Suppose that player 2 has a trivial action set, i.e.
\({\mathcal {B}}\) is a singleton. Even though this is essentially a game with only one player, as shown in Example
3.2, it may happen that player 1 has a subgame
\(\phi \)-maxmin strategy for a certain tolerance function
\(\phi \), but he has no pure strategy that is subgame
\(\phi \)-maxmin. Nevertheless, if
\({\mathcal {B}}\) is a singleton and the conditions of Theorem
5.7 are satisfied, then player 1 has a pure subgame
\(\phi \)-maxmin strategy. Indeed, player 1 has a pure
\(\delta \)-maxmin strategy for each
\(\delta >0\). Hence, the switching strategy used in the proof of Theorem
5.7 will also be pure.
Discounted games. Our model encompasses discounted stochastic games with finite action and countable state spaces, provided that the immediate payoffs are bounded. Since these games admit a value and the discounted payoffs are continuous, it follows by Theorem
5.7 that player 1 has a subgame
\(\phi \)-optimal strategy for every tolerance function
\(\phi >0\). Therefore, Corollary
6.2 implies the well-known fact that player 1 has a subgame 0-optimal strategy. Our proof does not directly imply that this strategy can be chosen to be stationary.