Skip to main content
Top

Optimal Stopping for Non-Markovian Asset Price Processes

  • Open Access
  • 2026
  • OriginalPaper
  • Chapter
Published in:

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This chapter delves into the intricacies of optimal stopping problems, particularly in the context of non-Markovian asset price processes. It begins by defining optimal stopping and its significance in pricing American and Bermudan options. The text explores the mathematical modeling of stopping times, emphasizing the importance of filtrations and the flow of information. It discusses two systematic approaches for solving optimal stopping problems: primal and dual methods. The primal approach is based on the dynamic programming principle, while the dual approach involves a minimization problem over martingales. The chapter also introduces the concept of the Snell envelope, a super-martingale that dominates the reward process, and discusses its computational advantages. A notable feature of this text is its focus on non-Markovian processes, which are often overlooked in favor of simpler Markovian models. It introduces the signature method as a means to encode the history of the process, making it possible to solve optimal stopping problems in a more general form. The chapter concludes with a practical example of how to compute the value of the optimal stopping problem and an optimal stopping time using the signature method. This comprehensive exploration provides valuable insights for professionals working on complex financial models and seeking to understand the nuances of optimal stopping in non-Markovian settings.

1 Optimal Stopping

Optimal stopping problems form an important class of problems within the wider realm of stochastic optimal control. A prime example in finance is the pricing of American options. Unlike European options, American options may be exercised (by the owner) at any time t before a terminal time T, in which case the counterparty has to pay the owner a payoff \(\phi (S_t)\) given in terms of the current price \(S_t\) of the underlying. For example, in the case of an American put option, we have \(\phi (S_t) = (K - S_t)^+ = \max (K-S_t, 0)\), where K is a fixed threshold value, the strike price.
When we mathematically model an American option, we obviously need to allow the owner to stop at a random time \(t = t(\omega )\), to accommodate reasonable strategies such as “stop at the first time when S is lower than a given level L”. At the same time, our model must not allow arbitrary random exercise times, to reflect that the owner’s decision to exercise at time t has to be taken at (or before) time t—i.e., the owner cannot “look into the future”. This, in particular, excludes trivial “solutions”, such as exercising at time \( \operatorname *{\mbox{arg max}}_{t \in [0,T]} \phi (S_t)\)—the pathwise optimal time.
“Not looking into the future” is a statement about the flow of information, which is modeled by filtrations in probability theory. Given a filtration, exercise times conforming to the above-mentioned restrictions are then modeled as stopping times. In what follows, we will in parallel consider stopping times taking values in a set \(\mathcal {T}\) of possible times, where we restrict ourselves to either of the following important cases:
1.
Continuous time, i.e., \(\mathcal {T} = [0,T]\) for some \(T>0\);
 
2.
Discrete time, i.e., \(\mathcal {T} = \{0=t_0 < t_1 < \cdots < t_N = T\}\) for some \(T>0\), \(N \ge 1\).
 
Definition 1.1
Given a probability space \((\Omega , \mathcal {F}, \mathbb {P})\) and a filtration \((\mathcal {F}_t)_{t \in \mathcal {T}}\), a stopping time is a random variable \(\tau : \Omega \to \mathcal {T} \cup \{\infty \}\) such that for all \(t \in \mathcal {T}: \ \{\tau \le t\} \in \mathcal {F}_t\). We denote the set of all stopping times by \(\mathcal {S}\).
We can now define the optimal stopping problem, where we, for simplicity, assume that interest rates are \(r=0\), or, equivalently, all cash-flows are already properly discounted.
Definition 1.2
In the context of Definition 1.1, let \((Y_t)_{t \in \mathcal {T}}\) be an adapted (i.e., for all \(t \in \mathcal {T}\), \(Y_t\) is \(\mathcal {F}_t\)-measurable) reward process taking values in \(\mathbb {R}_{\ge 0}\), which is bounded in the sense that \(\mathbb {E}\left [ \sup _{t \in \mathcal {T}} Y_t \right ] < \infty \). Then the optimal stopping problem is the problem of computing
$$\displaystyle \begin{aligned} \sup_{\tau \in \mathcal{S}} \mathbb{E}\left[ Y_{\tau \wedge T} \right]. \end{aligned}$$
Example 1.3
An especially important example of the optimal stopping problem is given when the reward process satisfies \(Y_t = \phi (S_t)\) in terms of an underlying asset price process S and a payoff function \(\phi \). The value of the optimal stopping problem (under a risk-neutral probability measure) then corresponds to the value of an American option if \(\mathcal {T} = [0,T]\) and of a Bermudan option if \(\mathcal {T} = \{0=t_0 < t_1 < \cdots < t_N = T\}\). The financial interpretation of \(\mathcal {T}\) is as the set of all possible exercise times.
For concreteness’ sake, we will usually consider the filtration to be generated by a (continuous) stochastic process \(X = (X_t)_{t \in \mathcal {T}}\) taking values in \(\mathbb {R}^d\). A natural class of stopping times are first hitting times of a process w.r.t. a set.
Example 1.4
Given a (for simplicity) continuous, adapted process \((S_t)_{t \in \mathcal {T}}\) taking values in \(\mathbb {R}\) and a measurable subset \(A \subset \mathbb {R}\), the first hitting time
$$\displaystyle \begin{aligned} \tau_A := \inf\{ t \in \mathcal{T} | S_t \in A\} \end{aligned}$$
is a stopping time.
As we shall see later, for optimal stopping we can essentially restrict ourselves to stopping times, which are hitting times.
How can we compute the value \(\sup _{\tau \in \mathcal {S}} \mathbb {E}\left [ Y_{\tau \wedge T} \right ]\) of the optimal stopping problem as well as an optimal stopping time \(\tau ^\ast \)? Several possibilities come to mind. We could, for instance, hope that optimal stopping times are, indeed, hitting times of the underlying driving process X w.r.t. sets A, parameterize a sufficiently large class of such sets \(A_\theta \), and optimize over \(\mathbb {E}[Y_{\tau _{A_\theta } \wedge T}]\). This approach forms the basis of the first signature-based method suggested in Sect. 3.
There are two important systematic approaches for solving optimal stopping problems, sometimes called primal and dual methods.
On the primal side, numerical methods are usually based on the dynamic programming principle, which we shall formulate in discrete time for simplicity. Let us first introduce the value process (also known as the Snell envelope)
$$\displaystyle \begin{aligned} {} V_t := \operatorname*{\mbox{ess sup}}_{\tau \in \mathcal{S}_t} \mathbb{E}\left[ Y_{\tau \wedge T} \mid \mathcal{F}_t \right], \ \mathcal{S}_t := \{\tau \in \mathcal{S} | \tau \ge t\},\ t \in \mathcal{T}, \end{aligned} $$
(1)
which serves as the backbone for most of the theory of the optimal stopping problem. In particular, V  is a super-martingale, and is, in fact, the smallest super-martingale dominating the reward process Y . In discrete time, it is easy to see that it satisfies a semi-explicit backward recursive formula. Indeed, we have
$$\displaystyle \begin{aligned} {} V_{t_n} = \begin{cases} Y_T, & n = N,\\ \max\left( Y_{t_n},\, \mathbb{E}[V_{t_{n+1}} \mid \mathcal{F}_{t_n}] \right), & 0 \le n < N, \end{cases} \end{aligned} $$
(2)
see, for instance, [15, Section 1.1]. Note that (2) implies an explicit formula of the optimal stopping time (as a first hitting time):
$$\displaystyle \begin{aligned} {} \tau^\ast := \inf\{t \in \mathcal{T} | V_t = Y_t\}. \end{aligned} $$
(3)
Note that we can solve (2), provided we can compute the conditional expectation, i.e., solve a regression problem. We will come back to this approach in the context of signatures in Sect. 3. Note that the process V  is a supermartingale.
From a computational point of view, it is very convenient to have primal and dual formulations (without duality gap) of the same problem, since this enables us to sandwich the unknown value \(V_0\).
For the optimal stopping problem, it turns out that the dual problem presented in [16] is a minimization problem over martingales. More precisely, we have
$$\displaystyle \begin{aligned} {} V_0 = \inf_{M \in \mathcal{M}_0} \mathbb{E}\left[ \sup_{t \in \mathcal{T}} \left( Y_t - M_t \right) \right], \end{aligned} $$
(4)
where \(\mathcal {M}_0\) contains all (say, square integrable) martingales starting in \(M_0 = 0\).
Remark 1.5
Suppose that \(\tau \) is an approximation of (3), i.e., a genuine, but sub-optimal stopping time. If we now compute \(\mathbb {E}\left [ Y_{\tau \wedge T} \right ]\)—using independent samples from the samples used to construct \(\tau \) in the first place—we obtain a lower bound for the (unknown) true value \(V_0\). Conversely, given an approximation M of the optimal martingale in (4), i.e., a genuine, but sub-optimal martingale. Plugging M into the representation—again using independent samples to preserve the martingale structure, we obtain an upper bound for \(V_0\), i.e.,
$$\displaystyle \begin{aligned} \mathbb{E}\left[ Y_{\tau \wedge T} \right] \le V_0 \le \mathbb{E}\left[ \sup_{t \in \mathcal{T}} \left( Y_t - M_t \right) \right]. \end{aligned}$$
Hence, the error in either approximation can be bounded by the difference between our (computed) lower and upper bounds.
Finally, let us comment on the role of the Markov property for both the dynamic programming problem and the dual problem. As already pointed out above, the main computational challenge in the dynamic programming principle is the computation of the continuation value, i.e., the conditional expectation \(\mathbb {E}[ V_{t_{n+1}} \mid \mathcal {F}_{t_n}]\). In the Markovian case, this can be expressed as \(\mathbb {E}[ V_{t_{n+1}} \mid X_{t_n}]\), and we can make an ansatz
$$\displaystyle \begin{aligned} \mathbb{E}[ V_{t_{n+1}} \mid X_{t_n}] = \sum_i c_i \varphi_i(X_{t_n}), \end{aligned}$$
reducing the problem to a standard least-squares problem. If the Markov property does not hold, we can still re-express
$$\displaystyle \begin{aligned} \mathbb{E}[ V_{t_{n+1}} \mid \mathcal{F}_{t_n}] = \mathbb{E}[ V_{t_{n+1}} \mid X_{t_0}, X_{t_1}, \ldots, X_{t_n}], \end{aligned}$$
but this corresponds to an (infeasible) regression problem in \(\mathbb {R}^{d(n+1)}\).
In the dual approach, we can often prove the existence of an optimal martingale expressible in terms of some function of X under the Markov property—e.g., in the Brownian setting, some optimal martingale is given by
$$\displaystyle \begin{aligned} M_t = \int_0^t f(s, X_s) \mathrm{d} W_s. \end{aligned}$$
Again, we are left with a (low-dimensional) functional approximation problem.
Code Notebook
The complete and documented code for this chapter can be found in a Jupyter Notebook in the following GitHub repository:

2 The Space of Stopped Rough Paths

The optimal stopping problem consists of finding a rule that tells us at every time instance whether it is favourable to stop a stochastic process now or to let it still run to maximize a certain gain. The information available to us is the whole past of the process until the present time. If the process satisfies the Markov property, the problem simplifies a lot since the past simply does not matter here. Consequently, most of the work that was done in the field of optimal stopping concentrates on the Markovian case, cf. e.g. the classical text [18]. However, one can argue that in many situations, assuming the Markov property leads to an oversimplification of the problem. In the general path-dependent case, the history of the process can be encoded using the signature, and it should not be too surprising that the optimal stopping problem can be reformulated in terms of the signature, too. This will lead to a computational convenient way to solve the optimal stopping problem in a very general form.
In this chapter, we will consider exclusively the continuous-time case \(\mathcal {T} = [0,T]\). We will always assume that \(\mathcal {F} = (\mathcal {F}_t)_{t \in \mathcal {T}}\) is a filtration generated by a continuous stochastic process \(X = (X_t)_{t \in \mathcal {T}}\), i.e. \(\mathcal {F}_t = \sigma (X_s \, |\, s \in \mathcal {T}, s \leq t)\). Let \(\tau \) be an \(\mathcal {F}\)-stopping time. Note that for any continuous \(\mathbb {R}\)-valued process Y , we have the identity
$$\displaystyle \begin{aligned} Y_{\tau \wedge T} = Y_0 + \int_0^T \mathbb{1}_{[0,\tau \wedge T)}(t) \, \mathrm{d}Y_t. \end{aligned} $$
For any \(\mathcal {F}\)-adapted stochastic process U and any \(z > 0\), the random variable
$$\displaystyle \begin{aligned} \tilde{\tau} := \inf \left\{ t \geq 0 \, |\, \int_0^t U_s^2 \, \mathrm{d}s \geq z \right\} \wedge T \end{aligned} $$
defines a stopping time, and thus
$$\displaystyle \begin{aligned} Y_{\tilde{\tau} \wedge T} = Y_0 + \int_0^T \mathbb{1}_{[0,z)}\left(\int_0^t U^2_s \, \mathrm{d} s \right) \, \mathrm{d}Y_t. \end{aligned} $$
The idea is now to approximate
$$\displaystyle \begin{aligned} {} \sup_{\tau \in \mathcal{S}} \mathbb{E}[Y_{\tau \wedge T}] \end{aligned} $$
(5)
by
$$\displaystyle \begin{aligned} {} \sup_{U} \mathbb{E} \left[Y_0 + \int_0^T \mathbb{1}_{[0,z)}\left(\int_0^t U^2_s \, \mathrm{d} s \right) \, \mathrm{d}Y_t \right] \end{aligned} $$
(6)
where the supremum runs over a sufficiently rich class of \(\mathcal {F}\)-adapted stochastic processes U. To see that (5) and (6) can indeed be equal, we can informally define for any given stopping time \(\tau \) an adapted process U by setting
$$\displaystyle \begin{aligned} U_t := \begin{cases} 0 &\text{for } t \leq \tau \wedge T \\ \infty &\text{for } t > \tau \wedge T \end{cases} \end{aligned} $$
that clearly satisfies
$$\displaystyle \begin{aligned} Y_{\tau \wedge T} = Y_0 + \int_0^T \mathbb{1}_{[0,z)}\left(\int_0^t U^2_s \, \mathrm{d} s \right) \, \mathrm{d}Y_t. \end{aligned} $$
Using the signature, we will be able to parametrize a subclass of adapted processes that will still be large enough to attain the supremum in (5).
One key point for using the signature method in optimal stochastic control theory is to model the notion of adaptedness via a suitable factorization over a space of path segments or stopped paths. To illustrate the idea, recall that if U is a random variable that is \(\sigma (Z)\)-measurable for another random variable Z, there is a measurable map f such that \(U = f(Z)\). Therefore, if \((U_t)_{t \in \mathcal {T}}\) is \(\mathcal {F}\)-adapted, we know (all technicalities aside) that for every \(t \in \mathcal {T}\), there is a measurable map \(f_t\) such that \(U_t = f_t(X|_{[0,t]})\). Here, the maps \(f_t\) are defined on suitable path spaces where the paths are defined on the interval \([0,t]\). We would now like to remove the dependency of the map f on the time point t. To do this, we need to find a suitable space that contains paths that are defined on all different time intervals \([0,t]\), \(t \in \mathcal {T}\), i.e. on path segments. From now, we assume that X is a continuous stochastic process taking values in \(\mathbb {R}^d\). The following definition first appeared in Dupire’s work on functional Itō-calculus [8].
Definition 2.1
Set \(\mathcal {C}_t := \mathcal {C}([0,t],\mathbb {R}^d)\) and define the set
$$\displaystyle \begin{aligned} \mathscr{C}_T := \bigcup_{t \in [0,T]} \mathcal{C}_t. \end{aligned} $$
For \(X, Y \in \mathscr {C}_T\), \(X \colon [0,t] \to \mathbb {R}^d\) and \(Y \colon [0,s] \to \mathbb {R}^d\), we define
$$\displaystyle \begin{aligned} d_{\mathscr{C}_T}(X,Y) := \sup_{u \in [0,T]} |X_{u \wedge t} - Y_{u \wedge s}| + |t-s|. \end{aligned} $$
It is not hard to show that \((\mathscr {C}_T,d_{\mathscr {C}_T})\) is a metric space. With a bit more work, one can show that \(\mathscr {C}_T\) is complete and separable, i.e. a Polish space. The following proposition shows that the space \(\mathscr {C}_T\) indeed serves as a suitable space for the desired factorization.
Proposition 2.2
Let\(U = (U_t)_{t \in [0,T]}\)be an\(\mathbb {R}\)-valued,\(\mathcal {F}\)-adapted and left continuous stochastic process. Then there exists a Borel-measurable map\(\theta \colon \mathscr {C}_T \to \mathbb {R}\)such that
$$\displaystyle \begin{aligned} U_t(\omega) = \theta(X(\omega)|_{[0,t]}) \end{aligned} $$
for every\(t \in [0,T]\)and\(\omega \in \Omega \).
We shifted the proof of this Proposition to the appendix. Let us note that the left-continuity assumption can actually be dropped in the case when U is progressively measurable. A proof of this more general result can be found in the forthcoming work [1].
Proposition 2.2 will not be useful for us since not every continuous path has a well-defined signature. Instead, we will first consider the smaller space of bounded variation paths in the next section for which the signature always exists.

2.1 Signature Factorization for Bounded Variation Paths

In the last section, we saw that for every left-continuous \(\mathcal {F}\)-adapted process U, there exists a measurable map \(\theta \colon \mathscr {C}_T \to \mathbb {R}\) such that \(U_t = \theta (X|_{[0,t]})\). In this section, we will assume that the paths of X are of bounded variation such that for every \(0 \leq s \leq t \leq T\), the signature
$$\displaystyle \begin{aligned} \mathbb{X}^{<\infty}_{s,t} = \left(1, X_t - X_s, \int_{s < t_1 < t_2 < t} \mathrm{d} X_{t_1} \otimes \mathrm{d} X_{t_2}, \ldots \right) \end{aligned} $$
exists as an element in the extended tensor algebra
$$\displaystyle \begin{aligned} T((\mathbb{R}^d)) := \prod_{m = 0}^{\infty} (\mathbb{R}^d)^{\otimes m}, \end{aligned} $$
cf. Sect. 2.3.3 of chapter “A Primer on the Signature Method in Machine Learning” and Sects. 2 and 3 of chapter “The Signature Kernel”. The projection of \(T((\mathbb {R}^d))\) on the n-th tensor power \((\mathbb {R}^d)^{\otimes n}\) is denoted by \(\pi _n \colon T((\mathbb {R}^d)) \to (\mathbb {R}^d)^{\otimes n}\). \(T((\mathbb {R}^d))\) is the algebraic dual of the tensor algebra
$$\displaystyle \begin{aligned} T((\mathbb{R}^d)^*) := \bigoplus_{m = 0}^{\infty} ((\mathbb{R}^d)^*)^{\otimes m} \end{aligned} $$
where the pairing is denoted by \(\langle \cdot , \cdot \rangle \colon T((\mathbb {R}^d)^*) \times T((\mathbb {R}^d)) \to \mathbb {R}\), cf. Sect. 4 of chapter “The Signature Kernel”. The shuffle product https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/617302_1_En_9_IEq94_HTML.gif
Bar chart with four vertical bars of varying heights, each representing different data values. The bars are evenly spaced and aligned along a horizontal axis. The chart likely compares quantities or categories, but specific labels and values are not visible.
that was introduced in Definition 2.12 of chapter “A Primer on the Signature Method in Machine Learning” extends to a product on \(T((\mathbb {R}^d)^*)\) and the shuffle identity from Theorem 2.14 of chapter “A Primer on the Signature Method in Machine Learning” can be reformulated as
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/MediaObjects/617302_1_En_9_Equr_HTML.png
The image displays a mathematical formula involving inner products and tensor products. The formula is: [ langle l_1, X_0^{<infty}, T rangle langle l_2, X_0^{<infty}, T rangle = langle l_1 sqcup sqcup l_2 X_0^{<infty}, T rangle ] Key symbols include: - langle and rangle for inner products. - X_0^{<infty} indicating a sequence or series. - sqcup sqcup representing a tensor or similar operation.
for every \(l_1, l_2 \in T((\mathbb {R}^d)^*)\) (see also Eq. (8) of chapter “The Signature Kernel”).
Our goal will be to identify a class of functionals \(\Theta \) such that for any given map \(\theta \), we can find a \(\vartheta \in \Theta \) such that
$$\displaystyle \begin{aligned} \theta(X|_{[0, \cdot]}) \approx \vartheta(\mathbb{X}^{<\infty}_{0,\cdot}) \end{aligned} $$
where \(\mathbb {X}^{<\infty }\) denotes the signature of X. It is clear that this can only work if X can be recovered by \(\mathbb {X}^{<\infty }\). We therefore consider the time-augmented process \(\widehat {X}_t := (t,X_t)\) which is a stochastic process with values in \(\mathbb {R}^{1+d}\). Its signature is denoted by \(\widehat {\mathbb {X}}^{<\infty }\). From Proposition 2.34 of chapter “A Primer on the Signature Method in Machine Learning”, we know that X can indeed by recovered by \(\widehat {\mathbb {X}}^{< \infty }\).
Inspired by Dupire, we make the following definition:
Definition 2.3
Let \(\widehat {\mathcal {X}}_t\) denote the set of paths \(X \colon [0,t] \to \mathbb {R}^{1 + d}\) having bounded variation and for which the first component is time, i.e. \(X_s = (s, X^1_s,\ldots ,X^d_s)\) for every \(s \in [0,t]\). Set
$$\displaystyle \begin{aligned} \widehat{\mathscr{X}}_T := \bigcup_{t \in [0,T]} \widehat{\mathcal{X}}_t. \end{aligned} $$
We equip \(\widehat {\mathscr {X}}_T\) with the following metric: For \(X, Y \in \widehat {\mathscr {X}}_T\) with \(X \colon [0,t] \to \mathbb {R}^{1 + d}\), \(Y \colon [0,s] \to \mathbb {R}^{1 + d}\), we set
$$\displaystyle \begin{aligned} d_{\widehat{\mathscr{X}}_T}(X,Y) := |X_0 - Y_0| + \int_0^{T} |\dot{X}_{u \wedge t} - \dot{Y}_{u \wedge s}|\, \mathrm{d} u + |t-s|. \end{aligned} $$
We can now prove that continuous functionals defined on \(\widehat {\mathscr {X}}_T\) can be arbitrarily well approximated by linear functionals of the signature on compact sets. The following proposition is a generalization of Proposition 2.35 in chapter “A Primer on the Signature Method in Machine Learning”.
Proposition 2.4
Let\(\theta \colon \widehat {\mathscr {X}}_T \to \mathbb {R}\)be continuous and\(K \subset \widehat {\mathcal {X}}_T\)be a compact set. Then for every\(\varepsilon > 0\)there exists and element\(l \in T((\mathbb {R}^{1+d})^*)\)such that
$$\displaystyle \begin{aligned} \sup_{X|_{[0,t]} \in K} |\theta(X|_{[0,t]}) - \langle l , \mathbb{X}^{<\infty}_{0,t} \rangle | \leq \varepsilon. \end{aligned} $$
Proof
One first checks that for every \(l \in T((\mathbb {R}^{1 + d})^*)\), the map
$$\displaystyle \begin{aligned} X|_{[0,t]} \mapsto \langle l, \mathbb{X}_{0,t}^{<\infty} \rangle \end{aligned} $$
defined from \( \widehat {\mathscr {X}}_T\) to \(\mathbb {R}\) is continuous. Set
$$\displaystyle \begin{aligned} \mathcal{W} := \{\theta \in \mathcal{C}( \widehat{\mathscr{X}}_T, \mathbb{R}) \, |\, \exists l \in T((\mathbb{R}^{1 + d})^*) \text{ s.t. } \theta(X|_{[0,t]}) {=} \langle l , \mathbb{X}^{< \infty}_{0,t} \rangle\ \forall X|_{[0,t]} \in \widehat{\mathscr{X}}_T \}. \end{aligned} $$
We will prove that \(\mathcal {W}\) satisfies the conditions of the Stone-Weierstrass theorem, see Theorem 2.33 in chapter “A Primer on the Signature Method in Machine Learning”. The first two properties of that theorem are immediate. Let \(\theta _1,\theta _2 \in \mathcal {W}\) with corresponding \(l_1, l_2 \in T((\mathbb {R}^{1 + d})^*)\). From the shuffle product identity, it holds that for every \(X|_{[0,t]} \in \widehat {\mathscr {X}}_T\),
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/MediaObjects/617302_1_En_9_Equy_HTML.png
The image displays a mathematical formula involving functions and sets. The expression is: [ theta_1(X|_{[0,t]}) theta_2(X|_{[0,t]}) = langle l_1, mathcal{X}_{0,t}^{<infty} rangle langle l_2, mathcal{X}_{0,t}^{<infty} rangle = langle l_1 perp!!!perp l_2, mathcal{X}_{0,t}^{<infty} rangle = theta(X|_{[0,t]}) ] The formula involves Greek letters theta (theta), set notation, and symbols for independence (perp!!!perp).
where \(\theta \in \mathcal {W}\) is given by https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/617302_1_En_9_IEq127_HTML.gif
The image displays a mathematical formula: [ theta(X_{[0,t]}) = langle l_1 bigsqcup l_2, mathbb{X}_{0,t}^{<infty} rangle ] The formula includes Greek letter theta (theta), a set notation X_{[0,t]}, and symbols such as the coproduct (bigsqcup) and infinity (<infty).
for all \(X|_{[0,t]} \in \widehat {\mathscr {X}}_T\). Thus \(\mathcal {W}\) is indeed an algebra. To prove the point-separating property, let X and Y  be two distinct paths. By Proposition 2.34 in chapter “A Primer on the Signature Method in Machine Learning”, their signatures \(\mathbb {X}^{<\infty }_{0,T}\) and \(\mathbb {Y}^{<\infty }_{0,T}\) are distinct, too. Therefore, there is an element \(l \in T((\mathbb {R}^{1 + d})^*)\) such that
$$\displaystyle \begin{aligned} \langle l , \mathbb{X}^{< \infty}_{0,T} \rangle \neq \langle l , \mathbb{Y}^{< \infty}_{0,T} \rangle \end{aligned} $$
which proves the last condition of the Stone-Weierstrass theorem. The claim of our proposition thus follows. □
Remark 2.5
It is possible to prove that \((\widehat {\mathscr {X}}_T, d_{\widehat {\mathscr {X}}_T})\) is a complete metric space. However, since the space of bounded variation paths is not separable, \(\widehat {\mathscr {X}}_T\) is not separable either. As a remedy, one can instead consider the spaces \(\widehat {\mathcal {X}}^0_t\) that are defined as the closure of smooth paths \(X \colon [0,t] \to \mathbb {R}^{1+d}\) of the form \(X_s = (s,X^1_s,\ldots ,X^d_s)\), \(s \in [0,t]\), with respect to the bounded variation metric. It is then possible to prove that the corresponding space
$$\displaystyle \begin{aligned} \widehat{\mathscr{X}}_T^0 = \bigcup_{t \in [0,T]} \widehat{\mathcal{X}}^0_t, \end{aligned} $$
is Polish.
Recall that our goal was to approximate \(\mathcal {F}\)-adapted processes U by functionals of the signature of the process X that generates \(\mathcal {F}\). Looking back at the results of Propositions 2.2 and 2.4, it remains to prove that measurable functions on a metric space can be approximated by continuous functions. The following proposition is well-known. For completeness, we decided to provide a proof in the appendix.
Proposition 2.6
Let\((\mathcal {M},d)\)be a metric space and let\(\mu \)be a probability measure defined on the Borel sets of\(\mathcal {M}\). If\(f \colon M \to \mathbb {R}\)is measurable, we can find a sequence of continuous maps\((f_n)\)such that
$$\displaystyle \begin{aligned} f_n \to f \end{aligned} $$
\(\mu \)-almost surely as\(n \to \infty \).
Finally, we can state the following:
Proposition 2.7
Let\(X \colon [0,T] \to \mathbb {R}^d\)be a stochastic process. Assume that the process\(\widehat {X}_t = (t,X_t)\)has sample paths in\(\widehat {\mathcal {X}}^0_T\)almost surely. Let U be a left-continuous process adapted to the filtration generated by X. Then there is a sequence of elements\(l_n \in T((\mathbb {R}^{1 + d})^*)\)such that
$$\displaystyle \begin{aligned} \langle l_n, \widehat{\mathbb{X}}^{<\infty}_{0,t}(\omega) \rangle \to U_t(\omega) \end{aligned} $$
\((\mathbb {P} \otimes \operatorname {Leb})\)-almost surely as\(n \to \infty \).
Proof
Let \(\mu \) denote the law of \(\widehat {X}\) that is a probability measure on \(\widehat {\mathcal {X}}^0_T\). Let \(\nu \) be the probability measure on the Borel sets of \(\widehat {\mathscr {X}}^0_T\) that is defined as the image measure of \(\mu \otimes \operatorname {Leb}\) under the map
$$\displaystyle \begin{aligned} (\widehat{X},t) \mapsto \widehat{X}|_{[0,t]}. \end{aligned} $$
Similar as in Proposition 2.2, we can find a measurable map \(\theta \colon \widehat {\mathscr {X}}^0_T \to \mathbb {R}\) such that \(U_t(\omega ) = \theta (\widehat {X}(\omega )|_{[0,t]})\). From Proposition 2.6, we can find a sequence of continuous functions \(\theta _n \colon \widehat {\mathscr {X}}^0_T \to \mathbb {R}\) such that \(\theta _n \to \theta \)\(\nu \)-almost surely as \(n \to \infty \). Next, let \(\tilde {\theta } \colon \widehat {\mathscr {X}}^0_T \to \mathbb {R}\) be any continuous function. We claim that for arbitrary \(\varepsilon > 0\) and \(\delta > 0\), we can find an element \(l \in T((\mathbb {R}^{1+d})^*)\) such that
$$\displaystyle \begin{aligned} {} \nu(|\tilde{\theta}(\widehat{X}|_{[0,t]}) - \langle l, \widehat{\mathbb{X}}^{< \infty}_{0,t} \rangle| \geq \varepsilon ) < \delta. \end{aligned} $$
(7)
Since \(\widehat {\mathscr {X}}^0_T\) is Polish, we can find a compact set \(K \subset \widehat {\mathscr {X}}^0_T\) such that \(\nu (K^c) < \delta \). On the set K, we can use Proposition 2.4 to see that there is an element \(l \in T((\mathbb {R}^{1+d})^*)\) such that
$$\displaystyle \begin{aligned} \sup_{X|_{[0,t]} \in K} |\tilde{\theta}(\widehat{X}|_{[0,t]}) - \langle l, \widehat{\mathbb{X}}^{< \infty}_{0,t} \rangle| < \varepsilon. \end{aligned} $$
and \((7)\) follows. From these results, the assertion of the proposition can be deduced. □

2.2 Signature Factorization for Rough Paths

The most important stochastic processes, like the Brownian motion, do not have sample paths of bounded variation. In this section, we want to explain how to define the signature for those paths.
One way to measure roughness of a path \(X \colon [0,T] \to \mathbb {R}^d\) is to consider its p-variation for some \(p \in [1,\infty )\), i.e.
$$\displaystyle \begin{aligned} \|X\|_{p} := \sup_{\mathcal{D} \subset [0,T]} \left( \sum_{[t_i,t_{i+1}] \in \mathcal{D}} |X_{t_{i+1}} - X_{t_i}|^p \right)^{\frac{1}{p}} \end{aligned} $$
where the supremum ranges over all finite partitions of the interval \([0,T]\). It is not hard to see that \(\|\cdot \|_{p}\) is a seminorm on the space of paths and that for \(p = 1\) it coincides with the bounded variation seminorm. It is also not hard to check that \(\alpha \)-Hölder paths have finite \(\frac {1}{\alpha }\)-variation (however, the converse is not true: the square root function \(t \mapsto \sqrt {t}\) is of bounded variation but not Lipschitz continuous).
Let us assume that a path X is not of bounded variation. Examples from stochastic analysis teach us that, in general, there is no canonical choice of what the iterated integrals of X should be. In fact, if X denotes the Brownian motion, we can make sense of iterated integrals as Itō- or Stratonovich integrals, both satisfying basic properties of an integral, but being different objects. One of Lyons’ key insights was that if finitely many iterated integrals up to a certain order (depending on the regularity parameter p) are given, the remaining iterated integrals, and hence the signature, are uniquely determined. This leads us to the notion of a rough path that we are going to define now.
Set \(\Delta _T := \{ s, t \in [0,T]\, |\, s \leq t \}\). For \(p \in [1,\infty )\), choose \(N \in \mathbb {N}\) such that \(p - 1 < N \leq p\). The truncated tensor algebra is defined as
$$\displaystyle \begin{aligned} T^N(\mathbb{R}^d) := \bigoplus_{m=0}^N (\mathbb{R}^d)^{\otimes m}. \end{aligned} $$
For paths \(\mathbf {X}, \mathbf {Y} \colon \Delta _T \to T^N(\mathbb {R}^d)\), we define
$$\displaystyle \begin{aligned} \| \mathbf{X} \|_{p} := \max_{k = 1, \ldots, N} \sup_{\mathcal{D} \subset [0,T]} \left( \sum_{[t_i,t_{i+1}] \in \mathcal{D}} |\pi_k({\mathbf{X}}_{t_i,t_{i+1}})|^{kp} \right)^{\frac{1}{kp}} \end{aligned} $$
and
$$\displaystyle \begin{aligned} d_{p}(\mathbf{X} , \mathbf{Y}) := \max_{k = 1, \ldots, N} \sup_{\mathcal{D} \subset [0,T]} \left( \sum_{[t_i,t_{i+1}] \in \mathcal{D}} |\pi_k({\mathbf{X}}_{t_i,t_{i+1}}) - \pi_k({\mathbf{Y}}_{t_i,t_{i+1}})|^{kp} \right)^{\frac{1}{kp}}. \end{aligned} $$
Definition 2.8
Let \(p \in [1,\infty )\) and \(p - 1 < N \leq p\).
(i)
For a continuous path \(X \colon [0,T] \to \mathbb {R}^d\) of bounded variation, we define its canonical lift to ap-rough path\(\mathbf {X} \colon \Delta _T \to T^N(\mathbb {R}^d)\) by setting
$$\displaystyle \begin{aligned} {\mathbf{X}}_{s,t} &:= \left(1, X_t - X_s, \int_{s < t_1 < t_2 < t} \mathrm{d} X_{t_1} \otimes \mathrm{d} X_{t_2}, \ldots, \right.\\ & \left.\qquad \; \int_{s < t_1 < \ldots < t_N < t} \mathrm{d} X_{t_1} \otimes \cdots \otimes \mathrm{d} X_{t_N} \right). \end{aligned} $$
 
(ii)
A continuous path \(\mathbf {X} \colon \Delta _T \to T^N(\mathbb {R}^d)\) is called a p-rough path if there exists a sequence of smooth paths \(X(n) \colon [0,T] \to \mathbb {R}^d\) such that their canonical lifts \(\mathbf {X}(n)\) satisfy
$$\displaystyle \begin{aligned} d_{p}(\mathbf{X} , \mathbf{X}(n)) \to 0 \end{aligned} $$
as \(n \to \infty \). We denote the space of all p-rough paths by \(\Omega ^{p}_T\) and equip it with the metric \(d_{p}\).
 
(iii)
A continuous path \(\mathbf {X} \colon \Delta _T \to T^N(\mathbb {R}^{1 + d})\) is called a time-augmentedp-rough path if there exists a sequence of smooth paths \(X(n) \colon [0,T] \to \mathbb {R}^{1 + d} \) of the form \(X_t(n) = (t, X^1_t(n), \ldots , X^d_t(n))\) such that their canonical lifts \(\mathbf {X}(n)\) satisfy
$$\displaystyle \begin{aligned} d_{p}(\mathbf{X} , \mathbf{X}(n)) \to 0 \end{aligned} $$
as \(n \to \infty \). The space of all time-augmented p-rough paths is equipped with the metric \(d_{p}\), too, and will be denoted by \(\widehat {\Omega }^{p}_T\)
 
For every p-rough path \(\mathbf {X}\), one can uniquely define its time-augmented rough path \(\widehat {\mathbf {X}}\) (this follows from the more general result [11, Theorem 9.26]). Moreover, if \(X(n) \colon [0,T] \to \mathbb {R}^d\) is a sequence of smooth paths such that their canonical lifts \(\mathbf {X}(n)\) satisfy \(d_p(\mathbf {X}, \mathbf {X}(n)) \to 0\) as \(n \to \infty \), the canonical lifts \(\widehat {\mathbf {X}}(n)\) of the time-augmented smooth paths \(\widehat {X}_t = (t,X_t)\) satisfy \(d_p(\widehat {\mathbf {X}}, \widehat {\mathbf {X}}(n)) \to 0\) as \(n \to \infty \) [11, Theorem 9.30].
Example 2.9
For \(p = 1\), one has \(N = 1\) and a 1-rough path \(\mathbf {X} \colon \Delta _T \to T^1(\mathbb {R}^d)\) is of the form \({\mathbf {X}}_{s,t} = (1,X_{s,t})\). Let \(X(n)\) be a sequence of smooth paths such that \(d_1(\mathbf {X}(n),\mathbf {X}) \to 0\) as \(n \to \infty \). From the identity \(X_t(n) - X_s(n) = X_t(n) - X_u(n) + X_u(n) - X_s(n)\), it follows that \(X_{s,t} = X_{s,u} + X_{u,t}\). Setting \(X_t := X_{0,t}\), it follows that \(X_t - X_s = X_{s,t}\). Therefore, the space \(\Omega ^{1}_T\) can be identified with the closure of the space of smooth paths under the bounded variation metric. Moreover, \(\widehat {\Omega }^{1}_T\) can be identified with the space \(\widehat {\mathcal {X}}^0_T\).
For us, the most important point about rough paths is that their signature can be uniquely defined.
Theorem 2.10 (Lyons’ Extension Theorem)
Let\(\mathbf {X} \in \Omega ^{p}_T\). Then there is a unique map\(\mathbb {X}^{<\infty } \colon \Delta _T \to T((\mathbb {R}^d))\)with the following properties:
(i)
\(\mathbb {X}^{<\infty }\)extends\(\mathbf {X}\), i.e.\(\pi _n(\mathbb {X}^{<\infty }_{s,t}) = \pi _n({\mathbf {X}}_{s,t})\)for every\(s,t \in \Delta _T\)and every\(n = 0,\ldots ,N\).
 
(ii)
\(\mathbb {X}^{<\infty }\)satisfies Chen’s identity , i.e. for every\(s \leq u \leq t\), one has
$$\displaystyle \begin{aligned} \mathbb{X}^{<\infty}_{s,t} = \mathbb{X}^{<\infty}_{s,u} \otimes \mathbb{X}^{<\infty}_{u,t}. \end{aligned} $$
 
(iii)
\(\mathbb {X}^{<\infty }\)satisfies the shuffle identity , i.e. for every\(s \leq t\), one has
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/MediaObjects/617302_1_En_9_Equan_HTML.png
The image displays a mathematical formula involving inner products and tensor products. The formula is: [ langle l_1, X_{s,t}^{<infty} rangle langle l_2, X_{s,t}^{<infty} rangle = langle l_1 sqcup sqcup l_2, X_{s,t}^{<infty} rangle ] Here, l_1 and l_2 are elements, X_{s,t}^{<infty} represents a variable or function, and sqcup sqcup denotes a specific operation or product.
for every\(l_1, l_2 \in T((\mathbb {R}^d)^*)\).
 
(iv)
For every\(M \in \mathbb {N}\),
$$\displaystyle \begin{aligned} \max_{k = 1, \ldots, M} \sup_{\mathcal{D} \subset [0,T]} \left( \sum_{[t_i,t_{i+1}] \in \mathcal{D}} |\pi_k(\mathbb{X}^{<\infty}_{t_i,t_{i+1}})|^{kp} \right)^{\frac{1}{kp}} < \infty. \end{aligned} $$
 
Proof
[14, Theorem 3.7]. □
The signature \(\mathbb {X}^{<\infty }_{0,T}\) of a rough path \(\mathbf {X}\) determines the path under suitable conditions. In fact, one can prove the following:
Theorem 2.11
Suppose that\(\mathbf {X}, \mathbf {Y} \in \widehat {\Omega }^{p}_T\)and that\(\mathbb {X}^{<\infty }_{0,T} = \mathbb {Y}^{<\infty }_{0,T}\). Then\(\mathbf {X} = \mathbf {Y}\).
Proof
Follows from the results obtained in [5]. □
Analogous to the bounded variation case, we can now define the space of stopped rough paths:
Definition 2.12
The space of stopped p-rough paths is defined as
$$\displaystyle \begin{aligned} \Lambda_T^p := \bigcup_{t \in [0,T]} \widehat{\Omega}^p_t. \end{aligned} $$
It is equipped with the following metric: If \(\mathbf {X} \in \widehat {\Omega }^p_t\) and \(\mathbf {Y} \in \widehat {\Omega }^p_s\), we define \(\tilde {\mathbf {X}} \in \widehat {\Omega }^p_T\) and \(\tilde {\mathbf {Y}} \in \widehat {\Omega }^p_T\) by \(\tilde {\mathbf {X}}_{u,v} := {\mathbf {X}}_{u \wedge t, v\wedge t}\) resp. \(\tilde {\mathbf {Y}}_{u,v} := {\mathbf {Y}}_{u \wedge s, v\wedge s}\). We then set
$$\displaystyle \begin{aligned} d_{\Lambda_T^p}(\mathbf{X},\mathbf{Y}) := d_{p} (\tilde{\mathbf{X}},\tilde{\mathbf{Y}}) + |t-s|. \end{aligned} $$
Using exactly the same arguments as in Proposition 2.4, we can prove the following:
Proposition 2.13
Let\(\theta \colon \Lambda _T^p \to \mathbb {R}\)be continuous and\(K \subset \Lambda _T^p\)be a compact set. Then for every\(\varepsilon > 0\)there exists and element\(l \in T((\mathbb {R}^{1+d})^*)\)such that
$$\displaystyle \begin{aligned} \sup_{\mathbf{X}|_{[0,t]} \in K} |\theta(\mathbf{X}|_{[0,t]}) - \langle l , \mathbb{X}^{<\infty}_{0,t} \rangle | \leq \varepsilon. \end{aligned} $$
This allows us to finally prove the following result:
Proposition 2.14
Let\(\mathbf {X}\)be a stochastic process with sample paths in a rough path space\(\Omega _T^p\). Consider the filtration\(\mathcal {F} = (\mathcal {F}_t)\)generated by\(\mathbf {X}\), i.e.
$$\displaystyle \begin{aligned} \mathcal{F}_t = \sigma({\mathbf{X}}_{u,v} \, |\, (u,v) \in \Delta_t). \end{aligned} $$
Let\(\widehat {\mathbf {X}}\)denote the time-augmented process having sample paths in\(\widehat {\Omega }_T^p\)and let\(\widehat {\mathbb {X}}^{<\infty }\)denote its signature. Let U be a left-continuous process adapted to the filtration\((\mathcal {F}_t)\). Then there is a sequence of elements\(l_n \in T((\mathbb {R}^{1 + d})^*)\)such that
$$\displaystyle \begin{aligned} \langle l_n, \widehat{\mathbb{X}}^{<\infty}_{0,t}(\omega) \rangle \to U_t(\omega) \end{aligned} $$
\((\mathbb {P} \otimes \operatorname {Leb})\)-almost surely as\(n \to \infty \).
Example 2.15
Let \(B \colon [0,T] \to \mathbb {R}^d\) be a Brownian motion. It is known that the sample paths of B are \(\alpha \)-Hölder continuous for every \(\alpha < \frac {1}{2}\). Therefore, the trajectories of B have finite p-variation for every \(p > 2\). It can be shown that B can be lifted to a process \(\mathbf {B}\) having trajectories in the rough path space \(\Omega ^p_T\) for every \(p > 2\) [10, Chapter 3]. In fact, \(\pi _2(\mathbf {B})\) is given by the iterated Stratonovich integrals of B:
$$\displaystyle \begin{aligned} \pi_2({\mathbf{B}}_{s,t}) = \int_s^t (B_u - B_s) \otimes \circ B_u. \end{aligned} $$
The time-augmented process \(\widehat {\mathbf {B}}\) is given on the second level by
$$\displaystyle \begin{aligned} \pi_2(\widehat{\mathbf{B}}_{s,t}) = \begin{pmatrix} \int_s^t (u - s)\, \mathrm{d} u & \int_s^t (u - s)\, \mathrm{d} B_u \\ \int_s^t (B_u - B_s)\, \mathrm{d} u & \int_s^t (B_u - B_s) \otimes \circ B_u \end{pmatrix}. \end{aligned} $$
The signature \(\widehat {\mathbb {B}}^{<\infty }\) consists of all iterated Stratonovich integrals of the process \(\widehat {B}_t = (t,B_t)\). It is clear that the filtration generated by B coincides with the one generated by \(\mathbf {B}\), i.e.
$$\displaystyle \begin{aligned} \mathcal{F}_t = \sigma({\mathbf{B}}_{u,v} \, |\, (u,v) \in \Delta_t) = \sigma(B_u \, |\, u \in [0,t]). \end{aligned} $$
From Proposition 2.14, it then follows that every left-continuous process that is adapted to the Brownian filtration can be approximated by linear functionals of the signature of the time-augmented Brownian motion, i.e. by linear combinations of iterated Stratonovich integrals of the process \(\widehat {B}_t = (t,B_t)\).

3 Approximation Methods

We now give a more detailed look at three different approximation methods for the optimal stopping problem.

3.1 Parametrization of Stopping Times

In the first approach, we parameterize stopping times. As motivated in Sect. 1, the general idea is to encode stopping times as first hitting times of sets, think of optimal exercise regions. Under Markovian assumptions, it would make sense to consider hitting times of the process X itself, i.e., consider first hitting times w.r.t. subsets of \([0,T] \times \mathbb {R}^d\). In order to generalize to the case with memory, we will instead consider hitting times w.r.t. subsets of \(T((\mathbb {R}^{1+d}))\), i.e., consider
$$\displaystyle \begin{aligned} {} \tau_A := \inf \{t \in [0,T] | \widehat{\mathbb{X}}^{<\infty}_{0,t} \in A\}, \quad A \subset T((\mathbb{R}^{1+d})), \end{aligned} $$
(8)
recalling that \(\widehat {\mathbb {X}}^{<\infty }_{0,t}\) denotes the signature of the time-extended path \((t,X_t)\). Given that \(T((\mathbb {R}^{1+d}))\) is infinite-dimensional, this leaves us with the question of finding a tractable family of subsets \(A \subset T((\mathbb {R}^{1+d}))\) rich enough to obtain good approximations to optimal stopping times.
In [2], the parameterization finally chosen was
$$\displaystyle \begin{aligned} {} \tau_{\ell} := \inf \{t \in [0,T] | \int_0^t \left\langle \ell\,,\widehat{\mathbb{X}}^{<\infty}_{0,s}\right\rangle ^2 \mathrm{d} s \ge 1\}, \quad \ell \in T((\mathbb{R}^{1+d})^\ast). \end{aligned} $$
(9)
The integral in (9) is increasing in t, which simplifies computing the minimizer in t. Obviously, we have https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/617302_1_En_9_IEq287_HTML.gif
The image displays a mathematical formula involving inner products and operators. The formula is: [ leftlangle ell, widehat{X}_{0,s}^{<infty} rightrangle^2 = leftlangle ell bigsqcup ell, widehat{X}_{0,s}^{<infty} rightrangle ] Symbols include the Greek letter ell, the hat symbol widehat{}, and the operator bigsqcup.
. Additionally, taking into account that running time t was added as first component of the process \(\widehat {X}_t = (t, X_t)\) as well as the distributative property of the tensor product, we obtain
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/MediaObjects/617302_1_En_9_Equax_HTML.png
The image displays a mathematical equation involving integrals and various symbols. The equation is: [ int_0^t langle ell, hat{X}_{0,s}^{<infty} rangle^2 , ds = int_0^t langle ell bigsqcup ell, hat{X}_{0,s}^{<infty} rangle , ds = langle (ell bigsqcup ell) 1, hat{X}_{0,t}^{<infty} rangle. ] Key symbols include integrals, angle brackets for inner products, a hat symbol over X, and a square cup symbol.
Hence, \(\tau _\ell \) is the first hitting time of the signature process at an affine hyperplane in \(T((\mathbb {R}^{1+d}))\). Given the shuffle identity, we expect that such hitting times are a very expressive class of stopping times. Note that hitting times are, generally, discontinuous functions of the path, so that we would need to appeal to a version of Prop. 2.14 to get a proper universal approximation theorem.
However, from a numerical perspective, a promising alternative might be to first regularize the function instead. It turns out that such regularization can be achieved by randomization, see [2]. Indeed, let us adjust (9) by stopping when we first hit a random level Z, i.e.,
$$\displaystyle \begin{aligned} {} \tau^r_{\ell} := \inf \{t \in [0,T] | \int_0^t \left\langle \ell\,,\widehat{\mathbb{X}}^{<\infty}_{0,s}\right\rangle ^2 \mathrm{d} s \ge Z\}, \quad \ell \in T((\mathbb{R}^{1+d})^\ast), \end{aligned} $$
(10)
where, for simplicity, we fix \(Z \sim \mathrm {Exp}(1)\), independent from X. After just having introduced the auxiliary random variable Z, we immediately integrate against it, giving
$$\displaystyle \begin{aligned} {} \mathbb{E}\left[ Y_{\tau^r_{\ell} \wedge T} \mid X_{\cdot} \right] = Y_0 + \int_0^T \exp\left( - \int_0^t \left\langle \ell\,,\widehat{\mathbb{X}}^{<\infty}_{0,s}\right\rangle \mathrm{d}s \right) \mathrm{d}Y_t, \end{aligned} $$
(11)
see [2, Prop. 4.3] for details. Note that the right hand side in (11) is, in fact, smooth in \(\ell \), so that higher order numerical optimization methods can be applied for computing
$$\displaystyle \begin{aligned} {} V_0 = \sup_{\ell \in T((\mathbb{R}^{1+d})^\ast)} \mathbb{E}\left[ Y_0 + \int_0^T \exp\left( - \int_0^t \left\langle \ell\,,\widehat{\mathbb{X}}^{<\infty}_{0,s}\right\rangle \mathrm{d}s \right) \mathrm{d}Y_t \right]. \end{aligned} $$
(12)
Of course, in addition to linear signature stopping rules such as (12), we could also use non-linear ones, i.e.,
$$\displaystyle \begin{aligned} {} V_0 = \sup_{\theta \in \Theta} \mathbb{E}\left[ Y_0 + \int_0^T \exp\left( - \int_0^t f_\theta(\widehat{\mathbb{X}}^{<\infty}_{0,s}) \mathrm{d}s \right) \mathrm{d}Y_t \right] \end{aligned} $$
(13)
for some suitable rich class of functionals \(f_\theta \) of the signature parameterized by \(\theta \in \Theta \). In fact, (13) was proved in [2, Prop. 7.4] for a class of neural network functions in the signature—\(\theta \) corresponding to the network weights.
In either way, several additional approximations are necessary in order to turn (12) or (13), respectively, into a fully implementable numerical method. Concentrating on (13) for concreteness, we obtain to following scheme:
1.
For a given truncation level\(I \in \mathbb {N}\) of the signature and a time grid \(0 = t_0 < \cdots < t_N = T\), simulate samples \(\widehat {\mathbb {X}}^{\le I, (m)}_{0,t_n}\), \(Y_{t_n}^{(m)}\), \(n=1, \ldots , N\), \(m=1, \ldots , M\), possibly based on a discretization of the iterated integrals forming the signature.
 
2.
Using a suitable optimization algorithm, solve the optimization problem
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/MediaObjects/617302_1_En_9_Equay_HTML.png
The image displays a mathematical formula involving summations and exponential functions. The formula is: [ hat{theta} in arg min_{theta in Theta} frac{1}{M} sum_{m=1}^{M} left( Y_0^{(m)} + sum_{=0}^{-1} exp left( -sum_{j=0}^{-1} f_{theta}(hat{X}_{0,t_j}^{leq I, (m)})(t_{j+1} - t_j) right) times (Y_{t_{+1}}^{(m)} - Y_{t_n}^{(m)}) right). ] Key elements include the use of Greek letters such as theta (theta), summation symbols (sum), and exponential functions (exp). The formula represents an optimization problem involving multiple summations and a product term.
 
Remark 3.1
The stopping time \(\widehat {\tau }\) corresponding to the estimated parameter \(\widehat {\theta }\) (or, alternatively, \(\widehat {\ell }\) in the linear case) is a (sub-optimal) stopping time, and we have
$$\displaystyle \begin{aligned} \mathbb{E}[Y_{\widehat{\tau}}] \le V_0 = \mathbb{E}[Y_{\tau^\ast \wedge T}]. \end{aligned}$$
Hence, a low-biased estimator for the optimal value can be found by generating new, independent samples \(\widehat {\mathbb {X}}^{\le I, (m)}_{0,t_n}\), \(Y_{t_n}^{(m)}\), \(n=1, \ldots , N\), \(m=M+1, \ldots , 2M\), say, and compute
$$\displaystyle \begin{aligned} \widehat{V}_0 = \frac{1}{M} \sum_{m=M+1}^{2M} Y^{(m)}_{\widehat{\tau}^{(m)}}, \end{aligned}$$
noting that \(\widehat {\tau }^{(m)}\) depends on \(\widehat {\mathbb {X}}^{\le I, (m)}_{0,t_n}\), \(Y_{t_n}^{(m)}\), \(n=1, \ldots , N\), \(m=M+1, \ldots , 2M\). The sub-optimality of the stopping time \(\widehat {\tau }\) implies that \(\widehat {V}_0\) is biased low, i.e., that \(\mathbb {E}\left [ \widehat {V}_0 \right ] \le V_0\). If we manage to also construct a high-biased estimator (see the next subsection), then this gives a powerful method to estimate errors: If the gap between low- and high-biased estimators is small, the accuracy is high.

3.2 The Signature Longstaff–Schwartz Algorithm

The celebrated Longstaff–Schwartz algorithm [13] is based on the dynamic programming principle (2). We assume that we are given a Bermudan option, which allows exercising at times \(0 = t_0 < t_1 < \cdots < t_N = T\). Recall that the optimal value function satisfies the recursive formula
$$\displaystyle \begin{aligned} V_{t_n} = \max\left( Y_{t_n}, \, C_{t_n} \right), \quad 0 \le n < N, \quad C_{t_n} := \mathbb{E}\left[ V_{t_{n+1}} \mid \mathcal{F}_{t_n} \right] \end{aligned}$$
with the terminal value \(V_{T} = Y_T\). In particular, any approximation \(\widehat {C}_{t_n} \approx C_{t_n}\), \(n=0, \ldots , N-1\), of the continuation value\(C_{t_n}\), defines an approximation \(\widehat {V}_{t_n}\) of the value \(V_{t_n}\) by plugging in the recursion. However, in practice, \(\widehat {V}\) can give unreasonable value, especially for extreme paths. A more stable approximation can be found by recursively constructing stopping times \(\tau _n\) taking values in \(\{t_n, \ldots , t_N\}\)—which should be thought as (approximate) solutions to the optimal stopping problems started at time \(t=t_n\). For this, we start with \(\tau _N := t_N\). Having constructed stopping times \(\tau _{n+1}, \ldots , \tau _N\), we then define
$$\displaystyle \begin{aligned} {} \tau_n := \begin{cases} t_n, & Y_{t_n} \ge \mathbb{E}\left[ Y_{\tau_{n+1}} \mid \mathcal{F}_{t_n} \right],\\ \tau_{n+1}, & \text{else}. \end{cases} \end{aligned} $$
(14)
Note that, by construction, \(\tau _n\) takes values in \(\{t_n, \ldots , t_N\}\). Moreover, by the dynamic programming principle, we have \(\mathbb {E}\left [Y_{\tau _n} \mid \mathcal {F}_{t_n} \right ] = V_{t_n}\) a.s., i.e., (14) is an equivalent formulation of (2). Given any algorithm computing approximations \(\widehat {\mathbb {E}}_{t_n}\) of the conditional expectation operator \(\mathbb {E}[\cdot \mid \mathcal {F}_{t_n}]\), we obtain approximations of the optimal stopping times \(\tau _n\) by setting \(\widehat {\tau }_N := t_N\) and
$$\displaystyle \begin{aligned} {} \widehat{\tau}_n := \begin{cases} t_n, & Y_{t_n} \ge \widehat{\mathbb{E}}_{t_n}\left[ Y_{\tau_{n+1}} \right],\\ \widehat{\tau}_{n+1}, & \text{else}. \end{cases} \end{aligned} $$
(15)
If the approximate conditional expectation \(\widehat {\mathbb {E}}_{t_n}\) is always \(\mathcal {F}_{t_n}\)-measurable, then \(\widehat {\tau }_n\) is a proper stopping time, \(n=0, \ldots , N\).
Now we are only left with the task of finding approximations to the conditional expectation operator. In the Markovian situation, Longstaff and Schwartz [13] suggest to use least squares Monte Carlo. By the Markov property, we know that the conditional expectation \(\mathbb {E}\left [Z \mid \mathcal {F}_{t_n} \right ] = f_{t_n}(X_{t_n})\) for some measurable function f. In fact, by the very definition of the conditional expectation (for r.v.s with finite second moment), it solves the minimization problem
$$\displaystyle \begin{aligned} f_{t_n} \in \operatorname*{\mbox{arg min}}_{g \text{ measurable}} \mathbb{E}\left[ \left( Z - g(X_{t_n}) \right)^2 \right]. \end{aligned}$$
This is of the form of a regression problem. Of course, the space of measurable functions to too big to handle properly, so the standard approach is to work in a finite-dimensional subspace instead. Identify K basis functions \(\phi _1, \ldots , \phi _K\), and make the ansatz that \(\mathbb {E}\left [Z \mid \mathcal {F}_{t_n} \right ]\) can be written as a linear combination of \(\phi _1(X_{t_n}), \ldots , \phi _K(X_{t_n})\). Then, the coefficients of the linear combination can be found by solving the least squares problem
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/MediaObjects/617302_1_En_9_Equ16_HTML.png
The image shows a mathematical formula representing an optimization problem. The expression is: [ mathbb{E} left[ left( Z - sum_{k=1}^{K} c_k phi_k(X_{t_n}) right)^2 right] xrightarrow{c_1, ldots, c_K} text{min!} ] It involves the expectation operator mathbb{E}, a summation, and a minimization over coefficients c_1, ldots, c_K. The formula includes Greek letters phi and mathematical symbols for summation and minimization.
(16)
Finally, given M independent samples \(Z^{(m)}\), \(X_{t_n}^{(m)}\), \(m=1, \ldots , M\), from (the joint distribution of) Z and \(X_{t_n}\), we define
$$\displaystyle \begin{aligned} {} \widehat{\mathbb{E}}^M_{t_n}[Z] &:= \sum_{k=1}^K \widehat{c}_k \phi_k(X_{t_n}), \quad (\widehat{c}_1, \ldots, \widehat{c}_K)\\ &= \operatorname*{\mbox{arg min}}_{c_1, \ldots, c_K \in \mathbb{R}} \frac{1}{M} \sum_{m=1}^M \left( Z^{(m)} - \sum_{k=1}^K c_k \phi_k(X^{(m)}_{t_n}) \right)^2. \end{aligned} $$
(17)
Remark 3.2
We have not yet touched upon the issue of finding good basis functions \(\phi _1, \ldots , \phi _K\). A popular choice is to choose polynomials up to some (mixed) degree together with some polynomials of the payoff function itself, we refer to [12] for more details.
If X is not a Markov process, then there is, in general, no function \(f_{t_n}\) s.t. \(\mathbb {E}\left [Z \mid \mathcal {F}_{t_n} \right ] = f_{t_n}(X_{t_n})\). Instead, the conditional expectation is a function of the whole path, \(\mathbb {E}\left [Z \mid \mathcal {F}_{t_n} \right ] = f_{t_n}(X|_{[0,t_n]})\). Unfortunately, the space of functions on paths \(X|_{[0,t_n]}\) is much larger than the space of functions on \(X_{t_n}\), so that the task of finding efficient basis functions \(\phi _1, \ldots , \phi _K\) on path-space might seem hopeless at first glance. However, as we know now, linear functions of the (truncated, normalized) signature do an excellent job for this problem.
This motivates the following approach put forth in [3]: Use (15) together with
$$\displaystyle \begin{aligned} {} \widehat{\mathbb{E}}^{I,M}_{t_n}[Z] := \langle \ell^{I,M}, \, \widehat{\mathbb{X}}^{\le I}_{0,t_n} \rangle, \quad \ell^{I,M} := \operatorname*{\mbox{arg min}}_{\ell} \frac{1}{M} \sum_{m=1}^M \left( Z^{(m)} - \langle \ell, \, \widehat{\mathbb{X}}^{\le I, (m)}_{0,t_n} \rangle \right)^2, \end{aligned} $$
(18)
where \(I \in \mathbb {N}\) is a truncation value, and \(\widehat {\mathbb {X}}^{\le I, (m)}_{0,t_n}\) denotes the truncated signature at level I associated with the sample \(X^{(m)}\), \(m=1, \ldots , M\).
Under some technical assumptions, [3, Prop. 3.3] shows convergence of the value associated with the stopping time \(\widehat {\tau }_0\) constructed as above as \(I,M \to \infty \).
Note that the Longstaff–Schwartz algorithm produces a low-biased estimator in the form of Remark 3.1.

3.3 Dual Martingale Methods

The optimal stopping problem has a dual problem, which comes in the form of a minimization problem over martingales, see [16]. Indeed, it turns out that
$$\displaystyle \begin{aligned} {} V_0 = \inf_{M \in \mathcal{M}_0} \mathbb{E} \left[ \sup_{t \in \mathcal{T}} \left( Y_t - M_t \right) \right], \end{aligned} $$
(19)
where \(\mathcal {M}_0\) denotes the space of martingales M with \(M_0 = 0\) and \(\mathbb {E}[ \sup _{t \in \mathcal {T}} M_t] < \infty \), and Y  needs to satisfy a uniform integrability condition. In fact, the proof of this remarkable result is both simple and enlightening, so that we will present it here.
Sketch of Proof of (19)
For concreteness, we give the proof in continuous time, \(\mathcal {T} = [0,T]\), and assume that Y  has continuous paths. In that case, the super-martingale V  has a Doob–Meyer decomposition
$$\displaystyle \begin{aligned} V_t = V_0 + M^\ast_t - A^\ast_t \end{aligned}$$
into a martingale \(M^\ast \in \mathcal {M}_0\) and a predictable increasing process \(A^\ast \). For concreteness, we shall require \(M^\ast _0 = A^\ast _0 = 0\). For the proof, note that for any \(M \in \mathcal {M}_0\), we have
$$\displaystyle \begin{aligned} V_0 = \sup_{\tau \in \mathcal{S}_0} \mathbb{E}[Y_{\tau \wedge T}] = \sup_{\tau \in \mathcal{S}_0} \mathbb{E}[Y_{\tau \wedge T} - M_{\tau \wedge T}] \le \mathbb{E}\left[ \sup_{0 \le t \le T} \left(Y_t - M_t\right) \right]. \end{aligned}$$
For the converse inequality, note that \(Y \le V = V_0 + M^\ast - A^\ast \). Hence,
$$\displaystyle \begin{aligned}\displaystyle \inf_{M \in \mathcal{M}_0} \mathbb{E}\left[ \sup_{0 \le t \le T} \left( Y_t - M_t \right) \right] \le \mathbb{E}\left[ \sup_{0 \le t \le T} \left( Y_t - M^\ast_t \right) \right] \\\displaystyle \le \mathbb{E}\left[ \sup_{0 \le t \le T} \left( V_t - M^\ast_t \right) \right] = \mathbb{E}\left[ \sup_{0 \le t \le T} \left( V_0 - A^\ast_t \right) \right] = V_0, \end{aligned} $$
using that \(A^\ast \) is increasing. □
Note that the above proof never relied on the Markov property, and, indeed, (19) does not require it. However, similarly to Sect. 3.2, the Markov property can strongly facilitate the actual computation. Indeed, similar to the Longstaff–Schwartz algorithm, an obvious numerical approach for computing the value of the optimal stopping problem based on the dual formulation is to introduce a parametric family of martingales (say, \(M^\theta \), indexed by a finite-dimensional parameter space \(\Theta \)) and then solve the finite-dimensional minimization problem
$$\displaystyle \begin{aligned} {} V_0 \approx \inf_{\theta \in \Theta} \mathbb{E} \left[ \sup_{t \in \mathcal{T}} \left( Y_t - M^\theta_t \right) \right]. \end{aligned} $$
(20)
If the underlying state process X is a Markov process, we can actually describe the Doob–Meyer decomposition of the value process V . First, note that in the Markovian case, there is a function v s.t. \(V_t = v(t, X_t)\). Hence, we can directly obtain an optimal martingale \(M^\ast \) by applying Itô’s formula to the process \(V_t = v(t,X_t)\). Indeed, suppose for concreteness’ sake that X is a diffusion process (departing from our usual assumption of bounded variation), i.e.,
$$\displaystyle \begin{aligned} {} \mathrm{d} X_t = \mu(t, X_t) \mathrm{d}t + \sigma(t,X_t) \mathrm{d}W_t \end{aligned} $$
(21)
for a Brownian motion W. Then
$$\displaystyle \begin{aligned} M^\ast_t = \int_0^t \sigma(s,X_s) \partial _x v(s, X_s) \mathrm{d}W_s. \end{aligned}$$
Remark 3.3
The above equation has to be interpreted in the proper way. If both X and W are one-dimensional processes, then no ambiguity exists. If, say, both X and W are d-dimensional, than \(\mu \) and \(\sigma \) take values in \(\mathbb {R}^d\) and \(\mathbb {R}^{d \times d}\), respectively, \(\partial _x v(s, X_s)\) is a gradient, while \(M^\ast \) is still a scalar-valued martingale, i.e., the equation needs to be interpreted as
$$\displaystyle \begin{aligned} M^\ast_t = \int_0^t (\sigma(s,X_s) \partial _x v(s, X_s))^\top \mathrm{d}W_s. \end{aligned}$$
When we are trying to solve the optimal stopping problem, we obviously do not know v yet: it is, after all, the goal of our computation. But the above formula gives us an important structural information: at least one optimal martingale is of the form
$$\displaystyle \begin{aligned} M^\ast_t = \int_0^t f(s, X_s) \mathrm{d}W_s \end{aligned}$$
for some function f. Hence, a prime candidate for a parametric family of martingales can be constructed by, once again, choosing basis functions \(\phi _k(t,x)\), and setting
$$\displaystyle \begin{aligned} M^\theta_t = \int_0^t \sum_{k=1}^K \theta_k \phi_k(s, X_s) \mathrm{d}W_s, \end{aligned}$$
i.e., we make a similar functional approximation ansatz as for the primal problem in Sect. 3.2 in the Markovian case.
In the non-Markovian case, assuming that our filtration \((\mathcal {F}_t)_{t \in [0,T]}\) is the filtration of a Brownian motion W, the martingale representation theorem implies that any martingale (including any optimal one) is of the form
$$\displaystyle \begin{aligned} M_t = M_0 + \int_0^t f_s \mathrm{d}W_s \end{aligned}$$
for some adapted process f. Hence, when looking for a good parametric family of martingales, we need to solve a functional approximation problem for adapted processes, i.e., functionals \(f_t = f(t, X|_{[0,t]})\). As in Sect. 3.2, a natural parameterization is given by (linear) functionals of the signature, i.e., we consider
$$\displaystyle \begin{aligned} {} M^\ell_t := \int_0^t \langle \ell,\, \widehat{\mathbb{X}}^{\le I}_{0,s} \rangle \mathrm{d}W_s = \langle \ell,\, \int_0^t \widehat{\mathbb{X}}^{\le I}_{0,s} \mathrm{d}W_s\rangle, \quad \ell \in \Theta := \prod_{i=0}^I (\mathbb{R}^{1+d})^{\otimes i}, \end{aligned} $$
(22)
for some given truncation level I.
In order to obtain an implementable method for (20) in conjunction with (22), we have to take care of more approximation steps:
1.
Replace the expectation \(\mathbb {E}[\cdot ]\) by a finite sample mean based on M samples \(Y_t^{(m)}\), \(\widehat {\mathbb {X}}^{\le I, (m)}_{0,t}\), \(W^{(m)}_t\);
 
2.
Discretize the problem based on N time steps for the \(\sup \) inside the expectation and J timestep for the computation of the signature as well as the stochastic integral in (22).
 
Let \(L^{I,J,(m)}\) denote the discretization of \(\int _0^t \widehat {\mathbb {X}}^{\le I, (m)}_{0,s} \mathrm {d}W^{(m)}_s\) along a uniform time-grid of \([0,T]\) of mesh \(T/J\). Then it can be shown (see [3, Prop. 3.8]) that under weak assumptions of the underlying process, we have that \(V^{N,J,I,M} \to V_0\) as \(N,J,I, M \to \infty \), where
$$\displaystyle \begin{aligned} {} V^{N,J,I,M}_0 := \inf_{\ell \in \Theta} \frac{1}{M} \sum_{m=1}^M \max_{0 \le n \le N} \left( Y^{(m)}_{nT/N} - \langle \ell,\, L^{I,J,(m)}_{0,nT/N} \rangle \right). \end{aligned} $$
(23)

4 Martingale Duality in Details

Having provided a cursory look at three approximation methods for general optimal stopping problems based on the path-signature, we will now give an in-depth exposition for one of these methods, namely the dual martingale method from [3].
As explained above, the strategy is to approximate generic adapted processes by linear functionals of the signatures. The universality results for the standard signature, typically are formulated in terms of uniform convergence on compacts subsets of (rough) path-space. However, as explained in chapter “The Signature Kernel” of this book, there is a robust signature obtained by normalizing the standard signature, which is, in fact, uniformly bounded w.r.t. the underlying paths, and, hence, allows global universality—in the set of bounded continuous functions of (rough) path space—when considering the correct topology. Using this machinery, we can provide full convergence results (Theorem 4.1 below), and are not limited to local results—i.e., convergence “on compacts” or “with high probability”. In what follows, we consider a tensor normalization \(\lambda \) and the corresponding robust signature \(\lambda (\widehat {\mathbb {X}}^{<\infty }_{s,t})\) in the sense of chapter “The Signature Kernel”.
We start by presenting a key technical lemma required in the analysis, which is a consequence of Theorem 8.4 in chapter “The Signature Kernel”. Denote the linear functionals of the robust signature by
$$\displaystyle \begin{aligned} {} L_{\text{sig}}(\Lambda_T^p) := \left\{\Lambda_T^p \ni \mathbf{X}|_{[0,t]} \mapsto \left\langle \ell\,,\lambda(\widehat{\mathbb{X}}^{<\infty}_{0,t})\right\rangle \,\left|\, \ell \in \bigoplus_{i=0}^I (\mathbb{R}^{1+d})^{\otimes i}, \ I \ge 1 \right.\right\}, \end{aligned} $$
(24)
where \(\lambda \) denotes an appropriate signature normalization, see chapter “The Signature Kernel”, and we recall the definition of the space of stopped p-rough paths, see Definition 2.12.
The following result generalizes the universal approximation property of the signature in \(L^p\) w.r.t. some measure \(\mu \), extending Proposition 2.14. Note that measures \(\mu \) on \(\widehat {\Omega }^p_T\) can be extended to measures on the space of stopped rough paths \(\Lambda ^p_T\) by restriction. We will denote its extension by \(\mu \), as well. Conversely, every measure \(\mu \) on \(\Lambda ^p_T\) also induces a measure on \(\widehat {\Omega }^p_T\), which we will again denote by \(\mu \).
Theorem 4.1
Let\(\mu \)denote a finite measure on\(\Lambda _T^p\)which is supported on a compactly embedded subspace\(V \subset \widehat {\Omega }_T^p\). Then for any\(f \in L^p(\Lambda _T^p, \mu )\), \(1 \le p < \infty \), there is a sequence\(f_n \in L_{\mathit{\text{sig}}}(\Lambda _T)\)s.t. https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/617302_1_En_9_IEq416_HTML.gif
The image shows a mathematical expression representing a limit in the context of function spaces. The formula is: [ | f - f_n |_{L^p} xrightarrow{ to infty} 0 ] This indicates that the L^p norm of the difference between f and f_n approaches zero as approaches infinity.
.
Remark 4.2
Many measures \(\mu \) encountered in practice are, in fact, naturally supported on compactly embedded subspaces of \(\widehat {\Omega }^p_T\). For instance, take \(\mu \) to be the (rough) Wiener measure, i.e., the distribution of Brownian motion on rough path space. By Example 2.15, we can choose any \(p>2\) and obtain a probability measure \(\mu \) on \(\widehat {\Omega }_T^p\). However, we could also have chosen \(2 < q < p\), and it turns out that \(\mu \) is supported by \(\widehat {\Omega }^q_T \subset \widehat {\Omega }^p_T\). This embedding is compact, see [11]. Hence, the law of Brownian motion satisfies the condition of Theorem 4.1, as do (by similar arguments) the laws of fractional Brownian motion (for any Hurst index) and diffusion processes.
Proof of Theorem4.1
We follow the proof of [3, Th. 2.6].
By dominated convergence, we can first approximate f with a bounded measurable function \(f^K\) bounded by some K s.t. \(\left \lVert f - f^K\right \rVert _{L^p} \le \epsilon \). By abstract results from measure theory (Lusin’s theorem, Tietze’s extension theorem, cf. the proof of Prop. 2.6), we can further find \(\widetilde {f} \in C_b\left ( \Lambda _T; \mathbb {R} \right )\) s.t. \(\left \lVert f^K - \widetilde {f}\right \rVert _{L^p} \le \epsilon \). Hence, we can, without loss of generality, assume that our initial \(f \in C_b(\Lambda _T^p; \mathbb {R})\).
By Theorem 8.4 in chapter “The Signature Kernel”, \(L_{\text{sig}}(\Lambda _T^p) \subset C_b(\Lambda _T^p;\mathbb {R})\) is dense w.r.t. the strict topology. This means that there is a sequence \(f_n \in L_{\text{sig}}(\Lambda _T^p)\) s.t. for any \(\psi : \Lambda _T^p \to \mathbb {R}\) vanishing at infinity, we have that
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/MediaObjects/617302_1_En_9_Equbl_HTML.png
The image displays a mathematical expression involving a supremum and a limit. The formula is: [ sup_{mathbf{x} in Lambda_T^p} psi(mathbf{x}) |f(mathbf{x}) - f_n(mathbf{x})| xrightarrow{ to infty} 0 ] Key elements include the supremum over mathbf{x} in Lambda_T^p, a function psi(mathbf{x}), and the absolute difference |f(mathbf{x}) - f_n(mathbf{x})| approaching zero as tends to infinity.
Suppose that we can find a function \(\psi \) vanishing at infinity such that \(\frac {1}{\psi } \in L^p(\mu )\). Then we can conclude that
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/MediaObjects/617302_1_En_9_Equbm_HTML.png
The image displays a mathematical inequality involving integrals and functions. The expression is: [ | f - f_n |_{L^p_{L_P}} = int_{Lambda^p_T} left( frac{psi(x)(f(x) - f_n(x))}{psi(x)} right)^p mu(dx) ] This is less than: [ left( sup_{x in Lambda_T} psi(x) |f(x) - f_n(x)| right)^p int_{Lambda^p_T} frac{1}{psi(x)^p} mu(dx) to 0 text{ as } to infty ] Symbols include integrals, supremum, and limits, with Greek letters such as psi and mu.
Existence of such a function \(\psi \) follows from Lemma 4.4 with the normed vector space \(W = \widehat {\Omega }^p_T\), which concludes the argument. □
Lemma 4.3
Let\(\nu \)be a finite measure on\((\mathbb {R}_+, \mathcal {B}(\mathbb {R}_+))\). Then there is an increasing function\(\eta : \mathbb {R}_+ \to \mathbb {R}_+\)s.t.\(\lim _{x \to \infty } \eta (x) = \infty \)and\(\int _{\mathbb {R}_+} \eta (x) \nu (\mathrm {d}x) < \infty \).
The proof is left to the readers as an exercise, see also [3, Lemma 2.8]. The following lemma is a slight generalization of [3, Lemma 2.9].
Lemma 4.4
Let\(\mu \)be a finite measure on a normed vector space W which is supported on a compactly embedded normed vector space\(V \subset W\)with norm\(\left \lVert \cdot \right \rVert _V\). Let\(\eta \)be the function from Lemma4.3for the push-forward measure of\(\mu |_V\)by the function\(\left \lVert \cdot \right \rVert _V: V \to \mathbb {R}_+\). Then
$$\displaystyle \begin{aligned} \psi(x) := {\mathbf{1}}_{V}(x) \left( \frac{1}{1 + \eta(\left\lVert x\right\rVert _V)} \right)^{1/p} \end{aligned}$$
vanishes at infinity and satisfies\(\frac {1}{\psi } \in L^p(\mu )\).
Proof
Note that \(\frac {1}{\psi } \in L^p(\mu )\) follows immediately by construction, so we are left to prove that \(\psi \) vanishes at infinity. Since \(\eta \nearrow \infty \), for any \(\epsilon > 0\) there is an \(R > 0\) s.t. \(\left \lVert x\right \rVert _V > R\) implies that \(\left ( \frac {1}{1 + \eta (\left \lVert x\right \rVert _V)} \right )^{1/p} < \epsilon \). The claim follows by noting that \(\{x \in V | \left \lVert x\right \rVert _V \le R\}\) is compact as a subset of W. □
As a corollary of Theorem 4.1, we will see that all (square integrable) martingales can be approximated by integrals of linear functionals of the signature against the driving Brownian motion provided that the filtration is generated by Brownian motion. More concretely, we assume that X satisfies (21) driven by a d-dimensional Brownian motion W s.t. the diffusion matrix \(\sigma \) is invertible everywhere, implying that the filtrations generated by X and W coincide.
Lemma 4.5
Let M be a square integrable one-dimensional martingale with\(M_0 = 0\)w.r.t. the filtration\(\mathcal {F}\). Then there is a sequence\(f_n = (f_n^1, \ldots , f_n^d)\), \(f_n^i \in L_{\mathit{\text{sig}}}(\Lambda _T^p)\)s.t.
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/MediaObjects/617302_1_En_9_Equbo_HTML.png
The image displays a mathematical formula involving expectations, supremum, and integrals. The formula is: [ mathbb{E} left[ sup_{0 leq t leq T} left| M_t - M_t^ right|^2 right] xrightarrow{ to infty} 0, quad M_t^ := sum_{i=1}^{d} int_{0}^{t} f_n^i(X_{[0,s]}) , mathrm{d}W_s^i ] Key elements include the expectation operator mathbb{E}, supremum sup, and the integral with respect to W_s^i. The formula describes a convergence as approaches infinity.
Proof
By the martingale representation theorem, see, e.g., [19], there is an adapted, square integrable process \(\alpha _s\) s.t.
$$\displaystyle \begin{aligned} M_t = \sum_{i=1}^d \int_0^t \alpha^i_s \mathrm{d}W^i_s. \end{aligned}$$
By our usual interpretation of adapted processes as measurable functions on \(\Lambda _T^p\), we can find a measurable f s.t. \(\alpha _t = f(\mathbf {X}|_{[0,t]})\), for all \(t \in [0,T]\). By Theorem 4.1, for every \(i = 1, \ldots , d\), we can find a sequence of functions \(f^i_n \in L_{\text{sig}}(\Lambda _T^p)\) s.t.
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/MediaObjects/617302_1_En_9_Equbq_HTML.png
The image shows a mathematical expression involving an expectation and an integral. The formula is: [ mathbb{E} left[ int_{0}^{T} left( f^{i}(mathbf{X}_{[0,t]}) - f_{}^{i}(mathbf{X}_{[0,t]}) right)^{2} , mathrm{d}t right] xrightarrow{ to infty} 0 ] Key elements include the expectation operator mathbb{E}, integral from 0 to T, and the limit as approaches infinity. The expression inside the integral involves a squared difference between two functions f^{i} and f_{}^{i} applied to mathbf{X}_{[0,t]}.
Doob’s inequality together with the Itô isometry directly imply the result. □
We can now apply this result to the dual martingale formulation of the optimal stopping problem, see (19). Indeed, Lemma 4.5 implies that for any \(M \in \mathcal {M}_0^2\)
$$\displaystyle \begin{aligned} \sup_{0 \le t \le T} \left( Y_t - M_t \right) = \lim_{n \to \infty} \sup_{0 \le t \le T} \left( Y_t - \sum_{i=1}^d \int_0^t f^i_n(\mathbf{X}|_{[0,s]}) \mathrm{d}W^i_s \right), \end{aligned}$$
where the limit is understood in the \(L^2\) sense. We conclude that
$$\displaystyle \begin{aligned} \mathbb{E}\left[ \sup_{0 \le t \le T} \left( Y_t - M_t \right) \right] = \lim_{n \to \infty} \mathbb{E}\left[ \sup_{0 \le t \le T} \left( Y_t - \sum_{i=1}^d \int_0^t f^i_n(\mathbf{X}|_{[0,s]}) \mathrm{d}W^i_s \right) \right]. \end{aligned}$$
A simple argument then yields
$$\displaystyle \begin{aligned} {} V_0 = \inf_{\ell^1, \ldots, \ell^d \in \bigoplus_{i=0}^\infty (\mathbb{R}^{1+d})^{\otimes i}} \mathbb{E}\left[ \sup_{0 \le t \le T} \left( Y_t - \sum_{i=1}^d \int_0^t \langle \ell^i, \, \lambda(\widehat{\mathbb{X}}^{<\infty}_{0,s}) \rangle \mathrm{d}W^i_s \right) \right]. \end{aligned} $$
(25)
We are now left with turning the duality formula (25) into a numerical algorithm. Several issues immediately come to mind: (25) includes several integrals (the stochastic integrals, but also the signature itself) which require discretization. This also applies to computing \(\sup \) over an infinite set of times. The expectation needs to be computed. Finally, we need to minimize over linear functionals of the signature.
We start with time discretization. Let us first introduce a grid \(0 = t_0 < t_1 < \cdots < t_N = T\), and restrict the \(\sup \) inside the expectation to a \(\max \) over the grid, i.e.,
$$\displaystyle \begin{aligned} {} V_0^N := \inf_{\ell^1, \ldots, \ell^d \in \bigoplus_{i=0}^\infty (\mathbb{R}^{1+d})^{\otimes i}} \mathbb{E}\left[ \max_{n=0, \ldots, N} \left( Y_{t_n} - \sum_{i=1}^d \int_0^{t_n} \langle \ell^i, \, \lambda(\widehat{\mathbb{X}}^{<\infty}_{0,s}) \rangle \mathrm{d}W^i_s \right) \right]. \end{aligned} $$
(26)
The martingale duality formula (19) for the discrete time stopping problem, provided that we still use the continuous-time filtration \((\mathcal {F}_{t_n})_{n=0}^N\)—i.e., not the filtration generated by \((X_{t_n})_{n=0}^N\). Hence, the error \(\left \lvert V_0 - V^N_0\right \rvert \) only depends on the underlying processes \(W, X, Y\), but not on the method, and we shall assume that \(\left \lvert V_0 - V^N_0\right \rvert \) as \(N \to \infty \), see, for instance, [9].
Next, let us truncate the signature at level I, which turns an infinite dimensional minimization problem into a finite-dimensional one, i.e.,
$$\displaystyle \begin{aligned} {} V_0^{I,N} := \inf_{\ell^1, \ldots, \ell^d \in \bigoplus_{i=0}^I (\mathbb{R}^{1+d})^{\otimes i}} \mathbb{E}\left[ \max_{n=0, \ldots, N} \left( Y_{t_n} - \sum_{i=1}^d \int_0^{t_n} \langle \ell^i, \, \lambda(\widehat{\mathbb{X}}^{\le I}_{0,s}) \rangle \mathrm{d}W^i_s \right) \right]. \end{aligned} $$
(27)
It is not difficult to see that \(\lim _{I \to \infty } V^{I,N}_0 = V^N_0\): Indeed, note that \(V^{I,N}_0\) is increasing in I. Besides, for every \(\epsilon >0\) we can find \(I_\epsilon \) such that an \(\epsilon \)-optimal solution \((\ell ^1, \ldots , \ell ^d)\) is contained in \(\bigoplus _{i=0}^{I_\epsilon } (\mathbb {R}^{1+d})^{\otimes i}\). This, however, already implies convergence.
Note that the infimum in (27) is attained, i.e., we can find minimizers \(\ell ^{1,\ast }, \ldots , \ell ^{d,\ast } \in \bigoplus _{i=0}^I (\mathbb {R}^{1+d})^{\otimes i}\). Indeed, we note that the functional
$$\displaystyle \begin{aligned} \ell \mapsto \mathbb{E}\left[ \max_{n=0, \ldots, N} \left( Y_{t_n} - \sum_{i=1}^d \int_0^{t_n} \langle \ell^i, \, \lambda(\widehat{\mathbb{X}}^{\le I}_{0,s}) \rangle \mathrm{d}W^i_s \right) \right] \end{aligned}$$
is convex as an expectation of the pointwise maximum of affine (and, hence, convex) functions, see, e.g., [6, Section 3.2.1 together with Section 3.2.3]. The functional also grows to infinity as \(\left \lvert \ell ^i\right \rvert \to \infty \) and is continuous by dominated convergence (note the use of the robust signature). This implies the existence of minimizers.
If we now replace the exact truncated signature \(\widehat {\mathbb {X}}^{\le I}_{0,s}\) by a discretized version \(\widehat {\mathbb {X}}^{\le I,J}_{0,s}\) based on a grid \(0 = s_0 < \cdots < s_J = T\) (containing \(t_1, \ldots , t_N\))—think of \(J \ll N\)—and similarly approximate the stochastic integrals in (27) by approximation based on the same grid, i.e.,
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/MediaObjects/617302_1_En_9_Equ28_HTML.png
The image displays a complex mathematical formula involving several components. The formula is: [ V_0^{I, , J} := inf_{ell^1, ldots, ell^d in bigoplus_{i=0}^{I} (mathbb{R}^{1+d})^{otimes i}} mathbb{E} left[ max_{=0, ldots, } left( Y_{t_n} - sum_{i=1}^{d} sum_{s_j < t_n} langle ell^i, lambda(hat{X}_{0, s_j}^{leq I, J}) rangle right) times (W_{s_{j+1}}^i - W_{s_j}^i) right]. ] Key elements include infimum, expectation, maximum, summation, and inner product operations, as well as Greek letters and mathematical symbols.
(28)
A similar argument as above shows that \(\lim _{J \to \infty } V^{I,N,J}_0 = V^{I,N}_0\).
This leaves the final approximation necessary, namely replacing the expectation by a finite sample (Monte Carlo) approximation. We are finally given an approximation problem
$$\displaystyle \begin{aligned} \begin{array}{*{20}l} {} V_0^{I,N,J,M} := \inf_{\ell^1, \ldots, \ell^d \in \bigoplus_{i=0}^I (\mathbb{R}^{1+d})^{\otimes i}} \frac{1}{M} \sum_{m=1}^M \bigg[ \max_{n=0, \ldots, N} \bigg( Y^{(m)}_{t_n} \\ - \sum_{i=1}^d \sum_{s_j < t_n} \langle \ell^i, \, \lambda(\widehat{\mathbb{X}}^{\le I,J,(m)}_{0,s_j}) \rangle (W^{i,(m)}_{s_{j+1}} - W^{i,(m)}_{s_j}) \bigg) \bigg], \end{array} \end{aligned} $$
(29)
where \(Y^{(m)}\), \(\widehat {\mathbb {X}}^{\le I,J,(m)}_{0,s_j}\), \(W^{i,(m)}_{s_j}\) are i.i.d. samples from the respective random variables, \(m=1, \ldots , M\).
The problem (29) is an example of a sample average approximation to a stochastic optimization problem. In order to streamline the discussion, let us denote all the sources of randomness in (28) by \(\xi {=} \left ( (Y_{t_n})_{n=0}^N, (\widehat {\mathbb {X}}^{\le I,J}_{0,s_j})_{j=0}^J, (W^i_{s_j})_{j{=}0, \ldots , J, i=1, \ldots , d} \right )\). The we have
$$\displaystyle \begin{aligned} V_0^{I,N,J} = \inf_{\ell \in \left( \bigoplus_{i=0}^I (\mathbb{R}^{1+d})^{\otimes i} \right)^d} \mathbb{E}\left[ F(\ell, \xi) \right], \end{aligned}$$
for a function F implicitly defined by (28), whereas (29) is its sample average approximation
$$\displaystyle \begin{aligned} V_0^{I,N,J,M} = \inf_{\ell \in \left( \bigoplus_{i=0}^I (\mathbb{R}^{1+d})^{\otimes i} \right)^d} \frac{1}{M} \sum_{m=1}^M F(\ell, \xi^{(m)}). \end{aligned}$$
The convergence theory for such approximations is well understood. We cite a (slightly modified) result from [17, Chapter 6, Theorem 4].
Lemma 4.6
Given a measurable function\(F:\mathbb {R}^K \times \mathbb {R}^L \to \mathbb {R}\)s.t.\(\ell \mapsto F(\ell ,\xi )\)is lower semi-continuous for every\(\xi \)and\(\ell \mapsto \mathbb {E}[F(\ell , \xi )]\)is also lower semi-continuous. Additionally assume that\(\ell \mapsto F(\ell ,\xi )\)is convex for almost every\(\xi \)and the set of minimizers of\(\min _{\ell } \mathbb {E}[F(\ell , \xi )]\)is non-empty and bounded. Then https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_9/617302_1_En_9_IEq508_HTML.gif
The image displays a mathematical formula related to optimization and convergence. It shows: [ min_{ell} frac{1}{M} sum_{m=1}^{M} F(ell, xi^{(m)}) xrightarrow{M to infty} min_{ell} mathbb{E}[F(ell, xi)] ] The formula involves a minimization problem over a function F with respect to ell , where xi^{(m)} represents a sequence of variables. As M approaches infinity, the average converges almost surely (a.s.) to the expected value of the function.
.
Since all the conditions of Lemma 4.6 are easily verified for (28), we can conclude that \(\lim _{M\to \infty } V_0^{I,N,J,M} = V_0^{I,N,J}\) a.s. Hence, we obtain (as claimed earlier in Sect. 3.3)
$$\displaystyle \begin{aligned} \lim_{N \to \infty} \lim_{I\to\infty} \lim_{J \to \infty} \lim_{M \to \infty} V_0^{I,N,J,M} = V_0 \text{ a.s.}, \end{aligned}$$
with the caveat that some restrictions between the parameters exist—i.e., we can only choose \(J \ge N\).

5 Numerical Example

The aim of this section is to illustrate the performance of the signature stopping methodology in numerically approximating optimal stopping values. We therefore investigate the optimal stopping of fractional Brownian motions. This choice allows us to explore a wide range of underlying path regularities, including scenarios involving non-Markovianity, all within a unified model.
Code Notebook
The complete and documented code for this numerical example can be found in a Jupyter Notebook in the following GitHub repository:
This notebook explains how to implement the algorithms from Sects. 3.1 and 3.2 in Python. It is complemented with several exercises designed to deepen the understanding of signatures and their usage for parametrizing stopping times.
A one-dimensional fractional Brownian motion is a zero-mean Gaussian process X with a covariance function described by
$$\displaystyle \begin{aligned} {} \mathbb{E}( X_s X_t ) = \frac{1}{2}\left( |s|^{2H} + |t|^{2H} - |t-s|^{2H}\right), \quad 0 \le s \le t, \end{aligned} $$
(30)
where \(H \in (0,1]\) is referred to as the Hurst parameter. The sample paths of X are almost surely \(\alpha \)-Hölder continuous for any \(\alpha < H\). When \(H=\frac {1}{2}\), X represents a standard Brownian motion. However, when \(H \neq \frac {1}{2}\), X is neither a Markov process nor a semimartingale. Figure 1 illustrates the regularity of fractional Brownian sample paths for different Hurst parameters. As a one-dimensional process, X can be trivially lifted to a geometric rough path \(\mathbb {X}\in \Omega ^p_T(\mathbb {R}^d)\) for any \(p > \frac {1}{H}\), defined as \(\mathbb {X} = \pi _{\le \lfloor p \rfloor }\exp _\otimes (X_{s,t})\). In the particular example we consider, the stopped process and the underlying process are identical, i.e., \(Y = X\) is the fractional Brownian motion itself. For all following experiments we fix a unit time horizon \(T = 1.0\).
Fig. 1
Illustration of fractional Brownian Motion sample paths across different Hurst parameters H which is a strict upper bound on paths Hölder regularity. This case \(H=0.5\) corresponds to a standard Brownian motion
Full size image
In Table 1, we present estimates of the optimal stopping values corresponding to various values of H. These estimates were obtained using the signature stopping time parametrizations introduced in Sect. 3.1. Specifically, we assess the performance of both linear and deep signature stopping rules, employing different signature truncation levels for each. The architecture of the deep neural networks consists of 2 hidden layers with 30 additional neurons each, using ReLU activation functions. The observations indicate a significant improvement when employing higher truncation levels and transitioning from linear to deep signature strategies. Note that, as a consequence of Doob’s optimal sampling theorem, the optimal stopping value for standard Brownian motion is 0. Therefore, the values obtained for \(H=0.5\) are consistent with this theoretical benchmark.
Table 1
Lower-biased approximated values to the optimal stopping problem of fractional Brownian motion using linear- and deep signature stopping rules. Evaluations were performed for several Hurst parameters H and truncation levels I using a time discretization of \(N=1000\) steps and around \(2^{20}\) training and testing samples. The maximal Monte Carlo error was 0.001
 
Linear
Deep
H
\(I = 1\)
\(I = 2\)
\(I = 3\)
\(I = 4\)
\(I = 5\)
\(I = 1\)
\(I = 2\)
\(I = 3\)
\(I = 4\)
\(I = 5\)
0.1
0.350
0.625
0.840
0.995
1.094
1.247
1.346
1.352
1.358
1.359
0.3
0.131
0.213
0.266
0.297
0.307
0.392
0.418
0.417
0.422
0.418
0.5
\(-\)0.002
0.000
0.000
\(-\)0.001
0.000
0.002
0.000
\(-\)0.001
0.000
0.000
0.7
0.091
0.126
0.146
0.155
0.152
0.194
0.200
0.206
0.206
0.205
0.9
0.165
0.219
0.248
0.262
0.258
0.328
0.333
0.336
0.336
0.335
Figure 2 illustrates the estimates of optimal stopping values for a range of Hurst parameters using time discretizations with \( N = 100, 1000, 10{,}000 \) steps. These values were obtained using a deep signature stopping rule with a fixed signature truncation level of \( I=3 \). For comparison, we have also included the benchmark values from [4]. In their work, the authors translated the problem into a Markovian setting by lifting the discretized fractional Brownian motion to a high-dimensional process, which includes its running history. We observe that when using the same time discretization (\( n=100 \)), our signature methods yield comparable results to [4]. It is worth noting, however, that for the signature methodology, increasing the number of time steps does not lead to an increase in the number of model parameters, as the input dimension of the network remains equal to the dimension of the truncated signatures. This contrasts starkly with the lifting approach used in [4]. Particularly for small Hurst parameters, we observe a significant improvement gained from choosing a finer discretization grid. This observation aligns with the theoretical insight from [2] stating that the optimal stopping value of fractional Brownian motion diverges as the Hurst parameter \( H \) approaches zero.
Fig. 2
An illustration of the estimated values (vertical axis) for the optimal stopping problem of fractional Brownian motion across a range of Hurst parameters (horizontal axis). These estimates were obtained using a deep signature stopping rule with a fixed signature truncation level of \(I=3\) on time discretization grids with varying numbers of steps (\(N=100, 1000, 10{,}000\)). The values labeled by “BCJ” are those obtained in [4]
Full size image
Let us now focus on the case of a Bermudan option with \(N=100\) equidistant exercise opportunities \(0 = t_0 < t_1 < \dots < t_N = T\), where the payoff process \(Y = X\) is again a fractional Brownian motion. In this case, we can utilize the Longstaff-Schwartz method described in Sect. 3.2 and the dual martingale method described in Sect. 3.3 to compute lower and upper bounds. Note that, apart from the signature truncation level I, we have an additional parameter to consider: the number of time grid points, denoted as J, used to approximate the signature of the fractional Brownian motion. In Table 2, we present the results obtained for the choices \(J=100=N\) and \(J=500>N\) for various Hurst parameters. The values from [4] were also obtained on a grid with \(N=100\) points, providing comparable benchmarks for the discrete problem. We observe that the lower bounds obtained with the signature Longstaff-Schwartz method are comparable to those in [4] for \(J=100\), while for \(J=500\), we see a significant improvement. To obtain comparable upper bounds, it appears that choosing a finer grid for signature computation is necessary. We emphasize that even for discrete-time stopping problems with an underlying filtration generated by a non-Markovian continuous-time process, the ability to enhance results by increasing J, thereby incorporating more information about the underlying path without increasing the number of model parameters, is an inherent feature of the signature methodology. There seems to be no evident extension of methods such as those in [4] to incorporate such a feature.
Table 2
Lower and upper biased approximations to the discrete-time optimal stopping of a fractional Brownian motion with \(N=100\) time points using the Signature Longstaff-Schwartz and dual martingale methods. These estimates were obtained with a signature truncation level of \(I=6\) and different choices for the time grid size J used for the signature approximation. In the right-most column, we provide values from [4] for comparison
 
\(J = 100\)
\(J = 500\)
BCJ
H
Lower
Upper
Lower
Upper
Lower
Upper
0.1
1.045
1.129
1.065
1.117
1.048
1.05
0.3
0.363
0.396
0.371
0.392
0.368
0.371
0.5
\(-\)0.001
0
\(-\)0.002
0
0
0.005
0.7
0.203
0.234
0.205
0.220
0.206
0.208
0.9
0.331
0.357
0.337
0.356
0.335
0.339

6 Conclusions and Outlook

Signatures can serve as a natural set of basis functions for the numerical approximation of functions on path-space, analogous to polynomials on finite-dimensional spaces. In this chapter, we show how this conjecture can be used in the numerics of optimal stopping problems, when the underlying stochastic process is not a Markov process. More precisely, we derive non-Markovian extensions of some classical algorithms, namely the Longstaff–Schwartz algorithm and Rogers’ dual algorithm. In addition, we also present a direct algorithm based on parameterization of a rich class of stopping times defined in terms of the signature.
Hence, signature based methods for optimal stopping provide a very rich and efficient class of methods, as also supported by numerical examples. On the other hand, they can be challenging to implement, due to the large number of terms present in the signature, as well as the cost of computing the signature itself from time-series of the underlying process.
There are many open problems and largely unresolved questions in the context of the signature method for optimal stopping and, more generally, stochastic optimal control. Possibly the most prominent problem is to establish quantitative error bounds and rates of convergence for the proposed algorithms and other algorithms based on the signature. Note that the methods presented here require several numerical approximations (truncation of the signature, finite sample, discretization of the iterated integrals constituting the signature), and we would like to obtain sharp error estimates and rates for the total error in terms of these numerical parameters. Note that these approximations interact with each other, e.g., higher degree of truncation may lead to larger discretization errors or variances. Functional Taylor expansions as presented in chapter “Signature and the Functional Taylor Expansion” could serve as important tools for analyzing the approximation errors.
From a more applied point of view, it would be highly desirable to improve the performance of the signature methods if possible. To this end, dimension reduction techniques such as randomized signatures, see, for instance, [7], can be useful. Moreover, detailed case studies for important applications, such as pricing American options in rough volatility models, with an emphasis on obtaining tight upper and lower bounds would be desirable.

Acknowledgements

C.B. gratefully acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy (EXC-2046/1, project ID 390685689), as well as support from DFG CRC/TRR 388 “Rough Analysis, Stochastic Dynamics and Related Fields”, Project B03.
Appendix

Proof of Proposition2.2

Fix \(t \in [0,T]\). Since \(U_t\) is \(\mathcal {F}_t = \sigma (X_s \,|\, s \in [0,t])\)-measurable, there is a measurable function \(\theta _t \colon \mathcal {C}_t \to \mathbb {R}\) such that
$$\displaystyle \begin{aligned} U_t(\omega) = \theta_t(X(\omega)|_{[0,t]}) \end{aligned} $$
for every \(\omega \in \Omega \). Define \(\phi \colon \mathscr {C}_T \to [0,T] \times \mathcal {C}_T\) as follows: If \(X \in \mathcal {C}_t\), define \(\tilde {X} \in \mathcal {C}_T\) by setting \(\tilde {X}_s := X_{s \wedge t}\) and we set \(\phi (X) = (t,\tilde {X})\). Since \(\phi \) is continuous, it is also measurable. Define \(f \colon [0,T] \times \mathcal {C}_T \to \mathbb {R}\) by \(f(t,X) = \theta _t(X|_{[0,t]})\). For fixed t, \(X \mapsto X_{[0,t]}\) as a map from \(\mathcal {C}_T\) to \(\mathcal {C}_t\) is continuous. Since \(\theta _t\) is measurable, \(X \mapsto f(t,X)\) is measurable. For \(n \in \mathbb {N}\), define \(I^n_k := [k/2^n T, (k+1)/2^n T)\) for \(k = 0, \ldots , 2^n - 2\), \(I^n_{2^n -1} := [(2^n -1)/2^n T, T]\) and \(t^n_k := k/2^n T\). Set
$$\displaystyle \begin{aligned} f_n(t,X) := \sum_{k = 0}^{2^n -1} f(t^n_k,X) \mathbb{1}_{I^n_k}(t) \end{aligned} $$
which is (jointly) measurable for every \(n \in \mathbb {N}\). Set
$$\displaystyle \begin{aligned} \tilde{f}(t,X) := \limsup_{m \to \infty} \limsup_{n \to \infty} f_n(t+1/m,X) \quad \text{and} \quad \theta(X|_{[0,t]}) := (\tilde{f} \circ \phi)(X|_{[0,t]}). \end{aligned} $$
The map \(\theta \) is thus measurable and satisfies
$$\displaystyle \begin{aligned} \theta(X(\omega)|_{[0,t]}) = \limsup_{m \to \infty} \limsup_{n \to \infty} \sum_{k = 0}^{2^n -1} U_{t^n_k}(\omega) \mathbb{1}_{I^n_k}(t + 1/m) = U_t(\omega). \end{aligned} $$

Proof of Proposition2.6

It suffices to prove that for given \(\varepsilon > 0\) and \(\delta > 0\), there exists a continuous map \(g \colon \mathcal {M} \to \mathbb {R}\) such that
$$\displaystyle \begin{aligned} \mu(|f - g| > \varepsilon) < \delta. \end{aligned} $$
By Lusin’s theorem, there is a closed set \(A \subset \mathcal {M}\) such that \(\mu (A^c) < \delta \) and \(\tilde {g} := f|_A\) is continuous. Since A is closed, from Tietze’s extension theorem, we can extend \(\tilde {g}\) to a continuous map g on the whole space \(\mathcal {M}\). The map g clearly satisfies
$$\displaystyle \begin{aligned} \{x \in \mathcal{M}\, :\, |f(x) - g(x)| > \varepsilon \} \subset A^c \end{aligned} $$
for every \(\varepsilon > 0\), thus the result follows. □
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Title
Optimal Stopping for Non-Markovian Asset Price Processes
Authors
Christian Bayer
Paul P. Hager
Sebastian Riedel
Copyright Year
2026
DOI
https://doi.org/10.1007/978-3-031-97239-3_9
1.
go back to reference P. Bank, C. Bayer, P.P. Hager, S. Riedel, T. Nauen, Stochastic control with signatures. Preprint. arXiv:2406.01585 (2024)
2.
go back to reference C. Bayer, P.P. Hager, S. Riedel, J. Schoenmakers, Optimal stopping with signatures. Ann. Appl. Probab. 33(1), 238–273 (2023)MathSciNetCrossRef
3.
go back to reference C. Bayer, L. Pelizzari, J. Schoenmakers, Primal and dual optimal stopping with signatures. Preprint. arXiv:2312.03444 (2023)
4.
go back to reference S. Becker, P. Cheridito, A. Jentzen, Deep optimal stopping. J. Mach. Learn. Res. 20(74), 1–25 (2019)MathSciNet
5.
go back to reference H. Boedihardjo, X. Geng, T. Lyons, D. Yang, The signature of a rough path: uniqueness. Adv. Math. 293, 720–737 (2016)MathSciNetCrossRef
6.
go back to reference S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, 2004)
7.
go back to reference E.M. Compagnoni, A. Scampicchio, L. Biggio, A. Orvieto, T. Hofmann, J. Teichmann, On the effectiveness of randomized signatures as reservoir for learning rough dynamics, in 2023 International Joint Conference on Neural Networks (IJCNN) (IEEE, 2023), pp. 1–8
8.
go back to reference B. Dupire, Functional Itô calculus. Quant. Finance 19(5), 721–729 (2019)MathSciNetCrossRef
9.
go back to reference E. Ekström, Properties of American option prices. Stoch. Process. Appl. 114(2), 265–278 (2004)MathSciNetCrossRef
10.
go back to reference P.K. Friz, M. Hairer, A Course on Rough Paths (Springer, Berlin, 2020)CrossRef
11.
go back to reference P.K. Friz, N.B. Victoir, Multidimensional Stochastic Processes as Rough Paths: Theory and Applications, vol. 120 (Cambridge University Press, 2010)
12.
go back to reference P. Glasserman, Monte Carlo methods in financial engineering (Springer Science+Business Media, New York, 2003)CrossRef
13.
go back to reference F.A. Longstaff, E.S. Schwartz, Valuing American options by simulation: a simple least-squares approach. Rev. Financ. Stud. 14(1), 113–147 (2001)CrossRef
14.
go back to reference T.J. Lyons, M. Caruana, T. Lévy, Differential Equations Driven by Rough Paths (Springer, 2007)
15.
go back to reference G. Peskir, A. Shiryaev, Optimal Stopping and Free-boundary Problems (Springer, Basel, 2006)
16.
go back to reference L.C. Rogers, Monte Carlo valuation of American options. Math. Finance 12(3), 271–286 (2002)MathSciNetCrossRef
17.
go back to reference A. Shapiro, Monte carlo sampling methods, in Handbooks in Operations Research and Management Science, vol. 10 (Elsevier, Amsterdam, 2003), pp. 353–425
18.
go back to reference A.N. Shiryaev, Optimal Stopping Rules, vol. 8 (Springer Science & Business Media, 2007)
19.
go back to reference S.E. Shreve, Stochastic Calculus for Finance II: Continuous-time Models, vol. 11 (Springer, New York, 2004)CrossRef
    Image Credits
    Salesforce.com Germany GmbH/© Salesforce.com Germany GmbH, IDW Verlag GmbH/© IDW Verlag GmbH, Diebold Nixdorf/© Diebold Nixdorf, Ratiodata SE/© Ratiodata SE, msg for banking ag/© msg for banking ag, C.H. Beck oHG/© C.H. Beck oHG, OneTrust GmbH/© OneTrust GmbH, Governikus GmbH & Co. KG/© Governikus GmbH & Co. KG, Horn & Company GmbH/© Horn & Company GmbH, EURO Kartensysteme GmbH/© EURO Kartensysteme GmbH, Jabatix S.A./© Jabatix S.A.