Finds documents with both search terms in any word order, permitting "n" words as a maximum distance between them. Best choose between 15 and 30 (e.g. NEAR(recruit, professionals, 20)).
Finds documents with the search term in word versions or composites. The asterisk * marks whether you wish them BEFORE, BEHIND, or BEFORE and BEHIND the search term (e.g. lightweight*, *lightweight, *lightweight*).
This chapter delves into the intricate world of adapted topologies and higher-rank signatures, focusing on their application in financial decision-making under uncertainty. It begins by introducing the concept of adapted topologies, which are essential for modeling the uncertain evolution of financial markets and the information available to investors. The text then explores the notion of higher-rank signatures, which extend the classical signature theory to capture more complex probabilistic properties of stochastic processes. The chapter provides a detailed overview of past and recent developments in the systematic study of filtered processes, highlighting the importance of model sensitivity. It also discusses the theoretical tools and algorithms for computing higher-rank signatures, including the prediction process and higher-rank prediction processes. The practical applications of these concepts are illustrated through examples such as the pricing of American options under the rough Bergomi model and the classification of filtered processes in adapted weak topologies. The chapter concludes with a bibliographical discussion, providing a historical context and further reading on the topic. Readers will gain a deeper understanding of how adapted topologies and higher-rank signatures can be used to address continuity and sensitivity in sequential decision-making problems, making this chapter a valuable resource for professionals in the field.
AI Generated
This summary of the content was generated with the help of AI.
Abstract
Two stochastic processes can have similar laws but yield a vastly different outcome in applications such as optimal stopping or stochastic programming. The reason is that the usual weak topology does not account for the different available information in time that is stored in the filtration of the underlying process. To address the resulting discontinuities, Aldous introduced the extended weak topology, and subsequently, Hoover and Keisler showed that both, weak topology as well as extended weak topology, are just the first two topologies in a sequence of the so-called adapted weak topologies that get increasingly finer. In this short survey, we give a brief introduction to the recent advances of the applications of signature theory, in particular the higher-rank (expected) signatures, in adapted weak topologies and related fields, and highlight theoretical and computational properties.
1 Introduction to Adapted Topologies
Financial and economic agents frequently encounter decision making problems. Facing uncertainty, they have to make a sequence of choices in a complex environment which necessitates a systematic approach. Therefore, to enable the decision making process, mathematical models are employed, which assist in identifying reasonable courses of action. A common method of incorporating uncertainty into a model is through stochastic processes. The agent has to base their decision on the available information which is continually updated in time. This evolving information is formally represented by a filtration. Summarizing, the mathematical base of such a framework, where we consider throughout the article for ease of exposition a discrete-time framework with \(T \in \mathbb {N}\) time steps and \(I := \{ 1, \ldots , T \}\), involves:
a probability space \((\Omega ,\mathcal {F},\mathbb {P})\);
a filtration \((\mathcal {F}_t)_{t \in I}\), i.e., an increasing sequence of \(\sigma \)-algebras with \(\mathcal {F}_t \subseteq \mathcal {F}\); and
a stochastic process \(X = (X_t)_{t \in I}\) adapted to the filtration \((\mathcal {F}_t)_{t \in I}\), i.e., \(X_t\) is \(\mathcal {F}_t\)-measurable.
Informally speaking, as \(\mathcal {F}_t\) reflects the information of the agent at time t, the adaptedness of X means that the agent has observed the path X up to time t.
To give an illustrative example of such a setup, we briefly consider the model of a simple financial market that can be found in classical mathematical finance literature, see for example [40]. There, the stochastic process \(X = (X_t)_{t \in I}\) with \(X_t \in \mathbb {R}^d\) may be used to model the uncertain evolution in time of d-asset prices. This means that each coordinate of \(X_t = (X_t^i)_{i = 1}^d\) represents the value of a single asset at a given point in time. Meanwhile, the filtration incorporates the information that is available to an investor, which they can use to base their decisions on. Therefore, the investor can only employ trading strategies that are adapted to this filtration. For example, let \((H_t)_{t \in I}\) be a self-financing dynamic trading strategy. When \(H_t^i\) denotes the amount of units of the i-th asset that the investor holds between time t and \(t + 1\) (that is, on \([t,t+1)\)), then this amount is necessarily measurable with respect to \(\mathcal {F}_t\). Moreover, the value of the self-financing portfolio at time t is given by
In more mathematical terms, we consider dynamic trading strategies that are predictable with respect to that filtration. Therefore, the collection of attainable claims of type (1) crucially depends on process and filtration, which we further stress by the example below. But beforehand, we introduce the following notion.
consisting of a probability space \((\Omega ^{\mathbf {X}}, \mathcal {F}^{\mathbf {X}}, \mathbb {P}^{\mathbf {X}})\), a filtration \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\) and an \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-adapted stochastic process X. The class of all filtered processes is denoted by \(\mathcal {F}\mathcal {P}\).
As the herein studied framework can be condensed to such a 5-tuple, filtered processes can be seen as the mathematical base of models for many phenomena in finance and economics. In what follows we give an overview on past and recent developments in the systematic study of filtered processes, where the main motivation stems from model sensitivity.
We use \(\mathbb E_{\mathbf {X}}\) to denote the expectation w.r.t. \(\mathbb P^{\mathbf {X}}\) for some filtered process \(\mathbf {X}\). When the probability space is unambiguous from the context, will omit the subscript and simple write \(\mathbb E\).
Example 1.2
Let us consider the simple probability space \((\Omega ,\mathcal {F},\mathbb {P})\) where \(\Omega := \{ \omega _1,\omega _2\}\) has two elements, \(\mathcal {F} := 2^\Omega \) is the power set of \(\Omega \), and \(\mathbb {P}\) is the uniform distribution on \(\Omega \). We consider the process \(X = (X_t)_{t = 1}^2\) with
for \(\omega \in \Omega \). First assume that the investor’s available information is given by the filtration \((\mathcal {F}_t)_{t = 1}^2\) with \(\mathcal {F}_1 := \{ \emptyset , \Omega \}\) and \(\mathcal {F}_2 := \mathcal {F}\), that is, we consider the filtered process \(\mathbf {X} := (\Omega , \mathcal F, \mathbb P, (\mathcal F_t)_{t = 1}^2, X)\). Then the set of attainable claims of type (1) is given by the collection
$$\displaystyle \begin{aligned} \{ h X_2 : h \in \mathbb{R} \}. \end{aligned}$$
By buying at time 1, h units of X, one can not make profit without the risk of losing money, since \(h \cdot X_2 \ge 0\)\(\mathbb {P}\)-almost surely if and only if \(h = 0\).
Advertisement
On the other hand, if the investor’s available information is given by the filtration \((\mathcal {G}_t)_{t = 1}^2\) with \(\mathcal G_1 = \mathcal G_2 = \mathcal {F}\), that is, we have the filtered process \(\mathbf {Y} := (\Omega , \mathcal F, \mathbb P, (\mathcal G_t)_{t = 1}^2, X)\) as model, then the set of attainable claims is given by
$$\displaystyle \begin{aligned} \{ H X_2 : H \text{ is }\mathcal{F}\text{-measurable} \}. \end{aligned}$$
Since the investor knows at time \(t=1\) already the value of \(X_2\), they can exploit this information by buying \(H_1 = X_2\) units. This leads to profit without risk of losing money, since the value of the portfolio at time 2 is \(H X_2 = 1\).
Even though Example 1.2 is admittedly a rather basic example, it still stresses the importance of studying stochastic processes closely in combination with filtration. Virtually any example of a sequential decision making problem exhibits similar properties, including the optimal stopping problem (cf. Example 1.3) that we will discuss in more detail below.
Before we move forward to the detailed discussions, for readers’ convenience we first collect some important notations which will be used throughout the whole article and list them as follows:
Table of Notations
Symbol
Meaning
Page
Spaces
\(\mathcal {P}(A)\)
The space of probability measures on a topological space A
337
\(I \to A\)
The space of all paths from I into a set A
339
\(M^{(r)}\)
The path space of rank r-prediction processes
343
\(T((\mathbb {R}^d))\)
The tensor algebra over \(\mathbb {R}^d\)
348
H
A Hilbert space in \(T((\mathbb {R}^d))\) containing the image of signature
348
\(H^{(r+1)}\)
A Hilbert space in \(T((\mathbb {R} \oplus H^{(r)}))\) containing the image of \(\Phi ^{(r)}_{\text{Sig}}\)
353
\({\mathbf{H}}_{\text{Sig}}^{(r)}\)
The RKHS induced by \({\mathbf{k}}^{(r)}_{\text{Sig}}\)
355
Paths and processes
\(\mathrm {pp}(\mathbf {X})\)
The prediction process of a filtered process \(\mathbf {X}\)
339
\(\mathrm {pp}^{(r)}(\mathbf {X})\)
The rank r-prediction process of a filtered process \(\mathbf {X}\)
343
\( {\mathbf{p}}^{(r)}\)
The paths in \(M^{(r)}\)
353
\(\tilde {{\mathbf {p}}}^{(r)}\)
The paths formed by taking the integral of \(\Phi ^{(r-1)}_{\text{Sig}}\) against \({\mathbf{p}}^{(r)}\)
The empirical estimator of \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})\)
364
Topologies
\(\mathcal {T}^{(r)}\)
The adapted weak topology of rank r
343
\(\mathcal {T}\)
The adapted weak topology
344
Signatures
\(\Phi _{\text{Sig}}\)
The signature on \(I \to \mathbb {R}^d\)
349
\(\Phi ^{(r)}_{\text{Sig}}\)
The rank-r signature on \(M^{(r)}\)
352
\(d^{(r)}_{\text{Sig}}\)
The rank-r signature metric
354
\( {\mathbf{k}}_{\text{Sig}}\)
The signature kernel
355
\({\mathbf {k}}^{(r)}_{\text{Sig}}\)
The rank-r signature kernel
355
\(\mathcal {D}^{(r)}_{\text{Sig}}\)
The rank-r signature kernel MMD
356
\(\hat {\mathcal {D}}^{(1)}_{\text{Sig}}\)
The empirical estimator of \(\mathcal {D}^{(1)}_{\text{Sig}}\)
364
1.1 Model Sensitivity
In the past 30 years, there was substantial progress in our understanding of model uncertainty in finance, starting with the seminal articles by Lyons [60], Avellaneda et al. [8], Soner et al. [75] on markets with uncertain volatilities, and Hobson [44] and Cont [32] on robust hedging and pricing. Given the numerous contributions to the field in recent years, we can not do justice to all, but refer readers to Beiglböck et al. [20], Galichon et al. [41] and Bouchard and Nutz [29], and the references therein. Parametric model sensitivity was studied, for example, by Kramkov and Sîrbu [54] and Kardaras and Žitković [50]. More recently, Backhoff et al. [13] shed new light on model sensitivity in finance by suggesting a non-parametric analysis based on the recent notion of adapted Wasserstein distance. We will loosely follow their motivation:
Models are based on data and observations that are naturally noisy as well as incomplete information. For this reason, a model can be at best a faithful approximation of the modeled scenario and optimal actions that are based on this model may not be optimal in reality. Consequently, it becomes crucial to gauge the sensitivity of decisions based on slightly erroneous models. In other words, one would like to get a grasp on the following question.
(Q)
How well do decision strategies derived from a slightly erroneous model perform when applied to reality?
However, Question (Q) is not yet well-posed, because we have not clarified when two models should be considered as ‘close’. Consequently, we need a suitable notion of proximity of models. In our setting, this translates to finding an appropriate notion of when two filtered processes (cf. Definition 1.1) \(\mathbf {X}, \mathbf {Y}\) are close. However, before comprehending proximity, it is essential to understand when two filtered processes should be considered equal.
1.2 The Weak Topology
Two random variables \(X,Y\) taking values in \(\mathbb R^d\), that are not necessarily supported on the same probability space, share the same probabilistic properties if their law coincides. At the same time, a reasonable notion of proximity is provided by the ubiquitous topology of weak convergence of measures. Let \(C_b(\mathbb R^d)\) be the set of continuous bounded functions on \(\mathbb R^d\). We say that X and Y have the same distribution if
There are numerous ways of metrizing weak convergence of measures on the set of probability distributions on \(\mathbb R^d\), \(\mathcal P(\mathbb R^d)\), such as the Prokhorov metric, Wasserstein distances or maximum mean discrepancies. Throughout this article, we endow spaces of probability measure with weak convergence of measures.
In a first attempt, one could try to identify filtered processes that share the same law and say two filtered processes are close if their laws are similar. This is rather unsatisfactory, since we completely neglect any information that is encoded in the filtrations, which is inadequate for our purposes, cf. Example 1.2. But, even when considering only filtered processes \(\mathbf {X}\) whose filtration \((\mathcal {F}^{\mathbf {X}})_{t \in I}\) is given by the natural filtration generated by the process X itself, proximity w.r.t. weak convergence of measures is not the correct notion in view of finding an answer to Question 1.1. In order to stress the shortcomings of the usual weak topology of measures when employed in our setting, we provide a simple example.
Example 1.3
The optimal stopping problem is perhaps one of the simplest information dependent optimization problems that is encountered on financial markets. Let \((\Omega ,\mathcal {F},\mathbb {P})\), \((\mathcal {F}_t)_{t = 1}^2\), \((\mathcal G_t)_{t = 1}^2\) and X be given as in Example 1.2. For \(\varepsilon > 0\) we consider the simplistic scenarios
for \(\varepsilon \searrow 0 \), where \(\mathrm {ST}(\mathbf {X})\) denotes the set of \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-stopping times.
Example 1.3 provides a very simple example that illustrates the core of the problem. Even though \(X^\varepsilon \to X\) in distribution, the associated filtered processes \({\mathbf {X}}^\varepsilon \) are, from a probabilistic point of view, very different from \(\mathbf {X}\). While X is a martingale on the filtered probability space of \(\mathbf {X}\), the processes \((X^\varepsilon )_{\varepsilon \ge 0}\) are predictable on their respective filtered probability spaces. Of course, one could easily consider more sophisticated examples, also in continuous time. For example, it is possible to approximate a Brownian motion in law by “deterministic” processes, whose paths are completely known at initial time and observe the very same phenomenon, see for example [6, Chapter 4, Section 11].
1.3 The Prediction Process
Multiple mathematicians have addressed the shortcomings presented in Sect. 1.2 and worked on ways of refining the weak topology in order to additionally incorporate the information that is encoded in the filtration. Notable contributions stem from the works of Aldous [6] and Hoover–Keisler [46]. For an in depth discussion on the multitude of related works we refer the interested reader to Sect. 4.
From now on, for \(I = \{1,\ldots ,T\}\) and any set A, we will use the symbol \(I \to A\) to denote the product space \(A^T = A \times \ldots \times A\) (T copies of A), which is the space of paths defined on I taking values in A.
Inspired by earlier work of Knight [53], Aldous defines the prediction process \(\mathrm {pp}(\mathbf {X})\) of a filtered process \(\mathbf {X}\) as the measure-valued process given by
$$\displaystyle \begin{aligned} \mathrm{pp}_t(\mathbf{X}) := \mathbb{P}^{\mathbf{X}}( X \in \cdot | \mathcal{F}_t^{\mathbf{X}})\quad t \in I. \end{aligned} $$
(4)
We have that \(\mathrm {pp}(\mathbf {X})\) takes values in \(I \to \mathcal P(I \to \mathbb R^d)\). As the name suggests, \(\mathrm {pp}_t(\mathbf {X})\) is the predicted law of the process given the information at time t. This process can be used to define an equivalence relation that Aldous dubs synonymity. Two filtered processes \(\mathbf {X}\) and \(\mathbf {Y}\) are called synonymous if \(\mathrm {pp}(\mathbf {X})\) and \(\mathrm {pp}(\mathbf {Y})\) have the same distribution, i.e.,
A sequence \(({\mathbf {X}}^n)_{n \in \mathbb N}\) of filtered processes is said to converge in the extended weak topology to a filtered process \(\mathbf {X}\) if
We have the following observations of topological nature:
1.
It is straightforward to see that the extended weak topology is finer than convergence in law of \((X^n)_{n \in \mathbb N}\) to X. Indeed, any function \(g \in C_b(I \to \mathbb R^d)\) has a representative \(f \in C_b(I \to \mathcal P(I \to \mathbb R^d))\) such that
by simply setting \(f(\rho ):= \int g(x) \, d\rho _t(x)\) for every \(\rho \in (I \to \mathcal P(I \to \mathbb R^d))\) and any \(t \in I\), because we have by the tower property
Consider the prediction processes \(({\mathbf {X}}^\varepsilon )_{\varepsilon \ge 0}\) and \(\mathbf {X}\) from Example 1.3. Then a simple computation shows that \(\mathrm {pp}(\mathbf {X}) = (\mathrm {pp}_1(\mathbf {X}), \mathrm {pp}_2(\mathbf {X}))\) satisfies
where \(\delta _{X^\varepsilon (\omega _1)} = \delta _{(\varepsilon ,1)}\) and \(\delta _{X(\omega _2)} = \delta _{(-\varepsilon ,-1)}\). Hence, \(\mathrm {pp}({\mathbf {X}}^\varepsilon ) \to \mathrm {pp}({\mathbf {X}}^0)\) in distribution while the laws of \(\mathrm {pp}(\mathbf {X})\) and \(\mathrm {pp}({\mathbf {X}}^0)\) differ. Therefore, Example 1.3 provides an example of a sequence that converges weakly but not with respect to the extended weak topology.
3.
Typical functions in \(C_b(I\to \mathcal P(I \to \mathbb R^d))\) can be built as follows: For each \(t \in I = \{1,\ldots ,T\}\), we choose a function \(g_t \in C_b(I \to \mathbb {R}^d)\); then given a measure-valued path \(\rho = (\rho _t)_{t \in I} \in I \to \mathcal P(I \to \mathbb R^d)\), we obtain a \(\mathbb {R}^T\)-valued vector by testing each measure \(\rho _t \in \mathcal P(I \to \mathbb R^d)\) against the function \(g_t \in C_b(I \to \mathbb {R}^d)\):
and get an element \(f \in C_b(I\to \mathcal P(I \to \mathbb R^d))\). In fact these functions are used in [46] to study the adapted topologies.
As a concrete example, we assume \(T = 2\) and \(d = 1\), in this case \((I \to \mathbb {R}^d) = \mathbb {R}^2\), and \((I \to \mathcal {P}(I \to \mathbb {R}^d)) = \mathcal {P}(\mathbb {R}^2) \times \mathcal {P}(\mathbb {R}^2)\). We now take \(g_1(x_1,x_2) = [(x_1 + x_2)\wedge 2]\vee (-2) \) and \(g_2(x_1,x_2) = \sin {}(x_1 - x_2)\), and \(h(x_1,x_2) = \exp (-|x_1| -x_2^2)\) to define a function \(f \in C_b(I\to \mathcal P(I \to \mathbb R^d))\) as above. Then for any filtered process \(\mathbf {X}\) on I taking values in \([-1,1]\), we have (as \(\mathrm {pp}(\mathbf {X}) \in \mathcal {P}(\mathbb {R}^2) \times \mathcal {P}(\mathbb {R}^2)\))
We leave a simple exercise to curious readers to check that for filtered processes \(\mathbf {X}\) and \({\mathbf {X}}^\varepsilon \) from Example 1.3, it holds that for the above function f,
so that \(\lim _{\varepsilon \to 0} \mathbb {E}[f(\mathrm {pp}({\mathbf {X}}^\varepsilon ))] \neq \mathbb {E}[f(\mathrm {pp}(\mathbf {X}))]\).
Probabilistic properties such as being a martingale or being predictable, play an important role in stochastic analysis. Many of these properties are already captured by the prediction process.
Remark 1.5
Let \(\mathbf {X}\) and \(\mathbf {Y}\) be filtered processes that are synonymous, then we have:
(a)
if X is a \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-martingale, then Y is a \((\mathcal {F}_t^{\mathbf {Y}})_{t \in I}\)-martingale;
(b)
if X is \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-predictable, then Y is \((\mathcal {F}_t^{\mathbf {Y}})_{t \in I}\)-predictable;
(c)
if X is a \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-Markov process, then Y is a \((\mathcal {F}_t^{\mathbf {Y}})_{t \in I}\)-Markov process.
To strengthen the intuition, we check the second claim as the other two properties can be shown with similar reasoning. Recall that X is \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-predictable if \(X_{t + 1}\) is \(\mathcal {F}_t^{\mathbf {X}}\)-measurable for all \(t\in I\). Equivalently, we can express this property in terms of the prediction process:
where \(\delta _x\) denotes a Dirac measure at \(x \in \mathbb R^d\) and \((e_t)_\#\) is the pushforward of a measure with respect to the time-t evaluation map \(e_t : (\mathbb R^d)^I \to \mathbb R^d\). Therefore, X is \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-predictable if and only if \((e_t)_\# \mathrm {pp}_t(\mathbf {X})\) takes values in the set of Dirac measures. But, the latter is a property that can be deduced by knowing just the law of \(\mathrm {pp}(\mathbf {X})\). Since \(\mathbf {X}\) and \(\mathbf {Y}\) are synonymous, their prediction processes \(\mathrm {pp}(\mathbf {X})\) and \(\mathrm {pp}(\mathbf {Y})\) have the same distribution. It is immediate from (7) that Y also has to be \((\mathcal {F}^{\mathbf {Y}}_t)_{t \in I}\)-predictable.
In fact, these properties are even closed under convergence in the extended weak topology, except for the martingale property, that is Item (a). This holds true also for the latter if we additionally assume uniform integrability of the sequence of stochastic processes. We refer readers to [17] for detailed proofs.
Remark 1.5 suggests that synonymity may be the adequate notion of equivalence for filtered process. But, even though the prediction process is suitable to rectify many problems (see also the discussion in Sect. 4), it is still not complex enough to identify the value of sequential decision making problems, when admitting processes with general filtrations. We will show this point by means of optimal stopping and insist that similar examples can be constructed for most sequential decision making problems.
Example 1.6
Let \(U,V\) be independent fair coin flips such that \(\mathbb {P}(U =1) = \mathbb {P}(U=0) = \mathbb {P}(V=1) = \mathbb {P}(V=0) = 1/2\), \(I = \{1,2,3\}\) and X is given by
Observe that \(\mathcal {F}_1\) encodes at time 1 whether or not we will know the value of \(X_3\) at time 2, while \(\mathcal G_1\) contains no information. The prediction process is ignorant to this kind of information, which is of “higher order”. In fact, computing the prediction processes at time 1 we get
$$\displaystyle \begin{aligned} \mathrm{pp}_1(\mathbf{X}) = \mathrm{law}( X | \mathcal{F}_1) = \mathrm{law}( X | U ) = \mathrm{law}(X) = \mathrm{law}( X | \mathcal G_1) = \mathrm{pp}_1(\mathbf{Y}), \end{aligned} $$
where we used independence of U and V . In particular, we have shown that \(\mathrm {pp}(\mathbf {X}) = \mathrm {pp}(\mathbf {Y})\). On the other hand, computing the optimal stopping values we find
We leave the verification of the above inequality as an exercise to curious readers by checking that \(\tau ^{\mathbf {X}} := {\mathbf {1}}_{\{U = 0\}} + 2 {\mathbf {1}}_{\{U = 1, V=0\}} + 3 {\mathbf {1}}_{\{U = 1, V=1\}}\) and \(\tau ^{\mathbf {Y}} := 2 {\mathbf {1}}_{\{U = 1, V=0\}} + 3 {\mathbf {1}}_{\{U = 1, V=1\}} + 3{\mathbf {1}}_{\{U = 0\}}\) are optimal stopping times for \(\sup _{\tau \in \mathrm {ST}(\mathbf {X})} \mathbb E[X_\tau ]\) and \(\sup _{\tau \in \mathrm {ST}(\mathbf {Y})} \mathbb E[X_\tau ]\), respectively.
1.4 Higher-Rank Prediction Processes
We have seen that Aldous’ notion of synonymity does not capture all probabilistic properties of a filtered process, in particular, it is agnostic to information of higher order as described in Example 1.6. Motivated by such an observation, Hoover–Keisler [46] define prediction processes of higher rank. For a rank \(r \in \mathbb N\), the rankr-prediction process\(\mathrm {pp}^{(r)}\) of a filtered process \(\mathbf {X}\) is inductively defined by
Schematically, we can think of the path space of the rank r prediction process as
The image displays a sequence of mathematical expressions defining a series of mappings. The expressions are as follows: - M^{(0)} := (I to mathbb{R}^d) - M^{(1)} := (I to mathcal{P}(I to mathbb{R}^d)) - Ellipsis indicating continuation. - M^{(r)} := (I to mathcal{P}(I to ldots mathcal{P}(I to mathbb{R}^d)) ldots)) The final expression includes a brace indicating the nested structure, labeled as M^{(r-1)} .
The space \(M^{(r)}\) is constructed by alternately applying the following two operations:
mapping a Polish space S to the set of Borel probability measures \(\mathcal P(S)\) on S;
taking the product \((I \to S) = S^I\) of a Polish space S.
As both operations preserve the property of being Polish,1 it is immediate that \(M^{(r)}\) is a Polish space, which mathematically justifies the inductive construction in (8). Two filtered processes \(\mathbf {X}\) and \(\mathbf {Y}\) have the same rank-r adapted distribution if
$$\displaystyle \begin{aligned} {} \mathbb E[f(\mathrm{pp}^{(r)}(\mathbf{X}))] = \mathbb E[f(\mathrm{pp}^{(r)}(\mathbf{Y}))]\quad \text{for all } f \in C_b(M^{(r)}). \end{aligned} $$
(9)
A sequence \(({\mathbf {X}}^n)_{n \in \mathbb N}\) of filtered processes is said to converge in the adapted weak topology of rankr to a filtered process \(\mathbf {X}\) if
Finally, the adapted distribution of a filtered process \(\mathbf {X}\) is the law of \(\mathrm {pp}^\infty (\mathbf {X})\) given by
$$\displaystyle \begin{aligned} {} \mathrm{pp}_t^{(\infty)}(\mathbf{X}) := (\mathrm{pp}_t^{(0)}(\mathbf{X}), \mathrm{pp}_t^{(1)}(\mathbf{X}), \dots) \quad \text{for all } t \in I, \end{aligned} $$
(11)
where \(\mathrm {pp}^{(\infty )}(\mathbf {X}) = (\mathrm {pp}^{(\infty )}_t(\mathbf {X}))_{t \in I}\) takes values in
$$\displaystyle \begin{aligned} M^{(\infty)} := (I \to (\mathbb{R}^d \times \mathcal P (M^{(0)}) \times \mathcal P(M^{(1)}) \times \dots )). \end{aligned}$$
Before proceeding, we want to remark that in the present finite discrete-time setting with T time steps, it suffices to simply consider the prediction process of rank \(T - 1\) instead of the whole adapted distribution. This interesting property is discussed in more detail in Remark 1.7 down below. Therefore, the reader may choose to ignore the case \(r = \infty \) and read it instead as \(r = T - 1\). We say that a sequence \(({\mathbf {X}}^n)_{n \in \mathbb N}\) of filtered processes converges in the adapted weak topology to \(\mathbf {X}\) if (10) holds for \(r = \infty \). Hence, convergence in the adapted weak topology is precisely convergence in distribution of the corresponding adapted distributions. Clearly, all filtered processes \(\mathcal {F}\mathcal {P}\) are a proper class containing a lot of redundancy. As a next step, we use the equivalence relation induced by the adapted distribution \(\sim _{\infty }\) to put filtered processes with the “same probabilistic properties” into a single equivalence class. We define the factor space of filtered processes
After factorization we end up with the set \(\text{FP}\), on which we can define the adapted weak topology of rank \(r \in \mathbb N \cup \{ 0, \infty \}\), denoted by \(\mathcal T^{(r)}\), whose mode of convergence is described by (10). In analogy to when working with \(L_p\)-spaces, we will for the most part not distinguish between single filtered process and their \(\sim _\infty \)-equivalence class. Moreover, we will write \(\mathcal T := \mathcal T^{(\infty )}\) for the adapted weak topology.
Remark 1.7
We compile some facts on \(\mathcal T^{(r)}\), the adapted weak topology of rank r:
(1)
Observe that \(\mathcal T^{(0)}\) corresponds to the usual weak topology, while \(\mathcal T^{(1)}\) is precisely Aldous’ extended weak topology and \(\mathcal T = \mathcal T^{(\infty )}\) is the adapted weak topology.
(2)
The topologies \(\mathcal T^{(r)}\) become increasingly finer. We have
where \(e_t\) is the time-t evaluation map and \(\delta _x\) denotes a Dirac measure at \(x \in M^{(r)}\). The second equality is by definition of the push-forward operation and the last is due to \(\mathrm {pp}^{(r)}(\mathbf {X})\) being \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-adapted. Consequently, one can build a continuous map
that for all \(\mathbf {X} \in \text{FP}\), \(\iota (\mathrm {pp}^{(r + 1)}(\mathbf {X})) = \mathrm {pp}^{(r)}(\mathbf {X})\) almost surely, which readily provides the inclusion \(\mathcal T^{(r)} \subseteq \mathcal T^{(r + 1)}\). For additional details, we refer to [17, Theorem 4.11 and Theorem 7.1].
(3)
As we consider the setting where \(I = \{1,\ldots ,T\}\) consists of \(T \in \mathbb N\) times, it is not necessary to consider prediction processes of arbitrary rank. In fact, we have by [17, Lemma 4.7 and Lemma 4.10] that, for every \(r \in \mathbb N\), there exists a continuous map
$$\displaystyle \begin{aligned} F : M^{(T-1)} \to M^{(r)} \end{aligned}$$
with the property that \(F(\mathrm {pp}^{(T-1)}(\mathbf {X})) = \mathrm {pp}^{(r)}(\mathbf {X})\) almost surely, for all \(\mathbf {X} \in \text{FP}\). It is straightforward to see that \(\mathcal T^{(r)} = \mathcal T^{(T-1)} = \mathcal T\), which means that all the necessary information is contained in the rank \((T-1)\)-prediction process.
(4)
In fact, as long as \(r < T - 1\) the inclusion \(\mathcal T^{(r)} \subseteq \mathcal T^{(r + 1)}\) is strict. This can be achieved with the construction shown in Example 1.6, see also [46, Example 3.2] and [17, Proposition 7.2].
(5)
For more detailed descriptions and examples of \(\mathrm {pp}^{(r)} (\mathbf {X})\), \(M^{(r)}\) and \(\mathcal T^{(r)}\) we refer readers to [46] and [17]. We also recommend readers to focus on the case \(r=1\) at the first reading.
Theorem 1.8
The topological space\((\mathit{\text{FP}},\mathcal T)\)is Polish, that is, separable and completely metrizable.
A proof of Theorem 1.8 can be found in [17, Theorem 1.3] and [22, Theorem 1.5]. It is worth mentioning that, even though all filtered processes are a proper class, \((\text{FP},\mathcal T)\) is tractable in the sense that it is a Polish space and simple processes (for example, Markov processes living on finite probability spaces, cf. [17, Theorem 5.4]) are dense. The main purpose of the adapted weak topology is to finally have a topology that is sufficiently strong to have continuity of sequential decision making problems, but which is not unnecessarily strong. We refer to Sect. 4 below for a more detailed discussion. To get back at our introductory example, that is Example 1.3, we have continuity of optimal stopping.
Lemma 1.9 (Continuity of Optimal Stopping)
Let\(c : (I \to \mathbb R^d) \times I \to \mathbb R\)be continuous, bounded and non-anticipative.2Then we have continuity of the value function
When there are only two time points \(I = \{1, 2\}\), optimal stopping boils down to deciding at time \(t = 1\) whether or not to stop. At time \(t = 1\), the gain to immediately stop is \(c(X,1) = \mathbb E_{\mathbf {X}}[c(X,1)|\mathcal {F}_1^{\mathbf {X}}]\) while the expected profit of stopping at \(t = 2\) is \(\mathbb E_{\mathbf {X}}[ c(X,2) | \mathcal {F}_1^{\mathbf {X}}]\). Consequently, the value of the optimal stopping problem is given by
for \(\rho = (\rho _1,\rho _2) \in M^{(1)}\). As c is continuous and bounded, the same is true for F. Using the latter, the claim follows directly from the definition of \(\mathcal T\) and the representation given in (14). □
The proof for general \(T \in \mathbb N\) is essentially the same as in the case \(T = 2\) but with an additional induction over the number of time steps. The value \(v_c(\mathbf {X})\) is given by the associated Snell envelope \((U_t^{\mathbf {X}})_{t \in I \cup \{ 0 \} }\) at time 0. This means that we have
for \(t \in I\), and \(U_0 = \mathbb E_{\mathbf {X}}[U_1^{\mathbf {X}}]\). Similarly, we define recursively
The image contains two mathematical expressions. The first expression is U_T^X = c(underbrace{pp^{(0)}(mathbf{X}), T}_{:= F_T(pp^{(0)}(mathbf{X}))}) . The second expression is U_t^X = c(underbrace{iota_t(pp^{(t)}(mathbf{X})), t}_{:= F_{T-t+1}(pp^{(t)}(mathbf{X}))}) lor (F_{T-t+1} # pp^{(t)}(mathbf{X})) . The expressions involve functions c , pp , and F , with variables T , t , and mathbf{X} . Greek symbols include iota . The expressions use logical and mathematical symbols such as lor (logical OR) and # .
where \(\iota _t : M^{(t)} \to M^{(0)}\), defined analogously to \(\iota \) in Remark 1.7 ((2)), is a continuous map with \(\iota _t(\mathrm {pp}^{(t)}(\mathbf {X})) = \mathrm {pp}^{(0)}(\mathbf {X}) = X\). Hence, \(F_{T - t} : M^{(t)} \to \mathbb R\) is continuous. In particular, we can construct a continuous map \(F_0 : \mathcal P (M^{(T-1)}) \to \mathbb R\) with
We conclude from the definition of convergence in \(\mathcal T\), cf. (10), continuity of \(v_c\). □
2 Higher-Rank (Expected) Signatures
In the introduction we established that the higher-rank adapted weak topology \(\mathcal {T}^{(r)}\), \(r \ge 1\), on the space of filtered processes \(\text{FP}\) is finer than the weak topology (the latter corresponds to \(r = 0\)), and is essential to addressing continuity for multi-stage optimization problems which are in general discontinuous with respect to the weak topology, such as optimal stopping problems as discussed in Example 1.2. Therefore, it should come as no surprise that many theoretical tools which are valid for the weak topology do not work any more for the higher-rank cases. For instance, the expectation of the classical signature feature, which is able to characterize the laws of stochastic processes as moment generating function for path-valued random variables, does not distinguish two filtered processes in the rank-1 adapted weak topology. Motivated by these observations, we want to accordingly modify the classical signature feature to introduce a family of so-called higher-rank signatures, that adequately capture higher-rank adapted weak topologies whilst retaining the advantageous properties of the classical signature, both in theoretical and computational aspects. For technical simplicity, throughout the whole section, we will implicitly assume that
(A1)
All filtered processes take values in a compact subset \(\mathcal {K} \subset \mathbb {R}^d\).
2.1 Properties of Higher-Rank (Expected) Signatures
In this subsection (which is based on [28, Sections 3,4]), we introduce the notion of higher-rank signatures, whose expectations can be used to distinguish filtered processes with respect to the higher-rank adapted weak topologies.
To provide the reader with sufficient intuition for these increasingly complex concepts, let us first disregard the time components and discuss the static case.
The Static Case
Since \(\mathcal {K} \subset \mathbb {R}^d\) is a compact subset, the Stone–Weierstrass theorem asserts that the space of all polynomials on \(\mathcal {K}\) is dense in \(C_b(\mathcal {K})\) in the topology of uniform convergence. Furthermore, a fundamental result in probability theory states that two \(\mathcal {K}\)-valued random variables \(X,Y\) have the same law if and only if, for all \(k\in \mathbb {N}\), they have the same k-th moment. Moreover, a sequence \((X^n)_{n \in \mathbb N}\) of \(\mathcal {K}\)-valued random variables converges to X in distribution if and only if, for all \(k \in \mathbb {N}\), their k-th moments converge. These well–known results can be reformulated in terms of the notations used in chapter “The Signature Kernel” as follows:
where \(a_k \cdot b_k\) denotes the usual inner product of \(a_k\) and \(b_k\) in \((\mathbb {R}^d)^{\otimes k}\) which is obtained from the Euclidean inner product on \(\mathbb {R}^d\). Next, define the monomial map \(\Phi _{\text{Mon}}: \mathbb {R}^d \to H\) by
where \(x^{\otimes k} := (x_1^{m_1}\ldots x_d^{m_d})_{m_1+ \ldots + m_d =k, m_i \in \mathbb {N}_0} \in (\mathbb {R}^d)^{\otimes k}\) for a vector \(x = (x_1,\ldots ,x_d) \in \mathbb {R}^d\). Thanks to the Stone–Weierstrass theorem we have
Writing \(\Phi _{\text{Mon}}^{n}(x) := \frac {x^{\otimes n}}{n!}\), \(n \in \mathbb {N}_0\) for the n-th component of \(\Phi _{\text{Mon}}(x)\), notice that \(\mathbb {E}[\Phi _{\text{Mon}}^{n}(X)] = \mathbb {E}[\frac {X^{\otimes n}}{n!}]\) is the n-th moment of X (up to the constant \(1/n!\)).
Remark 2.1
In fact, the preceding results regarding polynomials and re-scaled moments can be generalized further by relaxing the compactness assumption. To this end, we need to normalize the monomials map \(\Phi _{\text{Mon}}\) in order to guarantee integrability. According to [31, Section 3.2], the renormalized version of \(\Phi _{\text{Mon}}\) is called the robust monomials, and can be represented by \(\Lambda \circ \Phi _{\text{Mon}}\) for a so-called tensor normalization \(\Lambda : H \to \{(a_0,a_1,\ldots ) \in H: \|(a_0,a_1,\ldots )\|_H \le 1\}\).
Now, let us move forward to the case of filtered processes with respect to the weak topology, i.e., when the rank \(r=0\), cf. Sect. 1.4. In this setting, it is well-known that the notions of monomials on a compact subset \(\mathcal {K} \subset \mathbb {R}^d\) admit natural generalizations to the path spaces \(M^{(0)} = (I \to \mathcal {K})\), which are called signatures and robust signatures, respectively.
The Rank \(r = 0\) Case
The classical signature theory tells us that the notions of monomials \(\Phi _{\text{Mon}}\) on \(\mathcal {K} \subset \mathbb {R}^d\), namely the static case, have natural generalizations to \(\mathcal {K}\)-valued paths, which are called the signature. To elaborate, an \(\mathcal K\)-valued (discrete-time) path \(\mathbf {\mathit {x}} = (\mathbf {\mathit {x}}_1,\ldots ,\mathbf {\mathit {x}}_T) \in M^{(0)} = (I \to \mathcal {K})\) is identified with a piecewise-linear path \(\mathbf {\mathit {x}}: [0,T] \to \mathbb {R}^d\) via
and we write \(\mathbf {\mathit {x}}_t = (x^1_t, \ldots , x^d_t)\) for each \(t \in [0,T]\). In other words, we embed the discrete-time path space \(M^{(0)}\) into the space of continuous paths with bounded variation defined on \([0,T]\), which we denote by \((\mathbb {R}^d)_{\text{1-var}}\) as in chapter “The Signature Kernel”. Then the signature of the path \(\mathbf {\mathit {x}} \in (I \to \mathbb {R}^d)\) is defined by
is the iterated Riemann integral for \(k \ge 1\), and H is given as in (15). Using the same notation as in chapter “The Signature Kernel”, for each \(k \in \mathbb {N}\) we use \(\Phi _{\text{Sig},k}(\mathbf {\mathit {x}})\) to denote the k-th iterated integrals of \(\mathbf {\mathit {x}}\), namely \(\Phi _{\text{Sig},k}(\mathbf {\mathit {x}}) = \int d\mathbf {\mathit {x}}^{\otimes k} \in (\mathbb {R}^d)^{\otimes k}\).
Thanks to the algebraic and analytic properties of iterated integrals, the following theorem holds true and implies that the signature feature \(\Phi _{\text{Sig}}\) is the sequence of monomials on the path space \(M^{(0)}\), respectively: Their linear combinations are dense in the space of continuous bounded functions on the path space, the expectation of \(\Phi _{\text{Sig}}(X)\) (called the expected signature) can be used to characterize the weak convergence of the laws of stochastic processes (viewed as path-valued random variables). We summarize these facts into the following theorem, which can also be found in chapter “The Signature Kernel”.
Theorem 2.2
Let\(\mathcal {K} \subset \mathbb {R}^d\)be compact. Let\(M^{(0)} = (I \to \mathcal {K})\)denote the set of paths taking values in\(\mathcal {K}\). For every\(\mathbf {\mathit {x}} \in M^{(0)}\), we define the time-augmented version of\(\mathbf {\mathit {x}}\)as\(\bar {\mathbf {\mathit {x}}}(t) = (t, \mathbf {\mathit {x}}_t) \in \mathbb {R}^{d+1}\). Then the time-augmented signature feature3
\(\bar \Phi _{\mathit{\text{Sig}}}\)is characteristic to the space of signed measures on\(M^{(0)}\), i.e., for signed measures\(\mu \), \(\nu \)on\(M^{(0)}\)we have
In other words, the (time-augmented) signature metric\(d^{(0)}_{\mathit{\text{Sig}}}: \mathcal {P}(M^{(0)}) \times \mathcal {P}(M^{(0)}) \to \mathbb {R}_+\)given by
metrizes the weak topology on\(\mathcal {P}(M^{(0)})\).
Remark 2.3
In order to ensure the above characteristicness, it is necessary to consider the signature of the time-augmented path \(\bar {\mathbf {\mathit {x}}}_t = (t, \mathbf {\mathit {x}}_t) \in \mathbb {R}^{d+1}\), see e.g. [31] for a detailed discussion on this issue. From now on, to keep the notations as simple as possible, we will not distinguish the signature feature from their time-augmented version, and will use the notations \(\Phi _{\text{Sig}}\) instead of \(\bar \Phi _{\text{Sig}}\) for time-augmented signatures.
As in the static case, one can use any tensor normalizing \(\Lambda \) to define a robust signature by \(\Lambda \circ \Phi _{\text{Sig}}\). Using this robust signature we can show that all of the above results hold for general \(\mathbb {R}^d\)-valued paths and drop the compact assumption, see [31]. To avoid unnecessary technicalities, we will stick to signature features for paths taking values in the compact subset \(\mathcal {K}\) of \(\mathbb {R}^d\) in the present article.
Now, let us think of an interesting question:
(Q)
Can we adopt the signature methodology to define the moment generating function and a signature-type metric as in Theorem 2.2 to characterize the rank-r adapted weak topology on the space of filtered processes \(\text{FP}\), for any rank \(r \in \mathbb {N}_0\)?
The Rank \(r = 1\) Case
As the rank-0 adapted weak topology on \(\text{FP}\) is induced by the weak topology on \(\mathcal {P}(M^{(0)})\), Theorem 2.2 shows that the signature feature \(\Phi _{\text{Sig}}\) and the signature metric \(d_{\text{Sig}}\) achieve this goal when \(r=0\). Now, let us proceed with \(r=1\). According to the definition, the rank-1 adapted weak topology \(\mathcal T^{(1)}\) is the initial topology induced by the law of the rank 1-prediction process \(\mathrm {pp}^{(1)}(\mathbf {X}) = \mathrm {law}(X|\mathcal {F}^{\mathbf {X}}_t)_{t \in I} \in M^{(1)}\) with respect to the weak topology on \(\mathcal {P}(M^{(1)})\) where \(M^{(1)} = (I \to \mathcal {P}(M^{(0)}))\). Therefore, to establish a signature-type metric for the rank-1 adapted weak topology, a natural idea is to define the signature feature for the rank 1-prediction process, which can be viewed as an \(M^{(1)}\)-valued random variable. The only technical issue here is that the paths \(\mathbf {\mathit {p}} = (\mathbf {\mathit {p}}_t)_{t \in I}\) in \(M^{(1)}\) are measure-valued paths, and it is not clear how to define iterated integrals for paths with values in a non-linear space (here, the non-linear space is \(\mathcal {P}(M^{(0)})\)).4
One natural way to circumvent this difficulty is to transform a measure-valued path into a path taking values in a linear space, then use the iterated integrals to define the signature feature for the latter one, and use it as the signature for the original measure-valued path. In our situation, we can use the signature feature for \(\mathbb {R}^d\)-valued paths to achieve such a transformation. More precisely, for a given measure-valued path \(\mathbf {\mathit {p}} = (\mathbf {\mathit {p}}_t)_{t \in I} \in M^{(1)}\), for each \(t \in I\) we define \(\tilde {\mathbf {\mathit {p}}}_t\) as the Bochner integral of the H-valued mapping \(\Phi _{\text{Sig}}\):
By a slight abuse of notation, we will still use \(\tilde {\mathbf {\mathit {p}}}\) to denote the time-augmented version of the path \(\tilde {\mathbf {\mathit {p}}}\), i.e., \(\tilde {\mathbf {\mathit {p}}}_t = (t, \int _{\mathbf {\mathit {x}} \in M_0} \Phi _{\text{Sig}}(\mathbf {\mathit {x}}) \mathbf {\mathit {p}}_t (d \mathbf {\mathit {x}})) \in \mathbb {R} \oplus H\) for each \(t \in I\). Then, similar to \(\mathbb {R}^d\)-valued paths, we identify this discrete-time H-valued path with the piecewise-linear H-valued path \(\tilde {\mathbf {\mathit {p}}}_t = (i-t)\tilde {\mathbf {\mathit {p}}}_{i-1} + (t-(i-1))\tilde {\mathbf {\mathit {p}}}_{i}, \quad t \in [i-1,i]\), which belongs to the space of H-valued path on \([0,T]\) with bounded variation. Subsequently, we can define iterated integrals and form a (time-augmented) signature for \(\tilde {\mathbf {\mathit {p}}} = (\tilde {\mathbf {\mathit {p}}}_t)_{t \in I}\) by
where \(H^{(2)} \subset T((\mathbb {R} \oplus H)) = \prod _{k=0}^\infty (\mathbb {R}\oplus H)^{\otimes k}\) is a Hilbert space embedded in the tensor algebra \(T((\mathbb {R} \oplus H))\) over the Hilbert space \(\mathbb {R} \oplus H\), whose inner product \(\langle \cdot , \cdot \rangle _{H^{(2)}}\) is induced by the inner product \(\langle \cdot , \cdot \rangle _{\mathbb {R} \oplus H}\) on \(\mathbb {R} \oplus H\). For a complete description of \(H^{(2)}\) we refer to [47, Appendix A] or [28, Appendix B]. Here we remark that the superscript \((2)\) of \(H^{(2)}\) means that \(H^{(2)}\) is formed by taking the tensor algebra over vector space twice, i.e., \(H^{(2)}\) is contained in the tensor algebra over another tensor algebra \(H \subset T((\mathbb {R}^d))\).
is called the rank-1 signature of the measure-valued path \(\mathbf {\mathit {p}} = (\mathbf {\mathit {p}}_t)_{t \in I} \in M^{(1)}\).
Now, let \(\mathbf {X} = \left ( \Omega ^{\mathbf {X}}, \mathcal {F}^{\mathbf {X}}, \mathbb {P}^{\mathbf {X}}, (\mathcal {F}_t^{\mathbf {X}})_{t \in I}, X \right ) \in \text{FP}\) be a filtered process. Its rank 1-prediction process \(\mathrm {pp}^{(1)}(\mathbf {X}) = \mathrm {law}(X|\mathcal {F}^{\mathbf {X}}_t), t \in I\), is a \(M^{(1)}\)-valued random variable, and it is easy to see that
$$\displaystyle \begin{aligned} \widetilde{\mathrm{pp}}^{(1)}(\mathbf{X})_t := \mathbb{E}_{\mathbf{X}}[\Phi_{\text{Sig}}(X) | \mathcal{F}^{\mathbf{X}}_t], \quad t \in I \end{aligned}$$
is an H-valued martingale, so that \(\Phi _{\text{Sig}}^{(1)}(\mathrm {pp}^{(1)} (\mathbf {X})) = \Phi _{\text{Sig}} (t \mapsto \widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})_t)\).
The following theorem justifies the above construction, since the so-defined rank-1 signature for measure-valued paths satisfies all properties (e.g., universality and characteristicness) which hold for the signature for \(\mathbb {R}^d\)-valued paths. For a detailed proof, we refer readers to [28, Theorem 2].
Theorem 2.5
The rank-1 signature\(\Phi ^{(1)}_{\mathit{\text{Sig}}}\)has the following properties:
(i)
\(\Phi ^{(1)}_{\mathit{\text{Sig}}}\)is universal to\(C_b(M^{(1)})\)equipped with the uniform topology, i.e., for every\(f \in C_b(M^{(1)})\)and\(\varepsilon > 0\), there is\(\ell \in H^{(2)}\)with
\(\Phi ^{(1)}_{\mathit{\text{Sig}}}\)is characteristic to the space of signed measures on\(M^{(1)}\). In particular, two filtered processes\(\mathbf {X}\)and\(\mathbf {Y}\)have the same rank-1 adapted distribution if and only if
metrizes the rank-1 adapted weak topology\(\mathcal {T}^{(1)}\)on\(\mathit{\text{FP}}\).
The General Higher Rank Case
Thanks to the hierarchical structure of the higher-rank prediction processes, we can repeat the above construction to obtain the rank-r signature, for any \(r \in \mathbb {N}_0\), which then can be used to characterize the rank-r adapted weak topology as in Theorem 2.5. Roughly speaking, as the rank-\((r+1)\) adapted weak topology is induced by the rank \((r+1)\)-prediction process, which is constructed as the prediction process of the rank r-prediction process, the corresponding rank-\((r+1)\) signature \(\Phi ^{(r+1)}_{\text{Sig}}\) is obtained by taking the signature feature of the paths formed via (taking expectations of) the rank-r signature.
The concrete procedure is described as follows, which is also reflected in the diagram below:
Diagram illustrating a mathematical process with three main components connected by arrows. The top component is labeled \( M^{(r+1)} = (I \rightarrow \varphi(M^{(r)})) \). The left arrow, labeled \( \Phi_{S\Sigma}^{(r)} \), points from \( (I \rightarrow H^{(r+1)}) \) to the top component. The right dashed arrow, labeled \( \Phi_{S\Sigma}^{(r+1)} \), points from the top component to \( H^{(r+2)} \subset T((R \oplus H^{(r+1)})) \). A horizontal arrow labeled \( \Phi_{S\Sigma} \) connects \( (I \rightarrow H^{(r+1)}) \) to \( H^{(r+2)} \subset T((R \oplus H^{(r+1)})) \).
1.
Assume that for \(r \in \mathbb {N}_0\) we have defined the rank-r signature \(\Phi ^{(r)}_{\text{Sig}}: M^{(r)} \to H^{(r+1)}\) for the measure-valued path \(\mathbf {\mathit {p}}^{(r)} = (\mathbf {\mathit {p}}^{(r)}_t)_{t \in I} \in M^{(r)}\) , where \(H^{(r+1)}\) is a Hilbert space . By convention we set \(H^{(0)} := \mathbb {R}^d\) and \(H^{(1)} := H\), and \(\Phi ^{(0)}_{\text{Sig}} = \Phi _{\text{Sig}}\) is the usual signature on \(M^{(0)}\) defined as in Theorem 2.2.
2.
Consider the measure-valued path space \(M^{(r+1)} = (I \to \mathcal {P}(M^{(r)}))\). For a path \(\mathbf {\mathit {p}}^{r+1} = (\mathbf {\mathit {p}}^{r+1}_t)_{t \in I} \in M^{(r+1)}\) and for each \(t \in I\), \(\mathbf {\mathit {p}}^{r+1}_t \in \mathcal {P}(M^{(r)})\) is a probability measure on the path space \(M^{(r)}\). Hence, the following \(H^{(r+1)}\)-valued Bochner integral is well–defined :
In particular, now we obtain a \(H^{(r+1)}\)-valued path \(t \mapsto \tilde {\mathbf {\mathit {p}}}^{(r+1)}_t\). Again, by an abuse of notation, we will still use \(\tilde {\mathbf {\mathit {p}}}^{(r+1)}\) to denote the time-augmented version of the path \(\tilde {\mathbf {\mathit {p}}}^{(r+1)}\). Note that now the codomain of \(\tilde {\mathbf {\mathit {p}}}^{(r+1)}_t\) is \(\mathbb {R} \oplus H^{(r+1)}\).
3.
Let \(H^{(r+2)}\) be a Hilbert space embedded in the tensor algebra \(T((\mathbb {R} \oplus H^{(r+1)}))= \prod _{k=0}^\infty (\mathbb {R} \oplus H^{(r+1)})^{\otimes k}\) whose inner product \(\langle \cdot ,\cdot \rangle _{H^{(r+2)}}\) is induced by the inner product \(\langle \cdot ,\cdot \rangle _{\mathbb {R} \oplus H^{(r+1)}}\). Then, use the iterated integrals of \(\mathbb {R} \oplus H^{(r+1)}\)-valued path \(t \mapsto \tilde {\mathbf {\mathit {p}}}^{(r+1)}_t\)5 to define
as the rank-\((r+1)\) signature for paths in \(M^{(r+1)}\).
Using the above construction, Theorem 2.5 can be further generalized to any rank \(r \in \mathbb {N}_0\). It suffices to notice that for filtered process \(\mathbf {X} \in \text{FP}\), the rank r-prediction process \(\mathrm {pp}^{(r)}(\mathbf {X})\) is a \(M^{(r)}\)-valued random variable, and the rank-r adapted weak topology on \(\text{FP}\) is the initial topology induced by the rank r-prediction process \(\mathrm {pp}^{(r)}: \text{FP} \to \mathcal {P}(M^{(r)})\).
Theorem 2.6
For every rank\(r \in \mathbb {N}_0\)the rank -r signature \(\Phi ^{(r)}_{\mathit{\text{Sig}}}\)has the following properties:
(i)
\(\Phi ^{(r)}_{\mathit{\text{Sig}}}\)is universal to\(C_b(M^{(r)})\)equipped with the uniform topology, i.e., for every\(f \in C_b(M^{(r)})\)and\(\varepsilon > 0\), there is\(\ell \in H^{(r+1)}\)with
\(\Phi ^{(r)}_{\mathit{\text{Sig}}}\)is characteristic to the space of signed measures on\(M^{(r)}\). In particular, two filtered processes\(\mathbf {X}\)and\(\mathbf {Y}\)have the same rank-r adapted distribution if and only if
metrizes the rank-r adapted weak topology\(\mathcal {T}^{(r)}\)on\(\mathit{\text{FP}}\).
See [28, Theorem 4 and Remark 7] for a detailed proof of Theorem 2.6.
Remark 2.7
Again, by composing the rank-r signature with a tensor normalization \(\Lambda \) we can generalize the above results to general filtered processes taking values in \(\mathbb {R}^d\) and drop the compactness assumption. For more details, we refer to [28, Theorem 2, Remark 7].
2.2 Higher-Rank (Expected) Signatures and Kernel Learning
From the construction of higher-rank signatures in the last section, we see that the rank-r signature \(\Phi ^{(r)}_{\text{Sig}}\) takes values in a Hilbert space \(H^{(r+1)}\), for \(r \in \mathbb {N}_0\). The inner product structure on \(H^{(r+1)}\) naturally allows us to use a kernel learning approach to handle \(\Phi ^{(r)}_{\text{Sig}}\) for any \(r \ge 0\), exactly as the classical signature kernel learning method (corresponding to the case \(r=0\)) discussed in [57, Section 7]; in particular, one can apply kernel learning techniques to compute the higher-rank signature metrics, which will be reported in the next subsection.
Note that in the sequel we implicitly assume that all paths are time-augmented when taking signatures of them.
First let us have a quick review of the connection between the usual signature \(\Phi _{\text{Sig}}\) and kernel learning, which is discussed in chapter “The Signature Kernel” systematically. As before, let \(\Phi _{\text{Sig}}: M^{(0)} \to H\) denote the robust signature feature mapping \(\mathbb {R}^d\)-valued path to a vector in the Hilbert space H. Then we can obtain an associated Reproducing Kernel Hilbert Space6 (RKHS) over \(M^{(0)}\) induced by \(\Phi _{\text{Sig}}\) and \(\langle \cdot , \cdot \rangle _H\), which will be denoted by \(\mathbf {\mathit {H}}_{\text{Sig}}\), together with its kernel
Moreover, we have \(\mathbf {\mathit {H}}_{\text{Sig}} = \{f: M^{(0)} \to \mathbb {R}: f(\mathbf {\mathit {x}}) = \langle h, \Phi _{\text{Sig}}(\mathbf {\mathit {x}}) \rangle _H, h \in H \}\). In fact, it is not hard to check that these two Hilbert spaces H and \(\mathbf {\mathit {H}}_{\text{Sig}}\) are isometric by considering the mapping \(h \in H \mapsto f(\cdot ) = \langle h, \Phi _{\text{Sig}}(\cdot )\rangle _H \in \mathbf {\mathit {H}}_{\text{Sig}}\); in other words, the function space \(\mathbf {\mathit {H}}_{\text{Sig}}\) just provides an alternative way of thinking about the Hilbert space H embedded in the tensor algebra.
Now, for any \(r \ge 0\), we can define the so-called rank-rsignature kernel\(\mathbf {\mathit {k}}_{\text{Sig}}^{(r)}: M^{(r)} \to \mathbf {\mathit {H}}_{\text{Sig}}^{(r)}\) for some RKHS \(\mathbf {\mathit {H}}^{(r)}_{\text{Sig}}\) (where we set \(\mathbf {\mathit {k}}^{(0)}_{\text{Sig}} = \mathbf {\mathit {k}}_{\text{Sig}}\) and \(\mathbf {\mathit {H}}^{(0)}_{\text{Sig}} = \mathbf {\mathit {H}}_{\text{Sig}}\)) in the same spirit as we did before: invoking that the rank-r signature \(\Phi _{\text{Sig}}^{(r)}\) on \(M^{(r)}\) takes values in a Hilbert space \(H^{(r+1)}\), We therefore obtain the associated reproducing kernel
for any \(\mathbf {\mathit {p}}^{(r)}, \boldsymbol {q}^{(r)} \in M^{(r)}\).
Recall that (see e.g. [31, Section 6]) a natural distance for probability measures on any topological space \(\mathcal {X}\) is given by fixing a class of functions \(\mathscr {G} \subset \mathbb {R}^{\mathcal {X}}\) and define the integral probability metric between \(\mu ,\nu \in \mathcal {P}(\mathcal {X})\) as
For instance, if \(\mathscr {G}\) denotes the space of Lipschitz functions on \(\mathcal {X}\) with Lipschitz constant 1, then the above integral probability metric is exactly the first order Wasserstein distance. An insight from the theory of kernel learning is that if \(\mathscr {G}\) is the unit ball of a RKHS \((\mathbf {\mathit {H}},\mathbf {\mathit {k}})\), then the above integral probability metric is called a Maximum Mean Discrepancy (MMD), and the reproducing property implies that
where \(X,X^\prime ,Y,Y^\prime \) are independent with \(X,X^\prime \sim \mu \) and \(Y,Y^\prime \sim \nu \).
The next proposition shows that higher-rank signature metrics \(d^{(r)}_{\text{Sig}}(\cdot ,\cdot )\) are MMDs induced by the higher-rank signature kernels \(\mathbf {\mathit {k}}^{(r)}_{\text{Sig}}(\cdot ,\cdot )\) defined as above, which is straight forward to prove, and we leave it as an exercise for those curious readers.
Proposition 2.8
For any\(r \in \mathbb {N}_0\), the rank-r signature kernel MMD\(\mathcal {D}^{(r)}_{\mathit{\text{Sig}}}\)on\(\mathit{\text{FP}}\)defined by
In the next subsection we will show that the above expansion of \(\mathcal {D}^{(r)}_{\text{Sig}}(\mathbf {X}, \mathbf {Y})^2\) is simply estimated from finite samples of \(\mathrm {law}(X)\) and \(\mathrm {law}(Y)\) as the higher-rank signature kernel \(\mathbf {\mathit {k}}^{(r)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) is relatively cheap to evaluate, provided all \(\mathbf {X}\) and \(\mathbf {Y}\) are endowed with their natural filtration, respectively.
2.3 Algorithms for Computing Higher-Rank (Expected) Signature
In the last subsection we have seen that the higher-rank signature metric (equivalently speaking, the higher-rank signature kernel MMD) metrizes the higher-rank adapted weak topology \(\mathcal {T}^{(r)}\) on the space of filtered processes \(\text{FP}\). Although these theoretical results look satisfactory, a numerical computation of these distances is admittedly a highly challenging task, as the Hilbert spaces \(H^{(r)}\) and \(\mathbf {\mathit {H}}^{(r)}_{\text{Sig}}\) are infinite dimensional with very complicated algebraic structure. In this subsection we will introduce two possible algorithms for computing these higher-rank signature distances, which have been verified in some numerical experiments, albeit one can indeed improve them further.
To simplify our notations and give a clearer description of the algorithm, let us make throughout the following assumptions:
(A2)
All filtered processes are equipped with the natural filtration, that is, we focus on the subspace \(\text{FP}_{\text{plain}}\) of \(\text{FP}\) given by
We will apply a backward recursion method to derive algorithms that efficiently compute the signature \(\Phi _{\text{Sig}}(\mathbf {X})\) and the rank-1 signature \(\Phi ^{(1)}_{\text{Sig}}(\mathbf {X})\), provided \(\mathbf {X} \in \text{FP}_{\text{plain}}\) is a nowhere recombining (that is, its value at time t uniquely determines \(X_1,\ldots ,X_{t-1}\)) Markov chain (with finite states). We refer readers to [28, Section 5] for a detailed description of this algorithm and here we only summarize the essential idea and main steps.
Recall that higher-rank signatures take values in some tensor algebras. Therefore, to compute them we need the following lemma concerning dynamic programming in an algebra (see [28, Lemma 3]).
Lemma 2.9
Let A be an (Banach) algebra and\(Z = (Z_t)_{t \in I}\)be a nowhere recombining Markov chain taking values in A with finite states. The function
where\(Z_{t+1} \ldots Z_T\)and\(bu_{t+1}(b)\)are calculated by using the multiplication defined on the algebra A.
If \(\mathbf {X} \in \text{FP}_{\text{plain}}\) is a nowhere recombining Markov chain, then the signature process \(Z_t = \Phi _{\text{Sig}}(\mathbf {X}|_{[0,t]})\) can also be viewed as a nowhere recombining Markov chain taking values in the tensor algebra \(T((\mathbb {R}^d))\). As \(I = \{1,\ldots ,T\}\) is discrete, it is well–known that
where \(\otimes \) is the tensor product on \(T((\mathbb {R}^d))\), \(\exp \) is the tensor exponential on \(T((\mathbb {R}^d))\) such that \(\exp (x) = \sum _{n=0}^\infty \frac {x^{\otimes n}}{n!}\) and \(\Delta _t X = X_t - X_{t-1}\). Hence, Lemma 2.9 hints at an efficient way to compute the expected signature \(\mathbb {E}[\Phi _{\text{Sig}}(\mathbf {X})] := u_0(X_0)\) (we set \(X_0 := 0\)) since the value function u satisfies the recursion
Algorithm 1 Pseudo-code for \( \mathbb {E}[\Phi _{\text{Sig}}(\mathbf {X})]\)
Algorithm pseudocode for computing ExpSig_0 of a Markov chain X represented as a rooted tree with root r. The procedure ExpSig_0(a) checks if a.children is empty, returning 1 if true. Otherwise, it initializes sum to 0 and iterates over b in a.children, updating sum with p(a, b) exp{b - a} otimes ExpSig_0(b). The procedure returns sum. The output is ExpSig_0(r) = mathbb{E}[Phi_{Sig}(X)].
The computation of the rank-1 signature \(\Phi ^{(1)}_{\text{Sig}}(\mathbf {X})\) follows along the same lines. We now apply Lemma 2.9 with \(A = T((H))\) (recall that H denotes the target space of \(\Phi _{\text{Sig}}(X)\)) and \(Z_t = \exp ({\Delta _t \widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})}) = \exp (\mathbb {E}[\Phi _{\text{Sig}}(X) \vert X_{t}]-\mathbb {E}[\Phi _{\text{Sig}}(X)\vert X_{t-1}])\). Here, we note that the value of \(Z_t = a\) depends on \(X_t = x\) and \(X_{t-1} = x^\prime \); and, since \( X \) is nowhere recombining by assumption, the path segment \((X_s)_{s = 1}^t\) such that \( X_t = x \) is uniquely determined, which implies that the value of \(X_{t-1} = x^\prime \) is actually uniquely determined by \(X_t = x\). Based on this observation, we can see that the function \( v_t(a) := \mathbb {E}[Z_{t+1}\cdots Z_T|Z_t=a ] \) in Lemma 2.9 can actually be viewed as a function of \(X_t =x\) and therefore satisfies the recursion
where we used the symbol \(*\) to denote the tensor product on \(T((H))\). Of course, the exponential appearing in the above expression is taken with respect to the operation \(*\), that is, for \(h \in H\), \(\exp (h) = \sum _{n=0}^\infty \frac {h^{*n}}{n!}\). One can readily check that \(v_0(X_0) = \mathbb {E}[\Phi ^{(1)}_{\text{Sig}}(\mathbf {X})]\) (with \(X_0 := 0\) as usual) is the quantity we want to compute.
This recursion is more involved as it requires two evaluations of \( \mathbb {E}[\Phi _{\text{Sig}}(X)\vert X_{t}] \) at every step. However, if we assume that the process \(\mathbf {X}\) is nowhere recombining we can rewrite this as the following system
where \(\Phi _{\text{Sig}}(X \vert X_t = y)\) denotes the signature of the path segment \((X_s)_{s = 1}^t\) such that \( X_t = y \), which is well defined since \( X \) is nowhere recombining by assumption. Unfortunately \( v_t(x) \) depends on both \( u_t \) and \( u_{t+1}\) and since multiplication in \(T((H))\) is non-commutative, there is no way to separate the two dependencies. Because of this \( u_t(x) \) needs to be computed before \( v_t(x) \) and the recursion is best solved by caching the relevant values at every function call. This approach is outlined in Algorithm 2.
Algorithm 2 Pseudo-code for \( \mathbb {E}[\Phi ^{(1)}_{\text{Sig}}(\mathbf {X})]\)
Procedure for calculating the signature of a sample path in a Markov chain represented as a rooted tree. The input is a Markov chain X with root r , and s(a) denotes the signature of the sample path ending at a . The procedure, ExpSig1, initializes a_1 and a_2 to zero. It iterates over the children of a , updating a_1 and a_2 using recursive calls to ExpSig1 and mathematical operations involving exponential and tensor products. The output is the expected value of the first signature of X , denoted as mathbb{E}[Phi^{(1)}_{Sig}(X)] .
The above algorithm based on the backward recursion method provides a practical way to compute the higher-rank signature numerically. However, in this algorithm one needs to truncate the signature features \(\Phi _{\text{Sig}}\) in every step. However, it is a notorious property for truncated signatures that the dimension grows exponentially as the truncation level increases: \(\Phi _{\text{Sig},k}(\mathbf {\mathit {x}}) = (1, \int \mathbf {\mathit {x}}, \ldots , \int d\mathbf {\mathit {x}}^{\otimes k}) \in (\mathbb {R}^d)^{1 + d + \ldots + d^k}\), and the situation gets worse if one truncates the higher-rank signatures \(\Phi ^{(1)}_{\text{Sig}}\). On the other hand, in many applications (see Sect. 3) we are more interested in the higher-rank signature distance between two filtered processes instead of the higher-rank robust signature itself, therefore we can actually take use of the inner product structure of the target spaces of higher-rank robust signatures to design a more efficient algorithm to compute these distances, which are \(\mathbb {R}\)-valued objects, as the next paragraph will show.
Method 2: Conditional KME and Signature Kernel PDE Trick
The aim of this algorithm is to compute the rank-1 signature kernel MMD \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) for \(\mathbf {X}, \mathbf {Y} \in \text{FP}_{\text{plain}}\). This algorithm can be naturally extended to compute higher-rank signature MMD. In contrast to the first algorithm, we will employ here techniques from kernel learning extensively. The statements below are mainly borrowed from [73] and [47].
The starting point of this algorithm is the property from Proposition 2.8 that the square of \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) admits the representation
where \(Z,Z^\prime ,\hat Z,\hat Z^\prime \) are independent with \(Z,Z^\prime \sim \mathrm {pp}^{(1)}(\mathbf {X})\) and \(\hat Z,\hat Z^\prime \sim \mathrm {pp}^{(1)}(\mathbf {Y})\).
Thus, to compute \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) it suffices to compute the three expectations of inner product on the right hand side of the above equation separately. To avoid notational excess, we assume from now on without loss of generality that \(\mathbf {X}\) and \(\mathbf {Y}\) live on a common probability space such that \(Z = \mathrm {pp}^{(1)}(\mathbf {X})\) and \(\hat Z = \mathrm {pp}^{(1)}(\mathbf {Y})\) are independent. Let us compute \(\mathbb {E}[\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X}), \mathrm {pp}^{(1)}(\mathbf {Y}))]\) as an example. We decompose the procedure of the numerical solution into the following four steps. For more details on this algorithm, we refer readers to [73] or [47]. Now let us summarize the main ideas of this algorithm in the following informal style.
Informal Description
Recall that what we need to compute is the expectation of the rank-1 signature kernel on prediction processes \(\mathrm {pp}^{(1)}(\mathbf {X})\) and \(\mathrm {pp}^{(1)}(\mathbf {Y})\), namely \(\mathbb {E}[\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X}), \mathrm {pp}^{(1)}(\mathbf {Y}))]\). Suppose samples \(\hat x^i \sim \mathrm {law}(\mathbf {X})\) , \(i=1,\ldots ,M\) and \(\hat y^j \sim \mathrm {law}(\mathbf {Y})\) , \(j=1,\ldots ,N\) are given, we can indeed use the empirical measures \(\mu ^M := \frac {1}{M}\sum _{i=1}^M \delta _{\hat x^i}\) and \(\nu ^N := \frac {1}{N}\sum _{j=1}^N\delta _{\hat y^j}\) to construct an empirical estimator
for \(\mathbb {E}[\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X}), \mathrm {pp}^{(1)}(\mathbf {Y}))]\). Based on this observation, our task boils down to the numerical computation of the inner product \(\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X})(\hat x^i), \mathrm {pp}^{(1)}(\mathbf {Y})(\hat y^j))\) for each pair \((i,j)\). A well-known result is that for \(\mathbb {R}^d\)-valued piecewise smooth paths x and y defined on \([0,T]\) (recall that in our case all paths are discrete on \(0,1,\ldots ,T\), and we view them as piecewise-linear paths on \([0,T]\) by linear interpolation), the signature kernel \(\mathbf {\mathit {k}}_{\text{Sig}}(x,y)\) is the terminal value \(v^{x,y}(T,T)\) of the unique solution \(v^{x,y}(\cdot ,\cdot )\) to the Goursat problem
see e.g. [30] or [57, Section 7.2, PDE Solver]. Actually, by the very definition of the rank-1 signature kernel \(\mathbf {\mathit {k}}_{\text{Sig}}^{(1)}(\cdot ,\cdot )\), which is essentially induced by the signature of an Hilbert space H-valued paths, using the same computation as for the above classical case, it is not hard to check that \(\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X})(\hat x^i), \mathrm {pp}^{(1)}(\mathbf {Y})(\hat y^j))\) is the terminal value \(u^{i,j}(T,T)\) of the unique solution \(u^{i,j}\) to the Goursat problem
where \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X}) = \mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X) | \mathcal {F}^{\mathbf {X}}_t]\) and \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {Y}) = \mathbb {E}_{\mathbf {Y}}[\Phi _{\text{Sig}}(Y) | \mathcal {F}^{\mathbf {Y}}_t]\) are conditional expectations of H-valued random variables \(\Phi _{\text{Sig}}(X)\) and \(\Phi _{\text{Sig}}(Y)\) respectively, which can be viewed as functions on the path space \((I \to \mathbb {R}^d)\), as the filtrations \(\mathcal {F}^{\mathbf {X}}\) and \(\mathcal {F}^{\mathbf {Y}}\) are natural. Clearly, if we can compute the inner product \(\big \langle \frac {\partial }{\partial t}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i), \frac {\partial }{\partial s}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j)\big \rangle _{H}\), then we would be able to solve the above Goursat type PDE, and then get the desired quantity \(\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X})(\hat x^i), \mathrm {pp}^{(1)}(\mathbf {Y})(\hat y^j))\). Hence, our goal is to numerically approximate \(\big \langle \frac {\partial }{\partial t}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i), \frac {\partial }{\partial s}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j)\big \rangle _{H}\). To do this, as \(\widetilde {\mathrm {pp}}^{(1)}_t(\mathbf {X}) = \mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X) | \mathcal {F}^{\mathbf {X}}_t]\) and \(\widetilde {\mathrm {pp}}^{(1)}_t(\mathbf {Y}) = \mathbb {E}_{\mathbf { Y}}[\Phi _{\text{Sig}}(Y) | \mathcal {F}^{\mathbf {Y}}_t]\), we should of course find a consistent estimator of the above conditional expectations in terms of data \((\hat x^i)_{1\le i\le M}\) and \((\hat y^j)_{1\le j\le N}\).
From the above discussions, we can conclude that the algorithm of computing the rank-1 signature kernel MMD \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) consists of the following subtasks:
1.
Use samples \((\hat x^i)_{1\le i\le M} \sim \mathrm {law}(\mathbf {X})\) and \((\hat y^j)_{1\le j\le N} \sim \mathrm {law}(\mathbf {Y})\) to construct empirical estimators for conditional kernel means \(\widetilde {\mathrm {pp}}^{(1)}_t(\mathbf {X}) = \mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X) | \mathcal {F}^{\mathbf {X}}_t]\) and \(\widetilde {\mathrm {pp}}^{(1)}_t(\mathbf {Y}) = \mathbb {E}_{\mathbf {Y}}[\Phi _{\text{Sig}}(Y) | \mathcal {F}^{\mathbf {Y}}_t]\). See Step 1 and Step 2 below.
2.
Apply the empirical estimators constructed in the previous steps to approximate the inner product \(\big \langle \frac {\partial }{\partial t}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i), \frac {\partial }{\partial s}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j)\big \rangle _{H}\), so that one can find a numerical solution of the Goursat problem
for each pair of samples \((\hat x^i, \hat y^j)\). The resulting quantity \(u^{i,j}(T,T)\) is an approximation of the rank-1 signature kernel \(\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X})(\hat x^i), \mathrm {pp}^{(1)}(\mathbf {Y})(\hat y^j))\). From this we can build an empirical estimator \(\hat {\mathcal {D}}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y}) = \frac {1}{NM}\sum _{i=1}^M\sum _{j=1}^N u^{i,j}(T,T)\) for \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\). See Step 3 and Step 4 below.
3.
It is interesting to notice that thanks to the concrete form of the estimators of \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X}) = \mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X) | \mathcal {F}^{\mathbf {X}}_t]\) and \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {Y}) = \mathbb {E}_{\mathbf { Y}}[\Phi _{\text{Sig}}(Y) | \mathcal {F}^{\mathbf {Y}}_t]\), the computation of the inner product \(\big \langle \frac {\partial }{\partial t}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i), \frac {\partial }{\partial s}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j) \big \rangle _H\) on the Hilbert space H can be reduced to finding numerical solutions of Goursat PDEs on \(\mathbb {R}^d\), namely solving
where \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X}) = \mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X)|\mathcal {F}^{\mathbf {X}}_t], t \in I\). Hence, our initial task is to estimate the conditional kernel mean \(\mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X)|\mathcal {F}^{\mathbf {X}}_t] \in H\) for each \(t \in I\). To do this, we will adopt the conditional kernel mean embedding scheme proposed in [65]. Roughly speaking, as the filtration \(\mathcal {F}^{\mathbf {X}}_t = \sigma (X_s, s\le t)\) is the natural filtration generated by X, there is a measurable function \(F_t: (I_t \to \mathbb {R}^d) \to H\) such that
where \(I_t = \{1,\ldots ,t\}\), and \(F_t\) is the minimizer of the loss function \(\mathbb {E}_{\mathbf {X}}[\|\Phi _{\text{Sig}}(X) - F(X|_{I_t})\|_{H}^2]\) among all measurable functions \(F \in L^2((I_t \to \mathbb {R}^d), H, \mathbb {P}^{\mathbf {X}} \circ (X|_{I_t})^{-1})\).
Step 2: Solve Minimization Problem via Empirical Data
Usually, people use the unique minimizer of the so-called regularized empirical surrogate loss with regularization parameter \(\lambda _M >0\) given by
as a good estimator for \(F_t\), where \(\hat x^i, i = 1,\ldots ,M\) are i.i.d. sample paths of \(\mathrm {law}(X)\), and \(\mathscr {G}^t \subset L^2((I_t \to \mathbb {R}^d), H, \mathbb {P}^{\mathbf {X}} \circ (X|_{I_t})^{-1})\) is a properly chosen Hilbert space which is dense in \(L^2((I_t \to \mathbb {R}^d), H, \mathbb {P}^{\mathbf {X}} \circ (X|_{I_t})^{-1})\). Here we remark that such a special construction of \(\mathscr {G}^t\) will significantly simplify the minimization problem of the functional \(\hat {\mathcal {E}}_{X|_{I_t}, M, \lambda _M}(F)\), and the concrete construction of \(\mathscr {G}^t\) is completely based on the universality of the signature.
As the next step, it can be shown that (see [65]) given samples \(\hat x^i, i =1,\ldots ,M\), the unique minimizer \(\hat F_t^{(\hat x^i)_{i=1,\ldots ,M}} \in \mathcal G^t\) (for simplicity, we write it as \(\hat F_t\) in below) of the regularized empirical surrogate loss with regularization parameter \(\lambda _M\) is of the form that
where \(\mathbf {\mathit {k}}_{\text{Sig}}^{X,t}(\cdot ) = (\mathbf {\mathit {k}}_{\text{Sig}}(\hat x^i|_{I_t}, \cdot ))_{i=1,\ldots ,M}:(I_t \to \mathbb {R}^d) \to \mathbb {R}^M\) is an \(\mathbb {R}^M\)-valued (continuous) function on the path space \((I_t \to \mathbb {R}^d)\), \({\mathbf {K}}^X_{t,t}= ([{\mathbf {K}}^X_{t,t}]_{i,j})_{i,j=1,\ldots ,M}\) is an \(M \times M\) matrix with entries \([{\mathbf {K}}^X_{t,t}]_{i,j} = \mathbf {\mathit {k}}_{\text{Sig}}(\hat x^i|_{I_t}, \hat x^j|_{I_t}) = \langle \Phi _{\text{Sig}}(\hat x^i), \Phi _{\text{Sig}}(\hat x^j) \rangle _H\), \(\Phi _{\text{Sig}}^{X,T}= (\Phi _{\text{Sig}}(\hat x^i))_{i=1,\ldots ,M}\) is a vector in \(H^M\) and \(\text{I}_M\) is the \(M \times M\)-identity matrix. Clearly, for a given \(x|_{I_t} \in (I_t \to \mathbb {R}^d)\), we first evaluate the function \(\mathbf {\mathit {k}}_{\text{Sig}}^{X,t} (\cdot )\) against it and obtain an \(\mathbb {R}^M\) row vector \((\langle \Phi _{\text{Sig}}(\hat x^i|_{I_t}), \Phi _{\text{Sig}}(x|_{I_t}) \rangle _H)_{i=1,\ldots ,M}\), and then multiply it with the \(M \times M\)-matrix \(({\mathbf {K}}^X_{t,t} + M\lambda _M \text{I}_M)^{-1}\) to get another \(\mathbb {R}^M\) row vector \(\mathbf {\mathit {k}}_{\text{Sig}}^{X,t} (x|_{I_t}) ({\mathbf {K}}^X_{t,t} + M\lambda _M \text{I}_M)^{-1} := (v_1,\ldots ,v_M)\); finally, we multiply the latter vector \((v_1,\ldots ,v_M)\) with the \(H^M\)-vector \(\Phi ^{X,T}_{\text{Sig}} = (\Phi _{\text{Sig}}(\hat x^i))_{i=1,\ldots ,M}\) to get a vector \(\sum _{i=1}^M v_i \Phi _{\text{Sig}}(\hat x^i)\) in the Hilbert space H. The above explains why \(\hat F_t(\cdot )\) defined in (16) is an H-valued function on \((I_t \to \mathbb {R}^d)\).
Step 3: Construction of Empirical Estimator for Higher-Rank Signature MMD
Given i.i.d. sample paths \((\hat x^i)_{i \le M}\) of \(\mathrm {law}(X)\), from Step 2 we can define the H-valued stochastic process \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})_t := \hat F_t^{(\hat x^i)_{i=1,\ldots ,M}}(\cdot ), t \in I\), which serves as an empirical estimator for \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})\). Similarly, for a given i.i.d. sample paths \((\hat y^j)_{j \le N}\) of \(\mathrm {law}(Y)\), we define the H-valued stochastic process \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})_t = \hat F^{(\hat y_j)_{j=1,\ldots ,N}}_t(\cdot ), t \in I\), which is the empirical estimator for \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})\). Then, for empirical measures \(\hat \mu ^M := \frac {1}{M}\sum _{i=1}^M\delta _{\hat x^i}\) and \(\hat \nu ^N := \frac {1}{N}\sum _{j=1}^N \delta _{\hat y^j}\), we define
(note that here \(\Phi _{\text{Sig}}(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X}))\) denotes the signature of H-valued process \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})\)) and use it as an empirical estimator for the rank-1 signature kernel MMD \(\mathcal {D}_{\text{Sig}}^{(1)}(\mathbf {X},\mathbf {Y})\), when we have samples \((\hat x^i)_{i=1,\ldots ,M}\) and \((\hat y^j)_{j=1,\ldots ,N}\). he next theorem justifies the validity of such estimator: we may use it to obtain an accurate estimation of \(\mathcal {D}_{\text{Sig}}^{(1)}(\mathbf {X},\mathbf {Y})\) provided the size of samples M and N is very large. The corresponding proof can be found in [73, Appendix, Theorem 7].
Theorem 2.10
For\(\mathbf {X},\mathbf {Y} \in \mathit{\text{FP}}_{\mathit{\text{plain}}}\), \(\hat {\mathcal {D}}^{(1)}_{\mathit{\text{Sig}}}(\mathbf {X},\mathbf {Y})\)is a consistent estimator of the rank 1-robust signature kernel MMD, i.e.,
in probability as\(M,N \to \infty \), provided that the regularization parameters\(\lambda _M\), \(\lambda _N\), cf. (16), decay to 0 slower than\(\mathcal {O}(M^{-1/2})\)and\(\mathcal {O}(N^{-1/2})\), respectively.
Step 4: Evaluation of Higher-Rank Signature Kernels: A PDE Trick
Thanks to Theorem 2.10, the approximation of the rank-1 signature MMD \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) boils down to computing its empirical counterpart \(\hat {\mathcal {D}}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\). Starting from the expression
and inserting the expressions of the empirical measures \(\hat \mu ^M = \frac {1}{M}\sum _{i=1}^M\delta _{\hat x^i}\) and \(\hat \nu ^N = \frac {1}{N}\sum _{j=1}^N \delta _{\hat y^j}\) into the above expectation, we obtain that
holds for every pair of indices \((i,j)\), the task of computing this empirical estimator is in turn reduced to evaluating the empirical version of the rank-1 signature kernels
of H-valued paths \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i)\) and \(\widehat {\mathrm {pp}^{(1)}}(\mathbf {Y})(\hat y^j)\) is the unique solution \(u^{i,j}(T,T)\) of the following hyperbolic PDE, which is also called the Goursat problem:
where we consider \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i)\) and \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j)\) as continuous H-valued paths via linear interpolation. Furthermore, in view of (16) and the fact that \(\mathbf {\mathit {k}}_{\text{Sig}}(\hat x^k,\hat y^\ell ) = \langle \Phi _{\text{Sig}}(\hat x^k), \Phi _{\text{Sig}}(\hat y^\ell ) \rangle _H\), using some linear algebra calculations we can readily check for \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})_t(\hat x^i) = \hat F^{(\hat x^k)_{k \le M}}_t(\hat x^i)\) and \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})_s(\hat y^j) = \hat F^{(\hat y^\ell )_{\ell \le N}}_s(\hat y^j)\) that
Now, applying again the signature kernel PDE trick to the signature kernels above, we see that \(\mathbf {\mathit {k}}_{\text{Sig}}(\hat x^k,\hat y^\ell )\) is the unique solution \(\tilde u^{k,\ell }(T,T)\) of the Goursat problem
Similarly, the quantity \(\mathbf {\mathit {k}}_{\text{Sig}}(\hat x^k|_{I_t},\hat x^i|_{I_t})\) can be computed by finding the terminal value of \(v_{k,i}(t,t)\) with \(v_{k,i}\) being the unique solution to the Goursat problem
As a result, by solving a family of Goursat PDEs, We can compute the above vectors and matrices, and then insert them back into (20) we can get an estimation of the inner product \( \big \langle \widehat {\mathrm {pp}}^{(1)}(\mathbf {X})_t(\hat x^i), \widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})_s(\hat y^j)\big \rangle _{H}\); then we can obtain the coefficients of PDE (19) and are able to solve this equation to get the rank-1 signature kernel \(\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i), \widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j) )\) for each pair \((i,j)\). Finally, in view of the polarization formula (18), we can use \(\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i), \widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j) )\) to compute the square of Hilbert norms \(\|\Phi _{\text{Sig}}(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X}))(\hat x^i) - \Phi _{\text{Sig}}(\widehat {\mathrm {pp}}^{(1)}(\mathbf {Y}))(\hat y^j)\|_{H^{(2)}}^2\) for each pair \((i,j)\), which in turn allows us to obtain the sum
Here, we remark that one can use the finite difference method from [30, Section 3.1] to solve the above Goursat problem, see Algorithm 3 below. Note that therein f denotes the solution operator in this scheme which returns the outcome obtained in each step \(s_p \times t_q \to s_{p+1} \times t_{q+1}\), where \((s_q)_q\) and \((t_p)_p\) are discretizations of \([0,T]\).
The whole algorithm can be summarized as follows via pseudo-code. Here we remark that in this algorithm, the distance \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) in an infinite-dimensional Hilbert space can be efficiently estimated by solving a (possibly large) system of low-dimensional Hyperbolic PDEs (i.e., the Goursat problem), whose coefficients rely on the chosen sample paths \(\hat x^i \sim \mathrm {law}(X)\) and \(\hat y^j \sim \mathrm {law}(Y)\).
The image is a pseudocode algorithm involving matrix operations. It starts with an input matrix M in mathbb{R}^{T times T} and outputs a full solution matrix u in mathbb{R}^{ times } . The algorithm initializes certain elements of u to 1 and iterates over indices p and q to update u using a function f that involves elements of u and M . The final step checks a condition and returns either the full matrix u or a specific element u[-1, -1] .
The image is a mathematical algorithm with steps involving input and output of sample paths. It includes mathematical notation and operations such as: 1. Input: Sample paths { hat{x}^i_j }_{j=1}^M sim law(X) and { hat{y}^j_j }_{j=1}^ sim law(Y) , with a condition on "full" being either True or False. 2. Output: A matrix G in mathbb{R}^{M times times P times Q} where G[i, j, p, q] = kSig(hat{x}^i_{l_sp}, hat{y}^j_{l_tq}) or G[:, :, -1, -1] . 3. A matrix M[i, j, p, q] is assigned values from langle hat{x}^i_{s_p}, hat{y}^j_{t_q} rangle . 4. Matrix operations are performed on M . 5. G[i, j] is solved using a PDE solver. 6. Conditional return of G based on the "full" condition. The notation includes Greek letters and mathematical symbols.
Algorithmic pseudocode for computing an empirical estimator of matrix M with dimensions mathbb{R}^{M times times T times T} . Inputs include Gram matrices G_{XX}, G_{XY}, G_{YY} and hyperparameter lambda . The pseudocode involves matrix operations and loops over indices p and q from 1 to T . Key operations include matrix inversions and multiplications to update matrices W_X and W_Y , and compute M[i, j, p, q] using specified mathematical expressions.
The image is a pseudocode for calculating an empirical estimator of the rank-1 signature kernel MMD. It includes the following steps: 1. **Input**: Sample paths {x^i}_{i=1}^M sim law(X) and {y^j}_{j=1}^ sim law(Y), with hyperparameter lambda. 2. **Output**: An empirical estimator of the rank-1 signature kernel hat{D}_{Sig}^{(1)}(X, Y). 3. Calculate G_{XX} leftarrow ZerothRankGram({x^i}_{i=1}^M, {x^i}_{i=1}^M, full = True). 4. Calculate G_{XY} leftarrow ZerothRankGram({x^i}_{i=1}^M, {y^j}_{j=1}^, full = True). 5. Calculate G_{YY} leftarrow ZerothRankGram({y^j}_{j=1}^, {y^j}_{j=1}^, full = True). 6. Calculate G_{XX}^1 leftarrow FirstRankGram(G_{XX}, G_{XX}, G_{XX}, lambda). 7. Calculate G_{XY}^1 leftarrow FirstRankGram(G_{XX}, G_{XY}, G_{YY}, lambda). 8. Calculate G_{YY}^1 leftarrow FirstRankGram(G_{YY}, G_{YY}, G_{YY}, lambda). 9. **Return**: avg(G_{XX}^1) - 2avg(G_{XY}^1) + avg(G_{YY}^1).
3 Applications
In this section, we will mention two concrete applications of higher-rank (expected) signatures to solve problems related to adapted weak topologies.
3.1 Classification of Filtered Processes in Adapted Weak Topologies
This subsection is mainly borrowed from [28, Section 5.1], which showed a significant application of the higher-rank expected signatures to distinguish stochastic processes in hypothesis testing. The Python code of the algorithm is available at: https://github.com/PatricBonnier/Higher-rank-signature-regression.
First recall that expected signatures (the expectations of \(\Phi _{\text{Sig}}\) in our notation) are currently explored in machine learning applications since they provide a hierarchical description of the law of a stochastic process; in the terminology of statistical learning the signature map is a so-called “universal and characteristic” feature map. However, the expected signature can only be used to metrize the weak convergence and completely ignores the filtration. We now use the algorithms from the previous section to demonstrate that already in the arguably simplest classification task, the use of higher-rank expected signatures can be beneficial.
Define two processes \(\mathbf {X}, \mathbf {Y} \in \text{FP}\) with their natural filtrations as follows7
where c denotes a constant and \( N_1, N_2, M_1,M_2 \) are independent binomial random variables; \( \varepsilon > 0 \) fixed. Note that for a fixed c and small \(\varepsilon \), the processes \({\mathbf {X}}^{c}\) and \({\mathbf {Y}}^{c}\) have similar laws; despite \({\mathbf {X}}^{c}\) and \({\mathbf {Y}}^{c}\) being essentially different in the rank-1 adapted weak topology, analogous to Example 1.3. We now endow \(\text{FP}\) with a probability measure \(\mu \) as follows: a sample from \(\mu \) consists of sampling a \(C \sim N(0,1)\) and then selecting with probability \(0.5\) the process \({\mathbf {X}}^C\) and with probability \(0.5\) the process \({\mathbf {Y}}^C\). Figure 2 shows for each sample from \(\mu \) one sample trajectory from this process.
Fig. 2
The sample paths of \( \mu \) are plotted above for \( \varepsilon = 0.01 \). They are marked as either X in red, or Y in green depending on if the sample comes from \( {\mathbf {X}}^C \) or \( {\mathbf {Y}}^C \)
We ran the following experiment: we sampled 1000 processes from \(\mu \) and labelled the processes corresponding to whether \({\mathbf {X}}^{c}\) or \({\mathbf {Y}}^{c}\) was sampled. We then compare their distances in the rank-0 signature metric \(d_{\text{Sig}}\) (based on the usual signature) and in the rank-1 signature distance \(d^{(1)}_{\text{Sig}}\) (based on the rank-1 signature) for each sample truncated at level 6 and level 3 respectively, by using Algorithm 1. This was then split into a training set and test set—both of size 500-and for \(50 \le m \le 500\) a support vector machine classifier was trained on m data points from the training set with signature and rank-1 signature as features.
Figure 3 shows the accuracies of the resulting classifiers on the test set. Observe that for small values of \(\varepsilon \), the classifier based on the usual signature feature (i.e., the criterion is the weak topology) is essentially guessing, and even for larger values it does not converge well; the classifier based on the rank-1 signature feature (i.e., the criterion is the rank-1 adapted weak topology) converges immediately however, which is to be expected as the rank-1 signature is able to separate \({\mathbf {X}}^{c}\) and \({\mathbf {Y}}^{c}\) independently of \(\varepsilon \).
Fig. 3
The accuracies of the linear classifier trained on \( \Phi _0(X) \) (stands for the classification based on the weak topology) and \( \Phi _1(X) \) (stands for the classification based on the rank-1 adapted weak topology) is plotted against the number of samples used in green and red respectively. Solid lines are used for \( \varepsilon = 10^{-4} \) and dashed lines for \( \varepsilon = 10^{-2} \)
We emphasize that although this is a toy example, it demonstrate how the expected signature can fail to pick up essential properties of a model and that the higher-rank (expected) signature provide additional features that linearize complex dependencies between law and filtration.
3.2 Optimal Stopping via Adapted Distribution Regression
In this subsection, which is a shortened version of [47, Section 3], we will briefly show how to employ the higher-rank signatures in a kernel regression approach to learn the value functions of optimal stopping problems. By Theorem 2.6 we know that the rank-r signature kernel MMD \(\mathcal {D}^{(r)}_{\text{Sig}}(\cdot ,\cdot )\) metrizes the rank-r adapted weak topology, for \(r \in \mathbb N_0\). This fact together with a classical result from kernel learning implies the following important and useful theorem, whose proof can be found in [47, Appendix, proof of Theorem 2.14]:
Theorem 3.1
Let\(\sigma > 0\)be fixed and assume all filtered processes take values in a compact subset\(\mathcal {K} \subset \mathbb {R}^d\). For each\(r \in \mathbb {N}_0\)we define the kernel\(K^{(r)}_{\mathit{\text{Sig}}}: \mathit{\text{FP}} \times \mathit{\text{FP}}\to \mathbb {R}\),
Then the RKHS generated by\(K^{(r)}_{\mathit{\text{Sig}}}\)is dense in the space\(C_b((\mathit{\text{FP}},\mathcal {T}^{(r)}))\)w.r.t. the uniform topology. In particular, if\(v(\cdot ) \in C_b((\mathit{\text{FP}},\mathcal {T}^{(r)}))\), then for any\(\varepsilon > 0\), there exist an\(M \in \mathbb {N}\), \({\mathbf {Y}}^i \in \mathit{\text{FP}}\)and scalars\(a_i \in \mathbb {R}\), \(i=1,\ldots ,M\), such that
As the construction of \(K^{(r)}_{\text{Sig}}\) is based on the rank-r signature kernel MMD which is in turn induced by the rank-r signature kernel \( {\mathbf{k}}^{(r)}_{\text{Sig}}\), by an abuse of notation we will also call \(K^{(r)}_{\text{Sig}}\) the rank-r signature kernel.
Since the rank-0 adapted weak topology is exactly the weak topology, the above theorem yields that any weakly continuous bounded function \(v(\cdot )\) on \(\text{FP}\) can be approximated by the (linear) regression of the classical signature kernel \(K^{(0)}_{\text{Sig}}(\mathbf {X}, \mathbf {Y}) = \exp (-\sigma ^2 \mathcal {D}_{\text{Sig}}(\mathbf {X}, \mathbf {Y})^2)\). Such an approach is called the distribution regression (via signature kernel) in [58]. This regression scheme is of particular interest in the context of mathematical finance: Suppose that \(v: \text{FP} \to \mathbb {R}\) is the pricing functional of a bounded payoff function of some path—dependent financial derivative (e.g., the European options), which is certainly weakly continuous, and we have a set of pairs \(({\mathbf {Y}}^j, y^j) \in \text{FP} \times \mathbb {R}, j =1,\ldots , M\), where \(y^j = v({\mathbf {Y}}^j)\) is the spotted fair price of model \({\mathbf {Y}}^j\), as the input, then one can perform the above signature kernel regression on this data set \(({\mathbf {Y}}^j, y^j)_{j=1,\ldots ,M}\) to determine the coefficients \(a_j, j= 1,\ldots , M\) such that
Moreover, if \(\mathbf {X} \in \text{FP}\) is a new financial model, to compute the fair price \(v(\mathbf {X})\) on the market \(\mathbf {X}\), one just needs to compute the signature kernel MMD \(\mathcal {D}_{\text{Sig}}(\cdot , {\mathbf {Y}}^j)^2\) for all \(j=1,\ldots ,M\) by the second algorithm in the last section, and then use \(\sum _{j=1}^M a_j \exp (-\sigma ^2 \mathcal {D}_{\text{Sig}}(\cdot , {\mathbf {Y}}^j)^2)\) as an approximating price (with an error up to \(\varepsilon \)).
Now let us consider the value function in the optimal stopping problem, or, in the context of mathematical finance, the pricing functional of American-type options,
for a given (bounded and continuous) function \(c: (I \to \mathbb {R}^d) \times I \to \mathbb {R}\). From Example 1.3 we know that in general it is not weakly continuous, whilst by Lemma 1.9 we know that \(v_c(\cdot )\) is continuous for \(\mathcal {T}^{(T-1)}\). In fact, if we restrict to filtered processes with natural filtration, then on \(\text{FP}_{\text{plain}}\) this functional is continuous w.r.t. the rank-1 adapted weak topology \(\mathcal {T}^{(1)}\) (see [14, Theorem 1.1]). The above observations tell us that in order to apply a regression for computing the fair price of American-type options, one should use the rank-1 signature kernel MMD \(\mathcal {D}^{(1)}_{\text{Sig}}\) instead of \(\mathcal {D}_{\text{Sig}}\) in the usual distribution regression. We call such a regression approach with higher-rank signature kernels the adapted distribution regression.
To illustrate the power of adapted distribution regression, here we empirically investigate the effectiveness of the proposed signature kernel algorithm for two optimal stopping problems:
1.
the pricing of an American basket option with geometric put payoff;
2.
the pricing of American options under the rough Bergomi model.
More precisely, treating these as distribution regression problems from probability measures on path space to \(\mathbb {R}\), we endow the classical Kernel Ridge Regression (KRR) algorithm with the characteristic kernel \(K^{(1)}_{\text{Sig}}\) studied in previous sections and compare its performance against classical algorithms. The code is available at https://github.com/maudl3116/higherOrderKME.
3.2.1 American Basket Option Pricing
The numerical example in this subsection is taken from [47, Section 3.5.1]. It is well known that an American option can be approximated by a Bermudan option, which can be exercised only at specific dates \(t_0 < t_1 < t_2 < \ldots < t_N\). If the time grid is chosen sufficiently small, the American option is well approximated by the Bermudan option. For \(d \in \mathbb {N}\), we consider the Black-Scholes model by using a d-dimensional geometric Brownian motion \((X_t)_{t\geq 0}\) for the stock prices. At time \(t \in [0,T]\) with initial stock prices \(X_t = x \in \mathbb {R}^d\), the price of the American option \(V(t,x)\) can be formulated as the following optimal stopping problem
with payoff function \(\Phi : \mathbb {R}^d \to \mathbb {R}_+\), where the supremum is taken over all stopping times. Here, we consider a geometric put payoff defined as follows
To perform KRR, we consider the classical supervised learning setting and we form a dataset \(\{(\mathcal {X}_i, y_i)\}_{i=1}^N\) of \(N = 100\) input-output pairs, where
\(\mathcal {X}_i = \{x^{i,j} : [0,T] \to \mathbb {R}^d\}_{j=1}^{n_i}\) is a set of \(n_i\) paths sampled from a d-dimensional Black-Scholes model, with fixed interest rate \(r = 2 \%\), \(d=20\) and volatility sampled uniformly at random \(\sigma _i \in [0.1,0.5]\) over the time grid \(\{0, T/\ell , 2T/\ell , \ldots , T\}\) with \(\ell = 10\) as in [43]. For simplicity we set \(n_1 = \ldots . = n_N = n\);
\(y_i\) is the price of the American option obtained via a binomial-tree method using a tree of depth \(10{,}000\) with strike \(v=100\) and maturity \(T=1\).
The dataset is randomly split into a training set of size \(N_{\text{train}}=90\) and a test set of size \(N_{\text{test}}=10\). We fit the KRR on the training set and evaluate the model on the test set.
We count how many sample paths n are required by both methods (namely the kernel induced by rank-0 signatures and rank-1 signatures) to produce an approximation error \(\epsilon \) from the target price obtained using the lattice method. We use the Mean Absolute Percent Error (MAPE), that is \(\epsilon =N_{\text{test}}^{-1}\sum _{i=1}^{N_{\text{test}}}{|(\hat {y}_i-y_i)/y_i|}\) where \(\{\hat {y}_i\}_{i=1}^{N_{\text{test}}}\) are the prices predicted by either algorithm. In this supervised learning we use the code provided in [43] to apply the LS algorithm with basis functions given by the Taylor polynomials up to degree two, as recommended in [43], to get the price \(y_i, i =1, \ldots , N_{\text{test}}\).
As shown in Table 1, the higher-rank signature methods, that is KRR with the rank-1 signature kernels \(K^{(1)}_{\text{Sig}}\) require significantly fewer sample paths from the models to produce a desired precision than the classical signature kernel \(K^{(0)}_{\text{Sig}}\). We also observe that for the (higher-rank) signature methods, the regression is fitted on \(N_{\text{train}}\) base models and so depends on \(n\times N_{\text{train}}\) sample paths. However, for any new prediction, the \(n\times N_{\text{train}}\) training sample paths are fixed and do not need to be re-sampled; only n additional samples are required. So for a total of \(N_{\text{test}}\) new predictions we will have that \(n(N_{\text{train}} + N_{\text{test}}) \approx nN_{\text{test}}\) when \(N_{\text{train}} \ll N_{\text{test}}\).
Table 1
Geometric put option pricing under the Black-Scholes model. The columns show the number of samples n to produce a MAPE \(\epsilon \) on the test set according to different numerical methods
\(\epsilon =10\%\)
\(\epsilon =5\%\)
\(\epsilon =2.5\%\)
Rank-0 signature kernel
The image shows a mathematical notation: K_{Sig}^{(0)} .
50
250
1000
Rank-1 signature kernel
The image shows a mathematical notation: K_{Sig}^{(1)} .
50
200
500
3.2.2 Pricing American Options Under the Rough Bergomi Model
In this subsection, which is taken from [73], we make use of Support Vector Regression (SVR) equipped with the rank-1 signature kernel \(K^{(1)}_{\text{Sig}}\) to address the pricing of American call options.
where \(\Phi (S_t) = (S_t - v)_+\), and the supremum is taken over all stopping times. Here, we assume that market price dynamics follow a rough Bergomi model, introduced in [18], which is defined as follows
where \(W^1,W^2\) are two independent Brownian motions.
We employ the hybrid scheme of [24] to efficiently simulate sample paths of the Volterra process \(Y^a\), with time complexity \(\mathcal {O}(n\log (n))\), where n is the number of time steps. Option prices are numerically obtained using the Monte Carlo algorithm proposed by Longstaff and Schwartz (LS) in [59] with \(500{,}000\) sample paths from the rough Bergomi model with \(\rho =-0.9, \eta =1.9\) and \(\xi =0.09\). The strike is \(v=1\), the maturity is \(T=1\), we use 10 time steps and the interest rate is \(r=0.05\). For the distribution regression task, we consider 100 values \(\{a_i\}_{i=1}^{100}\) for the roughness index a chosen uniformly at random from \(\left ]0.5, -0.2\right ]\). Each \(a_i\) yields a different price process \(S^a\). We regress the corresponding option price \(V^a(0,S_0=1)\) on \(m \in \mathbb {N}\) sample paths from \(S^a\) and use an SVR endowed with the rank-1 signature kernel \(K^{(1)}_{\text{Sig}}\) and benchmark the method against the rank-0 signature kernel regression algorithm for the same number m of sample paths, where \(m = 100, 250, 500\).
As shown in Table 2, SVR with kernel \(K^{(1)}_{\text{Sig}}\) systematically outperforms the classical signature kernel \(K^{(0)}_{\text{Sig}}\).
Table 2
American call option pricing under the rough Bergomi model. Standard deviations (in parenthesis) are obtained by repeating the experiment 10 times with different train-test splits (m-sample is fixed)
Predictive MSE \(\times 10^{-6}\)
\(m=100\)
\(m=250\)
\(m=500\)
Rank 0-Signature kernel
The image shows a mathematical notation: K_{Sig}^{(0)} .
5.27 (\(\pm \) 1.53)
1.36 (\(\pm \) 0.37)
0.98 (\(\pm \) 0.20)
Rank 1-Signature kernel
The image shows a mathematical notation: K_{Sig}^{(1)} .
3.86 (\(\pm \) 1.41)
1.40 (\(\pm \) 0.42)
0.83 (\(\pm \) 0.15)
4 Bibliographical Discussion
In this section we give a short bibliographical overview on past and recent developments with regard to adapted topologies.
The prediction process was introduced for the first time by Knight [53] in 1975 to study measurable processes in continuous time. In this very general setting, the usefulness of the prediction process stems from the fact that, even though no additional regularity of the process itself is a-priori assumed than measurability, when working with the right-continuous filtration generated by the process, it is more regular. In fact, the prediction process has càdlàg paths and the strong Markov property.
In the unpublished seminal monograph titled “Weak convergence and the general theory of processes” from 1981, Aldous [6] establishes the notion of “extended weak convergence”. While our discussion from Sect. 1.3 is restricted to a discrete-time framework, his work was concerned with continuous time and stochastic bases of filtered processes that satisfy the usual conditions. With Knight’s prediction processes as the main device, Aldous aimed to connect the theory of weak convergence as laid down in Billingsley’s influential book [25] with, at the time, new developments in the ‘general theory of stochastic processes’ of the Strasbourg school and the martingale approach to stochastic processes. His main motivation stems from theoretical as well as practical deliberations on certain unsatisfactory features of the usual weak topology in regard to stochastic processes: On the one hand, stochastic operations such as the Doob-Meyer decomposition are not solely path transformation, and on the other hand, stochastic optimization problems such as optimal stopping are discontinuous w.r.t. the weak topology, cf. Example 1.3. We also refer to his work [5, 7].
From a model-theoretic perspective, Keisler [51] introduces in a seminar in 1979 the ‘adapted probability logic’ for filtered probability spaces with stochastic processes, which aims to also formalize the probabilistic language of stochastic processes. Subsequently, influenced by the work of Aldous, Hoover–Keisler [46] establish notions of equivalence of filtered processes based on higher-rank prediction processes, see Sect. 1.4, which tries to capture all relevant information, in particular the higher-order information that is unseen by the rank 1-prediction process as demonstrated in Example 1.6. They do so based on the concept of adapted distribution, cf. (11), which they introduce and then further explore. Two filtered processes have the same adapted distribution if they share the same ‘probabilistic properties’, such as having the same law, the same Doob-Meyer decomposition, the same value for filtration dependent functionals, and being a martingale or predictable. We refer to [46] for a precise discussion of this topic.
In independent work, Hellwig [42] studies in 1996 the question of continuity of sequential decision making problems in the context of mathematical economics. We put his framework in perspective with our initial setup from Sect. 1. For each time \(t \in I\), we have a state space \(\mathcal X_t\) that models the set of possible states of the world and \(\mathcal A_t\) (where \(\mathcal A_t\) is a compact metric space and \(\mathcal A := \prod _{t \in I} \mathcal A_t\)), the set of actions at time t, that are available to the agent. Parallel to the introduction, we choose as a mathematical base to model the uncertain evolution of the states of the world as well as information by a filtered process \(\mathbf {X}\). For simplicity, we assume here that the stochastic basis of \(\mathbf {X}\) supports a uniform random variable independent of \(\mathcal {F}_T\), which allows us to skip working with additional randomization. This assumption is without loss of generality since any equivalence class of filtered processes admits a representative with that property. Given a certain state of the world \(x \in \mathcal X\) and a chosen sequence of actions \(a \in \mathcal A\), the agents utility is described by the utility function \(u(x,a)\). The goal of the agent is to maximize its utility \(U(\mathbf {X})\) by sequentially choosing actions where
and the maximum is taken over the set of all admissible plan of actions \(A = (A_t)_{t \in I}\) that are \((\mathcal {F}_t^{\mathbf {X}})\)-adapted stochastic processes with paths in \(\mathcal A\). Hellwig observed that, even when restricting to \(\text{FP}_{\text{plain}}\) (that are filtered processes equipped with the natural filtration of the underlying process), the weak topology is not strong enough to have continuity of (23). Thus, a stronger notion of convergence has to be specified, the information topology. In view of the already introduced prediction process, this topology is easiest described as the initial topology w.r.t. the family of maps of marginals of the prediction process
This topology yields, when \(u \in C_b(\mathcal X \times \mathcal A)\), continuity of the utility maximization problem (23) on \(\text{FP}_{\text{plain}}\), see also the discussion “On all adapted topologies are equal” below.
Starting from Wasserstein distances, which are a powerful tool to compare probability distributions, see [74, 78], and have become a staple even beyond applied and theoretical probability theory, versions of Wasserstein distances have been introduced for stochastic processes. In this context we want to mention the works of Vershik on the classification of filtrations [76, 77], where an iterated Kantorovich distance is central to his criterion for standardness of filtrations, see also [37, Theorem 2.7], and Rüschendorf’s modification of the Wasserstein distance running under the name of ‘Markov constructions’ [71]. The extension from two to multiple time steps was done by Pflug [66] using the name “nested distance” to study continuity and stability of multistage stochastic optimization problems. We refer to [67] for a comprehensive overview on this extensive body of works. Lassalle [56] introduces a ‘causal optimal transport problem’ with an associated distance, to study concentration inequalities on the Wiener space, see also the recent article of Föllmer [39] in similar vein. In the same spirit, adapted Wasserstein distances [11, 17] were introduced to study sensitivity in mathematical finance [13, 16]. Notably, in finite discrete time the adapted Wasserstein distances coincides with the nested distance. It was shown in [17] that the adapted Wasserstein distance naturally extends to the space of filtered processes \(\text{FP}\) and defines there a complete and geodesically complete metric, compatible with the adapted weak topology. This insight allows to define new ways of taking averages and barycenters of families of stochastic processes, that respect geometry of the paths as well as the underlying filtrations. For example, the barycenter of a family of martingales is again a martingale which is not the case for classical Wasserstein barycenters, [4, 55]. Optimal transport-based methods for the numerical resolution were explored in [38, 68] and the adapted empirical measure for the statistical estimation of laws of stochastic processes was studied, e.g., in [1, 9]. Closely related to these advances is the distance for solution of SDEs [15, 26, 70] and the Knothe–Rosenblatt distance [21], which were recently investigated.
Finally, we also report on the approach based on signature methods in [28], which opens up the framework of adapted topologies to computationally efficient algorithms based on kernel methods. For more details, we refer to the preceding Sects. 2 and 3.
On All Adapted Topologies Are Equal
Remarkably, in discrete time all of the approaches discussed above yield the same topology on \(\text{FP}_{\text{plain}}\), see [14, Theorem 1.1] and [63, Theorem 11]. This is of importance, since the nested path spaces of the higher-rank prediction processes are quite complex. Consequently, as long as one considers only processes with their generated filtration, which is typically the case in practice, from a topological point of view it is not necessary to consider higher-rank prediction processes. In mathematical terms, this means that when \(\mathbf {X}, {\mathbf {X}}^1, {\mathbf {X}}^2, \dots \), is a sequence in \(\text{FP}_{\text{plain}}\), then \({\mathbf {X}}^n \to \mathbf {X}\) in the adapted weak topology \(\mathcal T\) if and only if this convergence holds in the rank 1-adapted weak topology \(\mathcal T^{(1)}\). We want to stress that in view of Example 1.3 (and in general for sequential decision making problems) this result requires that the limit of the sequence is element of \(\text{FP}_{\text{plain}}\). This result emphasizes what was already observed in Example 1.6 that the higher order prediction process captures information of higher order, which is not necessary when considering natural filtrations.
Further Applications of Adapted Topologies
Over the past 40 years, the adapted weak topology and related concepts found application in multiple fields such as economics, mathematical finance, stochastic control, stochastic analysis and even machine learning. We try to provide a representative cross section: The adapted topologies found application in the study of optimal stopping [33], game options [35], robust hedging [36] and utility maximisation [19] with proportional transaction costs, convergence of filtrations and stability of backward SDEs [34], stability of Doob-Meyer decomposition [61], martingale representation [64], weak transport theory [3, 10, 12, 49], enlargement of filtration [2], Skorokhod representation of sequences of stochastic processes [23, 45], stochastic exit time problems [69], continuity in stochastic finance [13], sensitivity analysis of stochastic control problems [16], and in a hypothesis test for testing martingality [27]. Moreover, we refer to [72] for a comprehensive survey on information structures and associated stochastic control topologies for decision making problems. Related ideas were also recently explored in the context of machine learning; we mention, for example, chain rule optimal transport [62], causal Wasserstein GANs [52, 79], conditional optimal transport [48]. Higher-rank expected signatures [47, 73] were employed in the context of quantitative finance for real-world calibration and optimal stopping.
Acknowledgements
CL will acknowledge the support from the National Key R&D Program of China (No. 2023YFA1010900).
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
A function \(c: (I \to \mathbb R^d) \times I \to \mathbb R\) is non-anticipative if for each \(t \in I\), the value of \(c(x, t)\) only depends on \(x_1,\ldots ,x_t\), i.e., the trajectory of the path x before time t.
Although \(\mathcal {P}(M^{(0)})\) is a subset of the linear space of all signed measures on \(M^{(0)}\), it is quite hard to compute the resulting signature explicitly if one uses the associated linear structure to define the iterated integrals of measure-valued paths.
B. Acciaio, S. Hou, Convergence of adapted empirical measures on \(\mathbb {R}^d\). Preprint. arXiv:2211.10162 (2022)
2.
B. Acciaio, J. Backhoff-Veraguas, A. Zalashko, Causal optimal transport and its links to enlargement of filtrations and continuous-time stochastic optimization. Stoch. Process. Appl. 130(5), 2918–2953 (2020)MathSciNetCrossRef
3.
B. Acciaio, M. Beiglböck, G. Pammer, Weak transport for non-convex costs and model-independence in a fixed-income market. Math. Finance 31(4), 1423–1453 (2021)MathSciNetCrossRef
4.
B. Acciaio, D. Kršek, G. Pammer, Multicausal transport: barycenters and dynamic matching. Preprint. arXiv:2401.12748 (2024)
5.
D. Aldous, Stopping times and tightness. Ann. Probab. 6, 335–340 (1978)MathSciNetCrossRef
6.
D. Aldous, Weak convergence and general theory of processes. Unpublished monograph: Department of Statistics, University of California, Berkeley, July 1981
7.
D. Aldous, Stopping times and tightness. II. Ann. Probab. 17, 586–595 (1989)MathSciNet
8.
M. Avellaneda, A. Levy, A. Parás, Pricing and hedging derivative securities in markets with uncertain volatilities. Appl. Math. Finance 2(2), 73–88 (1995)CrossRef
9.
J. Backhoff, D. Bartl, M. Beiglböck, J. Wiesel, Estimating processes in adapted Wasserstein distance. Ann. Appl. Probab. 32(1), 529–550 (2022)MathSciNetCrossRef
10.
J. Backhoff-Veraguas, G. Pammer, Stability of martingale optimal transport and weak optimal transport. Ann. Appl. Probab. 32(1), 721–752 (2022)MathSciNetCrossRef
11.
J. Backhoff-Veraguas, M. Beiglböck, Y. Lin, A. Zalashko, Causal transport in discrete time and applications. SIAM J. Optim. 27(4), 2528–2562 (2017)MathSciNetCrossRef
12.
J. Backhoff-Veraguas, M. Beiglböck, G. Pammer, Existence, duality, and cyclical monotonicity for weak transport costs. Calc. Variations Partial Differential Equations 58(6), 203 (2019)
13.
J. Backhoff-Veraguas, D. Bartl, M. Beiglböck, M. Eder, Adapted Wasserstein distances and stability in mathematical finance. Finance Stoch. 24(3), 601–632 (2020)MathSciNetCrossRef
14.
J. Backhoff-Veraguas, D. Bartl, M. Beiglböck, M. Eder, All adapted topologies are equal. Probab. Theory Relat. Fields 178, 1125–1172 (2020)MathSciNetCrossRef
15.
J. Backhoff-Veraguas, S. Källblad, B.A. Robinson, Adapted Wasserstein distance between the laws of SDEs. Preprint. arXiv:2209.03243 (2022)
16.
D. Bartl, J. Wiesel, Sensitivity of multiperiod optimization problems with respect to the adapted Wasserstein distance. SIAM J. Financ. Math. 14(2), 704–720 (2023)MathSciNetCrossRef
17.
D. Bartl, M. Beiglböck, G. Pammer, The Wasserstein space of stochastic processes. arXiv:2104.14245v1 (2021)
18.
C. Bayer, P. Friz, J. Gatheral, Pricing under rough volatility. Quant. Finance 16(6), 887–904 (2016)MathSciNetCrossRef
19.
E. Bayraktar, L. Dolinskyi, Y. Dolinsky, Extended weak convergence and utility maximisation with proportional transaction costs. Finance Stoch. 24(4), 1013–1034 (2020)MathSciNetCrossRef
20.
M. Beiglböck, P. Henry-Labordère, F. Penkner, Model-independent bounds for option prices—a mass transport approach. Finance Stoch. 17(3), 477–501 (2013)MathSciNetCrossRef
21.
M. Beiglböck, G. Pammer, A. Posch, The Knothe–Rosenblatt distance and its induced topology. Preprint. arXiv:2312.16515 (2023)
22.
M. Beiglböck, G. Pammer, S. Schrott, X. Zhang, Representing general stochastic processes as martingale laws. Preprint. arXiv:2312.16725 (2023)
23.
M. Beiglböck, S. Pflügl, S. Schrott, A probabilistic view on the adapted Wasserstein distance. Preprint. arXiv:2406.19810 (2024)
24.
M. Bennedsen, A. Lunde, M.S. Pakkanen, Hybrid scheme for Brownian semistationary processes. Finance Stoch. 21(4), 931–965 (2017)MathSciNetCrossRef
25.
P. Billingsley, Convergence of Probability Measures. Wiley Series in Probability and Statistics: Probability and Statistics, 2nd edn. (Wiley, New York, 1999), A Wiley-Interscience Publication
26.
J. Bion–Nadal, D. Talay, On a Wasserstein-type distance between solutions to stochastic differential equations. Ann. Appl. Probab. 29(3), 1609–1639 (2019)
27.
J. Blanchet, J. Wiesel, E. Zhang, Z. Zhang, Empirical martingale projections via the adapted Wasserstein distance. Preprint. arXiv:2401.12197 (2024)
28.
P. Bonnier, C. Liu, H. Oberhauser, Adapted topologies and higher rank signatures. Ann. Appl. Probab. 33(3), 2136–2175 (2023)MathSciNetCrossRef
29.
B. Bouchard, M. Nutz, Arbitrage and duality in nondominated discrete-time models. Ann. Appl. Probab. 25(2), 823–859 (2015)MathSciNetCrossRef
30.
T. Cass, T. Lyons, C. Salvi, W. Yang, The Signature kernel is the solution of a Goursat PDE. SIAM J. Math. Data Sci. 3(3), 873–899 (2021)MathSciNetCrossRef
31.
I. Chevyrev, H. Oberhauser, Signature moments to characterize laws of stochastic processes. J. Mach. Learn. Res. 23, 1–42 (2022)MathSciNet
32.
R. Cont, Model uncertainty and its impact on the pricing of derivative instruments. Math. Finance 16(3), 519–547 (2006)MathSciNetCrossRef
33.
F. Coquet, S. Toldo, Convergence of values in optimal stopping and convergence of optimal stopping times. Electron. J. Probab. 12(none), 207–228 (2007)
34.
F. Coquet, J. Mémin, L. Słominski, On weak convergence of filtrations, in Séminaire de probabilités XXXV (2001), pp. 306–328
35.
Y. Dolinsky, Applications of weak convergence for hedging of game options. Ann. Appl. Probab. 20(5), 1891–1906 (2010)MathSciNetCrossRef
36.
Y. Dolinsky, Hedging of game options under model uncertainty in discrete time. Electron. Commun. Probab. 19(none), 1–11 (2014)
37.
L. Dubins, J. Feldman, M. Smorodinsky, B. Tsirelson, Decreasing sequences of \(\sigma \)-fields and a measure change for Brownian motion. Ann. Probab. 24(2), 882–904 (1996)
38.
S. Eckstein, G. Pammer, Computational methods for adapted optimal transport. Ann. Appl. Probab. 34(1A), 675–713 (2024)MathSciNetCrossRef
39.
H. Föllmer, Optimal couplings on Wiener space and an extension of Talagrand’s transport inequality, in Stochastic Analysis, Filtering, and Stochastic Optimization: A Commemorative Volume to Honor Mark HA Davis’s Contributions (Springer, 2022), pp. 147–175
40.
H. Föllmer, A. Schied, Stochastic Finance: An Introduction in Discrete Time (Walter de Gruyter, Berlin, 2011)CrossRef
41.
A. Galichon, P. Henry-Labordère, N. Touzi, A stochastic control approach to no-arbitrage bounds given marginals, with an application to lookback options. Ann. Appl. Probab. 24(1), 312–336 (2014)MathSciNetCrossRef
42.
M.F. Hellwig, Sequential decisions under uncertainty and the maximum theorem. J. Math. Econ. 25(4), 443–464 (1996)MathSciNetCrossRef
43.
C. Herrera, F. Krach, P. Ruyssen, J. Teichmann, Optimal stopping via randomized neural networks. Front. Math. Finance 3(1), 31–77 (2024)MathSciNetCrossRef
44.
D.G. Hobson, Robust hedging of the lookback option. Finance Stoch. 2, 329–347 (1998)CrossRef
45.
D.N. Hoover, Convergence in distribution and Skorokhod convergence for the general theory of processes. Probab. Theory Relat. Fields 89, 239–259 (1991)MathSciNetCrossRef
B. Horvath, M. Lemercier, C. Liu, T. Lyons, C. Salvi, Optimal stopping via distribution regression: a higher rank signature approach. arXiv: 2304.01479 (2023)
48.
B. Hosseini, A.W. Hsu, A. Taghvaei, Conditional optimal transport on function spaces. Preprint. arXiv:2311.05672 (2023)
49.
B. Jourdain, G. Pammer, An extension of martingale transport and stability in robust finance. Electron. J. Probab. 29, 1–30 (2024)MathSciNetCrossRef
50.
C. Kardaras, G. Žitković, Stability of the utility maximization problem with random endowment in incomplete markets. Math. Finance 21(2), 313–333 (2011)MathSciNetCrossRef
51.
J. Keisler, Hyperfinite models of adapted probability logic. Ann. Pure Appl. Logic, 31, 71–86 (1986)MathSciNetCrossRef
52.
K. Klemmer, T. Xu, B. Acciaio, D.B. Neill, SPATE-GAN: Improved generative modeling of dynamic spatio-temporal patterns with an autoregressive embedding loss, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 4 (June 2022), pp. 4523–4531
53.
F.B. Knight, A predictive view of continuous time processes. Ann. Probab. 3(4), 573–596 (1975)MathSciNetCrossRef
54.
D. Kramkov, M. Sîrbu, Sensitivity analysis of utility-based prices and risk-tolerance wealth processes. Annals Appl. Probab. 16(4), 2140–2194 (2006)MathSciNetCrossRef
55.
D. Kršek, G. Pammer, General duality and dual attainment for adapted transport. Preprint. arXiv:2401.11958 (2024)
56.
R. Lassalle, Causal transport plans and their Monge–Kantorovich problems. Stoch. Anal. Appl. 36(3), 452–484 (2018)MathSciNetCrossRef
57.
D. Lee, H. Oberhauser, The signature kernel. arXiv:2305.04625 (2023)
58.
M. Lemercier, C. Salvi, T. Damoulas, E. Bonilla, T. Lyons, Distribution regression for sequential data, in International Conference on Artificial Intelligence and Statistics (2021)
59.
F.A. Longstaff, E.S. Schwartz, Valuing American options by simulation: a simple least-squares approach. Rev. Financ. Stud. 14(1), 113–147 (2001)CrossRef
60.
T. Lyons, Uncertain volatility and the risk-free synthesis of derivatives. Appl. Math. Finance 2(2), 117–133 (1995)CrossRef
61.
J. Mémin, Stability of Doob-Meyer decomposition under extended convergence. Acta Math. Appl. Sin. 19(2), 177–190 (2003)MathSciNetCrossRef
62.
F. Nielsen, K. Sun, Chain rule optimal transport, in Progress in Information Geometry: Theory and Applications (Springer, 2021), pp. 191–217
63.
G. Pammer, A note on the adapted weak topology in discrete time. Electron. Commun. Probab. 29, 1–13 (2024)MathSciNetCrossRef
64.
A. Papapantoleon, D. Possamaï, and A. Saplaouras, Stability results for martingale representations: the general case. Trans. Am. Math. Soc. 372(8), 5891–5946 (2019)MathSciNetCrossRef
65.
J. Park, K. Muandet, A measure-theoretic approach to kernel conditional mean embeddings, in Advances in Neural Information Processing Systems, vol. 33 (2020), pp. 21247–21259
66.
G. Pflug, Version-independence and nested distributions in multistage stochastic optimization. SIAM J. Optim. 20(3), 1406–1420 (2010)MathSciNetCrossRef
67.
G. Pflug, A. Pichler, Multistage Stochastic Optimization. Springer Series in Operations Research and Financial Engineering (Springer, Cham, 2014)
68.
A. Pichler, M. Weinhardt, The nested Sinkhorn divergence to learn the nested distance. Comput. Manag. Sci. 19(2), 269–293 (2022)MathSciNetCrossRef
69.
C. Reisinger, Y. Zhang, Regularity and stability of feedback relaxed controls. SIAM J. Control Optim. 59(5), 3118–3151 (2021)MathSciNetCrossRef
70.
B.A. Robinson, M. Szölgyenyi, Bicausal optimal transport for SDEs with irregular coefficients. Preprint. arXiv:2403.09941 (2024)
71.
L. Rüschendorf, The Wasserstein distance and approximation theorems. Probab. Theory Relat. Fields 70(1), 117–129 (1985)MathSciNetCrossRef
72.
N. Saldi, S. Yüksel, Geometry of information structures, strategic measures and associated stochastic control topologies. Probab. Surv. 19, 450–532 (2022)MathSciNetCrossRef
73.
C. Salvi, M. Lemercier, C. Liu, B. Horvath, T. Damoulas, T. Lyons, Higher order kernel mean embeddings to capture filtrations of stochastic processes, in Advances in Neural Information Processing Systems, vol. 34 (2021), pp. 16635–16647
74.
F. Santambrogio, Optimal Transport for Applied Mathematicians, vol. 55, no. 58–63 (Birkäuser, New York, 2015), p. 94CrossRef
75.
M. Soner, N. Touzi, J. Zhang, Quasi-sure stochastic analysis through aggregation. Electron. J. Probab. 16(none), 1844–1879 (2011)
76.
A.M. Vershik, Theory of decreasing sequences of measurable partitions. Algebra Anal. 6(4), 1–68 (1994)MathSciNet
77.
A.M. Vershik, The theory of filtrations of subalgebras, standardness, and independence. Russian Math. Surv. 72(2), 257 (2017)
78.
C. Villani et al., Optimal Transport: Old and New, vol. 338 (Springer, Cham, 2009)
79.
T. Xu, L.K. Wenliang, M. Munn, B. Acciaio, COT-GAN: Generating sequential data via causal optimal transport, in Advances in Neural Information Processing Systems, vol. 33 (2020), pp. 8798–8809