Skip to main content
Top

Adapted Topologies and Higher-Rank Signatures

  • Open Access
  • 2026
  • OriginalPaper
  • Chapter
Published in:

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This chapter delves into the intricate world of adapted topologies and higher-rank signatures, focusing on their application in financial decision-making under uncertainty. It begins by introducing the concept of adapted topologies, which are essential for modeling the uncertain evolution of financial markets and the information available to investors. The text then explores the notion of higher-rank signatures, which extend the classical signature theory to capture more complex probabilistic properties of stochastic processes. The chapter provides a detailed overview of past and recent developments in the systematic study of filtered processes, highlighting the importance of model sensitivity. It also discusses the theoretical tools and algorithms for computing higher-rank signatures, including the prediction process and higher-rank prediction processes. The practical applications of these concepts are illustrated through examples such as the pricing of American options under the rough Bergomi model and the classification of filtered processes in adapted weak topologies. The chapter concludes with a bibliographical discussion, providing a historical context and further reading on the topic. Readers will gain a deeper understanding of how adapted topologies and higher-rank signatures can be used to address continuity and sensitivity in sequential decision-making problems, making this chapter a valuable resource for professionals in the field.

1 Introduction to Adapted Topologies

Financial and economic agents frequently encounter decision making problems. Facing uncertainty, they have to make a sequence of choices in a complex environment which necessitates a systematic approach. Therefore, to enable the decision making process, mathematical models are employed, which assist in identifying reasonable courses of action. A common method of incorporating uncertainty into a model is through stochastic processes. The agent has to base their decision on the available information which is continually updated in time. This evolving information is formally represented by a filtration. Summarizing, the mathematical base of such a framework, where we consider throughout the article for ease of exposition a discrete-time framework with \(T \in \mathbb {N}\) time steps and \(I := \{ 1, \ldots , T \}\), involves:
  • a probability space \((\Omega ,\mathcal {F},\mathbb {P})\);
  • a filtration \((\mathcal {F}_t)_{t \in I}\), i.e., an increasing sequence of \(\sigma \)-algebras with \(\mathcal {F}_t \subseteq \mathcal {F}\); and
  • a stochastic process \(X = (X_t)_{t \in I}\) adapted to the filtration \((\mathcal {F}_t)_{t \in I}\), i.e., \(X_t\) is \(\mathcal {F}_t\)-measurable.
Informally speaking, as \(\mathcal {F}_t\) reflects the information of the agent at time t, the adaptedness of X means that the agent has observed the path X up to time t.
To give an illustrative example of such a setup, we briefly consider the model of a simple financial market that can be found in classical mathematical finance literature, see for example [40]. There, the stochastic process \(X = (X_t)_{t \in I}\) with \(X_t \in \mathbb {R}^d\) may be used to model the uncertain evolution in time of d-asset prices. This means that each coordinate of \(X_t = (X_t^i)_{i = 1}^d\) represents the value of a single asset at a given point in time. Meanwhile, the filtration incorporates the information that is available to an investor, which they can use to base their decisions on. Therefore, the investor can only employ trading strategies that are adapted to this filtration. For example, let \((H_t)_{t \in I}\) be a self-financing dynamic trading strategy. When \(H_t^i\) denotes the amount of units of the i-th asset that the investor holds between time t and \(t + 1\) (that is, on \([t,t+1)\)), then this amount is necessarily measurable with respect to \(\mathcal {F}_t\). Moreover, the value of the self-financing portfolio at time t is given by
$$\displaystyle \begin{aligned} {} \sum_{s = 1}^{t - 1} H_s \cdot (X_{s + 1} - X_s). \end{aligned} $$
(1)
In more mathematical terms, we consider dynamic trading strategies that are predictable with respect to that filtration. Therefore, the collection of attainable claims of type (1) crucially depends on process and filtration, which we further stress by the example below. But beforehand, we introduce the following notion.
Definition 1.1 (Filtered Process)
A filtered process \(\mathbf {X}\) is a 5-tuplet
$$\displaystyle \begin{aligned} \mathbf{X} := \left( \Omega^{\mathbf{X}}, \mathcal{F}^{\mathbf{X}}, \mathbb{P}^{\mathbf{X}}, (\mathcal{F}_t^{\mathbf{X}})_{t \in I}, X \right), \end{aligned}$$
consisting of a probability space \((\Omega ^{\mathbf {X}}, \mathcal {F}^{\mathbf {X}}, \mathbb {P}^{\mathbf {X}})\), a filtration \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\) and an \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-adapted stochastic process X. The class of all filtered processes is denoted by \(\mathcal {F}\mathcal {P}\).
As the herein studied framework can be condensed to such a 5-tuple, filtered processes can be seen as the mathematical base of models for many phenomena in finance and economics. In what follows we give an overview on past and recent developments in the systematic study of filtered processes, where the main motivation stems from model sensitivity.
We use \(\mathbb E_{\mathbf {X}}\) to denote the expectation w.r.t. \(\mathbb P^{\mathbf {X}}\) for some filtered process \(\mathbf {X}\). When the probability space is unambiguous from the context, will omit the subscript and simple write \(\mathbb E\).
Example 1.2
Let us consider the simple probability space \((\Omega ,\mathcal {F},\mathbb {P})\) where \(\Omega := \{ \omega _1,\omega _2\}\) has two elements, \(\mathcal {F} := 2^\Omega \) is the power set of \(\Omega \), and \(\mathbb {P}\) is the uniform distribution on \(\Omega \). We consider the process \(X = (X_t)_{t = 1}^2\) with
$$\displaystyle \begin{aligned} X_1(\omega) = 0, \quad X_2(\omega) := \begin{cases} 1 & \omega = \omega_1, \\ -1 & \omega = \omega_2, \end{cases} \end{aligned}$$
for \(\omega \in \Omega \). First assume that the investor’s available information is given by the filtration \((\mathcal {F}_t)_{t = 1}^2\) with \(\mathcal {F}_1 := \{ \emptyset , \Omega \}\) and \(\mathcal {F}_2 := \mathcal {F}\), that is, we consider the filtered process \(\mathbf {X} := (\Omega , \mathcal F, \mathbb P, (\mathcal F_t)_{t = 1}^2, X)\). Then the set of attainable claims of type (1) is given by the collection
$$\displaystyle \begin{aligned} \{ h X_2 : h \in \mathbb{R} \}. \end{aligned}$$
By buying at time 1, h units of X, one can not make profit without the risk of losing money, since \(h \cdot X_2 \ge 0\)\(\mathbb {P}\)-almost surely if and only if \(h = 0\).
On the other hand, if the investor’s available information is given by the filtration \((\mathcal {G}_t)_{t = 1}^2\) with \(\mathcal G_1 = \mathcal G_2 = \mathcal {F}\), that is, we have the filtered process \(\mathbf {Y} := (\Omega , \mathcal F, \mathbb P, (\mathcal G_t)_{t = 1}^2, X)\) as model, then the set of attainable claims is given by
$$\displaystyle \begin{aligned} \{ H X_2 : H \text{ is }\mathcal{F}\text{-measurable} \}. \end{aligned}$$
Since the investor knows at time \(t=1\) already the value of \(X_2\), they can exploit this information by buying \(H_1 = X_2\) units. This leads to profit without risk of losing money, since the value of the portfolio at time 2 is \(H X_2 = 1\).
Even though Example 1.2 is admittedly a rather basic example, it still stresses the importance of studying stochastic processes closely in combination with filtration. Virtually any example of a sequential decision making problem exhibits similar properties, including the optimal stopping problem (cf. Example 1.3) that we will discuss in more detail below.
Before we move forward to the detailed discussions, for readers’ convenience we first collect some important notations which will be used throughout the whole article and list them as follows:
Table of Notations
Symbol
Meaning
Page
Spaces
\(\mathcal {P}(A)\)
The space of probability measures on a topological space A
337
\(I \to A\)
The space of all paths from I into a set A
339
\(M^{(r)}\)
The path space of rank r-prediction processes
343
\(T((\mathbb {R}^d))\)
The tensor algebra over \(\mathbb {R}^d\)
348
H
A Hilbert space in \(T((\mathbb {R}^d))\) containing the image of signature
348
\(H^{(r+1)}\)
A Hilbert space in \(T((\mathbb {R} \oplus H^{(r)}))\) containing the image of \(\Phi ^{(r)}_{\text{Sig}}\)
353
\({\mathbf{H}}_{\text{Sig}}^{(r)}\)
The RKHS induced by \({\mathbf{k}}^{(r)}_{\text{Sig}}\)
355
Paths and processes
\(\mathrm {pp}(\mathbf {X})\)
The prediction process of a filtered process \(\mathbf {X}\)
339
\(\mathrm {pp}^{(r)}(\mathbf {X})\)
The rank r-prediction process of a filtered process \(\mathbf {X}\)
343
\( {\mathbf{p}}^{(r)}\)
The paths in \(M^{(r)}\)
353
\(\tilde {{\mathbf {p}}}^{(r)}\)
The paths formed by taking the integral of \(\Phi ^{(r-1)}_{\text{Sig}}\) against \({\mathbf{p}}^{(r)}\)
353
\(\hat x^i\)
The i-th sample path of process \(\mathbf {X}\)
361
\(\hat y^j\)
The j-th sample path of process \(\mathbf {Y}\)
361
\(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})\)
\(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X}) = \mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X) | \mathcal {F}^{\mathbf {X}}_t],t \in I\)
362
\(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})\)
The empirical estimator of \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})\)
364
Topologies
\(\mathcal {T}^{(r)}\)
The adapted weak topology of rank r
343
\(\mathcal {T}\)
The adapted weak topology
344
Signatures
\(\Phi _{\text{Sig}}\)
The signature on \(I \to \mathbb {R}^d\)
349
\(\Phi ^{(r)}_{\text{Sig}}\)
The rank-r signature on \(M^{(r)}\)
352
\(d^{(r)}_{\text{Sig}}\)
The rank-r signature metric
354
\( {\mathbf{k}}_{\text{Sig}}\)
The signature kernel
355
\({\mathbf {k}}^{(r)}_{\text{Sig}}\)
The rank-r signature kernel
355
\(\mathcal {D}^{(r)}_{\text{Sig}}\)
The rank-r signature kernel MMD
356
\(\hat {\mathcal {D}}^{(1)}_{\text{Sig}}\)
The empirical estimator of \(\mathcal {D}^{(1)}_{\text{Sig}}\)
364

1.1 Model Sensitivity

In the past 30 years, there was substantial progress in our understanding of model uncertainty in finance, starting with the seminal articles by Lyons [60], Avellaneda et al. [8], Soner et al. [75] on markets with uncertain volatilities, and Hobson [44] and Cont [32] on robust hedging and pricing. Given the numerous contributions to the field in recent years, we can not do justice to all, but refer readers to Beiglböck et al. [20], Galichon et al. [41] and Bouchard and Nutz [29], and the references therein. Parametric model sensitivity was studied, for example, by Kramkov and Sîrbu [54] and Kardaras and Žitković [50]. More recently, Backhoff et al. [13] shed new light on model sensitivity in finance by suggesting a non-parametric analysis based on the recent notion of adapted Wasserstein distance. We will loosely follow their motivation:
Models are based on data and observations that are naturally noisy as well as incomplete information. For this reason, a model can be at best a faithful approximation of the modeled scenario and optimal actions that are based on this model may not be optimal in reality. Consequently, it becomes crucial to gauge the sensitivity of decisions based on slightly erroneous models. In other words, one would like to get a grasp on the following question.
(Q)
How well do decision strategies derived from a slightly erroneous model perform when applied to reality?
 
However, Question (Q) is not yet well-posed, because we have not clarified when two models should be considered as ‘close’. Consequently, we need a suitable notion of proximity of models. In our setting, this translates to finding an appropriate notion of when two filtered processes (cf. Definition 1.1) \(\mathbf {X}, \mathbf {Y}\) are close. However, before comprehending proximity, it is essential to understand when two filtered processes should be considered equal.

1.2 The Weak Topology

Two random variables \(X,Y\) taking values in \(\mathbb R^d\), that are not necessarily supported on the same probability space, share the same probabilistic properties if their law coincides. At the same time, a reasonable notion of proximity is provided by the ubiquitous topology of weak convergence of measures. Let \(C_b(\mathbb R^d)\) be the set of continuous bounded functions on \(\mathbb R^d\). We say that X and Y  have the same distribution if
$$\displaystyle \begin{aligned} \mathbb E[f(X)] = \mathbb E[f(Y)] \quad \text{for all }f \in C_b(\mathbb R^d). \end{aligned} $$
(2)
A sequence \((X^n)_{n \in \mathbb N}\) of \(\mathbb R^d\)-valued random variables is said to converge weakly to a random variable X if
$$\displaystyle \begin{aligned} {} \mathbb E[f(X^n)] \to \mathbb E[f(X)]\quad \text{for all }f \in C_b(\mathbb R^d). \end{aligned} $$
(3)
There are numerous ways of metrizing weak convergence of measures on the set of probability distributions on \(\mathbb R^d\), \(\mathcal P(\mathbb R^d)\), such as the Prokhorov metric, Wasserstein distances or maximum mean discrepancies. Throughout this article, we endow spaces of probability measure with weak convergence of measures.
In a first attempt, one could try to identify filtered processes that share the same law and say two filtered processes are close if their laws are similar. This is rather unsatisfactory, since we completely neglect any information that is encoded in the filtrations, which is inadequate for our purposes, cf. Example 1.2. But, even when considering only filtered processes \(\mathbf {X}\) whose filtration \((\mathcal {F}^{\mathbf {X}})_{t \in I}\) is given by the natural filtration generated by the process X itself, proximity w.r.t. weak convergence of measures is not the correct notion in view of finding an answer to Question 1.1. In order to stress the shortcomings of the usual weak topology of measures when employed in our setting, we provide a simple example.
Example 1.3
The optimal stopping problem is perhaps one of the simplest information dependent optimization problems that is encountered on financial markets. Let \((\Omega ,\mathcal {F},\mathbb {P})\), \((\mathcal {F}_t)_{t = 1}^2\), \((\mathcal G_t)_{t = 1}^2\) and X be given as in Example 1.2. For \(\varepsilon > 0\) we consider the simplistic scenarios
$$\displaystyle \begin{aligned} \mathbf{X} :=& \left( \Omega, \mathcal{F}, \mathbb{P}, (\mathcal{F}_t)_{t \in I}, X \right), \\ {\mathbf{X}}^\varepsilon :=& \left( \Omega, \mathcal{F}, \mathbb{P}, (\mathcal G_t)_{t \in I}, X^\varepsilon \right), \end{aligned} $$
where the process
$$\displaystyle \begin{aligned} X_1^\varepsilon(\omega) := \begin{cases} \varepsilon & \omega = \omega_1, \\ -\varepsilon & \omega = \omega_2, \end{cases} \quad \text{and}\quad X_2^\varepsilon(\omega) := X_2(\omega), \end{aligned} $$
see Fig. 1 below. Clearly, we have
$$\displaystyle \begin{aligned} \mathbb E[f(X^\varepsilon)] \,{=}\, \frac 12 \left( f((-\varepsilon,-1)) \,{+}\, f((\varepsilon,1)) \right) \,{\to}\, \frac 12 \left( f((0,-1) {+} f((0,1)) \right) {=} \mathbb E[f(X)], \end{aligned}$$
Fig. 1
Two processes with similar law but different probabilistic properties. A similar illustration can also be found in [13, Figure 1] and [28, Figure 1]
Full size image
while the optimal stopping values
$$\displaystyle \begin{aligned} \sup_{\tau \in \mathrm{ST}({\mathbf{X}}^\varepsilon)} \mathbb E[X_\tau^\varepsilon] = \frac 12 (1 - \varepsilon) \to \frac 12 > 0 = \sup_{\tau \in \mathrm{ST}(\mathbf{X})} \mathbb E[X_\tau], \end{aligned}$$
for \(\varepsilon \searrow 0 \), where \(\mathrm {ST}(\mathbf {X})\) denotes the set of \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-stopping times.
Example 1.3 provides a very simple example that illustrates the core of the problem. Even though \(X^\varepsilon \to X\) in distribution, the associated filtered processes \({\mathbf {X}}^\varepsilon \) are, from a probabilistic point of view, very different from \(\mathbf {X}\). While X is a martingale on the filtered probability space of \(\mathbf {X}\), the processes \((X^\varepsilon )_{\varepsilon \ge 0}\) are predictable on their respective filtered probability spaces. Of course, one could easily consider more sophisticated examples, also in continuous time. For example, it is possible to approximate a Brownian motion in law by “deterministic” processes, whose paths are completely known at initial time and observe the very same phenomenon, see for example [6, Chapter 4, Section 11].

1.3 The Prediction Process

Multiple mathematicians have addressed the shortcomings presented in Sect. 1.2 and worked on ways of refining the weak topology in order to additionally incorporate the information that is encoded in the filtration. Notable contributions stem from the works of Aldous [6] and Hoover–Keisler [46]. For an in depth discussion on the multitude of related works we refer the interested reader to Sect. 4.
From now on, for \(I = \{1,\ldots ,T\}\) and any set A, we will use the symbol \(I \to A\) to denote the product space \(A^T = A \times \ldots \times A\) (T copies of A), which is the space of paths defined on I taking values in A.
Inspired by earlier work of Knight [53], Aldous defines the prediction process \(\mathrm {pp}(\mathbf {X})\) of a filtered process \(\mathbf {X}\) as the measure-valued process given by
$$\displaystyle \begin{aligned} \mathrm{pp}_t(\mathbf{X}) := \mathbb{P}^{\mathbf{X}}( X \in \cdot | \mathcal{F}_t^{\mathbf{X}})\quad t \in I. \end{aligned} $$
(4)
We have that \(\mathrm {pp}(\mathbf {X})\) takes values in \(I \to \mathcal P(I \to \mathbb R^d)\). As the name suggests, \(\mathrm {pp}_t(\mathbf {X})\) is the predicted law of the process given the information at time t. This process can be used to define an equivalence relation that Aldous dubs synonymity. Two filtered processes \(\mathbf {X}\) and \(\mathbf {Y}\) are called synonymous if \(\mathrm {pp}(\mathbf {X})\) and \(\mathrm {pp}(\mathbf {Y})\) have the same distribution, i.e.,
$$\displaystyle \begin{aligned} \mathbb E_{\mathbf{X}}[f(\mathrm{pp}(\mathbf{X}))] = \mathbb E_{\mathbf{Y}}[f(\mathrm{pp}(\mathbf{Y}))] \quad \text{for all }C_b(I \to \mathcal P(I \to \mathbb R^d)). \end{aligned} $$
(5)
A sequence \(({\mathbf {X}}^n)_{n \in \mathbb N}\) of filtered processes is said to converge in the extended weak topology to a filtered process \(\mathbf {X}\) if
$$\displaystyle \begin{aligned} \mathbb E_{{\mathbf{X}}^{\mathbf{n}}}[f(\mathrm{pp}({\mathbf{X}}^n))] \to \mathbb E_{\mathbf{X}}[f(\mathrm{pp}(\mathbf{X}))]\quad \text{for all }f \in C_b(I\to\mathcal P(I \to \mathbb R^d)). \end{aligned} $$
(6)
Remark 1.4
We have the following observations of topological nature:
1.
It is straightforward to see that the extended weak topology is finer than convergence in law of \((X^n)_{n \in \mathbb N}\) to X. Indeed, any function \(g \in C_b(I \to \mathbb R^d)\) has a representative \(f \in C_b(I \to \mathcal P(I \to \mathbb R^d))\) such that
$$\displaystyle \begin{aligned} \mathbb E[g(X)] = \mathbb E[f(\mathrm{pp}(\mathbf{X})]\quad \text{for all filtered processes }\mathbf{X}, \end{aligned}$$
by simply setting \(f(\rho ):= \int g(x) \, d\rho _t(x)\) for every \(\rho \in (I \to \mathcal P(I \to \mathbb R^d))\) and any \(t \in I\), because we have by the tower property
$$\displaystyle \begin{aligned} \mathbb{E}[f(\mathrm{pp}(\mathbf{X}))] = \mathbb{E}\bigg[\int g(x) \, d\mathrm{pp}(\mathbf{X})_t(x)\bigg] = \mathbb{E}[\mathbb{E}[g(X)|\mathcal{F}_t]] = \mathbb E[g(X)]. \end{aligned}$$
 
2.
Consider the prediction processes \(({\mathbf {X}}^\varepsilon )_{\varepsilon \ge 0}\) and \(\mathbf {X}\) from Example 1.3. Then a simple computation shows that \(\mathrm {pp}(\mathbf {X}) = (\mathrm {pp}_1(\mathbf {X}), \mathrm {pp}_2(\mathbf {X}))\) satisfies
$$\displaystyle \begin{aligned} \mathrm{pp}_1(\mathbf{X})(\omega_i) = \frac 12 \left( \delta_{(0,-1)} + \delta_{(0,1)} \right) \text{ and } \mathrm{pp}_2(\mathbf{X})(\omega_i) = \delta_{X(\omega_i)}, \quad i=1,2; \end{aligned} $$
where \(\delta _{X(\omega _1)} = \delta _{(0,1)}\) and \(\delta _{X(\omega _2)} = \delta _{(0,-1)}\). Similarly, as \(\mathcal {G}_1 = \mathcal {G}_2 = \mathcal {F}\), we can show that \(\mathrm {pp}({\mathbf {X}}^\varepsilon ) = (\mathrm {pp}_1({\mathbf {X}}^\varepsilon ), \mathrm {pp}_2({\mathbf {X}}^\varepsilon ))\) satisfies that
$$\displaystyle \begin{aligned} \mathrm{pp}_1({\mathbf{X}}^\varepsilon)(\omega_i) = \mathrm{pp}_2({\mathbf{X}}^\varepsilon)(\omega_i) = \delta_{X^\varepsilon(\omega_i)}, \quad i=1,2; \end{aligned} $$
where \(\delta _{X^\varepsilon (\omega _1)} = \delta _{(\varepsilon ,1)}\) and \(\delta _{X(\omega _2)} = \delta _{(-\varepsilon ,-1)}\). Hence, \(\mathrm {pp}({\mathbf {X}}^\varepsilon ) \to \mathrm {pp}({\mathbf {X}}^0)\) in distribution while the laws of \(\mathrm {pp}(\mathbf {X})\) and \(\mathrm {pp}({\mathbf {X}}^0)\) differ. Therefore, Example 1.3 provides an example of a sequence that converges weakly but not with respect to the extended weak topology.
 
3.
Typical functions in \(C_b(I\to \mathcal P(I \to \mathbb R^d))\) can be built as follows: For each \(t \in I = \{1,\ldots ,T\}\), we choose a function \(g_t \in C_b(I \to \mathbb {R}^d)\); then given a measure-valued path \(\rho = (\rho _t)_{t \in I} \in I \to \mathcal P(I \to \mathbb R^d)\), we obtain a \(\mathbb {R}^T\)-valued vector by testing each measure \(\rho _t \in \mathcal P(I \to \mathbb R^d)\) against the function \(g_t \in C_b(I \to \mathbb {R}^d)\):
$$\displaystyle \begin{aligned} \bigg(\int g_1(x) d\rho_1(x), \ldots, \int g_T(x) d\rho_T(x)\bigg) \in \mathbb{R}^T. \end{aligned}$$
Finally, pick a continuous bounded function \(h: \mathbb {R}^T \to \mathbb {R}\), we define
$$\displaystyle \begin{aligned} f(\rho) = h \circ \bigg(\int g_1(x) d\rho_1(x), \ldots, \int g_T(x) d\rho_T(x)\bigg), \end{aligned}$$
and get an element \(f \in C_b(I\to \mathcal P(I \to \mathbb R^d))\). In fact these functions are used in [46] to study the adapted topologies.
As a concrete example, we assume \(T = 2\) and \(d = 1\), in this case \((I \to \mathbb {R}^d) = \mathbb {R}^2\), and \((I \to \mathcal {P}(I \to \mathbb {R}^d)) = \mathcal {P}(\mathbb {R}^2) \times \mathcal {P}(\mathbb {R}^2)\). We now take \(g_1(x_1,x_2) = [(x_1 + x_2)\wedge 2]\vee (-2) \) and \(g_2(x_1,x_2) = \sin {}(x_1 - x_2)\), and \(h(x_1,x_2) = \exp (-|x_1| -x_2^2)\) to define a function \(f \in C_b(I\to \mathcal P(I \to \mathbb R^d))\) as above. Then for any filtered process \(\mathbf {X}\) on I taking values in \([-1,1]\), we have (as \(\mathrm {pp}(\mathbf {X}) \in \mathcal {P}(\mathbb {R}^2) \times \mathcal {P}(\mathbb {R}^2)\))
$$\displaystyle \begin{aligned} f(\mathrm{pp}(\mathbf{X})) = \exp(-|\mathbb{E}[X_1 + X_2|\mathcal{F}_1]| - \mathbb{E}[\sin{}(X_1 - X_2)|\mathcal{F}_2]^2), \end{aligned}$$
and in this case
$$\displaystyle \begin{aligned} \mathbb{E}[f(\mathrm{pp}(\mathbf{X}))] = \mathbb{E}[-|\exp(\mathbb{E}[X_1 + X_2|\mathcal{F}_1] | - \mathbb{E}[\sin{}(X_1 - X_2)|\mathcal{F}_2]^2)]. \end{aligned}$$
We leave a simple exercise to curious readers to check that for filtered processes \(\mathbf {X}\) and \({\mathbf {X}}^\varepsilon \) from Example 1.3, it holds that for the above function f,
$$\displaystyle \begin{aligned} \mathbb{E}[f(\mathrm{pp}(\mathbf{X}))] &= \mathbb{E}[\exp(-|\mathbb{E}[X_1 + X_2|\mathcal{F}_1]| -\mathbb{E}[\sin{}(X_1 - X_2)|\mathcal{F}_2]^2)] \\ &=\exp(-\sin^2(1)) \end{aligned} $$
and
$$\displaystyle \begin{aligned} \mathbb{E}[f(\mathrm{pp}({\mathbf{X}}^\varepsilon))] &= \mathbb{E}[\exp(-|\mathbb{E}[X^\varepsilon_1 + X^\varepsilon_2|\mathcal{G}_1]| -\mathbb{E}[\sin{}(X^\varepsilon_1 - X^\varepsilon_2)|\mathcal{G}_2]^2)] \\ &=\exp(-(1+\varepsilon) - \sin^2(1-\varepsilon)), \end{aligned} $$
so that \(\lim _{\varepsilon \to 0} \mathbb {E}[f(\mathrm {pp}({\mathbf {X}}^\varepsilon ))] \neq \mathbb {E}[f(\mathrm {pp}(\mathbf {X}))]\).
 
Probabilistic properties such as being a martingale or being predictable, play an important role in stochastic analysis. Many of these properties are already captured by the prediction process.
Remark 1.5
Let \(\mathbf {X}\) and \(\mathbf {Y}\) be filtered processes that are synonymous, then we have:
(a)
if X is a \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-martingale, then Y  is a \((\mathcal {F}_t^{\mathbf {Y}})_{t \in I}\)-martingale;
 
(b)
if X is \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-predictable, then Y  is \((\mathcal {F}_t^{\mathbf {Y}})_{t \in I}\)-predictable;
 
(c)
if X is a \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-Markov process, then Y  is a \((\mathcal {F}_t^{\mathbf {Y}})_{t \in I}\)-Markov process.
 
To strengthen the intuition, we check the second claim as the other two properties can be shown with similar reasoning. Recall that X is \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-predictable if \(X_{t + 1}\) is \(\mathcal {F}_t^{\mathbf {X}}\)-measurable for all \(t\in I\). Equivalently, we can express this property in terms of the prediction process:
$$\displaystyle \begin{aligned} {} \delta_{X_{t + 1}} = \mathrm{law}(X_{t + 1} | \mathcal{F}_t^{\mathbf{X}}) = (e_t)_\# \mathrm{law}(X | \mathcal{F}_t^{\mathbf{X}}) = (e_t)_\# \mathrm{pp}_t(\mathbf{X}), \end{aligned} $$
(7)
where \(\delta _x\) denotes a Dirac measure at \(x \in \mathbb R^d\) and \((e_t)_\#\) is the pushforward of a measure with respect to the time-t evaluation map \(e_t : (\mathbb R^d)^I \to \mathbb R^d\). Therefore, X is \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-predictable if and only if \((e_t)_\# \mathrm {pp}_t(\mathbf {X})\) takes values in the set of Dirac measures. But, the latter is a property that can be deduced by knowing just the law of \(\mathrm {pp}(\mathbf {X})\). Since \(\mathbf {X}\) and \(\mathbf {Y}\) are synonymous, their prediction processes \(\mathrm {pp}(\mathbf {X})\) and \(\mathrm {pp}(\mathbf {Y})\) have the same distribution. It is immediate from (7) that Y  also has to be \((\mathcal {F}^{\mathbf {Y}}_t)_{t \in I}\)-predictable.
In fact, these properties are even closed under convergence in the extended weak topology, except for the martingale property, that is Item (a). This holds true also for the latter if we additionally assume uniform integrability of the sequence of stochastic processes. We refer readers to [17] for detailed proofs.
Remark 1.5 suggests that synonymity may be the adequate notion of equivalence for filtered process. But, even though the prediction process is suitable to rectify many problems (see also the discussion in Sect. 4), it is still not complex enough to identify the value of sequential decision making problems, when admitting processes with general filtrations. We will show this point by means of optimal stopping and insist that similar examples can be constructed for most sequential decision making problems.
Example 1.6
Let \(U,V\) be independent fair coin flips such that \(\mathbb {P}(U =1) = \mathbb {P}(U=0) = \mathbb {P}(V=1) = \mathbb {P}(V=0) = 1/2\), \(I = \{1,2,3\}\) and X is given by
$$\displaystyle \begin{aligned} X := (X_t)_{t \in I} := (0.6,0.4,V) \in \mathbb R^3. \end{aligned}$$
Consider the two filtrations \((\mathcal {F}_t)_{t \in I}\) and \((\mathcal G_t)_{t \in I}\) where
$$\displaystyle \begin{gathered} \mathcal{F}_1 := \sigma(U), \;\, \mathcal{F}_2 := \sigma\left( U, \{U=1,V=0\}, \{U=1,V=1\} \right),\;\; \mathcal{F}_3 := \sigma(U,V), \\ \mathcal G_1 := \{ \emptyset, \Omega \}, \quad \mathcal G_2 := \mathcal{F}_2, \quad \mathcal G_3 := \mathcal{F}_3, \end{gathered} $$
and define the filtered processes
$$\displaystyle \begin{aligned} \mathbf{X} := \left( \Omega, \mathcal{F}, \mathbb P, (\mathcal{F}_t)_{t \in I}, X \right) \quad \text{and}\quad \mathbf{Y} := \left( \Omega, \mathcal{F}, \mathbb P, (\mathcal G_t)_{t \in I}, X \right). \end{aligned}$$
Observe that \(\mathcal {F}_1\) encodes at time 1 whether or not we will know the value of \(X_3\) at time 2, while \(\mathcal G_1\) contains no information. The prediction process is ignorant to this kind of information, which is of “higher order”. In fact, computing the prediction processes at time 1 we get
$$\displaystyle \begin{aligned} \mathrm{pp}_1(\mathbf{X}) = \mathrm{law}( X | \mathcal{F}_1) = \mathrm{law}( X | U ) = \mathrm{law}(X) = \mathrm{law}( X | \mathcal G_1) = \mathrm{pp}_1(\mathbf{Y}), \end{aligned} $$
where we used independence of U and V . In particular, we have shown that \(\mathrm {pp}(\mathbf {X}) = \mathrm {pp}(\mathbf {Y})\). On the other hand, computing the optimal stopping values we find
$$\displaystyle \begin{aligned} \sup_{\tau \in \mathrm{ST}(\mathbf{X})} \mathbb E[X_\tau] = 0.65 > 0.6 =\sup_{\tau \in \mathrm{ST}(\mathbf{Y})} \mathbb E[X_\tau]. \end{aligned} $$
We leave the verification of the above inequality as an exercise to curious readers by checking that \(\tau ^{\mathbf {X}} := {\mathbf {1}}_{\{U = 0\}} + 2 {\mathbf {1}}_{\{U = 1, V=0\}} + 3 {\mathbf {1}}_{\{U = 1, V=1\}}\) and \(\tau ^{\mathbf {Y}} := 2 {\mathbf {1}}_{\{U = 1, V=0\}} + 3 {\mathbf {1}}_{\{U = 1, V=1\}} + 3{\mathbf {1}}_{\{U = 0\}}\) are optimal stopping times for \(\sup _{\tau \in \mathrm {ST}(\mathbf {X})} \mathbb E[X_\tau ]\) and \(\sup _{\tau \in \mathrm {ST}(\mathbf {Y})} \mathbb E[X_\tau ]\), respectively.

1.4 Higher-Rank Prediction Processes

We have seen that Aldous’ notion of synonymity does not capture all probabilistic properties of a filtered process, in particular, it is agnostic to information of higher order as described in Example 1.6. Motivated by such an observation, Hoover–Keisler [46] define prediction processes of higher rank. For a rank \(r \in \mathbb N\), the rankr-prediction process\(\mathrm {pp}^{(r)}\) of a filtered process \(\mathbf {X}\) is inductively defined by
$$\displaystyle \begin{aligned} {} \mathrm{pp}^{(0)}(\mathbf{X}) := X, \quad \text{and}\quad \mathrm{pp}^{(r)}_t(\mathbf{X}) := \mathrm{law} ( \mathrm{pp}^{(r - 1)}(\mathbf{X})| \mathcal{F}_t^{\mathbf{X}} )\quad \text{for }t \in I. \end{aligned} $$
(8)
Schematically, we can think of the path space of the rank r prediction process as
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/MediaObjects/617302_1_En_10_Equx_HTML.png
The image displays a sequence of mathematical expressions defining a series of mappings. The expressions are as follows: - M^{(0)} := (I to mathbb{R}^d) - M^{(1)} := (I to mathcal{P}(I to mathbb{R}^d)) - Ellipsis indicating continuation. - M^{(r)} := (I to mathcal{P}(I to ldots mathcal{P}(I to mathbb{R}^d)) ldots)) The final expression includes a brace indicating the nested structure, labeled as M^{(r-1)} .
The space \(M^{(r)}\) is constructed by alternately applying the following two operations:
  • mapping a Polish space S to the set of Borel probability measures \(\mathcal P(S)\) on S;
  • taking the product \((I \to S) = S^I\) of a Polish space S.
As both operations preserve the property of being Polish,1 it is immediate that \(M^{(r)}\) is a Polish space, which mathematically justifies the inductive construction in (8). Two filtered processes \(\mathbf {X}\) and \(\mathbf {Y}\) have the same rank-r adapted distribution if
$$\displaystyle \begin{aligned} {} \mathbb E[f(\mathrm{pp}^{(r)}(\mathbf{X}))] = \mathbb E[f(\mathrm{pp}^{(r)}(\mathbf{Y}))]\quad \text{for all } f \in C_b(M^{(r)}). \end{aligned} $$
(9)
A sequence \(({\mathbf {X}}^n)_{n \in \mathbb N}\) of filtered processes is said to converge in the adapted weak topology of rankr to a filtered process \(\mathbf {X}\) if
$$\displaystyle \begin{aligned} {} \mathbb E[f(\mathrm{pp}^{(r)}({\mathbf{X}}^n))] \to \mathbb E[f(\mathrm{pp}^{(r)}(\mathbf{X}))] \quad \text{for all }f \in C_b(M^{(r)}). \end{aligned} $$
(10)
Finally, the adapted distribution of a filtered process \(\mathbf {X}\) is the law of \(\mathrm {pp}^\infty (\mathbf {X})\) given by
$$\displaystyle \begin{aligned} {} \mathrm{pp}_t^{(\infty)}(\mathbf{X}) := (\mathrm{pp}_t^{(0)}(\mathbf{X}), \mathrm{pp}_t^{(1)}(\mathbf{X}), \dots) \quad \text{for all } t \in I, \end{aligned} $$
(11)
where \(\mathrm {pp}^{(\infty )}(\mathbf {X}) = (\mathrm {pp}^{(\infty )}_t(\mathbf {X}))_{t \in I}\) takes values in
$$\displaystyle \begin{aligned} M^{(\infty)} := (I \to (\mathbb{R}^d \times \mathcal P (M^{(0)}) \times \mathcal P(M^{(1)}) \times \dots )). \end{aligned}$$
Before proceeding, we want to remark that in the present finite discrete-time setting with T time steps, it suffices to simply consider the prediction process of rank \(T - 1\) instead of the whole adapted distribution. This interesting property is discussed in more detail in Remark 1.7 down below. Therefore, the reader may choose to ignore the case \(r = \infty \) and read it instead as \(r = T - 1\). We say that a sequence \(({\mathbf {X}}^n)_{n \in \mathbb N}\) of filtered processes converges in the adapted weak topology to \(\mathbf {X}\) if (10) holds for \(r = \infty \). Hence, convergence in the adapted weak topology is precisely convergence in distribution of the corresponding adapted distributions. Clearly, all filtered processes \(\mathcal {F}\mathcal {P}\) are a proper class containing a lot of redundancy. As a next step, we use the equivalence relation induced by the adapted distribution \(\sim _{\infty }\) to put filtered processes with the “same probabilistic properties” into a single equivalence class. We define the factor space of filtered processes
$$\displaystyle \begin{aligned} \text{FP} := \mathcal{F}\mathcal{P} /_{\sim_{\infty}}. \end{aligned} $$
(12)
After factorization we end up with the set \(\text{FP}\), on which we can define the adapted weak topology of rank \(r \in \mathbb N \cup \{ 0, \infty \}\), denoted by \(\mathcal T^{(r)}\), whose mode of convergence is described by (10). In analogy to when working with \(L_p\)-spaces, we will for the most part not distinguish between single filtered process and their \(\sim _\infty \)-equivalence class. Moreover, we will write \(\mathcal T := \mathcal T^{(\infty )}\) for the adapted weak topology.
Remark 1.7
We compile some facts on \(\mathcal T^{(r)}\), the adapted weak topology of rank r:
(1)
Observe that \(\mathcal T^{(0)}\) corresponds to the usual weak topology, while \(\mathcal T^{(1)}\) is precisely Aldous’ extended weak topology and \(\mathcal T = \mathcal T^{(\infty )}\) is the adapted weak topology.
 
(2)
The topologies \(\mathcal T^{(r)}\) become increasingly finer. We have
$$\displaystyle \begin{aligned} \mathcal T^{(0)} \subsetneq \mathcal T^{(1)} \subsetneq \cdots \subsetneq \mathcal T^{(T-1)} = \cdots = \mathcal T^{(\infty)}. \end{aligned}$$
Indeed, this can be seen as follows: For any filtered process \(\mathbf {X}\) we have
$$\displaystyle \begin{aligned} (e_t)_\# \mathrm{pp}^{(r + 1)}_t(\mathbf{X}) = (e_t)_\# \mathrm{law}(\mathrm{pp}^{(r)}(\mathbf{X}) | \mathcal{F}_t^{\mathbf{X}}) = \mathrm{law}(\mathrm{pp}^{(r)}_t(\mathbf{X}) | \mathcal{F}_t^{\mathbf{X}}) = \delta_{\mathrm{pp}^{(r)}_t(\mathbf{X})}, \end{aligned} $$
where \(e_t\) is the time-t evaluation map and \(\delta _x\) denotes a Dirac measure at \(x \in M^{(r)}\). The second equality is by definition of the push-forward operation and the last is due to \(\mathrm {pp}^{(r)}(\mathbf {X})\) being \((\mathcal {F}_t^{\mathbf {X}})_{t \in I}\)-adapted. Consequently, one can build a continuous map
$$\displaystyle \begin{aligned} {} \iota : M^{(r)} \to M^{(r - 1)} \end{aligned} $$
(13)
that for all \(\mathbf {X} \in \text{FP}\), \(\iota (\mathrm {pp}^{(r + 1)}(\mathbf {X})) = \mathrm {pp}^{(r)}(\mathbf {X})\) almost surely, which readily provides the inclusion \(\mathcal T^{(r)} \subseteq \mathcal T^{(r + 1)}\). For additional details, we refer to [17, Theorem 4.11 and Theorem 7.1].
 
(3)
As we consider the setting where \(I = \{1,\ldots ,T\}\) consists of \(T \in \mathbb N\) times, it is not necessary to consider prediction processes of arbitrary rank. In fact, we have by [17, Lemma 4.7 and Lemma 4.10] that, for every \(r \in \mathbb N\), there exists a continuous map
$$\displaystyle \begin{aligned} F : M^{(T-1)} \to M^{(r)} \end{aligned}$$
with the property that \(F(\mathrm {pp}^{(T-1)}(\mathbf {X})) = \mathrm {pp}^{(r)}(\mathbf {X})\) almost surely, for all \(\mathbf {X} \in \text{FP}\). It is straightforward to see that \(\mathcal T^{(r)} = \mathcal T^{(T-1)} = \mathcal T\), which means that all the necessary information is contained in the rank \((T-1)\)-prediction process.
 
(4)
In fact, as long as \(r < T - 1\) the inclusion \(\mathcal T^{(r)} \subseteq \mathcal T^{(r + 1)}\) is strict. This can be achieved with the construction shown in Example 1.6, see also [46, Example 3.2] and [17, Proposition 7.2].
 
(5)
For more detailed descriptions and examples of \(\mathrm {pp}^{(r)} (\mathbf {X})\), \(M^{(r)}\) and \(\mathcal T^{(r)}\) we refer readers to [46] and [17]. We also recommend readers to focus on the case \(r=1\) at the first reading.
 
Theorem 1.8
The topological space\((\mathit{\text{FP}},\mathcal T)\)is Polish, that is, separable and completely metrizable.
A proof of Theorem 1.8 can be found in [17, Theorem 1.3] and [22, Theorem 1.5]. It is worth mentioning that, even though all filtered processes are a proper class, \((\text{FP},\mathcal T)\) is tractable in the sense that it is a Polish space and simple processes (for example, Markov processes living on finite probability spaces, cf. [17, Theorem 5.4]) are dense. The main purpose of the adapted weak topology is to finally have a topology that is sufficiently strong to have continuity of sequential decision making problems, but which is not unnecessarily strong. We refer to Sect. 4 below for a more detailed discussion. To get back at our introductory example, that is Example 1.3, we have continuity of optimal stopping.
Lemma 1.9 (Continuity of Optimal Stopping)
Let\(c : (I \to \mathbb R^d) \times I \to \mathbb R\)be continuous, bounded and non-anticipative.2Then we have continuity of the value function
$$\displaystyle \begin{aligned} v_c : (\mathit{\text{FP}},\mathcal T) \to \mathbb R : v_c(\mathbf{X}) := \sup_{\tau \in \mathrm{ST}(\mathbf{X})} \mathbb E[c(X,\tau)]. \end{aligned}$$
For illustrative purposes, we include a self-contained proof of this result.
1.9, \(T = 2\)
Proof of Lemma
When there are only two time points \(I = \{1, 2\}\), optimal stopping boils down to deciding at time \(t = 1\) whether or not to stop. At time \(t = 1\), the gain to immediately stop is \(c(X,1) = \mathbb E_{\mathbf {X}}[c(X,1)|\mathcal {F}_1^{\mathbf {X}}]\) while the expected profit of stopping at \(t = 2\) is \(\mathbb E_{\mathbf {X}}[ c(X,2) | \mathcal {F}_1^{\mathbf {X}}]\). Consequently, the value of the optimal stopping problem is given by
$$\displaystyle \begin{aligned} {} v_c(\mathbf{X}) = \mathbb E_{\mathbf{X}}[c(X,1) \vee \mathbb E_{\mathbf{X}}[ c(X,2) | \mathcal{F}_1^{\mathbf{X}} ] ] = \mathbb E_{\mathbf{X}}[F(\mathrm{pp}^{(1)}(\mathbf{X}))], \end{aligned} $$
(14)
where \(F \colon M^{(1)} \to \mathbb \mathbb {R}\) is given by
$$\displaystyle \begin{aligned} F( \rho ) = \int c(x,1) \, d\rho_1(x) \vee \int c(y,2) \, d\rho_1(y), \end{aligned}$$
for \(\rho = (\rho _1,\rho _2) \in M^{(1)}\). As c is continuous and bounded, the same is true for F. Using the latter, the claim follows directly from the definition of \(\mathcal T\) and the representation given in (14). □
1.9, \(T \in \mathbb N\)
Proof of Lemma
The proof for general \(T \in \mathbb N\) is essentially the same as in the case \(T = 2\) but with an additional induction over the number of time steps. The value \(v_c(\mathbf {X})\) is given by the associated Snell envelope \((U_t^{\mathbf {X}})_{t \in I \cup \{ 0 \} }\) at time 0. This means that we have
$$\displaystyle \begin{aligned} v_c(\mathbf{X}) = U_0, \end{aligned}$$
where the Snell envelope is given by the recursive scheme
$$\displaystyle \begin{aligned} U_T^{\mathbf{X}} := c(X,T) \quad \text{and}\quad U_t^{\mathbf{X}} := c(X,t) \vee \mathbb E_{\mathbf{X}}[U_{t + 1}^{\mathbf{X}} | \mathcal{F}_t^{\mathbf{X}}], \end{aligned}$$
for \(t \in I\), and \(U_0 = \mathbb E_{\mathbf {X}}[U_1^{\mathbf {X}}]\). Similarly, we define recursively
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/MediaObjects/617302_1_En_10_Equag_HTML.png
The image contains two mathematical expressions. The first expression is U_T^X = c(underbrace{pp^{(0)}(mathbf{X}), T}_{:= F_T(pp^{(0)}(mathbf{X}))}) . The second expression is U_t^X = c(underbrace{iota_t(pp^{(t)}(mathbf{X})), t}_{:= F_{T-t+1}(pp^{(t)}(mathbf{X}))}) lor (F_{T-t+1} # pp^{(t)}(mathbf{X})) . The expressions involve functions c , pp , and F , with variables T , t , and mathbf{X} . Greek symbols include iota . The expressions use logical and mathematical symbols such as lor (logical OR) and # .
where \(\iota _t : M^{(t)} \to M^{(0)}\), defined analogously to \(\iota \) in Remark 1.7 ((2)), is a continuous map with \(\iota _t(\mathrm {pp}^{(t)}(\mathbf {X})) = \mathrm {pp}^{(0)}(\mathbf {X}) = X\). Hence, \(F_{T - t} : M^{(t)} \to \mathbb R\) is continuous. In particular, we can construct a continuous map \(F_0 : \mathcal P (M^{(T-1)}) \to \mathbb R\) with
$$\displaystyle \begin{aligned} F_0(\mathrm{law}(\mathrm{pp}^{(T-1)}(\mathbf{X}))) := \mathbb E_{\mathbf{X}}[F_1(\mathrm{pp}^{(T-1)}(\mathbf{X}))] = U_0^{\mathbf{X}} = v_c(\mathbf{X}). \end{aligned}$$
We conclude from the definition of convergence in \(\mathcal T\), cf. (10), continuity of \(v_c\). □

2 Higher-Rank (Expected) Signatures

In the introduction we established that the higher-rank adapted weak topology \(\mathcal {T}^{(r)}\), \(r \ge 1\), on the space of filtered processes \(\text{FP}\) is finer than the weak topology (the latter corresponds to \(r = 0\)), and is essential to addressing continuity for multi-stage optimization problems which are in general discontinuous with respect to the weak topology, such as optimal stopping problems as discussed in Example 1.2. Therefore, it should come as no surprise that many theoretical tools which are valid for the weak topology do not work any more for the higher-rank cases. For instance, the expectation of the classical signature feature, which is able to characterize the laws of stochastic processes as moment generating function for path-valued random variables, does not distinguish two filtered processes in the rank-1 adapted weak topology. Motivated by these observations, we want to accordingly modify the classical signature feature to introduce a family of so-called higher-rank signatures, that adequately capture higher-rank adapted weak topologies whilst retaining the advantageous properties of the classical signature, both in theoretical and computational aspects. For technical simplicity, throughout the whole section, we will implicitly assume that
(A1)
All filtered processes take values in a compact subset \(\mathcal {K} \subset \mathbb {R}^d\).
 

2.1 Properties of Higher-Rank (Expected) Signatures

In this subsection (which is based on [28, Sections 3,4]), we introduce the notion of higher-rank signatures, whose expectations can be used to distinguish filtered processes with respect to the higher-rank adapted weak topologies.
To provide the reader with sufficient intuition for these increasingly complex concepts, let us first disregard the time components and discuss the static case.
The Static Case
Since \(\mathcal {K} \subset \mathbb {R}^d\) is a compact subset, the Stone–Weierstrass theorem asserts that the space of all polynomials on \(\mathcal {K}\) is dense in \(C_b(\mathcal {K})\) in the topology of uniform convergence. Furthermore, a fundamental result in probability theory states that two \(\mathcal {K}\)-valued random variables \(X,Y\) have the same law if and only if, for all \(k\in \mathbb {N}\), they have the same k-th moment. Moreover, a sequence \((X^n)_{n \in \mathbb N}\) of \(\mathcal {K}\)-valued random variables converges to X in distribution if and only if, for all \(k \in \mathbb {N}\), their k-th moments converge. These well–known results can be reformulated in terms of the notations used in chapter “The Signature Kernel” as follows:
Consider the Hilbert space
$$\displaystyle \begin{aligned} {} H := \Big\{(a_0,a_1,a_2,\ldots) \in \prod_{k=0}^\infty (\mathbb{R}^d)^{\otimes k}: \bigg(\sum_{k=0}^\infty \|a_k\|^2\bigg)^{1/2} < \infty\Big\}, \end{aligned} $$
(15)
which is embedded in the tensor algebra \(T((\mathbb {R}^d)) := \prod _{k=0}^\infty (\mathbb {R}^d)^{\otimes k}\) and whose inner product is given by
$$\displaystyle \begin{aligned} \langle (a_0,a_1,a_2,\ldots,), (b_0,b_1,b_2,\ldots,)\rangle_H = \sum_{k=0}^\infty a_k \cdot b_k, \end{aligned}$$
where \(a_k \cdot b_k\) denotes the usual inner product of \(a_k\) and \(b_k\) in \((\mathbb {R}^d)^{\otimes k}\) which is obtained from the Euclidean inner product on \(\mathbb {R}^d\). Next, define the monomial map \(\Phi _{\text{Mon}}: \mathbb {R}^d \to H\) by
$$\displaystyle \begin{aligned} \Phi_{\text{Mon}}(x) := \Big(1,x, \frac{x^{\otimes 2}}{2 !}, \frac{x^{\otimes 3}}{3 !}, \ldots, \frac{x^{\otimes n}}{n !}, \ldots \Big), \end{aligned}$$
where \(x^{\otimes k} := (x_1^{m_1}\ldots x_d^{m_d})_{m_1+ \ldots + m_d =k, m_i \in \mathbb {N}_0} \in (\mathbb {R}^d)^{\otimes k}\) for a vector \(x = (x_1,\ldots ,x_d) \in \mathbb {R}^d\). Thanks to the Stone–Weierstrass theorem we have
$$\displaystyle \begin{gathered} \forall \varepsilon >0, \ \forall f \in C_b(\mathcal{K}), \ \exists \ell \in H : \sup_{x \in \mathcal{K}}|f(x) - \langle \ell, \Phi_{\text{Mon}}(x) \rangle_H|\le \varepsilon, \\ \text{law}(X) = \text{law}(Y) \iff \mathbb{E}[\Phi_{\text{Mon}}(X)] = \mathbb{E}[\Phi_{\text{Mon}}(Y)], \\ \left( \forall f \in C_b(\mathcal{K}), \! \lim_{n \to \infty}\mathbb{E}[f(X_n)] {=} \mathbb{E}[f(X)] \right)\!\!\! {\iff}\!\!\! \lim_{n \to \infty} \mathbb{E}[\Phi_{\text{Mon}}(X_n)] {=} \mathbb{E}[\Phi_{\text{Mon}}(X)]. \end{gathered} $$
Writing \(\Phi _{\text{Mon}}^{n}(x) := \frac {x^{\otimes n}}{n!}\), \(n \in \mathbb {N}_0\) for the n-th component of \(\Phi _{\text{Mon}}(x)\), notice that \(\mathbb {E}[\Phi _{\text{Mon}}^{n}(X)] = \mathbb {E}[\frac {X^{\otimes n}}{n!}]\) is the n-th moment of X (up to the constant \(1/n!\)).
Remark 2.1
In fact, the preceding results regarding polynomials and re-scaled moments can be generalized further by relaxing the compactness assumption. To this end, we need to normalize the monomials map \(\Phi _{\text{Mon}}\) in order to guarantee integrability. According to [31, Section 3.2], the renormalized version of \(\Phi _{\text{Mon}}\) is called the robust monomials, and can be represented by \(\Lambda \circ \Phi _{\text{Mon}}\) for a so-called tensor normalization \(\Lambda : H \to \{(a_0,a_1,\ldots ) \in H: \|(a_0,a_1,\ldots )\|_H \le 1\}\).
Now, let us move forward to the case of filtered processes with respect to the weak topology, i.e., when the rank \(r=0\), cf. Sect. 1.4. In this setting, it is well-known that the notions of monomials on a compact subset \(\mathcal {K} \subset \mathbb {R}^d\) admit natural generalizations to the path spaces \(M^{(0)} = (I \to \mathcal {K})\), which are called signatures and robust signatures, respectively.
The Rank \(r = 0\) Case
The classical signature theory tells us that the notions of monomials \(\Phi _{\text{Mon}}\) on \(\mathcal {K} \subset \mathbb {R}^d\), namely the static case, have natural generalizations to \(\mathcal {K}\)-valued paths, which are called the signature. To elaborate, an \(\mathcal K\)-valued (discrete-time) path \(\mathbf {\mathit {x}} = (\mathbf {\mathit {x}}_1,\ldots ,\mathbf {\mathit {x}}_T) \in M^{(0)} = (I \to \mathcal {K})\) is identified with a piecewise-linear path \(\mathbf {\mathit {x}}: [0,T] \to \mathbb {R}^d\) via
$$\displaystyle \begin{aligned} \mathbf{\mathit{x}}_t := (i-t)\mathbf{\mathit{x}}_{i-1} + (t-(i-1))\mathbf{\mathit{x}}_i, \quad t \in [i-1,i], \end{aligned}$$
and we write \(\mathbf {\mathit {x}}_t = (x^1_t, \ldots , x^d_t)\) for each \(t \in [0,T]\). In other words, we embed the discrete-time path space \(M^{(0)}\) into the space of continuous paths with bounded variation defined on \([0,T]\), which we denote by \((\mathbb {R}^d)_{\text{1-var}}\) as in chapter “The Signature Kernel”. Then the signature of the path \(\mathbf {\mathit {x}} \in (I \to \mathbb {R}^d)\) is defined by
$$\displaystyle \begin{aligned} \Phi_{\text{Sig}} (\mathbf{\mathit{x}}) := \Big(1, \int d\mathbf{\mathit{x}}, \int d\mathbf{\mathit{x}}^{\otimes 2}, \ldots\Big) \in H, \end{aligned}$$
where
$$\displaystyle \begin{aligned} \int d\mathbf{\mathit{x}}^{\otimes k} = \Big(\int_{0<t_1<\ldots<t_k < T}dx^{i_1}_{t_1}\ldots dx^{i_k}_{t_k} \Big)_{i_1,\ldots,i_k \in \{ 1,\ldots,d \} } \in (\mathbb{R}^d)^{\otimes k} \end{aligned}$$
is the iterated Riemann integral for \(k \ge 1\), and H is given as in (15). Using the same notation as in chapter “The Signature Kernel”, for each \(k \in \mathbb {N}\) we use \(\Phi _{\text{Sig},k}(\mathbf {\mathit {x}})\) to denote the k-th iterated integrals of \(\mathbf {\mathit {x}}\), namely \(\Phi _{\text{Sig},k}(\mathbf {\mathit {x}}) = \int d\mathbf {\mathit {x}}^{\otimes k} \in (\mathbb {R}^d)^{\otimes k}\).
Thanks to the algebraic and analytic properties of iterated integrals, the following theorem holds true and implies that the signature feature \(\Phi _{\text{Sig}}\) is the sequence of monomials on the path space \(M^{(0)}\), respectively: Their linear combinations are dense in the space of continuous bounded functions on the path space, the expectation of \(\Phi _{\text{Sig}}(X)\) (called the expected signature) can be used to characterize the weak convergence of the laws of stochastic processes (viewed as path-valued random variables). We summarize these facts into the following theorem, which can also be found in chapter “The Signature Kernel”.
Theorem 2.2
Let\(\mathcal {K} \subset \mathbb {R}^d\)be compact. Let\(M^{(0)} = (I \to \mathcal {K})\)denote the set of paths taking values in\(\mathcal {K}\). For every\(\mathbf {\mathit {x}} \in M^{(0)}\), we define the time-augmented version of\(\mathbf {\mathit {x}}\)as\(\bar {\mathbf {\mathit {x}}}(t) = (t, \mathbf {\mathit {x}}_t) \in \mathbb {R}^{d+1}\). Then the time-augmented signature feature3
$$\displaystyle \begin{aligned} \bar \Phi_{\mathit{\text{Sig}}}: M^{(0)} \to H \subset T((\mathbb{R}^{d+1})), \quad \mathbf{\mathit{x}} \mapsto \Phi_{\mathit{\text{Sig}}}(\bar{\mathbf{\mathit{x}}}) \end{aligned}$$
has the following properties:
(i)
\(\bar \Phi _{\mathit{\text{Sig}}}\)is characteristic to the space of signed measures on\(M^{(0)}\), i.e., for signed measures\(\mu \), \(\nu \)on\(M^{(0)}\)we have
$$\displaystyle \begin{aligned} \mu = \nu \iff \mathbb{E}_{X \sim \mu}[\bar \Phi_{\mathit{\text{Sig}}}(X)] = \mathbb{E}_{Y \sim \nu}[\bar \Phi_{\mathit{\text{Sig}}}(Y)]. \end{aligned}$$
 
(ii)
For a sequence of\(\mathcal {K}\)-valued stochastic processes\(X, X^1, X^2, \ldots \), it holds that
$$\displaystyle \begin{aligned} \left( \forall f \in C_b(M^{(0)}), \ \lim_{n \to \infty}\mathbb{E}[f(X^n)] = \mathbb{E}[f(X)] \right) & \iff \lim_{n \to \infty} \mathbb{E}[\bar \Phi_{\mathit{\text{Sig}}}(X^n)]\\ &\quad = \mathbb{E}[\bar \Phi_{\mathit{\text{Sig}}}(X)]. \end{aligned} $$
In other words, the (time-augmented) signature metric\(d^{(0)}_{\mathit{\text{Sig}}}: \mathcal {P}(M^{(0)}) \times \mathcal {P}(M^{(0)}) \to \mathbb {R}_+\)given by
$$\displaystyle \begin{aligned} d^{(0)}_{\mathit{\text{Sig}}}(\mu,\nu) := \|\mathbb{E}_{X \sim \mu}[\bar \Phi_{\mathit{\text{Sig}}}(X)] - \mathbb{E}_{Y \sim \nu}[\bar \Phi_{\mathit{\text{Sig}}}(Y)]\|_H, \end{aligned}$$
metrizes the weak topology on\(\mathcal {P}(M^{(0)})\).
 
Remark 2.3
  • In order to ensure the above characteristicness, it is necessary to consider the signature of the time-augmented path \(\bar {\mathbf {\mathit {x}}}_t = (t, \mathbf {\mathit {x}}_t) \in \mathbb {R}^{d+1}\), see e.g. [31] for a detailed discussion on this issue. From now on, to keep the notations as simple as possible, we will not distinguish the signature feature from their time-augmented version, and will use the notations \(\Phi _{\text{Sig}}\) instead of \(\bar \Phi _{\text{Sig}}\) for time-augmented signatures.
  • As in the static case, one can use any tensor normalizing \(\Lambda \) to define a robust signature by \(\Lambda \circ \Phi _{\text{Sig}}\). Using this robust signature we can show that all of the above results hold for general \(\mathbb {R}^d\)-valued paths and drop the compact assumption, see [31]. To avoid unnecessary technicalities, we will stick to signature features for paths taking values in the compact subset \(\mathcal {K}\) of \(\mathbb {R}^d\) in the present article.
Now, let us think of an interesting question:
(Q)
Can we adopt the signature methodology to define the moment generating function and a signature-type metric as in Theorem 2.2 to characterize the rank-r adapted weak topology on the space of filtered processes \(\text{FP}\), for any rank \(r \in \mathbb {N}_0\)?
 
The Rank \(r = 1\) Case
As the rank-0 adapted weak topology on \(\text{FP}\) is induced by the weak topology on \(\mathcal {P}(M^{(0)})\), Theorem 2.2 shows that the signature feature \(\Phi _{\text{Sig}}\) and the signature metric \(d_{\text{Sig}}\) achieve this goal when \(r=0\). Now, let us proceed with \(r=1\). According to the definition, the rank-1 adapted weak topology \(\mathcal T^{(1)}\) is the initial topology induced by the law of the rank 1-prediction process \(\mathrm {pp}^{(1)}(\mathbf {X}) = \mathrm {law}(X|\mathcal {F}^{\mathbf {X}}_t)_{t \in I} \in M^{(1)}\) with respect to the weak topology on \(\mathcal {P}(M^{(1)})\) where \(M^{(1)} = (I \to \mathcal {P}(M^{(0)}))\). Therefore, to establish a signature-type metric for the rank-1 adapted weak topology, a natural idea is to define the signature feature for the rank 1-prediction process, which can be viewed as an \(M^{(1)}\)-valued random variable. The only technical issue here is that the paths \(\mathbf {\mathit {p}} = (\mathbf {\mathit {p}}_t)_{t \in I}\) in \(M^{(1)}\) are measure-valued paths, and it is not clear how to define iterated integrals for paths with values in a non-linear space (here, the non-linear space is \(\mathcal {P}(M^{(0)})\)).4
One natural way to circumvent this difficulty is to transform a measure-valued path into a path taking values in a linear space, then use the iterated integrals to define the signature feature for the latter one, and use it as the signature for the original measure-valued path. In our situation, we can use the signature feature for \(\mathbb {R}^d\)-valued paths to achieve such a transformation. More precisely, for a given measure-valued path \(\mathbf {\mathit {p}} = (\mathbf {\mathit {p}}_t)_{t \in I} \in M^{(1)}\), for each \(t \in I\) we define \(\tilde {\mathbf {\mathit {p}}}_t\) as the Bochner integral of the H-valued mapping \(\Phi _{\text{Sig}}\):
$$\displaystyle \begin{aligned} \tilde{\mathbf{\mathit{p}}}_t = \int_{\mathbf{\mathit{x}} \in M_0} \Phi_{\text{Sig}}(\mathbf{\mathit{x}}) \mathbf{\mathit{p}}_t (d \mathbf{\mathit{x}}) \in H. \end{aligned}$$
By a slight abuse of notation, we will still use \(\tilde {\mathbf {\mathit {p}}}\) to denote the time-augmented version of the path \(\tilde {\mathbf {\mathit {p}}}\), i.e., \(\tilde {\mathbf {\mathit {p}}}_t = (t, \int _{\mathbf {\mathit {x}} \in M_0} \Phi _{\text{Sig}}(\mathbf {\mathit {x}}) \mathbf {\mathit {p}}_t (d \mathbf {\mathit {x}})) \in \mathbb {R} \oplus H\) for each \(t \in I\). Then, similar to \(\mathbb {R}^d\)-valued paths, we identify this discrete-time H-valued path with the piecewise-linear H-valued path \(\tilde {\mathbf {\mathit {p}}}_t = (i-t)\tilde {\mathbf {\mathit {p}}}_{i-1} + (t-(i-1))\tilde {\mathbf {\mathit {p}}}_{i}, \quad t \in [i-1,i]\), which belongs to the space of H-valued path on \([0,T]\) with bounded variation. Subsequently, we can define iterated integrals and form a (time-augmented) signature for \(\tilde {\mathbf {\mathit {p}}} = (\tilde {\mathbf {\mathit {p}}}_t)_{t \in I}\) by
$$\displaystyle \begin{aligned} \Phi_{\text{Sig}} (\tilde{\mathbf{\mathit{p}}}) = \Big(1, \int d\tilde{\mathbf{\mathit{p}}}, \int d\tilde{\mathbf{\mathit{p}}}^{\otimes 2}, \ldots\Big) \in H^{(2)}, \end{aligned}$$
where \(H^{(2)} \subset T((\mathbb {R} \oplus H)) = \prod _{k=0}^\infty (\mathbb {R}\oplus H)^{\otimes k}\) is a Hilbert space embedded in the tensor algebra \(T((\mathbb {R} \oplus H))\) over the Hilbert space \(\mathbb {R} \oplus H\), whose inner product \(\langle \cdot , \cdot \rangle _{H^{(2)}}\) is induced by the inner product \(\langle \cdot , \cdot \rangle _{\mathbb {R} \oplus H}\) on \(\mathbb {R} \oplus H\). For a complete description of \(H^{(2)}\) we refer to [47, Appendix A] or [28, Appendix B]. Here we remark that the superscript \((2)\) of \(H^{(2)}\) means that \(H^{(2)}\) is formed by taking the tensor algebra over vector space twice, i.e., \(H^{(2)}\) is contained in the tensor algebra over another tensor algebra \(H \subset T((\mathbb {R}^d))\).
Definition 2.4
The \(H^{(2)}\)-valued mapping
$$\displaystyle \begin{aligned} \Phi_{\text{Sig}}^{(1)}(\mathbf{\mathit{p}}) := \Big(1, \int d\tilde{\mathbf{\mathit{p}}}, \int d\tilde{\mathbf{\mathit{p}}}^{\otimes 2}, \ldots\Big) \in H^{(2)} \end{aligned}$$
is called the rank-1 signature of the measure-valued path \(\mathbf {\mathit {p}} = (\mathbf {\mathit {p}}_t)_{t \in I} \in M^{(1)}\).
Now, let \(\mathbf {X} = \left ( \Omega ^{\mathbf {X}}, \mathcal {F}^{\mathbf {X}}, \mathbb {P}^{\mathbf {X}}, (\mathcal {F}_t^{\mathbf {X}})_{t \in I}, X \right ) \in \text{FP}\) be a filtered process. Its rank 1-prediction process \(\mathrm {pp}^{(1)}(\mathbf {X}) = \mathrm {law}(X|\mathcal {F}^{\mathbf {X}}_t), t \in I\), is a \(M^{(1)}\)-valued random variable, and it is easy to see that
$$\displaystyle \begin{aligned} \widetilde{\mathrm{pp}}^{(1)}(\mathbf{X})_t := \mathbb{E}_{\mathbf{X}}[\Phi_{\text{Sig}}(X) | \mathcal{F}^{\mathbf{X}}_t], \quad t \in I \end{aligned}$$
is an H-valued martingale, so that \(\Phi _{\text{Sig}}^{(1)}(\mathrm {pp}^{(1)} (\mathbf {X})) = \Phi _{\text{Sig}} (t \mapsto \widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})_t)\).
The following theorem justifies the above construction, since the so-defined rank-1 signature for measure-valued paths satisfies all properties (e.g., universality and characteristicness) which hold for the signature for \(\mathbb {R}^d\)-valued paths. For a detailed proof, we refer readers to [28, Theorem 2].
Theorem 2.5
The rank-1 signature\(\Phi ^{(1)}_{\mathit{\text{Sig}}}\)has the following properties:
(i)
\(\Phi ^{(1)}_{\mathit{\text{Sig}}}\)is universal to\(C_b(M^{(1)})\)equipped with the uniform topology, i.e., for every\(f \in C_b(M^{(1)})\)and\(\varepsilon > 0\), there is\(\ell \in H^{(2)}\)with
$$\displaystyle \begin{aligned} \sup_{\mathbf{\mathit{p}} \in M^{(1)}} |f(\mathbf{\mathit{p}}) - \langle \ell, \Phi^{(1)}_{\mathit{\text{Sig}}}(\mathbf{\mathit{p}}) \rangle_{H^{(2)}}| \le \varepsilon. \end{aligned}$$
 
(ii)
\(\Phi ^{(1)}_{\mathit{\text{Sig}}}\)is characteristic to the space of signed measures on\(M^{(1)}\). In particular, two filtered processes\(\mathbf {X}\)and\(\mathbf {Y}\)have the same rank-1 adapted distribution if and only if
$$\displaystyle \begin{aligned} \mathbb{E}_{\mathbf{X}}[\Phi^{(1)}_{\mathit{\text{Sig}}}(\mathrm{pp}^{(1)} (\mathbf{X}))] = \mathbb{E}_{\mathbf{Y}}[\Phi^{(1)}_{\mathit{\text{Sig}}}(\mathrm{pp}^{(1)} (\mathbf{Y}))]. \end{aligned}$$
 
(iii)
The rank- 1 signature metric \(d_{\mathit{\text{Sig}}}^{(1)}: \mathit{\text{FP}} \times \mathit{\text{FP}} \to \mathbb {R}_+\)given by
$$\displaystyle \begin{aligned} d^{(1)}_{\mathit{\text{Sig}}}(\mathbf{X},\mathbf{Y}) := \|\mathbb{E}_{\mathbf{X}}[\Phi^{(1)}_{\mathit{\text{Sig}}}(\mathrm{pp}^{(1)}(\mathbf{X}))] - \mathbb{E}_{\mathbf{Y}}[\Phi^{(1)}_{\mathit{\text{Sig}}}(\mathrm{pp}^{(1)}(\mathbf{Y}))]\|_{H^{(2)}}, \end{aligned}$$
metrizes the rank-1 adapted weak topology\(\mathcal {T}^{(1)}\)on\(\mathit{\text{FP}}\).
 
The General Higher Rank Case
Thanks to the hierarchical structure of the higher-rank prediction processes, we can repeat the above construction to obtain the rank-r signature, for any \(r \in \mathbb {N}_0\), which then can be used to characterize the rank-r adapted weak topology as in Theorem 2.5. Roughly speaking, as the rank-\((r+1)\) adapted weak topology is induced by the rank \((r+1)\)-prediction process, which is constructed as the prediction process of the rank r-prediction process, the corresponding rank-\((r+1)\) signature \(\Phi ^{(r+1)}_{\text{Sig}}\) is obtained by taking the signature feature of the paths formed via (taking expectations of) the rank-r signature.
The concrete procedure is described as follows, which is also reflected in the diagram below:
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/MediaObjects/617302_1_En_10_Figa_HTML.png
Diagram illustrating a mathematical process with three main components connected by arrows. The top component is labeled \( M^{(r+1)} = (I \rightarrow \varphi(M^{(r)})) \). The left arrow, labeled \( \Phi_{S\Sigma}^{(r)} \), points from \( (I \rightarrow H^{(r+1)}) \) to the top component. The right dashed arrow, labeled \( \Phi_{S\Sigma}^{(r+1)} \), points from the top component to \( H^{(r+2)} \subset T((R \oplus H^{(r+1)})) \). A horizontal arrow labeled \( \Phi_{S\Sigma} \) connects \( (I \rightarrow H^{(r+1)}) \) to \( H^{(r+2)} \subset T((R \oplus H^{(r+1)})) \).
1.
Assume that for \(r \in \mathbb {N}_0\) we have defined the rank-r signature \(\Phi ^{(r)}_{\text{Sig}}: M^{(r)} \to H^{(r+1)}\) for the measure-valued path \(\mathbf {\mathit {p}}^{(r)} = (\mathbf {\mathit {p}}^{(r)}_t)_{t \in I} \in M^{(r)}\) , where \(H^{(r+1)}\) is a Hilbert space . By convention we set \(H^{(0)} := \mathbb {R}^d\) and \(H^{(1)} := H\), and \(\Phi ^{(0)}_{\text{Sig}} = \Phi _{\text{Sig}}\) is the usual signature on \(M^{(0)}\) defined as in Theorem 2.2.
 
2.
Consider the measure-valued path space \(M^{(r+1)} = (I \to \mathcal {P}(M^{(r)}))\). For a path \(\mathbf {\mathit {p}}^{r+1} = (\mathbf {\mathit {p}}^{r+1}_t)_{t \in I} \in M^{(r+1)}\) and for each \(t \in I\), \(\mathbf {\mathit {p}}^{r+1}_t \in \mathcal {P}(M^{(r)})\) is a probability measure on the path space \(M^{(r)}\). Hence, the following \(H^{(r+1)}\)-valued Bochner integral is well–defined :
$$\displaystyle \begin{aligned} \tilde{\mathbf{\mathit{p}}}^{(r+1)}_t = \int_{\mathbf{\mathit{p}}^{(r)} \in M^{(r)}} \Phi^{(r)}_{\text{Sig}}(\mathbf{\mathit{p}}^{(r)}) d \mathbf{\mathit{p}}^{r+1}_t (\mathbf{\mathit{p}}^{(r)}) \in H^{(r+1)}. \end{aligned}$$
In particular, now we obtain a \(H^{(r+1)}\)-valued path \(t \mapsto \tilde {\mathbf {\mathit {p}}}^{(r+1)}_t\). Again, by an abuse of notation, we will still use \(\tilde {\mathbf {\mathit {p}}}^{(r+1)}\) to denote the time-augmented version of the path \(\tilde {\mathbf {\mathit {p}}}^{(r+1)}\). Note that now the codomain of \(\tilde {\mathbf {\mathit {p}}}^{(r+1)}_t\) is \(\mathbb {R} \oplus H^{(r+1)}\).
 
3.
Let \(H^{(r+2)}\) be a Hilbert space embedded in the tensor algebra \(T((\mathbb {R} \oplus H^{(r+1)}))= \prod _{k=0}^\infty (\mathbb {R} \oplus H^{(r+1)})^{\otimes k}\) whose inner product \(\langle \cdot ,\cdot \rangle _{H^{(r+2)}}\) is induced by the inner product \(\langle \cdot ,\cdot \rangle _{\mathbb {R} \oplus H^{(r+1)}}\). Then, use the iterated integrals of \(\mathbb {R} \oplus H^{(r+1)}\)-valued path \(t \mapsto \tilde {\mathbf {\mathit {p}}}^{(r+1)}_t\)5 to define
$$\displaystyle \begin{aligned} \hspace{-5pt}\Phi^{(r+1)}_{\text{Sig}}(\mathbf{\mathit{p}}^{(r+1)}) {=} \Phi_{\text{Sig}} (\tilde{\mathbf{\mathit{p}}}^{(r+1)}) {=} \Big(1, \int d\tilde{\mathbf{\mathit{p}}}^{(r+1)}, \int d(\tilde{\mathbf{\mathit{p}}}^{(r+1)})^{\otimes 2}, \ldots \Big)\! \in H^{(r{+}2)} \end{aligned}$$
as the rank-\((r+1)\) signature for paths in \(M^{(r+1)}\).
 
Using the above construction, Theorem 2.5 can be further generalized to any rank \(r \in \mathbb {N}_0\). It suffices to notice that for filtered process \(\mathbf {X} \in \text{FP}\), the rank r-prediction process \(\mathrm {pp}^{(r)}(\mathbf {X})\) is a \(M^{(r)}\)-valued random variable, and the rank-r adapted weak topology on \(\text{FP}\) is the initial topology induced by the rank r-prediction process \(\mathrm {pp}^{(r)}: \text{FP} \to \mathcal {P}(M^{(r)})\).
Theorem 2.6
For every rank\(r \in \mathbb {N}_0\)the rank -r signature \(\Phi ^{(r)}_{\mathit{\text{Sig}}}\)has the following properties:
(i)
\(\Phi ^{(r)}_{\mathit{\text{Sig}}}\)is universal to\(C_b(M^{(r)})\)equipped with the uniform topology, i.e., for every\(f \in C_b(M^{(r)})\)and\(\varepsilon > 0\), there is\(\ell \in H^{(r+1)}\)with
$$\displaystyle \begin{aligned} \sup_{\mathbf{\mathit{p}}^{(r)} \in M^{(r)}} |f(\mathbf{\mathit{p}}^{(r)}) - \langle \ell, \Phi^{(r)}_{\mathit{\text{Sig}}}(\mathbf{\mathit{p}}^{(r)}) \rangle_{H^{(r+1)}}| \le \varepsilon. \end{aligned}$$
 
(ii)
\(\Phi ^{(r)}_{\mathit{\text{Sig}}}\)is characteristic to the space of signed measures on\(M^{(r)}\). In particular, two filtered processes\(\mathbf {X}\)and\(\mathbf {Y}\)have the same rank-r adapted distribution if and only if
$$\displaystyle \begin{aligned} \mathbb{E}_{\mathbf{X}}[\Phi^{(r)}_{\mathit{\text{Sig}}}(\mathrm{pp}^{(r)} (\mathbf{X}))] = \mathbb{E}_{\mathbf{Y}}[\Phi^{(r)}_{\mathit{\text{Sig}}}(\mathrm{pp}^{(r)} (\mathbf{Y}))], \end{aligned}$$
that is, they have the same rank r-expected signature .
 
(iii)
The rank- r signature metric \(d^{(r)}_{\mathit{\text{Sig}}}: \mathit{\text{FP}} \times \mathit{\text{FP}} \to \mathbb {R}_+\)given by
$$\displaystyle \begin{aligned} d^{(r)}_{\mathit{\text{Sig}}}(\mathbf{X},\mathbf{Y}) := \|\mathbb{E}_{\mathbf{X}}[\Phi^{(r)}_{\mathit{\text{Sig}}}(\mathrm{pp}^{(r)}(\mathbf{X}))] - \mathbb{E}_{\mathbf{Y}}[\Phi^{(r)}_{\mathit{\text{Sig}}}(\mathrm{pp}^{(r)}(\mathbf{Y}))]\|_{H^{(r+1)}}, \end{aligned}$$
metrizes the rank-r adapted weak topology\(\mathcal {T}^{(r)}\)on\(\mathit{\text{FP}}\).
 
See [28, Theorem 4 and Remark 7] for a detailed proof of Theorem 2.6.
Remark 2.7
Again, by composing the rank-r signature with a tensor normalization \(\Lambda \) we can generalize the above results to general filtered processes taking values in \(\mathbb {R}^d\) and drop the compactness assumption. For more details, we refer to [28, Theorem 2, Remark 7].

2.2 Higher-Rank (Expected) Signatures and Kernel Learning

From the construction of higher-rank signatures in the last section, we see that the rank-r signature \(\Phi ^{(r)}_{\text{Sig}}\) takes values in a Hilbert space \(H^{(r+1)}\), for \(r \in \mathbb {N}_0\). The inner product structure on \(H^{(r+1)}\) naturally allows us to use a kernel learning approach to handle \(\Phi ^{(r)}_{\text{Sig}}\) for any \(r \ge 0\), exactly as the classical signature kernel learning method (corresponding to the case \(r=0\)) discussed in [57, Section 7]; in particular, one can apply kernel learning techniques to compute the higher-rank signature metrics, which will be reported in the next subsection.
Note that in the sequel we implicitly assume that all paths are time-augmented when taking signatures of them.
First let us have a quick review of the connection between the usual signature \(\Phi _{\text{Sig}}\) and kernel learning, which is discussed in chapter “The Signature Kernel” systematically. As before, let \(\Phi _{\text{Sig}}: M^{(0)} \to H\) denote the robust signature feature mapping \(\mathbb {R}^d\)-valued path to a vector in the Hilbert space H. Then we can obtain an associated Reproducing Kernel Hilbert Space6 (RKHS) over \(M^{(0)}\) induced by \(\Phi _{\text{Sig}}\) and \(\langle \cdot , \cdot \rangle _H\), which will be denoted by \(\mathbf {\mathit {H}}_{\text{Sig}}\), together with its kernel
$$\displaystyle \begin{aligned} \mathbf{\mathit{k}}_{\text{Sig}}(\mathbf{\mathit{x}}, \boldsymbol{y}) = \langle \Phi_{\text{Sig}}(\mathbf{\mathit{x}}), \Phi_{\text{Sig}}(\boldsymbol{y}) \rangle_H, \quad \mathbf{\mathit{x}}, \boldsymbol{y} \in M^{(0)}. \end{aligned}$$
In this case, the reproducing kernel induced by \(\Phi _{\text{Sig}}\) is simply given by
$$\displaystyle \begin{aligned} \mathbf{\mathit{k}}_{\text{Sig}}: M^{(0)} \to \mathbf{\mathit{H}}_{\text{Sig}}, \quad \mathbf{\mathit{x}} \mapsto \mathbf{\mathit{k}}_{\text{Sig}}(\mathbf{\mathit{x}}, \cdot) \end{aligned}$$
and the inner product on the RKHS \(\mathbf {\mathit {H}}_{\text{Sig}}\) satisfies that
$$\displaystyle \begin{aligned} \langle \mathbf{\mathit{k}}_{\text{Sig}}(\mathbf{\mathit{x}}), \mathbf{\mathit{k}}_{\text{Sig}}(\boldsymbol{y}) \rangle_{\mathbf{\mathit{H}}_{\text{Sig}}} = \mathbf{\mathit{k}}_{\text{Sig}}(\mathbf{\mathit{x}}, \boldsymbol{y}) = \langle \Phi_{\text{Sig}}(\mathbf{\mathit{x}}), \Phi_{\text{Sig}}(\boldsymbol{y}) \rangle_H. \end{aligned}$$
Moreover, we have \(\mathbf {\mathit {H}}_{\text{Sig}} = \{f: M^{(0)} \to \mathbb {R}: f(\mathbf {\mathit {x}}) = \langle h, \Phi _{\text{Sig}}(\mathbf {\mathit {x}}) \rangle _H, h \in H \}\). In fact, it is not hard to check that these two Hilbert spaces H and \(\mathbf {\mathit {H}}_{\text{Sig}}\) are isometric by considering the mapping \(h \in H \mapsto f(\cdot ) = \langle h, \Phi _{\text{Sig}}(\cdot )\rangle _H \in \mathbf {\mathit {H}}_{\text{Sig}}\); in other words, the function space \(\mathbf {\mathit {H}}_{\text{Sig}}\) just provides an alternative way of thinking about the Hilbert space H embedded in the tensor algebra.
Now, for any \(r \ge 0\), we can define the so-called rank-rsignature kernel\(\mathbf {\mathit {k}}_{\text{Sig}}^{(r)}: M^{(r)} \to \mathbf {\mathit {H}}_{\text{Sig}}^{(r)}\) for some RKHS \(\mathbf {\mathit {H}}^{(r)}_{\text{Sig}}\) (where we set \(\mathbf {\mathit {k}}^{(0)}_{\text{Sig}} = \mathbf {\mathit {k}}_{\text{Sig}}\) and \(\mathbf {\mathit {H}}^{(0)}_{\text{Sig}} = \mathbf {\mathit {H}}_{\text{Sig}}\)) in the same spirit as we did before: invoking that the rank-r signature \(\Phi _{\text{Sig}}^{(r)}\) on \(M^{(r)}\) takes values in a Hilbert space \(H^{(r+1)}\), We therefore obtain the associated reproducing kernel
$$\displaystyle \begin{aligned} \mathbf{\mathit{k}}^{(r)}_{\text{Sig}}: M^{(r)} \to \mathbf{\mathit{H}}^{(r)}_{\text{Sig}}, \quad \mathbf{\mathit{p}}^{(r)} \mapsto \langle \Phi_{\text{Sig}}^{(r)}(\mathbf{\mathit{p}}^{(r)}), \Phi_{\text{Sig}}^{(r)}(\cdot) \rangle_{H^{(r+1)}} \end{aligned}$$
as the rank-r signature kernel, and note that
$$\displaystyle \begin{aligned} \mathbf{\mathit{k}}^{(r)}_{\text{Sig}}(\mathbf{\mathit{p}}^{(r)},\boldsymbol{q}^{(r)}) = \langle \Phi^{(r)}_{\text{Sig}}(\mathbf{\mathit{p}}^{(r)}), \Phi^{(r)}_{\text{Sig}}(\boldsymbol{q}^{(r)}) \rangle_{H^{(r+1)}} \end{aligned}$$
for any \(\mathbf {\mathit {p}}^{(r)}, \boldsymbol {q}^{(r)} \in M^{(r)}\).
Recall that (see e.g. [31, Section 6]) a natural distance for probability measures on any topological space \(\mathcal {X}\) is given by fixing a class of functions \(\mathscr {G} \subset \mathbb {R}^{\mathcal {X}}\) and define the integral probability metric between \(\mu ,\nu \in \mathcal {P}(\mathcal {X})\) as
$$\displaystyle \begin{aligned} d(\mu,\nu) = \sup_{f \in \mathscr{G}} |\mathbb{E}_{X \sim \mu}[f(X)] - \mathbb{E}_{Y \sim \nu}[f(Y)]|. \end{aligned}$$
For instance, if \(\mathscr {G}\) denotes the space of Lipschitz functions on \(\mathcal {X}\) with Lipschitz constant 1, then the above integral probability metric is exactly the first order Wasserstein distance. An insight from the theory of kernel learning is that if \(\mathscr {G}\) is the unit ball of a RKHS \((\mathbf {\mathit {H}},\mathbf {\mathit {k}})\), then the above integral probability metric is called a Maximum Mean Discrepancy (MMD), and the reproducing property implies that
$$\displaystyle \begin{aligned} d^2(\mu,\nu) = \mathbb{E}[\mathbf{\mathit{k}}(X,X^\prime)] - 2\mathbb{E}[\mathbf{\mathit{k}}(X,Y)] + \mathbb{E}[\mathbf{\mathit{k}}(Y,Y^\prime)], \end{aligned}$$
where \(X,X^\prime ,Y,Y^\prime \) are independent with \(X,X^\prime \sim \mu \) and \(Y,Y^\prime \sim \nu \).
The next proposition shows that higher-rank signature metrics \(d^{(r)}_{\text{Sig}}(\cdot ,\cdot )\) are MMDs induced by the higher-rank signature kernels \(\mathbf {\mathit {k}}^{(r)}_{\text{Sig}}(\cdot ,\cdot )\) defined as above, which is straight forward to prove, and we leave it as an exercise for those curious readers.
Proposition 2.8
For any\(r \in \mathbb {N}_0\), the rank-r signature kernel MMD\(\mathcal {D}^{(r)}_{\mathit{\text{Sig}}}\)on\(\mathit{\text{FP}}\)defined by
$$\displaystyle \begin{aligned} \mathcal{D}^{(r)}_{\mathit{\text{Sig}}}(\mathbf{X}, \mathbf{Y}) := \sup_{f \in \mathbf{\mathit{H}}^{(r)}_{\mathit{\text{Sig}}}: \|f\|_{\mathbf{\mathit{H}}^{(r)}_{\mathit{\text{Sig}}}} = 1} |\mathbb{E}_{\mathbf{X}}[f(\mathrm{pp}^{(r)} (\mathbf{X}))] - \mathbb{E}_{\mathbf{Y}}[f(\mathrm{pp}^{(r)} (\mathbf{Y}))]| \end{aligned}$$
satisfies\(\mathcal {D}^{(r)}_{\mathit{\text{Sig}}}(\mathbf {X}, \mathbf {Y}) = d^{(r)}_{\mathit{\text{Sig}}}(\mathbf {X},\mathbf {Y})\)and
$$\displaystyle \begin{aligned} \mathcal{D}^{(r)}_{\mathit{\text{Sig}}}(\mathbf{X}, \mathbf{Y})^2 = \mathbb{E}[\mathbf{\mathit{k}}^{(r)}_{\mathit{\text{Sig}}}(Z, Z^\prime)] - 2\mathbb{E}[\mathbf{\mathit{k}}^{(r)}_{\mathit{\text{Sig}}}(Z, \hat Z)] + \mathbb{E}[\mathbf{\mathit{k}}^{(r)}_{\mathit{\text{Sig}}}(\hat Z, \hat Z^\prime)], \end{aligned} $$
where\(Z,Z^\prime ,\hat Z,\hat Z^\prime \)are independent with\(Z,Z^\prime \sim \mathrm {pp}^{(r)}(\mathbf {X})\)and\(\hat Z,\hat Z^\prime \sim \mathrm {pp}^{(r)}(\mathbf {Y})\).
In the next subsection we will show that the above expansion of \(\mathcal {D}^{(r)}_{\text{Sig}}(\mathbf {X}, \mathbf {Y})^2\) is simply estimated from finite samples of \(\mathrm {law}(X)\) and \(\mathrm {law}(Y)\) as the higher-rank signature kernel \(\mathbf {\mathit {k}}^{(r)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) is relatively cheap to evaluate, provided all \(\mathbf {X}\) and \(\mathbf {Y}\) are endowed with their natural filtration, respectively.

2.3 Algorithms for Computing Higher-Rank (Expected) Signature

In the last subsection we have seen that the higher-rank signature metric (equivalently speaking, the higher-rank signature kernel MMD) metrizes the higher-rank adapted weak topology \(\mathcal {T}^{(r)}\) on the space of filtered processes \(\text{FP}\). Although these theoretical results look satisfactory, a numerical computation of these distances is admittedly a highly challenging task, as the Hilbert spaces \(H^{(r)}\) and \(\mathbf {\mathit {H}}^{(r)}_{\text{Sig}}\) are infinite dimensional with very complicated algebraic structure. In this subsection we will introduce two possible algorithms for computing these higher-rank signature distances, which have been verified in some numerical experiments, albeit one can indeed improve them further.
To simplify our notations and give a clearer description of the algorithm, let us make throughout the following assumptions:
(A2)
All filtered processes are equipped with the natural filtration, that is, we focus on the subspace \(\text{FP}_{\text{plain}}\) of \(\text{FP}\) given by
$$\displaystyle \begin{aligned} \text{FP}_{\text{plain}} = \Big\{\mathbf{X} = \left( \boldsymbol{\Omega}^{\mathbf{X}}, \mathcal{F}^{\mathbf{X}}, \mathbb{P}^{\mathbf{X}}, (\mathcal{F}_t^{\mathbf{X}})_{t \in I}, X \right) \in \text{FP}: \mathcal{F}_t^{\mathbf{X}} = \sigma(X_s,s \le t)\Big\}; \end{aligned}$$
 
(A3)
\(r=1\).
 
Method 1: A Backward Recursion
We will apply a backward recursion method to derive algorithms that efficiently compute the signature \(\Phi _{\text{Sig}}(\mathbf {X})\) and the rank-1 signature \(\Phi ^{(1)}_{\text{Sig}}(\mathbf {X})\), provided \(\mathbf {X} \in \text{FP}_{\text{plain}}\) is a nowhere recombining (that is, its value at time t uniquely determines \(X_1,\ldots ,X_{t-1}\)) Markov chain (with finite states). We refer readers to [28, Section 5] for a detailed description of this algorithm and here we only summarize the essential idea and main steps.
Recall that higher-rank signatures take values in some tensor algebras. Therefore, to compute them we need the following lemma concerning dynamic programming in an algebra (see [28, Lemma 3]).
Lemma 2.9
Let A be an (Banach) algebra and\(Z = (Z_t)_{t \in I}\)be a nowhere recombining Markov chain taking values in A with finite states. The function
$$\displaystyle \begin{aligned} u_t(a) = \mathbb{E}[Z_{t+1} \ldots Z_T| Z_t = a] \end{aligned}$$
satisfies for every\(t \in I\)and\(a \in A\)the recursion
$$\displaystyle \begin{aligned} u_t(a) = \sum_{b \in A} \mathbb{P}(Z_{t+1}\ldots Z_T| Z_t = a)bu_{t+1}(b), \end{aligned}$$
where\(Z_{t+1} \ldots Z_T\)and\(bu_{t+1}(b)\)are calculated by using the multiplication defined on the algebra A.
If \(\mathbf {X} \in \text{FP}_{\text{plain}}\) is a nowhere recombining Markov chain, then the signature process \(Z_t = \Phi _{\text{Sig}}(\mathbf {X}|_{[0,t]})\) can also be viewed as a nowhere recombining Markov chain taking values in the tensor algebra \(T((\mathbb {R}^d))\). As \(I = \{1,\ldots ,T\}\) is discrete, it is well–known that
$$\displaystyle \begin{aligned} \Phi_{\text{Sig}}(X) = \exp(\Delta_1 X) \otimes \ldots \otimes \exp(\Delta_T X) \end{aligned}$$
where \(\otimes \) is the tensor product on \(T((\mathbb {R}^d))\), \(\exp \) is the tensor exponential on \(T((\mathbb {R}^d))\) such that \(\exp (x) = \sum _{n=0}^\infty \frac {x^{\otimes n}}{n!}\) and \(\Delta _t X = X_t - X_{t-1}\). Hence, Lemma 2.9 hints at an efficient way to compute the expected signature \(\mathbb {E}[\Phi _{\text{Sig}}(\mathbf {X})] := u_0(X_0)\) (we set \(X_0 := 0\)) since the value function u satisfies the recursion
$$\displaystyle \begin{aligned} u_t(x) = \sum_{y \in \mathbb{R}^d} \mathbb{P}(X_{t+1} = y| X_t = x)\exp(y - x) \otimes u_{t+1}(y), \end{aligned}$$
Algorithm 1 Pseudo-code for \( \mathbb {E}[\Phi _{\text{Sig}}(\mathbf {X})]\)
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/MediaObjects/617302_1_En_10_Figaaa_HTML.png
Algorithm pseudocode for computing ExpSig_0 of a Markov chain X represented as a rooted tree with root r. The procedure ExpSig_0(a) checks if a.children is empty, returning 1 if true. Otherwise, it initializes sum to 0 and iterates over b in a.children, updating sum with p(a, b) exp{b - a} otimes ExpSig_0(b). The procedure returns sum. The output is ExpSig_0(r) = mathbb{E}[Phi_{Sig}(X)].
The computation of the rank-1 signature \(\Phi ^{(1)}_{\text{Sig}}(\mathbf {X})\) follows along the same lines. We now apply Lemma 2.9 with \(A = T((H))\) (recall that H denotes the target space of \(\Phi _{\text{Sig}}(X)\)) and \(Z_t = \exp ({\Delta _t \widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})}) = \exp (\mathbb {E}[\Phi _{\text{Sig}}(X) \vert X_{t}]-\mathbb {E}[\Phi _{\text{Sig}}(X)\vert X_{t-1}])\). Here, we note that the value of \(Z_t = a\) depends on \(X_t = x\) and \(X_{t-1} = x^\prime \); and, since \( X \) is nowhere recombining by assumption, the path segment \((X_s)_{s = 1}^t\) such that \( X_t = x \) is uniquely determined, which implies that the value of \(X_{t-1} = x^\prime \) is actually uniquely determined by \(X_t = x\). Based on this observation, we can see that the function \( v_t(a) := \mathbb {E}[Z_{t+1}\cdots Z_T|Z_t=a ] \) in Lemma 2.9 can actually be viewed as a function of \(X_t =x\) and therefore satisfies the recursion
$$\displaystyle \begin{aligned}\displaystyle v_t(x) = \sum_{y \in \mathbb{R}^d} \mathbb{P}(X_{t+1} = y \vert X_t = x) \\\displaystyle \cdot \exp \Big\{ \mathbb{E}[\Phi_{\text{Sig}}(X) \vert X_{t+1}=y]-\mathbb{E}[\Phi_{\text{Sig}}(X)\vert X_{t}=x] \Big\} * v_{t+1}(y), \end{aligned} $$
where we used the symbol \(*\) to denote the tensor product on \(T((H))\). Of course, the exponential appearing in the above expression is taken with respect to the operation \(*\), that is, for \(h \in H\), \(\exp (h) = \sum _{n=0}^\infty \frac {h^{*n}}{n!}\). One can readily check that \(v_0(X_0) = \mathbb {E}[\Phi ^{(1)}_{\text{Sig}}(\mathbf {X})]\) (with \(X_0 := 0\) as usual) is the quantity we want to compute.
This recursion is more involved as it requires two evaluations of \( \mathbb {E}[\Phi _{\text{Sig}}(X)\vert X_{t}] \) at every step. However, if we assume that the process \(\mathbf {X}\) is nowhere recombining we can rewrite this as the following system
$$\displaystyle \begin{aligned} v_t(x) &= \sum_{y \in \mathbb{R}^d} \mathbb{P}(X_{t+1} = y \vert X_t = x) \\ &\quad \cdot \exp \big\{ \Phi_{\text{Sig}}(X \vert X_t {=} y) \otimes u_{t+1}(y)-\Phi_{\text{Sig}}(X \vert X_{t-1} {=} x) \otimes u_{t}(x) \big\} * v_{t+1}(y), \\ u_t(x) &= \sum_{y\in \mathbb{R}^d} \mathbb{P}(X_{t+1} = y \vert X_t = x) \exp \big\{ y-x \big\} \otimes u_{t+1}(y), \end{aligned} $$
where \(\Phi _{\text{Sig}}(X \vert X_t = y)\) denotes the signature of the path segment \((X_s)_{s = 1}^t\) such that \( X_t = y \), which is well defined since \( X \) is nowhere recombining by assumption. Unfortunately \( v_t(x) \) depends on both \( u_t \) and \( u_{t+1}\) and since multiplication in \(T((H))\) is non-commutative, there is no way to separate the two dependencies. Because of this \( u_t(x) \) needs to be computed before \( v_t(x) \) and the recursion is best solved by caching the relevant values at every function call. This approach is outlined in Algorithm 2.
Algorithm 2 Pseudo-code for \( \mathbb {E}[\Phi ^{(1)}_{\text{Sig}}(\mathbf {X})]\)
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/MediaObjects/617302_1_En_10_Figaab_HTML.png
Procedure for calculating the signature of a sample path in a Markov chain represented as a rooted tree. The input is a Markov chain X with root r , and s(a) denotes the signature of the sample path ending at a . The procedure, ExpSig1, initializes a_1 and a_2 to zero. It iterates over the children of a , updating a_1 and a_2 using recursive calls to ExpSig1 and mathematical operations involving exponential and tensor products. The output is the expected value of the first signature of X , denoted as mathbb{E}[Phi^{(1)}_{Sig}(X)] .
The above algorithm based on the backward recursion method provides a practical way to compute the higher-rank signature numerically. However, in this algorithm one needs to truncate the signature features \(\Phi _{\text{Sig}}\) in every step. However, it is a notorious property for truncated signatures that the dimension grows exponentially as the truncation level increases: \(\Phi _{\text{Sig},k}(\mathbf {\mathit {x}}) = (1, \int \mathbf {\mathit {x}}, \ldots , \int d\mathbf {\mathit {x}}^{\otimes k}) \in (\mathbb {R}^d)^{1 + d + \ldots + d^k}\), and the situation gets worse if one truncates the higher-rank signatures \(\Phi ^{(1)}_{\text{Sig}}\). On the other hand, in many applications (see Sect. 3) we are more interested in the higher-rank signature distance between two filtered processes instead of the higher-rank robust signature itself, therefore we can actually take use of the inner product structure of the target spaces of higher-rank robust signatures to design a more efficient algorithm to compute these distances, which are \(\mathbb {R}\)-valued objects, as the next paragraph will show.
Method 2: Conditional KME and Signature Kernel PDE Trick
The aim of this algorithm is to compute the rank-1 signature kernel MMD \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) for \(\mathbf {X}, \mathbf {Y} \in \text{FP}_{\text{plain}}\). This algorithm can be naturally extended to compute higher-rank signature MMD. In contrast to the first algorithm, we will employ here techniques from kernel learning extensively. The statements below are mainly borrowed from [73] and [47].
The starting point of this algorithm is the property from Proposition 2.8 that the square of \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) admits the representation
$$\displaystyle \begin{aligned} \mathcal{D}^{(1)}_{\text{Sig}}(\mathbf{X}, \mathbf{Y})^2 = \mathbb{E}[\mathbf{\mathit{k}}^{(1)}_{\text{Sig}}(Z,Z^\prime)] - 2\mathbb{E}[\mathbf{\mathit{k}}^{(1)}_{\text{Sig}}(Z, \hat Z)] + \mathbb{E}[\mathbf{\mathit{k}}^{(1)}_{\text{Sig}}(\hat Z,\hat Z^\prime)], \end{aligned} $$
where \(Z,Z^\prime ,\hat Z,\hat Z^\prime \) are independent with \(Z,Z^\prime \sim \mathrm {pp}^{(1)}(\mathbf {X})\) and \(\hat Z,\hat Z^\prime \sim \mathrm {pp}^{(1)}(\mathbf {Y})\).
Thus, to compute \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) it suffices to compute the three expectations of inner product on the right hand side of the above equation separately. To avoid notational excess, we assume from now on without loss of generality that \(\mathbf {X}\) and \(\mathbf {Y}\) live on a common probability space such that \(Z = \mathrm {pp}^{(1)}(\mathbf {X})\) and \(\hat Z = \mathrm {pp}^{(1)}(\mathbf {Y})\) are independent. Let us compute \(\mathbb {E}[\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X}), \mathrm {pp}^{(1)}(\mathbf {Y}))]\) as an example. We decompose the procedure of the numerical solution into the following four steps. For more details on this algorithm, we refer readers to [73] or [47]. Now let us summarize the main ideas of this algorithm in the following informal style.
Informal Description
Recall that what we need to compute is the expectation of the rank-1 signature kernel on prediction processes \(\mathrm {pp}^{(1)}(\mathbf {X})\) and \(\mathrm {pp}^{(1)}(\mathbf {Y})\), namely \(\mathbb {E}[\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X}), \mathrm {pp}^{(1)}(\mathbf {Y}))]\). Suppose samples \(\hat x^i \sim \mathrm {law}(\mathbf {X})\) , \(i=1,\ldots ,M\) and \(\hat y^j \sim \mathrm {law}(\mathbf {Y})\) , \(j=1,\ldots ,N\) are given, we can indeed use the empirical measures \(\mu ^M := \frac {1}{M}\sum _{i=1}^M \delta _{\hat x^i}\) and \(\nu ^N := \frac {1}{N}\sum _{j=1}^N\delta _{\hat y^j}\) to construct an empirical estimator
$$\displaystyle \begin{aligned} \mathbb{E}_{\mu^M \otimes \nu^N}[\mathbf{\mathit{k}}^{(1)}_{\text{Sig}}(\mathrm{pp}^{(1)}(\mathbf{X}), \mathrm{pp}^{(1)}(\mathbf{Y}))] {=} \frac{1}{NM}\sum_{i=1}^M\sum_{j=1}^N \mathbf{\mathit{k}}^{(1)}_{\text{Sig}}(\mathrm{pp}^{(1)}(\mathbf{X})(\hat x^i), \mathrm{pp}^{(1)}(\mathbf{Y})(\hat y^j)) \end{aligned}$$
for \(\mathbb {E}[\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X}), \mathrm {pp}^{(1)}(\mathbf {Y}))]\). Based on this observation, our task boils down to the numerical computation of the inner product \(\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X})(\hat x^i), \mathrm {pp}^{(1)}(\mathbf {Y})(\hat y^j))\) for each pair \((i,j)\). A well-known result is that for \(\mathbb {R}^d\)-valued piecewise smooth paths x and y defined on \([0,T]\) (recall that in our case all paths are discrete on \(0,1,\ldots ,T\), and we view them as piecewise-linear paths on \([0,T]\) by linear interpolation), the signature kernel \(\mathbf {\mathit {k}}_{\text{Sig}}(x,y)\) is the terminal value \(v^{x,y}(T,T)\) of the unique solution \(v^{x,y}(\cdot ,\cdot )\) to the Goursat problem
$$\displaystyle \begin{aligned} \frac{\partial^2 v^{x,y}(s,t)}{\partial s \partial t} = \big\langle \frac{\partial}{\partial t}x_t, \frac{\partial}{\partial s}y_s\big\rangle_{\mathbb{R}^d} v^{x,y}(s,t), \end{aligned}$$
see e.g. [30] or [57, Section 7.2, PDE Solver]. Actually, by the very definition of the rank-1 signature kernel \(\mathbf {\mathit {k}}_{\text{Sig}}^{(1)}(\cdot ,\cdot )\), which is essentially induced by the signature of an Hilbert space H-valued paths, using the same computation as for the above classical case, it is not hard to check that \(\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X})(\hat x^i), \mathrm {pp}^{(1)}(\mathbf {Y})(\hat y^j))\) is the terminal value \(u^{i,j}(T,T)\) of the unique solution \(u^{i,j}\) to the Goursat problem
$$\displaystyle \begin{aligned} \frac{\partial^2 u^{i,j}(s,t)}{\partial s \partial t} = \big\langle \frac{\partial}{\partial t}\widetilde{\mathrm{pp}}^{(1)}(\mathbf{X})(\hat x^i), \frac{\partial}{\partial s}\widetilde{\mathrm{pp}}^{(1)}(\mathbf{Y})(\hat y^j)\big\rangle_{H} u^{i,j}(s,t), \end{aligned}$$
where \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X}) = \mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X) | \mathcal {F}^{\mathbf {X}}_t]\) and \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {Y}) = \mathbb {E}_{\mathbf {Y}}[\Phi _{\text{Sig}}(Y) | \mathcal {F}^{\mathbf {Y}}_t]\) are conditional expectations of H-valued random variables \(\Phi _{\text{Sig}}(X)\) and \(\Phi _{\text{Sig}}(Y)\) respectively, which can be viewed as functions on the path space \((I \to \mathbb {R}^d)\), as the filtrations \(\mathcal {F}^{\mathbf {X}}\) and \(\mathcal {F}^{\mathbf {Y}}\) are natural. Clearly, if we can compute the inner product \(\big \langle \frac {\partial }{\partial t}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i), \frac {\partial }{\partial s}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j)\big \rangle _{H}\), then we would be able to solve the above Goursat type PDE, and then get the desired quantity \(\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X})(\hat x^i), \mathrm {pp}^{(1)}(\mathbf {Y})(\hat y^j))\). Hence, our goal is to numerically approximate \(\big \langle \frac {\partial }{\partial t}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i), \frac {\partial }{\partial s}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j)\big \rangle _{H}\). To do this, as \(\widetilde {\mathrm {pp}}^{(1)}_t(\mathbf {X}) = \mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X) | \mathcal {F}^{\mathbf {X}}_t]\) and \(\widetilde {\mathrm {pp}}^{(1)}_t(\mathbf {Y}) = \mathbb {E}_{\mathbf { Y}}[\Phi _{\text{Sig}}(Y) | \mathcal {F}^{\mathbf {Y}}_t]\), we should of course find a consistent estimator of the above conditional expectations in terms of data \((\hat x^i)_{1\le i\le M}\) and \((\hat y^j)_{1\le j\le N}\).
From the above discussions, we can conclude that the algorithm of computing the rank-1 signature kernel MMD \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) consists of the following subtasks:
1.
Use samples \((\hat x^i)_{1\le i\le M} \sim \mathrm {law}(\mathbf {X})\) and \((\hat y^j)_{1\le j\le N} \sim \mathrm {law}(\mathbf {Y})\) to construct empirical estimators for conditional kernel means \(\widetilde {\mathrm {pp}}^{(1)}_t(\mathbf {X}) = \mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X) | \mathcal {F}^{\mathbf {X}}_t]\) and \(\widetilde {\mathrm {pp}}^{(1)}_t(\mathbf {Y}) = \mathbb {E}_{\mathbf {Y}}[\Phi _{\text{Sig}}(Y) | \mathcal {F}^{\mathbf {Y}}_t]\). See Step 1 and Step 2 below.
 
2.
Apply the empirical estimators constructed in the previous steps to approximate the inner product \(\big \langle \frac {\partial }{\partial t}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i), \frac {\partial }{\partial s}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j)\big \rangle _{H}\), so that one can find a numerical solution of the Goursat problem
$$\displaystyle \begin{aligned} \frac{\partial^2 u^{i,j}(s,t)}{\partial s \partial t} = \big\langle \frac{\partial}{\partial t}\widetilde{\mathrm{pp}}^{(1)}(\mathbf{X})(\hat x^i), \frac{\partial}{\partial s}\widetilde{\mathrm{pp}}^{(1)}(\mathbf{Y})(\hat y^j)\big\rangle_{H} u^{i,j}(s,t) \end{aligned}$$
for each pair of samples \((\hat x^i, \hat y^j)\). The resulting quantity \(u^{i,j}(T,T)\) is an approximation of the rank-1 signature kernel \(\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\mathrm {pp}^{(1)}(\mathbf {X})(\hat x^i), \mathrm {pp}^{(1)}(\mathbf {Y})(\hat y^j))\). From this we can build an empirical estimator \(\hat {\mathcal {D}}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y}) = \frac {1}{NM}\sum _{i=1}^M\sum _{j=1}^N u^{i,j}(T,T)\) for \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\). See Step 3 and Step 4 below.
 
3.
It is interesting to notice that thanks to the concrete form of the estimators of \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X}) = \mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X) | \mathcal {F}^{\mathbf {X}}_t]\) and \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {Y}) = \mathbb {E}_{\mathbf { Y}}[\Phi _{\text{Sig}}(Y) | \mathcal {F}^{\mathbf {Y}}_t]\), the computation of the inner product \(\big \langle \frac {\partial }{\partial t}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i), \frac {\partial }{\partial s}\widetilde {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j) \big \rangle _H\) on the Hilbert space H can be reduced to finding numerical solutions of Goursat PDEs on \(\mathbb {R}^d\), namely solving
$$\displaystyle \begin{aligned} \frac{\partial^2 \tilde u_{i,j}(s,t)}{\partial s \partial t} = \big\langle \frac{\partial}{\partial t}\hat x^i_t, \frac{\partial}{\partial s}\hat y^j_{s}\big\rangle_{\mathbb{R}^d} \tilde u^{i,j}(s,t), \quad (s,t) \in [0,T] \times [0,T]. \end{aligned}$$
Please see Step 4 below for the details.
 
Step 1: Estimation of Conditional Kernel Mean is a Minimization Problem
First, note that we have by definition
$$\displaystyle \begin{aligned} \mathbf{\mathit{k}}^{(1)}_{\text{Sig}}(\mathrm{pp}^{(1)}(\mathbf{X}), \mathrm{pp}^{(1)}(\mathbf{Y})) = \langle \Phi_{\text{Sig}}(\widetilde{\mathrm{pp}}^{(1)}(\mathbf{X})), \Phi_{\text{Sig}}(\widetilde{\mathrm{pp}}^{(1)}(\mathbf{Y})) \rangle_{H}, \end{aligned}$$
where \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X}) = \mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X)|\mathcal {F}^{\mathbf {X}}_t], t \in I\). Hence, our initial task is to estimate the conditional kernel mean \(\mathbb {E}_{\mathbf {X}}[\Phi _{\text{Sig}}(X)|\mathcal {F}^{\mathbf {X}}_t] \in H\) for each \(t \in I\). To do this, we will adopt the conditional kernel mean embedding scheme proposed in [65]. Roughly speaking, as the filtration \(\mathcal {F}^{\mathbf {X}}_t = \sigma (X_s, s\le t)\) is the natural filtration generated by X, there is a measurable function \(F_t: (I_t \to \mathbb {R}^d) \to H\) such that
$$\displaystyle \begin{aligned} \mathbb{E}_{\mathbf{X}}[\Phi_{\text{Sig}}(X)|\mathcal{F}^{\mathbf{X}}_t] = F_t \circ X|_{I_t} \end{aligned}$$
where \(I_t = \{1,\ldots ,t\}\), and \(F_t\) is the minimizer of the loss function \(\mathbb {E}_{\mathbf {X}}[\|\Phi _{\text{Sig}}(X) - F(X|_{I_t})\|_{H}^2]\) among all measurable functions \(F \in L^2((I_t \to \mathbb {R}^d), H, \mathbb {P}^{\mathbf {X}} \circ (X|_{I_t})^{-1})\).
Step 2: Solve Minimization Problem via Empirical Data
Usually, people use the unique minimizer of the so-called regularized empirical surrogate loss with regularization parameter \(\lambda _M >0\) given by
$$\displaystyle \begin{aligned} \hat{\mathcal{E}}_{X|_{I_t}, M, \lambda_M}(F) := \frac{1}{M}\sum_{i=1}^M \|\Phi_{\text{Sig}}(\hat x^i) - F(\hat x^i|_{I_t})\|_{H}^2 + \lambda_M \|F\|^2_{\mathscr{G}^t} \end{aligned}$$
as a good estimator for \(F_t\), where \(\hat x^i, i = 1,\ldots ,M\) are i.i.d. sample paths of \(\mathrm {law}(X)\), and \(\mathscr {G}^t \subset L^2((I_t \to \mathbb {R}^d), H, \mathbb {P}^{\mathbf {X}} \circ (X|_{I_t})^{-1})\) is a properly chosen Hilbert space which is dense in \(L^2((I_t \to \mathbb {R}^d), H, \mathbb {P}^{\mathbf {X}} \circ (X|_{I_t})^{-1})\). Here we remark that such a special construction of \(\mathscr {G}^t\) will significantly simplify the minimization problem of the functional \(\hat {\mathcal {E}}_{X|_{I_t}, M, \lambda _M}(F)\), and the concrete construction of \(\mathscr {G}^t\) is completely based on the universality of the signature.
As the next step, it can be shown that (see [65]) given samples \(\hat x^i, i =1,\ldots ,M\), the unique minimizer \(\hat F_t^{(\hat x^i)_{i=1,\ldots ,M}} \in \mathcal G^t\) (for simplicity, we write it as \(\hat F_t\) in below) of the regularized empirical surrogate loss with regularization parameter \(\lambda _M\) is of the form that
$$\displaystyle \begin{aligned} {} \hat F_t(\cdot) = \mathbf{\mathit{k}}_{\text{Sig}}^{X,t} (\cdot) ({\mathbf{K}}^X_{t,t} + M\lambda_M \text{I}_M)^{-1}\Phi_{\text{Sig}}^{X,T} \end{aligned} $$
(16)
where \(\mathbf {\mathit {k}}_{\text{Sig}}^{X,t}(\cdot ) = (\mathbf {\mathit {k}}_{\text{Sig}}(\hat x^i|_{I_t}, \cdot ))_{i=1,\ldots ,M}:(I_t \to \mathbb {R}^d) \to \mathbb {R}^M\) is an \(\mathbb {R}^M\)-valued (continuous) function on the path space \((I_t \to \mathbb {R}^d)\), \({\mathbf {K}}^X_{t,t}= ([{\mathbf {K}}^X_{t,t}]_{i,j})_{i,j=1,\ldots ,M}\) is an \(M \times M\) matrix with entries \([{\mathbf {K}}^X_{t,t}]_{i,j} = \mathbf {\mathit {k}}_{\text{Sig}}(\hat x^i|_{I_t}, \hat x^j|_{I_t}) = \langle \Phi _{\text{Sig}}(\hat x^i), \Phi _{\text{Sig}}(\hat x^j) \rangle _H\), \(\Phi _{\text{Sig}}^{X,T}= (\Phi _{\text{Sig}}(\hat x^i))_{i=1,\ldots ,M}\) is a vector in \(H^M\) and \(\text{I}_M\) is the \(M \times M\)-identity matrix. Clearly, for a given \(x|_{I_t} \in (I_t \to \mathbb {R}^d)\), we first evaluate the function \(\mathbf {\mathit {k}}_{\text{Sig}}^{X,t} (\cdot )\) against it and obtain an \(\mathbb {R}^M\) row vector \((\langle \Phi _{\text{Sig}}(\hat x^i|_{I_t}), \Phi _{\text{Sig}}(x|_{I_t}) \rangle _H)_{i=1,\ldots ,M}\), and then multiply it with the \(M \times M\)-matrix \(({\mathbf {K}}^X_{t,t} + M\lambda _M \text{I}_M)^{-1}\) to get another \(\mathbb {R}^M\) row vector \(\mathbf {\mathit {k}}_{\text{Sig}}^{X,t} (x|_{I_t}) ({\mathbf {K}}^X_{t,t} + M\lambda _M \text{I}_M)^{-1} := (v_1,\ldots ,v_M)\); finally, we multiply the latter vector \((v_1,\ldots ,v_M)\) with the \(H^M\)-vector \(\Phi ^{X,T}_{\text{Sig}} = (\Phi _{\text{Sig}}(\hat x^i))_{i=1,\ldots ,M}\) to get a vector \(\sum _{i=1}^M v_i \Phi _{\text{Sig}}(\hat x^i)\) in the Hilbert space H. The above explains why \(\hat F_t(\cdot )\) defined in (16) is an H-valued function on \((I_t \to \mathbb {R}^d)\).
Step 3: Construction of Empirical Estimator for Higher-Rank Signature MMD
Given i.i.d. sample paths \((\hat x^i)_{i \le M}\) of \(\mathrm {law}(X)\), from Step 2 we can define the H-valued stochastic process \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})_t := \hat F_t^{(\hat x^i)_{i=1,\ldots ,M}}(\cdot ), t \in I\), which serves as an empirical estimator for \(\widetilde {\mathrm {pp}}^{(1)}(\mathbf {X})\). Similarly, for a given i.i.d. sample paths \((\hat y^j)_{j \le N}\) of \(\mathrm {law}(Y)\), we define the H-valued stochastic process \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})_t = \hat F^{(\hat y_j)_{j=1,\ldots ,N}}_t(\cdot ), t \in I\), which is the empirical estimator for \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})\). Then, for empirical measures \(\hat \mu ^M := \frac {1}{M}\sum _{i=1}^M\delta _{\hat x^i}\) and \(\hat \nu ^N := \frac {1}{N}\sum _{j=1}^N \delta _{\hat y^j}\), we define
$$\displaystyle \begin{aligned} \hat{\mathcal{D}}^{(1)}_{\text{Sig}}(\mathbf{X},\mathbf{Y})^2 := \mathbb{E}_{\hat \mu^M \otimes \hat \nu^N}[\|\Phi_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{X})) - \Phi_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{Y}))\|_{H^{(2)}}^2], \end{aligned}$$
(note that here \(\Phi _{\text{Sig}}(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X}))\) denotes the signature of H-valued process \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})\)) and use it as an empirical estimator for the rank-1 signature kernel MMD \(\mathcal {D}_{\text{Sig}}^{(1)}(\mathbf {X},\mathbf {Y})\), when we have samples \((\hat x^i)_{i=1,\ldots ,M}\) and \((\hat y^j)_{j=1,\ldots ,N}\). he next theorem justifies the validity of such estimator: we may use it to obtain an accurate estimation of \(\mathcal {D}_{\text{Sig}}^{(1)}(\mathbf {X},\mathbf {Y})\) provided the size of samples M and N is very large. The corresponding proof can be found in [73, Appendix, Theorem 7].
Theorem 2.10
For\(\mathbf {X},\mathbf {Y} \in \mathit{\text{FP}}_{\mathit{\text{plain}}}\), \(\hat {\mathcal {D}}^{(1)}_{\mathit{\text{Sig}}}(\mathbf {X},\mathbf {Y})\)is a consistent estimator of the rank 1-robust signature kernel MMD, i.e.,
$$\displaystyle \begin{aligned} \hat{\mathcal{D}}^{(1)}_{\mathit{\text{Sig}}}(\mathbf{X},\mathbf{Y}) \to \mathcal{D}^{(1)}_{\mathit{\text{Sig}}}(\mathbf{X},\mathbf{Y}) \end{aligned}$$
in probability as\(M,N \to \infty \), provided that the regularization parameters\(\lambda _M\), \(\lambda _N\), cf. (16), decay to 0 slower than\(\mathcal {O}(M^{-1/2})\)and\(\mathcal {O}(N^{-1/2})\), respectively.
Step 4: Evaluation of Higher-Rank Signature Kernels: A PDE Trick
Thanks to Theorem 2.10, the approximation of the rank-1 signature MMD \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) boils down to computing its empirical counterpart \(\hat {\mathcal {D}}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\). Starting from the expression
$$\displaystyle \begin{aligned} \hat{\mathcal{D}}^{(1)}_{\text{Sig}}(\mathbf{X},\mathbf{Y})^2 = \mathbb{E}_{\hat \mu^M \otimes \hat \nu^N}[\|\Phi_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{X})) - \Phi_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{Y}))\|_{H^{(2)}}^2] \end{aligned} $$
and inserting the expressions of the empirical measures \(\hat \mu ^M = \frac {1}{M}\sum _{i=1}^M\delta _{\hat x^i}\) and \(\hat \nu ^N = \frac {1}{N}\sum _{j=1}^N \delta _{\hat y^j}\) into the above expectation, we obtain that
$$\displaystyle \begin{aligned} \begin{array}{*{20}l} {} \mathbb{E}_{\hat \mu^M \otimes \hat \nu^N}[\|\Phi_{\text{Sig}}(\widehat{\text{pp}}^{(1)}(\mathbf{X})) - \Phi_{\text{Sig}}(\widehat{\text{ pp}}^{(1)}(\mathbf{Y}))\|_{H^{(2)}}^2] \\ = \frac{1}{NM}\sum_{i=1}^M\sum_{j=1}^N \|\Phi_{\text{Sig}}(\widehat{\text{pp}}^{(1)}(\mathbf{X}))(\hat x^i) - \Phi_{\text{Sig}}(\widehat{\text{ pp}}^{(1)}(\mathbf{Y}))(\hat y^j)\|_{H^{(2)}}^2. \end{array} \end{aligned} $$
(17)
Since
$$\displaystyle \begin{aligned} \begin{array}{*{20}l} {} \|\Phi_{\text{Sig}}(\widehat{\text{pp}}^{(1)}(\mathbf{X}))(\hat x^i) - \Phi_{\text{Sig}}(\widehat{\text{pp}}^{(1)}(\mathbf{Y}))(\hat y^j)\|_{H^{(2)}}^2 \\ = \langle \Phi_{\text{Sig}}(\widehat{\text{pp}}^{(1)}(\mathbf{X}))(\hat x^i) - \Phi_{\text{Sig}}(\widehat{\text{ pp}}^{(1)}(\mathbf{Y})) (\hat y^j), \\ \Phi_{\text{Sig}}(\widehat{\text{pp}}^{(1)}(\mathbf{X}))(\hat x^i) - \Phi_{\text{Sig}}(\widehat{\text{pp}}^{(1)}(\mathbf{Y}))(\hat y^j)\rangle_{H^{(2)}} \end{array} \end{aligned} $$
(18)
holds for every pair of indices \((i,j)\), the task of computing this empirical estimator is in turn reduced to evaluating the empirical version of the rank-1 signature kernels
$$\displaystyle \begin{aligned} \mathbf{\mathit{k}}^{(1)}_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{X})(\hat x^i), \widehat{\mathrm{pp}}^{(1)}(\mathbf{Y})(\hat y^j) ) = \langle \Phi_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{X}))(\hat x^i), \Phi_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{Y}))(\hat y^j)\rangle_{H^{(2)}}, \end{aligned}$$
for all \(i,j\). By [30, Theorem 2.5] or [57, Section 7.2, PDE Solver], we know that the evaluation of the signature kernel
$$\displaystyle \begin{aligned} \langle \Phi_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{X}))(\hat x^i), \Phi_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{Y}))(\hat y^j)\rangle_{H^{(2)}}^2\end{aligned}$$
of H-valued paths \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i)\) and \(\widehat {\mathrm {pp}^{(1)}}(\mathbf {Y})(\hat y^j)\) is the unique solution \(u^{i,j}(T,T)\) of the following hyperbolic PDE, which is also called the Goursat problem:
$$\displaystyle \begin{aligned} \begin{array}{*{20}l} {} \frac{\partial^2 u^{i,j}(s,t)}{\partial s \partial t} = \big\langle \frac{\partial}{\partial t}\widehat{\text{pp}}^{(1)}(\mathbf{X})_t(\hat x^i), \frac{\partial}{\partial s}\widehat{\text{pp}}^{(1)}(\mathbf{Y})_s(\hat y^j)\big\rangle_{H} u^{i,j}(s,t) \\ =\big\langle \widehat{\text{pp}}^{(1)}(\mathbf{X})_t(\hat x^i) - \widehat{\text{pp}}^{(1)}(\mathbf{X})_{t-1}(\hat x^i), \\ \widehat{\text{pp}}^{(1)}(\mathbf{Y})_s(\hat y^j) - \widehat{\text{pp}}^{(1)}(\mathbf{Y})_{s-1}(\hat y^j) \big\rangle_{H} u^{i,j}(s,t), \end{array} \end{aligned} $$
(19)
where we consider \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i)\) and \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j)\) as continuous H-valued paths via linear interpolation. Furthermore, in view of (16) and the fact that \(\mathbf {\mathit {k}}_{\text{Sig}}(\hat x^k,\hat y^\ell ) = \langle \Phi _{\text{Sig}}(\hat x^k), \Phi _{\text{Sig}}(\hat y^\ell ) \rangle _H\), using some linear algebra calculations we can readily check for \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})_t(\hat x^i) = \hat F^{(\hat x^k)_{k \le M}}_t(\hat x^i)\) and \(\widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})_s(\hat y^j) = \hat F^{(\hat y^\ell )_{\ell \le N}}_s(\hat y^j)\) that
$$\displaystyle \begin{aligned} \begin{array}{*{20}l} {} \big\langle \widehat{\text{pp}}^{(1)}(\mathbf{X})_t(\hat x^i), \widehat{\text{pp}}^{(1)}(\mathbf{Y})_s(\hat y^j)\big\rangle_{H} = \langle \hat F^{(\hat x^k)_{k \le M}}_t(\hat x^i), \hat F^{(\hat y^\ell)_{\ell \le N}}_s(\hat y^j) \rangle_H \\ = \langle \mathbf{\mathit{k}}_{\text{Sig}}^{X,t} (\hat x^i) ({\mathbf{K}}^X_{t,t} + M\lambda_M \text{I}_M)^{-1}\Phi_{\text{Sig}}^{X,T}, \mathbf{\mathit{k}}_{\text{Sig}}^{Y,s} (\hat y^j) ({\mathbf{K}}^Y_{s,s} + N\lambda_N\text{I}_N)^{-1}\Phi_{\text{Sig}}^{Y,T} \rangle_H \\ =\mathbf{\mathit{k}}^{\hat x^i}_s ({\mathbf{K}}^{X}_{t,t} + M \lambda_M \text{I}_M)^{-1}{\mathbf{K}}^{X,Y}_{T,T}({\mathbf{K}}^{Y}_{s,s} + N \lambda_N \text{I}_N)^{-1}\mathbf{\mathit{k}}^{\hat y_j}_s, \end{array} \end{aligned} $$
(20)
where \(\mathbf {\mathit {k}}^{\hat x^i}_t \in \mathbb {R}^M\) and \(\mathbf {\mathit {k}}_s^{\hat y^j} \in \mathbb {R}^N\) are the vectors
$$\displaystyle \begin{aligned} {} [\mathbf{\mathit{k}}^{\hat x^i}_t]_k = \mathbf{\mathit{k}}_{\text{Sig}}(\hat x^k|_{I_t}, \hat x^i|_{I_t}), \quad [\mathbf{\mathit{k}}^y_s]_\ell = \mathbf{\mathit{k}}_{\text{Sig}}(\bar y^\ell|_{I_s}, \hat y^j|_{I_s}), \quad 1 \le k \le M, 1 \le \ell \le N, \end{aligned} $$
(21)
and \({\mathbf {K}}^X_{s,s} \in \mathbb {R}^{M \times M}\), \({\mathbf {K}}^{X,Y}_{T,T} \in \mathbb {R}^{M \times N}\) and \({\mathbf {K}}^Y_{t,t}\in \mathbb {R}^{N \times N}\) are the matrices
$$\displaystyle \begin{aligned} {} &[{\mathbf{K}}^X_{t,t}]_{k, \ell} = \mathbf{\mathit{k}}_{\text{Sig}}(\hat x^k|_{I_t}, \hat x^\ell|_{I_t}), \quad [{\mathbf{K}}^{X,Y}_{T,T}]_{k,\ell} = \mathbf{\mathit{k}}_{\text{Sig}}(\hat x^k,\hat y^\ell), \\ &[{\mathbf{K}}^Y_{s,s}]_{k,\ell} = \mathbf{\mathit{k}}_{\text{Sig}}(\hat y^k|_{I_s}, \hat y^\ell|_{I_s}). \end{aligned} $$
(22)
Now, applying again the signature kernel PDE trick to the signature kernels above, we see that \(\mathbf {\mathit {k}}_{\text{Sig}}(\hat x^k,\hat y^\ell )\) is the unique solution \(\tilde u^{k,\ell }(T,T)\) of the Goursat problem
$$\displaystyle \begin{aligned} \frac{\partial^2 \tilde u_{k,\ell}(s,t)}{\partial s \partial t} = \big\langle \frac{\partial}{\partial t}\hat x^k_t, \frac{\partial}{\partial s}\hat y^\ell_{s}\big\rangle_{\mathbb{R}^d} \tilde u^{k,\ell}(s,t), \quad (s,t) \in [0,T] \times [0,T]. \end{aligned}$$
Similarly, the quantity \(\mathbf {\mathit {k}}_{\text{Sig}}(\hat x^k|_{I_t},\hat x^i|_{I_t})\) can be computed by finding the terminal value of \(v_{k,i}(t,t)\) with \(v_{k,i}\) being the unique solution to the Goursat problem
$$\displaystyle \begin{aligned} \frac{\partial^2 v_{k,i}(s,r)}{\partial s \partial r} = \big\langle \frac{\partial}{\partial s}\hat x^k_s, \frac{\partial}{\partial r}\hat x^i_{r}\big\rangle_{\mathbb{R}^d} v^{k,i}(s,r), \quad (s,r) \in [0,t] \times [0,t]. \end{aligned}$$
As a result, by solving a family of Goursat PDEs, We can compute the above vectors and matrices, and then insert them back into (20) we can get an estimation of the inner product \( \big \langle \widehat {\mathrm {pp}}^{(1)}(\mathbf {X})_t(\hat x^i), \widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})_s(\hat y^j)\big \rangle _{H}\); then we can obtain the coefficients of PDE (19) and are able to solve this equation to get the rank-1 signature kernel \(\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i), \widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j) )\) for each pair \((i,j)\). Finally, in view of the polarization formula (18), we can use \(\mathbf {\mathit {k}}^{(1)}_{\text{Sig}}(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X})(\hat x^i), \widehat {\mathrm {pp}}^{(1)}(\mathbf {Y})(\hat y^j) )\) to compute the square of Hilbert norms \(\|\Phi _{\text{Sig}}(\widehat {\mathrm {pp}}^{(1)}(\mathbf {X}))(\hat x^i) - \Phi _{\text{Sig}}(\widehat {\mathrm {pp}}^{(1)}(\mathbf {Y}))(\hat y^j)\|_{H^{(2)}}^2\) for each pair \((i,j)\), which in turn allows us to obtain the sum
$$\displaystyle \begin{aligned} \frac{1}{NM}\sum_{i=1}^M\sum_{j=1}^N \|\Phi_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{X}))(\hat x^i) - \Phi_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{Y}))(\hat y^j)\|_{H^{(2)}}^2, \end{aligned}$$
and the latter is exactly the empirical rank-1 signature kernel distance
$$\displaystyle \begin{aligned} \mathbb{E}_{\hat \mu^M \otimes \hat \nu^N}[\|\Phi_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{X})) - \Phi_{\text{Sig}}(\widehat{\mathrm{pp}}^{(1)}(\mathbf{Y}))\|_{H^{(2)}}^2] = \mathcal{D}^{(1)}_{\text{Sig}}(\mathbf{X}, \mathbf{Y})^2. \end{aligned}$$
Here, we remark that one can use the finite difference method from [30, Section 3.1] to solve the above Goursat problem, see Algorithm 3 below. Note that therein f denotes the solution operator in this scheme which returns the outcome obtained in each step \(s_p \times t_q \to s_{p+1} \times t_{q+1}\), where \((s_q)_q\) and \((t_p)_p\) are discretizations of \([0,T]\).
The whole algorithm can be summarized as follows via pseudo-code. Here we remark that in this algorithm, the distance \(\mathcal {D}^{(1)}_{\text{Sig}}(\mathbf {X},\mathbf {Y})\) in an infinite-dimensional Hilbert space can be efficiently estimated by solving a (possibly large) system of low-dimensional Hyperbolic PDEs (i.e., the Goursat problem), whose coefficients rely on the chosen sample paths \(\hat x^i \sim \mathrm {law}(X)\) and \(\hat y^j \sim \mathrm {law}(Y)\).
Algorithm 3 \(\mathsf {PDESolve}\)\(\mathcal {O}(T^2)\)
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/MediaObjects/617302_1_En_10_Figaac_HTML.png
The image is a pseudocode algorithm involving matrix operations. It starts with an input matrix M in mathbb{R}^{T times T} and outputs a full solution matrix u in mathbb{R}^{ times } . The algorithm initializes certain elements of u to 1 and iterates over indices p and q to update u using a function f that involves elements of u and M . The final step checks a condition and returns either the full matrix u or a specific element u[-1, -1] .
Algorithm 4 \(\mathsf {ZerothRankGram}\)\(\mathcal {O}(dM^2T^2)\)
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/MediaObjects/617302_1_En_10_Figaad_HTML.png
The image is a mathematical algorithm with steps involving input and output of sample paths. It includes mathematical notation and operations such as: 1. Input: Sample paths { hat{x}^i_j }_{j=1}^M sim law(X) and { hat{y}^j_j }_{j=1}^ sim law(Y) , with a condition on "full" being either True or False. 2. Output: A matrix G in mathbb{R}^{M times times P times Q} where G[i, j, p, q] = kSig(hat{x}^i_{l_sp}, hat{y}^j_{l_tq}) or G[:, :, -1, -1] . 3. A matrix M[i, j, p, q] is assigned values from langle hat{x}^i_{s_p}, hat{y}^j_{t_q} rangle . 4. Matrix operations are performed on M . 5. G[i, j] is solved using a PDE solver. 6. Conditional return of G based on the "full" condition. The notation includes Greek letters and mathematical symbols.
Algorithm 5 \(\mathsf {InnerProdPredCondKME}\)\(\mathcal {O}(T^2M^3)\)
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/MediaObjects/617302_1_En_10_Figaae_HTML.png
Algorithmic pseudocode for computing an empirical estimator of matrix M with dimensions mathbb{R}^{M times times T times T} . Inputs include Gram matrices G_{XX}, G_{XY}, G_{YY} and hyperparameter lambda . The pseudocode involves matrix operations and loops over indices p and q from 1 to T . Key operations include matrix inversions and multiplications to update matrices W_X and W_Y , and compute M[i, j, p, q] using specified mathematical expressions.
Algorithm 6 \(\mathsf {FirstRankGram}\)\(\mathcal {O}(T^2M^3)\)
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/MediaObjects/617302_1_En_10_Figaaf_HTML.png
Algorithmic pseudocode for estimating G^{1}_{X,Y} . 1. **Input**: Matrices G_{XX}, G_{XY}, G_{YY} and hyperparameter lambda . 2. **Output**: Empirical estimator of G^{1}_{X,Y} in mathbb{R}^{M times } , where G^{1}_{X,Y}[i, j] = k^{(1)}_{Sig}(hat{p}^{(1)}(mathbf{X})(x^i), hat{p}^{(1)}(mathbf{X})(y^j)) . 3. Compute mathbf{M} leftarrow InnerProdPredCondKME(G_{XX}, G_{XY}, G_{YY}, lambda) . 4. Update mathbf{M} using slicing and arithmetic operations. 5. Solve G^{1}_{X,Y}[i, j] leftarrow PDESolve(mathbf{M}[i, j]) for all i in {1, ldots, M} and j in {1, ldots, } . 6. Return G^{1}_{X,Y} .
Algorithm 7 \(\mathsf {FirstRankMMD}\)\(\mathcal {O}(dM^2T^2+T^2M^3)\)
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/MediaObjects/617302_1_En_10_Figaag_HTML.png
The image is a pseudocode for calculating an empirical estimator of the rank-1 signature kernel MMD. It includes the following steps: 1. **Input**: Sample paths {x^i}_{i=1}^M sim law(X) and {y^j}_{j=1}^ sim law(Y), with hyperparameter lambda. 2. **Output**: An empirical estimator of the rank-1 signature kernel hat{D}_{Sig}^{(1)}(X, Y). 3. Calculate G_{XX} leftarrow ZerothRankGram({x^i}_{i=1}^M, {x^i}_{i=1}^M, full = True). 4. Calculate G_{XY} leftarrow ZerothRankGram({x^i}_{i=1}^M, {y^j}_{j=1}^, full = True). 5. Calculate G_{YY} leftarrow ZerothRankGram({y^j}_{j=1}^, {y^j}_{j=1}^, full = True). 6. Calculate G_{XX}^1 leftarrow FirstRankGram(G_{XX}, G_{XX}, G_{XX}, lambda). 7. Calculate G_{XY}^1 leftarrow FirstRankGram(G_{XX}, G_{XY}, G_{YY}, lambda). 8. Calculate G_{YY}^1 leftarrow FirstRankGram(G_{YY}, G_{YY}, G_{YY}, lambda). 9. **Return**: avg(G_{XX}^1) - 2avg(G_{XY}^1) + avg(G_{YY}^1).

3 Applications

In this section, we will mention two concrete applications of higher-rank (expected) signatures to solve problems related to adapted weak topologies.

3.1 Classification of Filtered Processes in Adapted Weak Topologies

This subsection is mainly borrowed from [28, Section 5.1], which showed a significant application of the higher-rank expected signatures to distinguish stochastic processes in hypothesis testing. The Python code of the algorithm is available at: https://github.com/PatricBonnier/Higher-rank-signature-regression.
First recall that expected signatures (the expectations of \(\Phi _{\text{Sig}}\) in our notation) are currently explored in machine learning applications since they provide a hierarchical description of the law of a stochastic process; in the terminology of statistical learning the signature map is a so-called “universal and characteristic” feature map. However, the expected signature can only be used to metrize the weak convergence and completely ignores the filtration. We now use the algorithms from the previous section to demonstrate that already in the arguably simplest classification task, the use of higher-rank expected signatures can be beneficial.
Define two processes \(\mathbf {X}, \mathbf {Y} \in \text{FP}\) with their natural filtrations as follows7
$$\displaystyle \begin{aligned} {\mathbf{X}}^{c}: & X_0 = 0, X_1 = N_1, \qquad \qquad \quad X_2 = c + N_2, \\ {\mathbf{Y}}^{c}: & Y_0 = 0, Y_1 = \sqrt{1-\varepsilon^2}M_1 + \varepsilon c, Y_2 = c+M_2 . \end{aligned} $$
where c denotes a constant and \( N_1, N_2, M_1,M_2 \) are independent binomial random variables; \( \varepsilon > 0 \) fixed. Note that for a fixed c and small \(\varepsilon \), the processes \({\mathbf {X}}^{c}\) and \({\mathbf {Y}}^{c}\) have similar laws; despite \({\mathbf {X}}^{c}\) and \({\mathbf {Y}}^{c}\) being essentially different in the rank-1 adapted weak topology, analogous to Example 1.3. We now endow \(\text{FP}\) with a probability measure \(\mu \) as follows: a sample from \(\mu \) consists of sampling a \(C \sim N(0,1)\) and then selecting with probability \(0.5\) the process \({\mathbf {X}}^C\) and with probability \(0.5\) the process \({\mathbf {Y}}^C\). Figure 2 shows for each sample from \(\mu \) one sample trajectory from this process.
Fig. 2
The sample paths of \( \mu \) are plotted above for \( \varepsilon = 0.01 \). They are marked as either X in red, or Y in green depending on if the sample comes from \( {\mathbf {X}}^C \) or \( {\mathbf {Y}}^C \)
Full size image
We ran the following experiment: we sampled 1000 processes from \(\mu \) and labelled the processes corresponding to whether \({\mathbf {X}}^{c}\) or \({\mathbf {Y}}^{c}\) was sampled. We then compare their distances in the rank-0 signature metric \(d_{\text{Sig}}\) (based on the usual signature) and in the rank-1 signature distance \(d^{(1)}_{\text{Sig}}\) (based on the rank-1 signature) for each sample truncated at level 6 and level 3 respectively, by using Algorithm 1. This was then split into a training set and test set—both of size 500-and for \(50 \le m \le 500\) a support vector machine classifier was trained on m data points from the training set with signature and rank-1 signature as features.
Figure 3 shows the accuracies of the resulting classifiers on the test set. Observe that for small values of \(\varepsilon \), the classifier based on the usual signature feature (i.e., the criterion is the weak topology) is essentially guessing, and even for larger values it does not converge well; the classifier based on the rank-1 signature feature (i.e., the criterion is the rank-1 adapted weak topology) converges immediately however, which is to be expected as the rank-1 signature is able to separate \({\mathbf {X}}^{c}\) and \({\mathbf {Y}}^{c}\) independently of \(\varepsilon \).
Fig. 3
The accuracies of the linear classifier trained on \( \Phi _0(X) \) (stands for the classification based on the weak topology) and \( \Phi _1(X) \) (stands for the classification based on the rank-1 adapted weak topology) is plotted against the number of samples used in green and red respectively. Solid lines are used for \( \varepsilon = 10^{-4} \) and dashed lines for \( \varepsilon = 10^{-2} \)
Full size image
We emphasize that although this is a toy example, it demonstrate how the expected signature can fail to pick up essential properties of a model and that the higher-rank (expected) signature provide additional features that linearize complex dependencies between law and filtration.

3.2 Optimal Stopping via Adapted Distribution Regression

In this subsection, which is a shortened version of [47, Section 3], we will briefly show how to employ the higher-rank signatures in a kernel regression approach to learn the value functions of optimal stopping problems. By Theorem 2.6 we know that the rank-r signature kernel MMD \(\mathcal {D}^{(r)}_{\text{Sig}}(\cdot ,\cdot )\) metrizes the rank-r adapted weak topology, for \(r \in \mathbb N_0\). This fact together with a classical result from kernel learning implies the following important and useful theorem, whose proof can be found in [47, Appendix, proof of Theorem 2.14]:
Theorem 3.1
Let\(\sigma > 0\)be fixed and assume all filtered processes take values in a compact subset\(\mathcal {K} \subset \mathbb {R}^d\). For each\(r \in \mathbb {N}_0\)we define the kernel\(K^{(r)}_{\mathit{\text{Sig}}}: \mathit{\text{FP}} \times \mathit{\text{FP}}\to \mathbb {R}\),
$$\displaystyle \begin{aligned} K^{(r)}_{\mathit{\text{Sig}}}(\mathbf{X}, \mathbf{Y}) := \exp(-\sigma^2 \mathcal{D}^{(r)}_{\mathit{\text{Sig}}}(\mathbf{X}, \mathbf{Y})^2) \end{aligned}$$
Then the RKHS generated by\(K^{(r)}_{\mathit{\text{Sig}}}\)is dense in the space\(C_b((\mathit{\text{FP}},\mathcal {T}^{(r)}))\)w.r.t. the uniform topology. In particular, if\(v(\cdot ) \in C_b((\mathit{\text{FP}},\mathcal {T}^{(r)}))\), then for any\(\varepsilon > 0\), there exist an\(M \in \mathbb {N}\), \({\mathbf {Y}}^i \in \mathit{\text{FP}}\)and scalars\(a_i \in \mathbb {R}\), \(i=1,\ldots ,M\), such that
$$\displaystyle \begin{aligned} \sup_{\mathbf{X} \in {FP}}\Big|v(\mathbf{X}) - \sum_{i=1}^M a_i \exp(-\sigma^2 \mathcal{D}^{(r)}_{\mathit{\text{Sig}}}(\mathbf{X}, {\mathbf{Y}}^i)^2)\Big| \le \varepsilon. \end{aligned}$$
As the construction of \(K^{(r)}_{\text{Sig}}\) is based on the rank-r signature kernel MMD which is in turn induced by the rank-r signature kernel \( {\mathbf{k}}^{(r)}_{\text{Sig}}\), by an abuse of notation we will also call \(K^{(r)}_{\text{Sig}}\) the rank-r signature kernel.
Since the rank-0 adapted weak topology is exactly the weak topology, the above theorem yields that any weakly continuous bounded function \(v(\cdot )\) on \(\text{FP}\) can be approximated by the (linear) regression of the classical signature kernel \(K^{(0)}_{\text{Sig}}(\mathbf {X}, \mathbf {Y}) = \exp (-\sigma ^2 \mathcal {D}_{\text{Sig}}(\mathbf {X}, \mathbf {Y})^2)\). Such an approach is called the distribution regression (via signature kernel) in [58]. This regression scheme is of particular interest in the context of mathematical finance: Suppose that \(v: \text{FP} \to \mathbb {R}\) is the pricing functional of a bounded payoff function of some path—dependent financial derivative (e.g., the European options), which is certainly weakly continuous, and we have a set of pairs \(({\mathbf {Y}}^j, y^j) \in \text{FP} \times \mathbb {R}, j =1,\ldots , M\), where \(y^j = v({\mathbf {Y}}^j)\) is the spotted fair price of model \({\mathbf {Y}}^j\), as the input, then one can perform the above signature kernel regression on this data set \(({\mathbf {Y}}^j, y^j)_{j=1,\ldots ,M}\) to determine the coefficients \(a_j, j= 1,\ldots , M\) such that
$$\displaystyle \begin{aligned} \Big|v(\cdot) - \sum_{j=1}^M a_j \exp(-\sigma^2 \mathcal{D}_{\text{Sig}}(\cdot, {\mathbf{Y}}^j)^2)\Big| \le \varepsilon. \end{aligned}$$
Moreover, if \(\mathbf {X} \in \text{FP}\) is a new financial model, to compute the fair price \(v(\mathbf {X})\) on the market \(\mathbf {X}\), one just needs to compute the signature kernel MMD \(\mathcal {D}_{\text{Sig}}(\cdot , {\mathbf {Y}}^j)^2\) for all \(j=1,\ldots ,M\) by the second algorithm in the last section, and then use \(\sum _{j=1}^M a_j \exp (-\sigma ^2 \mathcal {D}_{\text{Sig}}(\cdot , {\mathbf {Y}}^j)^2)\) as an approximating price (with an error up to \(\varepsilon \)).
Now let us consider the value function in the optimal stopping problem, or, in the context of mathematical finance, the pricing functional of American-type options,
$$\displaystyle \begin{aligned} v_c(\mathbf{X}) := \sup_{\tau \in \mathrm{ST}(\mathbf{X})} \mathbb E[c(X,\tau)]\end{aligned}$$
for a given (bounded and continuous) function \(c: (I \to \mathbb {R}^d) \times I \to \mathbb {R}\). From Example 1.3 we know that in general it is not weakly continuous, whilst by Lemma 1.9 we know that \(v_c(\cdot )\) is continuous for \(\mathcal {T}^{(T-1)}\). In fact, if we restrict to filtered processes with natural filtration, then on \(\text{FP}_{\text{plain}}\) this functional is continuous w.r.t. the rank-1 adapted weak topology \(\mathcal {T}^{(1)}\) (see [14, Theorem 1.1]). The above observations tell us that in order to apply a regression for computing the fair price of American-type options, one should use the rank-1 signature kernel MMD \(\mathcal {D}^{(1)}_{\text{Sig}}\) instead of \(\mathcal {D}_{\text{Sig}}\) in the usual distribution regression. We call such a regression approach with higher-rank signature kernels the adapted distribution regression.
To illustrate the power of adapted distribution regression, here we empirically investigate the effectiveness of the proposed signature kernel algorithm for two optimal stopping problems:
1.
the pricing of an American basket option with geometric put payoff;
 
2.
the pricing of American options under the rough Bergomi model.
 
More precisely, treating these as distribution regression problems from probability measures on path space to \(\mathbb {R}\), we endow the classical Kernel Ridge Regression (KRR) algorithm with the characteristic kernel \(K^{(1)}_{\text{Sig}}\) studied in previous sections and compare its performance against classical algorithms. The code is available at https://github.com/maudl3116/higherOrderKME.

3.2.1 American Basket Option Pricing

The numerical example in this subsection is taken from [47, Section 3.5.1]. It is well known that an American option can be approximated by a Bermudan option, which can be exercised only at specific dates \(t_0 < t_1 < t_2 < \ldots < t_N\). If the time grid is chosen sufficiently small, the American option is well approximated by the Bermudan option. For \(d \in \mathbb {N}\), we consider the Black-Scholes model by using a d-dimensional geometric Brownian motion \((X_t)_{t\geq 0}\) for the stock prices. At time \(t \in [0,T]\) with initial stock prices \(X_t = x \in \mathbb {R}^d\), the price of the American option \(V(t,x)\) can be formulated as the following optimal stopping problem
$$\displaystyle \begin{aligned} V(t,x) := \sup_{\tau \in [t,T]} \mathbb{E}\biggl[ e^{-r(\tau-t)} \Phi(X_{\tau}) \bigg| X_t=x \biggr] \end{aligned}$$
with payoff function \(\Phi : \mathbb {R}^d \to \mathbb {R}_+\), where the supremum is taken over all stopping times. Here, we consider a geometric put payoff defined as follows
$$\displaystyle \begin{aligned} \Phi(x) := \bigg(v - \Big(\prod_{i=1}^d x_i\Big)^{1/d} \bigg)_+\end{aligned}$$
with strike \(v > 0\).
To perform KRR, we consider the classical supervised learning setting and we form a dataset \(\{(\mathcal {X}_i, y_i)\}_{i=1}^N\) of \(N = 100\) input-output pairs, where
  • \(\mathcal {X}_i = \{x^{i,j} : [0,T] \to \mathbb {R}^d\}_{j=1}^{n_i}\) is a set of \(n_i\) paths sampled from a d-dimensional Black-Scholes model, with fixed interest rate \(r = 2 \%\), \(d=20\) and volatility sampled uniformly at random \(\sigma _i \in [0.1,0.5]\) over the time grid \(\{0, T/\ell , 2T/\ell , \ldots , T\}\) with \(\ell = 10\) as in [43]. For simplicity we set \(n_1 = \ldots . = n_N = n\);
  • \(y_i\) is the price of the American option obtained via a binomial-tree method using a tree of depth \(10{,}000\) with strike \(v=100\) and maturity \(T=1\).
The dataset is randomly split into a training set of size \(N_{\text{train}}=90\) and a test set of size \(N_{\text{test}}=10\). We fit the KRR on the training set and evaluate the model on the test set.
We count how many sample paths n are required by both methods (namely the kernel induced by rank-0 signatures and rank-1 signatures) to produce an approximation error \(\epsilon \) from the target price obtained using the lattice method. We use the Mean Absolute Percent Error (MAPE), that is \(\epsilon =N_{\text{test}}^{-1}\sum _{i=1}^{N_{\text{test}}}{|(\hat {y}_i-y_i)/y_i|}\) where \(\{\hat {y}_i\}_{i=1}^{N_{\text{test}}}\) are the prices predicted by either algorithm. In this supervised learning we use the code provided in [43] to apply the LS algorithm with basis functions given by the Taylor polynomials up to degree two, as recommended in [43], to get the price \(y_i, i =1, \ldots , N_{\text{test}}\).
As shown in Table 1, the higher-rank signature methods, that is KRR with the rank-1 signature kernels \(K^{(1)}_{\text{Sig}}\) require significantly fewer sample paths from the models to produce a desired precision than the classical signature kernel \(K^{(0)}_{\text{Sig}}\). We also observe that for the (higher-rank) signature methods, the regression is fitted on \(N_{\text{train}}\) base models and so depends on \(n\times N_{\text{train}}\) sample paths. However, for any new prediction, the \(n\times N_{\text{train}}\) training sample paths are fixed and do not need to be re-sampled; only n additional samples are required. So for a total of \(N_{\text{test}}\) new predictions we will have that \(n(N_{\text{train}} + N_{\text{test}}) \approx nN_{\text{test}}\) when \(N_{\text{train}} \ll N_{\text{test}}\).
Table 1
Geometric put option pricing under the Black-Scholes model. The columns show the number of samples n to produce a MAPE \(\epsilon \) on the test set according to different numerical methods
 
\(\epsilon =10\%\)
\(\epsilon =5\%\)
\(\epsilon =2.5\%\)
Rank-0 signature kernel https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/617302_1_En_10_IEq915_HTML.gif
The image shows a mathematical notation: K_{Sig}^{(0)} .
50
250
1000
Rank-1 signature kernel https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/617302_1_En_10_IEq916_HTML.gif
The image shows a mathematical notation: K_{Sig}^{(1)} .
50
200
500

3.2.2 Pricing American Options Under the Rough Bergomi Model

In this subsection, which is taken from [73], we make use of Support Vector Regression (SVR) equipped with the rank-1 signature kernel \(K^{(1)}_{\text{Sig}}\) to address the pricing of American call options.
$$\displaystyle \begin{aligned} V(t,x) = \sup_{\tau \in [t,T]} \mathbb{E}\biggl[ e^{-r(\tau-t)} \Phi(S_{\tau}) \bigg| S_t=x \biggr], \end{aligned}$$
where \(\Phi (S_t) = (S_t - v)_+\), and the supremum is taken over all stopping times. Here, we assume that market price dynamics follow a rough Bergomi model, introduced in [18], which is defined as follows
$$\displaystyle \begin{aligned} S_t &= \exp \left\{ \int_0^t \sqrt{ V_u } \mathrm{d}B_u - \frac{1}{2}\int_0^t V_u \mathrm{d}u \right\}, && B_u = \rho W_u^1 + \sqrt{1 - \rho^2}W_u^2, \\ V_t &= \xi\ \exp \left\{ \eta Y^a_t - \frac{\eta^2}{2} t^{2a + 1}\right\}, && Y_t^a = \sqrt{2a + 1} \int_0^t (t - u)^a \mathrm{d}W^1_u, \end{aligned} $$
where \(W^1,W^2\) are two independent Brownian motions.
We employ the hybrid scheme of [24] to efficiently simulate sample paths of the Volterra process \(Y^a\), with time complexity \(\mathcal {O}(n\log (n))\), where n is the number of time steps. Option prices are numerically obtained using the Monte Carlo algorithm proposed by Longstaff and Schwartz (LS) in [59] with \(500{,}000\) sample paths from the rough Bergomi model with \(\rho =-0.9, \eta =1.9\) and \(\xi =0.09\). The strike is \(v=1\), the maturity is \(T=1\), we use 10 time steps and the interest rate is \(r=0.05\). For the distribution regression task, we consider 100 values \(\{a_i\}_{i=1}^{100}\) for the roughness index a chosen uniformly at random from \(\left ]0.5, -0.2\right ]\). Each \(a_i\) yields a different price process \(S^a\). We regress the corresponding option price \(V^a(0,S_0=1)\) on \(m \in \mathbb {N}\) sample paths from \(S^a\) and use an SVR endowed with the rank-1 signature kernel \(K^{(1)}_{\text{Sig}}\) and benchmark the method against the rank-0 signature kernel regression algorithm for the same number m of sample paths, where \(m = 100, 250, 500\).
As shown in Table 2, SVR with kernel \(K^{(1)}_{\text{Sig}}\) systematically outperforms the classical signature kernel \(K^{(0)}_{\text{Sig}}\).
Table 2
American call option pricing under the rough Bergomi model. Standard deviations (in parenthesis) are obtained by repeating the experiment 10 times with different train-test splits (m-sample is fixed)
 
Predictive MSE \(\times 10^{-6}\)
 
\(m=100\)
\(m=250\)
\(m=500\)
Rank 0-Signature kernel https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/617302_1_En_10_IEq943_HTML.gif
The image shows a mathematical notation: K_{Sig}^{(0)} .
5.27 (\(\pm \) 1.53)
1.36 (\(\pm \) 0.37)
0.98 (\(\pm \) 0.20)
Rank 1-Signature kernel https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_10/617302_1_En_10_IEq947_HTML.gif
The image shows a mathematical notation: K_{Sig}^{(1)} .
3.86 (\(\pm \) 1.41)
1.40 (\(\pm \) 0.42)
0.83 (\(\pm \) 0.15)

4 Bibliographical Discussion

In this section we give a short bibliographical overview on past and recent developments with regard to adapted topologies.
The prediction process was introduced for the first time by Knight [53] in 1975 to study measurable processes in continuous time. In this very general setting, the usefulness of the prediction process stems from the fact that, even though no additional regularity of the process itself is a-priori assumed than measurability, when working with the right-continuous filtration generated by the process, it is more regular. In fact, the prediction process has càdlàg paths and the strong Markov property.
In the unpublished seminal monograph titled “Weak convergence and the general theory of processes” from 1981, Aldous [6] establishes the notion of “extended weak convergence”. While our discussion from Sect. 1.3 is restricted to a discrete-time framework, his work was concerned with continuous time and stochastic bases of filtered processes that satisfy the usual conditions. With Knight’s prediction processes as the main device, Aldous aimed to connect the theory of weak convergence as laid down in Billingsley’s influential book [25] with, at the time, new developments in the ‘general theory of stochastic processes’ of the Strasbourg school and the martingale approach to stochastic processes. His main motivation stems from theoretical as well as practical deliberations on certain unsatisfactory features of the usual weak topology in regard to stochastic processes: On the one hand, stochastic operations such as the Doob-Meyer decomposition are not solely path transformation, and on the other hand, stochastic optimization problems such as optimal stopping are discontinuous w.r.t. the weak topology, cf. Example 1.3. We also refer to his work [5, 7].
From a model-theoretic perspective, Keisler [51] introduces in a seminar in 1979 the ‘adapted probability logic’ for filtered probability spaces with stochastic processes, which aims to also formalize the probabilistic language of stochastic processes. Subsequently, influenced by the work of Aldous, Hoover–Keisler [46] establish notions of equivalence of filtered processes based on higher-rank prediction processes, see Sect. 1.4, which tries to capture all relevant information, in particular the higher-order information that is unseen by the rank 1-prediction process as demonstrated in Example 1.6. They do so based on the concept of adapted distribution, cf. (11), which they introduce and then further explore. Two filtered processes have the same adapted distribution if they share the same ‘probabilistic properties’, such as having the same law, the same Doob-Meyer decomposition, the same value for filtration dependent functionals, and being a martingale or predictable. We refer to [46] for a precise discussion of this topic.
In independent work, Hellwig [42] studies in 1996 the question of continuity of sequential decision making problems in the context of mathematical economics. We put his framework in perspective with our initial setup from Sect. 1. For each time \(t \in I\), we have a state space \(\mathcal X_t\) that models the set of possible states of the world and \(\mathcal A_t\) (where \(\mathcal A_t\) is a compact metric space and \(\mathcal A := \prod _{t \in I} \mathcal A_t\)), the set of actions at time t, that are available to the agent. Parallel to the introduction, we choose as a mathematical base to model the uncertain evolution of the states of the world as well as information by a filtered process \(\mathbf {X}\). For simplicity, we assume here that the stochastic basis of \(\mathbf {X}\) supports a uniform random variable independent of \(\mathcal {F}_T\), which allows us to skip working with additional randomization. This assumption is without loss of generality since any equivalence class of filtered processes admits a representative with that property. Given a certain state of the world \(x \in \mathcal X\) and a chosen sequence of actions \(a \in \mathcal A\), the agents utility is described by the utility function \(u(x,a)\). The goal of the agent is to maximize its utility \(U(\mathbf {X})\) by sequentially choosing actions where
$$\displaystyle \begin{aligned} {} U(\mathbf{X}) := \max_A \mathbb E\left[ u(X,A) \right] \end{aligned} $$
(23)
and the maximum is taken over the set of all admissible plan of actions \(A = (A_t)_{t \in I}\) that are \((\mathcal {F}_t^{\mathbf {X}})\)-adapted stochastic processes with paths in \(\mathcal A\). Hellwig observed that, even when restricting to \(\text{FP}_{\text{plain}}\) (that are filtered processes equipped with the natural filtration of the underlying process), the weak topology is not strong enough to have continuity of (23). Thus, a stronger notion of convergence has to be specified, the information topology. In view of the already introduced prediction process, this topology is easiest described as the initial topology w.r.t. the family of maps of marginals of the prediction process
$$\displaystyle \begin{aligned} \text{FP} \to \mathcal P( \mathcal X) : \mathbf{X} \mapsto \mathrm{law}(\mathrm{pp}_t(\mathbf{X})), \quad t\in I. \end{aligned}$$
This topology yields, when \(u \in C_b(\mathcal X \times \mathcal A)\), continuity of the utility maximization problem (23) on \(\text{FP}_{\text{plain}}\), see also the discussion “On all adapted topologies are equal” below.
Starting from Wasserstein distances, which are a powerful tool to compare probability distributions, see [74, 78], and have become a staple even beyond applied and theoretical probability theory, versions of Wasserstein distances have been introduced for stochastic processes. In this context we want to mention the works of Vershik on the classification of filtrations [76, 77], where an iterated Kantorovich distance is central to his criterion for standardness of filtrations, see also [37, Theorem 2.7], and Rüschendorf’s modification of the Wasserstein distance running under the name of ‘Markov constructions’ [71]. The extension from two to multiple time steps was done by Pflug [66] using the name “nested distance” to study continuity and stability of multistage stochastic optimization problems. We refer to [67] for a comprehensive overview on this extensive body of works. Lassalle [56] introduces a ‘causal optimal transport problem’ with an associated distance, to study concentration inequalities on the Wiener space, see also the recent article of Föllmer [39] in similar vein. In the same spirit, adapted Wasserstein distances [11, 17] were introduced to study sensitivity in mathematical finance [13, 16]. Notably, in finite discrete time the adapted Wasserstein distances coincides with the nested distance. It was shown in [17] that the adapted Wasserstein distance naturally extends to the space of filtered processes \(\text{FP}\) and defines there a complete and geodesically complete metric, compatible with the adapted weak topology. This insight allows to define new ways of taking averages and barycenters of families of stochastic processes, that respect geometry of the paths as well as the underlying filtrations. For example, the barycenter of a family of martingales is again a martingale which is not the case for classical Wasserstein barycenters, [4, 55]. Optimal transport-based methods for the numerical resolution were explored in [38, 68] and the adapted empirical measure for the statistical estimation of laws of stochastic processes was studied, e.g., in [1, 9]. Closely related to these advances is the distance for solution of SDEs [15, 26, 70] and the Knothe–Rosenblatt distance [21], which were recently investigated.
Finally, we also report on the approach based on signature methods in [28], which opens up the framework of adapted topologies to computationally efficient algorithms based on kernel methods. For more details, we refer to the preceding Sects. 2 and 3.
On All Adapted Topologies Are Equal
Remarkably, in discrete time all of the approaches discussed above yield the same topology on \(\text{FP}_{\text{plain}}\), see [14, Theorem 1.1] and [63, Theorem 11]. This is of importance, since the nested path spaces of the higher-rank prediction processes are quite complex. Consequently, as long as one considers only processes with their generated filtration, which is typically the case in practice, from a topological point of view it is not necessary to consider higher-rank prediction processes. In mathematical terms, this means that when \(\mathbf {X}, {\mathbf {X}}^1, {\mathbf {X}}^2, \dots \), is a sequence in \(\text{FP}_{\text{plain}}\), then \({\mathbf {X}}^n \to \mathbf {X}\) in the adapted weak topology \(\mathcal T\) if and only if this convergence holds in the rank 1-adapted weak topology \(\mathcal T^{(1)}\). We want to stress that in view of Example 1.3 (and in general for sequential decision making problems) this result requires that the limit of the sequence is element of \(\text{FP}_{\text{plain}}\). This result emphasizes what was already observed in Example 1.6 that the higher order prediction process captures information of higher order, which is not necessary when considering natural filtrations.
Further Applications of Adapted Topologies
Over the past 40 years, the adapted weak topology and related concepts found application in multiple fields such as economics, mathematical finance, stochastic control, stochastic analysis and even machine learning. We try to provide a representative cross section: The adapted topologies found application in the study of optimal stopping [33], game options [35], robust hedging [36] and utility maximisation [19] with proportional transaction costs, convergence of filtrations and stability of backward SDEs [34], stability of Doob-Meyer decomposition [61], martingale representation [64], weak transport theory [3, 10, 12, 49], enlargement of filtration [2], Skorokhod representation of sequences of stochastic processes [23, 45], stochastic exit time problems [69], continuity in stochastic finance [13], sensitivity analysis of stochastic control problems [16], and in a hypothesis test for testing martingality [27]. Moreover, we refer to [72] for a comprehensive survey on information structures and associated stochastic control topologies for decision making problems. Related ideas were also recently explored in the context of machine learning; we mention, for example, chain rule optimal transport [62], causal Wasserstein GANs [52, 79], conditional optimal transport [48]. Higher-rank expected signatures [47, 73] were employed in the context of quantitative finance for real-world calibration and optimal stopping.

Acknowledgements

CL will acknowledge the support from the National Key R&D Program of China (No. 2023YFA1010900).
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Title
Adapted Topologies and Higher-Rank Signatures
Authors
Chong Liu
Gudmund Pammer
Copyright Year
2026
DOI
https://doi.org/10.1007/978-3-031-97239-3_10
1
We equip \((I \to S)\) with the product topology and \(\mathcal P(S)\) with weak convergence of measures.
 
2
A function \(c: (I \to \mathbb R^d) \times I \to \mathbb R\) is non-anticipative if for each \(t \in I\), the value of \(c(x, t)\) only depends on \(x_1,\ldots ,x_t\), i.e., the trajectory of the path x before time t.
 
3
Here the Hilbert space H is defined as in (15) by replacing \(\mathbb {R}^{d}\) with \(\mathbb {R}^{d+1}\).
 
4
Although \(\mathcal {P}(M^{(0)})\) is a subset of the linear space of all signed measures on \(M^{(0)}\), it is quite hard to compute the resulting signature explicitly if one uses the associated linear structure to define the iterated integrals of measure-valued paths.
 
5
As before, we use the linear interpolation to extend the discrete-time path \(\tilde {\mathbf {\mathit {p}}}^r\) to a piecewise path on \([0,T]\).
 
6
For the definition of RKHS, see chapter “The Signature Kernel”.
 
7
This example is taken from [28, Section 5.1].
 
1.
go back to reference B. Acciaio, S. Hou, Convergence of adapted empirical measures on \(\mathbb {R}^d\). Preprint. arXiv:2211.10162 (2022)
2.
go back to reference B. Acciaio, J. Backhoff-Veraguas, A. Zalashko, Causal optimal transport and its links to enlargement of filtrations and continuous-time stochastic optimization. Stoch. Process. Appl. 130(5), 2918–2953 (2020)MathSciNetCrossRef
3.
go back to reference B. Acciaio, M. Beiglböck, G. Pammer, Weak transport for non-convex costs and model-independence in a fixed-income market. Math. Finance 31(4), 1423–1453 (2021)MathSciNetCrossRef
4.
go back to reference B. Acciaio, D. Kršek, G. Pammer, Multicausal transport: barycenters and dynamic matching. Preprint. arXiv:2401.12748 (2024)
5.
go back to reference D. Aldous, Stopping times and tightness. Ann. Probab. 6, 335–340 (1978)MathSciNetCrossRef
6.
go back to reference D. Aldous, Weak convergence and general theory of processes. Unpublished monograph: Department of Statistics, University of California, Berkeley, July 1981
7.
go back to reference D. Aldous, Stopping times and tightness. II. Ann. Probab. 17, 586–595 (1989)MathSciNet
8.
go back to reference M. Avellaneda, A. Levy, A. Parás, Pricing and hedging derivative securities in markets with uncertain volatilities. Appl. Math. Finance 2(2), 73–88 (1995)CrossRef
9.
go back to reference J. Backhoff, D. Bartl, M. Beiglböck, J. Wiesel, Estimating processes in adapted Wasserstein distance. Ann. Appl. Probab. 32(1), 529–550 (2022)MathSciNetCrossRef
10.
go back to reference J. Backhoff-Veraguas, G. Pammer, Stability of martingale optimal transport and weak optimal transport. Ann. Appl. Probab. 32(1), 721–752 (2022)MathSciNetCrossRef
11.
go back to reference J. Backhoff-Veraguas, M. Beiglböck, Y. Lin, A. Zalashko, Causal transport in discrete time and applications. SIAM J. Optim. 27(4), 2528–2562 (2017)MathSciNetCrossRef
12.
go back to reference J. Backhoff-Veraguas, M. Beiglböck, G. Pammer, Existence, duality, and cyclical monotonicity for weak transport costs. Calc. Variations Partial Differential Equations 58(6), 203 (2019)
13.
go back to reference J. Backhoff-Veraguas, D. Bartl, M. Beiglböck, M. Eder, Adapted Wasserstein distances and stability in mathematical finance. Finance Stoch. 24(3), 601–632 (2020)MathSciNetCrossRef
14.
go back to reference J. Backhoff-Veraguas, D. Bartl, M. Beiglböck, M. Eder, All adapted topologies are equal. Probab. Theory Relat. Fields 178, 1125–1172 (2020)MathSciNetCrossRef
15.
go back to reference J. Backhoff-Veraguas, S. Källblad, B.A. Robinson, Adapted Wasserstein distance between the laws of SDEs. Preprint. arXiv:2209.03243 (2022)
16.
go back to reference D. Bartl, J. Wiesel, Sensitivity of multiperiod optimization problems with respect to the adapted Wasserstein distance. SIAM J. Financ. Math. 14(2), 704–720 (2023)MathSciNetCrossRef
17.
go back to reference D. Bartl, M. Beiglböck, G. Pammer, The Wasserstein space of stochastic processes. arXiv:2104.14245v1 (2021)
18.
go back to reference C. Bayer, P. Friz, J. Gatheral, Pricing under rough volatility. Quant. Finance 16(6), 887–904 (2016)MathSciNetCrossRef
19.
go back to reference E. Bayraktar, L. Dolinskyi, Y. Dolinsky, Extended weak convergence and utility maximisation with proportional transaction costs. Finance Stoch. 24(4), 1013–1034 (2020)MathSciNetCrossRef
20.
go back to reference M. Beiglböck, P. Henry-Labordère, F. Penkner, Model-independent bounds for option prices—a mass transport approach. Finance Stoch. 17(3), 477–501 (2013)MathSciNetCrossRef
21.
go back to reference M. Beiglböck, G. Pammer, A. Posch, The Knothe–Rosenblatt distance and its induced topology. Preprint. arXiv:2312.16515 (2023)
22.
go back to reference M. Beiglböck, G. Pammer, S. Schrott, X. Zhang, Representing general stochastic processes as martingale laws. Preprint. arXiv:2312.16725 (2023)
23.
go back to reference M. Beiglböck, S. Pflügl, S. Schrott, A probabilistic view on the adapted Wasserstein distance. Preprint. arXiv:2406.19810 (2024)
24.
go back to reference M. Bennedsen, A. Lunde, M.S. Pakkanen, Hybrid scheme for Brownian semistationary processes. Finance Stoch. 21(4), 931–965 (2017)MathSciNetCrossRef
25.
go back to reference P. Billingsley, Convergence of Probability Measures. Wiley Series in Probability and Statistics: Probability and Statistics, 2nd edn. (Wiley, New York, 1999), A Wiley-Interscience Publication
26.
go back to reference J. Bion–Nadal, D. Talay, On a Wasserstein-type distance between solutions to stochastic differential equations. Ann. Appl. Probab. 29(3), 1609–1639 (2019)
27.
go back to reference J. Blanchet, J. Wiesel, E. Zhang, Z. Zhang, Empirical martingale projections via the adapted Wasserstein distance. Preprint. arXiv:2401.12197 (2024)
28.
go back to reference P. Bonnier, C. Liu, H. Oberhauser, Adapted topologies and higher rank signatures. Ann. Appl. Probab. 33(3), 2136–2175 (2023)MathSciNetCrossRef
29.
go back to reference B. Bouchard, M. Nutz, Arbitrage and duality in nondominated discrete-time models. Ann. Appl. Probab. 25(2), 823–859 (2015)MathSciNetCrossRef
30.
go back to reference T. Cass, T. Lyons, C. Salvi, W. Yang, The Signature kernel is the solution of a Goursat PDE. SIAM J. Math. Data Sci. 3(3), 873–899 (2021)MathSciNetCrossRef
31.
go back to reference I. Chevyrev, H. Oberhauser, Signature moments to characterize laws of stochastic processes. J. Mach. Learn. Res. 23, 1–42 (2022)MathSciNet
32.
go back to reference R. Cont, Model uncertainty and its impact on the pricing of derivative instruments. Math. Finance 16(3), 519–547 (2006)MathSciNetCrossRef
33.
go back to reference F. Coquet, S. Toldo, Convergence of values in optimal stopping and convergence of optimal stopping times. Electron. J. Probab. 12(none), 207–228 (2007)
34.
go back to reference F. Coquet, J. Mémin, L. Słominski, On weak convergence of filtrations, in Séminaire de probabilités XXXV (2001), pp. 306–328
35.
go back to reference Y. Dolinsky, Applications of weak convergence for hedging of game options. Ann. Appl. Probab. 20(5), 1891–1906 (2010)MathSciNetCrossRef
36.
go back to reference Y. Dolinsky, Hedging of game options under model uncertainty in discrete time. Electron. Commun. Probab. 19(none), 1–11 (2014)
37.
go back to reference L. Dubins, J. Feldman, M. Smorodinsky, B. Tsirelson, Decreasing sequences of \(\sigma \)-fields and a measure change for Brownian motion. Ann. Probab. 24(2), 882–904 (1996)
38.
go back to reference S. Eckstein, G. Pammer, Computational methods for adapted optimal transport. Ann. Appl. Probab. 34(1A), 675–713 (2024)MathSciNetCrossRef
39.
go back to reference H. Föllmer, Optimal couplings on Wiener space and an extension of Talagrand’s transport inequality, in Stochastic Analysis, Filtering, and Stochastic Optimization: A Commemorative Volume to Honor Mark HA Davis’s Contributions (Springer, 2022), pp. 147–175
40.
go back to reference H. Föllmer, A. Schied, Stochastic Finance: An Introduction in Discrete Time (Walter de Gruyter, Berlin, 2011)CrossRef
41.
go back to reference A. Galichon, P. Henry-Labordère, N. Touzi, A stochastic control approach to no-arbitrage bounds given marginals, with an application to lookback options. Ann. Appl. Probab. 24(1), 312–336 (2014)MathSciNetCrossRef
42.
go back to reference M.F. Hellwig, Sequential decisions under uncertainty and the maximum theorem. J. Math. Econ. 25(4), 443–464 (1996)MathSciNetCrossRef
43.
go back to reference C. Herrera, F. Krach, P. Ruyssen, J. Teichmann, Optimal stopping via randomized neural networks. Front. Math. Finance 3(1), 31–77 (2024)MathSciNetCrossRef
44.
go back to reference D.G. Hobson, Robust hedging of the lookback option. Finance Stoch. 2, 329–347 (1998)CrossRef
45.
go back to reference D.N. Hoover, Convergence in distribution and Skorokhod convergence for the general theory of processes. Probab. Theory Relat. Fields 89, 239–259 (1991)MathSciNetCrossRef
46.
go back to reference D.N. Hoover, H.J. Keisler, Adapted probability distributions. Trans. Am. Math. Soc. 286(1), 159–201 (1984)MathSciNetCrossRef
47.
go back to reference B. Horvath, M. Lemercier, C. Liu, T. Lyons, C. Salvi, Optimal stopping via distribution regression: a higher rank signature approach. arXiv: 2304.01479 (2023)
48.
go back to reference B. Hosseini, A.W. Hsu, A. Taghvaei, Conditional optimal transport on function spaces. Preprint. arXiv:2311.05672 (2023)
49.
go back to reference B. Jourdain, G. Pammer, An extension of martingale transport and stability in robust finance. Electron. J. Probab. 29, 1–30 (2024)MathSciNetCrossRef
50.
go back to reference C. Kardaras, G. Žitković, Stability of the utility maximization problem with random endowment in incomplete markets. Math. Finance 21(2), 313–333 (2011)MathSciNetCrossRef
51.
go back to reference J. Keisler, Hyperfinite models of adapted probability logic. Ann. Pure Appl. Logic, 31, 71–86 (1986)MathSciNetCrossRef
52.
go back to reference K. Klemmer, T. Xu, B. Acciaio, D.B. Neill, SPATE-GAN: Improved generative modeling of dynamic spatio-temporal patterns with an autoregressive embedding loss, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 4 (June 2022), pp. 4523–4531
53.
go back to reference F.B. Knight, A predictive view of continuous time processes. Ann. Probab. 3(4), 573–596 (1975)MathSciNetCrossRef
54.
go back to reference D. Kramkov, M. Sîrbu, Sensitivity analysis of utility-based prices and risk-tolerance wealth processes. Annals Appl. Probab. 16(4), 2140–2194 (2006)MathSciNetCrossRef
55.
go back to reference D. Kršek, G. Pammer, General duality and dual attainment for adapted transport. Preprint. arXiv:2401.11958 (2024)
56.
go back to reference R. Lassalle, Causal transport plans and their Monge–Kantorovich problems. Stoch. Anal. Appl. 36(3), 452–484 (2018)MathSciNetCrossRef
57.
go back to reference D. Lee, H. Oberhauser, The signature kernel. arXiv:2305.04625 (2023)
58.
go back to reference M. Lemercier, C. Salvi, T. Damoulas, E. Bonilla, T. Lyons, Distribution regression for sequential data, in International Conference on Artificial Intelligence and Statistics (2021)
59.
go back to reference F.A. Longstaff, E.S. Schwartz, Valuing American options by simulation: a simple least-squares approach. Rev. Financ. Stud. 14(1), 113–147 (2001)CrossRef
60.
go back to reference T. Lyons, Uncertain volatility and the risk-free synthesis of derivatives. Appl. Math. Finance 2(2), 117–133 (1995)CrossRef
61.
go back to reference J. Mémin, Stability of Doob-Meyer decomposition under extended convergence. Acta Math. Appl. Sin. 19(2), 177–190 (2003)MathSciNetCrossRef
62.
go back to reference F. Nielsen, K. Sun, Chain rule optimal transport, in Progress in Information Geometry: Theory and Applications (Springer, 2021), pp. 191–217
63.
go back to reference G. Pammer, A note on the adapted weak topology in discrete time. Electron. Commun. Probab. 29, 1–13 (2024)MathSciNetCrossRef
64.
go back to reference A. Papapantoleon, D. Possamaï, and A. Saplaouras, Stability results for martingale representations: the general case. Trans. Am. Math. Soc. 372(8), 5891–5946 (2019)MathSciNetCrossRef
65.
go back to reference J. Park, K. Muandet, A measure-theoretic approach to kernel conditional mean embeddings, in Advances in Neural Information Processing Systems, vol. 33 (2020), pp. 21247–21259
66.
go back to reference G. Pflug, Version-independence and nested distributions in multistage stochastic optimization. SIAM J. Optim. 20(3), 1406–1420 (2010)MathSciNetCrossRef
67.
go back to reference G. Pflug, A. Pichler, Multistage Stochastic Optimization. Springer Series in Operations Research and Financial Engineering (Springer, Cham, 2014)
68.
go back to reference A. Pichler, M. Weinhardt, The nested Sinkhorn divergence to learn the nested distance. Comput. Manag. Sci. 19(2), 269–293 (2022)MathSciNetCrossRef
69.
go back to reference C. Reisinger, Y. Zhang, Regularity and stability of feedback relaxed controls. SIAM J. Control Optim. 59(5), 3118–3151 (2021)MathSciNetCrossRef
70.
go back to reference B.A. Robinson, M. Szölgyenyi, Bicausal optimal transport for SDEs with irregular coefficients. Preprint. arXiv:2403.09941 (2024)
71.
go back to reference L. Rüschendorf, The Wasserstein distance and approximation theorems. Probab. Theory Relat. Fields 70(1), 117–129 (1985)MathSciNetCrossRef
72.
go back to reference N. Saldi, S. Yüksel, Geometry of information structures, strategic measures and associated stochastic control topologies. Probab. Surv. 19, 450–532 (2022)MathSciNetCrossRef
73.
go back to reference C. Salvi, M. Lemercier, C. Liu, B. Horvath, T. Damoulas, T. Lyons, Higher order kernel mean embeddings to capture filtrations of stochastic processes, in Advances in Neural Information Processing Systems, vol. 34 (2021), pp. 16635–16647
74.
go back to reference F. Santambrogio, Optimal Transport for Applied Mathematicians, vol. 55, no. 58–63 (Birkäuser, New York, 2015), p. 94CrossRef
75.
go back to reference M. Soner, N. Touzi, J. Zhang, Quasi-sure stochastic analysis through aggregation. Electron. J. Probab. 16(none), 1844–1879 (2011)
76.
go back to reference A.M. Vershik, Theory of decreasing sequences of measurable partitions. Algebra Anal. 6(4), 1–68 (1994)MathSciNet
77.
go back to reference A.M. Vershik, The theory of filtrations of subalgebras, standardness, and independence. Russian Math. Surv. 72(2), 257 (2017)
78.
go back to reference C. Villani et al., Optimal Transport: Old and New, vol. 338 (Springer, Cham, 2009)
79.
go back to reference T. Xu, L.K. Wenliang, M. Munn, B. Acciaio, COT-GAN: Generating sequential data via causal optimal transport, in Advances in Neural Information Processing Systems, vol. 33 (2020), pp. 8798–8809
    Image Credits
    Salesforce.com Germany GmbH/© Salesforce.com Germany GmbH, IDW Verlag GmbH/© IDW Verlag GmbH, Diebold Nixdorf/© Diebold Nixdorf, Ratiodata SE/© Ratiodata SE, msg for banking ag/© msg for banking ag, C.H. Beck oHG/© C.H. Beck oHG, OneTrust GmbH/© OneTrust GmbH, Governikus GmbH & Co. KG/© Governikus GmbH & Co. KG, Horn & Company GmbH/© Horn & Company GmbH, EURO Kartensysteme GmbH/© EURO Kartensysteme GmbH, Jabatix S.A./© Jabatix S.A.