Skip to main content
Top

The Signature Kernel

  • Open Access
  • 2026
  • OriginalPaper
  • Chapter
Published in:

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The chapter delves into the signature kernel, a powerful tool in machine learning for handling sequential data. It begins by introducing the concept of kernel methods and their application to non-Euclidean domains, particularly sequences of arbitrary length. The text explores how the path signature, a feature map for sequential data, can be used to lift a kernel on a domain to a signature kernel on sequences. It discusses the theoretical guarantees of the signature kernel, including universality, characteristicness, and invariance to time-reparametrization. The chapter also covers practical aspects, such as the computational complexity of the signature kernel and methods for its approximation. Additionally, it touches on the application of the signature kernel in various domains, including hypothesis testing and Gaussian processes. The text concludes with a discussion on the potential of the signature kernel in addressing challenges in machine learning and its role in advancing the field. Readers will gain insights into the theoretical underpinnings and practical applications of the signature kernel, making it a valuable resource for professionals in the field.

1 Introduction

Suppose we have a collection of data valued in an arbitrary set \(\mathcal {X}\) which we wish to study by solving machine learning problems such as classification or regression. A powerful and flexible framework for tackling a wide range of such problems are kernel methods. In this framework, one begins with a kernel1 on \(\mathcal {X}\),
$$\displaystyle \begin{aligned} {} \operatorname{k} : \mathcal{X} \times \mathcal{X} \to \mathbb{R}, \end{aligned} $$
(1)
which acts as a measure of similarity between various data points. Kernel methods allow us to use a rich set of features that capture structures in the underlying data while avoiding the explicit computation of such features; it has become a mainstay in machine learning on arbitrary domains \(\mathcal {X}\) [62].
In this article we focus on the case where the domain is a set of sequences of arbitrary length. Sequential data are ubiquitous in practice, usually in the form of discrete time series. Consider the set of sequences of arbitrary length L in the set \(\mathcal {X}\), denoted by
$$\displaystyle \begin{aligned} \mathcal{X}_{\text{seq}} := \{\mathbf{x}=({\mathbf{x}}_0,\ldots,{\mathbf{x}}_L) : L \ge 0,\,{\mathbf{x}}_i \in \mathcal{X}\}. \end{aligned} $$
(2)
Constructing kernels for non-Euclidean domains such as \(\mathcal {X}_{\text{seq}}\) is often challenging. In particular, even when \(\mathcal {X}=\mathbb {R}^d\) is linear, the space of sequences \(\mathcal {X}_{\text{seq}}\) is non-linear since there is no natural addition operation for sequences of different length. In this chapter, we discuss how the path signature, identified as a feature map for sequential data, from chapter “A Primer on the Signature Method in Machine Learning” can be used to lift a kernel \(\operatorname {k}\) on \(\mathcal {X}\) (Eq. 1) to a signature kernel
$$\displaystyle \begin{aligned} \operatorname{k}_{\operatorname{\mathsf{Sig}}} : \mathcal{X}_{\text{seq}} \times \mathcal{X}_{\text{seq}} \to \mathbb{R}. \end{aligned} $$
(3)
on \(\mathcal {X}_{\text{seq}}\). We primarily consider \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\) on the domain of discrete-time sequences \(\mathcal {X}_{\text{seq}}\) since this is how we typically apply it in practice, but the domain of \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\) naturally extends to continuous time paths. More generally, the signature kernel is of interest for three primary reasons.
(i)
It inherits attractive theoretical guarantees from stochastic analysis and the properties of the classical signature. These manifest as universality (the ability to approximate non-linear functions), characteristicness (the ability to characterize probability measures), invariance to time-reparametrization, and convergence in the scaling limit when sequences (discrete time) approximate paths (continuous time).
 
(ii)
It overcomes bottlenecks of the signature features, both in terms of computational complexity and in terms of expressiveness since it allows us to use signatures of paths after they have been lifted to infinite-dimensional state space. That is, it can indirectly compute and use tensors in infininite-dimesional spaces that describe the underlying sequence.
 
(iii)
It leverages modular tools from kernel learning. Given a kernel \(\operatorname {k}\) with good theoretical properties, these typically transfer to the associated signature kernel \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\), and general tools from kernel learning apply. For example, universality and characteristicness [64] of \(\operatorname {k}\) are preserved for the associated signature kernel \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\). This also implies that \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\) induces a computable metric, the MMD metric, between laws of stochastic processes. Further, Gaussian processes can be defined on the genuine non-Euclidean domain \(\mathcal {X}_{\text{seq}}\) by using \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\) as a covariance function, generic kernel quadrature constructions can be used, etc.
 
A guiding theme for this chapter is that the iterated integrals, which constitute the signature, and hence the signature kernel, play the same role for sequences or paths as monomials do for vectors in \(\mathbb {R}^d\), see Table 1. The key message is that this can be simultaneously a blessing and a curse. Algebraically, polynomials have attractive properties but they quickly become computationally prohibitive in high-dimensions; analytically, trouble arises on non-compact sets, and often a more flexible class than simple monomials forms a better basis. Kernelization allows us to circumvent these drawbacks.
Table 1
Comparison between classical monomials and the signature
 
Monomial Map
Path Signature
Domain
\(\mathcal {X} = \mathbb {R}^d\)
\(\mathcal {X}_{\text{1-var}}\)
Co-domain
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_3/617302_1_En_3_IEq198_HTML.gif
The image shows a mathematical formula: prod_{m geq 0} (mathbb{R}^d) otimes m. It includes the product symbol prod, the set of real numbers mathbb{R}, the tensor product symbol otimes, and a condition m geq 0.
\(\prod _{m \ge 0} (\mathbb {R}^d)^{\otimes m}\)
Map
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_3/617302_1_En_3_IEq200_HTML.gif
The image shows a mathematical formula: Phi_{Mon}(mathbf{x}) = left(frac{mathbf{x}^{otimes m}}{m!}right)_{m geq 0}. The formula includes Greek letters and mathematical symbols, such as Phi (Phi), otimes (tensor product), and factorial notation m!.
\(\Phi _{\mathsf {Sig}}(\mathbf {x}) = \left ( \int d{\mathbf {x}}^{\otimes m}\right )_{m \geq 0}\)
Type of Tensors
symmetric
non-symmetric
Continuity
w.r.t. Euclidean metric
w.r.t. 1-variation metric
Factorial Decay
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_3/617302_1_En_3_IEq202_HTML.gif
The image displays a mathematical inequality involving norms and factorials. The formula is: [ |Phi_{text{Mon},m}(mathbf{x})| leq frac{|mathbf{x}|^m}{m!} ] where Phi_{Mon,m} is a function, mathbf{x} is a vector, |cdot| denotes the norm, m is an integer, and m! represents the factorial of m.
\(\|\Phi _{\mathsf {Sig},m}(\mathbf {x})\| \leq \frac {\|\mathbf {x}\|_{1}^m}{m!}\)
Scaling
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_3/617302_1_En_3_IEq204_HTML.gif
The image displays a mathematical formula involving Greek and mathematical symbols. The formula is: [ Phi_{text{Mon},m}(lambda mathbf{x}) = lambda^m Phi_{text{Mon},m}(mathbf{x}) ] Here, Phi is a function, lambda is a scalar, mathbf{x} is a vector, and m is an exponent.
\(\Phi _{\mathsf {Sig},m}(\lambda \mathbf {x}) = \lambda ^m \Phi _{\mathsf {Sig},m}(\mathbf {x})\)
 
compact \(\mathcal {K} \subset \mathcal {X}\)
compact \(\mathcal {K} \subset \mathcal {X}_{\text{1-var}}\)
Universality
w.r.t. \(C(\mathcal {K},\mathbb {R})\)
w.r.t. \(C(\bar {\mathcal {K}}, \mathbb {R})\)
Characteristicness
w.r.t. \(\mathcal {P}(\mathcal {K})\)
w.r.t. \(\mathcal {P}(\bar {\mathcal {K}})\)
This chapter focuses on theoretical aspects of the signature kernel. It is complemented by the User’s Guide to KSig, [66], that focuses on algorithmic aspects of the signature kernel, GPU-accelerated computations, and documents the Scikit-learn-compatible Python package KSig.2.
Notation
We use \([d] := \{1, \ldots , d\}\) to denote the finite sets, and \(\mathbb {R}^{\mathcal {X}} := \{ f: \mathcal {X} \to \mathbb {R}\}\) to be the space of functions.

2 Monomials, Polynomials, and Tensors

Classical multivariate monomials and moments are convenient tools for approximating functions on \(\mathbb {R}^d\) and, respectively, representing the law of vector-valued random variables via the moment sequence. Such monomials and moments can be concisely expressed using tensors. Recall from chapter “An Introduction to Tensors for Path Signatures”, the space of tensors of degree m in d dimensions, denoted \((\mathbb {R}^d)^{\otimes m}\), is a \(d^m\)-dimensional vector space. The coordinates of \((\mathbb {R}^d)^{\otimes m}\) are indexed by multi-indices\(I = (i_1, \ldots , i_m)\), where \(i_j \in [d]\). In particular, we can express the degree m monomials of a vector \(\mathbf {x}=({\mathbf {x}}^1,\ldots ,{\mathbf {x}}^d) \in \mathbb {R}^d\) as
$$\displaystyle \begin{aligned} {} {\mathbf{x}}^{\otimes m} \in (\mathbb{R}^d)^{\otimes m}, \end{aligned} $$
(4)
where the coordinate of \({\mathbf {x}}^{\otimes m}\) with respect to the multi-index \(I = (i_1, \ldots , i_m)\) is
$$\displaystyle \begin{aligned} {\mathbf{x}}^{i_1} \cdot {\mathbf{x}}^{i_2} \cdots {\mathbf{x}}^{i_m}. \end{aligned}$$
We define the monomial map\(\Phi _{\mathsf {Mon}}: \mathbb {R}^d \to \prod _{m \ge 0} (\mathbb {R}^d)^{\otimes m}\) and the degreemmonomial map\(\Phi _{\mathsf {Mon}, m} : \mathbb {R}^d \to (\mathbb {R}^d)^{\otimes m}\) by
$$\displaystyle \begin{aligned} \Phi_{\mathsf{Mon}}(\mathbf{x}) := \left(1,\mathbf{x},\frac{{\mathbf{x}}^{\otimes 2}}{2!},\frac{{\mathbf{x}}^{\otimes 3}}{3!},\ldots\right) \quad \text{and} \quad \Phi_{\mathsf{Mon},m}(\mathbf{x}):= \frac{{\mathbf{x}}^{\otimes m}}{m!} \end{aligned}$$
respectively. The map \(\Phi _{\mathsf {Mon}}\) takes a vector \(\mathbf {x} = ({\mathbf {x}}^i)_{i \in [d]} \in \mathbb {R}^d\) as an input, and outputs a sequence of tensors of increasing degree. This sequence consists of a scalar \({\mathbf {x}}^{\otimes 0} := 1\), a vector \({\mathbf {x}}^{\otimes 1} =\mathbf {x} \), a matrix \({\mathbf {x}}^{\otimes 2} = ({\mathbf {x}}^{i_1}{\mathbf {x}}^{i_2})_{i_1,i_2 \in [d]}\), a tensor \({\mathbf {x}}^{\otimes 3} = ({\mathbf {x}}^{i_1}{\mathbf {x}}^{i_2}{\mathbf {x}}^{i_3})_{i_1,i_2,i_3 \in [d]}\), and so on. The co-domain of \(\Phi _{\mathsf {Mon}}\) is a graded (in \(m=0,1,2,\ldots \)) linear space,3 hence we can apply a linear functional \(\mathbf {w}\) to \(\Phi _{\mathsf {Mon}}\).
When we restrict the domain of \(\Phi _{\mathsf {Mon}}\) to a compact subset \(\mathcal {K} \subset \mathbb {R}^d\), the Stone–Weierstrass theorem guarantees that for every \(f \in C(\mathcal {K},\mathbb {R})\) and \(\epsilon >0\), there exists a \(M \ge 1\) and coefficients \({\mathbf {w}}^0, {\mathbf {w}}^1, \ldots , {\mathbf {w}}^d,{\mathbf {w}}^{1,1},\ldots ,{\mathbf {w}}^{d,d},{\mathbf {w}}^{1,1,1},\ldots ,{\mathbf {w}}^{d,\ldots ,d} \in \mathbb {R}\), equivalently represented4 as \(\mathbf {w} \in \bigoplus _{m \geq 0}(\mathbb {R}^d)^{\otimes m}\), such that
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_3/MediaObjects/617302_1_En_3_Equc_HTML.png
The image displays a mathematical inequality involving a supremum and a summation. The expression is: [ left| sup_{x in mathcal{K}} f(mathbf{x}) - sum_{m=0}^{M} frac{mathbf{w}^{i_1, ldots, i_m} mathbf{x}^{i_1} cdots mathbf{x}^{i_m}}{m!} right| < epsilon ] Below the summation, there is a notation indicating it is equal to langle mathbf{w}, Phi_{Mon}(mathbf{x}) rangle. The inequality is bounded by epsilon (epsilon).
Hence, any choice of \(\mathbf {w}\) induces a function
$$\displaystyle \begin{aligned} \langle \mathbf{w}, \Phi_{\mathsf{Mon}}(\cdot)\rangle : \mathcal{K} \to \mathbb{R}, \end{aligned}$$
which is a linear function in terms of \(\Phi _{\mathsf {Mon}}\).
Another well-known result about monomials is that they can characterize laws of random variables; in fact, convergence of the moment sequence is equivalent to classical weak convergence. We sum these up in the following proposition.
Proposition 2.1
Let\(\mathcal {K} \subset \mathcal {X} = \mathbb {R}^d\)be compact. Then for every\(f \in C(\mathcal {K},\mathbb {R})\)and\(\epsilon >0\)there exists some\(\mathbf {w} \in \bigoplus _{m \geq 0} (\mathbb {R}^d)^{\otimes m}\)such that
$$\displaystyle \begin{aligned} {} \sup_{\mathbf{x} \in \mathcal{K}}|f(\mathbf{x}) - \langle \mathbf{w}, \Phi_{\mathsf{Mon}}(\mathbf{x}) \rangle| < \epsilon. \end{aligned} $$
(5)
Let\(\mathbf {X},{\mathbf {X}}_1,{\mathbf {X}}_2,\ldots \)be a sequence of\(\mathcal {K}\)-valued random variables. Then
$$\displaystyle \begin{aligned} \operatorname{Law}(\mathbf{X}) \mapsto \mathbb{E}[\Phi_{\mathsf{Mon}}(\mathbf{X})]:= \left(1,\mathbb{E}[\mathbf{X}], \frac{\mathbb{E}[{\mathbf{X}}^{\otimes 2}]}{2!},\ldots \right)\end{aligned}$$
is injective and the following are equivalent:
1.
\(\forall f \in C(\mathcal {X},\mathbb {R}),\quad \lim _{n \to \infty }\mathbb {E}[f({\mathbf {X}}_n)] = \mathbb {E}[f(\mathbf {X})]\);
 
2.
\(\forall m \in \mathbb {N}, \quad \lim _{n \to \infty }\mathbb {E}[{\mathbf {X}}^{\otimes m}_n]= \mathbb {E}[{\mathbf {X}}^{\otimes m}]\).
 
Proof
That (1) implies (2) follows immediately since polynomials are bounded on compacts, hence they form a subset of \(C_b(\mathcal {K},\mathbb {R})\). To see the reverse implication note that the compactness assumption implies tightness of the laws of \(({\mathbf {X}}_n)\) and this allows us to use Prokhorov’s Theorem. This reduces the proof to show that if \(({\mathbf {X}}_n)\) converges weakly along a subsequence to a random variable \(\mathbf {Y}\) then \(\operatorname {Law}(\mathbf {Y})=\operatorname {Law}(\mathbf {X})\). But if \({\mathbf {X}}_n\) converges weakly to \(\mathbf {Y}\), then \(\lim _n \mathbb {E}[p({\mathbf {X}}_n)]=\mathbb {E}[p(\mathbf {X})]\) for any polynomial p since polynomials are bounded on compacts. Hence we must have \(\mathbb {E}[p(\mathbf {X})]=\mathbb {E}[p(\mathbf {Y})]\) for any polynomial p. Finally, since polynomials are dense in \(C(\mathcal {K},\mathbb {R})\) (by Stone–Weierstrass that we already mentioned above), the result follows. □
The above shows that linear functionals of monomials, that is linear combination of the coordinates of \(\Phi _{\mathsf {Mon}}\), approximate continuous functions arbitrarily well. Further, the expected value of the coordinates of \(\Phi _{\mathsf {Mon}}\) characterizes the law of a random variable and convergence of these coordinates induces the classical weak convergence topology on the space of probability measures.
Remark 2.2 (Universality and Characteristicness)
In the statistical learning literature, these properties of \(\Phi _{\mathsf {Mon}}\) are known as universality of \(\Phi _{\mathsf {Mon}}\) with respect to the class of bounded and continuous functions and characteristicness of \(\Phi _{\mathsf {Mon}}\) with respect to probability measures on \(\mathcal {K}\). The general definition calls a map \(\Phi : \mathcal {K} \to H\) from a set \(\mathcal {K}\) into a Hilbert space H
(i)
universal with respect to a function class \(\mathcal {F} \subset \mathbb {R}^{\mathcal {K}}\) if the space of linear functionals \(\mathcal {F}_0 := \{ \langle \mathbf {w}, \Phi (\cdot )\rangle : \mathcal {K} \rightarrow \mathbb {R} \, : \, \mathbf {w} \in H\}\) is dense in \(\mathcal {F}\);
 
(ii)
characteristic with respect to probability measures \(\mathcal {P}(\mathcal {K})\) if the expectation map \(\mu \mapsto \mathbb {E}_{X \sim \mu }[\Phi (X)]\) that maps a probability measure on \(\mathcal {K}\) into H is injective.
 
In fact, if one uses a more general definition of characteristicness by regarding distributions, that is elements of \(\mathcal {F}'\), it can be shown that characteristicness and universality are equivalent. This makes many proofs easier, see [15, Definition 6] and [64].
The proof of Proposition 2.1 made heavy use of the compactness assumption but this is not a deficiency of the proof. The following example shows that things really go wrong on non-compact spaces.
A Warning: Compactness and Boundedness are Essential
Rather than a compact subset \(\mathcal {K}\), consider the entire space \(\mathcal {X} = \mathbb {R}^d\). In this case, moments may not exist for random variables on \(\mathcal {X}\) with unbounded support but the situation is even worse: even if all moments exist, they fail to characterize the law of the random variable.
This already happens in dimension \(d=1\) as the following classical counter example shows (Fig 1). Suppose \(\mathbf {X}\) is a standard log-normal, that is \(\log (\mathbf {X})\sim N(0,1)\). There exists another real-valued random variable \(\mathbf {Y}\) that is not log-normal,5 so that \(\operatorname {Law}(\mathbf {X}) \neq \operatorname {Law}(\mathbf {Y})\), but for all \(m \in \mathbb {N}\),
$$\displaystyle \begin{aligned} \mathbb{E}[{\mathbf{X}}^m] =\mathbb{E}[{\mathbf{Y}}^m] = \exp\left(\frac{m^2}{2}\right). \end{aligned}$$
Hence, “characteristicness” fails on non-compact sets. Similarly, universality, that is the uniform approximation of functions \(f(\mathbf {x})\) from Eq. (5), is hopeless since polynomials tend to \(\pm \infty \) as the argument \(\mathbf {x}\) becomes unbounded as \(\|\mathbf {x}\| \rightarrow \infty \).
Fig. 1
The probability density functions of \(\mathbf {X}\) (left) and \(\mathbf {Y}\) (right)
Full size image
Polynomials are algebraically and analytically nice on compact sets, but the fundamental issue is that while their algebraic properties extend to non-compact sets (they still form a ring) we run into issues in their analytic and probabilistic properties since they grow too quickly. We emphasize that “tightness” arguments are not so useful here; it is easy to find for any \(\epsilon >0\) a compact set that carries \((1-\epsilon )\) of the mass of the probability measure; the issue is that the functions we care about, monomials, explode on the space domain that carries only \(\epsilon \) of the mass. A classical way to deal with this is to restrict to random variables with conditions on moment growth, but this is unsatisfactory for our purposes since such assumptions quickly become abstract and in general, it is not possible to verify them in a non-parametric inference setting as is common in machine learning tasks. After all, things already went bad for a simple log-normal. Finally, a closely related issue is that moments are not robust statistics, that is, if we have to estimate expectation by i.i.d. samples from \(\mathcal {X}\) then small errors in these samples can lead to large estimation errors.
We will return to all these issues but for now, we ignore this issue with non-compact sets and ask a much more basic question: What is the natural generalization of monomials from \(\mathcal {X} = \mathbb {R}^d\) to the space \(\mathcal {X}_{\text{seq}}\) of sequences in \(\mathbb {R}^d\)?

3 Iterated Integrals as Non-Commutative Monomials

We will now go beyond the classical setting and consider monomials for sequential data. For simplicity we consider paths and sequence here in \(\mathcal {X}=\mathbb {R}^d\) but in Sect. 6 we discuss infinite-dimensional state spaces. While we work with discrete sequences in \(\mathcal {X}_{\text{seq}}\) in practice, it is often convenient to formulate our objects for continuous paths. However, we can naturally view a discrete sequence as continuous path by piecewise linear interpolation. Indeed, suppose we have a sequence \(\mathbf {x} = ({\mathbf {x}}_0, \ldots , {\mathbf {x}}_L) \in \mathcal {X}_{\text{seq}}\). By abuse of notation, we define the corresponding continuous path \(\mathbf {x}: [0,L] \rightarrow \mathbb {R}^d\) by
$$\displaystyle \begin{aligned} {} \mathbf{x}(t) = (i-t) {\mathbf{x}}_{i-1} + (t-i+1){\mathbf{x}}_i, \quad \quad t \in [i-1, i]. \end{aligned} $$
(6)
In the continuous setting, we will consider bounded variation paths
$$\displaystyle \begin{aligned} \mathcal{X}_{\text{1-var}}:= \{\mathbf{x}\in C([0,T],\mathcal{X}) \,:\, T>0, \,\, \|\mathbf{x} \|_1 <\infty,\, \mathbf{x}(0)=0\} \end{aligned}$$
where
$$\displaystyle \begin{aligned} \|\mathbf{x}\|_{1}:= \sup_{\substack{0=t_0 < \cdots <t_n=T\\ n \in \mathbb{N}}} \sum_{i=1}^n|\mathbf{x}(t_{i})-\mathbf{x}(t_{i-1})| \end{aligned}$$
denotes the usual 1-variation of a path. We restrict to paths which begin at the origin since the method that we discuss in this article (in particular, the path signature) is translation invariant. Further, we take \(\mathcal {X}=\mathbb {R}^d\) finite-dimensional for simplicity of exposition. There are several ways to put a topology on \(\mathcal {X}_{\text{1-var}}\) and we take the same as in [15]. Finally, we note that Eq. 6 defines an embedding
$$\displaystyle \begin{aligned} \mathcal{X}_{\text{seq}} \hookrightarrow \mathcal{X}_{\text{1-var}}. \end{aligned}$$
Chapter “A Primer on the Signature Method in Machine Learning” introduced the path signature,
$$\displaystyle \begin{aligned} \Phi_{\mathsf{Sig}}(\mathbf{x}) := \left(1,\int d\mathbf{x}, \int d{\mathbf{x}}^{\otimes 2}, \ldots \right) \in \prod_{m \ge 0} (\mathbb{R}^d)^{\otimes m} \end{aligned}$$
which maps bounded variation paths to a sequence of tensors. The degree m tensor is given as an m-times iterated integral of the underlying path \(\mathbf {x}: [0,T] \to \mathbb {R}^d\),
$$\displaystyle \begin{aligned} \Phi_{\mathsf{Sig},m}(\mathbf{x}):= \int d\mathbf{x} ^{\otimes m} := \int_{0<t_1<\cdots<t_m<T} d\mathbf{x}(t_1) \otimes \cdots \otimes d\mathbf{x} (t_m) \in (\mathbb{R}^d)^{\otimes m}, \end{aligned}$$
and by convention \((\mathbb {R}^d)^{\otimes 0}=\mathbb {R}\). The upper bound T in the domain of integration is given by the domain of the path \(\mathbf {x}: [0,T] \to \mathbb {R}^d\). Alternatively, in coordinates, we can identify
$$\displaystyle \begin{aligned} {} (\mathbb{R}^d)^{\otimes m}\ni\int d{\mathbf{x}}^{\otimes m} \simeq \left(\int_{0<t_1<\cdots < t_m <T} d{\mathbf{x}}^{i_1}(t_1) \cdots d{\mathbf{x}}^{i_m}(t_m)\right)_{i_1,\ldots,i_m \in [d]} \end{aligned}$$
where
$$\displaystyle \begin{aligned} \int_{0<t_1<\cdots < t_m <T} d{\mathbf{x}}^{i_1}(t_1) \cdots d{\mathbf{x}}^{i_m}(t_m)= \int_{0<t_1<\cdots < t_m <T} \dot{\mathbf{x}}^{i_1}(t_1) \cdots \dot{\mathbf{x}}^{i_m}(t_m)dt_1 \cdots dt_m \end{aligned}$$
is well-defined as Riemann–Stieltjes integral since bounded variation paths are almost everywhere differentiable. For less regular paths, some care has to be taken about the definition of the iterated integrals that constitute the signature, but the same principles apply (see Sect. 9); for paths in infinite-dimensional state spaces the same construction applies and we discuss the case of Hilbert spaces in Sect. 6. Such iterated integrals can be viewed as a generalization of monomials, and here we highlight the similarities.
A Graded Description.
One way to think about the signature map
$$\displaystyle \begin{aligned} \Phi_{\mathsf{Sig}}: \mathcal{X}_{\text{1-var}} \to \prod_{m \ge 0}(\mathbb{R}^d)^{\otimes m},\,\quad \mathbf{x} \mapsto \left(1,\int d\mathbf{x},\int d{\mathbf{x}}^{\otimes 2},\ldots\right),\end{aligned}$$
is that it is the natural generalization of the monomial map
$$\displaystyle \begin{aligned} \Phi_{\mathsf{Mon}}:\mathcal{X} \to \prod_{m \ge 0} (\mathbb{R}^d)^{\otimes m}, \quad \mathbf{x} \mapsto \left(1,\mathbf{x}, \frac{{\mathbf{x}}^{\otimes 2}}{2!},\ldots\right), \end{aligned}$$
from the domain \(\mathcal {X}\) to \(\mathcal {X}_{\text{1-var}}\). Both describe the underlying object (a vector or a path) as a series of tensors of increasing degree.
Signatures Generalize Monomials.
The signature \(\Phi _{\mathsf {Sig}}\) is a strict generalization of \(\Phi _{\mathsf {Mon}}\). For a straight line, \([0,1]\ni t \mapsto t \cdot \mathbf {v}\) with \(\mathbf {v} \in \mathbb {R}^d\), we have
$$\displaystyle \begin{aligned} {} \Phi_{\mathsf{Sig}}(t \mapsto t \cdot \mathbf{v}) = \left(1,\mathbf{v},\frac{{\mathbf{v}}^{\otimes 2}}{2!},\frac{{\mathbf{v}}^{\otimes 3}}{3!},\ldots\right)=\Phi_{\mathsf{Mon}}(\mathbf{v}), \end{aligned} $$
(7)
which follows immediately from the definition of \(\Phi _{\mathsf {Sig}}\) since
$$\displaystyle \begin{aligned} \int d (t\cdot \mathbf{v})^{\otimes m} = \int_{0 < t_1 < \cdots <t_m <1} {\mathbf{v}}^{\otimes m} dt_1 \cdots dt_m = \frac{{\mathbf{v}}^{\otimes m}}{m!}.\end{aligned}$$
Despite the similarity in their definitions, there are also important differences to highlight.
A Genuine Non-linear Domain.
Unlike \(\mathbb {R}^d\), the domain \(\mathcal {X}_{\text{1-var}}\) of \(\Phi _{\mathsf {Sig}}\) is not linear: there is no natural addition of elements in \(\mathcal {X}_{\text{1-var}}\). In particular, there is no natural way to add paths \(\mathbf {x}:[0,T]\to \mathbb {R}^d\) and \(\mathbf {y}:[0,T'] \to \mathbb {R}^d\) for \(T \neq T'\).
Asymmetric Tensors.
There is an important difference in the co-domains of the maps \(\Phi _{\mathsf {Sig}}\) and \(\Phi _{\mathsf {Mon}}\). The latter takes values in the subset of \(\prod _{m \ge 0} (\mathbb {R}^d)^{\otimes m}\) of symmetric tensors. For example, if we identify the degree \(m=2\) tensors in \(\Phi _{\mathsf {Mon}}\) as \(d\times d\) matrices, then \({\mathbf {x}}^{\otimes 2}\) is a symmetric matrix, since the \((i,j)\) coordinate of \({\mathbf {x}}^{\otimes 2}\) is simply the usual product of the i-th and j-th coordinate of \(\mathbf {x}\), and \({\mathbf {x}}^i \cdot {\mathbf {x}}^j = {\mathbf {x}}^j \cdot {\mathbf {x}}^i\).
However, in general the tensors in \(\Phi _{\mathsf {Sig}}\) are not symmetric. For example, by once again considering the \((i,j)\) and \((j,i)\) coordinate of \(\int d{\mathbf {x}}^{\otimes 2}\), we have in general that
$$\displaystyle \begin{aligned} \int_{0<t_1<t_2<T} d{\mathbf{x}}^i(t_1) d{\mathbf{x}}^j(t_2) \neq \int_{0<t_1<t_2<T} d{\mathbf{x}}^j(t_1) d{\mathbf{x}}^i(t_2). \end{aligned}$$
Non-commutative Monomials.
Because \(\Phi _{\mathsf {Sig}}\) results in non-symmetric tensors, the algebraic structure behind these monomials is non-commutative. Given two tensors \(\mathbf {v} \in (\mathbb {R}^d)^{\otimes m}\) and \(\mathbf {w} \in (\mathbb {R}^d)^{\otimes k}\), the tensor product \(\mathbf {v} \otimes \mathbf {w} \in (\mathbb {R}^d)^{\otimes (m+k)}\) is defined coordinate-wise as
$$\displaystyle \begin{aligned} (\mathbf{v} \otimes \mathbf{w})^{i_1, \ldots, i_{m+k}} := {\mathbf{v}}^{i_1, \ldots, i_m} \cdot {\mathbf{w}}^{i_{m+1}, \ldots, i_{m+k}}. \end{aligned}$$
In general, the tensor product is non-commutative, \(\mathbf {v} \otimes \mathbf {w} \neq \mathbf {w} \otimes \mathbf {v}\). However, when we are restricted to symmetric tensors such as in \(\Phi _{\mathsf {Mon}}\), the tensor product is commutative
$$\displaystyle \begin{aligned} ({\mathbf{x}}^{\otimes m} \otimes {\mathbf{x}}^{\otimes k})^{i_1, \ldots, i_{m+k}} = {\mathbf{x}}^{i_1} \cdots {\mathbf{x}}^{i_{m+k}} = ({\mathbf{x}}^{\otimes k} \otimes {\mathbf{x}}^{\otimes m})^{i_1, \ldots, i_{m+k}}. \end{aligned}$$
Remark 3.1
We refer to \(\Phi _{\mathsf {Sig}}\) as the signature (feature) map but developing a path into a series of iterated integrals is a classical construction that has been studied by many communities but known under different names such as Chen–Fliess series, Magnus expansion, time-ordered exponential, path exponential, chronological calculus etc. Modern origins go back to Chen’s work in [1012] which can be thought as an extension of classical Volterra series [61]. The rich Lie-algebraic properties of the signature received much attention in the study of free Lie algebras [55] and in the 1980’s, control theorists used \(\Phi _{\mathsf {Sig}}\) extensively to linearize controlled differential equations [6, 27]. Much of these results were then generalized in rough path theory [49] to deal with “rough” trajectories such as the sample of paths of stochastic processes where standard integration does not apply. A detailed history is beyond the scope of this article, but the above references point to a rich literature of disparate communities which have used (variations of) what we denote \(\Phi _{\mathsf {Sig}}\) for their own research agenda.

4 Universality and Characteristicness of the Signature

In order to further establish the relationship between the monomials \(\Phi _{\mathsf {Mon}}:\mathbb {R}^d \to \prod _{m \ge 0} (\mathbb {R}^d)^{\otimes m}\) and the signature \(\Phi _{\mathsf {Sig}}:\mathcal {X}_{\text{1-var}} \to \prod _{m \ge 0}(\mathbb {R}^d)^{\otimes m}\), we now consider how the signature can be used to study functions and measures on \(\mathcal {X}_{\text{1-var}}\). In particular, given some \(\mathbf {w} \in \bigoplus _{m \geq 0} (\mathbb {R}^d)^{\otimes m}\), we define a function \(\langle \mathbf {w}, \Phi _{\mathsf {Sig}}(\cdot )\rangle : \mathcal {X}_{\text{1-var}} \to \mathbb {R}\) as before; in coordinates,
$$\displaystyle \begin{aligned} \langle \mathbf{w},\Phi_{\mathsf{Sig}}(\mathbf{x}) \rangle := \sum_{m=0}^M \sum_{i_1,\ldots,i_m \in [d]} {\mathbf{w}}^{i_1,\ldots,i_m} \int_{0<t_1<\cdots <t_m<T} d{\mathbf{x}}^{i_1}(t_1)\cdots d{\mathbf{x}}^{i_m}(t_m).\end{aligned}$$
A direct calculation shows that the product of two such linear functionals can be expressed again as a linear functional of \(\Phi _{\mathsf {Sig}}\). More formally, for \({\mathbf {w}}_1,{\mathbf {w}}_2\in \bigoplus _{m \geq 0} (\mathbb {R}^d)^{\otimes m}\) there exists some \({\mathbf {w}}_{1,2} \in \bigoplus _{m \geq 0} (\mathbb {R}^d)^{\otimes m}\) such that
$$\displaystyle \begin{aligned} {} \langle {\mathbf{w}}_1, \Phi_{\mathsf{Sig}}(\mathbf{x}) \rangle \langle {\mathbf{w}}_2,\Phi_{\mathsf{Sig}}(\mathbf{x}) \rangle = \langle {\mathbf{w}}_{1,2}, \Phi_{\mathsf{Sig}}(\mathbf{x}) \rangle \end{aligned} $$
(8)
for all \(\mathbf {x} \in \mathcal {X}_{\text{1-var}}\). Put differently, products of (non-commutative) polynomials are again (non-commutative) polynomials. Although the proof of this is just a direct application of integration by parts, the algebraic (shuffle) structure of \({\mathbf {w}}_{1,2}\) gives rise to many interesting algebraic questions but is beyond the scope of this short introduction. The primary point is that (8) shows that linear functionals of paths induced by \(\mathbf {w} \in \bigoplus _{m \geq 0} (\mathbb {R}^d)^{\otimes m}\) form an algebra. If we additionally have the point-separating property, in which
$$\displaystyle \begin{aligned} \mathbf{x} \mapsto \Phi_{\mathsf{Sig}}(\mathbf{x}) \end{aligned}$$
is injective, then we can immediately generalize Proposition 2.1 from compact subsets in \(\mathbb {R}^d\) to compact subsets in \(\mathcal {X}_{\text{1-var}}\). However, in general, \(\mathbf {x} \mapsto \Phi _{\mathsf {Sig}}(\mathbf {x})\) is only injective up to tree-like equivalence. We refer to [34] for details on tree-like equivalence.
We can work around this tree-like equivalence in the following Proposition 4.1 by introducing a time coordinate: rather than using \(t \mapsto \mathbf {x}(t) \in \mathbb {R}^d\), we consider
$$\displaystyle \begin{aligned} {} t \mapsto \bar{\mathbf{x}} (t) := (t,\mathbf{x}(t))\in \mathbb{R}^{d+1}. \end{aligned} $$
(9)
When we consider such time-parametrized paths, the signature \(\Phi _{\mathsf {Sig}}\) is injective, and we obtain the following generalization of Proposition 2.1.
Proposition 4.1
Let\(\mathcal {K} \subset \mathcal {X}_{\mathit{\text{1-var}}}\)be a compact set and denote with\(\bar {\mathcal {K}}=\{\bar {\mathbf {x}}: \mathbf {x} \in \mathcal {K}\}\)the set of paths in\(\mathcal {K}\)augmented with a time coordinate from Eq.9. Then for every\(f\in C(\bar {\mathcal {K}},\mathbb {R})\)and\(\epsilon >0\)there exists a\(\mathbf {w} \in \bigoplus (\mathbb {R}^{d+1})^{\otimes m}\)such that
$$\displaystyle \begin{aligned} \sup_{\mathbf{x} \in \mathcal{K}} | f(\bar{\mathbf{x}}) - \langle \mathbf{w}, \Phi_{\mathsf{Sig}}(\bar{\mathbf{x}}) \rangle | < \epsilon. \end{aligned}$$
Let\(\mathbf {X},{\mathbf {X}}_1,{\mathbf {X}}_2,\ldots \)be a sequence of\(\mathcal {K}\)-valued random variables, that is, each\({\mathbf {X}}_i=({\mathbf {X}}_i(t))_{t \in [0,T_i]}\)is a stochastic process. Then the expected signature map
$$\displaystyle \begin{aligned} \operatorname{Law}(\bar{\mathbf{X}}) \mapsto \mathbb{E}[\Phi_{\mathsf{Sig}}(\bar{\mathbf{X}})] \end{aligned}$$
is injective and the following are equivalent:
1.
\(\forall f \in C(\bar {\mathcal {K}},\mathbb {R}),\quad \lim _{n \to \infty }\mathbb {E}[f( \bar {\mathbf {X}}_n)] = \mathbb {E}[f(\bar {\mathbf {X}})]\),
 
2.
\(\forall m \in \mathbb {N}, \quad \lim _{n \to \infty }\mathbb {E}[\int d \bar {\mathbf {X}}^{\otimes m}_n]= \mathbb {E}[\int d\bar {\mathbf {X}}^{\otimes m}]\).
 
This proposition is a direct generalization of Proposition 2.1; in fact, it can be proved in the same way. The only difference is that one needs to show \(\mathbf {x} \mapsto \Phi _{\mathsf {Sig}}(\bar {\mathbf {x}})\) is injective. However, this is not too difficult, see for example [47]. Without the augmentation with a time coordinate, the same statement holds when \(\mathcal {K}\) is replaced by set of tree-like equivalence classes of paths.6 With all of these properties in mind, we can see that the signature map behaves very similarly to classical monomials, and we summarize these results in Table 1.

5 The Good, the Bad, and the Ugly

While classical monomials have many attractive properties, they also come with some limitations that are inherited by the “non-commutative monomials”, i.e. the iterated integrals, that constitute \(\Phi _{\mathsf {Sig}}\).
Good.
Monomials are building blocks to construct more complicated functions and have a well-understood algebraic structure. For example, linear combinations of monomials (i.e. polynomials) are closed under multiplication and form a ring, and any continuous function with compact support can be uniformly approximated by polynomials.
Bad.
  • Monomials and hence also polynomials fail miserably on non-compact domains; they are great at providing local approximations but explode on unbounded domains.
  • On unbounded domains, the moment sequence may not exist. But even if it does, that is all moments exist, the moment sequence does in general not uniquely characterize the law of random variable.
  • Computational complexity: The tensor \({\mathbf {x}}^{\otimes m} \in (\mathbb {R}^d)^{\otimes m}\) has \(\binom {d+m-1}{m} =O(d^m)\) coordinates so that the computational cost becomes quickly prohibitive for moderate d which makes regression \(f(\mathbf {x}) \approx \langle \mathbf {w}, \Phi _{\mathsf {Mon}}(\mathbf {x}) \rangle \) and working with moments \(\mathbb {E}[{\mathbf {X}}^{\otimes m}]\) too expensive.
  • Non-robustness: Small changes in the underlying probability measure can lead to large changes in the moment sequence. This becomes an issue for estimation; e.g. one outlier among the sample can completely destroy moment estimates.
Ugly.
While monomials can theoretically approximate functions and capture probability distributions, they can be a poor choice in practice. In data science, arguably the most successful method is to learn the approximation from the data itself. In practice, this means providing a large parameterized family of functions and optimizing over the parameter when applied to data and standard monomials are rarely the optimal choice.
Because \(\Phi _{\mathsf {Sig}}\) generalizes \(\Phi _{\mathsf {Mon}}\), it shares the same issues. However, all these issues are well-known and modern statistics and machine learning have developed tools to address these shortcomings. The remainder of this article will focus on the following two tools and develop them for \(\Phi _{\mathsf {Sig}}\).
  • Kernelization allows us to simultaneously consider a rich set of non-linearities while avoiding the combinatorial explosion in the computation of monomials. This is done by sidestepping the computation of individual monomials by obtaining efficient algorithms to directly compute the inner product between monomials.
  • Robust Statistics was developed in the 1980s [40] to provide both a formalism to quantify the robustness of a given statistic and tools to make a given statistic robust in this precise mathematical sense. We use this to simultaneously resolve the issues with \(\Phi _{\mathsf {Mon}}\) and \(\Phi _{\mathsf {Sig}}\) on non-compact sets and deal with their general non-robustness.
The problems of extending kernelization and robustness methods to the path signature \(\Phi _{\mathsf {Sig}}\) address independent issues although they can be combined. We discuss kernelization in Sects. 6 and 7; robustification is discussed in Sect. 8.

6 Kernel Learning

We briefly recall the main ideas behind kernelization and we refer to [17, 62] for a detailed introduction to kernel learning.
Kernelization in a Nutshell
Given a set \(\mathcal {K}\), we want a feature map \(\varphi : \mathcal {K} \to H\) that injects elements of \(\mathcal {K}\) into a Hilbert space H such that \(\varphi \) provides sufficient non-linearities so that firstly, linear functionals of \(\varphi \), viewed as a map \(\mathbf {x} \mapsto \langle \mathbf {w}, \varphi (\mathbf {x}) \rangle \), are expressive enough to approximate a rich set of functions and, secondly, such that \(\mathbb {E}[\varphi (\mathbf {X})]\) captures the law of a \(\mathcal {K}\)-valued random variable \(\mathbf {X}\). The examples we encountered are \(\mathcal {K} \subset \mathcal {X} = \mathbb {R}^d\) with \(\varphi =\Phi _{\mathsf {Mon}}\) as the feature map and \(\mathcal {K} \subset \mathcal {X}_{\text{1-var}}\) with \(\varphi =\Phi _{\mathsf {Sig}}\) as the feature map. Even for \(\mathcal {K}=\mathbb {R}\) there are many choices of \(\varphi \) and H so let us take a step back and simply ask what could be a good candidate for the feature space H.
We want H to be linear, large, and canonical with respect to the underlying data space \(\mathcal {K}\). A canonical linear space that we can associate with any set \(\mathcal {K}\) is the space of real-valued functions \(\mathbb {R}^{\mathcal {K}} := \{f : \mathcal {K} \to \mathbb {R}\}\) on \(\mathcal {K}\). Hence, we take H to be a subset of \(\mathbb {R}^{\mathcal {K}}\), that is, H will be a space of real-valued functions on \(\mathcal {K}\). Consequently the feature map \(\varphi \) will be function-valued,
$$\displaystyle \begin{aligned} {} \mathcal{X}\ni\mathbf{x} \mapsto \varphi(\mathbf{x}) := \operatorname{k}(\mathbf{x},\cdot) \in H\subset \mathbb{R}^{\mathcal{K}}. \end{aligned} $$
(10)
While this has the potential to provide a rich set of “non-linearities” by embedding \(\mathcal {K}\) in the (generically infinite-dimensional) space H, the drawback is that learning algorithms require the evaluation of \(\varphi (\mathbf {x})=\operatorname {k}(\mathbf {x},\cdot )\) which makes the direct use of (10) infeasible. Kernel learning rests on three observations.
1.
If we assume that \(\operatorname {k}\) is positive definite, we can identify H as the Reproducing Kernel Hilbert Space (RKHS) induced by \(\operatorname {k}\), that is H is the closure of
$$\displaystyle \begin{aligned} \left\{\mathbf{x} \mapsto \sum_{i=1}^n \operatorname{k}({\mathbf{x}}_i,\mathbf{x})\, :\, {\mathbf{x}}_i \in \mathcal{K},\, n \in \mathbb{N}\right\} \subset \mathbb{R}^{\mathcal{K}}. \end{aligned}$$
This induces the reproducing property, where \(\langle f, \operatorname {k}(\mathbf {x},\cdot ) \rangle _H = f(\mathbf {x})\) for \(f \in H\). As a consequence of the reproducing property, we have \(\langle \varphi (\mathbf {x}), \varphi (\mathbf {y}) \rangle _H = \operatorname {k}(\mathbf {x},\mathbf {y}) \). Hence, we may think of \(\operatorname {k}\) as a “nonlinear inner product”.
 
2.
The evaluation of \(\operatorname {k}(\mathbf {x},\mathbf {y})\) can often be made computationally cheap even though it is the inner product of elements \(\varphi (\mathbf {x})\) and \(\varphi (\mathbf {y})\) in (the often infinite-dimensional) H. For example, arguably the most popular kernel is the radial basis function (RBF) kernel, \(\operatorname {k}(\mathbf {x},\mathbf {y})=\exp (-\sigma |\mathbf {x}-\mathbf {y}|^2)\), which is the inner product of an infinite-dimensional map \(\varphi (\mathbf {x})\). That an inner product of feature maps \(\varphi \) is computationally cheap, is often referred to as the “kernel trick”.
 
3.
Many learning algorithms can be reformulated so that they do not require the evaluation of \(\varphi (\mathbf {x})\) but only the computation of the inner product \(\operatorname {k}(\mathbf {x},\mathbf {y})=\langle \varphi (\mathbf {x}),\varphi (\mathbf {y})\rangle \).
 
To summarize, for a given dataset \(\mathcal {D}=\{{\mathbf {x}}_1,\ldots ,{\mathbf {x}}_N\} \subset \mathcal {X}\), kernelization allows us to work with highly non-linear feature maps \(\varphi (\mathbf {x})\) taking values in an infinite dimensional space H by only evaluating the Gram matrix \((\operatorname {k}({\mathbf {x}}_i,{\mathbf {x}}_j))_{i,j \in [N]}\).
Remark 6.1
Kernelization leverages duality for a complexity trade-off: while we do not have to worry about the dimension of H in the computation of kernels, we must compute the entire Gram martrix which scales quadratically in the number of samples N. This is a direct consequence of the duality that underlies kernelization.
To gain some intuition we revisit and kernelize the classical monomial feature map from Sect. 2,
$$\displaystyle \begin{aligned} \Phi_{\mathsf{Mon}}: \mathbb{R}^d \to \prod_{m \ge 0} (\mathbb{R}^d)^{\otimes m}, \quad \mathbf{x} \mapsto \left(1,\mathbf{x}, \frac{{\mathbf{x}}^{\otimes 2}}{2},\ldots\right). \end{aligned}$$
Further, we denote with
$$\displaystyle \begin{aligned} \Phi_{\mathsf{Mon}, :M} = \mathbb{R}^d \to \prod_{m \ge 0} (\mathbb{R}^d)^{\otimes m}, \quad \mathbf{x} \mapsto \left(1,\mathbf{x}, \frac{{\mathbf{x}}^{\otimes 2}}{2},\ldots,\frac{{\mathbf{x}}^{\otimes M}}{M!},0,0,\ldots\right), \end{aligned}$$
the same map, but truncated at level M; here \(0 \in (\mathbb {R}^d)^{\otimes M}\) denotes the degree M tensor that has as entries all zeroes. The co-domain of \(\Phi _{\mathsf {Mon}}\) and \(\Phi _{\mathsf {Mon}, :M}\) is the same, but the latter has only finitely many non-zero entries. We refer to \(\Phi _{\mathsf {Mon}, :M}\) as truncated at M.
Inner Product of Tensors
There is a natural inner product on \((\mathbb {R}^d)^{\otimes m}\) by setting
$$\displaystyle \begin{aligned} \langle {\mathbf{s}}_m,{\mathbf{t}}_m \rangle_{m}:= \sum_{i_1, \ldots, i_m =1}^d {\mathbf{s}}^{i_1, \ldots, i_m}_m \cdot {\mathbf{t}}^{i_1, \ldots, i_m}_m, \end{aligned}$$
where \({\mathbf {s}}_m, {\mathbf {t}}_m \in (\mathbb {R}^d)^{\otimes m}\). This inner product can be extended to the subset of \(\prod _{m \ge 0} (\mathbb {R}^d)^{\otimes m}\) as
$$\displaystyle \begin{aligned} \langle \mathbf{s}, \mathbf{t} \rangle := \sum_{m=0}^\infty \langle {\mathbf{s}}_m, {\mathbf{t}}_m\rangle_{m}, \end{aligned}$$
for which the above sum converges; where \(\mathbf {s} = ({\mathbf {s}}_0, {\mathbf {s}}_1, \ldots ), \mathbf {t} = ({\mathbf {t}}_0, {\mathbf {t}}_1, \ldots ) \in \prod _{m \ge 0} (\mathbb {R}^d)^{\otimes m}\). Hence, if we denote by \({\mathbf {e}}_i \in \mathbb {R}^d\) the vector that is all zeros except with a 1 at entry i, then an orthogonal basis of \(\prod (\mathbb {R}^d)^{\otimes m}\) is given by
$$\displaystyle \begin{aligned} \{{\mathbf{e}}_{i_1} \otimes \cdots \otimes {\mathbf{e}}_{i_m} \, :\, i_1,\ldots,i_m \in [d],\, m \in \mathbb{N}\}. \end{aligned}$$
The following is an elementary but key property that we use throughout: for all \({\mathbf {x}}_1,\ldots ,{\mathbf {x}}_m,{\mathbf {y}}_1,\ldots ,{\mathbf {y}}_m \in \mathbb {R}^d\), we have
$$\displaystyle \begin{aligned} {} \langle {\mathbf{x}}_1 \otimes \cdots \otimes {\mathbf{x}}_m, {\mathbf{y}}_1 \otimes \cdots \otimes {\mathbf{y}}_m \rangle = \langle {\mathbf{x}}_1,{\mathbf{y}}_1\rangle \cdots \langle {\mathbf{x}}_m,{\mathbf{y}}_m \rangle \end{aligned} $$
(11)
This already hints at how we can reduce the computational complexity. A naive evaluation of the left-hand side requires us to store the \(d^m\) coordinates of the tensors \({\mathbf {x}}_1 \otimes \cdots \otimes {\mathbf {x}}_m\) and \({\mathbf {y}}_1 \otimes \cdots \otimes {\mathbf {y}}_m \) and then compute the inner product, hence requires \(O(d^m)\) memory. In contrast, evaluating the right hand side requires just \(O(dm)\) memory.
Example 6.2 (Avoiding the Computational Complexity)
Let \(\mathcal {X} \subset \mathbb {R}^d\) and define
$$\displaystyle \begin{aligned} \operatorname{k}_{\mathsf{Mon}, :M}^0:\mathcal{X} \times \mathcal{X} \to \mathbb{R}, \quad \operatorname{k}_{\mathsf{Mon}, :M}^0(\mathbf{x},\mathbf{y}) := \langle \Phi_{\mathsf{Mon}, :M}(\mathbf{x}), \Phi_{\mathsf{Mon}, :M}(\mathbf{y})\rangle.\end{aligned}$$
We use the 0 to denote that it is the “0”-th version since we are going to improve it in the next example. But for now note that using (11) we get
$$\displaystyle \begin{aligned} {} \operatorname{k}_{\mathsf{Mon}, :M}^0(\mathbf{x},\mathbf{y})=\sum_{m=0}^M \frac{\langle {\mathbf{x}}^{\otimes m}, {\mathbf{y}}^{\otimes m}\rangle}{(m!)^2} = \sum_{m=0}^M \frac{\langle \mathbf{x}, \mathbf{y}\rangle^m}{(m!)^2}, \end{aligned} $$
(12)
which already shows that the computational cost is linear in d. We can make this even more efficient, by recalling Horner’s scheme: the formula (12) is a polynomial in \(\langle \mathbf {x}, \mathbf {y}\rangle \) which can be efficiently computed as
$$\displaystyle \begin{aligned} {} k_{\mathsf{Mon}, :M}^0(\mathbf{x},\mathbf{y}) = 1 + \langle \mathbf{x}, \mathbf{y}\rangle \left( 1 + \frac{\langle \mathbf{x}, \mathbf{y}\rangle}{2^2}\left(1 + \ldots \frac{\langle \mathbf{x}, \mathbf{y}\rangle}{(M-1)^2}\left( 1 + \frac{\langle \mathbf{x},\mathbf{y}\rangle}{M^2}\right)\right)\right). \end{aligned} $$
(13)
This “Horner-type” evaluation has three advantages:
1.
it requires only M additions and M multiplications in contrast to the naive evaluation of (12);
 
2.
it is numerically stable, again in contrast to (12) which adds powers of scalars which live on different scales and thus makes floating point arithmetic challenging; and
 
3.
it allows us to replace the inner product \(\langle \mathbf {x}, \mathbf {y} \rangle \) with the kernel \(\operatorname {k}(\mathbf {x},\mathbf {y})\) which we describe in the example below.
 
Example 6.3 (Composing Non-Linearities)
We now have an efficient way to evaluate \(\operatorname {k}_{\mathsf {Mon}}^0\) but it is well-known that this kernel does not perform well in practice [62]. The reason is the aforementioned fact, that although monomials have nice theoretical properties, they suffer from a lack of expressivity; in practice they are outperformed by other kernels on \(\mathbb {R}^d\) that are inner products of other non-linearities, e.g. the RBF kernel.
A natural way to address this is to first apply a non-linear map
$$\displaystyle \begin{aligned} \varphi: \mathbb{R}^d \to H\end{aligned}$$
and only subsequently compute the monomials in H, that is we consider
$$\displaystyle \begin{aligned} {} \Phi_{\mathsf{Mon}, :M}\left( \varphi(\mathbf{x}) \right)\equiv \left(1,\varphi(\mathbf{x}),\frac{\varphi(\mathbf{x})^{\otimes 2}}{2!}, \ldots, \frac{\varphi(\mathbf{x})^{\otimes M}}{M!},0,0,\ldots\right). \end{aligned} $$
(14)
This preserves the algebraic structure but additionally allows for a far more expressive feature map since \(\varphi \) adds (infinitely-many) additional non-linearities via \(\varphi \). If we take this underlying map to be the identity, \(\varphi = \operatorname {id}\), then (14) coincides with (4); but in general we can use arbitrary non-linearities \(\varphi \). Unfortunately, a direct use of the map \(\Phi _{\mathsf {Mon}}\circ \varphi \) is impossible in practice since we cannot directly evaluate \(\varphi \) when H is infinite-dimensional, even for \(M=2\). The key remark now is that if we take \(\varphi (\mathbf {x}) = \operatorname {k}(\mathbf {x},\cdot )\) to be the feature map of a kernel \(\operatorname {k}:\mathbb {R}^d \times \mathbb {R}^d \to \mathbb {R}\) that has H as RKHS, then the inner product
$$\displaystyle \begin{aligned} \operatorname{k}_{\mathsf{Mon}, :M}(\mathbf{x},\mathbf{y}) &:= \langle \Phi_{\mathsf{Mon}, :M} \circ \varphi(\mathbf{x}), \Phi_{\mathsf{Mon}, :M} \circ \varphi(\mathbf{y}) \rangle \\ &=\langle \Phi_{\mathsf{Mon}, :M} \circ \operatorname{k}(\mathbf{x},\cdot), \Phi_{\mathsf{Mon}, :M} \circ \operatorname{k}(\mathbf{y},\cdot) \rangle \end{aligned} $$
(15)
can be computed exactly! In fact, it just amounts to replacing the Euclidean inner product \(\langle \mathbf {x}, \mathbf {y} \rangle \) in Eqs. (12) and (13) with \(\operatorname {k}(\mathbf {x},\mathbf {y})\). Note the kernel \(\operatorname {k}_{\mathsf {Mon}, :M}\) depends on M and \(\varphi \); the optimal selection of these is important and we discuss this further in Sect. 7.3.
While the calculation itself is elementary, it is remarkable that this can be done despite the genuine infinite-dimensional co-domain of \(\Phi _{\mathsf {Mon}, :M} \circ \varphi \); such a trade-off of losing direct access to the feature map but being able to compute an inner product is often called a kernel trick. Thus the kernel \(\operatorname {k}_{\mathsf {Mon}, :M}\) inherits the nice algebraic properties of monomials and also the power to access an infinite-dimensional H via \(\operatorname {k}\).

7 The Signature Kernel

We now apply the same kernelization approach as in Sect. 6 to the signature feature map \(\Phi _{\mathsf {Sig}}\) in order to obtain a computable kernel equipped with further non-linearities. We begin by considering a kernel
$$\displaystyle \begin{aligned} \operatorname{k} : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}, \end{aligned} $$
(16)
on the state space of our paths; henceforth we refer to \(\operatorname {k}\) as the static kernel. Instead of lifting elements of \(\mathbb {R}^d\) into the RKHS H of \(\operatorname {k}\), we use it to map the path \(\mathbf {x}\),
$$\displaystyle \begin{aligned} t \mapsto \mathbf{x}(t) \in \mathbb{R}^d, \end{aligned}$$
into a path \(\operatorname {k}_{\mathbf {x}}\),
$$\displaystyle \begin{aligned} t \mapsto \operatorname{k}_{\mathbf{x}}(t) := \operatorname{k}(\mathbf{x}(t),\cdot) \in H, \end{aligned}$$
that now evolves in H. For a generic static kernel \(\operatorname {k}:\mathcal {X} \times \mathcal {X} \to \mathbb {R}\), its RKHS H will be infinite-dimensional, hence the path \(\operatorname {k}_{\mathbf {x}}:t\mapsto \operatorname {k}_{\mathbf {x}}(t)\) evolves in an infinite-dimensional space. Our aim is to use the signature of the path \(\operatorname {k}_{\mathbf {x}}\),
$$\displaystyle \begin{aligned} {} \Phi_{\mathsf{Sig}}(\operatorname{k}_{\mathbf{x}})= \left(1, \int d \operatorname{k}_{\mathbf{x}}, \int d \operatorname{k}_{\mathbf{x}}^{\otimes 2},\ldots \right), \end{aligned} $$
(17)
and its truncated version
$$\displaystyle \begin{aligned} \Phi_{\mathsf{Sig},:{M}}(\operatorname{k}_{\mathbf{x}}) := \left(1, \int d \operatorname{k}_{\mathbf{x}}, \int d \operatorname{k}_{\mathbf{x}}^{\otimes 2},\ldots, \int d \operatorname{k}_{\mathbf{x}}^{\otimes M},0,0,\ldots\right) \end{aligned} $$
(18)
as a way to represent the path \(\mathbf {x}\). Once again, the intuition is that
$$\displaystyle \begin{aligned} \Phi_{\mathsf{Sig}}(\operatorname{k}_{\mathbf{x}}) \in \prod_{m \ge 0} H^{\otimes m}\end{aligned}$$
is much more expressive than \(\Phi _{\mathsf {Sig}}(\mathbf {x})\) since it computes iterated integrals of the path after the non-linearities of \(\operatorname {k}\) have been applied so we will have a much richer set of coordinates (of iterated integrals) to choose from when \(\operatorname {k}\) is non-linear and the RKHS of \(\operatorname {k}\) is very large.
Again the bad news is, of course, that even for \(m=2\) storing \(\int d \operatorname {k}_{\mathbf {x}}^{\otimes 2} \in H^{\otimes 2}\) exactly is in general infeasible for the interesting case when H is infinite-dimensional. The main insight in [42] is that, perhaps surprisingly, the inner product
$$\displaystyle \begin{aligned} {} \operatorname{k}_{\mathsf{Sig}, :M}(\mathbf{x},\mathbf{y}) := \langle \Phi_{\mathsf{Sig},:{M}}(\operatorname{k}_{\mathbf{x}}), \Phi_{\mathsf{Sig},:{M}}(\operatorname{k}_{\mathbf{y}}) \rangle \end{aligned} $$
(19)
can be computed exactly when \(M<\infty \) for piecewise linear paths \(\mathbf {x},\mathbf {y}\) and, perhaps less surprisingly, inherits many of the attractive properties of the signature features.
Theorem 7.1 ([42, Remark B.10])
Let\(\operatorname {k}:\mathbb {R}^d\times \mathbb {R}^d \to \mathbb {R}\)be a kernel and let\(M \in \mathbb {N}\)and let
$$\displaystyle \begin{aligned} \operatorname{k}_{\mathsf{Sig}, :M}: \mathcal{X}_{\mathit{\text{1-var}}} \times \mathcal{X}_{\mathit{\text{1-var}}} \to \mathbb{R} \end{aligned} $$
(20)
be as defined in (19). Then\(\operatorname {k}_{\mathsf {Sig}, :M}(\mathbf {x},\mathbf {y})\)can be evaluated exactly for piecewise linear paths\(\mathbf {x}=({\mathbf {x}}_1,\ldots ,{\mathbf {x}}_{L_{\mathbf {x}}})\), \(\mathbf {y}=({\mathbf {y}}_1,\ldots ,{\mathbf {y}}_{L_{\mathbf {y}}})\)in\(O(M^3L^2)\)computational steps where\(L=\max (L_{\mathbf {x}},L_{\mathbf {y}})\).
An explicit algorithm is given in [42, Algorithm 6], where the truncation degree and approximation order are both set to M. We provide a summary of this dynamic programming algorithm in Sect. 7.2. Even in the case of \(M = \infty \), when \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}(\mathbf {x},\mathbf {y})\equiv \langle \Phi _{\mathsf {Sig},:{\infty }}(\mathbf {x}), \Phi _{\mathsf {Sig},:{\infty }}(\mathbf {y}) \rangle \equiv \langle \Phi _{\mathsf {Sig}}(\mathbf {x}), \Phi _{\mathsf {Sig}}(\mathbf {y}) \rangle \), good approximations of \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}(\mathbf {x},\mathbf {y})\) exist for piecewise linear paths via PDE methods, which is also discussed in Sect. 7.2.

7.1 Approximations

The above Theorem 7.1 provides a powerful method to compute signature kernels and allows for many applications. However, besides the exact evaluation of \(\operatorname {k}_{\mathsf {Sig}, :M}\), approximations are important for various reasons.
Gram Matrix Approximations.
In a machine learning context, the computational bottleneck is not just the evaluation of a single \(\operatorname {k}_{\mathsf {Sig}, :M}(\mathbf {x},\mathbf {y})\) but the computation of a Gram matrix
$$\displaystyle \begin{aligned} G = (\operatorname{k}_{\mathsf{Sig}, :M}(\mathbf{x},\mathbf{y}))_{\mathbf{x},\mathbf{y} \in \mathcal{D}} \end{aligned}$$
where \(\mathcal {D} \subset \mathcal {X}_{\text{seq}}\) is the finite set of data we are given. However, naively evaluating each of the \(|\mathcal {D}|^2\) entries of this matrix is typically not feasible since the datasets can be very large. General kernel learning approaches such as the Nyström-method can be applied and replace the original Gram matrix by an approximation with a low-rank matrix which can have linear complexity in \(|\mathcal {D}|\). Similarly, in a Gaussian process (GP) context, one not only needs to approximate G but also calculate its inverse as part of the Bayesian update rule. This scales even worse than quadratic in the number of data points, and again some generic GP methods have been developed to deal with this.
Untruncated Signature Kernel Approximations.
We regard the truncation level \(M \in \mathbb {N}_{\infty }\) as a hyperparameter that should be tuned by the data. While Theorem 7.1 gives an exact algorithm for finite M (in practice, the algoritm of Theorem 7.1 is feasible for \(M<10\)), one would like to know how well we can approximate the signature kernel with a large M or even \(M=\infty \).
Discrete to Continuous Time Approximation.
Sometimes the data is genuinely a discrete-time sequence, but sometimes the underlying object is a continuous-time path of which we are only given discrete-time samples. The signature kernel is well-defined for continuous-time paths, hence it can be useful to quantify the time-discretization error. These limiting paths are typically not of bounded variation, for example Brownian motion is only of finite p-variation for \(p>2\), so some care has to be taken.
All three of the above points address different issues, and the following discusses them in more detail.
Gram-Matrix Approximations
A general issue in kernel learning is that the computation of the Gram matrix scales quadratically in the number \(N=|\mathcal {D}|\) of data points \(\mathcal {D}\). In particular, each data point \(\mathbf {x} \in \mathcal {D}\) corresponds to a sequence, hence we would require \(N^2\) evaluations of the signature kernel. Even if a single evaluation of \(\operatorname {k}_{\mathsf {Sig}, :M}\) is cheap, this quadratic complexity becomes quickly prohibitive. A generic method that applies to any kernel \(\operatorname {k}\) is the Nyström approximation which, in its simplest form, approximates a general kernel Gram matrix
$$\displaystyle \begin{aligned} G=(\operatorname{k}({\mathbf{x}}_i,{\mathbf{y}}_j))_{i,j=1,\ldots,N} \end{aligned}$$
by sampling uniformly at random a subset \(J\subset [N]\) of size r and define
$$\displaystyle \begin{aligned} \tilde G = C W^{-1} C^\top \end{aligned}$$
where \(C= (\operatorname {k}({\mathbf {x}}_i,{\mathbf {x}}_j))_{i=1,\ldots ,N,\, j \in J}\) and \(W=(\operatorname {k}({\mathbf {x}}_i,{\mathbf {x}}_j))_{i,j \in J}\). Note that \(\tilde G\) is a random matrix of rank r, hence the computational cost scales as \(r^2\) which can be huge reduction if \(r\ll N\). Besides uniform subsampling of columns, more sophisticated Nyström approximations schemes are available [21]. All these generic methods can be applied to the signature kernel: selecting J amounts to selecting a subset of the sequences in our dataset. However, this only addresses the complexity in the number N of sequences, but not the other computational bottleneck which is the quadratic complexity in sequence length L. A simple way to address the latter is to subsample each sequence, which again, can work but is a brute force approach. A more elaborate option is to also apply the Nyström approximation at the level of time-points; [42, Algorithm 5] shows that these two Nyström approximations can be done simultaneously: informally, one does not only reweight the importance of sequences but also of time-points within these sequences.
Another situation where the cost of the matrix calculations become quickly too prohibitive is for Gaussian processes. We discuss Gaussian processes with signature kernels in Sect. 9 and again generic GP approximation methods apply. However, we can further reduce the complexity of our algorithms by exploiting the specific structure of the signature kernel. This can be achieved by selecting special tensors [69] or by using diagonalization arguments [45].
Truncation Approximation
The signature tensors decay factorially (see Table 1), and this allows us to quantify the difference between truncated and untruncated signatures as follows.
Proposition 7.2
Let\(\Phi _{\mathsf {Sig},:{\infty }}:=\Phi _{\mathsf {Sig}}\)be the untruncated signature map and let\(M> 0\). Then, for any\(\mathbf {x} \in \mathcal {X}_{\mathit{\text{1-var}}}\), we have
$$\displaystyle \begin{aligned} {} \|\Phi_{\mathsf{Sig},:{\infty}}(\operatorname{k}_{\mathbf{x}}) - \Phi_{\mathsf{Sig},:{M}}(\operatorname{k}_{\mathbf{x}})\| \leq \frac{e^{\|\operatorname{k}_{\mathbf{x}}\|_1} \|\operatorname{k}_{\mathbf{x}}\|^{M+1}_1}{(M+1)!}. \end{aligned} $$
(21)
A proof can be found in [15, Lemma 46]. Despite the nice factorial decay, we emphasize that the variation norm of the sequence appears. Furthermore, we can obtain an analogous bound on the truncation error for the kernel by applying Cauchy–Schwarz.
Proposition 7.3
Let\(\operatorname {k}_{\mathsf {Sig},:\infty }\)be the untruncated signature kernel and\(\operatorname {k}_{\mathsf {Sig},:M}\)be the signature kernel truncated at\(M> 0\). Then, for any\(\mathbf {x}, \mathbf {y} \in \mathcal {X}_{\mathit{\text{1-var}}}\), we have
$$\displaystyle \begin{aligned} |\operatorname{k}_{\mathsf{Sig},: \infty}(\mathbf{x}, \mathbf{y}) - \operatorname{k}_{\mathsf{Sig},:M}(\mathbf{x}, \mathbf{y})| \leq \frac{e^{C} C^{M+1}}{(M+1)!}, \end{aligned} $$
where\(C = \max \{ \|\operatorname {k}_{\mathbf {x}}\|_1, \|\operatorname {k}_{\mathbf {y}}\|_1\}\).
Proposition 7.3 quantitatively shows that by choosing a sufficiently high truncation level M, we obtain a good approximation to \(\operatorname {k}_{\mathsf {Sig},\infty }\). In practice, \(M=10\) is a feasible computation and since \(10! \approx 10^6\), the contribution of the tail in the sum is negligible. However, we draw attention to the fact that the 1-variation of the sequence appears as a constant. Hence, the level M at which the factorial decay kicks in is determined by the path length. The situation is analogous to computing the matrix exponential where the matrix norm determines this truncation level, see Figure 1 in [50].
Instead of using an exact algorithm for finite M and regarding \(\operatorname {k}_{\mathsf {Sig},:M}\) as an approximation to \(\operatorname {k}_{\mathsf {Sig},:\infty }\), one can also directly approximate \(\operatorname {k}_{\mathsf {Sig},:\infty }\) by numerically solving a PDE [57]. In this case, the approximation error to \(\operatorname {k}_{\mathsf {Sig},:\infty }\) is due to the discretization of the PDE solver. In particular, the error is quantified in terms of the step size; see [57] for further details.
Discrete Time Approximation
While we have formulated the kernel in terms of continuous paths, this immediately provides a kernel for discrete sequences via the piecewise linear interpolation from Eq. 6. Given a discrete sequence \(\mathbf {x} = ({\mathbf {x}}_0, \ldots , {\mathbf {x}}_L) \in \mathcal {X}_{\text{seq}}\) and the static kernel, we consider the induced continuous path \(\operatorname {k}_{\mathbf {x}}:[0,L] \to H\), defined by
$$\displaystyle \begin{aligned} {} \operatorname{k}_{\mathbf{x}}(t) = (i-t)\operatorname{k}_{{\mathbf{x}}_{i-1}} + (t-i+1)\operatorname{k}_{{\mathbf{x}}_i}, \quad \quad t\in [i-1, i]. \end{aligned} $$
(22)
Then, we can use the algorithms from [42] with the complexity bound from Theorem 7.1 to efficiently compute signature kernels for discrete data.
We can also consider sequences as discretizations of continuous paths. Given a continuous path \(\mathbf {x}: [0,T] \rightarrow \mathbb {R}^d\), we consider the discretization \({\mathbf {x}}^\pi \in \mathcal {X}_{\text{seq}}\) with respect to a partition \(\pi = (t_i)_{i=0}^L\) of the interval \([0,T]\), where \(0 = t_0 < \ldots < t_L = T\). Applying the above interpolation to \({\mathbf {x}}^\pi \), we obtain an approximate continuous path \(\operatorname {k}_{\mathbf {x}}^\pi \) induced by \({\mathbf {x}}^\pi \). Thus, we obtain an approximation \(\Phi _{\mathsf {Sig},:{M}}(\operatorname {k}_{\mathbf {x}}^\pi )\), which converges to the signature \(\Phi _{\mathsf {Sig}}(\operatorname {k}_{\mathbf {x}})\), as \(M\to \infty \) and as the mesh of the partition \(\pi \), \(\sup _{t_i \in \pi } |t_i-t_{i-1}|\), vanishes. The convergence rate is given by the following result; further estimates can be found in [42].
Proposition 7.4 ([42, Corollary 4.3])
Let\(\mathbf {x} \in \mathcal {X}_{\mathit{\text{1-var}}}\)with domain\([0,T]\)and suppose\(\pi = (t_i)_{i=0}^L\)be a partition of\([0,T]\). Then,
$$\displaystyle \begin{aligned} \|\Phi_{\mathsf{Sig},:{\infty}}(\operatorname{k}_{\mathbf{x}}) - \Phi_{\mathsf{Sig},:{M}}(\operatorname{k}_{\mathbf{x}}^\pi)\| \leq e^{\|\operatorname{k}_{\mathbf{x}}\|_1}\|\operatorname{k}_{\mathbf{x}}\|_1 \cdot \max_{i=1, \ldots, L} \|\operatorname{k}_{\mathbf{x}}|_{[t_{i-1}, t_i]}\|_1. \end{aligned} $$
(23)

7.2 Algorithms

A direct evaluation of the signature kernel \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}:\mathcal {X}_{\text{seq}} \times \mathcal {X}_{\text{seq}} \to \mathbb {R}\) as an inner product is only feasible if the underlying static kernel \(\operatorname {k}: \mathcal {X} \times \mathcal {X} \to \mathbb {R}\) has a low-dimensional RKHS H. If \(\mathcal {X}=\mathbb {R}^d\) and \(\operatorname {k}\) is the Euclidean inner product, then \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\) is just the inner product of plain-vanilla signatures and even this becomes quickly costly. However, the intuition is that by first lifting the path into a path that evolves in a high- or infinite-dimensional RKHS H, we gain much more expressive power. Indeed, this is also what the empirical data shows, see e.g. [69]. In particular, using popular choices such as RBF, Matern, or Laplace for the static kernel results in a signature kernel \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\) that outperforms the “trivial” signature kernel induced by the inner product kernel on \(\mathbb {R}^d\). The computation for nontrivial kernels cannot be done directly since the RKHS H of a generic kernel \(\operatorname {k}\) is infinite-dimensional and we cannot exactly compute the signature of paths evolving in general RKHS H. However, there are at least two different algorithms which yield effective approximations.
Dynamic Programming.
The idea is to mimic the argument for the monomial kernel \(\operatorname {k}_{\mathsf {Mon}}\) in \(\mathbb {R}^d\), given in Examples 6.2 and 6.3 in Sect. 6. Concretely, given \(\mathbf {x}=({\mathbf {x}}_0,\ldots ,{\mathbf {x}}_L) \in \mathcal {X}_{\text{seq}}\), we consider the signature of \(\operatorname {k}_{\mathbf {x}}\), which is the sequence \(\mathbf {x}\) lifted to a sequence in H defined in Eq. 22. First, the signature of one linear segment (see Eq. 7) is given as
$$\displaystyle \begin{aligned} \Phi_{\mathsf{Sig}}(\operatorname{k}_{\mathbf{x}}|_{[i-1, i]}) = \Phi_{\mathsf{Mon}}(\delta_i\operatorname{k}_{\mathbf{x}}) = \left(1, \delta_i\operatorname{k}_{\mathbf{x}}, \frac{(\delta_i\operatorname{k}_{\mathbf{x}})^{\otimes 2}}{2!}, \ldots\right), \end{aligned}$$
where \(\delta _i \operatorname {k}_{\mathbf {x}} := \operatorname {k}_{{\mathbf {x}}_{i+1}}-\operatorname {k}_{{\mathbf {x}}_i}=\operatorname {k}({\mathbf {x}}_{i+1},\cdot )-\operatorname {k}({\mathbf {x}}_i,\cdot ) \in H\). Second, using Chen identity, which expresses sequence concatenation as tensor multiplication, allows us to express the signature of the entire piecewise linear path as
$$\displaystyle \begin{aligned} \Phi_{\mathsf{Sig}}(\operatorname{k}_{\mathbf{x}}) = \Phi_{\mathsf{Mon}}(\delta_1\operatorname{k}_{\mathbf{x}}) \otimes \ldots \otimes \Phi_{\mathsf{Mon}}(\delta_L\operatorname{k}_{\mathbf{x}}). \end{aligned}$$
The degree m component can be explicitly expressed as
$$\displaystyle \begin{aligned} \Phi_{\mathsf{Sig},m}(\operatorname{k}_{\mathbf{x}}) = \sum_{\mathbf{i}: 1 \leq i_1 \leq \ldots \leq i_{m} \leq L} \frac{1}{\mathbf{i}!} \delta_{i_1}\operatorname{k}_{\mathbf{x}} \otimes \ldots \otimes \delta_{i_m} \operatorname{k}_{\mathbf{x}}\in H^{\otimes m}, \end{aligned}$$
where for \(\mathbf {i} = (i_1, \ldots , i_m)\), we define \(\mathbf {i}! := n_1! \cdots n_k!\) if \(\mathbf {i}\) contains \(k = |\{i_1, \ldots , i_m\}|\) distinct elements, and \(n_1, \ldots , n_k\) denotes the number of times they occur in \(\mathbf {i}\). Therefore, the truncated signature kernel can be written as
$$\displaystyle \begin{aligned} \operatorname{k}_{\mathsf{Sig}, :M}(\mathbf{x}, \mathbf{y}) & = \sum_{m=0}^M \,\sum_{\substack{\mathbf{i}: 1 \leq i_1 \leq \ldots \leq i_{m} \leq L\\\mathbf{j}: 1 \leq j_1 \leq \ldots \leq j_{m} \leq L}} \frac{1}{\mathbf{i}!\cdot \mathbf{j}!} \prod_{r=1}^m \nabla_{i_r, j_r}^{\operatorname{k}}(\mathbf{x}, \mathbf{y}), \end{aligned} $$
where
$$\displaystyle \begin{aligned} \nabla_{i, j}^{\operatorname{k}}(\mathbf{x}, \mathbf{y}) :=& \langle \delta_{i} \operatorname{k}_{\mathbf{x}}, \delta_{j} \operatorname{k}_{\mathbf{y}}\rangle_H \\ =& \operatorname{k}({\mathbf{x}}_{i-1}, {\mathbf{y}}_{j-1}) - \operatorname{k}({\mathbf{x}}_i, {\mathbf{y}}_{i-1}) - \operatorname{k}({\mathbf{x}}_{i-1}, {\mathbf{y}}_j) + \operatorname{k}({\mathbf{x}}_i, {\mathbf{y}}_j). \end{aligned} $$
Finally, we note that we can we can apply Horner’s scheme to obtain
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_3/MediaObjects/617302_1_En_3_Equ24_HTML.png
The image displays a mathematical formula involving summations and products. The main expression is: [ k_{text{Sig}., mathcal{M}}(mathbf{x}, mathbf{y}) = 1 + sum_{i_1 geq 1}^{j_1 geq 1} nabla_{i_1, j_1}^{k}(mathbf{x}, mathbf{y}) ] This is followed by a product of terms, each containing a summation and a division: [ cdot left( 1 + sum_{i_2 geq i_1}^{j_2 geq j_1} frac{nabla_{i_2, j_2}^{k}(mathbf{x}, mathbf{y})}{C_2(i_2, j_2)} right) cdot ldots cdot left( 1 + sum_{i_M geq i_{M-1}}^{j_M geq j_{M-1}} frac{nabla_{i_M, j_M}^{k}(mathbf{x}, mathbf{y})}{C_M(i_M, j_M)} right) ] The formula involves Greek symbols such as nabla and uses vector notation mathbf{x}, mathbf{y}.
(24)
where \({\mathbf {i}}_m = (i_1, \ldots , i_m)\), \({\mathbf {j}}_m = (j_1, \ldots , j_m)\), and \(C_m({\mathbf {i}}_m, {\mathbf {j}}_m) = n_{{\mathbf {i}}_m} \cdot n_{{\mathbf {j}}_m}\), where \(n_{{\mathbf {i}}_m} \ge 1\) is the largest number such that \(i_k = i_m\) for all \(k = m-n_{{\mathbf {i}}_m}, \ldots , m\). The constants \(C_m({\mathbf {i}}_m, {\mathbf {j}}_m)\) are used to take into account the counting of repetitions from the \(\frac {1}{\mathbf {i}!\cdot \mathbf {j}!}\) term. It is straightforward to implement (24) by a recursive algorithm which inherits the advantages of the classic Horner scheme.
PDE Solver.
Another approach was given in [57] which focuses on the case of \(M=\infty \). The idea is to define the function \(K: [0,L]^2 \rightarrow \mathbb {R}^d\) by the signature kernel
$$\displaystyle \begin{aligned} K(s,t) = \operatorname{k}_{\operatorname{\mathsf{Sig}}}(\mathbf{x}|_{[0,s]}, \mathbf{y}|_{[0,t]}) \end{aligned}$$
evaluated on \(\mathbf {x}\) restricted to \([0,s]\) and \(\mathbf {y}\) restricted to \([0,t]\). Then, this function satisfies the following PDE
$$\displaystyle \begin{aligned} \frac{\partial^2 K}{\partial s \partial t} (s,t) = \nabla_{\lceil s \rceil, \lceil t \rceil}^{\operatorname{k}}(\mathbf{x}, \mathbf{y}) \cdot K(s,t), \quad K(s,0) = K(0, t) = 1. \end{aligned}$$
Such PDEs are known as Goursat PDEs. Hence, the computation boils down to applying a PDE solver and [57] provides such a scheme and analyzes the error rate. A priori, the PDE approach is focused on \(M=\infty \) but cross-validating M often improves performance and [8] provides such a generalization of the PDE approach to approximate \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\) for any \(M \ge 1\).
Both algorithms work well and packages are available7. that adapt the underlying algorithms to work on GPUs. These two algorithms are very different, even for a single evaluation of \(\operatorname {k}_{\mathsf {Sig},:M}(\mathbf {x},\mathbf {y})\). The dynamic programming approach is exact for finite M and for \(M=\infty \) the approximation error stems from the factorial decay; in contrast, the PDE approach provides an approximation that is determined in by the properties of the scheme the PDE solver uses such as the step size. Similarly, these two algorithms are also amenable to different computational strategies when the whole Gram matrix needs to be evaluated. More recently, another approach based on computing random Fourier signature features has been developed [70]. The trade-offs between these choices can vary between datasets, and we refer the reader to [70, Section 4] for detailed complexity and empirical comparisons, and to [66] for a user’s guide to the KSig package.

7.3 Hyperparameter Tuning

Given a fast algorithm to compute the Gram matrix for a given parameter choice, the vast majority of the computational budget should be spent on hyperparameter tuning. This is true in general for any kernel but in particular for the signature kernel where the parameter choice can drastically influence performance. The signature kernel has the following hyperparameters.
1.
Static Kernel Parameters\(\Theta _{\operatorname {k}}\). The signature kernel takes as input a kernel \(\operatorname {k}\) on \(\mathcal {X}\) and turns it into a kernel \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\) on \(\mathcal {X}_{\text{seq}}\). Typically, the static kernel \(\operatorname {k}\) is parametrized; for example, the classic RBF kernel \(\operatorname {k}(\mathbf {x},\mathbf {y}) = \exp (\theta \|\mathbf {x}-\mathbf {y}\|^2)\) has a hyperparameter \(\theta \). In general, hyperparameters are not limited to scalars and we denote with \(\Theta _{\operatorname {k}}\) the set of all possible values of such hyperparameters. Consequently, the signature kernel \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\) then inherits all these hyperparameters of the static kernel \(\operatorname {k}\).
 
2.
Truncation Parameters\(\mathbb {N}_\infty \). In practice, the optimal choice of the truncation level \(M \in \mathbb {N}_\infty \) varies widely between datasets and can be as low as \(M=2\). Informally, one expects a trade-off between the non-linearity of the static kernel \(\operatorname {k}\) and the optimal truncation level M, but even without kernelization the choice of optimal M is subtle, [25]. Indeed, an extreme case is to take the trivial inner product kernel and experiments show that in this case the optimal M is large; in contrast for RBF and other kernels, lower choices of M are typically optimal. However, these are stylized facts and vary between datasets.
 
3.
Preprocessing Parameters\(\Theta _{\text{preprocess}}\). It is common practice in time-series modelling to preprocess the original time-series. For example, instead of the original sequence in \(\mathbb {R}^d\) one considers a sequence in \(\mathbb {R}^{2d}\) where the additional coordinates are given by lagging the original time series (a sliding window). Similarly, for multi-variate time-series the different coordinates might evolve on different scales and it makes sense to normalize them; see for example [69] for some standard time-series preprocessing for the signature kernel. In general, there is a large variety of preprocessing steps and we denote these with \(\Theta _{\text{preprocess}}\).
 
4.
Algebraic Parameters\(\Theta _{\text{algebra}}\). Signatures can be generalized in several ways and one can replace iterated integrals by other objects. How exactly these objects are constructed goes beyond the scope of this article, but this often allows us to increase the empirical performance. We do not further discuss this here but refer to the “non-geometric” signatures in [42, 68] which require us to go beyond the usual (Hopf-)algebraic treatment of signatures, see [20]. This induces another parameter-set \(\Theta _{\text{algebra}}\).
 
5.
Normalization Parameters\(\Theta _{\text{normalization}}\). As we discuss in the next section, a normalization makes the signature kernel a more robust statistic; in particular, this resolves the aforementioned issues about non-compactness. There are different ways to normalize, and this is further discussed in [15]. Furthermore, we can independently rescale the inner products for tensors of different degrees to obtain different signature kernels [8]. We denote these normalization and rescaling parameters \(\Theta _{\text{normalization}}\).
 
Therefore the hyperparameter set of the signature kernel \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\) build from a static kernel \(\operatorname {k}\) with parameter set \(\Theta _{\operatorname {k}}\) is
$$\displaystyle \begin{aligned} \Theta_{\operatorname{k}_{\operatorname{\mathsf{Sig}}}} := \Theta_{\operatorname{k}} \times \mathbb{N}_\infty \times \Theta_{\text{preprocess}} \times \Theta_{\text{algebra}} \times \Theta_{\text{normalization}}. \end{aligned}$$
We emphasize that (as for most kernels) the hyperparameter selection is essential to get strong performance. Depending on the problem this requires different approaches; a basic distinction is between supervised and unsupervised problems.
Supervised Learning.
An objective function is given and in principle straightforward cross-validation allows us to optimize \(\theta \in \Theta _{\operatorname {k}_{\operatorname {\mathsf {Sig}}}}\) by grid-search. However, since \(\Theta _{\operatorname {k}_{\operatorname {\mathsf {Sig}}}}\) is usually high-dimensional, more sophisticated hyperparameter tuning algorithms, such as Bayesian optimization, can be very useful and are implemented in most machine learning toolboxes. Either way, standard hyperparameter tuning can be used and simple grid search typically works well.
Unsupervised Learning.
In contrast to supervised problems the situation is less straightforward since the choice of optimization criteria is often not immediate. Even worse, standard heuristics such as the median heuristic [28] are not justified for the signature kernel and indeed typically lead to poor performance. For example, we discuss this in the context of hypothesis testing in Sect. 9. There, the median heuristic is used for some kernels, but there is no reason why this heuristic should apply to the signature kernel and indeed this would result in tests with poor power. In general, for unsupervised problems the specific application must be taken into account.

8 Robust Moments, Signatures, and Kernels

In Sect. 5, we saw that monomial features \(\Phi _{\mathsf {Mon}}\) lose the ability to approximate functions and characterize laws of random variables when we consider unbounded domains. These issues extend generically to the signature features \(\Phi _{\mathsf {Sig}}\). While compactness can be a reasonable assumption for monomials in the case of \(\mathcal {K} \subset \mathbb {R}^d\), this is a much stronger assumption8 to make for signatures in the case of \(\mathcal {K} \subset \mathcal {X}_{\text{1-var}}\).
Example 8.1 (No Characteristicness)
There exist two different stochastic processes \(\mathbf {X},\mathbf {Y}\) with sample paths that are straight lines, such that
$$\displaystyle \begin{aligned} \mathbb{E}[\Phi_{\mathsf{Sig}}(\mathbf{X})]=\mathbb{E}[\Phi_{\mathsf{Sig}}(\mathbf{Y})]. \end{aligned}$$
For example, one can take
$$\displaystyle \begin{aligned} \mathbf{X}(t)=t \cdot \mathbf{V} \text{ and } \mathbf{Y}(t) = t\cdot \mathbf{W}\end{aligned}$$
where \(\mathbf {V}\) is a random variable in \(\mathbb {R}^2\) drawn from the density
$$\displaystyle \begin{aligned} \mathbf{V} \sim \rho_1(\mathbf{v})=\frac{1}{2 \pi \cdot {\mathbf{v}}_1 {\mathbf{v}}_2} \exp\left(\frac{-(\log^2({\mathbf{v}}_1) + \log^2({\mathbf{v}}_2))}{2}\right) \end{aligned}$$
and \(\mathbf {W}\) drawn from
$$\displaystyle \begin{aligned} \rho_2({\mathbf{w}}_1, {\mathbf{w}}_2) = \rho_1({\mathbf{w}}_1,{\mathbf{w}}_2) \prod_{i=1}^2 \left(1 + \sin{}(2\pi \log({\mathbf{w}}_i))\right), \end{aligned}$$
shown in (Fig. 2). One option is to make stronger model assumptions on the decay rate of moments which would allow characterizing the law; in \(\mathbb {R}^d\) this is classic and for paths, we refer to [13]. However, in an inference context we usually do not want to make strong assumptions and the above example is a warning that things can go genuinely wrong, even for trivial stochastic processes. A different, but related, issue is the notion of robustness in statistics. At a high level, a robust statistic is one which is stable with respect to small perturbations of the underlying measure.
Fig. 2
Probability density functions of \(\mathbf {V}\) (left) and \(\mathbf {W}\) (right)
Full size image
Example 8.2 (No Robustness)
Suppose \(\mathbf {X}\) is a real-valued random variable with law \(\mu \). Fix \(\epsilon \in (0,1)\) and \(\mathbf {x} \in \mathbb {R}\) and let \({\mathbf {X}}_{\mathbf {x},\epsilon }\) be the perturbed random variable with law \(\mu _{\mathbf {x},\epsilon } = (1-\epsilon ) \mu + \epsilon \delta _{\mathbf {x}}\). When \(\mathbf {x} \gg \mathbb {E}[\mathbf {X}]\), it is easy to see that the outlier Dirac measure \(\delta _{\mathbf {x}}\) has outsized influence on the mean \(\mathbb {E}[{\mathbf {X}}_{\mathbf {x},\epsilon }]\), and this influence is magnified for higher moments \(\mathbb {E}[{\mathbf {X}}_{\mathbf {x},\epsilon }^{\otimes m}]\).
The reason Example 8.2 is worrisome is that in practice we are only given a finite number of samples to estimate our statistics, such as \(\mathbb {E}[{\mathbf {X}}^{\otimes m}]\) or \(\mathbb {E}[\int d{\mathbf {X}}^{\otimes m}]\). In practice, this finite sample is taken from real-world observations and it is common that this data is not clean; that is, outliers such as a \(\mathbf {x} \gg \mathbb {E}[\mathbf {X}]\) in the above example are not uncommon due to sensor errors, data recording, etc.9 However, some statistics can still be well-estimated, and the design of such robust features has been a central topic of modern statistics [35, 40]. Ironically, the standard machine learning benchmarks for time-series are fairly clean in terms of outliers but such outliers appear often in applications in the “real-world”.

8.1 Robust Statistics

The above example motivates the notion of the influence function, which quantifies the robustness of a statistic.
Definition 8.3 ([35])
Let \(\mathcal {X}\) be a topological space and \(\mathcal {P}(\mathcal {X})\) be the space of Borel probability measures on \(\mathcal {X}\). For a map \(T: \mathcal {P}(\mathcal {X}) \rightarrow \mathbb {R}\), we define the influence function ofTat\(\mu \in \mathcal {P}(\mathcal {X})\), denoted \(\mathsf {IF}(\cdot ; T, \mu ) : \mathcal {X} \to \mathbb {R}\), by
$$\displaystyle \begin{aligned} \mathsf{IF}(\mathbf{x}; T, \mu) := \lim_{\epsilon \searrow 0} \frac{T(\epsilon \delta_{\mathbf{x}} + (1-\epsilon)\mu) - T(\mu)}{\epsilon} \end{aligned}$$
for each \(x \in \mathcal {X}\) where this limit exists. We say that T is B-robust at\(\mu \) if \(\sup _{\mathbf {x} \in \mathcal {X}} |\mathsf {IF}(\mathbf {x}; T, \mu )| < \infty \).
Example 8.2 shows that the statistic “the m-th moment”,
$$\displaystyle \begin{aligned} T(\mu)= \int {\mathbf{x}}^{\otimes m}\, \mu(d\mathbf{x})\end{aligned}$$
is not B-robust and by the same argument we immediately see that expected signatures are not robust statistics to describe laws of stochastic processes. What goes wrong in both cases is that monomials, \({\mathbf {x}}^{\otimes m}\) resp. \(\int d{\mathbf {x}}^{\otimes m}\), explode when the argument gets big, so if we observe among our N samples an extreme value \(\mathbf {x}\), this results in an \(\epsilon =\frac {1}{N}\)-weighted Dirac at \(\mathbf {x}\) in the resulting empirical measure and the contribution of \({\mathbf {x}}^{\otimes m}\) weighted with mass \(\epsilon \delta _{\mathbf {x}}\), can dominate the estimator.

8.2 Robust Signatures

One way to address this issue is to compensate for the exponential growth as \(\mathbf {x}\) becomes large. In the case of the monomial map for \(\mathbf {x} \in \mathbb {R}^d\), this can be achieved by rescaling \({\mathbf {x}}^{\otimes m}\) to \(\lambda _m(\mathbf {x}) {\mathbf {x}}^{\otimes m}\), where \(\lambda _m: \mathbb {R}^d \rightarrow \mathbb {R}\) is a function which decays quickly as \(\|\mathbf {x}\| \to \infty \). What makes the precise choice of the penalty \(\lambda \) challenging is that it needs to simultaneously decay sufficiently quickly to make the statistics robust but it cannot decay too quickly in order to preserve the universal and characteristic properties. In particular, we wish to have robust analogues of Propositions 2.1 and 4.1. As a technical note, there are two differences when we consider these robust analogues.
1.
We consider the space of continuous bounded functions \(C_b(\mathcal {X}, \mathbb {R})\) rather than continuous functions.
 
2.
In order to deal with non-locally compact spaces, we equip the space of continuous bounded functions in the next two theorems with the strict topology. This was introduced in [29], and further discussion on its application to robustness can be found in [15].
 
In the following, we reformulate the scaling factors \(\lambda _m\) into a single normalization map \(\lambda : \prod _{m \ge 0} (\mathbb {R}^d)^{\otimes m} \to \prod _{m \ge 0} (\mathbb {R}^d)^{\otimes m}\).
Theorem 8.4
There exists a map\(\lambda : \prod _{m \ge 0}(\mathbb {R}^d)^{\otimes m} \to \prod _{m \ge 0} (\mathbb {R}^d)^{\otimes m}\)such that
$$\displaystyle \begin{aligned} \lambda \circ\Phi_{\mathsf{Mon}}: \mathbb{R}^d \to \prod_{m \ge 0}(\mathbb{R}^d)^{\otimes m}, \quad \mathbf{x} \mapsto \lambda (\Phi_{\mathsf{Mon}}(\mathbf{x})), \end{aligned}$$
1.
is universal to\(C_b(\mathbb {R}^d,\mathbb {R})\)equipped with the strict topology, and
 
2.
is characteristic to the space of signed measures on\(\mathbb {R}^d\).
 
Moreover,\(\mu \mapsto \mathbb {E}_{\mathbf {X} \sim \mu }[\lambda \circ \Phi _{\mathsf {Mon}}(\mathbf {X})]\)is B-robust.
In particular, the set \(\{\mathbf {x} \mapsto \langle \mathbf {w}, \lambda \circ \Phi _{\mathsf {Mon}}(\mathbf {x}) \rangle \, : \, \mathbf {w} \in \bigoplus (\mathbb {R}^d)^{\otimes m}\}\) is dense in \(C_b(\mathbb {R}^d,\mathbb {R})\) in the strict topology and the normalized moments \(\mathbb {E}[\lambda \circ \Phi _{\mathsf {Mon}}(\mathbf {X})]\) always characterizes the law of \(\mathbf {X}\). The existence of the function \(\lambda \) is not trivial and while \(\lambda \) is not given in an explicit form but it can be efficiently computed, see [15, Proposition 14]. We find that the same argument applies to signatures.
Theorem 8.5
There exists a map\(\lambda : \prod _{m \ge 0} \bar H^{\otimes m} \to \prod _{m \ge 0} \bar H^{\otimes m}\)such that
$$\displaystyle \begin{aligned} \lambda \circ\Phi_{\mathsf{Sig}}: H_{1-var} \to \prod_{m \ge 0} \bar H^{\otimes m}, \quad \mathbf{x} \mapsto \lambda (\Phi_{\mathsf{Sig}}(\bar{\mathbf{x}})) \end{aligned}$$
1.
is universal to10\(C_b(\bar H_{1-var},\mathbb {R})\)equipped with the strict topology, and
 
2.
is characteristic to the space of signed measure on\(H_{1-var}\).
 
Moreover,\(\mu \mapsto \mathbb {E}_{\mathbf {X} \sim \mu }[\lambda \circ \Phi _{\mathsf {Sig}}(\bar {\mathbf {X}})]\)is B-robust.
We formulated Theorem 8.5 for paths evolving in Hilbert spaces in view of kernelization but the result also holds for the ordinary signature with \(\operatorname {k}=\operatorname {id}\) and \(H=\mathbb {R}^d\). Further, the analogous statements hold without adding a time coordinate by considering functions or measures on tree-like equivalence classes of paths, see [15, Theorem 21].
Remark 8.6 (Tensor Normalization as Path Normalization)
The normalization \(\lambda \) on the level of the tensors is equivalent to a normalization of the underlying path. That is \(\lambda \circ \Phi _{\mathsf {Sig}}(\mathbf {x})\) can be rewritten as \(\Phi _{\mathsf {Sig}}(\frac {\mathbf {x}}{\Lambda (\mathbf {x})})\) for a function \(\Lambda : \mathcal {X}_{1-var} \to \mathbb {R}\) of the path \(\mathbf {x}\). This way, we can think of \(\lambda \) as a very specific data normalization, see [15].
Remark 8.7 (Nearly Compact Support Does Not Help)
We emphasize that it can be easy to find for every \(\epsilon >0\) a compact set that carries \(1-\epsilon \) the mass of the probability measure. The issue is that the functions we are interested, monomials or iterated integrals, behave badly on the remaining set carrying \(\epsilon \) of the probability mass. The normalization \(\lambda \) turns this functions into bounded, hence well-behaved functions, while still keeping the necessary algebraic structure.

8.3 Robust (Signature) Kernels

The normalization \(\lambda \) allows us to overcome the drawbacks of our feature maps, \(\Phi _{\mathsf {Mon}}\) and \(\Phi _{\mathsf {Sig}}\) on non-compact sets. Given a static kernel \(\operatorname {k}:\mathcal {X} \times \mathcal {X} \to \mathbb {R}\) with associated feature map \(\varphi : \mathcal {X} \rightarrow H\), we can define robust versions of the associated kernels, \(\operatorname {k}_{\mathsf {Mon}}\) and \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\), by
$$\displaystyle \begin{aligned} \operatorname{k}_{\mathsf{Mon}}(\mathbf{x}, \mathbf{y}) &:= \langle \lambda \circ \Phi_{\mathsf{Mon}}(\varphi(\mathbf{x})), \lambda \circ \Phi_{\mathsf{Mon}}(\varphi(\mathbf{y})) \rangle \\ \operatorname{k}_{\operatorname{\mathsf{Sig}}}(\mathbf{x},\mathbf{y}) &:= \langle \lambda \circ \Phi_{\mathsf{Sig}}(\varphi(\mathbf{x})), \lambda \circ \Phi_{\mathsf{Sig}}(\varphi(\mathbf{y})) \rangle. \end{aligned} $$
With some abuse of notion we use the same notation as the non-robust version. The above universality and characteristicness results, Theorems 8.4 and 8.5, immediately translate to the corresponding kernelized statement.
From the computational side, using results from [15, 42], the normalization \(\lambda \) can itself be kernelized, which then gives the following.
Proposition 8.8
Let\(\operatorname {k}:\mathcal {X} \times \mathcal {X} \to \mathbb {R}\)be a kernel,\(\mathbf {x},\mathbf {y} \in \mathcal {X}_{\mathit{\text{seq}}}\)be two sequences of maximal length L, and M be the truncation level. Then the robust truncated signature kernel\(\operatorname {k}_{\mathsf {Sig}, :M}(\mathbf {x},\mathbf {y})\)can be evaluated in\(O(L^2(c_{\operatorname {k}}+M^3)+q)\)computational steps and using\(O(M^2L^2+M+q)\)memory where\(c_{\operatorname {k}}\)denotes the cost of a single evaluation of the static\(\operatorname {k}\)and q denotes the cost of finding the non-negative root of a univariate polynomial of degree 2M.
The algorithm in Proposition 8.8 is in fact elementary given that we know how to compute the signature kernel, see [15] for details.

9 Applications and Further Developments

Having an efficiently computable kernel for sequential data in arbitrary spaces allows us to apply kernel methods to a wide variety of problems. Below we discuss some of them.
A Maximum Mean Discrepancy for Stochastic Processes.
Given a set \(\mathcal {X}\), we often care about the space of probability measures \(\mathcal {P}(\mathcal {X})\) on this space. We have encountered the notion of characteristicness which says that we can uniquely represent a measure \(\mu \in \mathcal {P}(\mathcal {X})\) as an expectation, that is
$$\displaystyle \begin{aligned} \mu \mapsto \mathbb{E}_{\mathbf{X} \sim \mu}[\varphi(\mathbf{X})] \end{aligned}$$
is injective. If the co-domain of \(\varphi \), the feature space, has a norm, this would induce metric
$$\displaystyle \begin{aligned} d_{\operatorname{k}}(\mu,\nu) = \|\mathbb{E}_{\mathbf{X} \sim \mu}[\varphi(\mathbf{X})]-\mathbb{E}_{\mathbf{Y} \sim \nu}[\varphi(\mathbf{Y})]\|. \end{aligned}$$
The caveat is that these expectations take values in an infinite-dimensional space. However, for the choice \(\varphi (\mathbf {x})=\operatorname {k}(\mathbf {x},\cdot )\), the above can be efficiently computed; a straightforward computation using the fact that H is a RKHS shows
$$\displaystyle \begin{aligned} \|\mathbb{E}[\varphi(\mathbf{X})]-\mathbb{E}[\varphi(\mathbf{Y})]\|^2 &= \langle \mathbb{E}[\varphi(\mathbf{X})]-\mathbb{E}[\varphi(\mathbf{Y})],\mathbb{E}[\varphi(\mathbf{X})]-\mathbb{E}[\varphi(\mathbf{Y})] \rangle\\ &=\mathbb{E}[\operatorname{k}(\mathbf{X},\mathbf{X}')] -2\mathbb{E}[\operatorname{k}(\mathbf{X},\mathbf{Y})] +\mathbb{E}[\operatorname{k}(\mathbf{Y},\mathbf{Y}')], \end{aligned} $$
where \(\mathbf {X}, \mathbf {X}' \sim \mu \) and \(\mathbf {Y}, \mathbf {Y}' \sim \nu \) are independent copies. The right-hand side is straightforward to estimate when only finite samples of \(\mu \) and \(\nu \) are given; in particular,
$$\displaystyle \begin{aligned} d_{\operatorname{k}}(\mu,\nu) = \sqrt{\mathbb{E}[\operatorname{k}(\mathbf{X},\mathbf{X}')] -2\mathbb{E}[\operatorname{k}(\mathbf{X},\mathbf{Y})] +\mathbb{E}[\operatorname{k}(\mathbf{Y},\mathbf{Y}')]} \end{aligned}$$
is a metric which can be easily estimated with finite samples. The above metric is also known as kernel maximum mean discrepancy (MMD) since a direct calculation shows that
$$\displaystyle \begin{aligned} d_{\operatorname{k}}(\mu,\nu)=\sup_{f \in H: \|f\| \le 1} \left| \int_{\mathcal{X}} f(\mathbf{x}) \mu(d\mathbf{x}) - \int_{\mathcal{X}} f(\mathbf{x}) \nu(d\mathbf{x})\right|. \end{aligned}$$
Hence, \(d_{\operatorname {k}}\) is the maximum difference we can achieve by testing our measure against functions in the unit ball of the RKHS of \(\operatorname {k}\). MMDs have found many applications in machine learning, see [51] for a survey. Having the signature kernel at hand, then immediately yields a MMD for probability measures on paths; in particular, we obtain a metric for laws of stochastic processes. The theoretical guarantees of the signature kernel then translate into theoretical guarantees for the signature kernel MMD \(d_{\operatorname {k}_{\operatorname {\mathsf {Sig}}}}\). However, the topology on the space of probability measures induced by the signature MMD is more subtle. When the RKHS is finite-dimensional it induces weak convergence but not for infinite-dimensional RKHS; see [15] for details.
Hypothesis Testing.
Among the most popular hypothesis testing methods is two-sample testing and signature kernels allow us to do this for stochastic processes. In particular, we can test
$$\displaystyle \begin{aligned} H_0: \operatorname{Law}(\mathbf{X})= \operatorname{Law}(\mathbf{Y}) \text{ against }H_1: \operatorname{Law}(\mathbf{X}) \neq \operatorname{Law}(\mathbf{Y}) \end{aligned}$$
where \(\mathbf {X}=({\mathbf {X}}_t)\) and \(\mathbf {Y}=({\mathbf {Y}}_t)\) are two stochastic processes. The data we are given are m sample trajectories from \(\mathbf {X}\) and n sample trajectories from \(\mathbf {Y}\). Such situations are not uncommon, for example consider a cohort of patients in a medical trial who are divided into two subgroups (e.g. one taking medication and the other not) and we measure medical markers during the trial period. For another example, see [2] for an application of the signature MMD to test and validate hypothesis for real-world economic scenarios. Following [31] we reject \(H_0\) if
$$\displaystyle \begin{aligned} {} d_{\operatorname{k}_{\operatorname{\mathsf{Sig}}}}(\hat \mu_X, \hat \mu_Y) > c(\alpha,m,n). \end{aligned} $$
(25)
Here, \(\hat \mu _X = \frac {1}{m}\sum _{i=1}^m \delta _{{\mathbf {x}}_i}\) and \(\hat \mu _Y = \frac {1}{n}\sum _{j=1}^n \delta _{{\mathbf {y}}_j}\) are the empirical measures given by our observed sequences11\({\mathbf {x}}_1,\ldots ,{\mathbf {x}}_n,{\mathbf {y}}_1,\ldots ,{\mathbf {y}}_m \in \mathcal {X}_{\text{paths}}\) and \(c(\alpha ,m,n)\) is a constant such that the test (25) has a significance level12\(\alpha \). In [31], several estimators are proposed for \(d_{\operatorname {k}_{\operatorname {\mathsf {Sig}}}}(\hat \mu _X, \hat \mu _Y)\) and these estimators differ in their degree of bias and their computational complexity. The function c is determined by asymptotic distribution in [31], but this choice leads in general to a very conservative test criteria, that is, one seldom rejects \(H_0\). In practice, permutation tests often result in better empirical performance for MMD hypothesis testing, and this applies especially to the signature kernel; we refer to [15] for details and for experiments and benchmarks against other hypothesis tests. Beyond testing the equivalence of laws, kernel learning allows us to test for independence between the coordinates of a random variable [32]. Note that for stochastic processes, independence between coordinates is challenging since statistical dependencies can manifest between different times in different coordinates. However, quantifying and testing independence for stochastic processes have important applications such as Independent Component Analysis for stochastic processes [5, 59, 60].
Finally, note that the signature kernel \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\), and hence the MMD \(d_{\operatorname {k}_{\operatorname {\mathsf {Sig}}}}\), is parametrized by hyperparameters \(\theta _{\operatorname {k}_{\operatorname {\mathsf {Sig}}}} \in \Theta _{\operatorname {k}_{\operatorname {\mathsf {Sig}}}}\) and already for simple supervised learning, the hyperparameter choice was essential for good performance. For hypothesis testing, a naive grid-search to choose the best \(\theta \) using the entire test set is not applicable since this would introduce correlations which invalidate the test. This is a generic problem for kernel MMDs in testing problems and several solutions have been proposed which optimize for the test power. One solution is to follow [65] which proposes replacing \(d_{\operatorname {k}_{\operatorname {\mathsf {Sig}}}}\) by
$$\displaystyle \begin{aligned} (\mathbf{x},\mathbf{y}) \mapsto \sup_{\theta \in \Theta} d_{\operatorname{k}_{\operatorname{\mathsf{Sig}}}(\theta)}(\mathbf{x},\mathbf{y}), \end{aligned}$$
where we write \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}(\theta )\) to denote the dependence of \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\) on \(\theta \). The above is again a metric and while it is not the optimal choice for the test power, it guarantees a sample test with given significance; moreover, it works well in practice with the signature kernel, see [15] for details. Other methods exist which aim to maximize the test power [33], but require splitting the data into train/test sets to perform hyperparameter selection. A related issue, is the choice of the static kernel \(\operatorname {k}\) itself: for essentially all previous applications of the signature kernel, using the Euclidean inner product kernel as static kernel is outperformed by non-linear static kernels such as RBF. However for hypothesis testing, using the standard inner product as the static kernel for the signature kernel can be competitive with the signature kernel built over non-linear static kernels for some datasets. We refer again to [15] for a discussion of this and its relation to the question of hyperparameter choice. This topic will be discussed further in chapter “Market Generators: A Paradigm Shift in Financial Modeling”.
Signature MMDs in Generative Models.
The signature MMD is a metric between any two measures on path space and can be computed efficiently. In particular, it does not make any parametric assumptions on the two measures that are compared, hence it can be used in a generative model to decide whether the generator produces samples that are close enough to a given target distribution. This has been applied to generative adversarial networks (GANs) to model financial data [7] and to generative models for (neural) stochastic differential equations [41]. Again, the choice of hyperparameter and static kernel is challenging from the theoretical perspective but this is not specific to the signature kernel and general heuristics from the machine learning literature might apply.
Gaussian Processes with Signature Covariances.
We often require not only point predictions, but also estimates of the associated uncertainties. Bayesian methods provide a structured approach to update and quantify uncertainties and in particular, Gaussian processes are a powerful and non-parametric method which provides a flexible way to put priors on functions of the data; see [54]. Formally, a Gaussian process on \(\mathcal {X}\) is a stochastic process \(f=(f_{\mathbf {x}})_{\mathbf {x} \in \mathcal {X}}\) indexed by \(\mathbf {x} \in \mathcal {X}\) such that \(f \sim N(m,\operatorname {k})\) is determined by a mean and a covariance function
$$\displaystyle \begin{aligned} m:\mathcal{X} \to \mathbb{R} \text{ and a positive definite } \operatorname{k}:\mathcal{X} \times \mathcal{X} \to \mathbb{R} \end{aligned}$$
where \(f_{\mathbf {x}}\) is Gaussian with
$$\displaystyle \begin{aligned} \mathbb{E}[f_{\mathbf{x}}] = m(\mathbf{x}) \text{ and } \operatorname{Cov}(f_{\mathbf{x}},f_{\mathbf{y}}) = k(\mathbf{x}, \mathbf{y}). \end{aligned}$$
Note that we break here with our convention to denote random variables with capital letters and write f instead of F to be consistent with the GP literature. Without loss of generality, one can take \(m(\mathbf {x})=0\) and consider data in an arbitrary space \(\mathcal {X}\) provided a kernel \(\operatorname {k}\) is given. If the kernel \(\operatorname {k}\) has nice properties these often translate into desirable properties of the GP. Given the signature kernel, one can immediately consider a GP for sequential data. However, two issues arise, one more theoretical and one more applied
  • The existence of Gaussian processes with continuous sample trajectories is not trivial on non-compact spaces; see for example [1] for simple counterexamples. The required regularity estimates to ensure existence can be derived by using a classical result of Dudley [22] and bounding the growth of the covering number which requires us to utilize the structure of \(\operatorname {k}\) and \(\mathcal {X}\). In [69], such covering number estimates are computed for the signature kernel which then gives the existence of a GP indexed by paths; further, it yields a modulus of continuity for the GP sample paths \(\mathbf {x} \mapsto f_{\mathbf {x}}\).
  • Plain vanilla GP inference requires us to invert large Gram matrices which scales with complexity \(O(n^3)\) in the number of samples n. For most datasets, this is too costly and several approximation have been developed. Arguably the most popular are variational inference and inducing point methods. To efficiently apply these to the signature kernel requires using the structure of tensors. The article [69] introduced GP with signature covariances and developed a variational approach based on sparse tensors; and [45] developed a variational approach that uses diagonal approximations to the Gram matrix which allows for quick matrix inversion.
Strings, Time Series, and Classic Sequence Kernels.
Kernel learning for sequences is a classic topic and the case when the state space is finite or discrete—in our notation \(\mathcal {X}\) is a finite or countable set of letters and \(\mathcal {X}_{\text{seq}}\) is the set of strings with letters from \(\mathcal {X}\)—has received much attention. Haussler developed a relation-convolution kernel framework [36] that gave rise to string kernels; see for example [46, 48] for algorithms and applications in text learning and biology. These approaches have been extended to allow for general state spaces with the alignment kernel [3, 18, 63] that search over all possible alignments (“time-warpings”), although this may not result in a positive definite kernel. However, both the classic string kernel and the global alignment kernel, arise as special cases or minor variations of the signature kernel construction and other variations are possible, see [42, Section 5 and Remark 4.10] and [68, Appendix B.3], thereby providing theoretical guarantees for these kernels.
Time-Parametrization (In-)variance.
The signature \(\mathbf {x} \mapsto \Phi _{\mathsf {Sig}}(\mathbf {x})\) is injective up to translation of the starting point \(\mathbf {x}(0)\) of the path \(\mathbf {x}=(\mathbf {x}(t))_{t \in [0,T]}\) and tree-like equivalence [34]. The latter means that paths are equivalent up to reparametrization and backtracking. In particular, \(\Phi _{\mathsf {Sig}}(\mathbf {x})=\Phi _{\mathsf {Sig}}({\mathbf {x}}^\rho )\) where \(\rho :[0,S]\to [0,T]\) is non-decreasing and \({\mathbf {x}}^\rho (t):= \mathbf {x} (\rho (t))\). Adding a time coordinate as in Sect. 4, \(\bar {\mathbf {x}}(t):=(t,\mathbf {x}(t))\), destroys any tree-like equivalence, that is \(\mathbf {x} \mapsto \Phi _{\mathsf {Sig}}(\bar {\mathbf {x}})\) is injective on \(\mathcal {X}_{\text{1-var}}\), hence distinguishes paths that differ by a time-change. Removing such a big class of invariances (the set of time-changes is infinite-dimensional) is a powerful tool and has received much attention in engineering [56] where this is known as “time-warping”; in the above discussed discrete case of strings, this is known as global alignment [43]. The same theoretical guarantees discussed in Sect. 4 translate to the corresponding statements about functions or probability measures on such equivalence classes.
Weak and Adapted Topologies.
The signature MMD is a metric on the space of probability measures on either \(\mathcal {X}_{\text{seq}}\) or \(\mathcal {X}_{\text{paths}}\). Arguably the most important topology on the space of probability measures is the weak topology and we discussed above when the signature resp. signature MMD metrizes this metric. However, the weak topology completely ignores the filtration of a stochastic process. The need for finer topologies for laws of stochastic processes that take the filtration into account has been understood a long time ago [38] and have recently seen a revived interest [53]. These finer topologies run under the name of higher rank topologies (the higher the rank, the more of the filtration structure is taken into account). In [4], higher rank signatures are introduced which take filtrations into account and allow us to metrize these higher rank topologies. The same methodology that constructs the signature kernel given a static kernel can be carried out by using the higher-rank signatures and [39, 58] use the PDE solver to estimate the associated MMD in various machine learning tasks. See chapter “Adapted Topologies and Higher-Rank Signatures” (Adapted Topologies) for further details.
Rough Paths.
Throughout we made the assumption that our paths are of bounded variation, that is we defined \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\) on \(\mathcal {X}_{\text{1-var}}\). However, many stochastic processes have sample paths that are of unbounded variation, and thus are not elements of \(\mathcal {X}_{\text{1-var}}\). This is fairly straightforward to address with tools from stochastic analysis and rough path theory. There are two ways to think about this:
a.
In practice we are given sequences and not paths, and these can always be embedded as bounded variation path, \(\mathcal {X}_{\text{seq}} \hookrightarrow \mathcal {X}_{\text{paths}}\).
 
b.
Even though we can embed any sequence as a bounded variation path as in ()a, this can become flawed in the high-frequency scaling limit since the iterated integrals might not converge. A more radical approach is to define the signature kernel for rough paths. This requires us to prescribe the first iterated integrals since these are not well-defined as Riemann–Stieltjes integrals. In [42, Appendix B] this was done for geometric rough paths, which are those for which piecewise linear approximations converge13 and hence allow for canonical choices of the higher iterated integrals. This covers semimartingales and large classes of other popular stochastic processes. The same algorithm applies if other choices of the higher order integrals are prescribed14 but these are typically not available in the usual sequence datasets.
 
Beyond (Geometric) Rough Paths.
Throughout we focused on the classical signature but for many applications, non-geometric or branched rough paths provide a different way to describe a path as a sequence of tensors. Much of the discussion we had so far can be translated to branched rough paths with minor modification, that is one can again define a “signature kernel” between branched rough paths.15 One can even go beyond that and argue that the essential structure is that of embedding a sequence into a non-commutative algebra in which we “stitch” the sequence elements together by the non-commutative multiplication in the algebra. That this multiplication is non-commutative is essential since this algebraically captures the sequence order. From this point of view, the signature and kernel that we discussed so far realizes this abstract method by instantiating it with the shuffle Hopf algebra and using as embedding the Lie group exponential; branched rough paths replace the shuffle algebra by the Connes–Kreimer algebra. Following this approach, we can simply parametrize the algebraic structure and learn the best choice from the data. The underlying theory is beyond the current scope but we refer to [68] for a demonstration of how parametrizing the underlying algebraic structure and going beyond geometric rough paths can improve benchmark performance, often significantly, and how to deal with the computational challenges. Similarly, see the discussion about the non-geometric signature kernel16 in [42, Remark 4.5]. Besides improving performance, such variations lead to interesting and challenging algebraic question which no longer can be treated within a Hopf algebraic framework. We refer to [20] for a detailed study of the underlying algebraic structures.
Evolving (Non-Euclidean) Data.
The signature kernel transforms any static kernel on the domain \(\mathcal {X}\) into a kernel on the corresponding space of sequences \(\mathcal {X}_{\text{seq}}\), allowing us to design kernels for objects evolving in general (non-Euclidean) spaces. One example arises from the field of topological data analysis, in which a method called persistent homology allows us to quantify the multi-scale topological structure‘ of data such as point clouds in highly nonlinear objects called persistence diagrams [19]. In [14], static persistence diagrams are transformed into paths, and the signature kernel allows us to capture the topological properties of a static point cloud. Given an evolving point cloud, for instance representing a swarm of agents, we obtain a sequence of persistence diagrams which summarize the dynamic topological structure of the data, and signature kernels again provide an effective method to perform learning tasks such as regression [30]. Another example is based on the observation that the definition of the signature only requires an inner product on the derivatives of a sequence. For paths evolving on Lie groups, where derivatives lie in the associated Lie algebra, one can formulate the signature kernel in exactly the same way [44]. Finally, another case where a general approach to sequences can be studied with the signature kernel is to analyze classic deep learning architectures via the attractive theoretical properties of the signature kernel; for example, [26] studies RNNs in terms of the signature kernel, and [52] study scaling of Resnets from this perspective.
Likelihood-free Inference.
Given a parametrized statistical model, the likelihood function captures its fit to data. Often the underlying models are complex which leads to intractable likelihood functions and makes it infeasible to find the optimal parameter. So called likelihood-free inference addresses these challenges and [23] combine the signature kernel and its MMD with density ratio estimation to develop a likelihood-free inference approach; see also [24] for its use in approximate Bayesian computation.
Random Approximations.
Kernel methods allow the implicit use of a high- or infinite-dimensional feature map \(\varphi : \mathcal {X} \to H\) on a dataset \(\{{\mathbf {x}}_1,\ldots ,{\mathbf {x}}_N\} \subset \mathcal {X}\) of N points, by replacing the computation and storage of the features \(\varphi ({\mathbf {x}}_1),\ldots ,\varphi ({\mathbf {x}}_N) \in H\) by the evaluation and storage of the Gram matrix \(\left (\operatorname {k}({\mathbf {x}}_i,{\mathbf {x}}_j)\right )_{i,j =1,\ldots ,N}\). However, even if the kernel is cheap to evaluate, the storage cost now scales quadratically in N which can be prohibitive for many datasets. Moreover, the signature kernel scales quadratically in the sequence length which results in quadratic complexity in both, the number of data points and the sequence length. A classic way kernel learning deals with this is to use randomness to construct lower-dimensional approximations such as the Nyström method which constructs a low-rank approximation to the full Gram matrix by random subsampling; see [42] for an algorithm that does this for the signature kernel. Another way to achieve this is by using random Fourier features. The starting point is that different feature maps can give rise to the same kernel,
$$\displaystyle \begin{aligned} \operatorname{k}(\mathbf{x},\mathbf{y}) = \langle \varphi(\mathbf{x}), \varphi(\mathbf{y}) \rangle = \langle \tilde \varphi(\mathbf{x}), \tilde \varphi(\mathbf{y}) \rangle \end{aligned}$$
where \(\varphi : \mathcal {X} \to H\) and \(\tilde \varphi : \mathcal {X} \to \tilde H\) may not even take values in the same Hilbert space. Random Fourier features construct for a given kernel \(\operatorname {k}\) associated with a feature map \(\varphi \) a low-dimensional and random feature map \(\tilde \varphi : \mathcal {X} \to \mathbb {R}^d\) such that
$$\displaystyle \begin{aligned} \operatorname{k}(\mathbf{x},\mathbf{y}) \approx \langle \tilde \varphi(\mathbf{x}), \tilde \varphi(\mathbf{y})\rangle\end{aligned}$$
with high probability. Random Fourier features have become a mainstay in kernel learning but do rely on Bochner’s theorem which does not immediately extend to kernels on non-linear domains such as the signature kernel (recall that there is no natural addition on \(\mathcal {X}_{\text{seq}}\)). Nevertheless, [70] mimicks a similar argument to construct “Random Fourier Signature Features”. This allows us to combine the strengths of the signature kernel with linear complexity in both the number of sequences and sequence length by using an approximation that holds with high probability. Experiments show that this is in a general a favourable tradeoff and allows us to tackle challenging time-series datasets. We refer the reader to the User’s Guide to KSig, [66], for a detailed discussion.
Learning to Forget.
The signature (kernel) provides a global summary of a sequence, giving equal weight to events that happened recently and in distant history. Sometimes this can be a strength, but sometimes one needs to emphasize recent events; especially, in long time series. So far, this has mostly been addressed by ad hoc approaches, but a principled, data-driven Bayesian approach of “learning to forget” was developed in [67].

Acknowledgements

HO and DL were supported by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA). HO was also supported by the EPSRC grant Datasig [EP/S026347/1].
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Title
The Signature Kernel
Authors
Darrick Lee
Harald Oberhauser
Copyright Year
2026
DOI
https://doi.org/10.1007/978-3-031-97239-3_3
1
Throughout, we use the term kernel to refer to a symmetric positive definite function.
 
3
Addition is defined degree-wise \(({\mathbf {s}}_0,{\mathbf {s}}_1,{\mathbf {s}}_2,\ldots ) + ({\mathbf {t}}_0,{\mathbf {t}}_1,{\mathbf {t}}_2,\ldots ) := ({\mathbf {s}}_0+{\mathbf {s}}_0,{\mathbf {s}}_1+{\mathbf {t}}_1,{\mathbf {s}}_2+{\mathbf {t}}_2,\ldots )\). That is, we add scalars to scalars, vectors to vectors, etc.
 
4
Recall that \(\mathbf {w} \in \bigoplus _{m \geq 0} (\mathbb {R}^d)^{\otimes m}\) denotes a sequence \(\mathbf {w} = ({\mathbf {w}}^I)\) where \({\mathbf {w}}^I\) is nonzero for finitely many multi-indices I, whereas \(\mathbf {w}' \in \prod (\mathbb {R}^d)^{\otimes m}\) denotes a sequence \(\mathbf {w}' = ({\mathbf {w}}^{\prime I})\) where \({\mathbf {w}}^{\prime I}\) may be nonzero for infinitely many I.
 
5
The random variable \(\mathbf {Y}\) is explicitly specified by the density \(\mathbb {P}(\mathbf {Y} \in dy) = f(y) (1+\sin {}(2 \pi \log (y))\) where f denotes the density of the log-normal \(\mathbf {X}\). In fact, [37] shows that there exist uncountably many random variables that have the same distribution as \(\mathbf {X}\).
 
6
Many choices of topologies for paths and unparametrized path are possible; for a detailed study for unparametrized paths, see [9].
 
8
Recall that a topological vector space is locally compact if and only if it is finite-dimensional. See also Remark 8.7, that even given a compact set that carries nearly all the support of the measure, this is not enough for characteristicness.
 
9
Formally, we are given samples from \({\mathbf {X}}_{\mathbf {x},\epsilon }\) and not from \(\mathbf {X}\).
 
10
Analogous to Proposition 4.1, \(\bar H_{1-var}\) denotes the paths of \(H_{1-var}\) that are augmented with a time coordinate. Explicitly, paths in \(\bar H_{1-var}\) are paths evolving in \(H \oplus \mathbb {R}\) of bounded 1-variation.
 
11
With slight abuse of notation, each \({\mathbf {x}}_i \in \mathcal {X}_{\text{seq}}\) denotes a whole sequence and not single sequence entry as before.
 
12
Recall that the type I error of a test is the probability of falsely rejecting \(H_0\) and the type II error is falsely accepting \(H_0\). If the type I error is bounded by \(\alpha \) then one says the test has power \(\alpha \).
 
13
In the notation of [42, Appendix B], \(\operatorname {k}^{+}_{(d,m)}\) with \(d=\lfloor p \rfloor \) where p denotes the p-variation of the rough path. In our current notation \(d,m\) are parts of the parameter \(\theta _{algebra} \in \Theta _{algebra}\) of \(\operatorname {k}_{\operatorname {\mathsf {Sig}}}\).
 
14
For example, to describe a “ rough” path in d dimensions, one could consider sequences \(\mathbf {x}=({\mathbf {x}}_0,\ldots ,{\mathbf {x}}_L)\) in the state space \(\mathcal {X}=G_{m,d}\), the group-like elements; informally \({\mathbf {x}}_i\) are the first m “iterated integrals” of the underlying path on the interval \([t_i,t_{i+1}]\). In the notation of Sect. 7.3, producing such group-like elements from raw sequential data could be regarded as a preprocessing step.
 
15
This is in the ArXiv version v1 of [16, Appendix D]. Appendix D is not part of the published version [15] of [16] for brevity.
 
16
Denoted as \(\operatorname {k}^+_{(d,m)}\) when \(d\neq m\) in the notation of [42].
 
1.
go back to reference R. Adler, J. Taylor, Random Fields and Geometry. Springer Monographs in Mathematics (Springer, New York, 2009)
2.
go back to reference H. Andrès, A. Boumezoued, B. Jourdain, Signature-based validation of real-world economic scenarios. arXiv preprint arXiv:2208.07251 (2022)
3.
go back to reference C. Bahlmann, B. Haasdonk, H. Burkhardt, Online handwriting recognition with support vector machines-a kernel approach, in Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition. (IEEE, New York, 2002), pp. 49–54
4.
go back to reference P. Bonnier, C. Liu, H. Oberhauser, Adapted Topologies and Higher Rank Signatures. Ann. Appl. Probab. 33(3), 2136–2175 (2023)MathSciNetCrossRef
5.
go back to reference P. Bonnier, H. Oberhauser, Signature cumulants, ordered partitions, and independence of stochastic processes. Bernoulli 26(4), 2727–2757 (2020)MathSciNetCrossRef
6.
go back to reference R.W. Brockett, Volterra series and geometric control theory. Automatica 12(2), 167–176 (1976)MathSciNetCrossRef
7.
go back to reference H. Buehler, B. Horvath, T. Lyons, I.P. Arribas, B. Wood, A data-driven market simulator for small data environments. arXiv preprint arXiv:2006.14498 (2020)
8.
go back to reference T. Cass, T. Lyons, X. Xu, Weighted signature kernels. Ann. Appl. Probab. 34(1A), 585–626 (2024)MathSciNetCrossRef
9.
go back to reference T. Cass, W.F. Turner, Topologies on unparameterised path space. J. Funct. Anal. 286(4), 110261 (2024)
10.
go back to reference K.-T. Chen, Iterated integrals and exponential homomorphisms†. Proc. Lond. Math. Soc. s3-4(1), 502–512 (1954)
11.
go back to reference K.-T. Chen, Integration of paths, geometric invariants and a generalized Baker- Hausdorff formula. Ann. Math. 65(1), 163–178 (1957)MathSciNetCrossRef
12.
go back to reference K.-T. Chen, Integration of paths—a faithful representation of paths by noncommutative formal power series. Trans. Am. Math. Soc. 89(2), 395–407 (1958)
13.
go back to reference I. Chevyrev, T. Lyons, Characteristic functions of measures on geometric rough paths. Ann. Probab. 44(6), 4049–4082 (2016)MathSciNetCrossRef
14.
go back to reference I. Chevyrev, V. Nanda, H. Oberhauser, Persistence paths and signature features in topological data analysis. IEEE Trans. Pattern Anal. Mach. Intell. 42, 192–202 (2018)CrossRef
15.
go back to reference I. Chevyrev, H. Oberhauser, Signature moments to characterize laws of stochastic processes. J. Mach. Learn. Res. 23(176), 1–42 (2022)MathSciNet
16.
go back to reference I. Chevyrev, H. Oberhauser, Signature Moments to Characterize Laws of Stochastic Processes. arXiv preprint 1810.10971.v1 (2022)
17.
go back to reference F. Cucker, S. Smale, On the mathematical foundations of learning. Bull. Am. Math. Soc. 39(1), 1–49 (2002)MathSciNetCrossRef
18.
go back to reference M. Cuturi, J.-P. Vert, Ø. Birkenes, T. Matsui, A kernel for time series based on global alignments, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing—ICASSP ’07, vol. 2 (2006), pp. II–413–II–416
19.
go back to reference T.K. Dey, Y. Wang, Computational Topology for Data Analysis (Cambridge University, Cambridge, 2022)CrossRef
20.
go back to reference J. Diehl, K. Ebrahimi-Fard, N. Tapia, Generalized iterated-sums signatures. J. Algebra 632, 801–824 (2023)MathSciNetCrossRef
21.
go back to reference P. Drineas, M.W. Mahoney, N. Cristianini, On the Nyström method for approximating a gram matrix for improved kernel-based learning. J. Mach. Learn. Res. 6(12), 2153–2175 (2005)MathSciNet
22.
go back to reference R.M. Dudley, Sample functions of the gaussian process, in Selected works of RM Dudley (Springer, Berlin, 2010), pp. 187–224CrossRef
23.
go back to reference J. Dyer, P. Cannon, S.M. Schmon, Amortised likelihood-free inference for expensive time-series simulators with signatured ratio estimation, in International Conference on Artificial Intelligence and Statistics (2022)
24.
go back to reference J. Dyer, J. Fitzgerald, B. Rieck, S.M. Schmon, Approximate bayesian computation for panel data with signature maximum mean discrepancies, in NeurIPS 2022 Temporal Graph Learning Workshop (2022)
25.
go back to reference A. Fermanian, Functional linear regression with truncated signatures. J. Multivar. Anal. 192, 105031 (2020)MathSciNetCrossRef
26.
go back to reference A. Fermanian, P. Marion, J.-P. Vert, G. Biau, Framing RNN as a kernel method: a neural ode approach. Adv. Neural Inf. Proces. Syst. 34, 3121–3134 (2021)
27.
go back to reference M. Fliess, Fonctionnelles causales non linéaires et indéterminées non commutatives. Bulletin de la Société Mathématique de France 109, 3–40 (1981)MathSciNetCrossRef
28.
go back to reference D. Garreau, W. Jitkrittum, M. Kanagawa, Large sample analysis of the median heuristic. arXiv:1707.07269 [math, stat] (2018)
29.
go back to reference R. Giles, A generalization of the strict topology. Trans. Am. Math. Soc. 161, 467–474 (1971)MathSciNetCrossRef
30.
go back to reference C. Giusti, D. Lee, Signatures, lipschitz-free spaces, and paths of persistence diagrams. SIAM J. Appl. Algebra Geom. 7(4), 828–866 (2023)MathSciNetCrossRef
31.
go back to reference A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, A. Smola, A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)MathSciNet
32.
go back to reference A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Schölkopf, A. Smola, A kernel statistical test of independence, in Advances in Neural Information Processing Systems, vol. 20 (2007)
33.
go back to reference A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, B.K. Sriperumbudur, Optimal kernel choice for large-scale two-sample tests, in Advances in Neural Information Processing Systems, ed. by F. Pereira, C. Burges, L. Bottou, K. Weinberger, vol. 25 (Curran Associates, Inc., New York, 2012)
34.
go back to reference B. Hambly, T. Lyons, Uniqueness for the signature of a path of bounded variation and the reduced path group. Ann. Math. 171(1), 109–167 (2010)MathSciNetCrossRef
35.
go back to reference F.R. Hampel, Robust Statistics: The Approach Based on Influence Functions (Wiley, New York, 1986)
36.
go back to reference D. Haussler et al., Convolution Kernels on Discrete Structures. Technical report (Citeseer, New York, 1999)
37.
go back to reference C.C. Heyde, On a property of the lognormal distribution. J. R. Stat. Soc. Ser. B Methodol. 25(2), 392–393 (1963)MathSciNetCrossRef
38.
go back to reference D.N. Hoover, H.J. Keisler, Adapted probability distributions. Trans. Am. Math. Soc. 286, 159–201 (1984)MathSciNetCrossRef
39.
go back to reference B. Horvath, M. Lemercier, C. Liu, T. Lyons, C. Salvi, Optimal Stopping via Distribution Regression: A Higher Rank Signature Approach. arXiv:2304.01479 [math] (2023)
40.
go back to reference P.J. Huber, E.M. Ronchetti, Robust Statistics (Wiley, New York, 2011)
41.
go back to reference P. Kidger, J. Foster, X. Li, H. Oberhauser, T.J. Lyons, Neural SDES as infinite-dimensional GANS, in International Conference on Machine Learning (PMLR, New York, 2021), pp. 5453–5463
42.
go back to reference F.J. Kiraly, H. Oberhauser, Kernels for sequentially ordered data. J. Mach. Learn. Res. 20(31), 1–45 (2019)MathSciNet
43.
go back to reference J.B. Kruskal, An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM Rev. 25(2), 201–237 (1983)MathSciNetCrossRef
44.
go back to reference D. Lee, R. Ghrist, Path Signatures on Lie Groups. arXiv:2007.06633 [cs, math, stat] (2020)
45.
go back to reference M. Lemercier, C. Salvi, T. Cass, E.V. Bonilla, T. Damoulas, T.J. Lyons, Siggpde: scaling sparse gaussian processes on sequential data, in International Conference on Machine Learning (PMLR, New York, 2021), pp. 6233–6242
46.
go back to reference C.S. Leslie, R. Kuang, Fast string kernels using inexact matching for protein sequences. J. Mach. Learn. Res. 5, 1435–1455 (2004)MathSciNet
47.
go back to reference C. Litterer, H. Oberhauser, On a Chen–Fliess approximation for diffusion functionals. Monatsh. Math. 175(4), 577–593 (2014)MathSciNetCrossRef
48.
go back to reference H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, C.J.C.H. Watkins, Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)
49.
go back to reference T. Lyons, Z. Qian, System Control and Rough Paths (Clarendon, Oxford, 2007)
50.
go back to reference C. Moler, C. Van Loan, Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Rev. 45(1), 3–49 (2003)MathSciNetCrossRef
51.
go back to reference K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al., Kernel mean embedding of distributions: A review and beyond. Found. Trends® Mach. Learn. 10(1-2), 1–141 (2017)
52.
go back to reference N. Muca Cirone, M. Lemercier, C. Salvi, Neural signature kernels as infinite-width-depth-limits of controlled ResNets, in Proceedings of the 40th International Conference on Machine Learning, ed. by A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett. Proceedings of Machine Learning Research, vol. 202 (PMLR, New York, 2023), pp. 25358–25425
53.
go back to reference G. Pammer, A note on the adapted weak topology in discrete time. arXiv preprint arXiv:2205.00989 (2022)
54.
go back to reference C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning Series (MIT Press, Cambridge, 2005)CrossRef
55.
go back to reference C. Reutenauer, Free Lie Algebras. London Mathematical Society Monographs. (Oxford University, Oxford, 1993)
56.
go back to reference H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 43–49 (1978)CrossRef
57.
go back to reference C. Salvi, T. Cass, J. Foster, T. Lyons, W. Yang, The signature Kernel is the solution of a Goursat PDE. SIAM J. Math. Data Sci. 3(3), 873–899 (2021)MathSciNetCrossRef
58.
go back to reference C. Salvi, M. Lemercier, C. Liu, B. Horvath, T. Damoulas, T. Lyons, Higher order kernel mean embeddings to capture filtrations of stochastic processes, in Advances in Neural Information Processing Systems, ed. by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, J.W. Vaughan, vol. 34 (Curran Associates, Inc., New York, 2021), pp. 16635–16647
59.
go back to reference A. Schell, A Robustness Analysis of Blind Source Separation. arXiv:2303.10104 [math, stat] (2023)
60.
go back to reference A. Schell, H. Oberhauser, Nonlinear independent component analysis for discrete-time and continuous-time signals. Ann. Stat. 51(2), 487–518 (2023)MathSciNetCrossRef
61.
go back to reference M. Schetzen, The Volterra and Wiener Theories of Nonlinear Systems (Krieger Publication, New York, 2006)
62.
go back to reference B. Scholkopf, A.J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (MIT Press, New York, 2018).
63.
go back to reference H. Shimodaira, K.-i. Noma, M. Nakai, S. Sagayama, Dynamic time-alignment kernel in support vector machine. Adv. Neural Inf. Proces. Syst. 14 (2001)
64.
go back to reference C.-J. Simon-Gabriel, A. Barp, B. Schölkopf, L. Mackey, Metrizing weak convergence with maximum mean discrepancies. arXiv preprint arXiv:2006.09268 (2020)
65.
go back to reference B.K. Sriperumbudur, K. Fukumizu, A. Gretton, G.R.G. Lanckriet, B. Schölkopf, Kernel choice and classifiability for RKHS embeddings of probability distributions, in NIPS (2009)
66.
go back to reference C. Tóth, D.J.D. Cruz, H. Oberhauser, A User’s Guide to KSIG: GPU-Accelerated Computation of the Signature Kernel. arxiv:2501.07145 (2025)
67.
go back to reference C. Tóth, M. Adachi, M.A. Osborne, H. Oberhauser, Learning to Forget: Bayesian Time Series Forecasting using Recurrent Sparse Spectrum Signature Gaussian Processes, in Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, ed. by Y. Li, S. Mandt, S. Agrawal, E. Khan, vol. 258 (Proceedings of Machine Learning Research, PMLR) pp. 4654–4662 (2025). https://raw.githubusercontent.com/mlresearch/v258/main/assets/toth25b/toth25b.pdfhttps://proceedings.mlr.press/v258/toth25b.html
68.
go back to reference C. Tóth, P. Bonnier, H. Oberhauser, Seq2Tens: An efficient representation of sequences by low-rank tensor projections, in International Conference on Learning Representations (2021)
69.
go back to reference C. Tóth, H. Oberhauser, Bayesian learning from sequential data using Gaussian Processes with Signature Covariances, in Proceedings of International Conference on Machine Learning (ICML), vol. 1 (2020)
70.
go back to reference C. Tóth, H. Oberhauser, Z. Szabó, Random Fourier Signature Features. SIAM J. Math. Data Sci. 7(1), 329–354 (2025). https://doi.org/10.1137/23M1620478MathSciNetCrossRef
    Image Credits
    Salesforce.com Germany GmbH/© Salesforce.com Germany GmbH, IDW Verlag GmbH/© IDW Verlag GmbH, Diebold Nixdorf/© Diebold Nixdorf, Ratiodata SE/© Ratiodata SE, msg for banking ag/© msg for banking ag, C.H. Beck oHG/© C.H. Beck oHG, OneTrust GmbH/© OneTrust GmbH, Governikus GmbH & Co. KG/© Governikus GmbH & Co. KG, Horn & Company GmbH/© Horn & Company GmbH, EURO Kartensysteme GmbH/© EURO Kartensysteme GmbH, Jabatix S.A./© Jabatix S.A.