Skip to main content
Top

A Primer on the Signature Method in Machine Learning

  • Open Access
  • 2026
  • OriginalPaper
  • Chapter
Published in:

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …
download
DOWNLOAD
print
PRINT
insite
SEARCH

Abstract

This chapter provides a comprehensive introduction to the signature method in machine learning, focusing on its theoretical foundations and practical applications. The text begins by explaining the concept of paths and their use in describing time-ordered data, such as financial time series, text, and evolving networks. It then delves into the definition of the signature of a path, which is a sequence of numbers associated with a path, given by iterated integrals. The chapter discusses the fundamental properties of the signature, including its invariance under reparametrization, the shuffle product, Chen’s identities, time reversal, and the log-signature. It also explores the practical applications of the signature method in machine learning, including feature extraction, classification, and regression. The text provides numerous examples and visualizations to illustrate the concepts and computations involved in the signature method. Additionally, it discusses the use of the signature method in various industries, such as finance, technology, and healthcare. The chapter concludes with a discussion of further topics and extensions, including the signature of rough paths, path uniqueness, paths with jumps, the moment problem, and higher dimensions. Overall, this chapter serves as a valuable resource for professionals looking to understand and apply the signature method in their work.

1 Introduction

Paths are used ubiquitously to describe time-ordered data, which appear in a variety of contexts. There are a number of important examples of time-ordered data, some of which we do not often think of as such:
  • Financial time series—e.g. price of a stock or index,
  • Text—e.g. “Lorem ipsum dolor sit amet, consectetur adipiscing elit,
  • Evolving networks—e.g. time evolution of contacts on social media.
Each of these examples can be mathematically described as a function X from an index set \(I\subset \mathbb R\) of real numbers to a state space \(\mathcal {X}\), that is, a path\(X: I\to \mathcal {X}\). For the three examples above, the corresponding index set and state space are
  • Financial time series—finite interval \(I=[a,b]\) and Euclidean space \(\mathcal {X}=\mathbb R^d\),
  • Text—positive integers \(I=\{0,1,\ldots , N\}\) and alphabet \(\mathcal {X}=\{a,A,b,B,\ldots \}\),
  • Evolving networks—positive integers \(I=\{0,1,\ldots , N\}\) and space \(\mathcal {X}\) of graphs or adjacency matrices.
In the sequel, we restrict ourselves to time-ordered data of the form \(X: [a,b]\to \mathbb R^d\). One may frequently arrive at this case by composing with a function \(\varphi : \mathcal {X}\to \mathbb R^d\) and by suitably interpolating discrete points if starting from discrete data (we give further details in Sect. 3).
Since time-ordered data may be complicated, sometimes requiring a large amount of memory just to store it, it becomes important in the analysis of this data to find features that are both descriptive and compact.
Over the past decade, the signature of a path has become a notable feature that can satisfy both of these properties. Roughly speaking, the signature is a sequence of numbers associated with a path, given by iterated integrals. The signature turns out to be highly descriptive, making it a powerful theoretical and practical tool to study paths. In a way that we make precise below, the signature is a natural generalization of polynomials to (parametrized or unparametrized) paths.
This chapter provides an introduction to the theoretical and practical aspects of the signature. The signature has gained attention in the mathematical community over the past 20 years in large part due to its connection with Lyons’ theory of rough paths. However, one of the main points we wish to emphasize is that no knowledge beyond classical integration theory is required to define and study the basic properties of the signature. Indeed, K. T. Chen studied the signature of a path in the 1950s, being one of the first authors to do so, and his primary results can be stated completely in terms of piecewise smooth paths, which already provide an elegant and deep mathematical theory. As such, we aim to make the chapter as self-contained as possible while remaining at an elementary level.
Since the first version of these lecture notes appeared in 2016, the applications of signatures have expanded dramatically; we do not review many of these interesting and important developments. We have instead chosen several conceptual applications which we hope, together with the theoretical foundations, can serve as an introduction to anyone wishing to learn more advanced topics or to apply the signature in practice.
Note
Lectures based on these notes were given at the International Center for Mathematical Sciences, Edinburgh, in 2021 as part of the European Summer School in Financial Mathematics 14th edition. The recordings are available at: https://www.youtube.com/playlist?list=PLUbgZHsSoMEXELP8YJ7Jf863PNFaChact.

2 Theoretical Foundations

The purpose of this section is to introduce the definition of the signature and present its fundamental properties. In Sect. 2.2 we provide the definition of the signature and several motivations for why it is a natural object to investigate. In Sect. 2.3 we discuss the core properties of the signature that we believe are most fundamental; this includes invariance under reparametrization, the shuffle product, Chen’s identities, time reversal, and the log-signature. With the exception of the log-signature, we provide complete proofs of all statements. Finally, in Sect. 2.5, we briefly highlight the role of the signature in the theory of rough paths and discuss further topics, including path uniqueness (i.e. the extent to which the signature determines the underlying path), the case of discontinuous paths, the moment problem for random paths, and recent work on multi-dimensional data.
A further in-depth discussion, along with alternative proofs of many of the results covered in the first part, can now be found in several texts, including the St. Flour lecture notes [57].

2.1 Preliminaries

2.1.1 Paths in Euclidean Space

Paths form one of the basic elements of this theory. A path X with values in \(\mathbb {R}^d\) is a continuous mapping from some interval \([a,b]\) to \(\mathbb {R}^d\), written as \(X:\;[a,b]\to \mathbb {R}^d\). We use the subscript notation \(X_t = X(t)\) to denote dependence on the parameter \(t \in [a,b]\).
Throughout the text, unless otherwise stated, we always assume that paths are piecewise continuously differentiable (more generally, one may assume that the paths are of bounded variation for which the same classical theory holds). By a smooth path, we mean a path which has derivatives of all orders.
Two simple examples of smooth paths in \(\mathbb R^2\) are presented in Fig. 1:
$$\displaystyle \begin{aligned} \text{left panel:}\;\;\;\; X_t &= \{X^1_t,X^2_t\} = \{t, t^3\}\;,\;t\in[-2,2] ,\\ \text{right panel:}\;\;\;\; X_t &= \{X^1_t,X^2_t\} = \{\cos t, \sin t\}\;,\;t\in[0,2\pi] . \end{aligned} $$
Fig. 1
Example of two-dimensional smooth paths
Full size image
This parametrization is generalized in d-dimensions (\(X_t\in \mathbb {R}^d\)) as
$$\displaystyle \begin{aligned} X: [a,b]\to\mathbb{R}^d\;,\;\;X_t=\left\{X^{1}_t,X^2_t,X^3_t,\ldots ,X^d_t\right\}. \end{aligned}$$
An example of a piecewise linear path is presented in Fig. 2:
$$\displaystyle \begin{aligned} X_t = \{X^1_t,X^2_t\} = \{t,f(t)\}\;,\;t\in[0,1], \end{aligned}$$
where f is a piecewise linear function on the time domain \([0,1]\). One possible example of the function f is a stock price at time t. Such non-smooth paths may represent sequential data or time series, typically consisting of successive measurements made over a time interval.
Fig. 2
Example of non-smooth path
Full size image

2.1.2 Path Integrals

We now briefly review the path (or line) integral. Given two paths \(Y : [a,b] \to \mathbb R\) and \(X : [a,b] \to \mathbb R\), we define the integral of Y  against X by
$$\displaystyle \begin{aligned} \int_a^b Y_t\,dX_t=\int_{a}^{b} Y_t \dot X_t \, dt, \end{aligned}$$
where the last integral is the usual (Riemann) integral of a (piecewise) continuous bounded function and where we use the “upper-dot” notation for differentiation with respect to a single variable: \(\dot X_t = dX_t/dt\).
Example 2.1
Consider the constant path \(Y_t = 1\) for all \(t \in [a,b]\). Then, by the fundamental theorem of calculus, the path integral of Y  against any path \(X : [a,b] \to \mathbb R\) is simply the increment of X:
$$\displaystyle \begin{aligned} \int_a^b dX_t=\int_{a}^{b} \dot X_t \,dt = X_b - X_a. \end{aligned}$$
Example 2.2
Consider the path \(X_t = t\) for all \(t \in [a,b]\). Then \(\dot X_t = 1\) for all \(t \in [a,b]\), and so the path integral of any \(Y : [a,b] \to \mathbb R\) is the usual Riemann integral
$$\displaystyle \begin{aligned} \int_a^b Y_t \, dX_t=\int_{a}^{b} Y_t \,dt. \end{aligned}$$
Example 2.3
We present a numerical example. Consider the 2-dimensional path
$$\displaystyle \begin{aligned} X_t = \{X^1_t,X^2_t\} = \{t^2,t^3\},\;\;\; t \in [0,1]. \end{aligned}$$
Then we can compute the path integral, using \(dX^2_t = \dot {X}^2_t \, dt = 3t^2 \, dt\):
$$\displaystyle \begin{aligned} \int_0^1 X^1_t \, dX^2_t= \int_0^1 t^2 3t^2 \, dt =\frac{3}{5}. \end{aligned}$$

2.2 The Signature of a Path

2.2.1 Definition

Having recalled the path integral of one real-valued path against another, we are now ready to define the signature of a path. For a path \(X : [a,b] \to \mathbb R^d\), recall that we denote the coordinate paths by \((X^1_t, \ldots X^d_t)\), where each \(X^i : [a,b] \to \mathbb R\) is a real-valued path. For any single index \(i \in \{1,\ldots , d\}\), let us define the quantity
$$\displaystyle \begin{aligned} {} S(X)^i_{a,t} = \int_{a < s < t} dX^i_s = X^i_t - X^i_a, \end{aligned} $$
(1)
which is the increment of the i-th coordinate of the path at time \(t \in [a,b]\). We emphasize that \(S(X)^i_{a,\cdot } : [a,b] \to \mathbb R\) is itself a real-valued path. Note that a in the subscript of \(S(X)^i_{a,t}\) is only used to denote the starting point of the interval \([a,b]\).
Now for any pair \(i,j \in \{1, \ldots , d\}\), let us define the double-iterated integral
$$\displaystyle \begin{aligned} {} S(X)^{i,j}_{a,t} = \int_{a < s < t} S(X)^i_{a,s} \, dX^j_s=\int_{a < r < s < t} dX^i_{r}\,dX^j_{s}, \end{aligned} $$
(2)
where \(S(X)^i_{a,s}\) is given by (1).
The integration limits in (2) also correspond to integration over a triangle (or, more generally, over a simplex in higher dimension). We emphasize again that \(S(X)^i_{a,s}\) and \(X^j_s\) are simply real-valued paths, so the expression (2) is a special case of the path integral, and that \(S(X)^{i,j}_{a,\cdot } : [a,b] \to \mathbb R\) is itself a real-valued path.
Likewise, for any triple \(i,j,k \in \{ 1, \ldots , d\}\), we define the triple-iterated integral
$$\displaystyle \begin{aligned} S(X)^{i,j,k}_{a,t} = \int_{a < s < t} S(X)^{i,j}_{a,s} \, dX^k_s =\int_{a < q < r < s < t} dX^i_{q}\, dX^j_{r}\,dX^k_{s}. \end{aligned}$$
Again, since \(S(X)^{i,j}_{a,s}\) and \(X^k_s\) are real-valued paths, the above is just a special case of the path integral, and \(S(X)^{i,j,k}_{a,\cdot } : [a,b] \to \mathbb R\) is itself a real-valued path.
We can continue recursively, and for any integer \(k \geq 1\) and collection of indexes \(i_1, \ldots , i_k \in \{ 1, \ldots , d\}\), we define
$$\displaystyle \begin{aligned} S(X)^{i_1,\ldots, i_k}_{a,t} = \int_{a< s < t} S(X)^{i_1,\ldots, i_{k-1}}_{a,s} \, dX^{i_k}_s. \end{aligned}$$
As before, since \(S(X)^{i_1,\ldots , i_{k-1}}_{a,s}\) and \(X^{i_k}_s\) are real-valued paths, the above is defined as a path integral, and \(S(X)^{i_1,\ldots , i_k}_{a,\cdot } : [a,b] \to \mathbb R\) is a real-valued path. Observe that we may equivalently write
$$\displaystyle \begin{aligned} {} S(X)^{i_1,\ldots, i_k}_{a,t} = \int_{a < t_k < t} \ldots \int_{a < t_1 < t_{2}} dX^{i_1}_{t_1} \ldots \, dX^{i_k}_{t_k}. \end{aligned} $$
(3)
The real number \(S(X)^{i_1,\ldots , i_k}_{a,b}\) is called the k-fold iterated integral of X along the indexes \(i_1,\ldots , i_k\).
Definition 2.4 (Signature)
The signature of a path \(X : [a,b] \to \mathbb R^d\), denoted by \(S(X)_{a,b}\), is the collection (infinite sequence) of all the iterated integrals of X. Formally, \(S(X)_{a,b}\) is the collection of real numbers
$$\displaystyle \begin{aligned} {} S(X)_{a,b}= (1, S(X)^1_{a,b}, \ldots, S(X)^d_{a,b}, S(X)^{1,1}_{a,b}, S(X)^{1,2}_{a,b}, \ldots) \end{aligned}$$
where the “zeroth” term, by convention, is equal to 1, and the index in the superscript runs along the set of all multi-indexes
$$\displaystyle \begin{aligned} {} W = \{(i_1,\ldots, i_k) \mid k \geq 0, i_1,\ldots, i_k \in \{1,\ldots, d\}\}. \end{aligned} $$
(4)
The set W above is also frequently called the set of words on the alphabet\(A = \{1,\ldots , d\}\). Remark that W contains the empty multi-index \(I=\emptyset \), corresponding to \(k=0\) in (4), and \(S(X)^\emptyset _{a,b}=1\), which is the first element on the right-hand side of (??). The motivation for the convention \(S(X)^\emptyset _{a,b}=1\) is Chen’s identity, which we discuss in Sect. 2.3.3.
Example 2.5
Consider an alphabet consisting of three letters only: \(\{1,2,3\}\). There is an infinite number of words that can be composed from this alphabet, namely
$$\displaystyle \begin{aligned} \{1,2,3\} \rightarrow (\emptyset, 1,2,3,11,12,13,21,22,23,31,32,33,111,112,113,121,\dots). \end{aligned}$$
An important property of the signature, which we immediately note, is that the iterated integrals of a path X are independent of the starting point of X. That is, if for some \(x \in \mathbb R^d\), we define the path \(\widetilde X_t = X_t + x\), then \(S(\widetilde X)^{i_1,\ldots , i_k}_{a,b} = S(X)^{i_1,\ldots ,i_k}_{a,b}\).
We often consider the k-th level of the signature, defined as the finite collection of all terms \(S(X)^{i_1,\ldots ,i_k}_{a,b}\) where the multi-index is of length k. For example, the first level of the signature is the collection of d real numbers \(S(X)^1_{a,b}, \ldots , S(X)^d_{a,b}\), and the second level is the collection of \(d^2\) real numbers
$$\displaystyle \begin{aligned} S(X)^{1,1}_{a,b}, \ldots, S(X)^{1,d}_{a,b}, S(X)^{2,1}_{a,b}, \ldots, S(X)^{d,d}_{a,b}. \end{aligned}$$

2.2.2 Examples

Example 2.6 (1-Dimensional Path)
The simplest example of a signature is that of a 1-dimensional path. In this case, our set of indexes (or alphabet) is of size one, \(A = \{1\}\), and the set of multi-indexes (or words) is \(W = \{(1,\ldots , 1) \mid k \geq 0 \}\), where 1 appears k times in \((1,\ldots , 1)\).
Consider the path \(X : [a,b] \to \mathbb R\), \(X_t = t\). One can immediately verify that the signature of X is given by
$$\displaystyle \begin{aligned} \begin{array}{rcl} S(X)^\emptyset_{a,b} & =&\displaystyle 1, \\ S(X)^1_{a,b} & =&\displaystyle X_b - X_a, \\ S(X)^{1,1}_{a,b} & =&\displaystyle \frac{(X_b - X_a)^2}{2!}, \\ S(X)^{1,1,1}_{a,b} & =&\displaystyle \frac{(X_b - X_a)^3}{3!}, \\ & \vdots&\displaystyle . \end{array} \end{aligned} $$
One can in fact show that the above expression of the signature remains true for any path \(X : [a,b] \to \mathbb R\). Hence, for 1-dimensional paths, the signature depends only on the increment \(X_b - X_a\).
Example 2.7
We next present an example of the signature for a 2-dimensional path. Our set of indexes is now \(A = \{1,2\}\), and the set of multi-indexes is
$$\displaystyle \begin{aligned} W = \{(i_1,\ldots, i_k) \mid k \geq 0, i_1,\ldots, i_k \in \{1,2\} \}, \end{aligned}$$
the collection of all finite sequences of 1’s and 2’s. Consider a parabolic path in \(\mathbb {R}^2\), as depicted in Fig. 3, defined by
$$\displaystyle \begin{aligned} \begin{array}{rcl} {} X_t& =&\displaystyle \{X^1_t,X^2_t\}=\{3+t,(3+t)^2\}\;,\;\;\;t\in[0,5]\;,\;(a=0, b=5),\\ dX_t& =&\displaystyle \{dX^1_t,dX^2_t\}=\{dt,2(3+t)\, dt\}. \end{array} \end{aligned} $$
(5)
Fig. 3
Example of a 2-dimensional path parametrized in (5)
Full size image
A straightforward computation gives
$$\displaystyle \begin{aligned} \begin{array}{rcl} S(X)^\emptyset_{0,5} & =&\displaystyle 1, \\ S(X)^1_{0,5}& =&\displaystyle \int_{0<t<5}dX^{1}_t=\int_{0}^{5}dt=X^1_5-X^1_0=5,\\ S(X)^2_{0,5}& =&\displaystyle \int_{0<t<5}dX^{2}_t=\int_{0}^{5}2\,(3+t)\,dt=X^2_5-X^2_0=55,\\ S(X)^{1,1}_{0,5}& =&\displaystyle \iint_{0<t_1<t_2<5}dX^{1}_{t_1} \,dX^{1}_{t_2}=\int_{0}^5\left[\int_{0}^{t_2}dt_1\right]dt_2=\frac{25}{2},\\ S(X)^{1,2}_{0,5} & =&\displaystyle \iint_{0<t_1<t_2<5}dX^{1}_{t_1} \, dX^{2}_{t_2}=\int_{0}^5\left[\int_{0}^{t_2}dt_1\right]2\,(3+t_2)\,dt_2=\frac{475}{3}, \\ S(X)^{2,1}_{0,5} & =&\displaystyle \iint_{0<t_1<t_2<5}dX^{2}_{t_1} \, dX^{1}_{t_2}=\int_{0}^5\left[\int_{0}^{t_2}2\,(3+t_1)\,dt_1\right]dt_2=\frac{350}{3}, \\ S(X)^{2,2}_{0,5} & =&\displaystyle \iint_{0<t_1<t_2<5}dX^{2}_{t_1} \, dX^{2}_{t_2}=\int_{0}^5\left[\int_{0}^{t_2}2\,(3+t_1)\,dt_1\right]2\,(3+t_2)\,dt_2\\ & =&\displaystyle \frac{3025}{2},\\ S(X)^{1,1,1}_{0,5}& =&\displaystyle \iiint_{0<t_1<t_2<t_3<5}dX^{1}_{t_1} \, dX^{1}_{t_2}\, dX^{1}_{t_3}=\int_0^5\left[\int_{0}^{t_3}\left[\int_{0}^{t_2}dt_1\right]dt_2\right]dt_3\\ & =&\displaystyle \frac{125}{6},\\ & \vdots&\displaystyle . \end{array} \end{aligned} $$
Continuing this way, one can compute every term \(S(X)^{i_1,\ldots ,i_k}_{0,5}\) of the signature for every multi-index \((i_1,\ldots , i_k)\), \(i_1,\ldots , i_k \in \{1,2\}\).
We will see later examples of paths, namely piecewise linear paths, whose signature can be computed without the need to evaluate integrals, see Example 2.18 and Corollary 2.23.

2.2.3 Picard Iterations: Motivation for the Signature

Before reviewing the basic properties of the signature, we take a moment to show how the signature arises naturally in the classical theory of ordinary differential equations (ODEs). In a sense, this provides one of the first reasons for our interest in the signature.
It is instructive to start the discussion with an intuitive and simple example of Picard’s method. Let us consider the first order ODE
$$\displaystyle \begin{aligned} {} \frac{dy(t)}{dt} = f(t,y(t)), \end{aligned} $$
(6)
where \(y:[a,b] \to \mathbb R\) is a real valued function of \(t\in [a,b]\). Picard’s method allows us to construct an approximate solution to (6) in the form of an iterative series. The integral form of (6) is given by
$$\displaystyle \begin{aligned} y(t) = y(a) + \int_{a}^t f(s,y(s))\,ds. \end{aligned}$$
We now define a sequence of functions \(y_k(t)\), \(k = 0, 1, \ldots \), where the first term is the constant function \(y_0(t) = y(a)\), and for \(k \geq 1\), we define inductively
$$\displaystyle \begin{aligned} y_{k}(t)=y(a) + \int_{a}^t f(t,y_{k-1}(s))\,ds. \end{aligned}$$
The classical Picard–Lindelöf theorem states that, under suitable conditions, the solution to (6) is given by \(y(t) = \lim _{k\rightarrow \infty } y_k(t)\).
Example 2.8
Suppose \(y:[0,T]\to \mathbb R\) solves the ODE
$$\displaystyle \begin{aligned} {} \frac{dy(t)}{dt} = y(t), \;\;\; y(0)=1. \end{aligned} $$
(7)
The first k terms of the Picard iterations are given by
$$\displaystyle \begin{aligned} \begin{array}{rcl} y_0(t) & =&\displaystyle 1,\\ y_1(t) & =&\displaystyle 1 + \int_{0}^t y_0(t) \, dt = 1 + t,\\ y_2(t) & =&\displaystyle 1 + \int_{0}^t y_1(t) \, dt = 1 + t + \frac{1}{2}t^2,\\ y_3(t) & =&\displaystyle 1 + \int_{0}^t y_2(t) \, dt = 1 + t + \frac{1}{2}t^2 + \frac{1}{6}t^3,\\ y_4(t) & =&\displaystyle 1 + \int_{0}^t y_3(t) \, dt = 1 + t + \frac{1}{2}t^2 + \frac{1}{6}t^3 + \frac{1}{24}t^4,\\ & \vdots &\displaystyle \\ y_k(t) & =&\displaystyle \sum_{n=0}^k \frac{1}{n!}t^n, \end{array} \end{aligned} $$
which converges to \(y(t) = e^t\) as \(k\rightarrow \infty \), which is indeed the solution to (7). These approximations are plotted in Fig. 4.
Fig. 4
Example of sequential Picard approximation to the true solution
Full size image
We are now ready to consider a controlled differential equation and the role of the signature in its solution. Consider a path \(X : [a,b] \to \mathbb R^d\). Let \(\mathbf L(\mathbb R^d, \mathbb R^e)\) denote the vector space of linear maps from \(\mathbb R^d\) to \(\mathbb R^e\). Equivalently, \(\mathbf L(\mathbb R^d, \mathbb R^e)\) can be regarded as the vector space of \(e \times d\) real matrices. For a path \(Z : [a,b] \to \mathbf L(\mathbb R^d, \mathbb R^e)\), note that we can define the integral
$$\displaystyle \begin{aligned} {} \int_a^b Z_t \, dX_t = \int_a^b Z_t(\dot X_t) \, dt \end{aligned} $$
(8)
as an element of \(\mathbb R^e\) in exactly the same way as the usual path integral, where \(Z_t(\dot X_t)\) takes values in \(\mathbb R^e\) for every \(t\in [a,b]\).
Example 2.9
Consider \(d=2\) and \(e=3\), so that \(\mathbf L(\mathbb R^d, \mathbb R^e)\) can be identified with the space of matrices \(\mathbb R^{e\times d}\). Consider \(a=0,b=1\) and
$$\displaystyle \begin{aligned} Z_t = \begin{pmatrix} 1 & t\\ t^2 & 0\\ 0 & 2\\ \end{pmatrix} \;, \qquad X_t = \begin{pmatrix} t \\ t^3 \end{pmatrix} \end{aligned}$$
for all \(t\in [0,1]\). Then \( \dot X_t = \begin {pmatrix} 1 \\ 3t^2 \end {pmatrix}\) and
$$\displaystyle \begin{aligned} Z_t (\dot X_t) = \begin{pmatrix} 1+3t^3 \\ t^2 \\ 6t^2 \\ \end{pmatrix}\;, \end{aligned}$$
so that the integral in (8) becomes
$$\displaystyle \begin{aligned} \int_0^1 Z_t \, dX_t = \int_0^1 Z_t(\dot X_t) \, dt = \begin{pmatrix} 7/4 \\ 1/3 \\ 2 \\ \end{pmatrix}\;. \end{aligned}$$
For a function \(V : \mathbb R^e \to \mathbf L(\mathbb R^d, \mathbb R^e)\) and a path \(Y : [a,b] \to \mathbb R^e\), we say that Y  solves the controlled differential equation
$$\displaystyle \begin{aligned} dY_t = V(Y_t) \, dX_t, \; \; Y_a = y \in \mathbb R^e, \end{aligned}$$
precisely when, for all times \(t \in [a,b]\),
$$\displaystyle \begin{aligned} {} Y_t = y + \int_a^t V(Y_s) \, dX_s. \end{aligned} $$
(9)
Such equations are of interest in control theory (see e.g. [1, 9]). We call V  the collection of driving vector fields, X the driver, and Y  the solution.
Keeping X fixed, a standard procedure to obtain a solution to (9) is through Picard iterations. For an arbitrary path \(Y : [a,b] \to \mathbb R^e\), define a new path \(F(Y) : [a,b] \to \mathbb R^e\) by
$$\displaystyle \begin{aligned} {} F(Y)_t = y + \int_a^t V(Y_s)\, dX_s. \end{aligned} $$
(10)
Observe that Y  is a solution to (9) if and only if \(F(Y)=Y\), i.e. Y  is a fixed point of F. Consider the sequence of paths \(Y^n_t = F(Y^{n-1})_t\) with initial arbitrary path \(Y^0_t\) (often taken as the constant path \(Y^0_t = y\)). The Picard–Lindelöf theorem implies that, under suitable assumptions, F possesses a unique fixed point Y  and that \(Y^n_t\) converges to Y  as \(n \rightarrow \infty \).
Suppose now \(V : \mathbb R^e \to \mathbf L(\mathbb R^d, \mathbb R^e)\) is a linear map. Note that we may equivalently treat V  as a tuple \(V=(V_1,\ldots ,V_d)\) where each \(V_i \in \mathbf L(\mathbb R^e,\mathbb R^e)\) is defined by \(V_i(y) = V(y)(e_i)\), where \(e_1,\ldots ,e_d\) are the canonical basis vectors of \(\mathbb R^d\). The ODE (9) can then be expressed as
$$\displaystyle \begin{aligned} Y_t = y + \sum_{i=1}^d \int_a^t V_i(Y_s)\,dX^i_s. \end{aligned}$$
Let us start the Picard iterations with the initial constant path \(Y^0_t = y\) for all \(t \in [a,b]\). It follows from the fact that each \(V_i\) is a linear map that the iterates of F given by (10) are
$$\displaystyle \begin{aligned} \begin{array}{rcl} Y^0_t & =&\displaystyle y, \\ Y^1_t & =&\displaystyle y + \sum_{i=1}^d\int_a^t V_i(Y^0_s) \, dX^i_s = y+ \sum_{i=1}^d V_i(y)\int_a^t dX^i_s, \\ Y^2_t & =&\displaystyle y + \sum_{i=1}^d\int_a^t V_i(Y^1_s) \, dX^i_s \\ & =&\displaystyle y + \sum_{i=1}^d V_i(y) \int_a^t dX^i_s + \sum_{i,j=1}^d V_i(V_j(y)) \int_a^t \int_a^s dX^j_u \, dX^i_s, \\ & \vdots&\displaystyle \\ Y^n_t & =&\displaystyle y + \sum_{i=1}^d\int_a^t V_i(Y^{n-1}_s)\, dX^i_s \\ & =&\displaystyle y+ \sum_{k = 1}^n \sum_{i_1,\ldots,i_k=1}^d V_{i_k}(\ldots(V_{i_1}(y))\ldots)\int_{a < t_1 < \ldots < t_k < t} dX^{i_1}_{t_1}\ldots \, dX^{i_k}_{t_k},\\ & \vdots&\displaystyle . \end{array} \end{aligned} $$
We recognize in the final expression the signature of X
$$\displaystyle \begin{aligned} S(X)^{i_1,\ldots,i_k}_{a,t} = \int_{a < t_1 < \ldots < t_k < t} dX^{i_1}_{t_1}\ldots \, dX^{i_k}_{t_k}. \end{aligned}$$
Furthermore, one can show that the above series converges at every time \(t\in [a,b]\), as the following exercise shows.
Exercise 2.10
Show that there exists \(C>0\), depending only on the path \(X:[a,b]\to \mathbb R^d\), the vector fields \(V_1,\ldots , V_d\in \mathbf L(\mathbb R^e,\mathbb R^e)\), and initial point \(y\in \mathbb R^e\), such that for all \(k \geq 1\) and \(t\in [a,b]\),
$$\displaystyle \begin{aligned} {} \Big\|\sum_{i_1,\ldots,i_k=1}^d V_{i_k}(\ldots(V_{i_1}(y))\ldots)\int_{a < t_1 < \ldots < t_k < t} dX^{i_1}_{t_1}\ldots \, dX^{i_k}_{t_k}\Big\| \leq \frac{C^k}{k!}. \end{aligned} $$
(11)
The factorial decay in (11) ensures that the limit \(\lim _{n\to \infty } Y^n_t\) exists for all \(t\in [a,b]\). We conclude that the solution \(Y_t\) is completely determined by the signature \(S(X)_{a,t}\) for every \(t \in [a,b]\). In particular, if the signatures of two drivers X and \(\widetilde X\) coincide at time \(t \in [a,b]\), that is, \(S(X)_{a,t} = S(\widetilde X)_{a,t}\), then the corresponding solutions to (9) will also agree at time t for any choice of the linear vector fields V .
An important but far less obvious result is that the same conclusion holds true for non-linear vector fields V . This result was first obtained by Chen [13] for a certain class of piecewise smooth paths, and more recently extended by Hambly and Lyons [36] to paths of bounded variation, and by Boedihardjo et al. [7] to a completely non-smooth setting of geometric rough paths (for which the signature is still well-defined, see Sect. 2.5). The latter class of paths is of particular interest in stochastic analysis.
We end the discussion on ODEs with one of the motivations behind the applications of signatures in machine learning. Suppose we have an instrument that can to detect differences between two points \(x,\tilde x\in \mathbb R^d\) whenever \(|x-\tilde x|>\epsilon \), where \(\epsilon >0 \) represents the precision of the instrument. Suppose we are now given two paths (e.g. time series) \(X,\tilde X:[a,b]\to \mathbb R^d\). What is an effective test to determine whether X and \(\tilde X\) differ? Of course, we can apply our instrument to test whether \(X_t=\tilde X_t\) for a selection of times \(t\in [a,b]\), but this may fail to see a difference in the case that X and \(\tilde X\) are very close, e.g. when \(\sup _{t\in [a,b]}|X_t-\tilde X_t|<\epsilon \).
A curious fact is that ODEs provide another test that may be more effective. Namely, suppose the signals X and \(\tilde X\) interact with another system that we may observe and that this system is given by solutions to the ODEs \(dY = V(Y)\, dX\) and \(\tilde Y=V(\tilde Y)\, d\tilde X\) with \(Y(a)=\tilde Y(a)\). Then, to determine if X and \(\tilde X\) differ, we can instead apply our instrument to test whether \(Y_b \neq \tilde Y_b\); indeed, \(Y_b \neq \tilde Y_b\) clearly implies \(X\neq \tilde X\). While this statement may at first appear void of interest, it turns out that the difference between Y  and \(\tilde Y\) may be much easier to detect than the difference between X and \(\tilde X\), i.e., it is possible that \(\sup _t |X_t-\tilde X_t|<\epsilon \) while \(Y_b-\tilde Y_b\) is much larger than \(\epsilon \)—see Fig. 5 for an illustration. As stated above, the signatures \(S(X)_{a,b}, S(\tilde X)_{a,b}\) completely determine the solutions \(Y_b,\tilde Y_b\) respectively (and in fact themselves can be seen as solutions to linear ODEs driven by \(X,\tilde X\)), which motivates the signature as a descriptive feature of time-ordered data.
Fig. 5
The path \(X : [0,T]\to \mathbb R^2\) is an approximation of a two-dimensional Brownian motion and \(\tilde X:[0,T]\to \mathbb R^2\) is a small perturbation of X at every point. The horizontal axis denotes times and the orange and blue lines indicate the two components \(X^1,X^2\); likewise for \(\tilde X\). The paths \(Y,\tilde Y:[0,T] \to \mathbb R\) are solutions to the ODEs \(d Y_t = V(Y_t) \, d X_t\) and \(d \tilde Y_t = V(\tilde Y_t) \, d \tilde X_t\) with the same initial point \(Y_0=\tilde Y_0\) and with vector fields \(V_1(y)=y/100\) and \(V_2(y)=y \sin {}(y)/100\). Although X and \(\tilde X\) are close in the uniform norm, Y  and \(\tilde Y\) differ significantly
Full size image

2.2.4 Geometric Intuition for the First Two Levels

While the signature is defined analytically using path integrals, we briefly discuss here the geometric meaning of the first two levels. As already mentioned, the first level, given by the terms \((S(X)^1_{a,b},\ldots , S(X)^d_{a,b})\), is simply the increment of the path \(X : [a,b] \to \mathbb R^d\). For the second level, the term \(S(X)^{i,i}_{a,b}\) is equal to \((X^i_b-X^i_a)^2/2\); this relation is a special case of the shuffle product, which we review in Sect. 2.3.2. To give meaning to the term \(S(X)^{i,j}_{a,b}\) for \(i \neq j\), consider the Lévy area (illustrated in Fig. 6) , which is a signed area enclosed by the path (solid red line) and the chord (blue straight dashed line) connecting the endpoints. The Lévy area of a 2-dimensional path \(\{X^1_t,X^2_t\}\) is given by
$$\displaystyle \begin{aligned} {} A=\frac{1}{2}\left(S(X)^{1,2}_{a,b}-S(X)^{2,1}_{a,b}\right). \end{aligned} $$
(12)
See Sect. 3.1.11 for an example computation of the signature where this geometric meaning is further highlighted.
Fig. 6
Example of the signed Lévy area of a curve. Areas above and under the chord connecting the two endpoints are negative and positive ares and are denoted by \(A_{-}\) and \(A_{+}\) respectively. The increments along each coordinate are \(\Delta X^1\) and \(\Delta X^2\)
Full size image

2.3 Important Properties of Signature

We now review several fundamental properties of the signature, providing complete proofs for most of them. Several deeper results are discussed in the following Sect. 2.5, but only on an informal level.

2.3.1 Invariance Under Time Reparametrizations

We call a surjective, continuous, non-decreasing function \(\psi : [a,b] \to [a,b]\) a reparametrization. For simplicity, we only consider smooth reparametrizations, although, just like in the definition of the path integral, this is not strictly necessary.
Let \(X, Y : [a,b] \to \mathbb R\) be two real-valued paths and \(\psi : [a,b] \to [a,b]\) a reparametrization. Define the paths \(\widetilde X, \widetilde Y : [a,b] \to \mathbb R\) by \(\widetilde X_t = X_{\psi (t)}\) and \(\widetilde Y_t = Y_{\psi (t)}\). Observe that
$$\displaystyle \begin{aligned} \dot{\widetilde X}_t = \dot{X}_{\psi(t)} \dot{\psi}(t), \end{aligned}$$
from which it follows that
$$\displaystyle \begin{aligned} \int_a^b \widetilde Y_t \, d\widetilde X_t = \int_a^b Y_{\psi(t)} \dot{X}_{\psi(t)} \dot{\psi}(t) \, dt = \int_a^b Y_u \, dX_u, \end{aligned}$$
where the last equality follows by making the substitution \(u = \psi (t)\). This shows that path integrals are invariant under a time reparametrization of both paths.
Consider now a multi-dimensional path \(X : [a,b] \to \mathbb R^d\) and a reparametrization \(\psi :[a,b] \to [a,b]\). As before, denote by \(\widetilde X :[a,b] \to \mathbb R^d\) the reparametrized path \(\widetilde X_t = X_{\psi (t)}\). Since every term of the signature \(S(X)^{i_1,\ldots ,i_k}_{a,b}\) is defined as an iterated path integral of X, it follows from the above that
$$\displaystyle \begin{aligned} S(\widetilde X)^{i_1,\ldots,i_k}_{a,b} = S(X)^{i_1,\ldots,i_k}_{a,b}, \; \; \forall k \geq 0, \; i_1,\ldots, i_k \in \{1,\ldots, d\}. \end{aligned}$$
That is to say, the signature \(S(X)_{a,b}\) is invariant under time reparametrizations of X.

2.3.2 Shuffle Product

As alluded to in the introduction, the signature is a natural generalization of polynomials to the space of (possibly unparametrized) paths. We already saw a simple example of this in Example 2.6 where the signature of a 1-dimensional path encodes the powers of its increment.
Signatures generalize polynomials to multi-dimensional paths through the shuffle product identity. This identity, shown originally by Ree [67], implies that the product of two terms \(S(X)^{i_1,\ldots , i_k}_{a,b}\) and \(S(X)^{j_1,\ldots , j_m}_{a,b}\) can be expressed as a sum of another collection of terms of \(S(X)_{a,b}\) which only depends on the multi-indexes \((i_1,\ldots , i_k)\) and \((j_1,\ldots , j_m)\); this generalizes the fact that the product of two polynomials is again a polynomial. More precisely,
$$\displaystyle \begin{aligned} S(X)^{i_1,\ldots, i_k}_{a,b}S(X)^{j_1,\ldots, j_m}_{a,b} = \sum_K S(X)^K_{a,b},\end{aligned}$$
where the sum is over all multi-indexes K obtained by ‘shuffling’ \((i_1,\ldots , i_k)\) and \((j_1,\ldots , j_m)\) together in a way that preserves their respective orders. See Example 2.16 for the precise relation with classical multiplication of polynomials, as well as Sect. 2.4 for the relevance of the shuffle product to statistical learning.
To make this statement precise, we define the shuffle product of two multi-indexes. For integers \(k,m\geq 0\), a permutation \(\sigma \) of the set \(\{1,\ldots , k+m\}\) is called a \((k,m)\)-shuffle if \(\sigma ^{-1}(1) < \ldots < \sigma ^{-1}(k)\) and \(\sigma ^{-1}(k+1) < \ldots < \sigma ^{-1}(k+m)\). The list \((\sigma (1),\ldots , \sigma (k+m))\) is also called a shuffle of \((1,\ldots , k)\) and \((k+1,\ldots , k+m)\).
Example 2.11
Consider \(k=2\) and \(m=3\) and the permutation \(\sigma : \{1,\ldots , 5\} \to \{1,\ldots , 5\}\) defined by
$$\displaystyle \begin{aligned} (\sigma(1),\sigma(2),\sigma(3),\sigma(4),\sigma(5)) = (3,1,4,5,2). \end{aligned}$$
Then \(\sigma \) is a \((2,3)\)-shuffle, i.e., \((3,1,4,5,2)\) is a shuffle of \((1,2)\) and \((3,4,5)\), because
$$\displaystyle \begin{aligned} \sigma^{-1}(1)=2 < \sigma^{-1}(2) = 5 \quad \text{and} \quad \sigma^{-1}(3)=1 < \sigma^{-1}(4) = 3 < \sigma^{-1}(5)=4. \end{aligned}$$
Recall that a multi-set is an unordered collection of objects in which each object may appear more than once, e.g. \(\{1,2,2,3,4,4,4\}\) is a multi-set of integers. (The difference between a set and multi-set is that an object appears at most once in a set.)
Definition 2.12 (Shuffle Product)
Let \(\mathrm {Shuffles}(k,m)\) denote the set of all \((k,m)\)-shuffles. Consider two multi-indexes \(I = (i_1,\ldots , i_k)\) and \(J = (j_1,\ldots , j_m)\) with \(i_1,\ldots ,i_k,j_1,\ldots , j_m \in \{ 1,\ldots , d\}\). Define the multi-index
$$\displaystyle \begin{aligned} (r_1,\ldots, r_k, r_{k+1}, \ldots r_{k+m}) = (i_1,\ldots, i_k, j_1,\ldots, j_m). \end{aligned}$$
The shuffle product of I and J, denoted https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/617302_1_En_1_IEq238_HTML.gif
The image shows a mathematical expression in LaTeX notation: llbracket J rrbracket. The symbols used are double square brackets and the letter "J".
, is a finite multi-set of multi-indexes of length \(k+m\) defined by
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/MediaObjects/617302_1_En_1_Equah_HTML.png
The image displays a mathematical formula involving a shuffle product. The formula is: [ I mathrel{amalgamalg} J = { (r_{sigma(1)}, ldots, r_{sigma(k+m)}) mid sigma in text{Shuffles}(k, m) } ] Here, I mathrel{amalgamalg} J represents the shuffle product of two sets, and sigma is a permutation in the set of shuffles of k and m.
Example 2.13
The following example shows the shuffle product between \((1,2,1)\) and \((2,3)\), which we color in red and blue respectively to distinguish the elements of the two multi-indexes (we drop parentheses for rotational simplicity):
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/MediaObjects/617302_1_En_1_Equai_HTML.png
The image displays a mathematical expression involving a shuffle product. The expression is: [ 121 shuffle 23 = {12123, 12213, 12231, 21213, 12231, 12231, 21231, 12321, 21321, 23121} ] The shuffle product symbol is represented by shuffle. The result is a set of permutations of the sequences 121 and 23.
We make several important remarks about the above definition.
1.
The number of elements in https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/617302_1_En_1_IEq242_HTML.gif
The image shows a mathematical expression in LaTeX notation: bigsqcup bigsqcap J . The symbols include a large union (bigsqcup) and a large intersection (bigsqcap), followed by the letter J .
is \(\binom {k+m}{k}\), which is equal to the total number of \((k,m)\)-shuffles.
 
2.
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/617302_1_En_1_IEq245_HTML.gif
The image shows a mathematical expression in LaTeX notation: bigsqcup bigsqcap J . The symbols include a large union (bigsqcup), a large intersection (bigsqcap), and a variable J .
is in general a multi-set (not just a set) as it may contain multiple instances of a single multi-index, e.g. https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/617302_1_En_1_IEq246_HTML.gif
The image displays a mathematical formula involving set notation and symbols. It is represented as: (i) perp (i) = {(i, j), (i, i)} . The formula includes a perpendicular symbol perp and set brackets {} containing ordered pairs.
for any \(i\in \{1,\ldots , d\}\) since both permutations of \(\{1,2\}\) are \((1,1)\)-shuffles.
 
3.
The shuffle product is commutative, that is, https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/617302_1_En_1_IEq250_HTML.gif
The image shows a mathematical formula: I perp J = J perp I . The symbol perp represents perpendicularity or orthogonality.
.
 
4.
If \(k=0\), then https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/617302_1_En_1_IEq252_HTML.gif
Mathematical notation showing the intersection of two sets, represented by the symbol cap, with the sets labeled as L and J.
is a singleton containing the multi-index J.
 
Theorem 2.14 (Shuffle Product Identity)
Consider a path\(X : [a,b] \to \mathbb R^d\)and two multi-indexes\(I = (i_1,\ldots , i_k)\)and\(J = (j_1,\ldots , j_m)\)with\(i_1,\ldots ,i_k,j_1,\ldots , j_m \in \{ 1,\ldots , d\}\). Then
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/MediaObjects/617302_1_En_1_Equ13_HTML.png
The image displays a mathematical formula involving indexed functions and a summation. The formula is: [ S(X)_{a,b}^I S(X)_{a,b}^J = sum_{K in I bigsqcup J} S(X)_{a,b}^K ] Here, S(X) is a function indexed by subscripts a, b and superscripts I, J, K . The summation is over K belonging to the disjoint union of sets I and J .
(13)
Proof
For \(k=0\), we have \(I=\emptyset \), and thus https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/617302_1_En_1_IEq259_HTML.gif
The image shows a mathematical expression in LaTeX compatible notation: llbracket J rrbracket . The expression uses double square brackets around the letter "J".
is a singleton containing the multi-index J. In this case (13) becomes \(S(X)^\emptyset _{a,b} S(X)^J_{a,b} = S(X)^J_{a,b}\), which is true since \(S(X)^\emptyset _{a,b}=1\) by definition. This proves the result for \(k=0\), and, by symmetry, for \(m=0\).
It remains to consider \(k,m\geq 1\). We proceed by induction on \(k+m\). For the inductive step, suppose (13) holds for all (possibly empty) multi-indexes of combined length \(k+m-1\). Then, for I and J as in the theorem statement,
$$\displaystyle \begin{aligned} \begin{array}{rcl} S(X)^I_{a,b} S(X)^J_{a,b} & =&\displaystyle \int_a^b S(X)^{i_1,\ldots, i_{k-1}}_{a,s} \, d X^{i_k}_s \int_a^b S(X)^{j_1,\ldots, j_{m-1}}_{a,t} \, d X^{j_m}_t \\ & =&\displaystyle \int_{s,t\in[a,b]^2} S(X)^{i_1,\ldots ,i_{k-1}}_{a,s} S(X)^{j_1,\ldots, j_{m-1}}_{a,t} \, d X^{i_k}_s \, d X^{j_m}_t \\ & =&\displaystyle \int_{a<s<t<b} S(X)^{i_1,\ldots, i_{k-1}}_{a,s}d X^{i_k}_s S(X)^{j_1,\ldots, j_{m-1}}_{a,t} \, d X^{j_m}_t \\ & &\displaystyle \quad +\int_{a<t<s<b} S(X)^{j_1\ldots j_{m-1}}_{a,t} d X^{j_m}_t S(X)^{i_1\ldots i_{k-1}}_{a,s} \, d X^{i_k}_s \\ & =&\displaystyle \int_{a}^b S(X)^{I}_{a,t} S(X)^{j_1\ldots j_{m-1}}_{a,t} \, d X^{j_m}_t +\int_{a}^b S(X)^{J}_{a,s} S(X)^{i_1\ldots i_{k-1}}_{a,s} \, d X^{i_k}_s \end{array} \end{aligned} $$
where we used Nubbin’s theorem in the second equality. By the inductive hypothesis together with the earlier cases \(m-1=0\) or \(k-1=0\), the final expression is equal
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/MediaObjects/617302_1_En_1_Equak_HTML.png
The image displays a mathematical formula involving integrals and summations. The expression is: [ int_a^b sum_{K in bigsqcup_{j_1 ldots j_{m-1}}} S(X)_{a,t}^K , dX_t^{j_m} + int_a^b sum_{K in bigsqcup_{i_1 ldots i_{k-1}}} S(X)_{a,s}^K , dX_s^{i_k} ] The formula includes integrals from a to b, summations over sets K, and functions S(X) with subscripts and superscripts.
We finally remark the combination identity
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/MediaObjects/617302_1_En_1_Equal_HTML.png
The image displays a mathematical formula involving set operations and indexing. The formula is: [ I bigsqcup J = { (K, j_m) : K in I bigsqcup (j_1, ldots, j_{m-1}) } cup { (K, i_k) : K in (i_1, ldots, i_{k-1}) bigsqcup J } ] Symbols used include set union (cup), indexed elements (j_m, j_1, ldots, j_{m-1}, i_k, i_1, ldots, i_{k-1}), and the disjoint union symbol (bigsqcup).
where the two sets on the right-hand side are disjoint; the proof of this is left as an exercise for the reader. We thus obtain (13) for \(I,J\) of combined length \(k+m\) for \(k,m\geq 1\), which completes the inductive step. □
Example 2.15
Consider a 2-dimensional path \(X : [a,b] \to \mathbb R^2\). The shuffle product identity (13) implies that
$$\displaystyle \begin{aligned} \begin{array}{rcl} S(X)^1_{a,b} S(X)^2_{a,b} & =&\displaystyle S(X)^{1,2}_{a,b} + S(X)^{2,1}_{a,b}, \\ S(X)^{1,2}_{a,b} S(X)^1_{a,b} & =&\displaystyle 2 S(X)^{1,1,2}_{a,b} + S(X)^{1,2,1}_{a,b}. \end{array} \end{aligned} $$
Example 2.16
We now show that the shuffle product generalizes the familiar multiplication of (multivariate) polynomials.
As the simplest example, consider the one-dimensional case with \(x\in \mathbb R\) and the linear path \(X:[0,1]\to \mathbb R\), \(X_t=tx\). Following Example 2.6, for any multi-index \((1,\ldots , 1)\) of length k,
$$\displaystyle \begin{aligned} {} S(X)^{1,\ldots, 1}_{0,1} = \frac{x^k}{k!}. \end{aligned} $$
(14)
Consider two multi-indexes I and J of length k and m (consisting only of 1’s). Then, by (14),
$$\displaystyle \begin{aligned} {} S(X)^{I}_{0,1}S(X)^{J}_{0,1} = \frac{x^{k+m}}{k!m!}. \end{aligned} $$
(15)
On the other hand, there are \(\binom {k+m}{k}\) ways to shuffle I and J, and all shuffles lead to the same multi-index with 1 appearing \(k+m\) times. Therefore, by (13),
$$\displaystyle \begin{aligned} {} S(X)^{I}_{0,1}S(X)^{J}_{0,1} = \binom{k+m}{k}\frac{x^{k+m}}{(k+m)!}. \end{aligned} $$
(16)
Since \(\binom {k+m}{k}=(k+m)!/(k!m!)\), we see that the expressions (15) and (16) match.
Consider now the multi-dimensional case \(x=(x_1,\ldots ,x_d)\in \mathbb R^d\) and the linear path \(X:[0,1]\to \mathbb R^d\), \(X_t=tx\) as before. For \(I=(i_1,\ldots ,i_k)\), one can verify that
$$\displaystyle \begin{aligned} {} S(X)^{I}_{0,1} = \frac{x_1^{\beta_1}\cdots x_d^{\beta_d}}{k!}, \end{aligned} $$
(17)
where \(\beta _n\) is the number of times n appears in \(i_1,\ldots , i_k\). In particular, \(S(X)^{I}_{0,1}\) depends only on the elements in I, not their order, and encodes a multivariate monomaniacal of x.
It follows that, for another multi-index \(J=(j_1,\ldots ,j_m)\),
$$\displaystyle \begin{aligned} {} S(X)^{I}_{0,1}S(X)^{J}_{0,1} = \frac{x_1^{\beta_1+\gamma_1}\cdots x_d^{\beta_d+\gamma_d}}{k!m!}, \end{aligned} $$
(18)
where \(\gamma _n\) is the number of times n appears in \(j_1,\ldots ,j_m\).
On the other hand, there are again \(\binom {k+m}{k}\) multi-indexes in https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/617302_1_En_1_IEq291_HTML.gif
The image shows a mathematical expression in LaTeX compatible notation: llbracket J rrbracket . The expression uses double square brackets around the letter "J".
, which may now be distinct, but each \(n\in \{1,\ldots , d\}\) appears \(\beta _n+\gamma _n\) times in each multi-index in https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-97239-3_1/617302_1_En_1_IEq294_HTML.gif
Mathematical notation showing the intersection of two sets, represented by the symbol cap, enclosed within brackets.
. Therefore, by (13),
$$\displaystyle \begin{aligned} S(X)^{I}_{0,1}S(X)^{J}_{0,1} = \binom{k+m}{k}\frac{x^{\beta_1+\gamma_1}_1\cdots x_d^{\beta_d+\gamma_d}}{(k+m)!}, \end{aligned}$$
which again agrees with (18).
The shuffle product implies that the product of two terms of the signature can be expressed as a linear combination of higher order terms. This fact has important theoretical and practical consequences, some of which we discuss in Sect. 2.4.

2.3.3 Chen’s Identity

We now describe a property of the signature known as Chen’s identity, which provides an algebraic relationship between paths and their signatures. It furthermore provides a powerful tool in the computation of the signature once its value is known on sub intervals (see Example 2.18 below). The simplest version of Chen’s identity is the following.
Theorem 2.17 (Chen’s Identity)
Suppose\(a<b<c\)and\(X:[a,c]\to \mathbb R^d\)is a path. Then, for any multi-index\((i_1, \ldots , i_k)\),
$$\displaystyle \begin{aligned} {} S(X)^{i_1,\ldots, i_k}_{a,c} = \sum_{m=0}^k S(X)^{i_1,\ldots, i_m}_{a,b} S(X)^{i_{m+1},\ldots, i_k}_{b,c}. \end{aligned} $$
(19)
Proof
We proceed by induction on k. The case \(k=0\) is obvious since both the left- and right-hand sides are equal to 1 in this case. Suppose now (19) holds for \(k-1\) with \(k\geq 1\). Then
$$\displaystyle \begin{aligned} \begin{array}{rcl} S(X)^{i_1,\ldots, i_k}_{a,c} & =&\displaystyle \int_a^b S(X)^{i_1,\ldots, i_{k-1}}_{a,t} \, d X_t^{i_k} + \int_b^c S(X)^{i_1,\ldots ,i_{k-1}}_{a,t} \, d X_t^{i_k} \\ & =&\displaystyle S(X)^{i_1,\ldots, i_{k}}_{a,b} +\int_b^c \sum_{m=0}^{k-1} S(X)^{i_1,\ldots ,i_{m}}_{a,b}S(X)^{i_{m+1},\ldots ,i_{k-1}}_{b,t} \, d X_t^{i_k} \\ & =&\displaystyle S(X)^{i_1,\ldots, i_{k}}_{a,b} + \sum_{m=0}^{k-1} S(X)^{i_1,\ldots, i_{m}}_{a,b}S(X)^{i_{m+1},\ldots, i_{k-1}, i_k}_{b,c} \\ & =&\displaystyle \sum_{m=0}^{k} S(X)^{i_1,\ldots, i_{m}}_{a,b}S(X)^{i_{m+1}, \ldots, i_k}_{b,c}, \end{array} \end{aligned} $$
where we used the inductive hypothesis in the second equality. □
The next example shows that Chen’s identity is an effective method to compute the signatures of piecewise linear paths.
Example 2.18 (Signature of Piecewise Linear Path)
Consider the path \(X:[0,2]\to \mathbb R^2\) which is linear on the intervals \([0,1]\) and \([1,2]\) and \(X_0 = (3,0), X_1=(3,1), X_2=(0,1)\), see Fig. 7.
Fig. 7
An example of a piecewise linear path
Full size image
Then
$$\displaystyle \begin{aligned} (S(X)^{\emptyset}_{0,1},S(X)^2_{0,1},S(X)^{22}_{0,1}) &= (1,1,1/2), \end{aligned} $$
with \(S(X)^{I}_{0,1} = 0\) for all other multi-indexes I of length at most 2, and
$$\displaystyle \begin{aligned} (S(X)^{\emptyset}_{1,2},S(X)^1_{1,2},S(X)^{11}_{1,2}) &= (1,-3,9/2), \end{aligned} $$
with \(S(X)^{I}_{1,2} = 0\) for all other multi-indexes I of length at most 2.
Then by Chen’s identity
$$\displaystyle \begin{aligned} (S(X)^{1}_{0,2},S(X)^{2}_{0,2}) &= (S(X)^{1}_{0,1} + S(X)^{1}_{1,2},S(X)^{2}_{0,1} + S(X)^{2}_{1,2}) \\ &= (-3+0,0+1), \end{aligned} $$
as expected from the fact that \(S(X)^i_{a,b}=X^i_b-X^i_a\). Furthermore, again by Chen’s identity,
$$\displaystyle \begin{aligned} S(X)^{12}_{0,2} &= S(X)^{12}_{0,1} + S(X)^{1}_{0,1}S(X)^{2}_{1,2} + S(X)^{12}_{1,2} = 0 + 0\times 0 + 0 = 0, \\ S(X)^{21}_{0,2} &= S(X)^{21}_{0,1} + S(X)^{2}_{0,1}S(X)^{1}_{1,2} + S(X)^{21}_{1,2} = 0 + 1\times (-3) + 0 = -3, \end{aligned} $$
which is less obvious, and which is consistent with the Lévy area identity (12).
There is a way to formulate Chen’s identity in a more algebraic way. To do so, we need to introduce the algebra of formal power series, which should appear natural in light of the definition of the signature.
Definition 2.19 (Formal Power Series)
Let \(e_1,\ldots , e_d\) be d formal indeterminate. The algebra of (non-commuting) formal power series in d indeterminates is the vector space of all formal series of the form
$$\displaystyle \begin{aligned} \sum_{k = 0}^\infty \sum_{i_1,\ldots, i_k \in \{1,\ldots, d\}} \lambda_{i_1,\ldots,i_k}e_{i_1}\ldots e_{i_k}, \end{aligned}$$
where the second summation runs over all multi-indexes \((i_1,\ldots , i_k)\), \(i_1,\ldots , i_k \in \{1,\ldots , d\}\), and \(\lambda _{i_1,\ldots , i_k}\) are real numbers.
A (non-commuting) formal polynomial is a formal power series for which only a finite number of coefficients \(\lambda _{i_1,\ldots , i_k}\) are non-zero. The terms \(e_{i_1}\ldots e_{i_k}\) are called monomials. The term corresponding to \(k = 0\) is simply a real number \(\lambda _0\). The space of formal power series is also called the space of tensor series over \(\mathbb R^d\). We stress that the power series we consider are non-commutative; for example, the elements \(e_1e_2\) and \(e_2e_1\) are distinct.
The space of formal power series may be naturally equipped with a vector space structure by defining addition and scalar multiplication as
$$\displaystyle \begin{aligned}\displaystyle \Big(\sum_{k = 0}^\infty \sum_{i_1,\ldots, i_k \in \{1,\ldots, d\}} \lambda_{i_1,\ldots,i_k}e_{i_1}\ldots e_{i_k}\Big) + \Big(\sum_{k = 0}^\infty \sum_{i_1,\ldots, i_k \in \{1,\ldots, d\}} \mu_{i_1,\ldots,i_k}e_{i_1}\ldots e_{i_k}\Big) \\\displaystyle = \sum_{k = 0}^\infty \sum_{i_1,\ldots, i_k \in \{1,\ldots, d\}} (\lambda_{i_1,\ldots,i_k} + \mu_{i_1,\ldots, i_k})e_{i_1}\ldots e_{i_k} \end{aligned} $$
and
$$\displaystyle \begin{aligned} c\Big(\sum_{k = 0}^\infty \sum_{i_1,\ldots, i_k \in \{1,\ldots, d\}} \lambda_{i_1,\ldots,i_k}e_{i_1}\ldots e_{i_k}\Big) = \sum_{k = 0}^\infty \sum_{i_1,\ldots, i_k \in \{1,\ldots, d\}} c\lambda_{i_1,\ldots,i_k}e_{i_1}\ldots e_{i_k}. \end{aligned}$$
Moreover, one may define the product \(\otimes \) between monomials by joining together multi-indexes
$$\displaystyle \begin{aligned} e_{i_1}\ldots e_{i_k} \otimes e_{j_1}\ldots e_{j_m} = e_{i_1}\ldots e_{i_k}e_{j_1}\ldots e_{j_m}. \end{aligned}$$
The product \(\otimes \) then extends uniquely and linearly to all power series. We demonstrate the first few terms of the product in the following expression:
$$\displaystyle \begin{aligned} \begin{array}{*{20}l} {} \Big(\sum_{k = 0}^\infty \sum_{i_1,\ldots, i_k \in \{1,\ldots, d\}} \lambda_{i_1,\ldots,i_k}e_{i_1}\ldots e_{i_k}\Big) \otimes \Big(\sum_{k = 0}^\infty \sum_{i_1,\ldots, i_k \in \{1,\ldots, d\}} \mu_{i_1,\ldots,i_k}e_{i_1}\ldots e_{i_k} \Big) \\ = \lambda_0\mu_0 + \sum_{i=1}^d (\lambda_0\mu_i + \lambda_i\mu_0)e_i + \sum_{i,j=1}^d \left( \lambda_0 \mu_{i,j} + \lambda_i\mu_j + \lambda_{i,j}\mu_0\right)e_ie_j + \ldots. \end{array} \end{aligned} $$
(20)
The space of formal power series becomes an algebra when equipped with this vector space structure and product \(\otimes \), i.e. \(\otimes \) is associative and distributive over addition. Note, however, that \(\otimes \) is not commutative as (20) demonstrates.
The reader may have noticed that the indexing set of the monomials \(e_{i_1}\ldots e_{i_k}\) coincides with the indexing set of the terms of the signature of a path \(X : [a,b] \to \mathbb R^d\), namely the collection of all multi-indexes \((i_1,\ldots , i_k)\), \(i_1,\ldots , i_k \in \{1,\ldots , d\}\). It follows that a convenient way to express the signature of X is by a formal power series where the coefficient of each monomial \(e_{i_1}\ldots e_{i_k}\) is set to be \(S(X)^{i_1,\ldots , i_k}_{a,b}\). We use the same symbol \(S(X)_{a,b}\) to denote this representation
$$\displaystyle \begin{aligned} S(X)_{a,b} = \sum_{k = 0}^\infty \sum_{i_1,\ldots, i_k \in \{1,\ldots, d\}} S(X)^{i_1,\ldots, i_k}_{a,b} e_{i_1}\ldots e_{i_k}, \end{aligned}$$
where, as before, the “0-th” level of the signature is \(S(X)^\emptyset _{a,b} = 1\) (corresponding to \(k=0\)).
Example 2.20
Consider \(x=(x^1,\ldots , x^d)\in \mathbb R^d\) and identify this with the formal power series \(x = \sum _{i=1}^d x^ie_i\) (which we denote by the same symbol). Define the formal power series \({\mathbf {e}}^x\) by
$$\displaystyle \begin{aligned} {\mathbf{e}}^x = \sum_{k=0}^\infty \frac{x^{\otimes k}}{k!} = \sum_{k=0}^\infty \frac{(\sum_{i=1}^d x^ie_i)^{\otimes k}}{k!}\;. \end{aligned}$$
Although this series has an infinite number of terms, the coefficient of \(e_{i_1}\ldots e_{i_k}\) in \({\mathbf {e}}^x\) is a monomial in \(x^1,\ldots ,x^d\) and thus \({\mathbf {e}}^x\) is well-defined as a formal power series without issues of convergence. Explicitly, the coefficient of \(e_{i_1}\ldots e_{i_k}\) in \({\mathbf {e}}^x\) is \(x_1^{\beta _1}\ldots x_d^{\beta _d}/k!\), where \(\beta _n\) is the number of times n appears in \(i_1,\ldots , i_k\).
Then, for the path \(X:[0,1]\to \mathbb R^d\) defined by \(X_t=tx\),
$$\displaystyle \begin{aligned} S(X)_{0,1} = {\mathbf{e}}^x, \end{aligned}$$
which is essentially the claim we saw already in (17) from Example 2.16.
To state Chen’s identity in an algebraic way, it remains to define the concatenation of paths.
Definition 2.21 (Concatenation)
For two paths \(X : [a,b] \to \mathbb R^d\) and \(Y: [b,c] \to \mathbb R^d\), we define their concatenation as the path \(X * Y : [a,c] \to \mathbb R^d\) for which \((X * Y)_t = X_t\) for \(t \in [a,b]\) and \((X * Y)_t = X_b + (Y_t - Y_b)\) for \(t \in [b,c]\).
See Fig. 8 for a visual depiction of concatenation. We can now recast Chen’s identity (Theorem 2.17) in an algebraic light, namely that the signature is a map from the space of paths into the space of formal power series that takes the concatenation product \(*\) to the product \(\otimes \). More precisely, we have the following result.
Fig. 8
Example of two paths X and Y  and the resulting concatenation \(X*Y\)
Full size image
Theorem 2.22 (Chen’s Identity)
For two paths\(X : [a,b] \to \mathbb R^d\)and\(Y: [b,c] \to \mathbb R^d\),
$$\displaystyle \begin{aligned} S(X * Y)_{a,c} = S(X)_{a,b} \otimes S(Y)_{b,c}. \end{aligned}$$
The contents of Theorems 2.17 and 2.22 are identical. We only state the latter version of Chen’s identity to highlight a useful algebraic structure. In particular, it allows us to state the following corollary that extends the computations from Examples 2.18 and 2.20.
Corollary 2.23 (Signature of Piecewise Linear Path)
Let\(X: [0,n] \to \mathbb R^d\)be a path that is linear on each interval\([0,1],\ldots ,[n-1,n]\). Then
$$\displaystyle \begin{aligned} S(X)_{0,n} = {\mathbf{e}}^{X_1-X_0} \otimes\ldots \otimes {\mathbf{e}}^{X_n-X_{n-1}}\;. \end{aligned}$$

2.3.4 Time-Reversal

The time-reversal property informally states that the signature \(S(X)_{a,b}\) of a path \(X : [a,b] \to \mathbb R^d\) is precisely the inverse under the product \(\otimes \) of the signature obtained by running X backwards in time. To make this precise, we make the following definition.
Definition 2.24 (Time-Reversal)
For a path \(X : [a,b] \to \mathbb R^d\), we define its time-reversal as the path \(\overleftarrow X : [a,b] \to \mathbb R^d\) for which \(\overleftarrow X_t = X_{a + b-t}\) for all \(t \in [a,b]\).
See Fig. 9 for a depiction of the time reversal of paths.
Fig. 9
Example of two paths X and Y  and their respective reversals \( \overleftarrow {X}\) and \( \overleftarrow {Y}\)
Full size image
Theorem 2.25 (Time-Reversed Signature)
For a path\(X : [a,b] \to \mathbb R^d\), it holds that
$$\displaystyle \begin{aligned} {} S(X)_{a,b} \otimes S(\overleftarrow X)_{a,b} = 1. \end{aligned} $$
(21)
Explicitly, for all multi-indexes\((i_1,\ldots , i_k)\),
$$\displaystyle \begin{aligned} {} S(\overleftarrow X)^{i_1,\ldots ,i_k}_{a,b}=(-1)^kS(X)^{i_k,\ldots, i_1}_{a,b}. \end{aligned} $$
(22)
Remark 2.26
The element 1 in (21) should be understood as the formal power series where \(\lambda _0 = 1\) and \(\lambda _{i_1,\ldots , i_k} = 0\) for all \(k \geq 1\) and \(i_1,\ldots , i_k \in \{1,\ldots , d\}\), which is the identity element under the product \(\otimes \).
Proof
We will prove (22). As before, we proceed by induction. The claim is clearly true for \(k=0\) since \(S(X)_{a,b}^\emptyset =1\). Suppose now that (22) is true for some \(k-1\geq 0\). Then
$$\displaystyle \begin{aligned} \begin{array}{rcl} S(\overleftarrow X)^{i_1,\ldots ,i_k}_{a,b} & =&\displaystyle \int_a^b S(\overleftarrow X)^{i_1,\ldots ,i_{k-1}}_{a,t} \, d\overleftarrow X^{i_k}_t \\ & =&\displaystyle \int_a^b (-1)^{k-1}S(X)^{i_{k-1},\ldots ,i_{1}}_{a+b-t,b} \, dX^{i_k}_{a+b-t} \\ & =&\displaystyle \int_a^b (-1)^{k}S(X)^{i_{k-1},\ldots ,i_{1}}_{s,b} \, dX^{i_k}_{s} \\ & =&\displaystyle (-1)^{k}S(X)^{i_k,i_{k-1},\ldots ,i_{1}}_{a,b}, \end{array} \end{aligned} $$
where we used the inductive hypothesis in the second equality, the change of variable \(s=a+b-t\) in the third equality, and in the final equality we used the identity
$$\displaystyle \begin{aligned} S(X)^{j_1,\ldots ,j_k}_{a,b} = \int_a^b S(X)^{j_2,\ldots ,j_k}_{t,b} \, dX^{j_1}_t, \end{aligned}$$
which itself follows from (3). □

2.3.5 Log-Signature

We now define a transform of the signature called the log-signature. The log-signature is nothing but the logarithm of the signature in the algebra of formal power series. For a power series
$$\displaystyle \begin{aligned} x = \sum_{k = 0}^\infty \sum_{i_1,\ldots, i_k \in \{1,\ldots, d\}} \lambda_{i_1,\ldots,i_k}e_{i_1}\ldots e_{i_k} \end{aligned}$$
for which \(\lambda _0 > 0\), define its logarithm as the power series given by
$$\displaystyle \begin{aligned} \log x = \log(\lambda_0) - \sum_{n \geq 1}\frac{1}{n}\left(1-\frac{x}{\lambda_0}\right)^{\otimes n}, \end{aligned}$$
where \(\otimes n\) denotes the n-th power with respect to the product \(\otimes \).
In the case that \(\lambda _0=1\), which is the most important case for us, the logarithm reduces to
$$\displaystyle \begin{aligned} {} \log x = -\sum_{n \geq 1}\frac{1}{n}\left(1 - x\right)^{\otimes n}. \end{aligned} $$
(23)
For example, for a real number \(\lambda \in \mathbb R\) and the series
$$\displaystyle \begin{aligned} x = 1 + \sum_{k \geq 1} \frac{\lambda^k}{k!} e_{1}^{\otimes k}, \end{aligned}$$
one can readily check that
$$\displaystyle \begin{aligned} \log x = \lambda e_1. \end{aligned}$$
In general, \(\log x\) is a series with an infinite number of terms. However, for every multi-index \((i_1,\ldots , i_k)\), the coefficient of \(e_{i_1}\ldots e_{i_k}\) in \(\log x\) depends only on the coefficients of x of the form \(\lambda _{j_1,\ldots ,j_m}\) with \(m \leq k\), of which there are only finitely many, so that \(\log x\) is well-defined without the need to consider convergence of infinite series.
Definition 2.27 (Log-Signature)
For a path \(X : [a,b] \to \mathbb R^d\), the log-signature of X is defined as the formal power series \(\log S(X)_{a,b}\).
For two formal power series x and y, we define their Lie bracket by
$$\displaystyle \begin{aligned} {} [x,y] = x\otimes y - y \otimes x. \end{aligned} $$
(24)
A direct computation shows that the first few terms of the log-signature are given by
$$\displaystyle \begin{aligned} {} \log S(X)_{a,b} = \sum_{i=1}^d S(X)^i_{a,b}e_i + \sum_{1 \leq i < j \leq d} \frac{1}{2}\left(S(X)^{i,j}_{a,b}-S(X)^{j,i}_{a,b}\right)[e_i,e_j] + \ldots. \end{aligned} $$
(25)
In particular, the coefficient of the polynomials \([e_i,e_j]\) in the log-signature is precisely the Lévy area introduced in Sect. 2.2.4.
Example 2.28
Consider the two-dimensional path
$$\displaystyle \begin{aligned} X : [0,2] \to \mathbb R^2, \; \; X_t = \begin{cases} \{t,0\} &\mbox{if } t \in [0,1], \\ \{1,t-1\} &\mbox{if } t \in [1,2]. \end{cases} \end{aligned}$$
Note that X is the concatenation of the two linear paths \(Y : [0,1] \to \mathbb R^2\), \(Y_t = \{t,0\}\) and \(Z:[1,2] \to \mathbb R^2\), \(Z_t = \{1,t-1\}\). One can readily check (see Example 2.20) that the signatures of Y  and Z (as formal power series) are given by
$$\displaystyle \begin{aligned} S(Y)_{0,1} = 1 + \sum_{k \geq 1} \frac{1}{k!}e_{1}^{\otimes k}, \; \; S(Z)_{1,2} = 1 + \sum_{k \geq 1} \frac{1}{k!}e_{2}^{\otimes k}. \end{aligned}$$
It follows from Chen’s identity that
$$\displaystyle \begin{aligned} S(X)_{0,2} = S(Y)_{0,1} \otimes S(Z)_{1,2} = 1 + e_1 + e_2 + \frac{1}{2!}e_1e_1 + \frac{1}{2!}e_2e_2 + e_1e_2 + \ldots. \end{aligned}$$
Hence the first few terms of the log-signature of X are
$$\displaystyle \begin{aligned} \log S(X)_{0,2} = e_1 + e_2 + \frac{1}{2}[e_1,e_2] + \ldots. \end{aligned}$$
In fact, one can readily check that the coefficients of \(\log S(X)_{0,2}\) in the above example are given precisely by the classical Baker–Campbell–Hausdorff formula.
The above example demonstrates the general fact that the log-signature can always be expressed as a power series composed entirely of so-called Lie polynomials. This is the content of the following theorem due to Chen [12], which generalizes the Baker–Campbell–Hausdorff theorem.
Theorem 2.29
Let\(X : [a,b] \to \mathbb R^d\)be a path. Then there exist real numbers\(\lambda _{i_1,\ldots , i_k}\)such that
$$\displaystyle \begin{aligned} {} \log S(X)_{a,b} = \sum_{k \geq 1} \sum_{i_1,\ldots, i_k \in \{1,\ldots, d\}} \lambda_{i_1,\ldots, i_k} [e_{i_1},[e_{i_2},\ldots,[e_{i_{k-1}},e_{i_k}]\ldots ]]. \end{aligned} $$
(26)
Note that the coefficients \(\lambda _{i_1,\ldots , i_k}\) are not unique since the polynomials of the form \([e_{i_1},[e_{i_2},\ldots ,[e_{i_{k-1}},e_{i_k}]\ldots ]]\) are not linearly independent (e.g., \([e_1,e_2] = -[e_2,e_1]\)).
Theorem 2.29 is the only result in this subsection which we do not prove since all proofs, to our knowledge, require a non-trivial detour into Lie algebras. See [72, Cor. 3.5] or [28, Sec. 7.5] for two different proofs.

2.4 Expected Signatures as Moments of Path-Valued Random Variables

We saw in Example 2.16 that the signature generalizes multivariate polynomials to the space of paths and that the shuffle product (Theorem 2.14) generalizes the classical product of polynomials. In this subsection, we highlight the relevance of these observations to the theory of statistical learning by interpreting the expectation of the signature as the moments of path-valued random variables. We assume that the reader is familiar with the basics of probability theory, such as measures and expectation; see, e.g. [5] for background material. See also Sect. 2.5.4 for a further discussion on signatures, moments of random variables, and the Fourier transform.
Consider a probability measure \(\mathbb {P}\) on \(\mathbb R\). The moments of \(\mathbb {P}\), defined as the numbers \(\int _{\mathbb R} x^k \mathbb {P}(dx)\) for \(k=0,1,\ldots \), encode important information about \(\mathbb {P}\) (whenever they are well-defined) and play a prominent role in statistics. To see why, we recall the classical Weierstrass approximation theorem.
Theorem 2.30 (Weierstrass 1885)
Let\(f:[a,b]\to \mathbb R\)be a continuous function and\(\epsilon >0\). Then there exists a polynomial\(p(x)=a_0+a_1x+\cdots +a_kx^k\)with real coefficients\(a_i\)such that\(|f(x)-p(x)| < \epsilon \)for all\(x\in [a,b]\).
The relevance of this theorem to probability theory is that the expectation
$$\displaystyle \begin{aligned} \mathbb{P}(p) = \int_{\mathbb R} p(x) \mathbb{P}(dx) \end{aligned}$$
of any polynomial p under \(\mathbb {P}\) is determined by the moments of \(\mathbb {P}\) due to linearity of integration. Furthermore, recall that two probability measures \(\mathbb {P}\) and \(\mathbb {Q}\) are equal if and only if \(\mathbb {P}(f)=\mathbb {Q}(f)\) for all continuous functions \(f:\mathbb R\to \mathbb R\), see [5, Theorem 25.8]. These two remarks provide a simple way to test whether two probability measures on a finite interval are equal.
Corollary 2.31
Suppose\(\mathbb {P}\)and\(\mathbb {Q}\)are probability measures on\([a,b]\). Then\(\mathbb {P}=\mathbb {Q}\)if and only if the moments of\(\mathbb {P}\)and\(\mathbb {Q}\)are equal.
Proof
The direction \(\Rightarrow \) is obvious, so we prove now \(\Leftarrow \). Consider \(f:[a,b]\to \mathbb R\) continuous and let \(\epsilon >0\). By Theorem 2.30 there exists a polynomial p such that \(|f(x)-p(x)|<\epsilon \) for all \(x\in [a,b]\). In particular, by the triangle inequality applied to integrals, \(|\mathbb {P}(f)-\mathbb {P}(p)|<\epsilon \) and likewise for \(\mathbb {Q}\). But since \(\mathbb {P}(p)=\mathbb {Q}(p)\) by the assumption that the moments of \(\mathbb {P}\) and \(\mathbb {Q}\) match, we see that \(|\mathbb {P}(f)-\mathbb {Q}(f)|<2\epsilon \). Since \(\epsilon \) was arbitrary, it follows that \(\mathbb {P}(f)=\mathbb {Q}(f)\) for any continuous function \(f:[a,b]\to \mathbb R\) and thus \(\mathbb {P}=\mathbb {Q}\). □
Remark 2.32
It is crucial that \(\mathbb {P}\) and \(\mathbb {Q}\) in the above theorem are defined on a bounded interval \([a,b]\) instead of on \(\mathbb R\). Indeed, one can find two distinct probability measures on \(\mathbb R\) whose moments are equal. Nonetheless, there exists an essentially sharp criterion on the growth rate of the moments of \(\mathbb {P}\) (Carleman’s condition) which guarantees that \(\mathbb {P}\) is determined by its moments, see [42].
We now recall a substantial generalization of Theorem 2.30, known as the Stone–Weierstrass theorem (see [73, Theorem 7.32]).
Theorem 2.33
Let\(\mathcal {X}\)be a compact space. Suppose\(\mathcal {F}\)is a vector space of functions\(f:\mathcal {X}\to \mathbb R\)such that
(i)
every\(f\in \mathcal {F}\)is continuous,
 
(ii)
the constant function 1 is in\(\mathcal {F}\)
 
(iii)
for every\(f,g\in \mathcal {F}\), one has\(fg\in \mathcal {F}\), and
 
(iv)
for all distinct\(x,y\in \mathcal {X}\), there exists\(f\in \mathcal {F}\)such that\(f(x)\neq f(y)\).
 
Then, for any continuous function\(g:\mathcal {F}\to \mathbb R\)and\(\epsilon >0\), there exists\(f\in \mathcal {F}\)such that\(|f(x)-g(x)|<\epsilon \)for all\(x\in \mathcal {X}\).
Put more succinctly, every point-separating algebra of continuous functions that contains 1 is dense in \(C(\mathcal {X}) = \{f:\mathcal {X}\to \mathbb R,\:\, f \mathrm { is continuous}\}\). (‘Algebra’ refers to item (iii) and ‘point-separating’ refers to item (iv).) We leave it as an exercise for the reader to verify that if \(\mathcal {X}=[a,b]\) and \(\mathcal {F}\) is the space of polynomials, then \(\mathcal {F}\) verifies the assumptions of Theorem 2.33, and thus Theorem 2.33 implies Theorem 2.30.
Turning back to the signature, let \(\mathcal {X}\) denote the set of paths \(X:[a,b]\to \mathbb R^d\) that we equip with the bounded variation norm \(\|X\|_{BV} = |X_a| + \int _a^b |\dot X_t|dt\). Let \(\mathcal {F}\) denote the space of all functions \(f:\mathcal {X}\to \mathbb R\) of the form
$$\displaystyle \begin{aligned} {} X \mapsto \sum_{i=1}^n \lambda_i S(X)^{I_i}_{a,b}, \end{aligned} $$
(27)
where \(n\geq 1\), \(\lambda _i\in \mathbb R\), and \(I_i\) is a multi-index. We claim that all the conditions of Theorem 2.33except(iv) are satisfied:
(i)
For every multi-index I, the function \(X\mapsto S(X)^I_{a,b}\) is continuous, the verification of which we leave as an exercise for the reader, hence every \(f\in \mathcal {F}\) is continuous.
 
(ii)
Since \(S(X)^\emptyset _{a,b}=1\), we clearly have \(1\in \mathcal {F}\).
 
(iii)
For the every multi-index I and J, by the shuffle product (Theorem 2.14), the function \(X\mapsto S(X)^I_{a,b}S(X)^J_{a,b}\) is in \(\mathcal {F}\), which implies that \(fg\in \mathcal {F}\) whenever \(f,g\in \mathcal {F}\).
 
Unfortunately, it is clear that the point-separating property (iv) is not satisfied. There are three reasons for this:
  • The signature is base-point invariant, i.e. \(S(X)_{a,b}=S(X+x)_{a,b}\), so no two paths that differ by a shift in \(\mathbb R^d\) can be separated by functions in \(\mathcal {F}\).
  • The signature is parametrization invariant (see Sect. 2.3.1), so no two paths that differ by a reparametrization of \([a,b]\) can be separated.
  • By Chen’s identity (Theorem 2.22) and the time-reversal property (Theorem 2.25), if X can be written as \(X=Y*\overleftarrow Y\) for a path Y , then \(S(X)_{a,b} = 1\) (understood as the formal power series), which is equal to the signature of a constant path Z, i.e. \(Z_t=z\) for all \(t\in [a,b]\) for some fixed \(z\in \mathbb R^d\). Therefore, X and Z cannot be separated, i.e. \(f(X)=f(Z)\) for all \(f\in \mathcal {F}\).
Nonetheless, it is possible to recover (iv) after a suitable adjustment of the space of paths \(\mathcal {X}\). The minimal adjustment necessary to achieve (iv) is to suitably quotient \(\mathcal {X}\) by the kernel of S, which consists of all tree-like paths (see Sect. 2.5.2).
However, we demonstrate a simpler, more pragmatic approach to achieve (iv), which is to shrink \(\mathcal {X}\). More precisely, let us fix \(y\in \mathbb R^{d-1}\) and consider the set of paths
$$\displaystyle \begin{aligned} \mathcal{Y} = \{Y = (Y^1,\ldots, Y^d) \in \mathcal{X} \,:\, \forall t\in[a,b]\;\; Y^1_t=t,\;\; Y_a = (a,y)\in\mathbb R^d \} . \end{aligned}$$
In plain words, \(\mathcal {Y}\) consists of those paths \(Y: [a,b]\to \mathbb R^d\) in \(\mathcal {X}\) such that \(Y^1\) is the time-component and whose initial point is \((a,y)\). (The fact that the first component is time is not crucial. What is crucial is that there exists a fixed strictly monotone function \(f:[a,b]\to \mathbb R\) such that every \(Y\in \mathcal {Y}\) is of the form \(Y_t=(f_t,\bar Y_t)\).) The next result shows that the signature separates the points of \(\mathcal {Y}\).
Proposition 2.34
Suppose that\(X,Y\in \mathcal {Y}\)with\(S(X)^I_{a,b}=S(Y)^I_{a,b}\)for all multi-indexes I. Then\(X=Y\).
Proof
Recall from Example 2.6 that \(S(X)^{1,\ldots ,1}_{a,t}=\frac {(t-a)^k}{k!}\) where ‘1’ appears k times in \(1,\ldots ,1\). Then, for any \(i=2,\ldots ,d\),
$$\displaystyle \begin{aligned} S(X)^{1,\ldots, 1, i}_{a,b} = \int_a^b S(X)^{1,\ldots,1}_{a,t}\dot X^i_t \, dt = \int_a^b \frac{(t-a)^k}{k!} \dot X^i_t \, dt. \end{aligned}$$
It follows that the sequence \(S(X)^{i}_{a,b},S(X)^{1,i}_{a,b},S(X)^{1,1, i}_{a,b},\ldots \) encodes all the moments of the measure \(\dot X^i_t \, dt\) on \([a,b]\). The fact that \(S(X)_{a,b}=S(Y)_{a,b}\) thus implies that the measures \(\dot X^i_t \, dt\) and \(\dot Y^i_t \, dt\) have the same moments and are therefore equal (we leave this final implication as an exercise to the reader with the hint that one should use Theorem 2.30). Since \(X_t= X_a+\int _a^t\dot X_s \,ds\) and \(X_a= Y_a\), it follows that \(X=Y\). □
We therefore see that the set of functions \(\mathcal {F}\) defined by (27) separates the points of \(\mathcal {Y}\). Consequently, as we already verified all the other conditions for Theorem 2.33, we obtain the following result, which shows that arbitrary continuous functions on paths can be approximated by linear functions of the signature.
Proposition 2.35
Consider a compact set\(K\subset \mathcal {Y}\), continuous function\(g: K\to \mathbb R\), and\(\epsilon >0\). Then there exists\(f\in \mathcal {F}\)such that\(|f(X)-g(X)|<\epsilon \)for all\(X\in K\). More precisely, there exists an integer\(n\geq 1\), real numbers\(\lambda _1,\ldots ,\lambda _n\), and multi-indexes\(I_1,\ldots , I_n\)such that, for all\(X\in K\),
$$\displaystyle \begin{aligned} \Big|g(X) - \sum_{i=1}^n \lambda_i S(X)^{I_i}_{a,b}\Big| < \epsilon. \end{aligned}$$
By precisely the same argument used to prove Corollary 2.31, we obtain the following corollary of Proposition 2.35 that justifies the title of this subsection.
Corollary 2.36
Suppose\(\mathbb {P}\)and\(\mathbb {Q}\)are probability measures on a compact set\(K\subset \mathcal {Y}\)with equal expected signatures, i.e.\(\int _K S(X)^I_{a,b} \mathbb {P}(dX)=\int _K S(X)^I_{a,b} \mathbb {Q}(dX)\)for all multi-indexes I. Then\(\mathbb {P}=\mathbb {Q}\).
Remark 2.37
It is possible to significantly extend this result, in particular dropping the assumption that \(\mathbb {P}\) and \(\mathbb {Q}\) are defined on a compact subset of paths, see Sect. 2.5.4.

2.5 Further Topics and Extensions

We conclude the first part of this chapter with a brief discussion on several more advanced topics. In particular, we discuss
1.
the signature of rough paths,
 
2.
the extent to which the signature determines the underlying path,
 
3.
extensions to jump processes,
 
4.
the role of the signature in the moment problem for random paths,
 
5.
extensions to higher dimensions.
 

2.5.1 Rough Paths

Our discussion of the signature has so far been restricted to paths which are piecewise continuously differentiable (or more generally of bounded variation). This restriction was needed to ensure that the iterated integrals of the path existed as Riemann–Stieltjes integrals. More generally, one can define the iterated integrals of a path using the Young integral for any path of finite p-variation with \(1 \leq p < 2\). The Young integral goes beyond the “classical” definition of the integral and is already able to cover a class of paths substantially more irregular than those of bounded variation.
A problem that arises for paths of infinite p-variation for all \(p < 2\) (which is a situation of great interest in stochastic analysis because the sample paths of Brownian motion have almost surely finite p-variation if and only if \(p > 2\)), is that there is no well-defined notion of an iterated integral for such paths. We stress that this is not due to any technical limitation of the Young integral, but rather due to the fact that there is no unique candidate for the iterated integrals. That is to say, there is not necessarily only one unique way to define the iterated integrals.
One of the key observations of T. Lyons in his introduction of rough paths in [53] was that if one defines the first \(\lfloor p \rfloor \) iterated integrals of a path X of finite p-variation, then there is a unique way to obtain all the other iterated integrals, and hence the signature of X. This notion of a path of finite p-variation, along with its first \(\lfloor p\rfloor \) iterated integrals (which may be defined arbitrarily provided that they satisfy Chen’s identity and possess finite p-variation), is precisely the definition of a p-rough path.
A key result in the theory of rough paths, known as the universal limit theorem, is that one can give meaning to controlled differential equations where the driver is a so-called geometricp-rough path, and where the solution depends in a continuous way on the driver provided that the space of p-rough paths is equipped with a suitable topology (known as the p-variation metric). We refer the interested reader to the St. Flour lecture notes of Lyons et al. [57] for an introduction to the theory of rough paths and to the monograph of Friz and Hairer [26] for a recent treatment of the topic.

2.5.2 Path Uniqueness

As discussed in Sect. 2.2.3, the signature of a path \(X : [a,b] \mapsto \mathbb R^d\) is all that is needed to determine the endpoint of the solution to a linear (and, less trivially, non-linear) differential equation driven by X. This was first shown by Chen [13] in the smooth setting, and later extended by Hambly and Lyons [36] and Boedihardjo et al. [7] to less regular paths. The works of these authors show that the signature captures deep geometric properties of a path, which we briefly discuss here.
A natural question that one may ask is the following: is a path completely determined by its signature? The answer is no for the reasons described above Proposition 2.34. For example, one can never recover from the signature the exact speed at which the path is traversed (due to invariance under time reparametrizations), nor can one tell apart the signature of a trivial constant path and that of a path concatenated with its time-reversal.
However, a highly non-trivial fact is that this is essentially the only information lost by the signature. For example, for a path X that never crosses itself, the signature completely describes the image and direction of traversal of the path (that is, all the points that X visits and the order in which it visits them), modulo its starting point. This demonstrates the signature’s ability to completely determine the geometric properties of a path that does not possess degeneracies consisting of movements going directly back onto itself. More precisely, it is shown in [7, 36] that, for two paths \(X,Y\), one has \(S(X)= S(Y)\) if and only if X and Y  are tree-like equivalent. However, it remains a challenging problem to effectively recover properties of a path from its signature: see Lyons and Xu [55, 56] and Geng [31] for progress on this question.

2.5.3 Paths with Jumps

For many theoretical and practical applications, it is necessary to consider paths with jumps (i.e. discontinuous paths). This is important, for example, in finance [23], superdiffusive dynamical systems [18, 22, 62], and queuing theory [76]. It turns out that, under suitable regularity assumptions on the path (satisfied, for example, by almost every sample path of semi-martingales), there exists a meaningful way to build a signature.
The basic idea is to connect the jumps of a discontinuous path \(X:[a,b]\to \mathbb R^d\) with straight lines by adding an ‘infinitesimal’ extra time \(\delta \) to the interval. The resulting object is a continuous path \(X^\delta :[a,b+\delta ] \to \mathbb R^d\), for which the signature can be computed as normal and is independent of the extra time \(\delta \). For example, if X is piecewise constant with finitely many jumps, the signature of \(X^\delta \) is the same as that of the piecewise linear path connecting the jumps of X in the respective order. This procedure to build a continuous path \(X^\delta \) from X works under natural regularity assumptions, e.g. for X càdlàg (right-continuous with left-limits), and does not require that X has finitely many jumps. These ideas stem from the work of Marcus [59, 60] (see also [3, 11, 46] and the references therein) and later developed in the context of rough paths [14, 15, 27, 77] (see in particular [14, 15] where the requirement that the jumps of X are connected with straight lines is relaxed).
We mention also another way to interpret rough paths with jumps that relies on so-called forward (or Itô) integration [29] and leads to a notion of signature, but the resulting object is most naturally interpreted in the sense of branched rough paths [34] as it no longer satisfies the shuffle product identity.

2.5.4 Moment Problem and Random Paths

The moment problem in probability theory is the following: given two probability measures \(\mathbb {P}\) and \(\mathbb {Q}\) on \(\mathbb R\) with equal moments, i.e. \(\int _{\mathbb R} x^k\mathbb {P}(dx) = \int _{\mathbb R} x^k\mathbb {Q}(dx)\) for all \(k \geq 0\), does it hold that \(\mathbb {P}=\mathbb {Q}\)? As alluded to Remark 2.32, the answer is, in general, negative, but there exist sufficient (and necessary) conditions on the growth of the moments \(\int _{\mathbb R} x^k\mathbb {P}(dx)\) under which the answer is positive.
Recall also from Sect. 2.4, especially Corollary 2.36, that the shuffle product identity and the Stone–Weierstrass theorem together imply that a probability measure on a compact set of paths (on which the signature map is injective) is uniquely determined by its expected signature. This gives a positive answer to the moment problem in the simplest possible case (compact support of measures).
Given that, for probability measures on \(\mathbb R\), compactness of support is a sufficient but not necessary condition to solve the moment problem, it is natural to ask whether one can relax the compactness assumption in the solution to the moment problem for signatures. The answer turns out to be ‘yes’ and the following sufficient condition was found in [16]: provided that the expectations \(\mathbb {E}[S(X)^{i_1,\ldots ,i_k}_{a,b}]\) decay with k faster than any geometric sequence, then these expectations uniquely determine the law of \(S(X)_{a,b}\).1
The condition of [16] is satisfied, for example, by classes of Lévy processes including Brownian motion (suitably interpreted as geometric rough paths), for which the expected signature is explicitly known [27], and for classes of Gaussian and Markovian processes for which the expected signature is not explicitly known [10, 16]. In turn, it is known [8, 50] that certain processes (e.g. Brownian motion stopped upon exiting a domain) do not satisfy the necessary moment decay condition of [16], and it remains open whether the law of their signatures is determined by its expected value. A related open problem is to find sufficient and necessary conditions for a solution to the moment problem for signatures, akin to the sharp Carelman’s condition (see Remark 2.32).
The method used in [16] to resolve the moment problem for signatures is to introduce a suitable noncommutative Fourier transform, which, as in the classical case, uniquely determines the law of any random variable without any assumptions on the moments. Again for Lévy processes, it is possible to derive a Lévy–Khintchine formula that determines the Fourier transform explicitly, see [14, Sec. 5.3.2].

2.5.5 Higher Dimensions

We conclude this section by mentioning several recent works that aim to extend the concept of signatures to functions defined over multi-dimensional space, i.e. functions \(X:\Omega \to \mathbb R^d\) where \(\Omega \subset \mathbb R^n\) and possibly \(n\geq 1\), instead of just paths \(X:[a,b]\to \mathbb R^d\). Due to the lack of a canonical ordering on points in \(\mathbb R^n\) for \(n\ge 2\), it is non-trivial to extend the notion of signature to such X in a way that preserves properties such as the shuffle product and Chen’s identity from Sect. 2.3. Nonetheless, such an extension clearly has value since many types of data, e.g., images, are spatial (vs. temporal) and lack a time-ordered structure.
Zhang et al. [81], inspired by [75], recently defined 2D signatures and applied this concept to image and texture classification. Their approach was based on rectangular increments and was restricted to level-1 and level-2 (generalizations of) signatures. Giusti et al. [32] provided an (essentially topological) extension of signatures based on cubical mapping spaces that applies to arbitrary dimensions and levels. The authors therein show generalizations of the shuffle product and Chen’s identity and are able to partially address the uniqueness question discussed in Sect. 2.5.2. Recently, Lee and Oberhauser [48] and then [20, 47] provided another generalization of the signature to surfaces which stems from higher gauge theory. See also [21] that provides a candidate for a multi-dimensional signature based on the notion of a model from the theory of regularity structures [35], which is used in the analysis of singular stochastic partial differential equations. This generalization, however, relies on several (a priori ad hoc) choices, including an integration operator that replaces the (somewhat canonical) Heaviside function in the one-dimensional case; it remains open if important properties of the signature extend to this construction. Finally, we mention [24] that introduces two-parameter sums signatures with applications to a multi-dimensional version of dynamic time warping.

3 Practical Applications

One of the practical applications of the signature transformation lies in the field of machine learning algorithms. As has already been discussed, the signature summarizes important information about a path. When the path is composed of a sequential data stream \(\{X_i\}\), the terms of the signature \(S(X)^{ijk\ldots }\) are good candidates for characteristic features of the data stream. The shuffle product property allows us to represent a non-linear function of the signature as a linear combination of iterated integrals. This is similar to basis function expansion and naturally applies to computations of regression models. In the next sections we will demonstrate numerical computations of signatures of paths from various data streams and applications of signatures to machine learning problems, namely classification.

3.1 Elementary Ingredients and Their Properties

In this section, we introduce the essential ingredients of the signature method and demonstrate how to perform basic computations.

3.1.1 Paths from Discrete Data

Our first task is to transform discrete sequential (i.e., time series) data into continuous paths, as signatures are typically defined for continuous paths rather than a collection of discrete data points. We start with basic computations of the signature applied to data streams. Let X be a data stream defined as a set of N points observed at times \(t_0 < t_1 < t_2 < \ldots < t_N\) in d-dimensions (\(X\in \mathbb {R}^d\)):
$$\displaystyle \begin{aligned} {} X=\left\{\Bigl(X^{1}_{t_i},X^2_{t_i},X^3_{t_i},\ldots ,X^d_{t_i}\Bigr)\right\}_{i=0}^N. \end{aligned} $$
(28)
One can think of \(\Bigl (X^{1}_{t_i},X^2_{t_i},X^3_{t_i},\ldots ,X^d_{t_i}\Bigr )\) as a coordinate of a data point recorded at time i in \(\mathbb {R}^d\). For example, consider two one-dimensional sequences of length four:
$$\displaystyle \begin{aligned} \{X^1_{t_i}\}_{i=0}^{N=3} &= \{1,3,8,6\},{} \end{aligned} $$
(29)
$$\displaystyle \begin{aligned} \{X^2_{t_i}\}_{i=0}^{N=3} &= \{1,4,2,5\}.{} \end{aligned} $$
(30)
We are interested in transforming this discrete series into a continuous path. Among various ways to find this transformation, we focus on two main approaches:
(a)
Piecewise linear interpolation.
This interpolation method is given by a set of linear segments (local linear approximation) connecting a set of data points. The resulting continuous path is a piecewise combination of linear functions. Formally, for each pair of points \(X_i\) and \(X_{i+1}\) on the interval \(t\in [t_i, t_{i+1}]\), the path is defined by
$$\displaystyle \begin{aligned} {} \hat{X_t} = X_{t_i} + \frac{t-t_i}{t_{i+1} - t_i}\left(X_{t_{i+1}} - X_{t_i}\right). \end{aligned} $$
(31)
By concatenating linearly interpolated paths, one arrives at the continuous path \(\hat {X_t}\) defined for any \(t\in [t_0, t_N]\).
 
(b)
Rectilinear interpolation (i.e. axis path).
Another way to connect \(X_{t_i}\) and \(X_{t_{i+1}}\) data points is by updating consecutively each coordinate of \(X_{t_i}\) in a piecewise linear fashion along the d axes until we reach \(X_{t_{i+1}}\). In this way, on the interval \(t\in [t_i, t_{i+1}]\) with \(i=0,1, \ldots , N-1\), one augments a pair of two consecutive points by \(d-1\)auxiliary data points of which we take the piecewise linear interpolation as in (31). For example, if \(d=3\), the rectilinear path over \([t_i,t_{i+1}]\) is the piecewise linear interpolation of the four data points
$$\displaystyle \begin{aligned} \{(X^1_{t_i}, X^2_{t_i}, X^3_{t_i}), (X^1_{t_{i+1}}, X^2_{t_i},X^3_{t_i}), (X^1_{t_{i+1}}, X^2_{t_{i+1}},X^3_{t_i}), (X^1_{t_{i+1}}, X^2_{t_{i+1}},X^3_{t_{i+1}})\}. \end{aligned}$$
Remark that, for the computation of the signature, it is not important how the rectilinear path is precisely parametrized over \([t_i,t_{i+1}]\) due to the parametrization invariance of the signature (Sect. 2.3.1).
 
The piecewise linear and rectilinear interpolation schemes using the data streams in (29) and (30) are presented in Fig. 10a and b respectively. Note that the auxiliary points denoted by empty red circles were added to construct the rectilinear paths.
Fig. 10
Examples of a piecewise linear (a) and rectilinear (b) interpolation schemes applied to a two-dimensional stream \(\{(X^1, X^2)\}\) from (29) and (30)
Full size image

3.1.2 Auxiliary Path Transformations

Although data points can be transformed directly into paths as in Sect. 3.1.1, in certain cases, auxiliary data transformations may improve statistical learning with signatures for downstream tasks. Such transformations may increase or decrease the dimension d of the data stream and lead to more effective representations of sequential data that enhance the model’s ability to learn, generalize, and make accurate predictions. In this section, we present several frequently used transformations.

3.1.3 Cumulative Sum

The cumulative sum, also known as the running total, is the sequence of partial sums of a given sequence of numbers. In other words, it is the sequence of sums of the elements of the original sequence, where each sum is calculated by adding the previous sum to the current element of the sequence. More precisely, the cumulative sum is defined by
$$\displaystyle \begin{aligned} {} \mathrm{CS}(\{X_{t_i}\}_{i=0}^{N}) = \{X_{t_0},X_{t_0}+X_{t_1},\dots,S_k,\dots,S_N\}\;;\;\;\;\;S_k = \sum_{i=0}^{k}X_{t_i}. \end{aligned} $$
(32)
Using again the previous example (29) and (30), the new data streams become
$$\displaystyle \begin{aligned} \{\tilde{X^1}\} = \mathrm{CS}(X^1) &= \{1, 4, 12, 18\},\\ \{\tilde{X^2}\} = \mathrm{CS}(X^2) &= \{1, 5, 7, 12\}. \end{aligned} $$

3.1.4 Base-Point Augmentation

As discussed in Sect. 2, the signature transformation is translation invariant, meaning that paths at different locations in \(\mathbb {R}^d\) will produce the same signature. In some cases, it may be useful to remove this translation invariance. For example, to capture the absolute values of data streams rather than their increments. To do this, an anchor data point, typically the origin (0), can be added at the beginning of the data stream. This removes the translation invariance of the signature transformation. Using examples (29) and (30), the augmented data streams with \(N+1\) data points become
$$\displaystyle \begin{aligned} {} \begin{aligned} \mathrm{BP}(\{X^1_{t_i}\}_{i=0}^{N=3}) = \{0, \{X^1_{t_i}\}_{i=0}^{N=3}\} &= \{0,1,3,8,6\},\\ \mathrm{BP}(\{X^2_{t_i}\}_{i=0}^{N=3}) = \{0, \{X^2_{t_i}\}_{i=0}^{N=3}\} &= \{0,1,4,2,5\}. \end{aligned} \end{aligned} $$
(33)

3.1.5 Time Augmentation

This type of sequential data transformation is defined by adding a univariate component with monotonically increasing or decreasing values that can be interpreted as auxiliary time. For example, consider a new (unevenly sampled) stream \(\{t\}\)
$$\displaystyle \begin{aligned} \{t_i\}_{i=0}^{N=3} = \{0,1,3,6\}, \end{aligned}$$
and augmenting \(\{X^1\}_{i=0}^{N=3}\), \(\{X^2\}_{i=0}^{N=3}\) and \(\{(X^1, X^2)\}_{i=0}^{N=3}\) from (29) and (30) leads to
$$\displaystyle \begin{aligned} \mathrm{TA}(\{X^1_{t_i}\}_{i=0}^{N=3}) &= \{(t_i, X^1_{t_i})\}_{i=0}^{N=3} = \{(0,1), (1,3), (3,8), (6,6)\},\\ \mathrm{TA}(\{X^2_{t_i}\}_{i=0}^{N=3}) &= \{(t_i, X^2_{t_i})\}_{i=0}^{N=3} = \{(0,1), (1,4), (3,2), (6,5)\},\\ \mathrm{TA}(\{(X^1_{t_i}, X^2_{t_i}\}_{i=0}^{N=3}) &= \{(t_i, X^1_{t_i}, X^2_{t_i})\}_{i=0}^{N=3} = \{(0,1,1), (1,3,4), (3,8,2), (6,6,5)\}. \end{aligned} $$
The signature of a stream integrated against a monotonic auxiliary stream (‘time’) can capture stream-specific effects and can be related to common statistics (e.g., mean). Intuitively, the integral of a stream against the auxiliary time component will describe the area under the path, which in some applications may capture important properties of the data process via the path and serve as an informative feature for downstream analytical tasks.
In some cases, knowing the difference between the timestamps may be informative, particularly for unevenly distributed data. To illustrate this scenario, consider streams \(\{X^1\}\) and \(\{X^2\}\) and their respective timestamps \(\{t_i\}\) as shown in Table 1a.
Table 1
Example streams
\(t_i\)
\(X^2\)
\(X^2\)
01-January
1
1
02-January
3
4
04-January
8
2
07-January
6
5
08-January
9
3
(a)
\(t_i\)
\(\hat {t}_i\)
\(\Delta t_i\)
\(X^2\)
\(X^2\)
01-January
0
1
1
02-January
1
1
3
4
04-January
3
2
8
2
07-January
6
3
6
5
08-January
7
1
9
3
(b)
In addition to the original timescale stream \(\{t_i\}\), we can introduce two auxiliary time-related streams:
$$\displaystyle \begin{aligned} \begin{array}{rcl} \hat{t}_i & =&\displaystyle \{t_i - t_0\}_{i=0}^{N=4} = \{0, 1, 3, 6, 7\},\\ \Delta t_i & =&\displaystyle \{t_{i} - t_{i-1}\}_{i=1}^{N=4} = \{1, 2, 3, 1\}. \end{array} \end{aligned} $$
Putting all streams together and aligning them in time, we get a 5-dimensional stream as shown in Table 1b. Intuitively, \(\{\hat {t}_i\}\) corresponds to time since the beginning and \(\{\Delta t_i\}\) to time differences between two consecutive timestamps. Note that, since there is no time difference for the first data point, some heuristic can be used to fill in the missing value, such as arbitrarily setting it to zero.

3.1.6 Lead-Lag Transformation

A lead-lag transformation is a mathematical operation that shifts the phase of a time series by a specified amount. In a lead-lag transformation, a time series can be either “led” or “lagged” by a certain number of time steps. Leading a signal means that it is shifted forward in time, while lagging a signal means that it is shifted backward in time. Lead-lag transformations can be used to analyze the relationships between different signals. For example, the lead-lag transformation can map a one-dimensional path into a two-dimensional path. Certain statistics, such as the quadratic variation of a stochastic process, can be represented in terms of the signature of the lead-lag transformed process (see Sect. 3.1.14).
Let X a d-dimensional data stream with N data points as in (28). Following the definition [25], its lead and lag transformed streams \(\{X^{\mathrm {Lead}}_j\}_{j=0}^{2N}\) and \(\{X^{\mathrm {Lag}}_j\}_{j=0}^{2N}\) with \(2N+1\) data points are defined as
$$\displaystyle \begin{aligned} {} X^{\mathrm{Lead}}_{t_j} \mapsto \begin{cases} X_{t_i} & \text{if} \;\; j=2i,\\ X_{t_i} & \text{if} \;\; j=2i-1,\\ \end{cases} \end{aligned} $$
(34)
and
$$\displaystyle \begin{aligned} {} X^{\mathrm{Lag}}_{t_j} \mapsto \begin{cases} X_{t_i} & \text{if} \;\; j=2i,\\ X_{t_i} & \text{if} \;\; j=2i+1.\\ \end{cases} \end{aligned} $$
(35)
It is clear from (34) and (35) that for each value of i corresponds to three values of j:
$$\displaystyle \begin{aligned} \begin{array}{rcl} j & =&\displaystyle 2i-1,\\ j & =&\displaystyle 2i,\\ j & =&\displaystyle 2i+1. \end{array} \end{aligned} $$
For example, given a one-dimensional sequence with four values
$$\displaystyle \begin{aligned} \{X_{t_i}\}_{i=0}^{N=3} = \{X_{t_0}, X_{t_1}, X_{t_2}, X_{t_3}\}, \end{aligned}$$
by substituting \(i\in [0,1,2,3]\) in (34) the components of the lead stream of \(\{X_{t_i}\}\) are
$$\displaystyle \begin{aligned} \begin{aligned} & & X^{\mathrm{Lead}}_{t_0} = X_{t_0},\;\; X^{\mathrm{Lead}}_{t_1} = X^{\mathrm{Lead}}_{t_2} = X_{t_1}, \\ X^{\mathrm{Lead}}_{t_3} & = & X^{\mathrm{Lead}}_{t_4} = X_{t_2},\;\; X^{\mathrm{Lead}}_{t_5} = X^{\mathrm{Lead}}_{t_6} = X_{t_3}, \end{aligned} \end{aligned}$$
and correspondingly of the lag (35) stream of \(\{X_{t_i}\}\) are
$$\displaystyle \begin{aligned} \begin{array}{rcl} X^{\mathrm{Lag}}_{t_0} & = &\displaystyle X^{\mathrm{Lag}}_{t_1} = X_{t_0},\;\; X^{\mathrm{Lag}}_{t_2} = X^{\mathrm{Lag}}_{t_3} = X_{t_1}, \\ X^{\mathrm{Lag}}_{t_4} & = &\displaystyle X^{\mathrm{Lag}}_{t_5} = X_{t_2},\;\; X^{\mathrm{Lag}}_{t_6} = X^{\mathrm{Lag}}_{t_7} = X_{t_3}, \end{array} \end{aligned} $$
or more compactly
$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathrm{Lead}: \{X_{t_i}\}_{i=0}^{3}\mapsto\{X^{\mathrm{Lead}}_{t_j}\}_{j=0}^{6} & = &\displaystyle \{X_{t_0}, X_{t_1}, X_{t_1}, X_{t_2}, X_{t_2}, X_{t_3}, X_{t_3}\}, \\ \mathrm{Lag}: \{X_{t_i}\}_{i=0}^{3}\mapsto\{X^{\mathrm{Lead}}_{t_j}\}_{j=0}^{6} & = &\displaystyle \{X_{t_0}, X_{t_0}, X_{t_1}, X_{t_1}, X_{t_2}, X_{t_2}, X_{t_3}\}. \end{array} \end{aligned} $$
One can immediately see that lead and lag transforms are derived from the original data stream by repeating and shifting its data points and deleting the first and last data points. They are repeated and time-shifted versions of each other.
Considering the sequence \(X^1\) from (29), its lead-lag (LL) mapping is given by
$$\displaystyle \begin{aligned} {} \mathrm{LL}\;:\;X^1 = \left\{1, 3, 8, 6\right\} \mapsto \begin{cases} X^{1,\mathrm{Lead}}&= \left\{1, 3, 3, 8, 8, 6, 6\right\},\\ X^{1,\mathrm{Lag}} &= \left\{1, 1, 3, 3, 8, 8, 6\right\},\\ \end{cases} \end{aligned} $$
(36)
and illustrated in Fig. 11. The corresponding lead-lag transformation of \(X^2\) from (30) is given by
$$\displaystyle \begin{aligned} \mathrm{LL}\;:\;X^2 = \left\{1, 4, 2, 6\right\} \mapsto \begin{cases} X^{2,\mathrm{Lead}}&= \left\{1, 4, 4, 2, 2, 6, 6\right\},\\ X^{2,\mathrm{Lag}} &= \left\{1, 1, 4, 4, 2, 2, 6\right\}, \end{cases} \end{aligned}$$
Fig. 11
Lead and lag paths of \(X^1\)
Full size image
and illustrated in Fig. 12.
Fig. 12
Lead and lag paths of \(X^2\)
Full size image

3.1.7 Multivariate Time Series as a Path in \(\mathbb {R}^d\)

The advantage of using the path representation of sequential data is that a d-dimensional process can be embedded as a path in \(\mathbb {R}^d\). For example, the two univariate streams \(\{X^1\}\) from (29) and \(\{X^2\}\) from (30) can be embedded into a path in \(\mathbb {R}^2\) as
$$\displaystyle \begin{aligned} \begin{array}{rcl} \{X_{t_i}\}_{i=0}^{N=3} & =&\displaystyle \{\bigl(X^1_{t_i}, X^2_{t_i} \bigr)\}_{i=0}^{N=3}\\ & =&\displaystyle \{(1, 1), (3, 4), (8, 2), (6, 5)\}, \end{array} \end{aligned} $$
the piecewise linear and rectilinear interpolations of which are displayed in Fig. 10. This can be easily extended to \(\mathbb {R}^3\) if we add another univariate stream of the same length:
$$\displaystyle \begin{aligned} \{X^3_{t_i}\}_{i=0}^{N=3} = \{9,2,7,1\}, \end{aligned}$$
resulting in
$$\displaystyle \begin{aligned} \begin{array}{rcl} \{X_{t_i}\}_{i=0}^{N=3} & =&\displaystyle \{\bigl(X^1_{t_i}, X^2_{t_i}, X^3_{t_i} \bigr)\}_{i=0}^{N=3}\\ & =&\displaystyle \{(1, 1, 9), (3, 4, 2), (8, 2, 7), (6, 5, 1)\}. \end{array} \end{aligned} $$
The resulting paths in 3-dimensions using piecewise linear and rectilinear interpolations are shown in Fig. 13.
Fig. 13
Example of embedding three data streams in \( \mathbb {R}\) into a single path in \( \mathbb {R}^3\) using piecewise linear (left) and rectilinear (right) interpolations respectively
Full size image
Another example of a multidimensional embedding involves the lead-lag transform. Suppose we are interested in capturing the quadratic variation in one of the multivariate streams. As we will see in Sect. 3.1.14, we can compute the quadratic variation using the lead-lag transformation. Since the lead-lag transformation increases the number of points from N to \(2N+1\) (cf. Sect. 3.1.6), all other stream lengths should be adjusted accordingly. It is worth noticing that, due to reparametrization invariance, the signatures of the lead and lag transformations of streams are equivalent:
$$\displaystyle \begin{aligned} S\Bigl(\bigl\{\bigl(X^1, X^2\bigr)\bigr\}\Bigr) \equiv S\Bigl(\{\bigl(X^{1, \mathrm{Lead}}, X^{2, \mathrm{Lead}}\bigr)\}\Bigr) \equiv S\Bigl(\{\bigl(X^{1, \mathrm{Lag}}, X^{2, \mathrm{Lag}}\bigr)\}\Bigr). \end{aligned}$$

3.1.8 Further Path Transformations

In this subsection, we showed several examples of different path augmentations, but there exist many others that serve various purposes. For example, invisibility-reset augmentation that adds information on the initial position of the path, coordinate projections that allow to compute the signature of a subset of coordinates individually, data rescaling, and random projections that project a high-dimensional path onto a lower-dimensional subspace using a random matrix. For further details, see [64].

3.1.9 Signed Area

Before we continue, it is important to emphasize the significance of the direction in which we travel along a path and the sign of the area which it encloses. Intuitively, the signed area enclosed by a path is a measure of the 2-dimensional region enclosed by the path. It is calculated by taking the absolute value of the area and then assigning a positive or negative sign to it depending on the orientation of the path. A positively oriented path (or a curve) is a simple closed path in the plane such that when following the path, the interior of the path is always on the left side. For a negatively oriented path, the interior is on the right side instead of the left side. In other words, the orientation of a path is determined by the direction in which the path appears to turn as it is traced. If the path turns clockwise, it is negatively oriented. If the path turns counterclockwise, it is positively oriented.
For example, if a path is traced clockwise, the signed area of the enclosed region would be negative (Fig. 14). If the path is traced counterclockwise, the signed area would be positive (Fig. 14). A combination of positively and negatively oriented paths and their respective areas is shown in Fig. 14. The sign of an area is also related to the sign of the winding number of a path [6].
Fig. 14
Examples of signed areas defined by the orientation of a path enclosing them. (a) Negative area \(A_{-}\). (b) Positive area \(A_{+}\). (c) Positive and negative areas
Full size image

3.1.10 The Relative Movement as a Signed Area

Consider a path in \(\mathbb {R}^2\) given by \(X = \{X^1,X^2\}\). If an increase (resp. decrease) in the component \(X^1\) is followed by an increase (resp. decrease) in the component \(X^2\), then the area A given by (12) is positive (Fig. 15). If the relative moves of the components \(X^1\) and \(X^2\) are in the opposite direction, then the area A is negative (Fig. 16).
Fig. 15
Example of a positive Lévy area that corresponds to two paths, \(\{X^1\}\) and \(\{X^2\}\), with relative movement in the same directions
Full size image
Fig. 16
Example of a negative Lévy area that corresponds to two paths, \(\{X^1\}\) and \(\{X^2\}\), with relative movement in opposite directions
Full size image

3.1.11 Computation and Geometry of the Signature

In Sect. 2, we outlined the mathematical foundations of the signature of a path and defined the signature terms as multiple iterated integrals. In this subsection, we provide another example of a computation and the geometric meaning of the signature. Consider the two univariate streams \(\{(X^1, X^2)\}\)
$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \{(X^1)\} & =&\displaystyle \{0, 1, 2, 3, 4, 5\},\\ \{(X^2)\} & =&\displaystyle \{8, 4, 5, 1, 10, 3\}, \end{array} \end{aligned} $$
(37)
embedded into a piecewise linear path in \(\mathbb {R}^2\) (Fig. 17a). Using the definition of the signature (??), we compute its first terms truncated at level \(L=2\) as
$$\displaystyle \begin{aligned} S(X) & = (1, S^{1}, S^{2}, S^{1,1}, S^{1,2}, S^{2,1}, S^{2,2}) \\ & = (1, 5, -5, 12.5, -10.5, -14.5, 12.5) . \end{aligned} $$
Fig. 17
Examples of two univariate data streams (37) embedded in \( \mathbb {R}^2\) and the geometric meaning of their signature terms \(S^{1}\) and \(S^{2}\). (a) A piecewise linear path \(\{(X^1, X^2)\} \in \mathbb {R}^2\) constructed from \(\{(X^1)\}\) and \(\{(X^2)\}\) using data points as (37). (b) Geometric meaning of the first two terms \(S^{1}\) and \(S^{2}\) of the signature of the path \(\{(X^1, X^2)\}\)
Full size image
One can easily confirm (cf. (1)) that the first order signature terms correspond to the total increments along each data stream:
$$\displaystyle \begin{aligned} \begin{array}{rcl} S^{1} & =&\displaystyle \int dX^1_t = X^1_{t_5} - X^1_{t_0} = 5-0 = 5,\\ S^{2} & =&\displaystyle \int dX^2_t = X^2_{t_5} - X^2_{t_0} = 3-5 = -5, \end{array} \end{aligned} $$
which are depicted in Fig. 17b. The second order signature terms are computed using double iterated integrals (cf. (2)):
$$\displaystyle \begin{aligned} {} \begin{aligned} S^{1,1} &= \int S^{1}\, dX^1_t = \int \int dX^1_t \, dX^1_t = \frac{1}{2!}\left(\int dX^1_t\right)^2 = \frac{1}{2!}(S^{1})^2 = 12.5,\\ S^{1,2} &= \int S^{1}\, dX^2_t = \int \int dX^1_t \, dX^2_t = -10.5,\\ S^{2,1} &= \int S^{2}\, dX^1_t = \int \int dX^2_t \, dX^1_t = -14.5,\\ S^{2,2} &= \int S^{2}\, dX^2_t = \int \int dX^2_t \, dX^2_t = \frac{1}{2!}\left(\int dX^2_t\right)^2 = \frac{1}{2!}(S^{2})^2 = 12.5. \end{aligned} \end{aligned} $$
(38)
The second order signature terms \(S^{1,2}\) and \(S^{2,1}\) and their geometric interpretation as areas are presented in Fig. 18.
Fig. 18
The geometric meaning of the terms \(S^{1,2}\) and \(S^{2,1}\) of \(\{(X^1, X^2)\}\) defined in (37). The left panel (a) represents the area enclosed by the path and two perpendicular dashed lines passing through the endpoints of the path, while the right panel (b) shows another possibility for the area to be enclosed by the path and two perpendicular dashed lines passing through the endpoints. (a) Area represented by the term \(S^{1,2}\). (b) Area represented by the term \(S^{2,1}\)
Full size image
Switching the order of integration over the path in the terms \(S^{1,2}\) and \(S^{2,1}\) gives rise to two areas which complete each other and add up to the total area of a rectangle with side lengths \(X^1\) and \(X^2\), which also corresponds to the shuffle product relation
$$\displaystyle \begin{aligned} \begin{array}{rcl} S^{1}\cdot S^{2} & =&\displaystyle S^{1,2} + S^{2,1},\\ (-5)\cdot 5 & =&\displaystyle -10.5 + (-14.5). \end{array} \end{aligned} $$
The geometric interpretation of the higher order terms is less intuitive and we omit this discussion. It is worth noticing, that the areas that correspond to the second order signature terms (\(S^{1,2}\) and \(S^{2,1}\)) have positive and negative signs depending on their position relative to the horizontal straight line. If we split the areas in Fig. 18 into individual areas that lie above and underneath the horizontal dashed lines, we obtain partitions as shown in Fig. 19.
Fig. 19
A schematic partitioning of the areas \(S^{1,2}\) and \(S^{2,1}\) enclosed by the path and their corresponding dashed lines into a sum of individual contributions. Circles denote the point of intersection of the path and the corresponding dashed lines. (a) The total area \(S^{1,2} = A_1 + A_2 + A_3\). (b) The total area \(S^{2,1} = \tilde {A}_1 + \tilde {A}_2 + \tilde {A}_3\)
Full size image
The intersection points, denoted by red circles in Fig. 19 can be easily computed as
$$\displaystyle \begin{aligned} \begin{array}{rcl} A & =&\displaystyle (5/2, 3), \;\;\;B = (29/9, 3),\\ \tilde{A} & =&\displaystyle (34/9, 8), \;\;\;\tilde{B} = (30/7, 8). \end{array} \end{aligned} $$
Subsequently, one can directly compute the individual areas
$$\displaystyle \begin{aligned} \begin{array}{rcl} A_1 & =&\displaystyle -5, \;\;\; A_2 = 13/18, \;\;\; A_3 = -56/9,\\ \tilde{A}_1 & =&\displaystyle -119/9, \;\;\; \tilde{A}_2 = 32/63, \;\;\; \tilde{A}_3 = -25/14, \end{array} \end{aligned} $$
and demonstrate that the results
$$\displaystyle \begin{aligned} \begin{array}{rcl} S^{1,2} & =&\displaystyle A_1 + A_2 + A_3 = -10.5,\\ S^{2,1} & =&\displaystyle \tilde{A}_1 + \tilde{A}_2 + \tilde{A}_3 = -14.5 \end{array} \end{aligned} $$
correspond to the direct computation of the signature terms as shown in (38). Note that the sign of an area is determined by the direction of the path that encloses the area, as demonstrated in Fig. 14.
The Lévy area (12) of the path \(\{(X^1, X^2)\}\) (37) is shown in Fig. 20 and is given by
$$\displaystyle \begin{aligned} {} A_{\text{L{\'e}vy}}=\frac{1}{2}\left(S^{1,2}-S^{2,1}\right) = 2. \end{aligned} $$
(39)
Fig. 20
Example of the signed Lévy area enclosed by a piecewise linear path and the chord that connects the two endpoints. The light blue area is negative and the pink area is positive according to the direction of travel along the path (from left to right)
Full size image
Using the intersection point \((3.4, 4.6)\) between the chord and the linear segment between \(t_3\) and \(t_4\), one can easily compute the areas \(A_{+}=6.8\) and \(A_{-}=-4.8\) and confirm that \(A_{\text{L{\'e}vy}} = A_{+}+A_{-}\) is equivalent to (39).

3.1.12 The Log-Signature of a Path

The shuffle product identity (Theorem 2.14) implies that any polynomial function on signatures can be written as a linear combination of signature elements. However, these relations imply that the signature contains ‘repeated information’, e.g. \(S(X)^{i,i}_{a,b}\) can always be determined from the level 1 signature since \(S(X)^{i,i}_{a,b}=\frac 12 (S(X)^i_{a,b})^2\); for some tasks (e.g. linear regression) this repetition can be useful, as it increases the number of potentially useful features, but in some cases, one may wish to remove it. The log-signature (also known as a logarithmic signature), as defined in Sect. 2.3.5, is a condensed form of the signature that removes this repeated information: it contains the same information as the signature but with fewer terms.
For example, consider a two-dimensional data stream \(\{X\}\) and the vector space of formal series in two non-commuting indeterminates \(\{e_1, e_2\}\) (see Definition 2.19). Recalling the log-signature from Definition 2.27 and the result of Theorem 26, the log-signature truncated at level \(L=4\) can be written as
$$\displaystyle \begin{aligned} {} \begin{aligned} \log S(X) &= \lambda_{1}e_1 + \lambda_{2}e_2+ \lambda_{12}[e_1, e_2] \\ &\quad + \lambda_{112}[e_1,[e_1,e_2]] + \lambda_{212}[e_2,[e_1,e_2]] \\ &\quad +\lambda_{1112}[e_1,[e_1,[e_1,e_2]]] + \lambda_{2112}[e_2,[e_1,[e_1,e_2]]] \\ &\quad + \lambda_{2212}[e_2,[e_2,[e_1,e_2]]], \end{aligned} \end{aligned} $$
(40)
where \(\lambda _{i_1,\ldots i_k}\in \mathbb R\) are coefficients and \([x,y] = x\otimes y - y \otimes x\) is the Lie bracket as defined in (24). There are a priori no relations between the coefficients \(\lambda _1,\ldots , \lambda _{2212}\) (in fact, for any choice of \(\lambda _1,\ldots , \lambda _{2212}\), one can find a piecewise linear path X so that \(\log S(X)\) is equal to the right-hand side of (40), see [28, Theorem 7.28]). This is in contrast to \(S(X)\) because the shuffle identities imply non-trivial constraints on its coefficients (e.g. \(S(X)^1S(X)^2 = S(X)^{1,2}+S(X)^{2,1}\)). Since \(S(X)\) and \(\log S(X)\) determine each other, this shows that \(\log S(X)\) is a compressed version of \(S(X)\).
The correspondence between the signature and log-signature terms can be computed from the definition (23). For example, building on (25), the correspondence between the coefficients of the signature and log-signature up to the level \(L=3\) is
$$\displaystyle \begin{aligned} {} \begin{aligned} \lambda_1 &= S^{1}, \;\;\;\lambda_2 = S^{2}, \;\;\;\lambda_{12} = \frac{1}{2!}\left(S^{1,2} - S^{2,1}\right),\\ \lambda_{112} &= \frac{1}{3!}\left(S^{1,1,2} + S^{2,1,1} - 2 S^{1,2,1}\right), \\ \lambda_{212} &= -\frac{1}{3!}\left(S^{1,2,2} + S^{2,2,1} - 2 S^{2,1,2}\right), \end{aligned} \end{aligned} $$
(41)
and it is possible to extend this correspondence to any truncation level L.
Let us illustrate the geometric intuition for the low-order log-signature terms. Using the relationships in (41), the log-signature coefficients up to level \(L=3\) of a two-dimensional path \(X = \{(X^1, X^2)\}\) from (37) are
$$\displaystyle \begin{aligned} \begin{array}{rrrrrrrr} \log S(X) &= (\lambda_{1}, & \lambda_{2}, & \lambda_{12}, & \lambda_{112}, & \lambda_{212}) \\ &= \;\; (5, & -5, & 2, & -12, & -35/6). \end{array} \end{aligned}$$
One can immediately see that \(\lambda _1\) and \(\lambda _2\) are the total increments along each of the axes and \(\lambda _{12}\) corresponds to the Lévy area (12) as depicted in Fig. 20. The higher order log-signature terms \(L\ge 3\) correspond to “areas of areas” [70] and can be represented as iterated Lie brackets of non-commuting indeterminates. The log-signature, containing the same information as the signature, can likewise be used as a feature set in statistical and machine learning applications, at times with more efficiency, see e.g. [51, 52, 71].

3.1.13 Numerical Computation of the Signature

To streamline the development and application of the signature method, several numerical packages with Python interfaces have been developed for the computation of the signature. The ‘ESig’ package is implemented in C++ with an intuitive Python interface [74]. This package computes the signature and log-signature coefficients, provides additional auxiliary tools, and can be run on CPU and, if available, on GPUs to accelerate computations. Another package with Python interface ‘iisignature’ was proposed and its implementation details are described in [69, 71]. In addition, the ‘signatory’ Python package was developed to enable the integration of the signature transformation layer in deep learning applications [38]. More information about the usage of these numerical packages can be found in their respective documentations.

3.1.14 Further Relationship Between the Signature Terms and Data Points

We now illustrate how the signature allows us to compute statistical moments between data streams. Consider the lead-lag path (Fig. 21) constructed from (29) and its lead-lag transformation (36). We can decompose this figure into three right-angled isosceles triangles Fig. 21. Note that the direction of movement along this path, starting from the point \(X^1_{t_0}\) and moving towards the endpoint \(X^1_{t_3}\) corresponds to a negative sign, thus the total area will be negative. All three triangles have the same direction of movement, resulting in the sum of their individual contributions.
Fig. 21
Decomposition of the full lead-lag path into three individual parts. The direction of movement along the path is indicated by arrows
Full size image
The absolute value of the total area is given by the sum of the half areas of squares
$$\displaystyle \begin{aligned} |A| &= \frac{1}{2}\left[\left(X^{1}_{t_1} - X^{1}_{t_0}\right)\left(X^{1}_{t_1} - X^{1}_{t_0}\right) + \left(X^{1}_{t_2} - X^{1}_{t_1}\right)\left(X^{1}_{t_2} - X^{1}_{t_1}\right)\right. \\ &\qquad + \left.\left(X^{1}_{t_3} - X^{1}_{t_2}\right)\left(X^{1}_{t_3} - X^{1}_{t_2}\right)\right]\\ &= \frac{1}{2}\left[(3-1)^2 + (8-3)^2 + (6-8)^2\right] = 16.5, \end{aligned} $$
which exactly equals to the Lévy area enclosed by the lead-lag path \(\{X^1\}\) and the straight line connecting the endpoints, and can be computed via the signature terms (12). Let us write
$$\displaystyle \begin{aligned} {} [X]_t = \sum_{i=1}^{N}\left(X_{t_{i}}- X_{t_{i-1}}\right)^2, \end{aligned} $$
(42)
which has the simple meaning of the quadratic variation of the path constructed from \(\{X_{t_i}\}_{i=0}^N\) and is related to the variance of the path. Thus one can generally write for any sequence \(\{X_{t_i}\}_{i=0}^N\)
$$\displaystyle \begin{aligned} A_{\text{L{\'e}vy}}^{\{(X^{\mathrm{Lead}}, X^{\mathrm{Lag}})\}} = \frac{1}{2}[X]. \end{aligned}$$
For example, for the lead-lag path \(\{(X^{1, \mathrm {Lead}}, X^{1, \mathrm {Lag}})\}\) from (36), originated from a one-dimensional sequence (29), the first order log-signature terms correspond to the total increments along each dimension, and the second order term is the Lévy area which equals to the quadratic variation of the one-dimensional sequence:
$$\displaystyle \begin{aligned} \begin{array}{rcl} \log S\left(\{(X^{1, \mathrm{Lead}}, X^{1, \mathrm{Lag}})\}\right) & =&\displaystyle \left(\Delta X^{1, \mathrm{Lead}}, \Delta X^{1, \mathrm{Lag}}, \frac{1}{2}[X^1], \ldots \right) \\ & =&\displaystyle (5, 5, 16.5, \ldots ). \end{array} \end{aligned} $$
Equivalently, one can write the above result in terms of the signature with all the terms included as
$$\displaystyle \begin{aligned} S\left(\{(X^{1, \mathrm{Lead}}, X^{1, \mathrm{Lag}})\}\right) = (1, 5, 5, 12.5, 29, -4, 12.5, \ldots ) \end{aligned}$$
and confirm that
$$\displaystyle \begin{aligned} [\{X^1\}] = S^{1,2}\left(\{(X^{1, \mathrm{Lead}}, X^{1, \mathrm{Lag}})\}\right) - S^{2,1}\left(\{(X^{1, \mathrm{Lead}}, X^{1, \mathrm{Lag}})\}\right). \end{aligned}$$
Next we demonstrate certain properties of paths which originate from composite transformations. Consider the one-dimensional sequence \(\{X^1 \}\) as in (29) and consider applying to \(\{X^1 \}\) the series of transformations given by the cumulative sum \(\mathrm {CS}(.)\) defined by (32), followed by base-point augmentation \(\mathrm {BP}(.)\) defined by (33), followed by lead-lag transformation \(\mathrm {LL}(.)\) defined by (34) and (35). The resulting sequence can be written as the composition of transformations
$$\displaystyle \begin{aligned} {} \{\tilde{X}^1\} = \mathrm{LL} \circ \mathrm{BP} \circ \mathrm{CS}(\{X^1\}) = \mathrm{LL}\big(\mathrm{BP}\big(\mathrm{CS}(\{X^1\})\big)\big). \end{aligned} $$
(43)
It follows that, defining \(\{Y^1\} = \mathrm {BP}\big (\mathrm {CS}(\{X^1\})\),
$$\displaystyle \begin{aligned} \begin{array}{rcl} \{X^1_{t_i}\}_{i=0}^{N=3} & =&\displaystyle \{1,3,8,6\}, \\ \mathrm{CS}(\{X^1\}) & {=}&\displaystyle \{1,4,12,18\}, \\ \{Y^1\} & {=}&\displaystyle \{0,1,4,12,18\}, \\ \{\tilde{X}^1\} & {=}&\displaystyle \mathrm{LL}\big(\mathrm{BP}\big(\mathrm{CS}(\{X^1\})\big)\big) {} \\ & {=}&\displaystyle \{(Y^{1,\mathrm{Lead}}, Y^{1,\mathrm{Lag}})\} \\ & {=}&\displaystyle \{(0,0),(1,0),(1,1),(4,1), (4,4),(12,4),(12,12), (18,12),(18,18)\}. \end{array} \end{aligned} $$
(44)
The resulting \(\{\tilde {X}^1\}\) path is shown in Fig. 22.
Fig. 22
The subsequent transformations of the original path \(\{X^1\}\) into a resulting path \(\{\tilde {X}^1\} = \mathrm {LL} \big (\mathrm {BP} \big (\mathrm {CS}(\{X^1\}) \big ) \big )\) as shown in (44)
Full size image
Explicitly (44) can be written as
$$\displaystyle \begin{aligned} \begin{array}{rcl} Y^{1,\mathrm{Lead}} & =&\displaystyle \{0, S_0, S_0, S_1, S_1, S_2, S_2, S_3, S_3\},\\ Y^{1,\mathrm{Lag}} & =&\displaystyle \{0,0,S_0, S_0, S_1, S_1, S_2, S_2, S_3\}, \end{array} \end{aligned} $$
where \(S_k\) is a partial sum defined in (32). We denote the signature of the composite transformation \(\{\tilde {X}^1\}\) from (43) as
$$\displaystyle \begin{aligned} {} \tilde{S} = S(\{\tilde{X}^1\}) = (1, \tilde{S}^{1}, \tilde{S}^{2}, \tilde{S}^{1,1}, \tilde{S}^{1,2}, \tilde{S}^{2,1}, \tilde{S}^{2,2}, \ldots ). \end{aligned} $$
(45)
Numerical evaluation of the expression in (45) yields
$$\displaystyle \begin{aligned} {} \tilde{S} = (1, 18, 18, 162, 217, 107, 162, \ldots ). \end{aligned} $$
(46)
From the definition of the signature terms and the cumulative sum, it follows that
$$\displaystyle \begin{aligned} \begin{array}{rcl} \tilde{S}^{1} & =&\displaystyle \int d {Y}^{1,\mathrm{Lag}} = {Y}^{1,\mathrm{Lag}}_{t_{N}} - {Y}^{1,\mathrm{Lag}}_{t_0} = \Delta{X^{1,\mathrm{Lag}}}\\ & =&\displaystyle \mathrm{BP}\big(\mathrm{CS}(\{X^1\})\big)_{t_4} - \mathrm{BP}\big(\mathrm{CS}(\{X^1\})\big)_{t_0}\\ & =&\displaystyle S_{k=3} - 0\\ & =&\displaystyle X^1_{t_0}+X^1_{t_1}+X^1_{t_2}+X^1_{t_3} - 0\\ & =&\displaystyle 1+3+8+6 = 18. \end{array} \end{aligned} $$
Similarly, one can show that \(\tilde {S}^{2} = \tilde {S}^{1}\). Therefore, the first order terms of the signature of the transformed sequence \(\{\tilde {X}^1\}\) equal to the sum of the original sequence \(\{{X}^1\}\):
$$\displaystyle \begin{aligned} \tilde{S}^{1} = \tilde{S}^{2} = \Delta{Y^{1,\mathrm{Lead}}} = \Delta{Y^{1,\mathrm{Lag}}}. \end{aligned}$$
This result can be easily confirmed numerically by comparing the corresponding terms in (45) and (46). In a similar fashion, one can demonstrate that quadratic variation (42) of the transformed sequence \([\{\tilde {X^1}\}]\) is proportional to variance of the original sequence \(\{X^1\}\) from (29).

3.2 Machine Learning with Signature Features

The signature of a path is a collection of iterated integrals that succinctly capture the path’s geometric properties and encode its essential information. This makes it particularly well-suited for machine learning applications involving sequential data and time series analysis. The workflow is simple and summarized in the following algorithm:
$$\displaystyle \begin{aligned} {{data}} &\rightarrow {{path}} \rightarrow {{signature\ of\ path}} \rightarrow {{features\ of\ data}}\\ &\rightarrow {{learning\ algorithms}} \end{aligned} $$
This algorithm is general and works for any type of sequential data which can be embedded into a continuous path. The extracted features can be used for various types of machine learning applications, including both supervised and unsupervised learning. For example, one can classify time series or distinguish clusters of data. One of the advantages of feature extraction with the signature method is that the signature is sensitive to the geometric shape of a path. Sensitivity to the geometric shape of the input data has led to a successful application of the signature method to Chinese character recognition problem [33]. One of the most well-known and natural applications of the signature method is in quantitative finance, namely analysis of time series data [25, 49]. Time series represent ordered sequential data, which is a natural candidate for creating a path, followed by computing the signature and applying machine learning algorithms for further analysis. Moreover, if the input data come from several parallel sources, this will result in a multi-dimensional path. An example of such data is panel data (in Econometrics) or longitudinal data (in Medicine, Psychology, Biostatistics etc.) which involve repeated observations of the same quantity over periods of time.
Concluding this brief introduction to the applications of the signature method, we emphasize that this method introduces a new concept of dealing with data: thinking of data as geometric paths and using the signature method to analyze this data. In the rest of this subsection, we demonstrate how to apply the signature transformation to problems in statistical machine learning with streamed data, focusing on the common analytical task of classification. However, other machine learning and inferential statistical tasks with streamed data can also be performed using the signature formalism, including regression, clustering, and kernel learning, some of which we review in Sect. 3.3.

3.2.1 Classification of Handwritten Digits: An Example

In this section, we demonstrate the application of the signature method to the machine learning task of pen-based classification of handwritten digits. The dataset is available from the University of California, Irvine, Machine Learning Repository [2]. The data set comprises 250 handwritten integer numbers in the range of 0–9, generated by 44 writers using a tablet with a stylus. The data represent a collection of consecutive data points represented by \((x, y)\) coordinates. The raw data that were captured from the tablet consisted of integer values and were normalized to lie in the range between 0 and 100. The consecutive data points were resampled and represented as a sequence of coordinate pairs \((x_i, y_i), i=1,\dots ,8\), regularly spaced on the trajectory. An example of ten random digits in the tabular format is shown in Table 2 and the first four rows are displayed in Fig. 23.
Fig. 23
Pictorial representation of the first four rows from the Table 2. From left to right: ‘0’, ‘3’, ‘1’ and ‘7’. Solid red dots represent the starting point \((x_1, y_1)\)
Full size image
Table 2
An example of ten random samples of handwritten digits in a tabular format from the Pen-based Recognition of Handwritten Digits data set [2]. Columns denoted by \(x_i\), and \(y_i\)\((i=1,\dots ,8)\) correspond to coordinates (features) and the last column designates the class label of digits
\(x_1\)
\(y_1\)
\(x_2\)
\(y_2\)
\(x_3\)
\(y_3\)
\(x_4\)
\(y_4\)
\(x_5\)
\(y_5\)
\(x_6\)
\(y_6\)
\(x_7\)
\(y_7\)
\(x_8\)
\(y_8\)
Label
12
87
0
44
20
4
66
0
100
30
89
75
49
100
13
77
0
23
76
52
100
100
98
79
71
63
50
82
19
49
0
0
7
3
0
40
32
59
71
82
100
100
85
75
71
50
64
25
75
0
1
0
100
53
95
80
66
69
30
66
0
100
30
61
36
7
39
7
0
67
43
74
78
91
100
100
95
74
87
49
84
23
96
0
1
10
97
2
47
24
1
74
0
100
43
82
90
33
100
0
65
0
0
48
30
67
65
85
100
100
84
75
65
51
47
25
34
0
1
55
70
73
100
62
66
49
26
30
0
0
2
50
3
100
6
1
100
100
66
88
32
66
7
39
0
10
44
0
62
18
14
22
6
85
100
43
76
11
48
0
17
44
0
100
10
59
20
1
10
6
Each data sample is represented by a two-dimensional coordinate sequence \((x_i, y_i) \in \mathcal {X}\), that can be seen as a path in \(\mathbb {R}^2\). For simplicity, we apply the signature transformation without composite augmentations. Following the steps outlined in Sect. 3.1.11, the signature is the feature map
$$\displaystyle \begin{aligned} S:\mathcal{X} \rightarrow \Sigma, \end{aligned}$$
where \(S^{I}\in \Sigma \) are the elements of the infinite signature series (??) and I is a multi-index defined in (4). For example, applying the level-2 truncated signature transformation on the first four rows
$$\displaystyle \begin{aligned} \begin{array}{rcl} X^{\mathrm{zero}} = \{(x_i, y_i)\}_{i=1}^{N=8} & =&\displaystyle \{(12, 87), (0, 44), (20, 4), (66, 0),\\ & &\displaystyle (100, 30), (89, 75), (49, 100), (13, 77)\},\\ X^{\mathrm{three}} & =&\displaystyle \{(23, 76), (52, 100), (100, 98), (79, 71),\\ & &\displaystyle (63, 50), (82, 19), (49, 0), (0, 7)\},\\ X^{\mathrm{one}} & =&\displaystyle \{(0, 40), (32, 59), (71, 82), (100, 100),\\ & &\displaystyle (85, 75), (71, 50), (64, 25), (75, 0)\},\\ X^{\mathrm{seven}} & =&\displaystyle \{(0, 100), (53, 95), (80, 66), (69, 30),\\ & &\displaystyle (66, 0), (100, 30), (61, 36), (7, 39)\} \end{array} \end{aligned} $$
yields
$$\displaystyle \begin{aligned} \begin{array}{rcl} S(X^{\mathrm{zero}}) & =&\displaystyle (1, 1, -10, 0.5, 7044.5, -7054.5, 50),\\ S(X^{\mathrm{three}}) & =&\displaystyle (1, -23, -69, 264.5, -4893, 6480, 2380.5),\\ S(X^{\mathrm{one}}) & =&\displaystyle (1, 75, -40, 2812.5, -4660, 1660, 800),\\ S(X^{\mathrm{seven}}) & =&\displaystyle (1, 7, -61, 24.5, -3693, 3266, 1860.5). \end{array} \end{aligned} $$
Applying the signature transformation on each row of the data matrix in Table 2 will generate the signature feature matrix given in Table 3.
Table 3
The signature feature matrix from the original data presented in Table 2 and the corresponding labels
1
\(S^{1}\)
\(S^{2}\)
\(S^{1,1}\)
\(S^{1,2}\)
\(S^{2,1}\)
\(S^{2,2}\)
Label
1
1
\(-10\)
\(0.5\)
\(7044.5\)
\(-7054.5\)
50
0
1
\(-23\)
\(-69\)
\(264.5\)
\(-4893\)
6480
\(2380.5\)
3
1
75
\(-40\)
\(2812.5\)
\(-4660\)
1660
800
1
1
7
\(-61\)
\(24.5\)
\(-3693\)
3266
\(1860.5\)
7
1
96
\(-67\)
4608
\(-7123\)
691
\(2244.5\)
1
1
\(-10\)
\(-32\)
50
\(7388.5\)
\(-7068.5\)
512
0
1
34
\(-48\)
578
\(-4179\)
2547
1152
1
1
45
\(-64\)
\(1012.5\)
178
\(-3058\)
2048
1
1
\(-86\)
\(-78\)
3698
5984
724
3042
6
1
\(-84\)
\(-90\)
3528
\(6028.5\)
\(1531.5\)
4050
6
Now, one can use the signature feature matrix as input to statistical hypothesis tests for downstream analytical tasks or machine learning algorithms to predict the class labels.
Another interesting observation is that the higher order signature terms can capture more nuanced patterns of paths. Consider only samples of handwritten ‘zero’ and ‘eight’ digits, such as those shown in Fig. 24. Both ‘zero’ and ‘eight’ shapes share a similar geometric pattern: the starting and end points usually coincide. However, the areas enclosed by both shapes will be different due to their orientations as discussed in Sect. 3.1.9. The clear differences between the distributions of the first \(S^{1}, S^{2}\) and second order \(S^{1,2}, S^{2,1}\) signature terms are shown in Fig. 25 and can improve the performance of machine learning or statistical approaches.
Fig. 24
Examples of handwritten zeros and eights. From left to right: ‘8’, ‘0’, ‘0’, ‘8’. Solid red dot represents the starting point \((x_1, y_1)\)
Full size image
Fig. 25
The distribution of the first (left panel) and second (right panel) signature terms describing ‘zero’ and ‘eight’ digits. The second order signature terms, that are proportional to a signed enclosed area, demonstrate good separability between the two classes
Full size image
Furthermore, the signature feature matrix can also be used for unsupervised learning tasks, including dimensionality reduction for clustering approaches. For example, applying the Uniform Manifold Approximation and Projection (UMAP) for Dimension Reduction method [61] on the signature feature matrix, one can find clusters of points belonging to the same class label, see Fig. 26.
Fig. 26
Projection of the signature feature space from \( \mathbb {R}^7\) onto \( \mathbb {R}^2\) using UMAP. The right column designates class labels \(0, \dots , 9\)
Full size image

3.3 Overview of Signature-Based Machine Learning Applications

Over the past two decades, the signature method has been the subject of extensive research and application across various scientific domains. The method has found success in diverse areas such as finance, healthcare, natural language processing, wearable devices, and computer vision, demonstrating its versatility and the powerful expressiveness inherent in its approach. This adaptability has established the signature method as a valuable asset in the ongoing research on modern machine learning techniques for streamed data.
The signature method’s application to sequential data representation has evolved substantially, extending far beyond its original scope and undergoing considerable refinement. To conclude, we aim to provide a concise overview of some notable examples where the signature method has been employed to tackle diverse challenges in machine learning and data analysis. Through these examples, we hope to showcase the wide-ranging implications and potential of this innovative approach in further advancing the field of machine learning.
Already in 2005, Lyons and Sidorova worked on the application of rough path theory to the problem of sound compression [54]. They presented a new approach based on the signature that provides an alternative to traditional Fourier and wavelet transforms. The main difference between these methods is in the linearity of the Fourier approach, while the signature method accounts for non-linear dependencies.
The signature transformation has been extensively used in financial applications. Field et al. [25] have applied the signature transformation to multidimensional financial data streams to identify atypical market behavior and recognize patterns of trading algorithms. In [49], Levin et al. proposed a new non-parametric regression method based on the expected signature of paths. Truncating the signature results in an efficient local description and offers a provable efficiency gain over linear path segment descriptions with the same number of features. This approach can potentially achieve significant dimension reduction in highly oscillatory data streams. In numerical examples, their method achieves similar accuracy to that of the Gaussian process approach but with lower computational cost, especially for large sample sizes. The truncated signature-based regression and classification models have further led to numerous applications in finance, such as the obtaining approximate solutions to the problem of optimal execution [37], non-parametric pricing and hedging of exotic derivatives [58] and many others [4].
The ability of the signature transformation to handle complex data streams, reduce dimensionality of the feature space, and enhance prediction accuracy makes it a valuable asset in healthcare and medicine. As wearable technology becomes more common, the amount of data generated by these devices is enormous. By analyzing data from connected devices, e.g., heart rate monitors, fitness trackers, sleep monitors, gait and other modalities, researchers can gain insights into patients’ health conditions, such as identifying early patients with neurological disorders that have a higher risk of falls [68]. Electronic health records (EHRs) contain vast amounts of data, including patient demographics, medical history, medications, and lab results. Applying the signature method to EHRs can help identify patterns and trends, enable better disease management, and support early detection of potential health issues, such as an early onset of sepsis [65] and diagnosis of Alzheimer’s disease using brain image-derived biomarkers (e.g., the hippocampus, ventricles and whole brain volumes) [63]. The signature method was applied to analyze data from self-reporting tools and mobile applications, to detect longitudinal patterns [44] in clinical trials [30], predict mood changes and episodes [43], derive clinically meaningful information from daily self-reported mood ratings to classify patients’ diagnosis, predict mood scores [66], and efficiently capture the chronological information in medical texts [45].
The signature approach has been applied to computer vision tasks such as human-pose estimation [80] and recognition of handwriting in various languages [78, 79]. The signature approach has been applied in deep learning [39], neural controlled differential equations [40], and topological data analysis [19]. Signatures were furthermore combined with kernel methods applicable to problems in classification and hypothesis testing [17, 41].
Further examples of applications of rough paths, signatures, and neural controlled differential equations can be found at https://www.datasig.ac.uk/papers.

Acknowledgements

We are thankful to Horatio Boedihardjo, Simone Dari, Terry Lyons, Hao Ni, Harald Oberhauser, and Danyu Yang for fruitful discussions and improvements of the manuscript. IC acknowledges support from the EPSRC via the New Investigator Award EP/X015688/1. AK is supported in part by the National Institute for Health and Care Research (NIHR) AI Award (AI_AWARD02183) and by a research grant from GlaxoSmithKline. AK wishes to thank the Oxford-Man Institute of Quantitative Finance for hospitality during the course of this work and gratefully acknowledges the support of the Wellcome Trust grant No: 102616/Z/13/Z, “CONBRIO”.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Title
A Primer on the Signature Method in Machine Learning
Authors
Ilya Chevyrev
Andrey Kormilitzin
Copyright Year
2026
DOI
https://doi.org/10.1007/978-3-031-97239-3_1
1
It is necessary to speak of the law of the signature\(S(X)_{a,b}\) instead of the law of the pathX because the signature does not, in general, uniquely determine the path, see Proposition 2.34 and the discussion beforehand.
 
1.
go back to reference A.A. Agrachev, Y.L. Sachkov, Control Theory from the Geometric Viewpoint, volume 87 of Encyclopaedia of Mathematical Sciences (Springer, Berlin, 2004), Control Theory and Optimization, II
2.
go back to reference E. Alpaydin, F. Alimoglu, Pen-based recognition of handwritten digits. UCI Machine Learning Repository (1998). https://doi.org/10.24432/C5MG6K
3.
go back to reference D. Applebaum, Lévy Processes and Stochastic Calculus, volume 116 of Cambridge Studies in Advanced Mathematics, 2nd edn. (Cambridge University Press, Cambridge, 2009)CrossRef
4.
go back to reference I.P. Arribas, Signatures in machine learning and finance. PhD thesis, University of Oxford, 2020
5.
go back to reference P. Billingsley, Probability and Measure. Wiley Series in Probability and Statistics (Wiley, Hoboken, 2012), Anniversary edition [of MR1324786], With a foreword by Steve Lalley and a brief biography of Billingsley by Steve Koppes
6.
go back to reference H. Boedihardjo, H. Ni, Z. Qian, Uniqueness of signature for simple curves. J. Funct. Anal. 267(6), 1778–1806 (2014)MathSciNetCrossRef
7.
go back to reference H. Boedihardjo, X. Geng, T. Lyons, D. Yang, The signature of a rough path: uniqueness. Adv. Math. 293, 720–737 (2016)MathSciNetCrossRef
8.
go back to reference H. Boedihardjo, J. Diehl, M. Mezzarobba, H. Ni, The expected signature of Brownian motion stopped on the boundary of a circle has finite radius of convergence. Bull. Lond. Math. Soc. 53(1), 285–299 (2021)MathSciNetCrossRef
9.
go back to reference R.W. Brockett, Finite Dimensional Linear Systems, volume 74 of Classics in Applied Mathematics (Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2015). Reprint of the 1970 original
10.
go back to reference T. Cass, M. Ogrodnik, Tail estimates for Markovian rough paths. Ann. Probab. 45(4), 2477–2504 (2017)MathSciNetCrossRef
11.
go back to reference A. Chechkin, I. Pavlyukevich, Marcus versus Stratonovich for systems with jump noise. J. Phys. A 47(34), 342001, 15 (2014)
12.
go back to reference K.-T. Chen, Integration of paths, geometric invariants and a generalized Baker-Hausdorff formula. Ann. Math. (2) 65, 163–178 (1957)
13.
go back to reference K.-T. Chen, Integration of paths—a faithful representation of paths by non-commutative formal power series. Trans. Amer. Math. Soc. 89, 395–407 (1958)MathSciNet
14.
go back to reference I. Chevyrev, Random walks and Lévy processes as rough paths. Probab. Theory Relat. Fields 170(3–4), 891–932 (2018)MathSciNetCrossRef
15.
go back to reference I. Chevyrev, P.K. Friz, Canonical RDEs and general semimartingales as rough paths. Ann. Probab. 47(1), 420–463 (2019)MathSciNetCrossRef
16.
go back to reference I. Chevyrev, T. Lyons, Characteristic functions of measures on geometric rough paths. Ann. Probab. 44(6), 4049–4082 (2016)MathSciNetCrossRef
17.
go back to reference I. Chevyrev, H. Oberhauser, Signature moments to characterize laws of stochastic processes. J. Mach. Learn. Res. 23(176), 1–42 (2022)MathSciNet
18.
go back to reference I. Chevyrev, P.K. Friz, A. Korepanov, I. Melbourne, Superdiffusive limits for deterministic fast-slow dynamical systems. Probab. Theory Relat. Fields 178(3–4), 735–770 (2020)MathSciNetCrossRef
19.
go back to reference I. Chevyrev, V. Nanda, H. Oberhauser, Persistence paths and signature features in topological data analysis. IEEE Trans. Pattern Anal. Mach. Intell. 42(1), 192–202 (2020)CrossRef
20.
go back to reference I. Chevyrev, J. Diehl, K. Ebrahimi-Fard, N. Tapia, A multiplicative surface signature through its Magnus expansion. arXiv e-prints, June (2024)
21.
go back to reference I. Chevyrev, A. Gerasimovičs, H. Weber, Feature engineering with regularity structures. J. Sci. Comput. 98(1), Paper No. 13, 28 (2024)
22.
go back to reference I. Chevyrev, A. Korepanov, I. Melbourne, Superdiffusive limits beyond the Marcus regime for deterministic fast-slow systems. Commun. Am. Math. Soc. 4, 746–786 (2024)MathSciNetCrossRef
23.
go back to reference R. Cont, P. Tankov, Financial Modelling with Jump Processes. Chapman & Hall/CRC Financial Mathematics Series (Chapman & Hall/CRC, Boca Raton, 2004)
24.
go back to reference J. Diehl, L. Schmitz, Two-parameter sums signatures and corresponding quasisymmetric functions. Prepint. arXiv:2210.14247 (2022)
25.
go back to reference J. Field, L.G. Gyurkó, M. Kontkowski, T. Lyons, Extracting information from the signature of a financial data stream. Preprint. arXiv:1307.7244 (2014)
26.
go back to reference P.K. Friz, M. Hairer, A Course on Rough Paths. Universitext (Springer, Cham, 2014). With an introduction to regularity structures
27.
go back to reference P.K. Friz, A. Shekhar, General rough integration, Lévy rough paths and a Lévy-Kintchine-type formula. Ann. Probab. 45(4), 2707–2765 (2017)
28.
go back to reference P.K. Friz, N.B. Victoir, Multidimensional Stochastic Processes as Rough Paths, volume 120 of Cambridge Studies in Advanced Mathematics (Cambridge University Press, Cambridge, 2010)CrossRef
29.
go back to reference P.K. Friz, H. Zhang, Differential equations driven by rough paths with jumps. J. Differential Equations 264(10), 6226–6301 (2018)MathSciNetCrossRef
30.
go back to reference J.R. Geddes, A. Gardiner, J. Rendell, M. Voysey, E. Tunbridge, C. Hinds, L.-M. Yu, J. Hainsworth, M.-J. Attenburrow, J. Simon et al., Comparative evaluation of quetiapine plus lamotrigine combination versus quetiapine monotherapy (and folic acid versus placebo) in bipolar depression (CEQUEL): a \(2\times 2\) factorial randomised trial. Lancet Psychiatry 3, 31 (2015)
31.
go back to reference X. Geng, Reconstruction for the signature of a rough path. Proc. Lond. Math. Soc. (3) 114(3), 495–526 (2017)
32.
go back to reference C. Giusti, D. Lee, V. Nanda, H. Oberhauser, A topological approach to mapping space signatures. Adv. Appl. Math. 163(part A), Paper No. 102787, 68 (2025)
33.
go back to reference B. Graham, Sparse arrays of signatures for online character recognition. Preprint. arXiv:1308.0371 (2013)
34.
go back to reference M. Gubinelli, Ramification of rough paths. J. Differential Equations 248(4), 693–721 (2010)MathSciNetCrossRef
35.
go back to reference M. Hairer, A theory of regularity structures. Invent. Math. 198(2), 269–504 (2014)MathSciNetCrossRef
36.
go back to reference B. Hambly, T. Lyons, Uniqueness for the signature of a path of bounded variation and the reduced path group. Ann. Math. (2) 171(1), 109–167 (2010)
37.
go back to reference J. Kalsi, T. Lyons, I.P. Arribas, Optimal execution with rough path signatures. SIAM J. Financ. Math. 11(2), 470–493 (2020)MathSciNetCrossRef
38.
go back to reference P. Kidger, T. Lyons, Signatory: differentiable computations of the signature and logsignature transforms, on both CPU and GPU, in International Conference on Learning Representations (2021). https://github.com/patrick-kidger/signatory
39.
go back to reference P. Kidger, P. Bonnier, I. Perez Arribas, C. Salvi, T. Lyons, Deep signature transforms, in Advances in Neural Information Processing Systems, vol. 32 (2019)
40.
go back to reference P. Kidger, J. Morrill, J. Foster, T. Lyons, Neural controlled differential equations for irregular time series, in Advances in Neural Information Processing Systems, vol. 33 (2020), pp. 6696–6707
41.
go back to reference F.J. Király, H. Oberhauser, Kernels for sequentially ordered data. J. Mach. Learn. Res. 20, Paper No. 31, 45 (2019)
42.
go back to reference B. Korenblum, A. Mascuilli, J. Panariello, A generalization of Carleman’s uniqueness theorem and a discrete Phragmén-Lindelöf theorem. Proc. Amer. Math. Soc. 126(7), 2025–2032 (1998)MathSciNetCrossRef
43.
go back to reference A. Kormilitzin, K. Saunders, P. Harrison, J. Geddes, T. Lyons, Application of the signature method to pattern recognition in the CEQUEL clinical trial. Preprint. arXiv:1606.02074 (2016)
44.
go back to reference A. Kormilitzin, K.E. Saunders, P.J. Harrison, J.R. Geddes, T. Lyons, Detecting early signs of depressive and manic episodes in patients with bipolar disorder using the signature-based model. Preprint. arXiv:1708.01206 (2017)
45.
go back to reference A. Kormilitzin, N. Vaci, Q. Liu, H. Ni, G. Nenadic, A. Nevado-Holgado, An efficient representation of chronological events in medical texts, in Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis (2020), pp. 97–103
46.
go back to reference T.G. Kurtz, E. Pardoux, P. Protter, Stratonovich stochastic differential equations driven by general semimartingales. Ann. Inst. H. Poincaré Probab. Stat. 31(2), 351–377 (1995)MathSciNet
47.
go back to reference D. Lee, The surface signature and rough surfaces. arXiv e-prints, June (2024)
48.
go back to reference D. Lee, H. Oberhauser, Random surfaces and higher algebra. Preprint. arXiv:2311.08366 (2023)
49.
go back to reference D. Levin, T. Lyons, H. Ni, Learning from the past, predicting the statistics for the future, learning an evolving system. Preprint. arXiv:1309.0260 (2015)
50.
go back to reference S. Li, H. Ni, Expected signature of stopped Brownian motion on d-dimensional \(C^{2,\alpha }\)-domains has finite radius of convergence everywhere: \(2\leq d\leq 8\). J. Funct. Anal. 282(12), Paper No. 109447, 45 (2022)
51.
go back to reference S. Liao, Log signatures in machine learning. PhD thesis, University College London, 2022
52.
go back to reference S. Liao, T. Lyons, W. Yang, H. Ni, Learning stochastic differential equations using RNN with log signature features. Preprint. arXiv:1908.08286 (2019)
53.
go back to reference T.J. Lyons, Differential equations driven by rough signals. Rev. Mat. Iberoam. 14(2), 215–310 (1998)MathSciNetCrossRef
54.
go back to reference T.J. Lyons, N. Sidorova, Sound compression: a rough path approach, in Proceedings of the 4th International Symposium on Information and Communication Technologies (Trinity College, Dublin, 2005), pp. 223–228
55.
go back to reference T.J. Lyons, W. Xu, Hyperbolic development and inversion of signature. J. Funct. Anal. 272(7), 2933–2955 (2017)MathSciNetCrossRef
56.
go back to reference T.J. Lyons, W. Xu, Inverting the signature of a path. J. Eur. Math. Soc. (JEMS) 20(7), 1655–1687 (2018)
57.
go back to reference T.J. Lyons, M. Caruana, T. Lévy, Differential Equations Driven by Rough Paths (Springer, Berlin, 2007)CrossRef
58.
go back to reference T. Lyons, S. Nejad, I. Perez Arribas, Non-parametric pricing and hedging of exotic derivatives. Appl. Math. Finance 27(6), 457–494 (2020)MathSciNetCrossRef
59.
go back to reference S.I. Marcus, Modeling and analysis of stochastic differential equations driven by point processes. IEEE Trans. Inform. Theory 24(2), 164–172 (1978)MathSciNetCrossRef
60.
go back to reference S.I. Marcus, Modeling and approximation of stochastic differential equations driven by semimartingales. Stochastics 4(3), 223–245 (1980/1981)MathSciNetCrossRef
61.
go back to reference L. McInnes, J. Healy, N. Saul, L. Großberger, Umap: Uniform manifold approximation and projection. J. Open Source Softw. 3(29), 861 (2018)
62.
go back to reference I. Melbourne, R. Zweimüller, Weak convergence to stable Lévy processes for nonuniformly hyperbolic dynamical systems. Ann. Inst. Henri Poincaré Probab. Stat. 51(2), 545–556 (2015)MathSciNetCrossRef
63.
go back to reference P. Moore, T. Lyons, J. Gallacher, A.D.N. Initiative, Using path signatures to predict a diagnosis of Alzheimer’s disease. PloS One 14(9), e0222212 (2019)
64.
go back to reference J. Morrill, A. Fermanian, P. Kidger, T. Lyons, A generalised signature method for multivariate time series feature extraction. Preprint. arXiv:2006.00873 (2020)
65.
go back to reference J.H. Morrill, A. Kormilitzin, A.J. Nevado-Holgado, S. Swaminathan, S.D. Howison, T.J. Lyons, Utilization of the signature method to identify the early onset of sepsis from multivariate physiological time series in critical care monitoring. Crit. Care Med. 48(10), e976–e981 (2020)CrossRef
66.
go back to reference I. Perez Arribas, G.M. Goodwin, J.R. Geddes, T. Lyons, K.E. Saunders, A signature-based machine learning model for distinguishing bipolar disorder and borderline personality disorder. Transl. Psychiatry 8(1), 274 (2018)
67.
go back to reference R. Ree, Lie elements and an algebra associated with shuffles. Ann. Math. (2) 68, 210–220 (1958)
68.
go back to reference R.Z.U. Rehman, Y. Zhou, S. Del Din, L. Alcock, C. Hansen, Y. Guan, T. Hortobágyi, W. Maetzler, L. Rochester, C.J. Lamoth, Gait analysis with wearables can accurately classify fallers from non-fallers: a step toward better management of neurological disorders. Sensors 20(23), 6992 (2020)
69.
go back to reference J. Reizenstein, Calculation of iterated-integral signatures and log signatures. Preprint. arXiv:1712.02757 (2017)
70.
go back to reference J.F. Reizenstein, Iterated-integral signatures in machine learning. PhD thesis, University of Warwick, 2019
71.
go back to reference J.F. Reizenstein, B. Graham, Algorithm 1004: The iisignature library: Efficient calculation of iterated-integral signatures and log signatures. ACM Trans. Math. Softw. 46(1), 1–21 (2020)MathSciNetCrossRef
72.
go back to reference C. Reutenauer, Free Lie Algebras, volume 7 of London Mathematical Society Monographs. New Series (The Clarendon Press/Oxford University Press, New York, 1993). Oxford Science Publications
73.
go back to reference W. Rudin, Principles of Mathematical Analysis. International Series in Pure and Applied Mathematics, 3rd edn. (McGraw-Hill, New York, 1976)
74.
go back to reference T. Lyons, D. Maxwell, esig, Version 0.7.3. Python package. Accessed July 7, 2025 (2025). https://esig.readthedocs.io
75.
go back to reference S. Tindel, K. Chouk, Skorohod and Stratonovich integration in the plane. Electron. J. Probab. 20(39), 39 (2015)
76.
go back to reference W. Whitt, Stochastic-process Limits. Springer Series in Operations Research (Springer, New York, 2002). An introduction to stochastic-process limits and their application to queues
77.
go back to reference D.R.E. Williams, Path-wise solutions of stochastic differential equations driven by Lévy processes. Rev. Mat. Iberoam. 17(2), 295–329 (2001)CrossRef
78.
go back to reference D. Wilson-Nunn, T. Lyons, A. Papavasiliou, H. Ni, A path signature approach to online Arabic handwriting recognition, in 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR) (IEEE, Piscataway, 2018), pp. 135–139
79.
go back to reference Z. Xie, Z. Sun, L. Jin, H. Ni, T. Lyons, Learning spatial-semantic context with fully convolutional recurrent network for online handwritten Chinese text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(8), 1903–1917 (2017)CrossRef
80.
go back to reference W. Yang, T. Lyons, H. Ni, C. Schmid, L. Jin, Developing the path signature methodology and its application to landmark-based human action recognition, in Stochastic Analysis, Filtering, and Stochastic Optimization: A Commemorative Volume to Honor Mark HA Davis’s Contributions (Springer, 2022), pp. 431–464
81.
go back to reference S. Zhang, G. Lin, S. Tindel, Two-dimensional signature of images and texture classification. Proc. R. Soc. A Math. Phys. Eng. Sci. 478(2266), 20220346 (2022)
    Image Credits
    Salesforce.com Germany GmbH/© Salesforce.com Germany GmbH, IDW Verlag GmbH/© IDW Verlag GmbH, Diebold Nixdorf/© Diebold Nixdorf, Ratiodata SE/© Ratiodata SE, msg for banking ag/© msg for banking ag, C.H. Beck oHG/© C.H. Beck oHG, OneTrust GmbH/© OneTrust GmbH, Governikus GmbH & Co. KG/© Governikus GmbH & Co. KG, Horn & Company GmbH/© Horn & Company GmbH, EURO Kartensysteme GmbH/© EURO Kartensysteme GmbH, Jabatix S.A./© Jabatix S.A.