Finds documents with both search terms in any word order, permitting "n" words as a maximum distance between them. Best choose between 15 and 30 (e.g. NEAR(recruit, professionals, 20)).
Finds documents with the search term in word versions or composites. The asterisk * marks whether you wish them BEFORE, BEHIND, or BEFORE and BEHIND the search term (e.g. lightweight*, *lightweight, *lightweight*).
This chapter delves into the application of the signature Maximum Mean Discrepancy (sig-MMD) in two-sample statistical tests, highlighting its versatility and robustness in comparing distributions. It provides practical insights for setting up machine learning algorithms, including hyperparameter settings and common pitfalls, to effectively navigate challenges. The text explores the relationship between batch size and the probability of Type 2 errors, demonstrating how path scaling, lead-lag transformation, standardisation, and feature space transformation using the RBF kernel can make the test more robust. Additionally, it quantifies the information carried by the level terms of the signature, showing how higher-level terms contain crucial distributional properties that can be targeted using tailored -signature kernels. The chapter concludes by demonstrating the power of the test as a function of truncation level, batch size, feature space, and -signature kernel, making it a powerful statistical tool for two-sample hypothesis testing.
AI Generated
This summary of the content was generated with the help of AI.
Abstract
Maximum Mean Discrepancy (MMD) is a widely used concept in machine learning research which has gained popularity in recent years as a highly effective tool for comparing (finite-dimensional) distributions. Since it is designed as a kernel-based method, the MMD can be extended to path space valued distributions using the signature kernel. The resulting signature MMD (sig-MMD) can be used to define a metric between distributions on path space. Similarly to the original use case of the MMD as a test statistic within a two-sample testing framework, the sig-MMD can be applied to determine if two sets of paths are drawn from the same stochastic process. This chapter is dedicated to understanding the possibilities and challenges associated with applying the sig-MMD as a statistical tool in practice. We introduce and explain the sig-MMD, and provide easily accessible and verifiable examples for its practical use. We present examples that can lead to Type 2 errors in the hypothesis test, falsely indicating that samples have been drawn from the same underlying process (which generally occurs in a limited data setting). We then present techniques to mitigate the occurrence of this type of error.
1 Introduction
The seminal paper A Kernel Two-Sample Test [21] lays the foundation for establishing a versatile and robust framework for analysing and comparing distributions. This framework has been assessed and benchmarked against established classical two-sample tests in the finite-dimensional setting (such as Kolmogorov-Smirnov test) where it demonstrated superior or comparable performance [12, 19, 21]. It has also been shown to be well-suited for machine learning (ML) applications [2, 10, 28, 33, 49]. This framework is therefore particularly well-known for its applications to analysing finite-dimensional distributions in a ML context.
The test statistic which underpins this framework is based on computing the largest average difference between distributions, over functions in the unit ball of a Reproducing Kernel Hilbert Space (RKHS). This concept has been named the kernel Maximum Mean Discrepancy (MMD) [21]. On an appropriate (rich enough) space, the kernel MMD determines a metric between probability distributions. Using a signature kernel (\(k_{\text{Sig}}\)) (cf [8, 12, 31, 39]), the aforementioned MMD can be established for measures on path space defined over the set \(\mathcal {X}_{1-\text{var}}\). Similar to the originally proposed framework of the MMD (cf [21]), its path-valued counterpart (sig-MMD) can also be applied as a test statistic in a two-sample hypothesis test to determine if two collections of time series are drawn from different stochastic processes. If \(\mathbf {X}\) and \(\mathbf {Y}\) are two stochastic processes with laws \(\mathbb {P}_{\mathbf {X}} = \mathbb {P} \circ {\mathbf {X}}^{-1}\) and \(\mathbb {P}_{\mathbf {Y}} = \mathbb {P} \circ {\mathbf {Y}}^{-1}\) respectively, the two-sample test is described by the following null hypothesis (\(H_{0}\)) and alternative hypothesis (\(H_{1}\)):
The sig-MMD has facilitated the use of distribution regression on path space [2, 27, 40] and the training of generative models for time series1 [7, 28].
Advertisement
Since the path signature is infinite dimensional, the extension of these concepts to path space using the (full) signature within numerical algorithms is non-trivial. Recent literature focuses on the numerical computation of the signature kernel. These include (i) truncating the signature [31], (ii) using a kernel trick by means of an associated partial differential equation (PDE) to work with the full signature [39, 40], and (iii) using random Fourier features [38, 47]. This chapter aims to demonstrate how to use the sig-MMD in practice with specific focus on effectively navigating the challenges that may emerge in this context. We give practical insights for setting up the associated ML algorithms that can save the interested user considerable time when implementing these tools. We also suggest hyperparameter settings and discuss common pitfalls. We provide simple and easy-to-verify examples together with explanatory discussions which help build intuition surrounding these tools.
As mentioned in the chapter on signature kernels (Chap. 3 (The Signature Kernel)), in practice, the numerical approximation of the signature kernel is based on various hyperparameters. We study the statistical power of the two-sample hypothesis test as a function of the following subset of hyperparameters:
Static Kernel Parameters\(\Theta _{k}\): Instead of working directly with the original paths, the paths can be embedded in an infinite dimensional feature space using another kernel [31, 39].
Truncation Parameter\(\mathbb {N}_{\infty }\): Since the signature of a path is an infinite dimensional object, it is not possible to work with the full signature. To work directly with the signature, one must truncate the signature by using the first \(M \in \mathbb {N}\) terms of the signature. However, when using kernel methods, a kernel trick can be used to work with the full signature. In this case, computing the signature kernel amounts to solving a PDE [40]. Therefore, when using the signature, one works with either an imperfect representation of the signature through its truncated version or the full signature by way of the kernel trick.
Preprocessing Parameters\(\Theta _{\text{preprocess}}\): Preprocessing the paths by using lead-lag transformations [11, 17, 26, 46], adding a time-component [11, 46], and by standardising dimensions2 [28] can be performed.
Normalisation Parameters\(\Theta _{\text{normalisation}}\): Normalising the signature kernel results in a more robust statistic [12]. Moreover, by rescaling inner products, a generalised form of the signature kernel is obtained through which further sig-MMDs can be constructed [8, 9].
Time Discretisation Parameters\(\Theta _{\text{discrete}}\): In practice, computations are performed on discrete data. The discretisation of the time interval can sometimes be controlled. Finer grids provide better approximations of the stochastic process [13] whilst also increasing the computational burden. In addition, if working with the full signature through the PDE approach, the PDE is solved on a discrete grid which itself is controlled through a hyperparameter.
The following sections are aimed at quantifying and interpreting the behaviour of the sig-MMD test statistic in the context of the two-sample hypothesis test. This behaviour is mostly characterised by the above parameters and therefore a discussion of the role of these parameters in improving performance on the hypothesis test is provided. The variance of the sig-MMD plays a significant role in characterising this behaviour. This variance is closely linked with the number of sample paths used. Although this can sometimes be directly addressed through sampling more samples from the processes and/or collecting additional data if the paths correspond to a physical system, one cannot disregard the computational bottlenecks associated with the sig-MMD. Therefore, increasing the sample size is generally not a feasible solution. We provide insights into stabilising the errors associated with the sig-MMD in the two-sample hypothesis testing framework in a high variance setting.
Specifically, we focus on the following:
1.
Hypothesis Testing Errors Associated with the Sig-MMD: In most cases, the standard two-sample test using the sig-MMD on the space of stochastic processes has a high probability of a Type 2 error (acceptance of \(H_0\) when \(H_1\) is true) [3]. We explore techniques for reducing this. We show that scaling of the input paths plays a critical role in stabilising the statistical test. The effect of scaling on the signature kernel was first studied in [8, 9]. They showed that this has the equivalent effect of weighting inner products3 [8, Lemma 2.9], [9, Lemma 2.10]. By weighting inner products, we can zoom in and focus on specific moments of the processes and tailor the two-sample hypothesis test to the specific processes. We leverage the relationship between path scaling and the decaying absolute value of the graded signature terms to construct a two-sample test statistic targeted at improving the test power. In practice, the true distributions of the stochastic processes are not observed and instead, empirical distributions are used to approximate the true distributions. Since sample sizes affect the quality of the approximation, we study the likelihood of Type 2 errors occurring across various sample sizes. We also study the effect of path scaling on the likelihood of a Type 1 error occurring.
2.
Information Associated with the Signature Level Terms: We study the effects the different levels of the signature have on the statistical power of the two-sample test. In particular, we focus on the trade-off between the level of truncation (\(\mathbb {N}_{\infty }\)) and the information embedded within the higher-order terms. Using the kernel trick allows us to study the effects of including the full signature when computing the sig-MMD. We then need to account for the factorial decay of the signature terms to ensure that higher-level terms are contributing to the final value of the sig-MMD.
Advertisement
2 General Signature MMD
The path signature is well suited for working with data streams as a feature map for learning tasks. As a transform, the signature is invariant to reparametrisation, filtering out symmetries [39], and a continuous real-valued function on path space can be approximated by a linear function of the signature arbitrarily well [17, 31]. This result is analogous to the classical result that a continuous real-valued function defined on a closed interval can be approximated arbitrarily well by a polynomial function. In our case, the signature corresponds to a polynomial on path space. In terms of learning tasks, instead of fitting a non-linear function on path space, a linear function (i.e. linear regression) with the (truncated) signature terms as regressors can be used.
When a path \(\mathbf {x}\) has finite p-variation (\(p \geq 1\)), the signature terms (\(\Phi _{\text{Sig}, m} \left (\mathbf {x}\right )\)) decay factorially according to the uniform estimate [39], [46, Theorem 3.2], [45, Lemma 5.1]
where \(\left | \left | \cdot \right | \right |_{p}\) denotes the p-variation metric. Computationally, working with the signature is challenging as it is infinite dimensional, and so, a large proportion of signature-based ML approaches use the truncated signature. In terms of absolute value, only a small amount of information is lost by truncating due to the factorial decay [33, 36, 46]. However, we show that higher-level terms play a critical role in capturing distribution differences through the sig-MMD. In a regression setting, specific signature terms can be re-weighted to account for the factorial decay. In the case of the sig-MMD, this is a non-trivial task. This plays a key role in this chapter.
2.1 Sig-MMD
Let H be a RKHS associated with kernel k. Also, let \(\mathcal {F}\) be a unit ball in H defined over a compact metric space \(\mathcal {X}\). Given two distributions \(\mu , \nu \in \mathcal {P} \left (\mathcal {X}\right )\), the MMD between \(\mu \) and \(\nu \) is defined as [21]
If k is a continuous and characteristic kernel, then \(d_{k}\) is a metric [21, Theorem 5]. If H is the RKHS associated with the signature kernel \(k_{\text{Sig}}\),4 the corresponding MMD is the sig-MMD. Since the signature kernel is characteristic [12, 40] and universal [12] when restricted to a compact subset \(\mathcal {K} \subset \mathcal {X}_{1-\text{var}}\), the sig-MMD is a metric on the space of distributions of stochastic processes defined over paths in \(\mathcal {K}\). Throughout this chapter, \(\mathcal {K} \subset \mathcal {X}_{1-\text{var}}\) denotes a compact subset of \(\mathcal {X}_{1-\text{var}}\).
Throughout this chapter, we consider processes in continuous time. However, in practice, we only have access to discrete datapoints. A path is constructed from the data using linear interpolation. The linear interpolated paths are of bounded variation and belong to the set \(\mathcal {X}_{1-\text{var}}\). For this reason, throughout this chapter we assume that all paths are elements of \(\mathcal {X}_{1-\text{var}}\) even though their continuous counterpart may not be.
Given two stochastic processes \(\mathbf {X}\) and \(\mathbf {Y}\), to compute the sig-MMD between \(\mathbb {P}_{\mathbf {X}}\) and \(\mathbb {P}_{\mathbf {Y}}\), it is first re-formulated in terms of Kernel Mean Embeddings (KMEs). The KME of a stochastic process maps the distribution of the stochastic process to an element in a RKHS. Let \(H_{\text{Sig}}\) be the RKHS with kernel \(k_{\text{Sig}}\) restricted to paths in the compact subset \(\mathcal {K}\). Since the signature kernel is characteristic when restricted to the compact subset \(\mathcal {K}\), each distribution defined over paths in \(\mathcal {K}\) is embedded as a unique element within the RKHS \(H_{\text{Sig}}\). Formally, the KME5 is given by the mapping6
with \(\left \lVert \cdot \right \rVert _{H_{\text{Sig}}}\) being the norm induced by the inner product of the RKHS \(H_{\text{Sig}}\).
The KME formulation provided in Eq. (2) shows that the sig-MMD is the norm of the difference between the embeddings of the laws of the stochastic processes. Given independent collections \(\mathbf {X}, {\mathbf {X}}^{\prime } \sim \mathbb {P}_{\mathbf {X}}\) and \(\mathbf {Y}, {\mathbf {Y}}^{\prime } \sim \mathbb {P}_{\mathbf {Y}}\), by expanding the norm into inner products and using the reproducing property of the kernel operator, an alternative formulation of the sig-MMD is given by [21, Lemma 6]
are available. \(N, M\) are the sample sizes (batch sizes) and throughout this chapter, sample size and batch size are used interchangeably.
Given empirical samples, the MMD can be computed using either a biased or an unbiased estimator [21]. In the numerical examples presented in Sect. 4, we compare the performance of the two estimators when performing the two-sample hypothesis test. We show that in certain cases, the biased estimator has superior performance over the unbiased estimator. A biased estimator is given by
The bias present in the biased estimator is caused by the cross-terms \(k_{\text{Sig}} \left ({\mathbf {X}}^{i}, {\mathbf {X}}^{i}\right )\) and \(k_{\text{Sig}} \left ({\mathbf {Y}}^{i}, {\mathbf {Y}}^{i}\right )\) in the computation of the sample averages.
2.3 General Signature Kernels
The authors of [8, 9] extend the signature kernel to more generic kernels. A weight function is applied to the level terms of the signature to weight their relative importance in the kernel computation. Let \(\phi \colon \mathbb {N} \cup \left \{0 \right \} \mapsto \mathbb {R}^{+}\) be a weight function and \(\mathbf {s} = \left ({\mathbf {s}}_{0}, {\mathbf {s}}_{1}, \cdots \right )\), \(\mathbf {t} = \left ({\mathbf {t}}_{0}, {\mathbf {t}}_{1}, \cdots \right ) \in \prod _{m \geq 0} \left (\mathbb {R}^{d}\right )^{\otimes m}\). We then define the bilinear form
Given two paths \(\mathbf {x}, \mathbf {y} \in \mathcal {X}_{1-\text{Var}}\) and a weight function \(\phi \colon \mathbb {N} \cup \left \{0\right \} \mapsto \mathbb {R}^{+}\), the \(\phi \)-signature kernel is defined as
The following lemma provides a necessary condition on the weight function \(\phi \) to guarantee that the \(\phi \)-signature kernel is well-defined [8, Lemma 2.4], [9, Lemma 2.5].
Lemma 2.1
The\(\phi \)-signature kernel is well-defined provided the function\(\phi \colon \mathbb {N} \cup \left \{0 \right \} \mapsto \mathbb {R}^{+}\)is such that the series
Using the \(\phi \)-signature kernel, a general version of the sig-MMD can be defined. The \(\phi \)-MMD between the distributions of stochastic processes \(\mathbf {X}\), \(\mathbf {Y}\) is given by
Since the mapping \(\mu _{\phi } \colon \mathcal {P} \left (\mathcal {K}\right ) \mapsto H_{\phi }\)7 is injective [9], the \(\phi \)-MMD is a metric. If \(\phi \) is the mapping \(\phi \colon \mathbb {N} \cup \left \{0\right \} \mapsto \left \{1\right \}\), the \(\phi \)-signature kernel is \(k_{\text{Sig}}\).
2.4 Example of General Signature Kernels
The \(\phi \)-signature kernel provides a general framework for expanding the set of available signature-based kernels. We now provide a specific example of the \(\phi \)-signature kernel and its broader implications on the \(\phi \)-MMD. This \(\phi \)-MMD is used for the purpose of performing the two-sample hypothesis test in Sect. 4.
2.4.1 Scalar Multiplication
Let \(\mathbf {s} = \left ({\mathbf {s}}_{0}, {\mathbf {s}}_{1}, \cdots , \right ) \in \prod _{m \geq 0} \left (\mathbb {R}^{d}\right )^{\otimes m}\) and consider the mapping
for some scalar \(\theta \in \mathbb {R}^{+}\). Incorporating this within the bilinear form \(\langle \cdot , \cdot \rangle _{\phi }\), we have that [8, Lemma 2.9]
Applying the homomorphism \(\gamma _{\theta }\) to elements of the extended tensor algebra is equivalent to scaling the individual bilinear forms \(\langle \cdot , \cdot \rangle _{m}\) within the overall computation of the bilinear form \(\langle \cdot , \cdot \rangle _{\phi }\). In our specific context, \(\mathbf {s}\) and \(\mathbf {t}\) are signatures corresponding to two paths \(\mathbf {x}\) and \(\mathbf {y}\) respectively. In this case, the following holds [8, Corollary 2.10]
where \(\phi \) is the weight function \(\phi _{\theta } \colon \mathbb {N} \cup \left \{0\right \} \to \mathbb {R}\) defined by \(\phi \left (m\right ) = \theta ^{m}\). Multiplying paths by a constant scalar \(\sqrt {\theta }\) equates to using a \(\phi \)-signature kernel with weight function \(\phi _{\theta }\). Therefore, if both paths are multiplied by the scalar \(\sqrt {\theta }\), the inner product \(\langle \cdot , \cdot \rangle _{m}\) corresponding to level m is scaled by a factor of \(\theta ^{m}\). Through path scaling, either one of the following three scenarios holds:
1.
If \(\theta < 1\), a higher weighting is attributed to lower-level terms;
2.
If \(\theta = 1\), the original signature kernel is used; or
3.
If \(\theta > 1\), a higher weighting is attributed to higher-level terms.
In terms of \(\phi \)-MMD, by re-weighting the level terms, we can target the \(\phi \)-MMD to focus on particular distributional properties. For example, by setting a high value for \(\theta \) we would be prioritising higher-order moments over lower-order moments in the computation of the \(\phi \)-MMD. The opposite effect occurs if we use values for \(\theta \) which are less than 1. A key difference between the \(\phi \)-MMD and the \(\phi \)-signature kernel is that in the case of the \(\phi \)-signature kernel, rescaling one of the paths by \(\theta \) is equivalent to scaling both paths by \(\sqrt {\theta }\). This is not the case for the \(\phi \)-MMD. The \(\phi \)-MMD is a function of kernels of the forms
with inputs being \(\mathbb {P}_{\mathbf {X}}, \mathbb {P}_{\mathbf {Y}}\). If we scale \(\mathbb {P}_{\mathbf {X}}\) by \(\theta \), we would be re-weighting the terms \(k_{\text{Sig}}^{\phi } \left (\mathbf {X}, {\mathbf {X}}^{\prime }\right )\), \(k_{\text{Sig}}^{\phi } \left (\mathbf {X}, \mathbf {Y}\right )\), and \(k_{\text{Sig}}^{\phi } \left (\mathbf {Y}, {\mathbf {Y}}^{\prime }\right )\) by \(\theta ^{2}\), \(\theta \), and 1 respectively. Hence, inconsistent scalings are being applied and, more importantly, different \(\phi \)-signature kernels are being used to compute the \(\phi \)-MMD. We need to scale both inputs and scaling by \(\sqrt {\theta }\) is equivalent to re-weighting the signature terms of level m by \(\theta ^{m}\).
2.5 Detailed Construction of the \(\phi \)-MMD
We reformulate the computation of the \(\phi \)-MMD as a sum over level terms—the signature terms corresponding to a specific level of the signature. We use this reformulation to quantify the contribution of the level terms to the \(\phi \)-MMD.
Definition 2.2
An alphabet \(\mathcal {A}_{d}\) is a set comprising of the d letters \(1, \cdots , d\). A word of length k is a sequence of k letters from the alphabet (repetitions are allowed). The set of all words of length k over alphabet \(\mathcal {A}_{d}\) is denoted by \(\mathcal {W}_{k} \left (\mathcal {A}_{d}\right )\).
For any two d-dimensional stochastic processes \(\mathbf {X}, \mathbf {Y}\) and any \(m \in \mathbb {N}\), define the function \(\Lambda _{m} \colon \mathcal {P} \left (\mathcal {K}\right ) \times \mathcal {P} \left (\mathcal {K}\right ) \mapsto \mathbb {R}\) by
\(\Lambda _{m}\) is the level-m contribution to the expected signature kernel. This can be used to define the level-m contribution to the \(\phi \)-MMD. The level-m contribution is the mapping \(\Gamma _{m}^{\phi } \colon \mathcal {P} \left (\mathcal {K}\right ) \times \mathcal {P} \left (\mathcal {K}\right ) \mapsto \mathbb {R}\) defined by
By restricting the sum to the first k levels, we obtain the truncated \(\phi \)-MMD at level k.
Suppose we have N independent and identically distributed (i.i.d.) samples \({\mathbf {X}}^{1{:}N}\) and M i.i.d. samples \({\mathbf {Y}}^{1{:}M}\). Let \(\delta _{\left (\cdot \right )}\) denote the Dirac measure and for two paths \(\mathbf {X}, \mathbf {Y} \in \mathcal {K}\) denote \(\Lambda _{m} \left ( \delta _{\mathbf {X}}, \delta _{\mathbf {Y}} \right )\) by \(\Lambda _{m} \left (\mathbf {X}, \mathbf {Y}\right )\). The expectations \(\Lambda _{m}\) can be estimated using an unbiased or biased sample average. The unbiased estimators are given by
The biased estimator is denoted by \(\widehat {\Lambda _{m}^{b}}\) and is defined by taking the sum over all terms \(i, j\) including the cross terms. Using the empirical estimator for the \(\Lambda _{m}\) terms, an approximator for the level-m contribution to the \(\phi \)-MMD is given by
Whenever the weight function \(\phi \) is the constant function \(\phi \left (m\right ) = 1\) for all m, the \(\phi \) superscript is omitted from the notation.
By reformulating the sig-MMD in terms of level contributions, we can better understand the role the individual levels play in the computation of the sig-MMD. Since the level-K term of the expected signature is associated with the \(K{\text{th}}\) moment of the distribution of the stochastic process, differences between the \(K{\text{th}}\) moment of the distributions start featuring in the computation of the sig-MMD as from the level-K term \(\Gamma _{K}^{\phi }\). Suppose the two distributions have equal moments up to the \(\left (K-1\right ){\text{th}}\) moment and differ in their \(K{\text{th}}\) moment. Since the signature terms decay factorially, the terms \(\widehat {\Gamma _{m}}\) are of order \(\left (m!\right )^{-2}\) for all \(m \leq K\). Hence, terms corresponding to equal moments contribute more to the final value of the sig-MMD and therefore, using the sig-MMD as a statistic on which to perform the two-sample hypothesis test could result in a high probability of a Type 2 error occurring. To counteract the effect of the factorial decay, the \(\phi \)-MMD can be used to re-weight the level contributions. Generally, by using constant mappings \(\phi \colon m \mapsto \theta ^{m}\) for some \(\theta \in \mathbb {R}^{+}\), higher-order properties can be captured since higher weightings are given to higher-level signature terms (Sect. 2.4.1). Numerical examples illustrating this are provided in Sect. 4.
Constant mappings can be regarded as a change of units. For example, if the sample paths correspond to percentage returns and are provided within a normalised range, changing the units to percentages corresponds to the \(\phi \)-MMD with \(\theta = 100^{2}\). As shown in Sect. 4, scalings have an effect on test performance. Consequently, the test performed with normalised returns could result in a different conclusion than if returns provided as a percentage were used. In contrast to other test statistics (such as the Kolmogorov-Smirnov test statistic [24]), test performance with respect to the sig-MMD is not scale invariant. On the contrary, the scaling factor should be optimised since it plays an important role in test performance.
3 Two-Sample Hypothesis Testing
The standard two-sample hypothesis test between stochastic processes [12] is used to test whether two collections of time series stem from the same distribution. The null hypothesis under the two-sample test is
For the remainder of this section, we fix the distributions \(\mathbb {P}_{\mathbf {X}}\) and \(\mathbb {P}_{\mathbf {Y}}\). We describe the procedure and the key components required to perform the two-sample test.
Since the MMD is a metric on the space of Borel probability measures on stochastic processes [12, 21], it is used as the test statistic for the hypothesis test. Let \(\widehat {c}_{1-\alpha }\) be the \(\left (1-\alpha \right )\)-quantile of the empirical MMD under the null hypothesis. The null hypothesis is rejected with significance \(\alpha \) if \(\widehat {d_{k_{\text{Sig}}}^{2}} \left (\mathbb {P}_{\mathbf {X}}, \mathbb {P}_{\mathbf {Y}}\right ) < \widehat {c}_{1 - \alpha }\).
In two-sample testing, a Type 1 error occurs when the statistical test determines that the two samples have different underlying distributions when the null hypothesis holds. A Type 2 error occurs when the test concludes that the two samples arise from the same stochastic process (distribution) even though this is not the case.
To quantify the likelihood of a Type 2 error occurring, we need to approximate the distribution of the sig-MMD under both the null and alternative hypotheses. Suppose the empirical MMD has distribution \(\widehat {F_{H_{0}}}\) under the null hypothesis and \(\widehat {F_{H_{1}}}\) under the alternative hypothesis. For a given level \(\alpha \), the probability of a Type 2 error is
To perform the two-sample hypothesis test and compute the probabilities of a Type 1 error and Type 2 error, the following are needed:
1.
An estimate for the \(\left (1-\alpha \right )\)-quantile of the null distribution; and
2.
The distribution of the MMD under the alternative hypothesis.
We describe two techniques for approximating the distributions; bootstrapping and asymptotic analysis. The former is based on re-sampling from the stochastic processes until an ‘accurate’8 approximation of the MMD distribution can be established. The latter approach relies heavily on U-statistics [1, 4, 16, 25, 37, 42].
3.1 Bootstrapping
Let \({\mathbf {X}}^{1{:}N^{\prime }}\) and \({\mathbf {Y}}^{1{:}M^{\prime }}\) be two independent collections of paths sampled from \(\mathbb {P}_{X}\) and \(\mathbb {P}_{Y}\) respectively with values in \(\mathbb {R}^{d}\). To sample the sig-MMD under the null distribution, we sample batches of size \(N_{1}, N_{2}\) from the \(N^{\prime }\) available. This procedure is repeated B times to construct the collection
The empirical sig-MMD between these samples is calculated to obtain B distances under the null hypothesis. These B distances are used to construct the empirical null distribution. Repeating a similar procedure, the distribution of the test statistic under the alternative hypothesis is approximated by sampling B collections \(\left \{\left ({\mathbf {X}}^{1{:}N_{1}}, {\mathbf {Y}}^{1{:}M_{1}}\right )\right \}_{i=1}^{B}\) with (\(M_{1}\!\!<\!\!M^{\prime }\)) and then computing the test statistic over every collection. The value of B is closely related to the quality of the approximating distribution.
Permutation testing [12] is an alternative approach to sampling the test statistic under the null hypothesis. To perform such test, the sample paths \({\mathbf {X}}^{1{:}N^{\prime }}\) and \({\mathbf {Y}}^{1{:}M^{\prime }}\) are pooled together. Sampling the empirical sig-MMD under the null is performed by computing the empirical sig-MMD between two separate collections of size \(N^{\prime }, M^{\prime }\) respectively. These collections are sampled uniformly from all possible permutations of the pooled dataset into two groups of size \(N^{\prime }\) and \(M^{\prime }\).
3.2 U-Statistics
Let \(\mathcal {M}\) be a compact metric space and consider a distribution \(\mu \in \mathcal {P} \left (\mathcal {M}\right )\). Suppose we have samples \(x_{1}, \cdots , x_{N} \sim \mu \). Let \(h \colon \mathcal {M}^{r} \to \mathbb {R}\) be a kernel function with \(r \leq N\). The U-statistic [1, 4, 16, 25, 37, 42] with kernel h is defined as
where \({\mathbf {P}}_{r, N}\) denotes the set of all \(N!/\left (N-r\right )!\) permutations of size r chosen from \(\left \{1, \cdots , N\right \}\). If h is symmetric, then
where \({\mathbf {C}}_{r, N}\) denotes the set of all combinations of r integers from \(\left \{1, \cdots , N\right \}\) such that \(i_{1} \leq \cdots \leq i_{r}\). For example, if \(r=1\), the sample mean \(N^{-1} \sum _{i} x_{i}\) is a U-statistic. By setting h to be a linear combination of \(\phi \)-signature kernels, asymptotic results regarding limiting distributions of U-statistics can be applied to the \(\phi \)-MMD.
3.3 Asymptotic Analysis: Null Distribution
To perform the two-sample hypothesis test and quantify the probabilities of Type 1 and Type 2 errors occurring, an estimate for the threshold value \(\widehat {c}_{1-\alpha }\) is needed. Since \(\widehat {c}_{1-\alpha }\) is the \(\alpha \)-quantile of the null distribution \(\widehat {F_{H_{0}}}\), by approximating the null distribution, the \(\left (1-\alpha \right )\)-quantile can then be computed.
Suppose we have N samples \({\mathbf {X}}^{1{:}N}, {\mathbf {Y}}^{1{:}N}\). Define \({\mathbf {Z}}^{i} := \left ({\mathbf {X}}^{i}, {\mathbf {Y}}^{i}\right )\). Another unbiased estimator for \(d_{k_{\text{Sig}}} \left ( \mathbb {P}_{\mathbf {X}}, \mathbb {P}_{\mathbf {Y}} \right )^{2}\) is given by the one sample U-statistic [19, Lemma 7]
$$\displaystyle \begin{aligned} \widehat{d_{k_{\text{Sig}}, 2}} \left(\mathbb{P}_{\mathbf{X}}, \mathbb{P}_{\mathbf{Y}}\right)^{2} = \frac{1}{N \left(N-1\right)} \sum_{\substack{i, j=1 \\ i \neq j}}^{N} h \left({\mathbf{Z}}^{i}, {\mathbf{Z}}^{j}\right) \end{aligned}$$
where \(h \left (\cdot , \cdot \right )\) is defined by
The difference between \(\widehat {d_{k_{\text{Sig}}, 2}}\) and \(\widehat {d_{k_{\text{Sig}}}}\) (Eq. (5)) is that the terms \(k \left ({\mathbf {X}}^{i}, {\mathbf {Y}}^{i}\right )\) are not included in the computation of \(\widehat {d_{k_{\text{Sig}}, 2}}\) whilst they are accounted for when computing \(\widehat {d_{k_{\text{Sig}}}}\). As described by the authors of [19], assuming \(\mathbb {E} \left [ h^{2}\right ] < \infty \), when applied to larger sample sizes, the null distribution can be approximated by an infinite weighted sum of shifted independent \(\chi _{1}^{2}\) random variables as follows
where \(z_{i}\) are i.i.d. mean-zero normally distributed random variables with variance 2. The weight \(\lambda _{l}\) is the solution to the eigenvalue problem
This approach does not necessitate the same number of samples from the two distributions and it can be computationally intensive as it is based on matrix computations [3]. Alternatively, the null distribution of the biased MMD can be approximated using two different techniques, both based on low-order moments of the empirical MMD. One of these approaches uses Pearson curves [19‐21]. The other approach is computationally more efficient and is based on approximating the null distribution of the biased MMD using a gamma distribution [20]. The authors of [29] show that
When closed-form solutions are available for truncated estimates of the two moments, these closed-form solutions are used. When these are not available, empirical estimates are constructed using bootstrapping.
3.4 Asymptotic Analysis: Alternative Distribution
If \(\mathbb {E} \left [ h^{2}\right ] < \infty \), the distribution of the squared MMD under the alternative hypothesis converges in distribution to a Gaussian according to [19, Theorem 8]
and \(\mathcal {N} \left (0, \sigma ^{2}\right )\) denotes a Gaussian random variable with mean 0 and variance \(\sigma ^{2} > 0\).
4 Numerical Examples
In this section we carry out numerical simulations to show the effect different signature kernels have on reducing errors when performing two-sample hypothesis testing. In certain scenarios, the original signature kernel and standard two-sample hypothesis test perform well. In our simulations, we use parametric models to construct examples in which the probability of Type 2 errors occurring are high. This was done to illustrate the importance of re-weighting higher signature levels. The probabilities of a Type 2 error occurring were computed by approximating the distributions of the test statistic under the null and alternative hypotheses using bootstrapping. To compute the probabilities of a Type 1 error occurring, the same procedure as for Type 2 errors was applied, however in this case, both collections of paths were simulated from a stochastic model with the same parameters. When the parameters are chosen such that they are very close, very few \(\phi \)-signature kernels would reduce the probabilities of errors and more adhoc weight functions may be required. In some situations more sample paths will be needed. In practice, linearly interpolated sample paths are used instead of Brownian motion paths. Although Brownian motion paths do not belong to the set \(\mathcal {K}\), the linearly interpolated paths belong to the set \(\mathcal {K}\). Also, the sample paths converge to Brownian motion sample paths (Donsker’s invariance principle [13]) as the mesh size tends to 0. Therefore, the linearly interpolated paths adhere to the framework described in Sect. 2.
Throughout the experiments, the level was set to \(\alpha = 0.05\). Simulations were mainly run on a GPU9 for computational efficiency. The code used to generate these results can be found at the following repository: https://github.com/Andrew-Alden/SignatureMMDTesting.
4.1 Scaled Brownian Motion
Consider the time interval \(\left [0, T\right ]\) and two scaled Brownian motions \(\mathbf {S}\) and \(\mathbf {G}\) with dynamics described by the stochastic differential equations (SDEs)
The parameters \(\sigma , \beta > 0\) control the volatility of the processes and \(\left (W_{t}^{1}\right ), \left (W_{t}^{2}\right )\) are independent Brownian motions.10 Define the time-augmented11 processes \(\mathbf {X}, \mathbf {Y}\) as \({\mathbf {X}}_{t} := \left (t, {\mathbf {S}}_{t}\right )\) and \({\mathbf {Y}}_{t} := \left (t, {\mathbf {G}}_{t}\right )\). The two-sample hypothesis test is performed on the stochastic processes \(\mathbf {X}\) and \(\mathbf {Y}\).
The processes \(\mathbf {S}\) and \(\mathbf {G}\) are simulated using the Euler discretisation scheme
for all \(t \in \left [0, T\right ]\). If \(\sigma \neq \beta \), the processes \(\mathbf {S}\) and \(\mathbf {G}\) differ from their second moment.13 Therefore, the sig-MMD should distinguish between these processes. The empirical probabilities of a Type 2 error and a Type 1 error occurring were computed using the empirical distributions of the sig-MMD. To compute probabilities of Type 1 errors occurring, both collections of sample paths were sampled from \(\mathbf {X}\) (i.e. \(H_{0}\)).
In the simulations, \(\sigma \) was set to \(0.2\), \(\beta \) was set to \(0.3\), and the batch size was set to 128. The probability of a Type 2 error occurring was \(72.6\%\) with the biased estimator and \(85.8\%\) with the unbiased estimator. These probabilities correspond to the large overlap in the histograms in Fig. 1. To better understand the underlying cause of the high probability \(85.8\%\), the level contributions (\(\Gamma _{m}^{\phi }\)) were plotted. As can be noted in Fig. 2, the area of overlap decreases as the signature level increases. As can be seen in the plot of the level-1 contribution (\(\Gamma _{1}^{\phi }\)), the histograms completely overlap. This occurs because the processes are driftless. However, from the second level onwards, the histograms start to gradually separate (region of overlap decreases). Due to the factorial decay of the absolute value of the signature terms, even though the histograms start to separate, the level with the largest contribution to the MMD (first level) corresponds to the plot in which the histograms completely overlap. As a result, the probability of a Type 2 error occurring is high.
Fig. 1
Null and alternative distributions of the (squared) sig-MMD between two scaled Brownian motions. Batch size of 128 was used and 500 independent simulations were run. (a) Biased. (b) Unbiased
Level contributions to the sig-MMD between two scaled Brownian motions. Batch size of 128 was used and \(2{,}048\) independent simulations were run. No scaling was applied and the unbiased estimator was used
By using general signature kernels, the test statistic is tailored to target specific moment(s) when performing the two-sample hypothesis test. In this particular example, scaling the paths allows us to focus on the second moment of the processes. In our simulations, we tested various scaling factors. Scaling was applied to the processes \(\mathbf {S}\), \(\mathbf {G}\) and not to the time-augmented processes. Therefore, the time-index still ran from \(\left [0, T\right ]\). This was done because the time-index was used to add a time structure to the process. We could have set T in the time-augmented process to be the inverse of the scaling factor. In this case, multiplying the entire process including the time-index by the scaling factor would result in time starting at 0 and ending at \(T=1\). This is equivalent to our approach of not scaling the time index.
When re-running all simulations with scaled paths using a scaling factor of 3, the probability of a Type 2 error occurring drops to \(2.0\%\) (Fig. 3a) when using the biased estimator and \(6.6\%\) (Fig. 3b) when using the unbiased estimator. Scaling the paths by a factor of 3 is equivalent to scaling the parameters \(\sigma , \beta \) by a factor of 3. The scaling amplifies the differences between distributions. The level contributions to the \(\phi \)-MMD are plotted in Fig. 4. The separation of histograms is evident as the level increases. A crucial difference between the histograms in Figs. 2 and 4 is that, when scaling is applied (Fig. 4), the absolute value of the contribution is larger for the scaled version than for the standard sig-MMD. Hence, higher-level terms which better capture distributional differences contribute more to the final test statistic than when using the standard sig-MMD.
Fig. 3
Null and alternative distributions of the (squared) \(\phi \)-MMD between two scaled Brownian motions. Batch size of 128 was used and 500 independent simulations were run. A scaling of 3 was applied. (a) Biased. (b) Unbiased
Level contributions to the \(\phi \)-MMD between two scaled Brownian motions. Batch size of 128 was used and \(2{,}048\) independent simulations were run. A scaling of 3 was applied and the unbiased estimator was used
In the above example, the batch size was kept fixed at 128. It is important to study the effect of scaling paths on the probability of a Type 2 error occurring as a function of the batch size. In Fig. 5, the probability of a Type 2 error occurring is plotted as a function of the batch size and the scaling factor. Figure 5a corresponds to the biased estimator for the \(\phi \)-MMD and Fig. 5b corresponds to the unbiased estimator. For low batch sizes, scaling reduces the probability of a Type 2 error occurring. However, it does not reduce it to levels below \(30\%\) unless extreme scalings are used (Fig. 5c, d). Also, although the impact of scaling is consistent across the biased and the unbiased estimator, the effect takes place earlier in the biased version for scaling values less than 5. Therefore, for a given batch size, the probability of a Type 2 error occurring is most likely lower when using the biased estimator as opposed to the unbiased estimator if no extreme scaling is applied.
Fig. 5
Probability of a Type 2 error occurring as a function of scaling factor and batch size between two scaled Brownian motions. Solid line corresponds to the mean and the shaded region corresponds to one standard deviation away from the mean. (a) Biased. (b) Unbiased. (c) Biased—extreme scalings. (d) Unbiased—extreme scalings
These results demonstrate the importance of calibrating the signature kernel to focus on specific moments of the distributions. Apart from focusing on Type 2 errors, we ensured that the probability of a Type 1 error occurring did not increase as the scaling is altered. We did this by plotting the distributions of the probability of a Type 1 error occurring across various scaling factors (Fig. 6). These plots show that this probability is independent of batch size and it averages at around \(5\%\) in every scenario, corresponding to the level set for the test. These plots confirm that scaling does not negatively effect the probability of a Type 1 error occurring. This is consistent across both the biased and the unbiased estimator.
Fig. 6
Probability of a Type 1 error occurring as a function of sample size and scaling factor between two scaled Brownian motions. (a) Biased. (b) Unbiased
We now study the effect of path scalings on higher dimensional paths. Consider the processes \(\mathbf {X} = \left ({\mathbf {X}}_{t} = \left (t, S_{t}^{1}, \cdots , \right . \right .\)\(\left . \left .S_{t}^{d-1}\right ) \right )_{t}\) and \(\mathbf {Y} = \left ({\mathbf {Y}}_{t} = \left (t, G_{t}^{1}, \cdots , G_{t}^{d-1}\right ) \right )_{t}\). Each process pair \(\left ({\mathbf {S}}^{j}, {\mathbf {G}}^{j}\right )\) is a scaled Brownian motion with parameters \(\left (\sigma _{j}, \beta _{j}\right )\). We set \(d = 5\) and used the following parameter values
Table 1 contains the probability of a Type 2 error occurring for various batch sizes and dimensions. A scaling factor of 2 was applied to compute the probabilities. Scaling the paths does reduce the probability of a Type 2 error. In addition, scaled paths always had a lower probability than their original (unscaled) counterpart. More importantly, as the batch size increases, the probability does decrease when no scaling is applied since the sig-MMD is a metric and converges as sample size increases. The main effect of scaling the paths in this case is capturing distributional differences with smaller batch sizes. This is crucial when working in low-data regimes and/or working with limited computational resources.
Table 1
Likelihood of a Type 2 error occurring as dimension and batch size increase
Sample
Bias scaling
Bias no scaling
Unbiased
Unbiased
Dimension
size
(\(\%\))
(\(\%\))
scaling (\(\%\))
no scaling (\(\%\))
3
16
77.8
90.5
87.8
92.2
32
72.1
89.9
84.6
92.6
64
58.2
87.4
75.3
90.5
128
33.2
84.0
50.3
88.1
256
6.80
70.2
15.2
76.1
4
16
77.4
86.8
88.2
92.3
32
68.4
85.4
83.9
91.3
64
53.6
83.4
75.6
89.3
128
31.0
73.8
57.2
83.5
256
8.13
52.4
24.5
68.8
5
16
61.4
86.6
88.3
92.4
32
49.4
85.3
83.2
91.3
64
28.9
82.0
72.8
89.1
128
6.9
73.5
42.6
84.5
256
0.1
49.3
4.4
67.1
4.2 Autoregressive Time Series Models
This example focuses on autoregressive conditional heteroskedasticity (ARCH) processes. Let \(\mathbf {r} = \left ({\mathbf {r}}_{t}\right )\) be a stochastic process with conditional volatility modelled as a general autoregressive conditional heteroskedastic (GARCH) process [6, 15], a class of ARCH models [14]. The process \(\mathbf {r}\) is described by the equations
Two processes \({\mathbf {r}}_{1}, {\mathbf {r}}_{2}\) were simulated using the parameters in Table 2. As was done in the previous example (Sect. 4.1), the probability of a Type 2 error occurring in the two-sample hypothesis test was calculated through re-sampling techniques. Using the unbiased estimator, the computation resulted in a probability of \(90.2\%\) (Fig. 7a). Once again, path scaling was sufficient to reduce this probability. When a scaling of \(5.5\) was applied, the probability of a Type 2 error occurring dropped to \(0.0\%\) (Fig. 7b).
Fig. 7
Null and alternative distributions of the \(\phi \)-MMD between two GARCH models. Batch size of 128 was used and 500 independent simulations were run. (a) Without scaling. (b) Scaling of \(5.5\) was applied
Once again, the empirical distributions of the level contributions to the sig-MMD in the case of two GARCH models contained significant overlap (Fig. 8a). When a scaling of \(5.5\) was applied, the distribution of level terms contained sufficient separation (Fig. 8b). Another important difference between the distributions of the level terms with and without scaling is that when scaling is used, the level contributions in the case of the alternative hypothesis have larger absolute value. As previously discussed, this contributed to reducing the probability of a Type 2 error occurring.
Fig. 8
Null and alternative distributions of the level contributions between two GARCH models. Batch size of 128 was used and 500 simulations were run. (a) Without scaling. (b) Scaling of \(5.5\) was applied
All GARCH simulations were performed using a batch size of 128. We once again plotted the probability of a Type 2 error occurring as the batch size increases and scaling factor increases. This is provided in Fig. 9. Scaling does reduce the probability of a Type 2 error occurring. Comparing Fig. 9 with Fig. 5, we find that scaling has the common effect of reducing the probability of a Type 2 error occurring. However, a difference between the two is the scaling factor needed to reduce this probability. In the case of scaled Brownian motions, the probability started reducing when scalings slightly larger than 1 were considered. In the case of GARCH models, a high probability was maintained up until a scaling of around \(3.5\). The need for more aggressive scaling when working with GARCH models was confirmed in the distribution of the level terms depicted in Fig. 10. Figure 10a shows the first level contribution as a function of the scaling. Scaling does not alter the average contribution under either hypothesis. It mainly affects the variance of the level contribution. When considering the fourth and sixth level terms (Fig. 10b, c respectively), besides effecting the variance the scaling also alters the mean contribution. However, this effect occurs at high scaling values. For example, in Fig. 10b there is a noticeable separation in distributions when considering scaling factors greater than \(4.0\).
Fig. 9
Probability of a Type 2 error occurring between two GARCH models as a function of sample size. (a) Biased. (b) Unbiased
Distribution of the level terms between two GARCH models as a function of scaling factor. Unbiased estimator was used. (a) Level 1. (b) Level 4. (c) Level 6
We once again study the effect of scaling on the probability of a Type 1 error occurring. The distribution of the probability of a Type 1 error occurring using the \(\phi \)-MMD as a function of various scaling factors and batch sizes is plotted in Fig. 11. On average, the probability remained stable at around the \(5\%\) value.
Fig. 11
Probability of a Type 1 error occurring between two GARCH models as a function of sample size and scaling factor. Unbiased estimator was used
In this example, differences between processes were difficult to capture. This was confirmed by the level contributions plotted in Figs. 8 and 10 and the plot showing the effect of scaling on the probability of a Type 2 error occurring (Fig. 9). Although the processes have different first two moments, higher-level terms of the signature were needed to capture differences.
4.3 Mixture Models
To understand the effect of scaling on the Type 2 error of the statistical test between two stochastic processes which differ in their third moment, we use a mixture of a geometric Brownian motion (GBM) [5] and an Ornstein-Uhlenbeck (OU) process [48]. The GBM is described by the SDE
and assume \(\{W_{t}\}, \{B_{t}\}\) are uncorrelated Brownian motions. By choosing specific parameter values, the marginal distributions of \(X_{T}, Y_{T}\) with \(T=1\) satisfy
We simulate processes according to the parameters specified in Table 3. We use paths of length 30 and a batch size of 128. The probability of a Type 2 error occurring was \(94.6\%\). This corresponds to the overlap in distributions in Fig. 12a. Scaling the paths did not reduce this probability (Fig. 12b). To understand the reason for this, we plot the level contributions. In Fig. 13a, at each level, there is overlap between distributions. Increasing the batch size to 512 and \(2{,}000\), the distributions diverge at levels 3, 4, and 5. Therefore, the sig-MMD can distinguish between these processes, but it cannot distinguish when using a few paths. Re-plotting the distribution of the level contributions with a scaling of 2 applied (Fig. 14), we still note significant overlap between the null and alternative distributions at a batch size of 128. By increasing the batch size we have less overlap at levels 3, 4, 5, and 6. As a result of scaling, the absolute value of the level contributions under the alternative hypothesis do not decrease as rapidly compared to Fig. 13.
Fig. 12
Null and alternative distributions of the \(\phi \)-MMD between two mixture models. Batch size of 128 and the unbiased estimator were used. (a) No scaling. (b) Scaling of \(2.0\) was applied
Null and alternative distributions of the level contributions between two mixture models as a function of batch size. 100 independent simulations were run. The level contributions are plotted on a logarithmic scale. (a) Batch size of 128. (b) Batch size of 512. (c) Batch size of 2,000
Null and alternative distributions of the level contributions between two mixture models as a function of batch size. A scaling of 2 was applied. 100 independent simulations were run. The values are plotted on a logarithmic scale. (a) Batch size of 128. (b) Batch size of 512. (c) Batch size of \(2{,}000\)
where \(\mu _{T}, \sigma _{T}\) are the mean and standard deviation of \(\mathbf {X}\) at the terminal time.
Lead-Lag Transformation
Let \(0 = t_{0} < t_{1/2} < \cdots < t_{l-1/2} < t_{l} = T\) be a partition of \(\left [0, T\right ]\). The lead-lag transformation [3, 23, 46] of a 1-dimensional path \({\mathbf {X}}_{t}\) is the 2-dimensional path \(\left ({\mathbf {X}}^{\text{Lead}}_{t}, {\mathbf {X}}^{\text{Lag}}_{t}\right )\) defined by linear interpolation on the points
To gain more expressive power, the paths are lifted to the infinite dimensional Hilbert space \(H^{\prime } \left (\sigma _{\text{RBF}}^{2}\right )\) associated with the radial basis function (RBF) kernel with smoothing parameter \(\sigma _{\text{RBF}} > 0\). For \(\mathbf {x}, \mathbf {y} \in \mathbb {R}^{d}\), the RBF kernel is defined by
where \(\left | \left | \cdot \right | \right |_{2}\) denotes the Euclidean norm. Using the Riesz representation theorem, there exists a mapping \(\zeta _{\sigma _{\text{RBF}}} \colon \mathbb {R}^{d} \to H^{\prime }\) such that
The path \({\mathbf {X}}_{t}\) is lifted to the infinite dimensional path \(k_{\mathbf {X}} \colon t \mapsto \zeta _{\sigma _{\text{RBF}}} \left ({\mathbf {X}}_{t}\right )\). Since the lifted path is infinite dimensional, it is not possible to work directly with this path. However, using the kernel trick, we work directly with the RBF kernel instead. Computing the signature kernel between two lifted paths can be done using either dynamic programming as described in [31], or directly through the PDE approach [39]. An alternative to directly computing the RBF kernel consists of approximating the RBF kernel using random Fourier features [38, 47]. An important consideration when working with lifted paths is that scaling the original paths does not necessarily translate to weighting the level contributions to the signature kernel since the signature kernel is now computed between the lifted paths. Therefore, we need to ensure that we are scaling the path \(\zeta _{\sigma } \left ({\mathbf {X}}_{t}\right )\) and not the path \({\mathbf {X}}_{t}\).
Keeping the batch size fixed at 128, the paths were first standardised using the above procedure. Performing the hypothesis test with lead-lag paths lifted to the space \(H^{\prime } \left (0.5\right )\), the probability of a Type 2 error occurring reduced from \(94.6\%\) to \(28.8\%\) (Fig. 15a). By applying scaling, this was further reduced to \(21.0\%\).
Fig. 15
Null and alternative distributions of the \(\phi \)-MMD between two mixture models. Path standardisation and lead-lag transformations were performed. The RBF smoothing parameter was set to \(\sigma _{\text{RBF}}= \sqrt {0.5}\). Batch size of 128 was used. (a) No scaling. (b) Scaling of \(0.8\) was applied
To understand the relationship between path scalings and Type 2 error applied to lifted paths, we plot the probability of a Type 2 error as a function of the scaling factor and the batch size (Fig. 16). Whenever the probability was significantly reduced, the curve of probabilities had a parabolic shape. A difference between the plots in Fig. 16 and the plots in Figs. 5 and 9 is that, in Figs. 5 and 9 there was a scaling which reduced the probability of Type 2 error occurring. This was not necessarily the case when performing the two-sample hypothesis test between mixture models (see Fig. 16d). Moreover, we once again show that the probability of a Type 1 error occurring is unaffected by the scaling factor used (Fig. 17).
Fig. 16
Probability of a Type 2 error occurring between two mixture models as a function of sample size. (a) Biased—\(\sigma _{\text{RBF}}= \sqrt {0.1}\). (b) Biased—\(\sigma _{\text{RBF}}= \sqrt {0.5}\). (c) Unbiased—\(\sigma _{\text{RBF}}= \sqrt {0.1}\). (d) Unbiased—\(\sigma _{\text{RBF}}= \sqrt {0.5}\)
Probability of a Type 1 error occurring between two mixture models as a function of sample size and scaling factor. Biased estimator with \(\sigma _{\text{RBF}}= \sqrt {0.5}\)
In Sect. 4, we had full control over the null and alternative hypotheses, i.e. we knew whether the null or alternative hypothesis held prior to performing the test. In practice, we are not in a controlled environment and need to determine which hypothesis holds. As demonstrated through the simulations, finding a robust test setup suitable in all cases is not possible. We presented various techniques for tailoring the two-sample test depending on the collections of sample paths. In practice, finding an appropriate setup is similar to hyperparameter optimisation in a ML context. As is the case in most ML tasks, task performance is highly dependent on specific configurations (learning rates, network architecture, etc.). We propose optimising the scaling factor in a similar way, mimicking the optimisation procedure for certain hyperparameters in a ML context. Since we are focused on two-sample testing, the hyperparameters are optimised for a low probability of a Type 2 error occurring. We demonstrated various test configurations and techniques14 which prove to be effective in the context of two-sample testing using path signatures. Since we have shown that the probability of a Type 1 error occurring is largely unaffected across all techniques presented, by iterating over the various techniques, we are optimising the test for Type 2 errors whilst maintaining a low rate of Type 1 errors. When performing hyperparameter optimisation in the context of two-sample hypothesis testing, one would need to split the data into two parts. The first part is used to fine-tune the hyperparameters with respect to some criterion and the second part of the data is used to perform the hypothesis test [21, 22, 32, 41, 44].
We now demonstrate how to apply these techniques in an uncontrolled environment. We would like to determine whether the time series of returns corresponding to baskets of assets have the same distribution. Each basket consists of assets from the same sector. The sectors we consider are the technology sector and consumer based sector. We use 5 assets from each sector and for each asset we compute the percentage return. The return time series corresponding to each asset is split into multiple time series, each consisting of 15 consecutive observations (approximately 3 weeks of returns).15 Aggregating all length 15 time series of all assets within the same basket (sector) constitutes the available samples from the stochastic process corresponding to that basket of assets. We use data from 10 September 2014 to 8 September 2024. This is an uncontrolled setting since we do not know in advance whether the returns have the same distribution. The data was split into two parts according to the ratio \(80{:}20\). \(80\%\) of the data was used to calibrate the hyperparameters of the test. We first start the calibration procedure by using the raw returns without any scaling applied. The probability of a Type 2 error occurring in this setup was calculated at \(99.0\%\) (Fig. 18a). Performing a feature transformation using a RBF kernel with smoothing parameter set to 1, we then performed the two-sample test again. This feature transformation had no effect on the probability of a Type 2 error occurring (Fig. 18b). This was also the case when a scaling of 5 was applied (Fig. 18c).
Fig. 18
Null and alternative distributions of the \(\phi \)-MMD. 500 independent simulations were run. Batch size of 128 and unbiased estimator were used. (a) No scaling. (b) \(\sigma _{\text{RBF}} = 1\). (c) \(\sigma _{\text{RBF}} = 1\), scaling\(=5\)
Repeating the same tests after transforming the time series using the lead-lag transform (Fig. 19), we get a significant reduction in probability of Type 2 errors occurring when scaling was applied after a feature space transformation. Performing a permutation test on the remaining \(20\%\) of the data after applying a lead-lag transformation, using an RBF kernel with smoothing parameter of \(\sqrt {0.5}\), and applying a scaling of 5, the hypothesis test concluded that the distributions are not equal.
Fig. 19
Null and alternative distributions of the \(\phi \)-MMD after applying lead-lag transformation. 500 independent simulations were run. Batch size of 128 and unbiased estimator were used. (a) No scaling applied. (b) \(\sigma _{\text{RBF}} = \sqrt {0.5}\). (c) \(\sigma _{\text{RBF}} = \sqrt {0.5}\), scaling\(=5\)
This chapter focuses on two-sample hypothesis testing between stochastic processes. Whilst the sig-MMD has proven to be an effective tool for two-sample hypothesis testing, careful consideration is needed to fine-tune the test hyperparameters to the samples available. We explored techniques to make the test more robust; namely
1.
Path scaling;
2.
Lead-lag transformation;
3.
Standardisation; and
4.
Feature space transformation using the RBF kernel.
The relationship between the batch size and probability of a Type 2 error occurring was also explored. In addition, we also showed that path scalings do not adversely affect the probability of a Type 1 error occurring.
A contribution made through this work is quantifying the amount of information carried by the level terms of the signature. We showed that the level terms of the \(\phi \)-MMD incorporate specific distributional properties which, when analysed, can be used to target particular moments using tailored \(\phi \)-signature kernels. This guides the design of the two-sample test using the \(\phi \)-MMD. We show that higher-level signature terms contain information regarding distributional moments which is crucial for reducing Type 2 errors in the two-sample testing framework.
In summary, the power of the test is a function of the; truncation level, batch size, feature space (RBF kernel), and \(\phi \)-signature kernel used. When these are configured appropriately, the two-sample hypothesis test using the \(\phi \)-MMD as a test statistic is a powerful statistical tool.
Acknowledgements
This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) under Grant [EP/T517963/1]. The code is publicly available and can be found at the repository: https://github.com/Andrew-Alden/SignatureMMDTesting.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
If the path is multi-dimensional, this involves standardising the individual dimensions. This could be done by subtracting the mean to have each dimension centred around 0.
Since for practical purposes we work with elements from the set \(\mathcal {X}_{\text{Seq}}\), the mapping \(\mu \) extends to this set by means of the natural inclusion \(\mathcal {X}_{\text{Seq}} \hookrightarrow \mathcal {K} \subset \mathcal {X}_{1-\text{var}}\) induced through linear interpolation.
Generally, this is a function of the number of samples. However, if large sample sizes are available, it may not be computationally feasible to use all samples.
This does not affect the results. If \({\mathbf {S}}_{0}, {\mathbf {G}}_{0} \neq 0\) consider the processes \({\mathbf {S}}^{\prime } := \mathbf {S} - {\mathbf {S}}_{0}\) and \({\mathbf {G}}^{\prime } := \mathbf {G} - {\mathbf {G}}_{0}\).
The sub-sampled time series have no overlapping observations.
1.
J. Ai, O. Kuželka, Y. Wang, Hoeffding–Serfling inequality for U-statistics without replacement. J. Theor. Probab. 36, 390–408 (2023)MathSciNetCrossRef
2.
A. Alden, C. Ventre, B. Horvath, G. Lee, Model-agnostic pricing of exotic derivatives using signatures, in Proceedings of the Third ACM International Conference on AI in Finance, ICAIF ’22, New York (Association for Computing Machinery, New York, 2022), pp. 96–104
3.
H. Andrès, A. Boumezoued, B. Jourdain, Signature-based validation of real-world economic scenarios. ASTIN Bull. 54(2), 410–440 (2024)MathSciNetCrossRef
4.
E. Beutner, H. Zähle, Deriving the asymptotic distribution of U- and V-statistics of dependent data using weighted empirical processes. Bernoulli 18(3), 803–822 (2012)MathSciNetCrossRef
5.
F. Black, M. Scholes, The pricing of options and corporate liabilities. J. Polit. Econ. 81(3), 637–654 (1973)MathSciNetCrossRef
6.
T. Bollerslev, Generalized autoregressive conditional heteroskedasticity. J. Econ. 31(3), 307–327 (1986)MathSciNetCrossRef
7.
H. Bühler, B. Horvath, T. Lyons, I.P. Arribas, B. Wood, A data-driven market simulator for small data environments. Preprint. arXiv:2006.14498 (2020). https://arxiv.org/pdf/2006.14498
T. Cass, T. Lyons, X. Xu, Weighted signature kernels. Ann. Appl. Probab. 34(1A), 585–626 (2024)MathSciNetCrossRef
10.
B.-E. Chérief-Abdellatif, P. Alquier, MMD-bayes: robust Bayesian estimation via maximum mean discrepancy, in Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference, vol. 118. Proceedings of Machine Learning Research (PMLR). PMLR (2020), pp. 1–21
11.
I. Chevyrev, A. Kormilitzin, A primer on the signature method in machine learning. Preprint. arXiv:1603.03788 (2016). https://arxiv.org/pdf/1603.03788
12.
I. Chevyrev, H. Oberhauser, Signature moments to characterize laws of stochastic processes. J. Mach. Learn. Res. 23(176), 1–42 (2022)MathSciNet
13.
M.D. Donsker, An Invariance Principle for Certain Probability Limit Theorems (American Mathematical Society, Memoirs, Providence, 1951)
14.
R.F. Engle, Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica 50(4), 987–1007 (1982)MathSciNetCrossRef
15.
R.F. Engle, A.J. Patton, What good is a volatility model? Quant. Financ. 1(2), 237–245 (2001)CrossRef
A. Fermanian, Embedding and learning with signatures. Comput. Stat. Data Anal. 157 (2021)
18.
X. Geng, H. Ni, C. Wang, Expected signature on a Riemannian manifold and its geometric implications. Preprint. arXiv:2407.13086 (2024). https://arxiv.org/pdf/2407.13086
19.
A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, A. Smola, A Kernel method for the two-sample-problem, in Advances in Neural Information Processing Systems, vol. 19 (MIT Press, Cambridge, 2006)
20.
A. Gretton, K. Fukumizu, Z. Harchaoui, B.K. Sriperumbudur, A fast, consistent Kernel two-sample test, in Advances in Neural Information Processing Systems, vol. 22 (Curran Associates, Inc., Red Hook, 2009)
21.
A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, A. Smola, A Kernel two-sample test. J. Mach. Learn. Res. 13(25), 723–773 (2012)MathSciNet
22.
A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, B.K. Sriperumbudur, Optimal kernel choice for large-scale two-sample tests, in Advances in Neural Information Processing Systems, vol. 25 (Curran Associates, Inc., Red Hook, 2012)
23.
L.G. Gyurkó, T. Lyons, M. Kontkowski, J. Field, Extracting information from the signature of a financial data stream. Preprint. arXiv:1307.7244 (2014). https://arxiv.org/pdf/1307.7244
24.
D. Harrison, D. Sutton, P. Carvalho, M. Hobson, Validation of Bayesian posterior distributions using a multidimensional Kolmogorov–Smirnov test. Mon. Not. R. Astron. Soc. 451(3), 2610–2624 (2015)CrossRef
25.
W. Hoeffding, A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 19(3), 293–325 (1948)MathSciNetCrossRef
26.
B. Hoff, The Brownian Frame Process as a Rough Path. PhD thesis, University of Oxford (2005)
27.
B. Horvath, M. Lemercier, C. Liu, T. Lyons, C. Salvi, Optimal stopping via distribution regression: a higher rank signature approach. Preprint. arXiv:2304.01479 (2023). https://arxiv.org/pdf/2304.01479
28.
Z. Issa, B. Horvath, M. Lemercier, C. Salvi, Non-adversarial training of Neural SDEs with signature kernel scores, in Advances in Neural Information Processing Systems, vol. 36 (Curran Associates, Inc., Red Hook, 2023)
29.
N.L. Johnson, S. Kotz, N. Balakrishnan, Continuous Univariate Distributions, vol. 1, 2nd edn. (John Wiley and Sons, Hoboken, 1994)
30.
King’s Computational Research, Engineering and Technology Environment (CREATE) (2022). King’s College London
31.
F.J. Kiraly, H. Oberhauser, Kernels for sequentially ordered data. J. Mach. Learn. Res. 20(31), 1–45 (2019)MathSciNet
32.
J.M. Kübler, W. Jitkrittum, B. Schölkopf, K. Muandet, A witness two-sample test, in Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, vol. 151. Proceedings of Machine Learning Research (PMLR). PMLR (2022), pp. 1403–1419
33.
M. Lemercier, C. Salvi, T. Damoulas, E. Bonilla, T. Lyons, Distribution regression for sequential data, in Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, vol. 130. Proceedings of Machine Learning Research (PMLR). PMLR (2021), pp. 3754–3762
34.
T. Lyons, H. Ni, Expected signature of Brownian motion up to the first exit time from a domain. Ann. Probab. 43(5), 2729–2762 (2015)MathSciNetCrossRef
35.
K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, Kernel Mean Embedding of Distributions: A Review and Beyond, Now Foundations and Trends (2017)
36.
H. Ni, L. Szpruch, M. Sabate-Vidales, B. Xiao, M. Wiese, S. Liao, Sig-wasserstein gans for time series generation, in Proceedings of the Second ACM International Conference on AI in Finance, ICAIF ’21, New York (Association for Computing Machinery, New York, 2022)
A. Rahimi, B. Recht, Random features for large-scale Kernel machines, in Advances in Neural Information Processing Systems, vol. 20 (Curran Associates, Inc., Red Hook, 2007)
39.
C. Salvi, T. Cass, J. Foster, T. Lyons, W. Yang, The signature Kernel is the solution of a goursat PDE. SIAM J. Math. Data Sci. 3(3), 873–899 (2021)MathSciNetCrossRef
40.
C. Salvi, M. Lemercier, C. Liu, B. Horvath, T. Damoulas, T. Lyons, Higher order kernel mean embeddings to capture filtrations of stochastic processes, in 35th Conference on Neural Information Processing Systems, vol. 34. NeurIPS 2021 (2021)
41.
A. Schrab, I. Kim, M. Albert, B. Laurent, B. Guedj, A. Gretton, MMD aggregated two-sample test. J. Mach. Learn. Res. 24(194), 1–81 (2023)MathSciNet
C.-J. Simon-Gabriel, B. Schölkopf, Kernel distribution embeddings: universal kernels, characteristic kernels and kernel metrics on distributions. J. Mach. Learn. Res. 19(44), 1–29 (2018)MathSciNet
44.
D.J. Sutherland, H. Tung, H. Strathmann, S. De, A. Ramdas, A.J. Smola, A. Gretton, Generative models and model criticism via optimized maximum mean discrepancy, in 5th International Conference on Learning Representation. ICLR 2017 (2017)
45.
L. Terry, Rough paths, signatures and the modelling of functions on streams, in Proceedings of the International Congress of Mathematicians (2014)
G.E. Uhlenbeck, L.S. Ornstein, On the theory of Brownian motion. Phys. Rev. 36(5), 823–841 (1930)CrossRef
49.
P. Zhang, X. Chen, L. Zhao, W. Xiong, T. Qin, T.-Y. Liu, Distributional reinforcement learning for multi-dimensional reward functions, in Advances in Neural Information Processing Systems, vol. 34, pp. 1519–1529 (Curran Associates, Inc., Red Hook, 2021)