An empirical evaluation of similarity measures for time series classification

doi:10.1016/j.knosys.2014.04.035

Knowledge-Based Systems

Volume 67, September 2014, Pages 305-314

https://doi.org/10.1016/j.knosys.2014.04.035 Get rights and content

Abstract

Time series are ubiquitous, and a measure to assess their similarity is a core part of many computational systems. In particular, the similarity measure is the most essential ingredient of time series clustering and classification systems. Because of this importance, countless approaches to estimate time series similarity have been proposed. However, there is a lack of comparative studies using empirical, rigorous, quantitative, and large-scale assessment strategies. In this article, we provide an extensive evaluation of similarity measures for time series classification following the aforementioned principles. We consider 7 different measures coming from alternative measure ‘families’, and 45 publicly-available time series data sets coming from a wide variety of scientific domains. We focus on out-of-sample classification accuracy, but in-sample accuracies and parameter choices are also discussed. Our work is based on rigorous evaluation methodologies and includes the use of powerful statistical significance tests to derive meaningful conclusions. The obtained results show the equivalence, in terms of accuracy, of a number of measures, but with one single candidate outperforming the rest. Such findings, together with the followed methodology, invite researchers on the field to adopt a more consistent evaluation criteria and a more informed decision regarding the baseline measures to which new developments should be compared.

Introduction

Data in the form of time series pervades a large number of scientific domains [19], [24]. Observations that unfold over time usually represent valuable information subject to analysis, classification, indexing, prediction, or interpretation [18], [14], [28], [11]. Real-world examples include financial data (e.g., stock market fluctuations), medical data (e.g., electrocardiograms), computer data (e.g., log sequences), or motion data (e.g., location of moving objects). Even object shapes or handwriting can be effectively transformed into time series, facilitating their analysis and retrieval [23], [24].

A core issue when dealing with time series is determining their pairwise similarity, i.e., the degree to which a given time series resembles another. In fact, a time series similarity (or dissimilarity) measure is central to many mining, retrieval, clustering, and classification tasks [14], [28], [11], [21]. Furthermore, there is evidence that simple approaches to such tasks exploiting generic time series similarity measures usually outperform more elaborate, sometimes specifically-targeted strategies. This is the case, for instance, with time series classification, where a one-nearest neighbor approach using a well-known time series similarity measure was found to outperform an exhaustive list of alternatives [53], including decision trees, multi-scale histograms, multi-layer perceptron neural networks, order logic rules with boosting, or multiple classifier systems.

Deriving a measure that correctly reflects time series similarities is not straightforward. Apart from dealing with high dimensionality (time series can be roughly considered as multi-dimensional data), the calculation of such measures needs to be fast and efficient [21]. Indeed, with better information gathering tools, the size of time series data sets may continue to increase in the future. Moreover, there is the need for generic/multi-purpose similarity measures, so that they can be readily applied to any data set, whether this application is the final goal or just an initial approach to a given task. This last aspect highlights another desirable quality for time series similarity measures: their robustness to different types of data (cf. [21], [50]).

Over the years, several time series similarity measures have been proposed (for pointers to such measures see, e.g., [28], [11], [50]). Nevertheless, few quantitative comparisons have been made in order to evaluate their efficacy in a multiple-data framework. Apart from being an interesting and important task by itself, and as opposed to clustering, time series classification offers the possibility to straightforwardly assess the merit of time series similarity measures under a controlled, objective, and quantitative framework [21].

In a recent study, Wang et al. [50] perform an extensive comparison of classification accuracies for 9 measures (plus 4 variants) across 38 data sets coming from various scientific domains. One of the main conclusions of the study is that, even though the newly proposed measures can be theoretically attractive, the efficacy of some common and well-established measures is, in the vast majority of cases, very difficult to beat. Specifically, dynamic time warping (DTW; [4]) is found to be consistently superior to the other studied measures (or, at worst, for a few data sets, equivalent). In addition, the authors emphasize that the Euclidean distance remains a quite accurate, robust, simple, and efficient way of measuring the similarity between two time series. Finally, by looking in detail at the results presented by Wang et al. [50], we can spot a group of time series similarity measures that seems to have an efficacy comparable to DTW: those based on edit distances. In particular, the edit distance for real sequences (EDR; [7]) seems to be very competitive, if not slightly better than DTW. Interestingly, none of the three measures above was initially targeted to generic time series data, but were introduced with hindsight [1], [4], [7]. The intuition behind Euclidean distance relates to spatial proximity, DTW was initially devised for the specific task of spoken word recognition [43], and edit distances were introduced for measuring the dissimilarity between two strings [27].

The study by Wang et al. [50] is, to the best of our knowledge, the only comparative study dealing with time series classification using multiple similarity measures and a large collection of data. In general, the studies introducing a new measure only compare against a few other measures,¹ and usually using a reduced data set corpus (cf. [21]). Furthermore, there is a lack of agreement in the literature regarding evaluation methodologies. Besides, statistical significance is usually not studied or, at best, improperly evaluated. This is very inconvenient, as robust evaluation methodologies and statistical significance are the principal tools by which we can establish, in a formal and rigorous way, differences across the considered measures [45], [16], [9]. In addition, the chosen parameter values for every measure are rarely discussed. All these issues impact the scientific development of the field as one is never sure, e.g., of which measure should be used as a baseline for future developments, or of which parameters are the most sensible choice.

In this work, we perform an empirical evaluation of similarity measures for time series classification. We follow the initiative by Wang et al. [50], and consider a big pool of publicly-available time series data sets (45 in our case). However, instead of additionally focusing on representation methods, computational/storage demands, or more theoretical issues, we here take a pragmatic approach and restrict ourselves to classification accuracy. We believe that this is the most important aspect to be considered in a first stage and that, in contrast to the other aforementioned issues, it is not sufficiently well-covered in the existing literature. As for the considered measures, we decide to include DTW and EDR, as these were found to generally achieve the highest accuracies among all measures compared in Wang et al. [50]. Apart from these two, we choose the Euclidean distance plus 4 different measures not considered in such study, making up to a total of 7. Further important contributions that differentiate the current work from previous studies include (a) an extensive summary and background of the considered measures, with basic formulations, applications, and references, (b) the formalization of a robust evaluation methodology, exploiting standard out-of-sample cross-validation strategies, (c) the use of rigorous statistical significance tests in order to assess the superiority of a given measure, (d) the evaluation of both train and test accuracies, and (e) the assessment of the chosen parameters for each measure and data set.

The rest of the paper is organized as follows. Firstly, we provide the background on time series similarity measures, outline some of their applications, and detail their calculation (Section 2). Next, we explain the proposed evaluation methodology (Section 3). Subsequently, we report the obtained results (Section 4). A conclusion section ends the paper (Section 5).

Section snippets

Time series similarity measures

The list of approaches for dealing with time series similarity is vast, and a comprehensive enumeration of them all is beyond the scope of the present work (for that, the interested reader is referred to [13], [50], [14], [28], [31], [11]). In this section, we present several representative examples of different ‘families’ of time series similarity measures: lock-step measures (Euclidean distance), feature-based measures (Fourier coefficients), model-based measures (auto-regressive), and

Classification scheme

The efficacy of a time series similarity measure is commonly evaluated by the classification accuracy it achieves [21], [50]. For that, the error ratio of a distance-based classifier is calculated for a given labeled data set, understanding the error ratio as the number of wrongly classified items divided by the total number of tested items. The standard choice for the classifier is the one-nearest neighbor (1NN) classifier. Following Wang et al. [50], we can enumerate several advantages of

Classification performance: test

If we look at the overall results, we see that all considered measures clearly outperform the random baseline for practically all the 45 data sets (Table 2). Furthermore, we see that some of them achieve near-perfect accuracies for a number of data sets (e.g., CBF, CinC_ECG_torso, ECGFiveDays, Two_Patterns, or TwoLeadECG). However, no single measure achieves the best performance for all the data sets. The Euclidean distance is found to be the best-performing measure in 2 data sets, FC is the

Conclusion

From a general perspective, the obtained results show that there is a group of equivalent similarity measures, with no statistically significant differences among them (DTW, EDR, and MJC). The existing literature suggests that some longest common sub-sequence approaches, together with alternative variants of DTW and EDR, could potentially join this group [31], [50]. However, according to the results reported here, the TWED measure originally proposed by Marteau [31] seems to consistently

Acknowledgements

We thank the people who made available or contributed to the UCR time series repository. This research has been funded by 2009-SGR-1434 from Generalitat de Catalunya, JAEDOC069/2010 from Consejo Superior de Investigaciones Científicas, and TIN2009-13692-C03-01 and TIN2012-38450-C03-03 from the Spanish Government, and EU Feder funds.

References (54)

T.C. Fu
A review on time series data mining
Eng. Appl. Artif. Intell.
(2011)
A. Kholmatov et al.
Identity authentication using improved online signature verification method
Pattern Recognit. Lett.
(2005)
D. Lemire
Faster retrieval with a two-pass dynamic time warping lower bound
Pattern Recognit.
(2009)
S. Montani et al.
Case-based retrieval to support the treatment of end stage renal failure patients
Artif. Intell. Med.
(2006)
J. Serrà et al.
Characterization and exploitation of community structure in cover song networks
Pattern Recognit. Lett.
(2012)
P. Tormene et al.
Matching incomplete time series with dynamic time warping: an algorithm and an application to post-stroke rehabilitation
Artif. Intell. Med.
(2009)
R. Agrawal, C. Faloutsos, A. Swami, Efficient similarity search in sequence databases, in: Proc. of the Int. Conf. on...
I. Bartolini et al.
WARP: accurate retrieval of shapes using phase of Fourier descriptors and time warping distance
IEEE Trans. Pattern Anal. Machine Intell.
(2005)
G.E.A.P.A. Batista, X. Wang, E.J. Keogh, A complexity-invariant distance measure for time series, in: Proc. of the SIAM...
D.J. Berndt, J. Clifford, Using dynamic time warping to find patterns in time series, in: Proc. of the AAAI Workshop on...

Y. Cai, R. Ng, Indexing spatio-temporal trajectories with Chebyshev polynomials, in: Proc. of the ACM SIGMOD Int. Conf....

K.P. Chan, A.C. Fu, Efficient time series matching by wavelets, in: Proc. of the IEEE Int. Conf. on Data Engineering,...

L. Chen, M.T. Öszu, V. Oria, Robust and fast similarity search for moving object trajectories, in: Proc. of the ACM...

G. Das, K.I. Lin, H. Mannila, G. Renganathan, P. Smyth, Rule discovery from time series, in: Proc. of the AAAI Int....

J. Demšar

Statistical comparison of classifiers over multiple data sets

J. Machine Learn. Res.

(2006)

C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast subsequence matching in time-series databases, in: Proc. of the ACM...

P. Geurts, Contributions to Decision Tree Induction: Bias/Variance Tradeoff and Time Series Classification, Ph.D....

D. Gusfield

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology

(1997)

J. Han et al.

Data Mining: Concepts and Techniques

(2005)

T. Hastie et al.

The Elements of Statistical Learning

second ed.

(2009)

M. Hollander et al.

Nonparametric Statistical Methods

second ed.

(1999)

S. Holm

A simple sequentially rejective multiple test procedure

Scand. J. Stat.

(1979)

H. Kantz et al.

Nonlinear Time Series Analysis

(2004)

E.J. Keogh, Machine learning in time series databases (and everything is a time series!), in: Tutorial at the AAAI Int....

E.J. Keogh et al.

Dimensionality reduction for fast similarity search in large time series databases

Knowl. Inform. Syst.

(2001)

E.J. Keogh et al.

On the need for time series data mining benchmarks: a survey and empirical demonstration

Data Min. Knowl. Discov.

(2003)

E.J. Keogh et al.

Exact indexing of dynamic time warping

Knowl. Inform. Syst.

(2005)

Cited by (195)

Application of time series analysis to classify therapeutic breathing patterns
2024, Smart Health
Compare various methods for measuring time series similarity in order to classify referenced therapeutic breathing patterns (BP) used in respiratory disorder rehabilitation.
This experimental study involved the collection of respiratory signals during specified breathing exercises conducted with healthy volunteers. The study employed a screening phase using a k-NN classifier and eight distance measurement methods, including Minkowski Distance, Dynamic Time Warping-DTW (including FastDTW and constrained-cDTW variations), Longest Common Subsequence-LCSS, Edit Distance on Real Sequences-EDR, Time Warp Edit Distance-TWEED, and Minimum Jump Costs-MJC. Two distinct approaches were employed for classifying therapeutic BP based on time series similarity: (1) using the k-Shape algorithm for clustering, and 2) integrating methods to represent therapeutic BP and classify test curves using the most relevant measurement methods obtained from the first approach.
Among the two tested approaches, the combination of the cDTW algorithm and Minkowski distance (p = 2), using the 1-NN classifier, achieved the highest scores in this study, closely matching the metrics obtained from visual inspection conducted by human evaluators.
The use of combined classification methods in the analysis of flow curves referring to therapeutic breathing patterns improves the classification results, with metrics closely aligned with those obtained through visual evaluation conducted by individuals.
Time series analysis methods proved to be sensitive to classify respiratory flow curves equivalent to therapeutic breathing patterns used in respiratory disorder rehabilitation. This methodology can be used to monitor respiratory curves in new applications and implementation in devices for evaluating and treating the ventilatory pattern.
GT-LSTM: A spatio-temporal ensemble network for traffic flow prediction
2024, Neural Networks
Traffic flow prediction plays an instrumental role in modern intelligent transportation systems. Numerous existing studies utilize inter-embedded fusion routes to extract the intrinsic patterns of traffic flow with a single temporal learning approach, which relies heavily on constructing graphs and has low training efficiency. Different from existing studies, this paper proposes a spatio-temporal ensemble network that aims to leverage the strengths of different sequential capturing approaches to obtain the intrinsic dependencies of traffic flow. Specifically, we propose a novel model named graph temporal convolutional long short-term memory network (GT-LSTM), which mainly consists of features splicing and patterns capturing. In features splicing, the spatial dependencies of traffic flow are captured by employing self-adaptive graph convolutional network (GCN), and a non-inter-embedded approach is designed to integrate the spatial and temporal states. Further, the aggregated spatio-temporal states are fed into patterns capturing, which can effectively exploit the advantages of temporal convolutional network (TCN) and bidirectional long short-term memory network (Bi-LSTM) to extract the intrinsic patterns of traffic flow. Extensive experiments conducted on four real-world datasets demonstrate that the proposed network obtains excellent performance in both forecasting accuracy and training efficiency.
Automated parameter tuning with accuracy control for efficient reservoir simulations
2024, Journal of Computational Science
Computer simulations of complex physical processes typically require sophisticated numerical schemes that internally involve many parameters. Different choices of such internal numerical parameters may lead to considerably different levels of computational efficiency, some may even result in wrong simulation results. The task of finding an optimal set of the numerical parameters (e.g. for the purpose of minimising the simulation time), while ensuring an accepted level of numerical accuracy, is therefore extremely important but challenging. In this paper, we propose a new automated search algorithm that is based on constrained stochastic searches within the parameter space. This iterative search scheme is also equipped with an accuracy check, which adopts several complementary measures for quantifying the similarities between time series from different simulations, such that parameter choices that lead to insufficiently accurate results will be automatically rejected. As a concrete scenario of usage, we have applied the automated parameter search scheme to the open-source reservoir simulation framework OPM. An empirical study shows that a suitable design of the optimisation objective function, together with an appropriate choice of the number of trials per search iteration and the perturbation scale per trial, can produce fast and convergent improvements with respect to the optimisation objective. For example, for a set of 12 numerical parameters, 30 trials from five search iterations are sufficient for reducing the objective function by 30% for the open Norne black-oil reservoir model. The robustness of the automated search scheme is also demonstrated for two other open reservoir models. Moreover, it is found that the parameter values automatically identified for the Norne model can also greatly improve the simulation efficiency of another proprietary reservoir model that has drastically different scale, resolution and geological properties.
dtwParallel: A Python package to efficiently compute dynamic time warping between time series
2023, SoftwareX
dtwParallel is a Python package that computes the Dynamic Time Warping (DTW) distance between a collection of (multivariate) time series (MTS). dtwParallel incorporates the main functionalities available in current DTW libraries and novel functionalities such as parallelization, computation of similarity (kernel-based) values, and consideration of data with different types of features (categorical, real-valued, …). A low-floor, high-ceiling, and wide-walls software design principle has been adopted, envisioning uses in education, research, and industry. The source code and documentation of the package are available at https://github.com/oscarescuderoarnanz/dtwParallel.
Benchmark of the Physics Engine MuJoCo and Learning-based Parameter Optimization for Contact-rich Assembly Tasks
2023, Procedia CIRP
Digital simulation tools are used throughout the entire product development cycle, from designing components using finite element methods to simulating their assembly and planning their production. Especially in research, physics simulators and engines such as MuJoCo are increasingly used for rapid data generation in training machine learning models. The quality of the training data is crucial for successful training and the subsequent deployment of the models in real-world applications. The presented work reports on a systematic benchmark of the physics engine MuJoCo. Representative processes in robotic assembly applications are carried out in simulation and on a physical testbed. The results are compared, focusing on the robot arm's pose and the acting contact forces during the assembly processes using Minimal Jump Cost. Subsequently, a deep reinforcement learning agent takes over the simulation's parameterization, and a first comprehensive system for the process-specifc training of simulation parameters is developed and evaluated to increase the data quality and to partially encapsulate the setup effort involved in digital simulation, aiding offline design and validation. The approach results in overall 8% less dissimilar simulations but not in all the individual scenarios.
Automatic driver cognitive fatigue detection based on upper body posture variations
2022, Expert Systems with Applications
Driver cognitive fatigue can significantly affect driving and may lead to fatal accidents. In this regard, automatic detection of underload driver cognitive fatigue based on upper body posture dynamics is studied in this paper, where a semi-supervised approach is developed to identify the cognitive fatigue patterns of driver posture. Initially, an unsupervised Gaussian Mixture Model (GMM) clustering is applied to the acceleration data representing the driver’s head, neck, and sternum obtained in a simulated driving through a motion capture suit. This provides the optimum clusters of the most-similar and correlated time-series data of driver upper posture. Then, an automatic labelling algorithm is developed that mines the maximal value and the standard deviation of each GMM cluster and assigns a symbol according to the discrepancy in postural behaviour. Finally, novel machine learning supervised classifiers, including Gaussian Support Vector Machines, and Bootstrap-Aggregating based Ensemble Classifiers, are trained on the GMM-labelled upper body posture dataset, as real-time algorithms, to detect the driver fatigue. The proposed method was validated against cognitive fatigue measured through a neurophysiological method based on an electroencephalogram. The results show that the proposed semi-supervised approach outperforms the existing state-of-art systems in accurately detecting the cognitive fatigue patterns. It successfully recognizes different driving postures with accuracies of 93% and 90% for two test subjects. The shortcomings of the proposed work and directions for potential expansion of current work are discussed.

View all citing articles on Scopus

View full text

An empirical evaluation of similarity measures for time series classification

Abstract

Introduction

Section snippets

Time series similarity measures

Classification scheme

Classification performance: test

Conclusion

Acknowledgements

Eng. Appl. Artif. Intell.

Pattern Recognit. Lett.

Pattern Recognit.

Artif. Intell. Med.

Pattern Recognit. Lett.

Artif. Intell. Med.

WARP: accurate retrieval of shapes using phase of Fourier descriptors and time warping distance

IEEE Trans. Pattern Anal. Machine Intell.

Statistical comparison of classifiers over multiple data sets

J. Machine Learn. Res.

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology

Data Mining: Concepts and Techniques

The Elements of Statistical Learning

second ed.

Nonparametric Statistical Methods

second ed.

A simple sequentially rejective multiple test procedure

Scand. J. Stat.

Nonlinear Time Series Analysis

Dimensionality reduction for fast similarity search in large time series databases

Knowl. Inform. Syst.

On the need for time series data mining benchmarks: a survey and empirical demonstration

Data Min. Knowl. Discov.

Exact indexing of dynamic time warping

Knowl. Inform. Syst.