Elsevier

Knowledge-Based Systems

Volume 67, September 2014, Pages 305-314
Knowledge-Based Systems

An empirical evaluation of similarity measures for time series classification

https://doi.org/10.1016/j.knosys.2014.04.035Get rights and content

Abstract

Time series are ubiquitous, and a measure to assess their similarity is a core part of many computational systems. In particular, the similarity measure is the most essential ingredient of time series clustering and classification systems. Because of this importance, countless approaches to estimate time series similarity have been proposed. However, there is a lack of comparative studies using empirical, rigorous, quantitative, and large-scale assessment strategies. In this article, we provide an extensive evaluation of similarity measures for time series classification following the aforementioned principles. We consider 7 different measures coming from alternative measure ‘families’, and 45 publicly-available time series data sets coming from a wide variety of scientific domains. We focus on out-of-sample classification accuracy, but in-sample accuracies and parameter choices are also discussed. Our work is based on rigorous evaluation methodologies and includes the use of powerful statistical significance tests to derive meaningful conclusions. The obtained results show the equivalence, in terms of accuracy, of a number of measures, but with one single candidate outperforming the rest. Such findings, together with the followed methodology, invite researchers on the field to adopt a more consistent evaluation criteria and a more informed decision regarding the baseline measures to which new developments should be compared.

Introduction

Data in the form of time series pervades a large number of scientific domains [19], [24]. Observations that unfold over time usually represent valuable information subject to analysis, classification, indexing, prediction, or interpretation [18], [14], [28], [11]. Real-world examples include financial data (e.g., stock market fluctuations), medical data (e.g., electrocardiograms), computer data (e.g., log sequences), or motion data (e.g., location of moving objects). Even object shapes or handwriting can be effectively transformed into time series, facilitating their analysis and retrieval [23], [24].

A core issue when dealing with time series is determining their pairwise similarity, i.e., the degree to which a given time series resembles another. In fact, a time series similarity (or dissimilarity) measure is central to many mining, retrieval, clustering, and classification tasks [14], [28], [11], [21]. Furthermore, there is evidence that simple approaches to such tasks exploiting generic time series similarity measures usually outperform more elaborate, sometimes specifically-targeted strategies. This is the case, for instance, with time series classification, where a one-nearest neighbor approach using a well-known time series similarity measure was found to outperform an exhaustive list of alternatives [53], including decision trees, multi-scale histograms, multi-layer perceptron neural networks, order logic rules with boosting, or multiple classifier systems.

Deriving a measure that correctly reflects time series similarities is not straightforward. Apart from dealing with high dimensionality (time series can be roughly considered as multi-dimensional data), the calculation of such measures needs to be fast and efficient [21]. Indeed, with better information gathering tools, the size of time series data sets may continue to increase in the future. Moreover, there is the need for generic/multi-purpose similarity measures, so that they can be readily applied to any data set, whether this application is the final goal or just an initial approach to a given task. This last aspect highlights another desirable quality for time series similarity measures: their robustness to different types of data (cf. [21], [50]).

Over the years, several time series similarity measures have been proposed (for pointers to such measures see, e.g., [28], [11], [50]). Nevertheless, few quantitative comparisons have been made in order to evaluate their efficacy in a multiple-data framework. Apart from being an interesting and important task by itself, and as opposed to clustering, time series classification offers the possibility to straightforwardly assess the merit of time series similarity measures under a controlled, objective, and quantitative framework [21].

In a recent study, Wang et al. [50] perform an extensive comparison of classification accuracies for 9 measures (plus 4 variants) across 38 data sets coming from various scientific domains. One of the main conclusions of the study is that, even though the newly proposed measures can be theoretically attractive, the efficacy of some common and well-established measures is, in the vast majority of cases, very difficult to beat. Specifically, dynamic time warping (DTW; [4]) is found to be consistently superior to the other studied measures (or, at worst, for a few data sets, equivalent). In addition, the authors emphasize that the Euclidean distance remains a quite accurate, robust, simple, and efficient way of measuring the similarity between two time series. Finally, by looking in detail at the results presented by Wang et al. [50], we can spot a group of time series similarity measures that seems to have an efficacy comparable to DTW: those based on edit distances. In particular, the edit distance for real sequences (EDR; [7]) seems to be very competitive, if not slightly better than DTW. Interestingly, none of the three measures above was initially targeted to generic time series data, but were introduced with hindsight [1], [4], [7]. The intuition behind Euclidean distance relates to spatial proximity, DTW was initially devised for the specific task of spoken word recognition [43], and edit distances were introduced for measuring the dissimilarity between two strings [27].

The study by Wang et al. [50] is, to the best of our knowledge, the only comparative study dealing with time series classification using multiple similarity measures and a large collection of data. In general, the studies introducing a new measure only compare against a few other measures,1 and usually using a reduced data set corpus (cf. [21]). Furthermore, there is a lack of agreement in the literature regarding evaluation methodologies. Besides, statistical significance is usually not studied or, at best, improperly evaluated. This is very inconvenient, as robust evaluation methodologies and statistical significance are the principal tools by which we can establish, in a formal and rigorous way, differences across the considered measures [45], [16], [9]. In addition, the chosen parameter values for every measure are rarely discussed. All these issues impact the scientific development of the field as one is never sure, e.g., of which measure should be used as a baseline for future developments, or of which parameters are the most sensible choice.

In this work, we perform an empirical evaluation of similarity measures for time series classification. We follow the initiative by Wang et al. [50], and consider a big pool of publicly-available time series data sets (45 in our case). However, instead of additionally focusing on representation methods, computational/storage demands, or more theoretical issues, we here take a pragmatic approach and restrict ourselves to classification accuracy. We believe that this is the most important aspect to be considered in a first stage and that, in contrast to the other aforementioned issues, it is not sufficiently well-covered in the existing literature. As for the considered measures, we decide to include DTW and EDR, as these were found to generally achieve the highest accuracies among all measures compared in Wang et al. [50]. Apart from these two, we choose the Euclidean distance plus 4 different measures not considered in such study, making up to a total of 7. Further important contributions that differentiate the current work from previous studies include (a) an extensive summary and background of the considered measures, with basic formulations, applications, and references, (b) the formalization of a robust evaluation methodology, exploiting standard out-of-sample cross-validation strategies, (c) the use of rigorous statistical significance tests in order to assess the superiority of a given measure, (d) the evaluation of both train and test accuracies, and (e) the assessment of the chosen parameters for each measure and data set.

The rest of the paper is organized as follows. Firstly, we provide the background on time series similarity measures, outline some of their applications, and detail their calculation (Section 2). Next, we explain the proposed evaluation methodology (Section 3). Subsequently, we report the obtained results (Section 4). A conclusion section ends the paper (Section 5).

Section snippets

Time series similarity measures

The list of approaches for dealing with time series similarity is vast, and a comprehensive enumeration of them all is beyond the scope of the present work (for that, the interested reader is referred to [13], [50], [14], [28], [31], [11]). In this section, we present several representative examples of different ‘families’ of time series similarity measures: lock-step measures (Euclidean distance), feature-based measures (Fourier coefficients), model-based measures (auto-regressive), and

Classification scheme

The efficacy of a time series similarity measure is commonly evaluated by the classification accuracy it achieves [21], [50]. For that, the error ratio of a distance-based classifier is calculated for a given labeled data set, understanding the error ratio as the number of wrongly classified items divided by the total number of tested items. The standard choice for the classifier is the one-nearest neighbor (1NN) classifier. Following Wang et al. [50], we can enumerate several advantages of

Classification performance: test

If we look at the overall results, we see that all considered measures clearly outperform the random baseline for practically all the 45 data sets (Table 2). Furthermore, we see that some of them achieve near-perfect accuracies for a number of data sets (e.g., CBF, CinC_ECG_torso, ECGFiveDays, Two_Patterns, or TwoLeadECG). However, no single measure achieves the best performance for all the data sets. The Euclidean distance is found to be the best-performing measure in 2 data sets, FC is the

Conclusion

From a general perspective, the obtained results show that there is a group of equivalent similarity measures, with no statistically significant differences among them (DTW, EDR, and MJC). The existing literature suggests that some longest common sub-sequence approaches, together with alternative variants of DTW and EDR, could potentially join this group [31], [50]. However, according to the results reported here, the TWED measure originally proposed by Marteau [31] seems to consistently

Acknowledgements

We thank the people who made available or contributed to the UCR time series repository. This research has been funded by 2009-SGR-1434 from Generalitat de Catalunya, JAEDOC069/2010 from Consejo Superior de Investigaciones Científicas, and TIN2009-13692-C03-01 and TIN2012-38450-C03-03 from the Spanish Government, and EU Feder funds.

References (54)

  • Y. Cai, R. Ng, Indexing spatio-temporal trajectories with Chebyshev polynomials, in: Proc. of the ACM SIGMOD Int. Conf....
  • K.P. Chan, A.C. Fu, Efficient time series matching by wavelets, in: Proc. of the IEEE Int. Conf. on Data Engineering,...
  • L. Chen, M.T. Öszu, V. Oria, Robust and fast similarity search for moving object trajectories, in: Proc. of the ACM...
  • G. Das, K.I. Lin, H. Mannila, G. Renganathan, P. Smyth, Rule discovery from time series, in: Proc. of the AAAI Int....
  • J. Demšar

    Statistical comparison of classifiers over multiple data sets

    J. Machine Learn. Res.

    (2006)
  • C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast subsequence matching in time-series databases, in: Proc. of the ACM...
  • P. Geurts, Contributions to Decision Tree Induction: Bias/Variance Tradeoff and Time Series Classification, Ph.D....
  • D. Gusfield

    Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology

    (1997)
  • J. Han et al.

    Data Mining: Concepts and Techniques

    (2005)
  • T. Hastie et al.

    The Elements of Statistical Learning

    second ed.

    (2009)
  • M. Hollander et al.

    Nonparametric Statistical Methods

    second ed.

    (1999)
  • S. Holm

    A simple sequentially rejective multiple test procedure

    Scand. J. Stat.

    (1979)
  • H. Kantz et al.

    Nonlinear Time Series Analysis

    (2004)
  • E.J. Keogh, Machine learning in time series databases (and everything is a time series!), in: Tutorial at the AAAI Int....
  • E.J. Keogh et al.

    Dimensionality reduction for fast similarity search in large time series databases

    Knowl. Inform. Syst.

    (2001)
  • E.J. Keogh et al.

    On the need for time series data mining benchmarks: a survey and empirical demonstration

    Data Min. Knowl. Discov.

    (2003)
  • E.J. Keogh et al.

    Exact indexing of dynamic time warping

    Knowl. Inform. Syst.

    (2005)
  • Cited by (195)

    View all citing articles on Scopus
    View full text