skip to main content
column

k-Shape: Efficient and Accurate Clustering of Time Series

Published:02 June 2016Publication History
Skip Abstract Section

Abstract

The proliferation and ubiquity of temporal data across many disciplines has generated substantial interest in the analysis and mining of time series. Clustering is one of the most popular data mining methods, not only due to its exploratory power, but also as a preprocessing step or subroutine for other techniques. In this paper, we describe k-Shape, a novel algorithm for time-series clustering. k-Shape relies on a scalable iterative refinement procedure, which creates homogeneous and well-separated clusters. As its distance measure, k-Shape uses a normalized version of the cross-correlation measure in order to consider the shapes of time series while comparing them. Based on the properties of that distance measure, we develop a method to compute cluster centroids, which are used in every iteration to update the assignment of time series to clusters. An extensive experimental evaluation against partitional, hierarchical, and spectral clustering methods, with the most competitive distance measures, showed the robustness of k-Shape. Overall, k-Shape emerges as a domain-independent, highly accurate, and efficient clustering approach for time series with broad applications.

References

  1. The UCR Time Series Classification/Clustering Homepage. http://www.cs.ucr.edu/~eamonn/time_series_data. Accessed: May 2014.Google ScholarGoogle Scholar
  2. R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in sequence databases. In FODO, pages 69--84, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Alon, S. Sclaroff, G. Kollios, and V. Pavlovic. Discovering clusters in motion time-series data. In CVPR, pages 375--381, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. J. Bagnall and G. J. Janacek. Clustering time series from ARMA models with clipped data. In KDD, pages 49--58, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Z. Bar-Joseph, G. Gerber, D. K. Gifford, T. S. Jaakkola, and I. Simon. A new approach to analyzing gene expression time series data. In RECOMB, pages 39--48, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. E. Batista, E. J. Keogh, O. M. Tataw, and V. M. de Souza. CID: An efficient complexity-invariant distance for time series. Data Mining and Knowledge Discovery, pages 1--36, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. J. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In AAAI Workshop on KDD, pages 359--370, 1994.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Cai and R. Ng. Indexing spatio-temporal trajectories with Chebyshev polynomials. In SIGMOD, pages 599--610, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Chen and R. Ng. On the marriage of Lp-norms and edit distance. In VLDB, pages 792--803, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In SIGMOD, pages 491--502, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Q. Chen, L. Chen, X. Lian, Y. Liu, and J. X. Yu. Indexable PLA for efficient similarity search. In VLDB, pages 435--446, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Chen, M. A. Nascimento, B. C. Ooi, and A. K. Tung. Spade: On shape-based pattern detection in streaming time series. In ICDE, pages 786--795, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90):297--301, 1965.Google ScholarGoogle ScholarCross RefCross Ref
  14. G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discovery from time series. In KDD, pages 16--22, 1998.Google ScholarGoogle Scholar
  15. H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: Experimental comparison of representations and distance measures. PVLDB, 1(2):1542--1552, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In SIGMOD, pages 419--429, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Filippone, F. Camastra, F. Masulli, and S. Rovetta. A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1):176--190, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. Frentzos, K. Gratsias, and Y. Theodoridis. Index-based most similar trajectory search. In ICDE, pages 816--825, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  19. M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32:675--701, 1937.Google ScholarGoogle ScholarCross RefCross Ref
  20. M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  21. M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani. Mining the stock market: Which measure is best? In KDD, pages 487--496, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Giusti and G. E. Batista. An empirical comparison of dissimilarity measures for time series classification. In BRACIS, pages 82--88, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Goddard, S. K. Harms, S. E. Reichenbach, T. Tadesse, and W. J. Waltman. Geospatial decision support for drought risk management. Communications of the ACM, 46(1):35--37, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Q. Goldin and P. C. Kanellakis. On similarity queries for time-series data: Constraint specification and implementation. In CP, pages 137--153, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.Google ScholarGoogle Scholar
  26. L. Gupta, D. L. Molfese, R. Tammana, and P. G. Simos. Nonlinear alignment and averaging for estimating the evoked potential. IEEE Transactions on Biomedical Engineering, 43(4):348--356, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  27. M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information Systems, 17(2-3):107--145, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., 3rd edition, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Honda, S. Wang, T. Kikuchi, and O. Konishi. Mining of moving objects from time-series images and its application to satellite weather imagery. Journal of Intelligent Information Systems, 19(1):79--93, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Hu, Y. Chen, and E. Keogh. Time series classification under more realistic assumptions. In SDM, pages 578--586, 2013.Google ScholarGoogle Scholar
  31. K. Kalpakis, D. Gada, and V. Puttagunta. Distance measures for effective clustering of ARIMA time-series. In ICDM, pages 273--280, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Katznelson. An introduction to harmonic analysis. Cambridge University Press, 2004.Google ScholarGoogle Scholar
  33. L. Kaufman and P. J. Rousseeuw. Finding groups in data: An introduction to cluster analysis, volume 344. John Wiley & Sons, 2009.Google ScholarGoogle Scholar
  34. E. Keogh. A decade of progress in indexing and mining large time series databases. In VLDB, pages 1268--1268, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Locally adaptive dimensionality reduction for indexing large time series databases. In SIGMOD, pages 151--162, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. E. Keogh and J. Lin. Clustering of time-series subsequences is meaningless: Implications for previous and future research. Knowledge and Information Systems, 8(2):154--177, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E. Keogh and C. A. Ratanamahatana. Exact indexing of dynamic time warping. Knowledge and Information Systems, 7(3):358--386, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  38. C. Kin-pong and F. Ada. Efficient time series matching by wavelets. In ICDE, pages 126--133, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of time sequences. In SIGMOD, pages 289--300, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. C.-S. Li, P. S. Yu, and V. Castelli. MALM: A framework for mining sequence database at multiple abstraction levels. In CIKM, pages 267--272, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. X. Lian, L. Chen, J. X. Yu, G. Wang, and G. Yu. Similarity match over high speed time-series streams. In ICDE, pages 1086--1095, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  42. J. MacQueen. Some methods for classification and analysis of multivariate observations. In BSMSP, pages 281--297, 1967.Google ScholarGoogle Scholar
  43. R. N. Mantegna. Hierarchical structure in financial markets. The European Physical Journal B-Condensed Matter and Complex Systems, 11(1):193--197, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  44. W. Meesrikamolkul, V. Niennattrakul, and C. A. Ratanamahatana. Shape-based clustering for time series data. In PAKDD, pages 530--541. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. V. Megalooikonomou, Q. Wang, G. Li, and C. Faloutsos. A multiresolution symbolic representation of time series. In ICDE, pages 668--679, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. M. D. Morse and J. M. Patel. An efficient and accurate method for evaluating time series similarity. In SIGMOD, pages 569--580, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Mueen, E. Keogh, and N. Young. Logical-shapelets: An expressive primitive for time series classification. In KDD, pages 1154--1162, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. P. Nemenyi. Distribution-free Multiple Comparisons. PhD thesis, Princeton University, 1963.Google ScholarGoogle Scholar
  49. A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, pages 849--856, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. V. Niennattrakul and C. A. Ratanamahatana. Shape averaging under time warping. In ECTI-CON, pages 626--629, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  51. T. Oates. Identifying distinctive subsequences in multivariate time series by clustering. In KDD, pages 322--326, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. P. Papapetrou, V. Athitsos, M. Potamias, G. Kollios, and D. Gunopulos. Embedding-based subsequence matching in time-series databases. TODS, 36(3):17, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. J. Paparrizos and L. Gravano. k-Shape: Efficient and accurate clustering of time series. In SIGMOD, pages 1855--1870, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. F. Petitjean, A. Ketterlin, and P. Gançarski. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition, 44(3):678--693, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In KDD, pages 262--270, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. T. Rakthanmanon, E. J. Keogh, S. Lonardi, and S. Evans. Time series epenthesis: Clustering time series streams requires ignoring some data. In ICDM, pages 547--556, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846--850, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  58. C. A. Ratanamahatana and E. Keogh. Making time-series classification more accurate using learned constraints. In SDM, pages 11--22, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  59. E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes. Correlating financial time series with micro-blogging activity. In WSDM, pages 513--522, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1):43--49, 1978.Google ScholarGoogle ScholarCross RefCross Ref
  61. Y. Shou, N. Mamoulis, and D. Cheung. Fast and exact warping of time series using adaptive segmental approximations. Machine Learning, 58(2-3):231--267, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. K. Uehara and M. Shimada. Extraction of primitive motion and discovery of association rules from human motion data. In Progress in Discovery Science, pages 338--348. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and E. Keogh. Indexing multidimensional time-series. The VLDB Journal, 15(1):1--20, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. M. Vlachos, G. Kollios, and D. Gunopulos. Discovering similar multidimensional trajectories. In ICDE, pages 673--684, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. H. Wang, Y. Cai, Y. Yang, S. Zhang, and N. Mamoulis. Durable queries over historical time series. TKDE, 26(3):595--607, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, and E. Keogh. Experimental comparison of representation methods and distance measures for time series data. Data Mining and Knowledge Discovery, 26(2):275--309, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. T. Warren Liao. Clustering of time series data - a survey. Pattern Recognition, 38(11):1857--1874, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Y. Xiong and D.-Y. Yeung. Mixtures of ARMA models for model-based time series clustering. In ICDM, pages 717--720, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. J. Yang and J. Leskovec. Patterns of temporal variation in online media. In WSDM, pages 177--186, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. L. Ye and E. Keogh. Time series shapelets: A new primitive for data mining. In KDD, pages 947--956, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. J. Zakaria, A. Mueen, and E. Keogh. Clustering time series using unsupervised-shapelets. In ICDM, pages 785--794, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. k-Shape: Efficient and Accurate Clustering of Time Series
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader