column

k-Shape: Efficient and Accurate Clustering of Time Series

Authors:
John Paparrizos

Columbia University

Columbia University
View Profile

,
Luis Gravano

Columbia University

Columbia University
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 45 Issue 1March 2016pp 69–76https://doi.org/10.1145/2949741.2949758

Published:02 June 2016Publication History

ACM SIGMOD Record

Abstract

The proliferation and ubiquity of temporal data across many disciplines has generated substantial interest in the analysis and mining of time series. Clustering is one of the most popular data mining methods, not only due to its exploratory power, but also as a preprocessing step or subroutine for other techniques. In this paper, we describe k-Shape, a novel algorithm for time-series clustering. k-Shape relies on a scalable iterative refinement procedure, which creates homogeneous and well-separated clusters. As its distance measure, k-Shape uses a normalized version of the cross-correlation measure in order to consider the shapes of time series while comparing them. Based on the properties of that distance measure, we develop a method to compute cluster centroids, which are used in every iteration to update the assignment of time series to clusters. An extensive experimental evaluation against partitional, hierarchical, and spectral clustering methods, with the most competitive distance measures, showed the robustness of k-Shape. Overall, k-Shape emerges as a domain-independent, highly accurate, and efficient clustering approach for time series with broad applications.

References

The UCR Time Series Classification/Clustering Homepage. http://www.cs.ucr.edu/~eamonn/time_series_data. Accessed: May 2014.Google Scholar
R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in sequence databases. In FODO, pages 69--84, 1993. Google ScholarDigital Library
J. Alon, S. Sclaroff, G. Kollios, and V. Pavlovic. Discovering clusters in motion time-series data. In CVPR, pages 375--381, 2003. Google ScholarDigital Library
A. J. Bagnall and G. J. Janacek. Clustering time series from ARMA models with clipped data. In KDD, pages 49--58, 2004. Google ScholarDigital Library
Z. Bar-Joseph, G. Gerber, D. K. Gifford, T. S. Jaakkola, and I. Simon. A new approach to analyzing gene expression time series data. In RECOMB, pages 39--48, 2002. Google ScholarDigital Library
G. E. Batista, E. J. Keogh, O. M. Tataw, and V. M. de Souza. CID: An efficient complexity-invariant distance for time series. Data Mining and Knowledge Discovery, pages 1--36, 2013. Google ScholarDigital Library
D. J. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In AAAI Workshop on KDD, pages 359--370, 1994.Google ScholarDigital Library
Y. Cai and R. Ng. Indexing spatio-temporal trajectories with Chebyshev polynomials. In SIGMOD, pages 599--610, 2004. Google ScholarDigital Library
L. Chen and R. Ng. On the marriage of Lp-norms and edit distance. In VLDB, pages 792--803, 2004. Google ScholarDigital Library
L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In SIGMOD, pages 491--502, 2005. Google ScholarDigital Library
Q. Chen, L. Chen, X. Lian, Y. Liu, and J. X. Yu. Indexable PLA for efficient similarity search. In VLDB, pages 435--446, 2007. Google ScholarDigital Library
Y. Chen, M. A. Nascimento, B. C. Ooi, and A. K. Tung. Spade: On shape-based pattern detection in streaming time series. In ICDE, pages 786--795, 2007.Google ScholarCross Ref
J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90):297--301, 1965.Google ScholarCross Ref
G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discovery from time series. In KDD, pages 16--22, 1998.Google Scholar
H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: Experimental comparison of representations and distance measures. PVLDB, 1(2):1542--1552, 2008. Google ScholarDigital Library
C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In SIGMOD, pages 419--429, 1994. Google ScholarDigital Library
M. Filippone, F. Camastra, F. Masulli, and S. Rovetta. A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1):176--190, 2008. Google ScholarDigital Library
E. Frentzos, K. Gratsias, and Y. Theodoridis. Index-based most similar trajectory search. In ICDE, pages 816--825, 2007.Google ScholarCross Ref
M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32:675--701, 1937.Google ScholarCross Ref
M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005.Google ScholarCross Ref
M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani. Mining the stock market: Which measure is best? In KDD, pages 487--496, 2000. Google ScholarDigital Library
R. Giusti and G. E. Batista. An empirical comparison of dissimilarity measures for time series classification. In BRACIS, pages 82--88, 2013. Google ScholarDigital Library
S. Goddard, S. K. Harms, S. E. Reichenbach, T. Tadesse, and W. J. Waltman. Geospatial decision support for drought risk management. Communications of the ACM, 46(1):35--37, 2003. Google ScholarDigital Library
D. Q. Goldin and P. C. Kanellakis. On similarity queries for time-series data: Constraint specification and implementation. In CP, pages 137--153, 1995. Google ScholarDigital Library
G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.Google Scholar
L. Gupta, D. L. Molfese, R. Tammana, and P. G. Simos. Nonlinear alignment and averaging for estimating the evoked potential. IEEE Transactions on Biomedical Engineering, 43(4):348--356, 1996.Google ScholarCross Ref
M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information Systems, 17(2-3):107--145, 2001. Google ScholarDigital Library
J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., 3rd edition, 2011. Google ScholarDigital Library
R. Honda, S. Wang, T. Kikuchi, and O. Konishi. Mining of moving objects from time-series images and its application to satellite weather imagery. Journal of Intelligent Information Systems, 19(1):79--93, 2002. Google ScholarDigital Library
B. Hu, Y. Chen, and E. Keogh. Time series classification under more realistic assumptions. In SDM, pages 578--586, 2013.Google Scholar
K. Kalpakis, D. Gada, and V. Puttagunta. Distance measures for effective clustering of ARIMA time-series. In ICDM, pages 273--280, 2001. Google ScholarDigital Library
Y. Katznelson. An introduction to harmonic analysis. Cambridge University Press, 2004.Google Scholar
L. Kaufman and P. J. Rousseeuw. Finding groups in data: An introduction to cluster analysis, volume 344. John Wiley & Sons, 2009.Google Scholar
E. Keogh. A decade of progress in indexing and mining large time series databases. In VLDB, pages 1268--1268, 2006. Google ScholarDigital Library
E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Locally adaptive dimensionality reduction for indexing large time series databases. In SIGMOD, pages 151--162, 2001. Google ScholarDigital Library
E. Keogh and J. Lin. Clustering of time-series subsequences is meaningless: Implications for previous and future research. Knowledge and Information Systems, 8(2):154--177, 2005. Google ScholarDigital Library
E. Keogh and C. A. Ratanamahatana. Exact indexing of dynamic time warping. Knowledge and Information Systems, 7(3):358--386, 2005.Google ScholarCross Ref
C. Kin-pong and F. Ada. Efficient time series matching by wavelets. In ICDE, pages 126--133, 1999. Google ScholarDigital Library
F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of time sequences. In SIGMOD, pages 289--300, 1997. Google ScholarDigital Library
C.-S. Li, P. S. Yu, and V. Castelli. MALM: A framework for mining sequence database at multiple abstraction levels. In CIKM, pages 267--272, 1998. Google ScholarDigital Library
X. Lian, L. Chen, J. X. Yu, G. Wang, and G. Yu. Similarity match over high speed time-series streams. In ICDE, pages 1086--1095, 2007.Google ScholarCross Ref
J. MacQueen. Some methods for classification and analysis of multivariate observations. In BSMSP, pages 281--297, 1967.Google Scholar
R. N. Mantegna. Hierarchical structure in financial markets. The European Physical Journal B-Condensed Matter and Complex Systems, 11(1):193--197, 1999.Google ScholarCross Ref
W. Meesrikamolkul, V. Niennattrakul, and C. A. Ratanamahatana. Shape-based clustering for time series data. In PAKDD, pages 530--541. 2012. Google ScholarDigital Library
V. Megalooikonomou, Q. Wang, G. Li, and C. Faloutsos. A multiresolution symbolic representation of time series. In ICDE, pages 668--679, 2005. Google ScholarDigital Library
M. D. Morse and J. M. Patel. An efficient and accurate method for evaluating time series similarity. In SIGMOD, pages 569--580, 2007. Google ScholarDigital Library
A. Mueen, E. Keogh, and N. Young. Logical-shapelets: An expressive primitive for time series classification. In KDD, pages 1154--1162, 2011. Google ScholarDigital Library
P. Nemenyi. Distribution-free Multiple Comparisons. PhD thesis, Princeton University, 1963.Google Scholar
A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, pages 849--856, 2002.Google ScholarDigital Library
V. Niennattrakul and C. A. Ratanamahatana. Shape averaging under time warping. In ECTI-CON, pages 626--629, 2009.Google ScholarCross Ref
T. Oates. Identifying distinctive subsequences in multivariate time series by clustering. In KDD, pages 322--326, 1999. Google ScholarDigital Library
P. Papapetrou, V. Athitsos, M. Potamias, G. Kollios, and D. Gunopulos. Embedding-based subsequence matching in time-series databases. TODS, 36(3):17, 2011. Google ScholarDigital Library
J. Paparrizos and L. Gravano. k-Shape: Efficient and accurate clustering of time series. In SIGMOD, pages 1855--1870, 2015. Google ScholarDigital Library
F. Petitjean, A. Ketterlin, and P. Gançarski. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition, 44(3):678--693, 2011. Google ScholarDigital Library
T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In KDD, pages 262--270, 2012. Google ScholarDigital Library
T. Rakthanmanon, E. J. Keogh, S. Lonardi, and S. Evans. Time series epenthesis: Clustering time series streams requires ignoring some data. In ICDM, pages 547--556, 2011. Google ScholarDigital Library
W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846--850, 1971.Google ScholarCross Ref
C. A. Ratanamahatana and E. Keogh. Making time-series classification more accurate using learned constraints. In SDM, pages 11--22, 2004.Google ScholarCross Ref
E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes. Correlating financial time series with micro-blogging activity. In WSDM, pages 513--522, 2012. Google ScholarDigital Library
H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1):43--49, 1978.Google ScholarCross Ref
Y. Shou, N. Mamoulis, and D. Cheung. Fast and exact warping of time series using adaptive segmental approximations. Machine Learning, 58(2-3):231--267, 2005. Google ScholarDigital Library
K. Uehara and M. Shimada. Extraction of primitive motion and discovery of association rules from human motion data. In Progress in Discovery Science, pages 338--348. 2002. Google ScholarDigital Library
M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and E. Keogh. Indexing multidimensional time-series. The VLDB Journal, 15(1):1--20, 2006. Google ScholarDigital Library
M. Vlachos, G. Kollios, and D. Gunopulos. Discovering similar multidimensional trajectories. In ICDE, pages 673--684, 2002. Google ScholarDigital Library
H. Wang, Y. Cai, Y. Yang, S. Zhang, and N. Mamoulis. Durable queries over historical time series. TKDE, 26(3):595--607, 2014. Google ScholarDigital Library
X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, and E. Keogh. Experimental comparison of representation methods and distance measures for time series data. Data Mining and Knowledge Discovery, 26(2):275--309, 2013. Google ScholarDigital Library
T. Warren Liao. Clustering of time series data - a survey. Pattern Recognition, 38(11):1857--1874, 2005. Google ScholarDigital Library
Y. Xiong and D.-Y. Yeung. Mixtures of ARMA models for model-based time series clustering. In ICDM, pages 717--720, 2002. Google ScholarDigital Library
J. Yang and J. Leskovec. Patterns of temporal variation in online media. In WSDM, pages 177--186, 2011. Google ScholarDigital Library
L. Ye and E. Keogh. Time series shapelets: A new primitive for data mining. In KDD, pages 947--956, 2009. Google ScholarDigital Library
J. Zakaria, A. Mueen, and E. Keogh. Clustering time series using unsupervised-shapelets. In ICDM, pages 785--794, 2012. Google ScholarDigital Library

Index Terms

k-Shape: Efficient and Accurate Clustering of Time Series
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information systems applications

Index terms have been assigned to the content through auto-classification.

Recommendations

Shape clustering

This paper aims to address the problem of shape clustering by discovering the common structure which captures the intrinsic structural information of shapes belonging to the same cluster. It is based on a skeleton graph, named common structure skeleton ...
Read More
k-Shape: Efficient and Accurate Clustering of Time Series
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

The proliferation and ubiquity of temporal data across many disciplines has generated substantial interest in the analysis and mining of time series. Clustering is one of the most popular data mining methods, not only due to its exploratory power, but ...
Read More
Shape and texture clustering: Best estimate for the clusters number

The most difficult problem in automatic clustering is the determination of the total number of final clusters N"c"l"u"s"t"e"r. In the present paper, a new method for finding N"c"l"u"s"t"e"r is proposed and is compared with previously developed methods. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGMOD Record Volume 45, Issue 1
March 2016
73 pages
ISSN:0163-5808
DOI:10.1145/2949741
Editors:
Yanlei Diao
University of Massachusetts Amherst
,
Vanessa Braganholo
Universidade Federal Fluminense
,
Marco Brambilla
Politecnico di Milano
,
Chee Yong Chan
National University of Singapore
,
Rada Chirkova
North Carolina State University
,
Zackary Ives
University of Pennsylvania
,
Anastasios Kementsietsidis
Google Research
,
Jeffrey Naughton
University of Wisconsin-Madison
,
Olga Papaemmanoui
Brandeis Univesity
,
Aditya Parameswaran
University of Illinois
,
Anish Das Sarma
Google Research
,
Alkis Simitsis
HP Labs
,
Wang-Chiew Tan
University of California Santa Cruz
,
Nesime Tatbul
MIT CSAIL
,
Marianne Winslett
University of Illinois
,
Jun Yang
Duke University
Issue’s Table of Contents
Copyright © 2016 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 June 2016
Check for updates
Qualifiers
- column
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 132
  Total Citations
  View Citations
- 1,679
  Total Downloads
- Downloads (Last 12 months)413
- Downloads (Last 6 weeks)54
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

k-Shape: Efficient and Accurate Clustering of Time Series

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Shape clustering

k-Shape: Efficient and Accurate Clustering of Time Series

Shape and texture clustering: Best estimate for the clusters number