skip to main content
research-article

YADING: fast clustering of large-scale time series data

Published:01 January 2015Publication History
Skip Abstract Section

Abstract

Fast and scalable analysis techniques are becoming increasingly important in the era of big data, because they are the enabling techniques to create real-time and interactive experiences in data analysis. Time series are widely available in diverse application areas. Due to the large number of time series instances (e.g., millions) and the high dimensionality of each time series instance (e.g., thousands), it is challenging to conduct clustering on large-scale time series, and it is even more challenging to do so in real-time to support interactive exploration.

In this paper, we propose a novel end-to-end time series clustering algorithm, YADING, which automatically clusters large-scale time series with fast performance and quality results. Specifically, YADING consists of three steps: sampling the input dataset, conducting clustering on the sampled dataset, and assigning the rest of the input data to the clusters generated on the sampled dataset. In particular, we provide theoretical proof on the lower and upper bounds of the sample size, which not only guarantees YADING's high performance, but also ensures the distribution consistency between the input dataset and the sampled dataset. We also select L1 norm as similarity measure and the multi-density approach as the clustering method. With theoretical bound, this selection ensures YADING's robustness to time series variations due to phase perturbation and random noise.

Evaluation results have demonstrated that on typical-scale (100,000 time series each with 1,000 dimensions) datasets, YADING is about 40 times faster than the state-of-the-art, sampling-based clustering algorithm DENCLUE 2.0, and about 1,000 times faster than DBSCAN and CLARANS. YADING has also been used by product teams at Microsoft to analyze service performance. Two of such use cases are shared in this paper.

References

  1. Debregeas, A., and Hebrail, G. 1998. Interactive interpretation of Kohonen maps applied to curves. In Proc. of KDD'98. 179--183.Google ScholarGoogle Scholar
  2. Derrick, K., Bill, K., and Vamsi, C. 2012. Large scale/big data federation & virtualization: a case study.Google ScholarGoogle Scholar
  3. D. A. Patterson. 2002. A simple way to estimate the cost of downtime. In Proc. of LISA' 02, pp. 185--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Eamonn, K., and Shruti, K. 2002. On the need for time series data mining benchmarks: a survey and empirical demonstration. In Proc. of KDD'02, July 23--26.Google ScholarGoogle Scholar
  5. T. W. Liao. 2005. Clustering of time series data---A survey Pattern Recognit., vol. 38, no. 11, pp. 1857--1874, Nov. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P. Boesiger. 1998. A new correlation-based fuzzy logic clustering algorithm for fMRI, Mag. Resonance Med.Google ScholarGoogle Scholar
  7. B. K. Yi, H. V. Jagadish, and C. Faloutsos. 1998. Efficient retrieval of similar time sequences under time warping. In IEEE Proc. of ICDE, Feb. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Han, M. Kamber. 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, pp. 346--389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chu, K. & Wong, M. 1999. Fast time-series searching with scaling and shifting. In proc. of PODS. pp 237--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Popivanov, I. & Miller, R. J. 2002. Similarity search over time series data using wavelets. In proc. of ICDE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In proc. of ACM SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yi, B. & Faloutsos, C. 2000. Fast time sequence indexing for arbitrary lp norms. In proc. of the VLDB. pp 385--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. 2001. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Inf. Syst., 3(3).Google ScholarGoogle Scholar
  14. G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. 2003. Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE, 15(5). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Zhou, S., Zhou, A., Cao, J., Wen, J., Fan, Y., Hu. Y. 2000. Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In Proc. of the PAKDD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Stuart, Alan. 1962. Basic Ideas of Scientific Sampling, Hafner Publishing Company, New York.Google ScholarGoogle Scholar
  17. M. Ester, H. P. Kriegel, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. ACM SIGKDD 1996, pages 226--231.Google ScholarGoogle Scholar
  18. M. Steinbach, L. Ertoz, and V. Kumar. 2003. Challenges of clustering high dimensional data. In L. T. Wille, editor, New Vistas in Statistical Physics -- Applications in Econophysics, Bioinformatics, and Pattern Recognition. Springer-Verlag.Google ScholarGoogle Scholar
  19. Ng R. T., and Han J. 1994. Efficient and Effective Clustering Methods for Spatial Data Mining, In proc. of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Guha, R. Rastogi, K. Shim. 1998. CURE: an efficient clustering algorithm for large databases. SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Wang, J. Yang, R. Muntz, R. 1997. STING: a statistical information grid approach to spatial data mining, VLDB'97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Hans-Peter Kriegel, Peer Kröger, Jörg Sander, Arthur Zimek 2011. Density-based Clustering. WIREs Data Mining and Knowledge Discovery 1 (3): 231--240. doi:10.1002/widm.30.Google ScholarGoogle ScholarCross RefCross Ref
  23. Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander. 1999. OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD. pp. 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Tao Pei, Ajay Jasra, David J. Hand, A. X. Zhu, C. Zhou. 2009. DECODE: a new method for discovering clusters of different densities in spatial data, Data Min Knowl Disc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Guo, H. Li, and D. Pan. 2010. An improved piecewise aggregate approximation based on statistical features for time series mining. KSEM'10, pages 234--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Beckmann, N.; Kriegel, H. P.; Schneider, R.; Seeger, B. 1990. "The R-tree: an efficient and robust access method for points and rectangles". In proc. of SIGMOD. p. 322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P. Boesiger, A new correlation-based fuzzy logic clustering algorithm for fMRI, Mag. Resonance Med. 40 (1998).Google ScholarGoogle Scholar
  28. T. W. Liao, B. Bolt, J. Forester, E. Hailman, C. Hansen, R. C. Kaste, J. O'May, Understanding and projecting the battle state, 23rd Army Science Conference, Orlando, FL, 2002.Google ScholarGoogle Scholar
  29. D. Piccolo, A distance measure for classifying ARMA models, J. Time Ser. Anal. 11 (2) (1990) 153--163.Google ScholarGoogle Scholar
  30. D. Tran, M. Wagner, Fuzzy c-means clustering-based speaker verification, in: N. R. Pal, M. Sugeno (Eds.), AFSS 2002, Lecture Notes in Artificial Intelligence, 2275, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Danon L, Díaz-Guilera A, Duch J and Arenas A 2005 J. Stat. Mech. P09008.Google ScholarGoogle Scholar
  32. M. Datar, N. Immorlica, P. Indyk, V. S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions, Computational Geometry, pp. 253--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Berchthold S., Keim D., Kriegel H. P. 1996. The X-Tree: An Index Structure for High-Dimensional Data, VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Keogh, E., Zhu, Q., Hu, B., Hao. Y., Xi, X., Wei, L. & Ratanamahatana, C. A. (2011). The UCR Time Series Classification/Clustering Homepage: www.cs.ucr.edu/~eamonn/time_series_data/Google ScholarGoogle Scholar
  35. http://research.microsoft.com/en-us/people/juding/yading.aspxGoogle ScholarGoogle Scholar
  36. A. Hinneburg and H. H. Gabriel. DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation. In IDA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sheih, J., Keogh, E., 2009. iSAX: disk-aware mining and indexing of massive time series data sets. Data Mining and Knowledge Discovery 19 (1), 24--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. Barbara. Requirements for clustering data streams. SIGKDD Explorations, 3(2): 23--27, January 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. Liu, D. Zhou, N. Wu. 2007. Varied Density Based Spatial Clustering of Application with Noise, In proc. of ICSSSM. 528--531Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. YADING: fast clustering of large-scale time series data
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 8, Issue 5
          January 2015
          181 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 January 2015
          Published in pvldb Volume 8, Issue 5

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader