research-article

YADING: fast clustering of large-scale time series data

Authors:
Rui Ding

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Qiang Wang

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Yingnong Dang

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Qiang Fu

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Haidong Zhang

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Dongmei Zhang

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 5pp 473–484https://doi.org/10.14778/2735479.2735481

Published:01 January 2015Publication History

Proceedings of the VLDB Endowment

Abstract

Fast and scalable analysis techniques are becoming increasingly important in the era of big data, because they are the enabling techniques to create real-time and interactive experiences in data analysis. Time series are widely available in diverse application areas. Due to the large number of time series instances (e.g., millions) and the high dimensionality of each time series instance (e.g., thousands), it is challenging to conduct clustering on large-scale time series, and it is even more challenging to do so in real-time to support interactive exploration.

In this paper, we propose a novel end-to-end time series clustering algorithm, YADING, which automatically clusters large-scale time series with fast performance and quality results. Specifically, YADING consists of three steps: sampling the input dataset, conducting clustering on the sampled dataset, and assigning the rest of the input data to the clusters generated on the sampled dataset. In particular, we provide theoretical proof on the lower and upper bounds of the sample size, which not only guarantees YADING's high performance, but also ensures the distribution consistency between the input dataset and the sampled dataset. We also select L₁ norm as similarity measure and the multi-density approach as the clustering method. With theoretical bound, this selection ensures YADING's robustness to time series variations due to phase perturbation and random noise.

Evaluation results have demonstrated that on typical-scale (100,000 time series each with 1,000 dimensions) datasets, YADING is about 40 times faster than the state-of-the-art, sampling-based clustering algorithm DENCLUE 2.0, and about 1,000 times faster than DBSCAN and CLARANS. YADING has also been used by product teams at Microsoft to analyze service performance. Two of such use cases are shared in this paper.

References

Debregeas, A., and Hebrail, G. 1998. Interactive interpretation of Kohonen maps applied to curves. In Proc. of KDD'98. 179--183.Google Scholar
Derrick, K., Bill, K., and Vamsi, C. 2012. Large scale/big data federation & virtualization: a case study.Google Scholar
D. A. Patterson. 2002. A simple way to estimate the cost of downtime. In Proc. of LISA' 02, pp. 185--188. Google ScholarDigital Library
Eamonn, K., and Shruti, K. 2002. On the need for time series data mining benchmarks: a survey and empirical demonstration. In Proc. of KDD'02, July 23--26.Google Scholar
T. W. Liao. 2005. Clustering of time series data---A survey Pattern Recognit., vol. 38, no. 11, pp. 1857--1874, Nov. Google ScholarDigital Library
X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P. Boesiger. 1998. A new correlation-based fuzzy logic clustering algorithm for fMRI, Mag. Resonance Med.Google Scholar
B. K. Yi, H. V. Jagadish, and C. Faloutsos. 1998. Efficient retrieval of similar time sequences under time warping. In IEEE Proc. of ICDE, Feb. Google ScholarDigital Library
J. Han, M. Kamber. 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, pp. 346--389. Google ScholarDigital Library
Chu, K. & Wong, M. 1999. Fast time-series searching with scaling and shifting. In proc. of PODS. pp 237--248. Google ScholarDigital Library
Popivanov, I. & Miller, R. J. 2002. Similarity search over time series data using wavelets. In proc. of ICDE. Google ScholarDigital Library
Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In proc. of ACM SIGMOD. Google ScholarDigital Library
Yi, B. & Faloutsos, C. 2000. Fast time sequence indexing for arbitrary lp norms. In proc. of the VLDB. pp 385--394. Google ScholarDigital Library
E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. 2001. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Inf. Syst., 3(3).Google Scholar
G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. 2003. Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE, 15(5). Google ScholarDigital Library
Zhou, S., Zhou, A., Cao, J., Wen, J., Fan, Y., Hu. Y. 2000. Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In Proc. of the PAKDD. Google ScholarDigital Library
Stuart, Alan. 1962. Basic Ideas of Scientific Sampling, Hafner Publishing Company, New York.Google Scholar
M. Ester, H. P. Kriegel, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. ACM SIGKDD 1996, pages 226--231.Google Scholar
M. Steinbach, L. Ertoz, and V. Kumar. 2003. Challenges of clustering high dimensional data. In L. T. Wille, editor, New Vistas in Statistical Physics -- Applications in Econophysics, Bioinformatics, and Pattern Recognition. Springer-Verlag.Google Scholar
Ng R. T., and Han J. 1994. Efficient and Effective Clustering Methods for Spatial Data Mining, In proc. of VLDB. Google ScholarDigital Library
S. Guha, R. Rastogi, K. Shim. 1998. CURE: an efficient clustering algorithm for large databases. SIGMOD. Google ScholarDigital Library
W. Wang, J. Yang, R. Muntz, R. 1997. STING: a statistical information grid approach to spatial data mining, VLDB'97. Google ScholarDigital Library
Hans-Peter Kriegel, Peer Kröger, Jörg Sander, Arthur Zimek 2011. Density-based Clustering. WIREs Data Mining and Knowledge Discovery 1 (3): 231--240. doi:10.1002/widm.30.Google ScholarCross Ref
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander. 1999. OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD. pp. 49--60. Google ScholarDigital Library
Tao Pei, Ajay Jasra, David J. Hand, A. X. Zhu, C. Zhou. 2009. DECODE: a new method for discovering clusters of different densities in spatial data, Data Min Knowl Disc. Google ScholarDigital Library
C. Guo, H. Li, and D. Pan. 2010. An improved piecewise aggregate approximation based on statistical features for time series mining. KSEM'10, pages 234--244. Google ScholarDigital Library
Beckmann, N.; Kriegel, H. P.; Schneider, R.; Seeger, B. 1990. "The R-tree: an efficient and robust access method for points and rectangles". In proc. of SIGMOD. p. 322. Google ScholarDigital Library
X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P. Boesiger, A new correlation-based fuzzy logic clustering algorithm for fMRI, Mag. Resonance Med. 40 (1998).Google Scholar
T. W. Liao, B. Bolt, J. Forester, E. Hailman, C. Hansen, R. C. Kaste, J. O'May, Understanding and projecting the battle state, 23rd Army Science Conference, Orlando, FL, 2002.Google Scholar
D. Piccolo, A distance measure for classifying ARMA models, J. Time Ser. Anal. 11 (2) (1990) 153--163.Google Scholar
D. Tran, M. Wagner, Fuzzy c-means clustering-based speaker verification, in: N. R. Pal, M. Sugeno (Eds.), AFSS 2002, Lecture Notes in Artificial Intelligence, 2275, 2002. Google ScholarDigital Library
Danon L, Díaz-Guilera A, Duch J and Arenas A 2005 J. Stat. Mech. P09008.Google Scholar
M. Datar, N. Immorlica, P. Indyk, V. S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions, Computational Geometry, pp. 253--262. Google ScholarDigital Library
Berchthold S., Keim D., Kriegel H. P. 1996. The X-Tree: An Index Structure for High-Dimensional Data, VLDB. Google ScholarDigital Library
Keogh, E., Zhu, Q., Hu, B., Hao. Y., Xi, X., Wei, L. & Ratanamahatana, C. A. (2011). The UCR Time Series Classification/Clustering Homepage: www.cs.ucr.edu/~eamonn/time_series_data/Google Scholar
http://research.microsoft.com/en-us/people/juding/yading.aspxGoogle Scholar
A. Hinneburg and H. H. Gabriel. DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation. In IDA, 2007. Google ScholarDigital Library
Sheih, J., Keogh, E., 2009. iSAX: disk-aware mining and indexing of massive time series data sets. Data Mining and Knowledge Discovery 19 (1), 24--57. Google ScholarDigital Library
D. Barbara. Requirements for clustering data streams. SIGKDD Explorations, 3(2): 23--27, January 2002. Google ScholarDigital Library
P. Liu, D. Zhou, N. Wu. 2007. Varied Density Based Spatial Clustering of Application with Noise, In proc. of ICSSSM. 528--531Google ScholarCross Ref

Index Terms

YADING: fast clustering of large-scale time series data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Read More
Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...
Read More
On cluster tree for nested and multi-density data clustering

Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach-a cluster tree to determine such cluster structure and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 8, Issue 5
January 2015
181 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 January 2015
Published in pvldb Volume 8, Issue 5
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 775
  Total Downloads
- Downloads (Last 12 months)100
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

YADING: fast clustering of large-scale time series data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Hybrid Bisect K-Means Clustering Algorithm

Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

On cluster tree for nested and multi-density data clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

YADING: fast clustering of large-scale time series data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Hybrid Bisect K-Means Clustering Algorithm

Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

On cluster tree for nested and multi-density data clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media