research-article

Clustering stream data by exploring the evolution of density mountain

Authors:
Shufeng Gong

Northeastern University, Shenyang, China

Northeastern University, Shenyang, China
View Profile

,
Yanfeng Zhang

Northeastern University, Shenyang, China

Northeastern University, Shenyang, China
View Profile

,
Ge Yu

Northeastern University and Liaoning University, Shenyang, China

Northeastern University and Liaoning University, Shenyang, China
View Profile

Proceedings of the VLDB Endowment Volume 11 Issue 4pp 393–405https://doi.org/10.1145/3186728.3164136

Published:01 December 2017Publication History

Proceedings of the VLDB Endowment

Abstract

Stream clustering is a fundamental problem in many streaming data analysis applications. Comparing to classical batch-mode clustering, there are two key challenges in stream clustering: (i) Given that input data are changing continuously, how to incrementally update their clustering results efficiently? (ii) Given that clusters continuously evolve with the evolution of data, how to capture the cluster evolution activities? Unfortunately, most of existing stream clustering algorithms can neither update the cluster result in real-time nor track the evolution of clusters.

In this paper, we propose a stream clustering algorithm EDMStream by exploring the Evolution of Density Mountain. The density mountain is used to abstract the data distribution, the changes of which indicate data distribution evolution. We track the evolution of clusters by monitoring the changes of density mountains. We further provide efficient data structures and filtering schemes to ensure that the update of density mountains is in real-time, which makes online clustering possible. The experimental results on synthetic and real datasets show that, comparing to the state-of-the-art stream clustering algorithms, e.g., D-Stream, DenStream, DBSTREAM and MR-Stream, our algorithm is able to response to a cluster update much faster (say 7-15x faster than the best of the competitors) and at the same time achieve comparable cluster quality. Furthermore, EDMStream successfully captures the cluster evolution activities.

References

https://arxiv.org/pdf/1710.00867.pdf.Google Scholar
C. C. Aggarwal. Data streams: models and algorithms, volume 31. Springer Science and Business Media, 2007. Google ScholarDigital Library
C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In Proceedings of the VLDB, pages 81--92, 2003. Google ScholarDigital Library
N. Begum and E. Keogh. Rare time series motif discovery from unbounded streams. VLDBJ, 8(2):149--160, 2014. Google ScholarDigital Library
F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clustering over an evolving data stream with noise. In Proceedings of the SDM, pages 328--339, 2006.Google ScholarCross Ref
L. Cao, Q. Wang, and E. A. Rundensteiner. Interactive outlier exploration in big data streams. VLDBJ, 7(13):1621--1624, 2014. Google ScholarDigital Library
Y. Chen and L. Tu. Density-based clustering for real-time stream data. In Proceedings of the SIGKDD, pages 133--142, 2007. Google ScholarDigital Library
H. S. Christopher D. Manning, Prabhakar Raghavan. Introduction to Information Retrieval. Cambridge University Press, 1993.Google Scholar
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD, pages 226--231, 1996. Google ScholarDigital Library
J. Gan and Y. Tao. Dbscan revisited: mis-claim, un-fixability, and approximation. In Proceedings of SIGMOD, pages 519--530, 2015. Google ScholarDigital Library
J. Gan and Y. Tao. Dynamic density based clustering. In Proceedings of the SIGMOD, pages 1493--1507. ACM, 2017. Google ScholarDigital Library
S. Gong and Y. Zhang. Eddpc:an efficient distributed density peaks clustering algorithm. Computer Research and Development, 53(6):1400--1409, 2016.Google Scholar
M. Hahsler and M. Bolaños. Clustering data streams based on shared density between micro-clusters. IEEE TKDE, 28(6):1449--1461, 2016. Google ScholarDigital Library
H. Huang and S. P. Kasiviswanathan. Streaming anomaly detection using randomized matrix sketching. VLDBJ, 9(3):192--203, 2015. Google ScholarDigital Library
C. Isaksson, M. H. Dunham, and M. Hahsler. SOStream: Self Organizing Density-Based Clustering Over Data Stream. Springer Berlin Heidelberg, 2012.Google Scholar
P. Kranen, I. Assent, C. Baldauf, and T. Seidl. The clustree: indexing micro-clusters for anytime stream mining. KAIS, 29(2):249--272, 2011. Google ScholarDigital Library
H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes, and B. Pfahringer. An effective evaluation measure for clustering on evolving data streams. In Proceedings of the KDD, pages 868--876, 2011. Google ScholarDigital Library
M. Lichman. UCI machine learning repository, http://archive.ics.uci.edu/ml, 2013.Google Scholar
J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the Berkeley symposium on mathematical statistics and probability, pages 281--297, 1967.Google Scholar
M. Oliveira and J. Gama. A framework to monitor clusters evolution applied to economy and finance problems. Intelligent Data Analysis, 16(1):93--111, 2012. Google ScholarDigital Library
Y. Pei and O. Zaïane. A synthetic data generator for clustering and outlier analysis. Technical Report, 2006.Google Scholar
M. Ranasinghe, G. BeeHua, and T. Barathithasan. Estimating willingness to pay for urban water supply: a comparison of artificial neural networks and multiple regression analysis. Impact Assessment and Project Appraisal, 17(4):273--281, 1999.Google ScholarCross Ref
A. Reiss and D. Stricker. Creating and benchmarking a new dataset for physical activity monitoring. In Proceedings of the Affect and Behaviour Related Assistance, pages 1--8, 2012. Google ScholarDigital Library
A. Reiss and D. Stricker. Introducing a new benchmarked dataset for activity monitoring. In Proceedings of the ISWC, pages 108--109, 2012. Google ScholarDigital Library
A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science, 344(6191):1492--1496, 2014.Google ScholarCross Ref
J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. Carvalho, C. P. L. F. De, and J. Gama. Data stream clustering: A survey. ACM Computing Surveys, 46(1):125--134, 2013. Google ScholarDigital Library
M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. Monic: modeling and monitoring cluster transitions. In Proceedings of the SIGKDD, pages 706--711, 2006. Google ScholarDigital Library
S. J. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. K. Chan. Cost-based modeling and evaluation for data mining with application to fraud and intrusion detection: Results from the jam project. In Proceedings of the KDD, pages 130--144, 1999.Google Scholar
J. R. Vennam and S. Vadapalli. Syndeca: A tool to generate synthetic datasets for evaluation of clustering algorithms. In Proceedings of the COMAD, pages 27--36, 2005.Google Scholar
L. Wan, W. K. Ng, X. H. Dang, P. S. Yu, and K. Zhang. Density-based clustering of data streams at multiple resolutions. ACM TKDD, 3(3):49--50, 2009. Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the SIGMOD, pages 103--114, 1996. Google ScholarDigital Library
X. Zhang, C. Furtlehner, C. Germain-Renaud, and M. Sebag. Data stream clustering with affinity propagation. IEEE TKDE, 26(7):1644--1656, 2014.Google ScholarCross Ref
Y. Zhang, S. Chen, and G. Yu. Efficient distributed density peaks for clustering large data sets in mapreduce. IEEE TKDE, 28(12):3218--3230, 2016. Google ScholarDigital Library

Index Terms

Clustering stream data by exploring the evolution of density mountain
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

Density-based clustering for real-time stream data
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Existing data-stream clustering algorithms such as CluStream arebased on k-means. These clustering algorithms are incompetent tofind clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and user-specified ...
Read More
Density-based hierarchical clustering for streaming data

For streaming data that arrive continuously such as multimedia data and financial transactions, clustering algorithms are typically allowed to scan the data set only once. Existing research in this domain mainly focuses on improving the accuracy of ...
Read More
Fat node leading tree for data stream clustering with density peaks

Detecting clusters of arbitrary shape and constantly delivering the results for newly arrived items are two critical challenges in the study of data stream clustering. However, the existing clustering methods could not deal with these two problems ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 11, Issue 4
December 2017
133 pages
ISSN:2150-8097
Editors:
Jian Pei
Simon Fraser University
,
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 December 2017
Published in pvldb Volume 11, Issue 4
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 92
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Clustering stream data by exploring the evolution of density mountain

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Density-based clustering for real-time stream data

Density-based hierarchical clustering for streaming data

Fat node leading tree for data stream clustering with density peaks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Clustering stream data by exploring the evolution of density mountain

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Density-based clustering for real-time stream data

Density-based hierarchical clustering for streaming data

Fat node leading tree for data stream clustering with density peaks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media