skip to main content
research-article

Clustering stream data by exploring the evolution of density mountain

Authors Info & Claims
Published:01 December 2017Publication History
Skip Abstract Section

Abstract

Stream clustering is a fundamental problem in many streaming data analysis applications. Comparing to classical batch-mode clustering, there are two key challenges in stream clustering: (i) Given that input data are changing continuously, how to incrementally update their clustering results efficiently? (ii) Given that clusters continuously evolve with the evolution of data, how to capture the cluster evolution activities? Unfortunately, most of existing stream clustering algorithms can neither update the cluster result in real-time nor track the evolution of clusters.

In this paper, we propose a stream clustering algorithm EDMStream by exploring the Evolution of Density Mountain. The density mountain is used to abstract the data distribution, the changes of which indicate data distribution evolution. We track the evolution of clusters by monitoring the changes of density mountains. We further provide efficient data structures and filtering schemes to ensure that the update of density mountains is in real-time, which makes online clustering possible. The experimental results on synthetic and real datasets show that, comparing to the state-of-the-art stream clustering algorithms, e.g., D-Stream, DenStream, DBSTREAM and MR-Stream, our algorithm is able to response to a cluster update much faster (say 7-15x faster than the best of the competitors) and at the same time achieve comparable cluster quality. Furthermore, EDMStream successfully captures the cluster evolution activities.

References

  1. https://arxiv.org/pdf/1710.00867.pdf.Google ScholarGoogle Scholar
  2. C. C. Aggarwal. Data streams: models and algorithms, volume 31. Springer Science and Business Media, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In Proceedings of the VLDB, pages 81--92, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Begum and E. Keogh. Rare time series motif discovery from unbounded streams. VLDBJ, 8(2):149--160, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clustering over an evolving data stream with noise. In Proceedings of the SDM, pages 328--339, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  6. L. Cao, Q. Wang, and E. A. Rundensteiner. Interactive outlier exploration in big data streams. VLDBJ, 7(13):1621--1624, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Chen and L. Tu. Density-based clustering for real-time stream data. In Proceedings of the SIGKDD, pages 133--142, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. S. Christopher D. Manning, Prabhakar Raghavan. Introduction to Information Retrieval. Cambridge University Press, 1993.Google ScholarGoogle Scholar
  9. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD, pages 226--231, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Gan and Y. Tao. Dbscan revisited: mis-claim, un-fixability, and approximation. In Proceedings of SIGMOD, pages 519--530, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Gan and Y. Tao. Dynamic density based clustering. In Proceedings of the SIGMOD, pages 1493--1507. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Gong and Y. Zhang. Eddpc:an efficient distributed density peaks clustering algorithm. Computer Research and Development, 53(6):1400--1409, 2016.Google ScholarGoogle Scholar
  13. M. Hahsler and M. Bolaños. Clustering data streams based on shared density between micro-clusters. IEEE TKDE, 28(6):1449--1461, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Huang and S. P. Kasiviswanathan. Streaming anomaly detection using randomized matrix sketching. VLDBJ, 9(3):192--203, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Isaksson, M. H. Dunham, and M. Hahsler. SOStream: Self Organizing Density-Based Clustering Over Data Stream. Springer Berlin Heidelberg, 2012.Google ScholarGoogle Scholar
  16. P. Kranen, I. Assent, C. Baldauf, and T. Seidl. The clustree: indexing micro-clusters for anytime stream mining. KAIS, 29(2):249--272, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes, and B. Pfahringer. An effective evaluation measure for clustering on evolving data streams. In Proceedings of the KDD, pages 868--876, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Lichman. UCI machine learning repository, http://archive.ics.uci.edu/ml, 2013.Google ScholarGoogle Scholar
  19. J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the Berkeley symposium on mathematical statistics and probability, pages 281--297, 1967.Google ScholarGoogle Scholar
  20. M. Oliveira and J. Gama. A framework to monitor clusters evolution applied to economy and finance problems. Intelligent Data Analysis, 16(1):93--111, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Pei and O. Zaïane. A synthetic data generator for clustering and outlier analysis. Technical Report, 2006.Google ScholarGoogle Scholar
  22. M. Ranasinghe, G. BeeHua, and T. Barathithasan. Estimating willingness to pay for urban water supply: a comparison of artificial neural networks and multiple regression analysis. Impact Assessment and Project Appraisal, 17(4):273--281, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  23. A. Reiss and D. Stricker. Creating and benchmarking a new dataset for physical activity monitoring. In Proceedings of the Affect and Behaviour Related Assistance, pages 1--8, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Reiss and D. Stricker. Introducing a new benchmarked dataset for activity monitoring. In Proceedings of the ISWC, pages 108--109, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science, 344(6191):1492--1496, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  26. J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. Carvalho, C. P. L. F. De, and J. Gama. Data stream clustering: A survey. ACM Computing Surveys, 46(1):125--134, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. Monic: modeling and monitoring cluster transitions. In Proceedings of the SIGKDD, pages 706--711, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. J. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. K. Chan. Cost-based modeling and evaluation for data mining with application to fraud and intrusion detection: Results from the jam project. In Proceedings of the KDD, pages 130--144, 1999.Google ScholarGoogle Scholar
  29. J. R. Vennam and S. Vadapalli. Syndeca: A tool to generate synthetic datasets for evaluation of clustering algorithms. In Proceedings of the COMAD, pages 27--36, 2005.Google ScholarGoogle Scholar
  30. L. Wan, W. K. Ng, X. H. Dang, P. S. Yu, and K. Zhang. Density-based clustering of data streams at multiple resolutions. ACM TKDD, 3(3):49--50, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the SIGMOD, pages 103--114, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. X. Zhang, C. Furtlehner, C. Germain-Renaud, and M. Sebag. Data stream clustering with affinity propagation. IEEE TKDE, 26(7):1644--1656, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  33. Y. Zhang, S. Chen, and G. Yu. Efficient distributed density peaks for clustering large data sets in mapreduce. IEEE TKDE, 28(12):3218--3230, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Clustering stream data by exploring the evolution of density mountain
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 11, Issue 4
          December 2017
          133 pages
          ISSN:2150-8097
          Issue’s Table of Contents

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 December 2017
          Published in pvldb Volume 11, Issue 4

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader