skip to main content
research-article

Data stream clustering: A survey

Published:11 July 2013Publication History
Skip Abstract Section

Abstract

Data stream mining is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. In this context, several data stream clustering algorithms have been proposed to perform unsupervised learning. Nevertheless, data stream clustering imposes several challenges to be addressed, such as dealing with nonstationary, unbounded data that arrive in an online fashion. The intrinsic nature of stream data requires the development of algorithms capable of performing fast and incremental processing of data objects, suitably addressing time and memory limitations. In this article, we present a survey of data stream clustering algorithms, providing a thorough discussion of the main design components of state-of-the-art algorithms. In addition, this work addresses the temporal aspects involved in data stream clustering, and presents an overview of the usually employed experimental methodologies. A number of references are provided that describe applications of data stream clustering in different domains, such as network intrusion detection, sensor networks, and stock market analysis. Information regarding software packages and data repositories are also available for helping researchers and practitioners. Finally, some important issues and open questions that can be subject of future research are discussed.

Skip Supplemental Material Section

Supplemental Material

References

  1. Ackermann, M. R., Martens, M., Raupach, C., Swierkot, K., Lammersen, C., and Sohler, C. 2012. StreamKM++: A clustering algorithm for data streams. ACM J. Exper. Algor. 17, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agarwal, P. K., Har-Peled, S., and Varadarajan, K. R. 2004a. Approximating extent measures of points. J. ACM 51, 4, 606--635. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Aggarwal, C. C. 2003. A framework for diagnosing changes in evolving data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'03). 575--586. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Aggarwal, C. C. 2007. Data Streams -- Models and Algorithms. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Aggarwal, C. C. 2010. A segment-based framework for modeling and mining data streams. Knowl. Inf. Syst, 30, 1, 1--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Aggarwal, C. C., Han, J., Wang, J., and Yu, P. S. 2003. A framework for clustering evolving data streams. In Proceedings of the 29th Conference on Very Large Data Bases (VLDB'03). Vol. 29, 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Aggarwal, C. C., Han, J., Wang, J., and Yu, P. S. 2004b. A framework for projected clustering of high dimensional data streams. In Proceedings of the 30th Conference on Very Large Data Bases (VLDB'04). Vol. 30, 852--863. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Aggarwal, C. C. and Yu, P. S. 2006. A framework for clustering massive text and categorical data streams. In Proceedings of the 6th International Conference on Data Mining (SDM'06). 479.Google ScholarGoogle Scholar
  9. Aggarwal, C. C. and Yu, P. S. 2008. A framework for clustering uncertain data streams. In Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE'08). 150--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Aghbari, Z. A., Kamel, I., and Awad, T. 2012. On clustering large number of data streams. Intell. Data Anal. 16, 1, 69--91.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Amini, A., Wah, T. Y., Saybani, M. R., Aghabozorgi, S. R., and Yazdi, S. 2011. A study of density-grid based clustering algorithms on data streams. In Proceedings of the 8th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'11). IEEE Press, Los Alamitos, CA, 1652--1656.Google ScholarGoogle Scholar
  12. Arabie, P. and Hubert, L. J. 1999. An Overview of Combinatorial Data Analysis. World Scientific Publishing.Google ScholarGoogle Scholar
  13. Arthur, D. and Vassilvitskii, S. 2006. How slow is the k-means method? In Proceedings of the 22nd Annual Symposium on Computational Geometry (SCG'06). ACM Press, New York, 144--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Arthur, D. and Vassilvitskii, S. 2007. K-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. 1027--1035. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'02). ACM Press, New York, 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Babcock, B., Datar, M., Motwani, R., and O'Callaghan, L. 2003. Maintaining variance and k-medians over data stream windows. In Proceedings of the 22nd ACM SIGMOD-SIGACT SIGART Symposium on Principles of Database Systems. ACM Press, New York, 234--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Badoiu, M., Har-Peled, S., and Indyk, P. 2002. Approximate clustering via core-sets. In Proceedings of the 34th ACM Symposium on Theory of Computing (STOC'02). ACM Press, New York, 250--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ball, G. H. and Hall, D. J. 1965. ISODATA. A novel method of data analysis and pattern classification. Tech. rep. Stanford Research Institute, Menlo Park.Google ScholarGoogle Scholar
  19. Barbara, D. 2002. Requirements for clustering data streams. SIGKDD Explorations 3, 23--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Bentley, J. L. 1975. Multidimensional binary search trees used for associative searching. Comm. ACM 18, 9, 509--517. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Bentley, J. L. and Saxe, J. B. 1980. Decomposable searching problems I: Static-to-dynamic transformation. J. Algor. 1, 4, 301--358.Google ScholarGoogle ScholarCross RefCross Ref
  22. Bhat, U. N. and Miller, G. K. 2002. Elements of Applied Stochastic Processes 3rd Ed. John Wiley and Sons, New Jersey.Google ScholarGoogle Scholar
  23. Bifet, A. 2010. Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams. IOS Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Bifet, A., Holmes, G., Kirkby, R., and Pfahringer, B. 2010. MOA: Massive online analysis. J. Mach. Learn. Res. 11, 1601--1604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Bradley, P. S. and Fayyad, U. M. 1998a. Refining initial points for k-means clustering. In Proceedings of the 15th International Conference on Machine Learning (ICML'98). Morgan Kaufmann Publishers, San Francisco, 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Bradley, P. S., Fayyad, U. M., and Reina, C. 1998b. Scaling clustering algorithms to large databases. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD'98). AAAI Press, 9--15.Google ScholarGoogle Scholar
  27. Cao, F., Ester, M., Qian, W., and Zhou, A. 2006. Density-based clustering over an evolving data stream with noise. In Proceedings of the 6th SIAM International Conference on Data Mining. 328--339.Google ScholarGoogle Scholar
  28. Charikar, M. and Guha, S. 1999. Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science. 378--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Charikar, M., O'Callaghan, L., and Panigrahy, R. 2003. Better streaming algorithms for clustering problems. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing. ACM Press, New York, 30--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chen, Y. and Tu, L. 2007. Density-based clustering for real-time stream data. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 133--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Cho, K., Jo, S., Jang, H., Kim, S. M., and Song, J. 2006. DCF: An efficient data stream clustering framework for streaming applications. In Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06). 114--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Csernel, B., Clerot, F., and Hebrail, G. 2006. Datastream clustering over tilted windows through sampling. In Proceedings of the Knowledge Discovery from Data Streams Workshop (ECML/PKDD).Google ScholarGoogle Scholar
  33. Dang, X. H., Lee, V. C. S., Ng, W. K., Ciptadi, A., and Ong, K.-L. 2009. An em-based algorithm for clustering data streams in sliding windows. In Proceedings of the 14th International Conference on Database Systems for Advanced Applications. Lecture Notes in Computer Science, vol. 5463, Springer, 230--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Domingos, P. and Hulten, G. 2001. A general method for scaling up machine learning algorithms and its application to clustering. In Proceedings of the 8th International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, 106--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Dunham, M. H., Menga, Y., and Huang, J. 2004. Extensible markov model. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM'04). 371--374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 226--231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ribeiro, E. F., Barros, R. C., Carvalho, A. C. P. L. F., and Gama, J. 2012. improving the offline clustering stage of data stream algorithms in scenarios with variable number of clusters. In Proceedings of the 27th ACM Symposium on Applied Computing (SAC'12). 572--573. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Farnstrom, F., Lewis, J., and Elkan, C. 2000. Scalability for clustering algorithms revisited. SIGKDD Exploration Newslett. 2, 1, 51--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. 1996. From data mining to knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining. American Association for Artificial Intelligence, Menlo Park, CA, 1--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  41. Gaber, M. M., Vatsavai, R. R., Omitaomu, O. A., Gama, J., Chawla, N. V., and Ganguly, A. R. 2010. Knowledge Discovery from Sensor Data. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Gama, J. 2010. Knowledge Discovery from Data Streams. Chapman Hall/CRC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Gama, J. and Gaber, M. M. 2007. Learning from Data Streams: Processing Techniques in Sensor Networks. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Gama, J., Medas, P., Castillo, G., and Rodrigues, P. P. 2004. Learning with drift detection. In Proceedings of the 17th Brazilian Symposium on Artificial Intelligence (SBIA'04). Vol. 3171., 286--295.Google ScholarGoogle Scholar
  45. Gama, J., Pereira, P. R., and Lopes, L. 2011. Clustering distributed sensor data streams using local processing and reduced communication. Intell. Data Anal. 15, 1, 3--28. Google ScholarGoogle ScholarCross RefCross Ref
  46. Gama, J. and Pinto, C. 2006. Discretization from data streams: Applications to histograms and data mining. In Proceedings of the ACM Symposium on Applied Computing (SAC'06). 662--667. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Gan, G., Ma, C., and Wu, J. 2007. Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability. SIAM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Gao, M.-M., Liu, J.-Z., and Gao, X.-X. 2010. Application of compound gaussian mixture model clustering in the data stream. In Proceedings of the International Conference on Computer Application and System Modeling (ICCASM'10).Google ScholarGoogle Scholar
  49. Gonzalez, T. F. 1985. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38, 293--306.Google ScholarGoogle ScholarCross RefCross Ref
  50. Guha, S. 2009. Tight results for clustering and summarizing data streams. In Proceedings of the 12th International Conference on Database Theory (ICDT'09). ACM Press, New York, 268--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Guha, S., Meyerson, A., Mishra, N., Motwani, R., and O'Callaghan, L. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Engin. 15, 515--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Guha, S., Mishra, N., Motwani, R., and O'Callaghan, L. 2000. Clustering data streams. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 359--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Hahsler, M. and Dunham, M. H. 2010. rEMM: Extensible markov model for data stream clustering in r. J. Statist. Softw. 35, 5, 1--31.Google ScholarGoogle ScholarCross RefCross Ref
  54. Hahsler, M. and Dunham, M. H. 2011. Temporal structure learning for clustering massive data streams in real-time. In Proceedings of the SIAM Conference on Data Mining. SIAM/Omnipress, 664--675.Google ScholarGoogle Scholar
  55. Han, J. and Kamber, M. 2000. Data Mining: Concepts and Techniques. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Har-Peled, S. and Mazumdar, S. 2004. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing. 291--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Hulten, G. and Domingos, P. 2003. VFML -- A toolkit for mining high-speed time-changing data streams. Tech. rep. University of Washington. http://www.cs.washington.edu/dm/vfml/.Google ScholarGoogle Scholar
  58. Isaksson, C., Dunham, M. H., and Hahsler, M. 2012. SOStream: Self organizing density-based clustering over data stream. In Proceedings of the 8th International Conference on Machine Learning and Data Mining in Pattern Recognition. Lecture Notes in Computer Science, vol. 7376, Springer, 264--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Jain, A. K. 2009. Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31, 651--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jiang, N. and Gruenwald, L. 2006. Research issues in data stream association rule mining. SIGMOD Rec. 35, 1, 14--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Kaufman, L. and Rousseeuw, P. J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Interscience.Google ScholarGoogle Scholar
  62. Kavitha, V. and Punithavalli, M. 2010. Clustering time series data stream - A literature survey. Int. J. Comput. Sci. Inf. Secur. 8, 1, 289--294.Google ScholarGoogle Scholar
  63. Khalilian, M. and Mustapha, N. 2010. Data stream clustering: challenges and issues. In Proceedings of International Multi Conference of Engineers and Computer Scientists. 566--569.Google ScholarGoogle Scholar
  64. Kogan, J. 2007. Introduction to Clustering Large and High-Dimensional Data. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Kontaki, M., Papadopoulos, A. N., and Manolopoulos, Y. 2008. Continuous trend-based clustering in data streams. In Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery. 251--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Kranen, P., Assent, I., Baldauf, C., and Seidl, T. 2011. The clustree: Indexing microclusters for anytime stream mining. Knowl. Inf. Syst. 29, 2, 249--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Kremer, H., Kranen, P., Jansen, T., Seidl, T., Bifet, A., Holmes, G., and Pfahringer, B.. 2011. An effective evaluation measure for clustering on evolving data streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). ACM Press, New York, 868--876. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Li, Q., Ma, X., Tang, S., and Xie, S. 2011. Continuously identifying representatives out of massive streams. In Proceedings of the 7th International Conference on Advanced Data Mining and Applications (ADMA'11). Springer, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Li, Y. and Tan, B. H. 2011. Data stream clustering algorithm based on affinity propagation and density. Advanced Materials Res. 267, 444--449.Google ScholarGoogle ScholarCross RefCross Ref
  70. Liu, Y.-B., Cai, J.-R., Yin, J., and Fu, A. W.-C. 2008. Clustering text data streams. J. Comput. Sci. Technol. 23, 1, 112--128.Google ScholarGoogle ScholarCross RefCross Ref
  71. Lloyd, S. P. 1982. Least squares quantization in pcm. IEEE Trans. Inf. Theory 28, 2, 129--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Luhr, S. and Lazarescu, M. 2009. Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl. Engin. 68, 1--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. MacQueen, J. B. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. L. M. Le Cam and J. Neyman, Eds., Vol. 1., 281--297.Google ScholarGoogle Scholar
  74. De Maesschalck, R., Jouan-Rimbaud, D., and Massart, D. L. 2000. The mahalanobis distance. Chemometrics Intell. Laboratory Syst. 50, 1--18.Google ScholarGoogle ScholarCross RefCross Ref
  75. Mahdiraji, A. R. 2009. Clustering data stream: A survey of algorithms. Int. J. Knowl.-Based Intell. Engin. Syst. 13, 2, 39--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Markov, A. 1971. Extension of the limit theorems of probability theory to a sum of variables connected in a chain. In Dynamic Probabilistic Systems, Vol. 1, R. Howard, Ed., John Wiley and Sons, Chapter Appendix B, 552--577.Google ScholarGoogle Scholar
  77. Metwally, A., Agrawal, D., and Abbadi, A. E. L. 2005. Duplicate detection in click streams. In Proceedings of the 14th International Conference on World Wide Web. ACM Press, New York, 12--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Meyerson, A. 2001. Online facility location. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science. 426--431. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., and Euler, T. 2006. YALE: Rapid prototyping for complex data mining tasks. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). L. Ungar, M. Craven, D. Gunopulos, and T. Eliassi-Rad, Eds., ACM Press, New York, 935--940. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. O'Callaghan, L., Mishra, N., Meyerson, A., Guha, S., and Motwani, R. 2002. Streaming data algorithms for high-quality clustering. In Proceedings of the 18th International Conference on Data Engineering. 685--694. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Oliveira, M. and Gama, J. 2010. MEC --Monitoring clusters' transitions. In Proceedings of the 5th Starting AI Researchers Symposium. IOS Press, 212--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Ong, K. L, Li, W., Ng, W.-K., and Lim, E.-P. 2004. SCLOPE: An algorithm for clustering data streams of categorical attributes. In Proceedings of the 6th International Conference on Data Warehousing and Knowledge Discovery (KDD'04). 209--218.Google ScholarGoogle Scholar
  83. Oliveira, M. D. B. and Gama, J. 2012. A framework to monitor clusters evolution applied to economy and finance problems. Intell. Data Anal. 16, 1, 93--111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Ostfeld, A., Uber, J. G., Salomons, E., et al. 2008. The battle of the water sensor networks (BWSN): A design challenge for engineers and algorithms. J. Water Resources Plan. Manag. 134, 556.Google ScholarGoogle ScholarCross RefCross Ref
  85. Park, N. H. and Suk Lee, W. 2007. Cell trees: An adaptive synopsis structure for clustering multidimensional on-line data streams. Data Knowl. Engin. 63, 2, 528--549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Ren, J. and Ma, R. 2009. Density-based data streams clustering over sliding windows. In Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery. Vol. 5. 248--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Rodrigues, P., Gama, J., and Pedroso, J. P. 2006. ODAC: Hierarchical clustering of time series data streams. In Proceedings of the 6th SIAM International Conference on Data Mining. 499--503.Google ScholarGoogle Scholar
  88. Rodrigues, P. P., Gama, J., and Pedroso, J. P. 2008. Hierarchical clustering of time-series data streams. IEEE Trans. Knowl. Data Engin. 20, 5, 615--627. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Serir, L., Ramasso, E., and Zerhouni, N. 2012. Evidential evolving gustafson--kessel algorithm for online data streams partitioning using belief function theory. Int. J. Approximate Reason. 53, 5, 1--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Shah, R., Krishnaswamy, S., and Gaber, M. M. 2005. Resource-aware very fast k-means for ubiquitous data stream mining. In Proceedings of the 2nd International Workshop on Knowledge Discovery in Data Streams, Held in Conjunction with the 16th European Conference on Machine Learning (ECML'05).Google ScholarGoogle Scholar
  91. Silva, A., Chiky, R., and Hebrail, G. 2011. A clustering approach for sampling data streams in sensor networks. Knowl. Inf. Syst. 32, 1, 1--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Silva, J. A. and Hruschka, E. R. 2011. Extending k-means-based algorithms for evolving data streams with variable number of clusters. In Proceedings of the 4th International Conference on Machine Learning and Applications (ICMLA'11). Vol. 2. 14--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., and Schult, R. 2006. MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). ACM Press, New York, 706--711. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Steinhaus, H. 1956. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci 1, 801--804.Google ScholarGoogle Scholar
  95. Tasoulis, D. K., Adams, N. M., and Hand, D. J. 2006. Unsupervised clustering in streaming data. In Proceedings of the 6th IEEE International Conference on Data Mining-Workshops (ICDM'06). 638--642. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Tavallaee, M., Bagheri, E., Lu, W., and Ghorbani, A. A. 2009. A detailed analysis of the kdd cup 99 data set. In Proceedings of the 2nd IEEE International Conference on Computational Intelligence for Security and Defense Applications. 53--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Vattani, A. 2009. K-means requires exponentially many iterations even in the plane. In Proceedings of the 25th Annual Symposium on Computational Geometry (SCG'09). ACM Press, New York, 324--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Wan, R., Yan, X., and Su, X. 2008. A weighted fuzzy clustering algorithm for data stream. In Proceedings of the ISECS International Colloquium on Computing, Communication, Control, and Management. 360--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Wu, X., Kumar, V., Quinlan, J. R, Ghosh, J., Yang, Q., Motoda, H., Mclachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., and Steinberg, D. 2007. Top 10 algorithms in data mining. Knowl. Inf. Syst. 14, 1--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Xu, R. and Wunsch, D. 2009. Clustering. Computational Intelligence Series, Wiley-IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Yang, C. and Zhou, J. 2006. HClustream: A novel approach for clustering evolving heterogeneous data stream. In Proceedings of the 6th IEEE International Conference on Data Mining. 682--688. Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 103--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Zhang, T., Ramakrishnan, R., and Livny, M. 1997. BIRCH: A new data clustering algorithm and its applications. Data Mining Knowl. Discov. 1, 2, 141--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. Zhang, X., Sebag, M., and Germain-Renaud, C. 2009. Multi-scale real-time grid monitoring with job stream mining. In Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'09). 420--427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Zhang, X. and Wang, W. 2010. Self-adaptive change detection in streaming data with nonstationary distribution. In Advanced Data Mining and Applications. Springer, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Zhang, X., Zhou, X., and Hu, X. 2006. Semantic smoothing for model-based document clustering. In Proceedings of the 6th International Conference on Data Mining (ICDM'06). 1193--1198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Zhou, A., Cao, F., Qian, W., and Jin, C. 2008. Tracking clusters in evolving data streams over sliding windows. Knowl. Inf. Syst. 15, 2, 181--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. Zhu, H., Wang, Y., and Yu, Z. 2010. Clustering of evolving data stream with multiple adaptive sliding window. In Proceedings of the International Conference on Data Storage and Data Engineering (DSDE'10). 95--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. Zhu, Y. and Shasha, D. 2002. StatStream: Statistical monitoring of thousands of data streams in real time. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB'02). VLBD Endowment, 358--369. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data stream clustering: A survey

    Recommendations

    Reviews

    Fazli Can

    Data streams, which are continuously generated at uncontrollable, rapid rates, are a relatively new form of temporal data object. Data stream clustering (DSC) is an unsupervised data mining process that looks for hidden patterns in dynamic, massive, and unbounded datasets. Possible DSC applications include analysis of ATM transactions, computer network traffic, meteorological observations, and stock market transactions. In such applications, for example, one may want to find new trends and identify followers. The authors present various algorithms. They first provide an overview, followed by the data structures used for summarization, the time window models used for efficiency, outlier detection mechanisms that distinguish outliers from cluster evolution, and standard clustering algorithms used for actual clustering based on the output of the summarization step. In the later sections, the paper discusses temporal aspects that need to be considered during clustering, the assessment of the generated clustering structures, practical applications, software packages available for data stream clustering, challenges of the problem environment, and future research pointers. The paper is rounded out with a taxonomy that helps to distinguish the analyzed algorithms with respect to their important features. This is a useful paper for researchers, with a rich list of references (despite a typographical error in the reference for Faria et al., 2012). The ACM Digital Library provides online access to a short electronic appendix related to the paper with the time complexity analysis of the presented algorithms. The paper is comprehensive and nicely written. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 46, Issue 1
      October 2013
      551 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/2522968
      Issue’s Table of Contents

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 July 2013
      • Accepted: 1 January 2013
      • Revised: 1 September 2012
      • Received: 1 June 2012
      Published in csur Volume 46, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader