research-article

Data stream clustering: A survey

Authors:
Jonathan A. Silva

University of São Paulo, São Paulo, Brazil

University of São Paulo, São Paulo, Brazil
View Profile

,
Elaine R. Faria

University of São Paulo and Federal University of Uberlândia, Brazil

University of São Paulo and Federal University of Uberlândia, Brazil
View Profile

,
Rodrigo C. Barros

University of São Paulo, São Paulo, Brazil

University of São Paulo, São Paulo, Brazil
View Profile

,
Eduardo R. Hruschka

University of São Paulo, São Paulo, Brazil

University of São Paulo, São Paulo, Brazil
View Profile

,
André C. P. L. F. de Carvalho

University of São Paulo, Brazil

University of São Paulo, Brazil
View Profile

,
João Gama

University of Porto, Portugal

University of Porto, Portugal
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 46 Issue 1Article No.: 13pp 1–31https://doi.org/10.1145/2522968.2522981

Published:11 July 2013Publication History

ACM Computing Surveys

Abstract

Data stream mining is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. In this context, several data stream clustering algorithms have been proposed to perform unsupervised learning. Nevertheless, data stream clustering imposes several challenges to be addressed, such as dealing with nonstationary, unbounded data that arrive in an online fashion. The intrinsic nature of stream data requires the development of algorithms capable of performing fast and incremental processing of data objects, suitably addressing time and memory limitations. In this article, we present a survey of data stream clustering algorithms, providing a thorough discussion of the main design components of state-of-the-art algorithms. In addition, this work addresses the temporal aspects involved in data stream clustering, and presents an overview of the usually employed experimental methodologies. A number of references are provided that describe applications of data stream clustering in different domains, such as network intrusion detection, sensor networks, and stock market analysis. Information regarding software packages and data repositories are also available for helping researchers and practitioners. Finally, some important issues and open questions that can be subject of future research are discussed.

Supplemental Material

Available for Download

zip

silva.zip (37.5 KB)

Supplemental movie, appendix, image and software files for, Data stream clustering: A survey

References

Ackermann, M. R., Martens, M., Raupach, C., Swierkot, K., Lammersen, C., and Sohler, C. 2012. StreamKM++: A clustering algorithm for data streams. ACM J. Exper. Algor. 17, 1. Google ScholarDigital Library
Agarwal, P. K., Har-Peled, S., and Varadarajan, K. R. 2004a. Approximating extent measures of points. J. ACM 51, 4, 606--635. Google ScholarDigital Library
Aggarwal, C. C. 2003. A framework for diagnosing changes in evolving data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'03). 575--586. Google ScholarDigital Library
Aggarwal, C. C. 2007. Data Streams -- Models and Algorithms. Springer. Google ScholarDigital Library
Aggarwal, C. C. 2010. A segment-based framework for modeling and mining data streams. Knowl. Inf. Syst, 30, 1, 1--29. Google ScholarDigital Library
Aggarwal, C. C., Han, J., Wang, J., and Yu, P. S. 2003. A framework for clustering evolving data streams. In Proceedings of the 29^th Conference on Very Large Data Bases (VLDB'03). Vol. 29, 81--92. Google ScholarDigital Library
Aggarwal, C. C., Han, J., Wang, J., and Yu, P. S. 2004b. A framework for projected clustering of high dimensional data streams. In Proceedings of the 30^th Conference on Very Large Data Bases (VLDB'04). Vol. 30, 852--863. Google ScholarDigital Library
Aggarwal, C. C. and Yu, P. S. 2006. A framework for clustering massive text and categorical data streams. In Proceedings of the 6^th International Conference on Data Mining (SDM'06). 479.Google Scholar
Aggarwal, C. C. and Yu, P. S. 2008. A framework for clustering uncertain data streams. In Proceedings of the 24^th IEEE International Conference on Data Engineering (ICDE'08). 150--159. Google ScholarDigital Library
Aghbari, Z. A., Kamel, I., and Awad, T. 2012. On clustering large number of data streams. Intell. Data Anal. 16, 1, 69--91.Google ScholarDigital Library
Amini, A., Wah, T. Y., Saybani, M. R., Aghabozorgi, S. R., and Yazdi, S. 2011. A study of density-grid based clustering algorithms on data streams. In Proceedings of the 8^th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'11). IEEE Press, Los Alamitos, CA, 1652--1656.Google Scholar
Arabie, P. and Hubert, L. J. 1999. An Overview of Combinatorial Data Analysis. World Scientific Publishing.Google Scholar
Arthur, D. and Vassilvitskii, S. 2006. How slow is the k-means method&quest; In Proceedings of the 22^nd Annual Symposium on Computational Geometry (SCG'06). ACM Press, New York, 144--153. Google ScholarDigital Library
Arthur, D. and Vassilvitskii, S. 2007. K-means++: The advantages of careful seeding. In Proceedings of the 18^th Annual ACM-SIAM Symposium on Discrete Algorithms. 1027--1035. Google ScholarDigital Library
Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In Proceedings of the 21^st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'02). ACM Press, New York, 1--16. Google ScholarDigital Library
Babcock, B., Datar, M., Motwani, R., and O'Callaghan, L. 2003. Maintaining variance and k-medians over data stream windows. In Proceedings of the 22^nd ACM SIGMOD-SIGACT SIGART Symposium on Principles of Database Systems. ACM Press, New York, 234--243. Google ScholarDigital Library
Badoiu, M., Har-Peled, S., and Indyk, P. 2002. Approximate clustering via core-sets. In Proceedings of the 34^th ACM Symposium on Theory of Computing (STOC'02). ACM Press, New York, 250--257. Google ScholarDigital Library
Ball, G. H. and Hall, D. J. 1965. ISODATA. A novel method of data analysis and pattern classification. Tech. rep. Stanford Research Institute, Menlo Park.Google Scholar
Barbara, D. 2002. Requirements for clustering data streams. SIGKDD Explorations 3, 23--27. Google ScholarDigital Library
Bentley, J. L. 1975. Multidimensional binary search trees used for associative searching. Comm. ACM 18, 9, 509--517. Google ScholarDigital Library
Bentley, J. L. and Saxe, J. B. 1980. Decomposable searching problems I: Static-to-dynamic transformation. J. Algor. 1, 4, 301--358.Google ScholarCross Ref
Bhat, U. N. and Miller, G. K. 2002. Elements of Applied Stochastic Processes 3^rd Ed. John Wiley and Sons, New Jersey.Google Scholar
Bifet, A. 2010. Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams. IOS Press. Google ScholarDigital Library
Bifet, A., Holmes, G., Kirkby, R., and Pfahringer, B. 2010. MOA: Massive online analysis. J. Mach. Learn. Res. 11, 1601--1604. Google ScholarDigital Library
Bradley, P. S. and Fayyad, U. M. 1998a. Refining initial points for k-means clustering. In Proceedings of the 15^th International Conference on Machine Learning (ICML'98). Morgan Kaufmann Publishers, San Francisco, 91--99. Google ScholarDigital Library
Bradley, P. S., Fayyad, U. M., and Reina, C. 1998b. Scaling clustering algorithms to large databases. In Proceedings of the 4^th International Conference on Knowledge Discovery and Data Mining (KDD'98). AAAI Press, 9--15.Google Scholar
Cao, F., Ester, M., Qian, W., and Zhou, A. 2006. Density-based clustering over an evolving data stream with noise. In Proceedings of the 6^th SIAM International Conference on Data Mining. 328--339.Google Scholar
Charikar, M. and Guha, S. 1999. Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40^th Annual Symposium on Foundations of Computer Science. 378--388. Google ScholarDigital Library
Charikar, M., O'Callaghan, L., and Panigrahy, R. 2003. Better streaming algorithms for clustering problems. In Proceedings of the 35^th Annual ACM Symposium on Theory of Computing. ACM Press, New York, 30--39. Google ScholarDigital Library
Chen, Y. and Tu, L. 2007. Density-based clustering for real-time stream data. In Proceedings of the 13^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 133--142. Google ScholarDigital Library
Cho, K., Jo, S., Jang, H., Kim, S. M., and Song, J. 2006. DCF: An efficient data stream clustering framework for streaming applications. In Proceedings of the 17^th International Conference on Database and Expert Systems Applications (DEXA'06). 114--122. Google ScholarDigital Library
Csernel, B., Clerot, F., and Hebrail, G. 2006. Datastream clustering over tilted windows through sampling. In Proceedings of the Knowledge Discovery from Data Streams Workshop (ECML/PKDD).Google Scholar
Dang, X. H., Lee, V. C. S., Ng, W. K., Ciptadi, A., and Ong, K.-L. 2009. An em-based algorithm for clustering data streams in sliding windows. In Proceedings of the 14^th International Conference on Database Systems for Advanced Applications. Lecture Notes in Computer Science, vol. 5463, Springer, 230--235. Google ScholarDigital Library
Domingos, P. and Hulten, G. 2001. A general method for scaling up machine learning algorithms and its application to clustering. In Proceedings of the 8^th International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, 106--113. Google ScholarDigital Library
Dunham, M. H., Menga, Y., and Huang, J. 2004. Extensible markov model. In Proceedings of the 4^th IEEE International Conference on Data Mining (ICDM'04). 371--374. Google ScholarDigital Library
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2^nd International Conference on Knowledge Discovery and Data Mining. 226--231.Google ScholarDigital Library
Ribeiro, E. F., Barros, R. C., Carvalho, A. C. P. L. F., and Gama, J. 2012. improving the offline clustering stage of data stream algorithms in scenarios with variable number of clusters. In Proceedings of the 27^th ACM Symposium on Applied Computing (SAC'12). 572--573. Google ScholarDigital Library
Farnstrom, F., Lewis, J., and Elkan, C. 2000. Scalability for clustering algorithms revisited. SIGKDD Exploration Newslett. 2, 1, 51--57. Google ScholarDigital Library
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. 1996. From data mining to knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining. American Association for Artificial Intelligence, Menlo Park, CA, 1--34. Google ScholarDigital Library
Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml.Google Scholar
Gaber, M. M., Vatsavai, R. R., Omitaomu, O. A., Gama, J., Chawla, N. V., and Ganguly, A. R. 2010. Knowledge Discovery from Sensor Data. Springer. Google ScholarDigital Library
Gama, J. 2010. Knowledge Discovery from Data Streams. Chapman Hall/CRC. Google ScholarDigital Library
Gama, J. and Gaber, M. M. 2007. Learning from Data Streams: Processing Techniques in Sensor Networks. Springer. Google ScholarDigital Library
Gama, J., Medas, P., Castillo, G., and Rodrigues, P. P. 2004. Learning with drift detection. In Proceedings of the 17^th Brazilian Symposium on Artificial Intelligence (SBIA'04). Vol. 3171., 286--295.Google Scholar
Gama, J., Pereira, P. R., and Lopes, L. 2011. Clustering distributed sensor data streams using local processing and reduced communication. Intell. Data Anal. 15, 1, 3--28. Google ScholarCross Ref
Gama, J. and Pinto, C. 2006. Discretization from data streams: Applications to histograms and data mining. In Proceedings of the ACM Symposium on Applied Computing (SAC'06). 662--667. Google ScholarDigital Library
Gan, G., Ma, C., and Wu, J. 2007. Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability. SIAM. Google ScholarDigital Library
Gao, M.-M., Liu, J.-Z., and Gao, X.-X. 2010. Application of compound gaussian mixture model clustering in the data stream. In Proceedings of the International Conference on Computer Application and System Modeling (ICCASM'10).Google Scholar
Gonzalez, T. F. 1985. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38, 293--306.Google ScholarCross Ref
Guha, S. 2009. Tight results for clustering and summarizing data streams. In Proceedings of the 12^th International Conference on Database Theory (ICDT'09). ACM Press, New York, 268--275. Google ScholarDigital Library
Guha, S., Meyerson, A., Mishra, N., Motwani, R., and O'Callaghan, L. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Engin. 15, 515--528. Google ScholarDigital Library
Guha, S., Mishra, N., Motwani, R., and O'Callaghan, L. 2000. Clustering data streams. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 359--366. Google ScholarDigital Library
Hahsler, M. and Dunham, M. H. 2010. rEMM: Extensible markov model for data stream clustering in r. J. Statist. Softw. 35, 5, 1--31.Google ScholarCross Ref
Hahsler, M. and Dunham, M. H. 2011. Temporal structure learning for clustering massive data streams in real-time. In Proceedings of the SIAM Conference on Data Mining. SIAM/Omnipress, 664--675.Google Scholar
Han, J. and Kamber, M. 2000. Data Mining: Concepts and Techniques. Morgan Kaufmann. Google ScholarDigital Library
Har-Peled, S. and Mazumdar, S. 2004. On coresets for k-means and k-median clustering. In Proceedings of the 36^th Annual ACM Symposium on Theory of Computing. 291--300. Google ScholarDigital Library
Hulten, G. and Domingos, P. 2003. VFML -- A toolkit for mining high-speed time-changing data streams. Tech. rep. University of Washington. http://www.cs.washington.edu/dm/vfml/.Google Scholar
Isaksson, C., Dunham, M. H., and Hahsler, M. 2012. SOStream: Self organizing density-based clustering over data stream. In Proceedings of the 8^th International Conference on Machine Learning and Data Mining in Pattern Recognition. Lecture Notes in Computer Science, vol. 7376, Springer, 264--278. Google ScholarDigital Library
Jain, A. K. 2009. Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31, 651--666. Google ScholarDigital Library
Jiang, N. and Gruenwald, L. 2006. Research issues in data stream association rule mining. SIGMOD Rec. 35, 1, 14--19. Google ScholarDigital Library
Kaufman, L. and Rousseeuw, P. J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Interscience.Google Scholar
Kavitha, V. and Punithavalli, M. 2010. Clustering time series data stream - A literature survey. Int. J. Comput. Sci. Inf. Secur. 8, 1, 289--294.Google Scholar
Khalilian, M. and Mustapha, N. 2010. Data stream clustering: challenges and issues. In Proceedings of International Multi Conference of Engineers and Computer Scientists. 566--569.Google Scholar
Kogan, J. 2007. Introduction to Clustering Large and High-Dimensional Data. Cambridge University Press. Google ScholarDigital Library
Kontaki, M., Papadopoulos, A. N., and Manolopoulos, Y. 2008. Continuous trend-based clustering in data streams. In Proceedings of the 10^th International Conference on Data Warehousing and Knowledge Discovery. 251--262. Google ScholarDigital Library
Kranen, P., Assent, I., Baldauf, C., and Seidl, T. 2011. The clustree: Indexing microclusters for anytime stream mining. Knowl. Inf. Syst. 29, 2, 249--272. Google ScholarDigital Library
Kremer, H., Kranen, P., Jansen, T., Seidl, T., Bifet, A., Holmes, G., and Pfahringer, B.. 2011. An effective evaluation measure for clustering on evolving data streams. In Proceedings of the 17^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). ACM Press, New York, 868--876. Google ScholarDigital Library
Li, Q., Ma, X., Tang, S., and Xie, S. 2011. Continuously identifying representatives out of massive streams. In Proceedings of the 7^th International Conference on Advanced Data Mining and Applications (ADMA'11). Springer, 1--14. Google ScholarDigital Library
Li, Y. and Tan, B. H. 2011. Data stream clustering algorithm based on affinity propagation and density. Advanced Materials Res. 267, 444--449.Google ScholarCross Ref
Liu, Y.-B., Cai, J.-R., Yin, J., and Fu, A. W.-C. 2008. Clustering text data streams. J. Comput. Sci. Technol. 23, 1, 112--128.Google ScholarCross Ref
Lloyd, S. P. 1982. Least squares quantization in pcm. IEEE Trans. Inf. Theory 28, 2, 129--137. Google ScholarDigital Library
Luhr, S. and Lazarescu, M. 2009. Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl. Engin. 68, 1--27. Google ScholarDigital Library
MacQueen, J. B. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5^th Berkeley Symposium on Mathematical Statistics and Probability. L. M. Le Cam and J. Neyman, Eds., Vol. 1., 281--297.Google Scholar
De Maesschalck, R., Jouan-Rimbaud, D., and Massart, D. L. 2000. The mahalanobis distance. Chemometrics Intell. Laboratory Syst. 50, 1--18.Google ScholarCross Ref
Mahdiraji, A. R. 2009. Clustering data stream: A survey of algorithms. Int. J. Knowl.-Based Intell. Engin. Syst. 13, 2, 39--44. Google ScholarDigital Library
Markov, A. 1971. Extension of the limit theorems of probability theory to a sum of variables connected in a chain. In Dynamic Probabilistic Systems, Vol. 1, R. Howard, Ed., John Wiley and Sons, Chapter Appendix B, 552--577.Google Scholar
Metwally, A., Agrawal, D., and Abbadi, A. E. L. 2005. Duplicate detection in click streams. In Proceedings of the 14^th International Conference on World Wide Web. ACM Press, New York, 12--21. Google ScholarDigital Library
Meyerson, A. 2001. Online facility location. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science. 426--431. Google ScholarDigital Library
Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., and Euler, T. 2006. YALE: Rapid prototyping for complex data mining tasks. In Proceedings of the 12^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). L. Ungar, M. Craven, D. Gunopulos, and T. Eliassi-Rad, Eds., ACM Press, New York, 935--940. Google ScholarDigital Library
O'Callaghan, L., Mishra, N., Meyerson, A., Guha, S., and Motwani, R. 2002. Streaming data algorithms for high-quality clustering. In Proceedings of the 18^th International Conference on Data Engineering. 685--694. Google ScholarDigital Library
Oliveira, M. and Gama, J. 2010. MEC --Monitoring clusters' transitions. In Proceedings of the 5^th Starting AI Researchers Symposium. IOS Press, 212--224. Google ScholarDigital Library
Ong, K. L, Li, W., Ng, W.-K., and Lim, E.-P. 2004. SCLOPE: An algorithm for clustering data streams of categorical attributes. In Proceedings of the 6^th International Conference on Data Warehousing and Knowledge Discovery (KDD'04). 209--218.Google Scholar
Oliveira, M. D. B. and Gama, J. 2012. A framework to monitor clusters evolution applied to economy and finance problems. Intell. Data Anal. 16, 1, 93--111.Google ScholarDigital Library
Ostfeld, A., Uber, J. G., Salomons, E., et al. 2008. The battle of the water sensor networks (BWSN): A design challenge for engineers and algorithms. J. Water Resources Plan. Manag. 134, 556.Google ScholarCross Ref
Park, N. H. and Suk Lee, W. 2007. Cell trees: An adaptive synopsis structure for clustering multidimensional on-line data streams. Data Knowl. Engin. 63, 2, 528--549. Google ScholarDigital Library
Ren, J. and Ma, R. 2009. Density-based data streams clustering over sliding windows. In Proceedings of the 6^th International Conference on Fuzzy Systems and Knowledge Discovery. Vol. 5. 248--252. Google ScholarDigital Library
Rodrigues, P., Gama, J., and Pedroso, J. P. 2006. ODAC: Hierarchical clustering of time series data streams. In Proceedings of the 6^th SIAM International Conference on Data Mining. 499--503.Google Scholar
Rodrigues, P. P., Gama, J., and Pedroso, J. P. 2008. Hierarchical clustering of time-series data streams. IEEE Trans. Knowl. Data Engin. 20, 5, 615--627. Google ScholarDigital Library
Serir, L., Ramasso, E., and Zerhouni, N. 2012. Evidential evolving gustafson--kessel algorithm for online data streams partitioning using belief function theory. Int. J. Approximate Reason. 53, 5, 1--22. Google ScholarDigital Library
Shah, R., Krishnaswamy, S., and Gaber, M. M. 2005. Resource-aware very fast k-means for ubiquitous data stream mining. In Proceedings of the 2^nd International Workshop on Knowledge Discovery in Data Streams, Held in Conjunction with the 16^th European Conference on Machine Learning (ECML'05).Google Scholar
Silva, A., Chiky, R., and Hebrail, G. 2011. A clustering approach for sampling data streams in sensor networks. Knowl. Inf. Syst. 32, 1, 1--23. Google ScholarDigital Library
Silva, J. A. and Hruschka, E. R. 2011. Extending k-means-based algorithms for evolving data streams with variable number of clusters. In Proceedings of the 4^th International Conference on Machine Learning and Applications (ICMLA'11). Vol. 2. 14--19. Google ScholarDigital Library
Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., and Schult, R. 2006. MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). ACM Press, New York, 706--711. Google ScholarDigital Library
Steinhaus, H. 1956. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci 1, 801--804.Google Scholar
Tasoulis, D. K., Adams, N. M., and Hand, D. J. 2006. Unsupervised clustering in streaming data. In Proceedings of the 6^th IEEE International Conference on Data Mining-Workshops (ICDM'06). 638--642. Google ScholarDigital Library
Tavallaee, M., Bagheri, E., Lu, W., and Ghorbani, A. A. 2009. A detailed analysis of the kdd cup 99 data set. In Proceedings of the 2^nd IEEE International Conference on Computational Intelligence for Security and Defense Applications. 53--58. Google ScholarDigital Library
Vattani, A. 2009. K-means requires exponentially many iterations even in the plane. In Proceedings of the 25^th Annual Symposium on Computational Geometry (SCG'09). ACM Press, New York, 324--332. Google ScholarDigital Library
Wan, R., Yan, X., and Su, X. 2008. A weighted fuzzy clustering algorithm for data stream. In Proceedings of the ISECS International Colloquium on Computing, Communication, Control, and Management. 360--364. Google ScholarDigital Library
Wu, X., Kumar, V., Quinlan, J. R, Ghosh, J., Yang, Q., Motoda, H., Mclachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., and Steinberg, D. 2007. Top 10 algorithms in data mining. Knowl. Inf. Syst. 14, 1--37. Google ScholarDigital Library
Xu, R. and Wunsch, D. 2009. Clustering. Computational Intelligence Series, Wiley-IEEE Press. Google ScholarDigital Library
Yang, C. and Zhou, J. 2006. HClustream: A novel approach for clustering evolving heterogeneous data stream. In Proceedings of the 6^th IEEE International Conference on Data Mining. 682--688. Google ScholarDigital Library
Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 103--114. Google ScholarDigital Library
Zhang, T., Ramakrishnan, R., and Livny, M. 1997. BIRCH: A new data clustering algorithm and its applications. Data Mining Knowl. Discov. 1, 2, 141--182. Google ScholarDigital Library
Zhang, X., Sebag, M., and Germain-Renaud, C. 2009. Multi-scale real-time grid monitoring with job stream mining. In Proceedings of the 9^th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'09). 420--427. Google ScholarDigital Library
Zhang, X. and Wang, W. 2010. Self-adaptive change detection in streaming data with nonstationary distribution. In Advanced Data Mining and Applications. Springer, 1--12. Google ScholarDigital Library
Zhang, X., Zhou, X., and Hu, X. 2006. Semantic smoothing for model-based document clustering. In Proceedings of the 6^th International Conference on Data Mining (ICDM'06). 1193--1198. Google ScholarDigital Library
Zhou, A., Cao, F., Qian, W., and Jin, C. 2008. Tracking clusters in evolving data streams over sliding windows. Knowl. Inf. Syst. 15, 2, 181--214. Google ScholarDigital Library
Zhu, H., Wang, Y., and Yu, Z. 2010. Clustering of evolving data stream with multiple adaptive sliding window. In Proceedings of the International Conference on Data Storage and Data Engineering (DSDE'10). 95--100. Google ScholarDigital Library
Zhu, Y. and Shasha, D. 2002. StatStream: Statistical monitoring of thousands of data streams in real time. In Proceedings of the 28^th International Conference on Very Large Data Bases (VLDB'02). VLBD Endowment, 358--369. Google ScholarDigital Library

Index Terms

Data stream clustering: A survey
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

Fat node leading tree for data stream clustering with density peaks

Detecting clusters of arbitrary shape and constantly delivering the results for newly arrived items are two critical challenges in the study of data stream clustering. However, the existing clustering methods could not deal with these two problems ...
Read More
Evolving soft subspace clustering

A key challenge to most conventional clustering algorithms in handling many real world problems is that, data points in different clusters are often correlated with different subsets of features. To address this problem, subspace clustering has ...
Read More
Data stream clustering: a review
Abstract
Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-...
Read More

Reviews

Reviewer: Fazli Can

Data streams, which are continuously generated at uncontrollable, rapid rates, are a relatively new form of temporal data object. Data stream clustering (DSC) is an unsupervised data mining process that looks for hidden patterns in dynamic, massive, and unbounded datasets. Possible DSC applications include analysis of ATM transactions, computer network traffic, meteorological observations, and stock market transactions. In such applications, for example, one may want to find new trends and identify followers. The authors present various algorithms. They first provide an overview, followed by the data structures used for summarization, the time window models used for efficiency, outlier detection mechanisms that distinguish outliers from cluster evolution, and standard clustering algorithms used for actual clustering based on the output of the summarization step. In the later sections, the paper discusses temporal aspects that need to be considered during clustering, the assessment of the generated clustering structures, practical applications, software packages available for data stream clustering, challenges of the problem environment, and future research pointers. The paper is rounded out with a taxonomy that helps to distinguish the analyzed algorithms with respect to their important features. This is a useful paper for researchers, with a rich list of references (despite a typographical error in the reference for Faria et al., 2012). The ACM Digital Library provides online access to a short electronic appendix related to the paper with the time complexity analysis of the presented algorithms. The paper is comprehensive and nicely written. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Computing Surveys Volume 46, Issue 1
October 2013
551 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/2522968
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 July 2013
- Accepted: 1 January 2013
- Revised: 1 September 2012
- Received: 1 June 2012
Published in csur Volume 46, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data stream clustering
online clustering
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 384
  Total Citations
  View Citations
- 7,249
  Total Downloads
- Downloads (Last 12 months)326
- Downloads (Last 6 weeks)40
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data stream clustering: A survey

ACM Computing Surveys

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Fat node leading tree for data stream clustering with density peaks

Evolving soft subspace clustering

Data stream clustering: a review

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data stream clustering: A survey

ACM Computing Surveys

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Fat node leading tree for data stream clustering with density peaks

Evolving soft subspace clustering

Data stream clustering: a review

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media