Skip to main content
Erschienen in: Knowledge and Information Systems 1/2014

01.10.2014 | Regular Paper

Clustering data streams using grid-based synopsis

verfasst von: Vasudha Bhatnagar, Sharanjit Kaur, Sharma Chakravarthy

Erschienen in: Knowledge and Information Systems | Ausgabe 1/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Continually advancing technology has made it feasible to capture data online for onward transmission as a steady flow of newly generated data points, termed as data stream. Continuity and unboundedness of data streams make storage of data and multiple scans of data an impractical proposition for the purpose of knowledge discovery. Need to learn structures from data in streaming environment has been a driving force for making clustering a popular technique for knowledge discovery from data streams. Continuous nature of streaming data makes it infeasible to look for point membership among the clusters discovered so far, necessitating employment of a synopsis structure to consolidate incoming data points. This synopsis is exploited for building clustering scheme to meet subsequent user demands. The proposed Exclusive and Complete Clustering (ExCC) algorithm captures non-overlapping clusters in data streams with mixed attributes, such that each point either belongs to some cluster or is an outlier/noise. The algorithm is robust, adaptive to changes in data distribution and detects succinct outliers on-the-fly. It deploys a fixed granularity grid structure as synopsis and performs clustering by coalescing dense regions in grid. Speed-based pruning is applied to synopsis prior to clustering to ensure currency of discovered clusters. Extensive experimentation demonstrates that the algorithm is robust, identifies succinct outliers on-the-fly and is adaptive to change in the data distribution. ExCC algorithm is further evaluated for performance and compared with other contemporary algorithms.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
Applications like patient monitoring and drug consumption analysis in medical data; sensor networks in seismic studies, mines work in bounded data space.
 
2
Cell is also referred to as grid by some authors, e.g., [47].
 
Literatur
1.
Zurück zum Zitat Abraham A, Hassanien AE, de Leon Ferreira de Carvalho ACP, Snásel V (eds) (2009) Foundations of computational intelligence: data mining, Studies in Computational Intelligence, vol 206. Springer, Berlin Abraham A, Hassanien AE, de Leon Ferreira de Carvalho ACP, Snásel V (eds) (2009) Foundations of computational intelligence: data mining, Studies in Computational Intelligence, vol 206. Springer, Berlin
2.
Zurück zum Zitat Aggarwal CC (ed) (2007) Data streams: models and algorithms. Springer, New York, ISBN 978-0-387-28759-1 Aggarwal CC (ed) (2007) Data streams: models and algorithms. Springer, New York, ISBN 978-0-387-28759-1
3.
Zurück zum Zitat Aggarwal CC, Yu PS (2006) Framework for clustering massive text and categorical data streams. In: Proceedings of the sixth SIAM international conference on data mining, pp 447–471 Aggarwal CC, Yu PS (2006) Framework for clustering massive text and categorical data streams. In: Proceedings of the sixth SIAM international conference on data mining, pp 447–471
4.
Zurück zum Zitat Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of international conference on very large databases, pp 81–92 Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of international conference on very large databases, pp 81–92
5.
Zurück zum Zitat Aggarwal CC, Han J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large databases, Morgan Kaufmann, Burlington pp 852–863 Aggarwal CC, Han J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large databases, Morgan Kaufmann, Burlington pp 852–863
6.
Zurück zum Zitat Agrawal R, Gehrke J, Gunopolos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining application. In: Proceedings of the ACM SIGMOD international conference on management of data, ACM Press Agrawal R, Gehrke J, Gunopolos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining application. In: Proceedings of the ACM SIGMOD international conference on management of data, ACM Press
7.
Zurück zum Zitat Bailey DG, Johnston CT (2007) Single pass connected components analysis. In: Proceedings of image and vision computing, IEEE. New Zealand Bailey DG, Johnston CT (2007) Single pass connected components analysis. In: Proceedings of image and vision computing, IEEE. New Zealand
8.
Zurück zum Zitat Bailey DG, Johnston CT, Ma N (2008) Connected components analysis of streamed images. In: Proceedings of the international conference on field programmable logic and applications. ACM Press, pp 679–682 Bailey DG, Johnston CT, Ma N (2008) Connected components analysis of streamed images. In: Proceedings of the international conference on field programmable logic and applications. ACM Press, pp 679–682
9.
Zurück zum Zitat Barbára D (2002) Requirements of clustering data streams. SIGKDD Explor 3(2):23–27CrossRef Barbára D (2002) Requirements of clustering data streams. SIGKDD Explor 3(2):23–27CrossRef
10.
Zurück zum Zitat Barbára D, Chen P (2001) Tracking clusters in evolving data sets. In: Proceedings of FLAIRS 2001, special track on knowledge discovery and data mining Barbára D, Chen P (2001) Tracking clusters in evolving data sets. In: Proceedings of FLAIRS 2001, special track on knowledge discovery and data mining
12.
Zurück zum Zitat Bhatnagar V, Kaur S (2007) Exclusive and complete clustering of streams. In: Proceedings of the eighteenth international conference on database and expert systems applications. Germany, pp 629–638 Bhatnagar V, Kaur S (2007) Exclusive and complete clustering of streams. In: Proceedings of the eighteenth international conference on database and expert systems applications. Germany, pp 629–638
14.
Zurück zum Zitat Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the sixth SIAM international conference on data mining. Maryland, pp 326–337 Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the sixth SIAM international conference on data mining. Maryland, pp 326–337
15.
Zurück zum Zitat Chen Y, Tu L (2007) Density-based clustering for real-ime stream data. In: Proceedings of the thirteenth international conference on knowledge discovery and data mining. ACM, San Jose Chen Y, Tu L (2007) Density-based clustering for real-ime stream data. In: Proceedings of the thirteenth international conference on knowledge discovery and data mining. ACM, San Jose
16.
Zurück zum Zitat Cheng CH, Fu AW, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, USA, pp 84–93 Cheng CH, Fu AW, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, USA, pp 84–93
17.
Zurück zum Zitat Dang XH, Lee VCS, Ng WK, Ong KL (2009) Incremental and adaptive clustering stream data over sliding window. In: Proceedings of 20th international conference on database and expert systems applications, Austria, pp 660–674 Dang XH, Lee VCS, Ng WK, Ong KL (2009) Incremental and adaptive clustering stream data over sliding window. In: Proceedings of 20th international conference on database and expert systems applications, Austria, pp 660–674
18.
Zurück zum Zitat Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, Portland, pp 226–231 Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, Portland, pp 226–231
19.
Zurück zum Zitat Gaber MM, Siddiqi AM (2010) Distributed data stream classification for wireless sensor networks. In: Proceedings of the 2010 ACM symposium on applied computing. ACM, Sierre, pp 1629–1630 Gaber MM, Siddiqi AM (2010) Distributed data stream classification for wireless sensor networks. In: Proceedings of the 2010 ACM symposium on applied computing. ACM, Sierre, pp 1629–1630
20.
Zurück zum Zitat Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. SIGMOD Rec 3:18–26CrossRef Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. SIGMOD Rec 3:18–26CrossRef
21.
Zurück zum Zitat Gao J, Li J, Zhang Z, Tan PN (2005) An incremental data stream clustering algorithms based on dense units detection. In: Proceedings of the ninth Pacific-Asia conference on advances in knowledge discovery and data mining. Springer, Berlin, pp 420–425 Gao J, Li J, Zhang Z, Tan PN (2005) An incremental data stream clustering algorithms based on dense units detection. In: Proceedings of the ninth Pacific-Asia conference on advances in knowledge discovery and data mining. Springer, Berlin, pp 420–425
22.
Zurück zum Zitat Garofalakis M, Gehrke J, Rastogi R (2002) Querying and mining data streams: you only get one look. In: Tutorial notes of the 28th international conference on very large databases, China Garofalakis M, Gehrke J, Rastogi R (2002) Querying and mining data streams: you only get one look. In: Tutorial notes of the 28th international conference on very large databases, China
23.
Zurück zum Zitat Guha S, Mishra N, Motwani R, O’Callaghan L (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of IEEE international conference on data engineering. IEEE Computer Society, USA Guha S, Mishra N, Motwani R, O’Callaghan L (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of IEEE international conference on data engineering. IEEE Computer Society, USA
24.
Zurück zum Zitat Gupta C, Grossman RL (2007) Outlier detection with streaming dyadic decomposition. In: Proceedings of the seventh industrial conference on data mining. Springer, Berlin (Heidelberg), pp 77–91 Gupta C, Grossman RL (2007) Outlier detection with streaming dyadic decomposition. In: Proceedings of the seventh industrial conference on data mining. Springer, Berlin (Heidelberg), pp 77–91
25.
Zurück zum Zitat Han J, Kamber M (2005) Data mining concepts and techniques, 2nd edn. Morgan-Kaufmann, San Francisco Han J, Kamber M (2005) Data mining concepts and techniques, 2nd edn. Morgan-Kaufmann, San Francisco
26.
Zurück zum Zitat He Z, Xu X, Deng S, Huang JZ (2004) Clustering categorical data streams. Comput Res Repos. abs/cs/0412058 He Z, Xu X, Deng S, Huang JZ (2004) Clustering categorical data streams. Comput Res Repos. abs/cs/0412058
27.
Zurück zum Zitat Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large databases. Scotland, pp 506–517 Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large databases. Scotland, pp 506–517
28.
Zurück zum Zitat Hirsh H (2008) Data mining research: current status and future oppourtunities. Stat Anal Data Min 1(2):104–107CrossRefMathSciNet Hirsh H (2008) Data mining research: current status and future oppourtunities. Stat Anal Data Min 1(2):104–107CrossRefMathSciNet
29.
Zurück zum Zitat Hsu CC, Huang YP (2008) Incremental clustering of mixed data based on distance hierarchy. Expert Syst Appl 35(3):1177–1185CrossRef Hsu CC, Huang YP (2008) Incremental clustering of mixed data based on distance hierarchy. Expert Syst Appl 35(3):1177–1185CrossRef
30.
Zurück zum Zitat Hu X, Zhang X, Lu C, Park EK, Zhou X (2009) Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the fifteenth international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 389–396 Hu X, Zhang X, Lu C, Park EK, Zhou X (2009) Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the fifteenth international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 389–396
31.
Zurück zum Zitat Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef
32.
Zurück zum Zitat Jia C, Tan C, Yong A (2008) A grid and density-based clustering algorithm for processing data stream. In: Proceedings of the second international conference on genetic and evolutionary computing. IEEE Computer Society, Los Alamitos, CA, USA Jia C, Tan C, Yong A (2008) A grid and density-based clustering algorithm for processing data stream. In: Proceedings of the second international conference on genetic and evolutionary computing. IEEE Computer Society, Los Alamitos, CA, USA
33.
Zurück zum Zitat Kailing K, Kriegel HP, Kroger P (2004) Density-connected subspace clustering for high-dimensional data. In: Proceedings of the fourth SIAM international conference on data mining, pp 246–257 Kailing K, Kriegel HP, Kroger P (2004) Density-connected subspace clustering for high-dimensional data. In: Proceedings of the fourth SIAM international conference on data mining, pp 246–257
34.
Zurück zum Zitat Kriegel H, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. Trans Knowl Discov Data 3(1):1–58CrossRef Kriegel H, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. Trans Knowl Discov Data 3(1):1–58CrossRef
35.
Zurück zum Zitat Lee JW, Lee WS (2008) A coarse-grain grid-based subspace clustering method for online muti-dimensional data streams. In: Proceedings of the seventeenth ACM conference on information and knowledge management. ACM Press Lee JW, Lee WS (2008) A coarse-grain grid-based subspace clustering method for online muti-dimensional data streams. In: Proceedings of the seventeenth ACM conference on information and knowledge management. ACM Press
36.
Zurück zum Zitat Lu Y, Sun Y, Xu G, Liu G (2005) A grid-based clustering algorithm for high-dimensional data streams. In: Proceedings of the first international conference on advanced data mining application. China Lu Y, Sun Y, Xu G, Liu G (2005) A grid-based clustering algorithm for high-dimensional data streams. In: Proceedings of the first international conference on advanced data mining application. China
37.
Zurück zum Zitat Luhr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity-based representative points. Data Knowl Eng 68:1–27CrossRef Luhr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity-based representative points. Data Knowl Eng 68:1–27CrossRef
38.
Zurück zum Zitat Moise G, Zimek A, Kröger P, Kriegel H, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21:299–326CrossRef Moise G, Zimek A, Kröger P, Kriegel H, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21:299–326CrossRef
39.
Zurück zum Zitat Motoyoshi M, Miura T, Shioya I (2004) Clustering stream data by regression analysis. In: Proceedings of Australasian workshop on data mining and web intelligence, vol 32. New Zealand, pp 115–120 Motoyoshi M, Miura T, Shioya I (2004) Clustering stream data by regression analysis. In: Proceedings of Australasian workshop on data mining and web intelligence, vol 32. New Zealand, pp 115–120
40.
Zurück zum Zitat Orlowska ME, Sun X, Li X (2006) Can exclusive clustering on streaming data be achieved? SIGKDD Explor 8:102–108CrossRef Orlowska ME, Sun X, Li X (2006) Can exclusive clustering on streaming data be achieved? SIGKDD Explor 8:102–108CrossRef
41.
Zurück zum Zitat Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM SIGMOD Rec 33:32–37CrossRef Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM SIGMOD Rec 33:32–37CrossRef
42.
Zurück zum Zitat Park NH, Lee WS (2007) Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. J Data Knowl Eng 63:528–549CrossRef Park NH, Lee WS (2007) Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. J Data Knowl Eng 63:528–549CrossRef
43.
Zurück zum Zitat Ruiz C, Menasalvas E, Spiliopoulou M (2009) C-DenStream: using domain knowledge on a data stream. Discovery Science. Lecture notes in computer science, vol 5808. Springer, Berlin, pp 287–301 Ruiz C, Menasalvas E, Spiliopoulou M (2009) C-DenStream: using domain knowledge on a data stream. Discovery Science. Lecture notes in computer science, vol 5808. Springer, Berlin, pp 287–301
44.
Zurück zum Zitat Song M, Wang H (2004) Incremental estimation of Gaussian mixture models for online data stream clustering. In: Proceedings of the international conference on bioinformatics and its applications Song M, Wang H (2004) Incremental estimation of Gaussian mixture models for online data stream clustering. In: Proceedings of the international conference on bioinformatics and its applications
45.
Zurück zum Zitat Tan P, Steinbach M, Kumar V (2007) Introduction to data mining, Pearson Education, ISBN 978-81-317-1472-0 Tan P, Steinbach M, Kumar V (2007) Introduction to data mining, Pearson Education, ISBN 978-81-317-1472-0
46.
Zurück zum Zitat Tasoulis DK, Adams NM, Hand DJ (2006) Unsupervised clustering in streaming data. In: Proceedings of the sixth IEEE international conference on data mining—Workshops, pp 638–642 Tasoulis DK, Adams NM, Hand DJ (2006) Unsupervised clustering in streaming data. In: Proceedings of the sixth IEEE international conference on data mining—Workshops, pp 638–642
47.
Zurück zum Zitat Tu L, Chen Y (2009) Stream data clustering based on grid density and attraction. ACM Trans Knowl Discov Data 3(3):1–27CrossRef Tu L, Chen Y (2009) Stream data clustering based on grid density and attraction. ACM Trans Knowl Discov Data 3(3):1–27CrossRef
50.
Zurück zum Zitat Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data 3(3):1–28CrossRef Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data 3(3):1–28CrossRef
51.
Zurück zum Zitat Wang S, Fan Y, Zhang C, Xu H, Hao X, Hu Y (2008) Entropy-based clustering of data streams with mixed numeric and categorical values In: Proceedings of the seventh IEEE/ACIS international conference on computer and information science, pp 140–145 Wang S, Fan Y, Zhang C, Xu H, Hao X, Hu Y (2008) Entropy-based clustering of data streams with mixed numeric and categorical values In: Proceedings of the seventh IEEE/ACIS international conference on computer and information science, pp 140–145
52.
Zurück zum Zitat Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678CrossRef Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678CrossRef
53.
Zurück zum Zitat Yanchang Z, Junde S (2001) GDILC: a grid-based density-isoline clustering algorithm. In: Proceedings of international conference on Info-tech and Info-net, IEEE. pp 140–145 Yanchang Z, Junde S (2001) GDILC: a grid-based density-isoline clustering algorithm. In: Proceedings of international conference on Info-tech and Info-net, IEEE. pp 140–145
54.
Zurück zum Zitat Yanchang Z, Junde S (2003) AGRID: an efficient algorithm for clustering large high-dimensional datasets. In: Proceedings of the seventh Pacific-Asia international conference an advances in knowledge discovery and data mining. Springer, Berlin, pp 271–282 Yanchang Z, Junde S (2003) AGRID: an efficient algorithm for clustering large high-dimensional datasets. In: Proceedings of the seventh Pacific-Asia international conference an advances in knowledge discovery and data mining. Springer, Berlin, pp 271–282
55.
Zurück zum Zitat Yapa RD, Koichi H (2007) A connected component labeling algorithm for grayscale images and application of the algorithm on mammograms. In: Proceedings of the symposium on applied computing. ACM, Seoul Yapa RD, Koichi H (2007) A connected component labeling algorithm for grayscale images and application of the algorithm on mammograms. In: Proceedings of the symposium on applied computing. ACM, Seoul
56.
Zurück zum Zitat Zhou A, Cai Z, Wei L, Qian W (2003) M-kernel merging: towards density estimation over data streams. In: Proceedings of the eighth international conference on database systems for advanced applications. Japan, pp 285–292 Zhou A, Cai Z, Wei L, Qian W (2003) M-kernel merging: towards density estimation over data streams. In: Proceedings of the eighth international conference on database systems for advanced applications. Japan, pp 285–292
Metadaten
Titel
Clustering data streams using grid-based synopsis
verfasst von
Vasudha Bhatnagar
Sharanjit Kaur
Sharma Chakravarthy
Publikationsdatum
01.10.2014
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 1/2014
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-013-0659-1

Weitere Artikel der Ausgabe 1/2014

Knowledge and Information Systems 1/2014 Zur Ausgabe

Premium Partner