Skip to main content
Erschienen in: Knowledge and Information Systems 3/2014

01.09.2014 | Regular Paper

Efficient clustering of uncertain data streams

verfasst von: Cheqing Jin, Jeffrey Xu Yu, Aoying Zhou, Feng Cao

Erschienen in: Knowledge and Information Systems | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Clustering uncertain data streams has recently become one of the most challenging tasks in data management because of the strict space and time requirements of processing tuples arriving at high speed and the difficulty that arises from handling uncertain data. The prior work on clustering data streams focuses on devising complicated synopsis data structures to summarize data streams into a small number of micro-clusters so that important statistics can be computed conveniently, such as Clustering Feature (CF) (Zhang et al. in Proceedings of ACM SIGMOD, pp 103–114, 1996) for deterministic data and Error-based Clustering Feature (ECF) (Aggarwal and Yu in Proceedings of ICDE, 2008) for uncertain data. However, ECF can only handle attribute-level uncertainty, while existential uncertainty, the other kind of uncertainty, has not been addressed yet. In this paper, we propose a novel data structure, Uncertain Feature (UF), to summarize data streams with both kinds of uncertainties: UF is space-efficient, has additive and subtractive properties, and can compute complicated statistics easily. Our first attempt aims at enhancing the previous streaming approaches to handle the sliding-window model by using UF instead of old synopses, inclusive of CluStream (Aggarwal et al. in Proceedings of VLDB, 2003) and UMicro (Aggarwal and Yu in Proceedings of ICDE, 2008). We show that such methods cannot achieve high efficiency. Our second attempt aims at devising a novel algorithm, cluUS , to handle the sliding-window model by using UF structure. Detailed analysis and thorough experimental reports on synthetic and real data sets confirm the advantages of our proposed method.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
2
We use \({\overrightarrow{x}}_i\) to denote a deterministic tuple in Sect. 2.1 and an uncertain tuple from Sect. 2.2, respectively.
 
3
We call a tuple absorbed by a micro-cluster only if it joins in the micro-cluster.
 
Literatur
1.
Zurück zum Zitat Aggarwal CC (2009) Managing and mining uncertain data. Springer, Berlin Aggarwal CC (2009) Managing and mining uncertain data. Springer, Berlin
2.
Zurück zum Zitat Aggarwal CC (2009) On high dimensional projected clustering of uncertain data streams. In: Proceedings of ICDE, pp. 1152–1154 Aggarwal CC (2009) On high dimensional projected clustering of uncertain data streams. In: Proceedings of ICDE, pp. 1152–1154
3.
Zurück zum Zitat Alex N, Hasenfuss A, Hammer B (2009) Patch clustering for massive data sets. Neurocomputing 72:1455–1469CrossRef Alex N, Hasenfuss A, Hammer B (2009) Patch clustering for massive data sets. Neurocomputing 72:1455–1469CrossRef
4.
Zurück zum Zitat Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of VLDB Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of VLDB
5.
Zurück zum Zitat Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. In: Proceedings of ICDE Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. In: Proceedings of ICDE
6.
Zurück zum Zitat Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of ACM SIGACT-SIGMOD symposium on principles of database systems Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of ACM SIGACT-SIGMOD symposium on principles of database systems
7.
Zurück zum Zitat Burdick D, Deshpande PM, Jayram T, Ramakrishnan R, Vaithyanathan S (2005) OLAP over uncertain and imprecise data. In: Proceedings of VLDB Burdick D, Deshpande PM, Jayram T, Ramakrishnan R, Vaithyanathan S (2005) OLAP over uncertain and imprecise data. In: Proceedings of VLDB
8.
Zurück zum Zitat Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of ACM PODS Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of ACM PODS
9.
Zurück zum Zitat Benjelloun O, Sarma AD, Halevy AY, Widom J (2006) Uldbs: databases with uncertainty and lineage. In: Proceedings of VLDB Benjelloun O, Sarma AD, Halevy AY, Widom J (2006) Uldbs: databases with uncertainty and lineage. In: Proceedings of VLDB
10.
Zurück zum Zitat Chau M, Cheng R, Kao B, Ng J (2006) Uncertain data mining: An example in clustering location data. In: Proceedings of PAKDD Chau M, Cheng R, Kao B, Ng J (2006) Uncertain data mining: An example in clustering location data. In: Proceedings of PAKDD
11.
Zurück zum Zitat Cormode G, Garofalakis M (2007) Sketching probabilistic data streams. In: Proceedings of ACM SIGMOD Cormode G, Garofalakis M (2007) Sketching probabilistic data streams. In: Proceedings of ACM SIGMOD
12.
Zurück zum Zitat Cormode G, McGregor A (2008) Approximation algorithms for clustering uncertain data. In: Proceedings of PODS Cormode G, McGregor A (2008) Approximation algorithms for clustering uncertain data. In: Proceedings of PODS
13.
Zurück zum Zitat Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3:32–57CrossRefMATHMathSciNet Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3:32–57CrossRefMATHMathSciNet
14.
Zurück zum Zitat Dalvi NN, Suciu D (2004) Efficient query evaluation on probabilistic databases. In: Proceedings of ICDE Dalvi NN, Suciu D (2004) Efficient query evaluation on probabilistic databases. In: Proceedings of ICDE
15.
Zurück zum Zitat Dalvi NN, Suciu D (2007) Management of probabilistic data foundations and challenges. In: Proceedings of ACM PODS Dalvi NN, Suciu D (2007) Management of probabilistic data foundations and challenges. In: Proceedings of ACM PODS
16.
Zurück zum Zitat Ester M, Kriegel H-P, Sander J, Xu X (1996) Density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD Ester M, Kriegel H-P, Sander J, Xu X (1996) Density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD
17.
Zurück zum Zitat Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of SIGMOD, pp 73–84 Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of SIGMOD, pp 73–84
19.
Zurück zum Zitat Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, New Jersey Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, New Jersey
20.
Zurück zum Zitat Jayram T, Kale S, Vee E (2007) Efficient aggregation algorithms for probabilistic data. In: Proceedings of SODA Jayram T, Kale S, Vee E (2007) Efficient aggregation algorithms for probabilistic data. In: Proceedings of SODA
21.
Zurück zum Zitat Jin C, Yi K, Chen L, Yu JX, Lin X (2008) Sliding-window top-k queries on uncertain streams. Proc VLDB Endow 1(1):301–312CrossRef Jin C, Yi K, Chen L, Yu JX, Lin X (2008) Sliding-window top-k queries on uncertain streams. Proc VLDB Endow 1(1):301–312CrossRef
22.
Zurück zum Zitat Kao B, Lee SD, Cheung DW, Ho W-S, Chan KF (2008) Clustering uncertain data using Voronoi diagrams. In: Proceedings of ICDM, pp 333–342 Kao B, Lee SD, Cheung DW, Ho W-S, Chan KF (2008) Clustering uncertain data using Voronoi diagrams. In: Proceedings of ICDM, pp 333–342
23.
Zurück zum Zitat Kriegel H-P, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of KDD Kriegel H-P, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of KDD
24.
Zurück zum Zitat Kriegel H-P, Pfeifle M (2005) Hierarchical density-based clustering of uncertain data. In: Proceedings of ICDM Kriegel H-P, Pfeifle M (2005) Hierarchical density-based clustering of uncertain data. In: Proceedings of ICDM
25.
Zurück zum Zitat Lee SD, Kao B, Cheng R (2007) Reducing uk-means to k-means. In: Proceedingd of ICDM workshops, pp 483–488 Lee SD, Kao B, Cheng R (2007) Reducing uk-means to k-means. In: Proceedingd of ICDM workshops, pp 483–488
26.
Zurück zum Zitat Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data. In: Proceedings of ICDM Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data. In: Proceedings of ICDM
27.
Zurück zum Zitat O’Callaghan L, Mishra N, Meyerson A, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of ICDE O’Callaghan L, Mishra N, Meyerson A, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of ICDE
28.
Zurück zum Zitat Pelekis N, Kopanakis I, Kotsifakos EK, Frentzos E, Theodoridis Y (2011) clustering uncertain trajectories. Knowl Inf Syst 28(1):117–147CrossRef Pelekis N, Kopanakis I, Kotsifakos EK, Frentzos E, Theodoridis Y (2011) clustering uncertain trajectories. Knowl Inf Syst 28(1):117–147CrossRef
29.
Zurück zum Zitat Tao Y, Cheng R, Xiao X, Ngai WK, Kao B, Prabhakar S (2005) Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of VLDB Tao Y, Cheng R, Xiao X, Ngai WK, Kao B, Prabhakar S (2005) Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of VLDB
30.
Zurück zum Zitat Xin D, Halevy AY, Yu C (2007) Data integration with uncertainty. In: Proceedings of VLDB Xin D, Halevy AY, Yu C (2007) Data integration with uncertainty. In: Proceedings of VLDB
31.
Zurück zum Zitat Zhang M, Chen S, Jensen CS, Ooi BC, Zhang Z (2009) Effectively indexing uncertain moving objects for predictive queries. In: Proceedings of VLDB Zhang M, Chen S, Jensen CS, Ooi BC, Zhang Z (2009) Effectively indexing uncertain moving objects for predictive queries. In: Proceedings of VLDB
32.
Zurück zum Zitat Zhang Q, Li F, Yi K (2008) Finding frequent items in probabilistic data. In: Proceedings of SIGMOD Zhang Q, Li F, Yi K (2008) Finding frequent items in probabilistic data. In: Proceedings of SIGMOD
33.
Zurück zum Zitat Zhang W, Lin X, Zhang Y, Wang W, Yu JX (2009) Probilistic skyline operator over sliding windows. In: Proceedings of ICDE Zhang W, Lin X, Zhang Y, Wang W, Yu JX (2009) Probilistic skyline operator over sliding windows. In: Proceedings of ICDE
34.
Zurück zum Zitat Zhang Y, Lin X, ZHU G, Zhang W, Lin Q (2010) Efficient rank based knn query processing over uncertain data. In: Proceedings of ICDE Zhang Y, Lin X, ZHU G, Zhang W, Lin Q (2010) Efficient rank based knn query processing over uncertain data. In: Proceedings of ICDE
35.
Zurück zum Zitat Zhang T, Ramakrishnan R, Livnya M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD, pp 103–114 Zhang T, Ramakrishnan R, Livnya M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD, pp 103–114
Metadaten
Titel
Efficient clustering of uncertain data streams
verfasst von
Cheqing Jin
Jeffrey Xu Yu
Aoying Zhou
Feng Cao
Publikationsdatum
01.09.2014
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 3/2014
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-013-0657-3

Weitere Artikel der Ausgabe 3/2014

Knowledge and Information Systems 3/2014 Zur Ausgabe