nach oben

Knowledge and Information Systems

Erschienen in:

01.09.2014 | Regular Paper

Efficient clustering of uncertain data streams

verfasst von: Cheqing Jin, Jeffrey Xu Yu, Aoying Zhou, Feng Cao

Erschienen in: Knowledge and Information Systems | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Clustering uncertain data streams has recently become one of the most challenging tasks in data management because of the strict space and time requirements of processing tuples arriving at high speed and the difficulty that arises from handling uncertain data. The prior work on clustering data streams focuses on devising complicated synopsis data structures to summarize data streams into a small number of micro-clusters so that important statistics can be computed conveniently, such as Clustering Feature (CF) (Zhang et al. in Proceedings of ACM SIGMOD, pp 103–114, 1996) for deterministic data and Error-based Clustering Feature (ECF) (Aggarwal and Yu in Proceedings of ICDE, 2008) for uncertain data. However, ECF can only handle attribute-level uncertainty, while existential uncertainty, the other kind of uncertainty, has not been addressed yet. In this paper, we propose a novel data structure, Uncertain Feature (UF), to summarize data streams with both kinds of uncertainties: UF is space-efficient, has additive and subtractive properties, and can compute complicated statistics easily. Our first attempt aims at enhancing the previous streaming approaches to handle the sliding-window model by using UF instead of old synopses, inclusive of CluStream (Aggarwal et al. in Proceedings of VLDB, 2003) and UMicro (Aggarwal and Yu in Proceedings of ICDE, 2008). We show that such methods cannot achieve high efficiency. Our second attempt aims at devising a novel algorithm, cluUS , to handle the sliding-window model by using UF structure. Detailed analysis and thorough experimental reports on synthetic and real data sets confirm the advantages of our proposed method.

Vorheriger Artikel Recurrent concepts in data streams classification

Nächster Artikel Integration of artificial immune network and K-means for cluster analysis

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

http://nsidc.org/data/g00807.html.

We use \({\overrightarrow{x}}_i\) to denote a deterministic tuple in Sect. 2.1 and an uncertain tuple from Sect. 2.2, respectively.

We call a tuple absorbed by a micro-cluster only if it joins in the micro-cluster.

http://nsidc.org/data/g00807.html.

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

http://kdd.ics.uci.edu/databases/covertype/covertype.html.

http://mathworld.wolfram.com/NormalDistribution.html.

Aggarwal CC (2009) Managing and mining uncertain data. Springer, Berlin

Aggarwal CC (2009) On high dimensional projected clustering of uncertain data streams. In: Proceedings of ICDE, pp. 1152–1154

Alex N, Hasenfuss A, Hammer B (2009) Patch clustering for massive data sets. Neurocomputing 72:1455–1469CrossRef

Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of VLDB

Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. In: Proceedings of ICDE

Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of ACM SIGACT-SIGMOD symposium on principles of database systems

Burdick D, Deshpande PM, Jayram T, Ramakrishnan R, Vaithyanathan S (2005) OLAP over uncertain and imprecise data. In: Proceedings of VLDB

Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of ACM PODS

Benjelloun O, Sarma AD, Halevy AY, Widom J (2006) Uldbs: databases with uncertainty and lineage. In: Proceedings of VLDB

10.

Chau M, Cheng R, Kao B, Ng J (2006) Uncertain data mining: An example in clustering location data. In: Proceedings of PAKDD

11.

Cormode G, Garofalakis M (2007) Sketching probabilistic data streams. In: Proceedings of ACM SIGMOD

12.

Cormode G, McGregor A (2008) Approximation algorithms for clustering uncertain data. In: Proceedings of PODS

13.

Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3:32–57CrossRefMATHMathSciNet

14.

Dalvi NN, Suciu D (2004) Efficient query evaluation on probabilistic databases. In: Proceedings of ICDE

15.

Dalvi NN, Suciu D (2007) Management of probabilistic data foundations and challenges. In: Proceedings of ACM PODS

16.

Ester M, Kriegel H-P, Sander J, Xu X (1996) Density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD

17.

Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of SIGMOD, pp 73–84

18.

Hochbaum D, Shmoys D (1985) A best possible heuristic for the k-center problem. Math Oper Res 10(2):180–184CrossRefMATHMathSciNet

19.

Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, New Jersey

20.

Jayram T, Kale S, Vee E (2007) Efficient aggregation algorithms for probabilistic data. In: Proceedings of SODA

21.

Jin C, Yi K, Chen L, Yu JX, Lin X (2008) Sliding-window top-k queries on uncertain streams. Proc VLDB Endow 1(1):301–312CrossRef

22.

Kao B, Lee SD, Cheung DW, Ho W-S, Chan KF (2008) Clustering uncertain data using Voronoi diagrams. In: Proceedings of ICDM, pp 333–342

23.

Kriegel H-P, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of KDD

24.

Kriegel H-P, Pfeifle M (2005) Hierarchical density-based clustering of uncertain data. In: Proceedings of ICDM

25.

Lee SD, Kao B, Cheng R (2007) Reducing uk-means to k-means. In: Proceedingd of ICDM workshops, pp 483–488

26.

Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data. In: Proceedings of ICDM

27.

O’Callaghan L, Mishra N, Meyerson A, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of ICDE

28.

Pelekis N, Kopanakis I, Kotsifakos EK, Frentzos E, Theodoridis Y (2011) clustering uncertain trajectories. Knowl Inf Syst 28(1):117–147CrossRef

29.

Tao Y, Cheng R, Xiao X, Ngai WK, Kao B, Prabhakar S (2005) Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of VLDB

30.

Xin D, Halevy AY, Yu C (2007) Data integration with uncertainty. In: Proceedings of VLDB

31.

Zhang M, Chen S, Jensen CS, Ooi BC, Zhang Z (2009) Effectively indexing uncertain moving objects for predictive queries. In: Proceedings of VLDB

32.

Zhang Q, Li F, Yi K (2008) Finding frequent items in probabilistic data. In: Proceedings of SIGMOD

33.

Zhang W, Lin X, Zhang Y, Wang W, Yu JX (2009) Probilistic skyline operator over sliding windows. In: Proceedings of ICDE

34.

Zhang Y, Lin X, ZHU G, Zhang W, Lin Q (2010) Efficient rank based knn query processing over uncertain data. In: Proceedings of ICDE

35.

Zhang T, Ramakrishnan R, Livnya M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD, pp 103–114

Titel: Efficient clustering of uncertain data streams
verfasst von: Cheqing Jin
Jeffrey Xu Yu
Aoying Zhou
Feng Cao
Publikationsdatum: 01.09.2014
Verlag: Springer London
Erschienen in: Knowledge and Information Systems / Ausgabe 3/2014
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-013-0657-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 3/2014

Introducing semantic variables in mixed distance measures: Impact on hierarchical clustering

Integration of artificial immune network and K-means for cluster analysis

Efficient processing of streaming updates with archived master data in near-real-time data warehousing

Parallel multiple kernel learning: a hybrid alternating direction method of multipliers

On the difficulty of automatically detecting irony: beyond a simple case of negation

Explication and semantic querying of enterprise information systems