ABSTRACT
Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.
- BKSS90.N. Beckmann, H.-P. Kriegef, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. In Proc. of A CM SIGMOD, pages 322-331, Atlantic City, NJ, May i990. Google ScholarDigital Library
- CLR90.Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. The MIT Press, Massachusetts, 1990. Google ScholarDigital Library
- EKSX96.Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial database with noise. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), Portland, Oregon, August 1996.Google Scholar
- EKX95.Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. A database interface for clustering in large spatial databases. In Int'! Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995.Google Scholar
- GRS97.Sudipto Guha, R. Rastogi, and K. Shim. CURE: A clustering algorithm for large databases. Technical report, Bell Laboratories, Murray Hill, 1997.Google Scholar
- JD88.Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, New Jersey, 1988. Google ScholarDigital Library
- MR95.R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
- NH94.Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. In Proc. of the VLDB Conference, Santiago, Chile, September 1994. Google ScholarDigital Library
- Ols93.Clark F. Olson. Parallel algorithms for hierarchical clustering. Technical report, University of California at Berkeley, December 1993. Google ScholarDigital Library
- Sam89.H. Samet. The Design and Analysis of Spatial Data Structures. AddisomWesley, 1989. Google ScholarDigital Library
- Sam90.Hanan Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley Publishing Company, Inc., New York, 1990. Google ScholarDigital Library
- SRF87.T. Sellis, N, Roussopoulos, and C. FMoutsos. The R+ tree: a dynamic index for multi-dimensional objects. In Proc. 13th Int'l Conference on VLDB, pages 507-- 518, England, 1987. Google ScholarDigital Library
- Vit85.Jeff Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37-57, 1985. Google ScholarDigital Library
- ZRL96.Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. In Proceedings oj' the A CM SIGMOD Conference on Management o/ Data, pages 103-114, Montreal, Canada, June 1996. Google ScholarDigital Library
Index Terms
- CURE: an efficient clustering algorithm for large databases
Recommendations
CURE: an efficient clustering algorithm for large databases
Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the ...
Modified CURE algorithm with enhancement to identify number of clusters
In this paper, we present an effective way of identifying number of clusters k based on density of data in given dataset and optimality of clusters formed. We have used internal evaluation of clustering to choose optimal set of clusters after narrowing ...
Technology of Vehicle Identification by Laser Radar Based on CURE
ICEICE '12: Proceedings of the 2012 Second International Conference on Electric Information and Control Engineering - Volume 01this paper proposes an algorithm of vehicle identification by laser radar based on CURE. The algorithm adopts the improved CURE algorithm to cluster laser radar data, and uses the width of the class after clustering as the basis of vehicle ...
Comments