CURE: an efficient clustering algorithm for large databases

Authors:
Sudipto Guha

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Rajeev Rastogi

Bell Laboratories, Murray Hill, NJ

Bell Laboratories, Murray Hill, NJ
View Profile

,
Kyuseok Shim

Bell Laboratories, Murray Hill, NJ

Bell Laboratories, Murray Hill, NJ
View Profile

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of dataJune 1998Pages 73–84https://doi.org/10.1145/276304.276312

Published:01 June 1998Publication History

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

Pages 73–84

ABSTRACT

Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.

References

BKSS90.N. Beckmann, H.-P. Kriegef, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. In Proc. of A CM SIGMOD, pages 322-331, Atlantic City, NJ, May i990. Google ScholarDigital Library
CLR90.Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. The MIT Press, Massachusetts, 1990. Google ScholarDigital Library
EKSX96.Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial database with noise. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), Portland, Oregon, August 1996.Google Scholar
EKX95.Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. A database interface for clustering in large spatial databases. In Int'! Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995.Google Scholar
GRS97.Sudipto Guha, R. Rastogi, and K. Shim. CURE: A clustering algorithm for large databases. Technical report, Bell Laboratories, Murray Hill, 1997.Google Scholar
JD88.Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, New Jersey, 1988. Google ScholarDigital Library
MR95.R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
NH94.Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. In Proc. of the VLDB Conference, Santiago, Chile, September 1994. Google ScholarDigital Library
Ols93.Clark F. Olson. Parallel algorithms for hierarchical clustering. Technical report, University of California at Berkeley, December 1993. Google ScholarDigital Library
Sam89.H. Samet. The Design and Analysis of Spatial Data Structures. AddisomWesley, 1989. Google ScholarDigital Library
Sam90.Hanan Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley Publishing Company, Inc., New York, 1990. Google ScholarDigital Library
SRF87.T. Sellis, N, Roussopoulos, and C. FMoutsos. The R+ tree: a dynamic index for multi-dimensional objects. In Proc. 13th Int'l Conference on VLDB, pages 507-- 518, England, 1987. Google ScholarDigital Library
Vit85.Jeff Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37-57, 1985. Google ScholarDigital Library
ZRL96.Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. In Proceedings oj' the A CM SIGMOD Conference on Management o/ Data, pages 103-114, Montreal, Canada, June 1996. Google ScholarDigital Library

Index Terms

CURE: an efficient clustering algorithm for large databases

Recommendations

CURE: an efficient clustering algorithm for large databases

Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the ...
Read More
Modified CURE algorithm with enhancement to identify number of clusters

In this paper, we present an effective way of identifying number of clusters k based on density of data in given dataset and optimality of clusters formed. We have used internal evaluation of clustering to choose optimal set of clusters after narrowing ...
Read More
Technology of Vehicle Identification by Laser Radar Based on CURE
ICEICE '12: Proceedings of the 2012 Second International Conference on Electric Information and Control Engineering - Volume 01

this paper proposes an algorithm of vehicle identification by laser radar based on CURE. The algorithm adopts the improved CURE algorithm to cluster laser radar data, and uses the width of the class after clustering as the basis of vehicle ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
June 1998
599 pages
ISBN:0897919955
DOI:10.1145/276304
Chairmen:
Laura Haas
IBM AlmadenResearch Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editors:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
,
Michael Franklin
Univ. of Maryland, College Park
ACM SIGMOD Record Volume 27, Issue 2
June 1998
595 pages
ISSN:0163-5808
DOI:10.1145/276305
Chairmen:
Laura Haas
IBM Almaden Research Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editor:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
Issue’s Table of Contents
Copyright © 1998 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1998
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,617
  Total Citations
  View Citations
- 10,679
  Total Downloads
- Downloads (Last 12 months)1,259
- Downloads (Last 6 weeks)209
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CURE: an efficient clustering algorithm for large databases

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

CURE: an efficient clustering algorithm for large databases

Modified CURE algorithm with enhancement to identify number of clusters

Technology of Vehicle Identification by Laser Radar Based on CURE