skip to main content
10.1145/276304.276312acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free Access

CURE: an efficient clustering algorithm for large databases

Published:01 June 1998Publication History

ABSTRACT

Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.

References

  1. BKSS90.N. Beckmann, H.-P. Kriegef, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. In Proc. of A CM SIGMOD, pages 322-331, Atlantic City, NJ, May i990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. CLR90.Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. The MIT Press, Massachusetts, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. EKSX96.Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial database with noise. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), Portland, Oregon, August 1996.Google ScholarGoogle Scholar
  4. EKX95.Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. A database interface for clustering in large spatial databases. In Int'! Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995.Google ScholarGoogle Scholar
  5. GRS97.Sudipto Guha, R. Rastogi, and K. Shim. CURE: A clustering algorithm for large databases. Technical report, Bell Laboratories, Murray Hill, 1997.Google ScholarGoogle Scholar
  6. JD88.Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, New Jersey, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. MR95.R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. NH94.Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. In Proc. of the VLDB Conference, Santiago, Chile, September 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ols93.Clark F. Olson. Parallel algorithms for hierarchical clustering. Technical report, University of California at Berkeley, December 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Sam89.H. Samet. The Design and Analysis of Spatial Data Structures. AddisomWesley, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sam90.Hanan Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley Publishing Company, Inc., New York, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. SRF87.T. Sellis, N, Roussopoulos, and C. FMoutsos. The R+ tree: a dynamic index for multi-dimensional objects. In Proc. 13th Int'l Conference on VLDB, pages 507-- 518, England, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Vit85.Jeff Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37-57, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ZRL96.Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. In Proceedings oj' the A CM SIGMOD Conference on Management o/ Data, pages 103-114, Montreal, Canada, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CURE: an efficient clustering algorithm for large databases

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
            June 1998
            599 pages
            ISBN:0897919955
            DOI:10.1145/276304

            Copyright © 1998 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 June 1998

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate785of4,003submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader