Elsevier

Information Systems

Volume 25, Issue 5, July 2000, Pages 345-366
Information Systems

Rock: A robust clustering algorithm for categorical attributes

https://doi.org/10.1016/S0306-4379(00)00022-3Get rights and content

Abstract

Clustering, in data mining, is useful to discover distribution patterns in the underlying data. Clustering algorithms usually employ a distance metric based (e.g., euclidean) similarity measure in order to partition the database such that data points in the same partition are more similar than points in different partitions. In this paper, we study clustering algorithms for data with boolean and categorical attributes. We show that traditional clustering algorithms that use distances between points for clustering are not appropriate for boolean and categorical attributes. Instead, we propose a novel concept of links to measure the similarity/proximity between a pair of data points. We develop a robust hierarchical clustering algorithm ROCK that employs links and not distances when merging clusters. Our methods naturally extend to non-metric similarity measures that are relevant in situations where a domain expert/similarity table is the only source of knowledge. In addition to presenting detailed complexity results for ROCK, we also conduct an experimental study with real-life as well as synthetic data sets to demonstrate the effectiveness of our techniques. For data with categorical attributes, our findings indicate that ROCK not only generates better quality clusters than traditional algorithms, but it also exhibits good scalability properties.

References (14)

  • R. Agrawal et al.

    Fast similarity search in the presence of noise, scaling, and translation in time-series databases

  • D. Coppersmith et al.

    Matrix multiplication via arithmetic progressions

  • T.H. Cormen et al.
  • R.O. Duda et al.
  • M. Ester et al.

    A density-based algorithm for discovering clusters in large spatial database with noise

  • M. Ester et al.

    A database interface for clustering in large spatial databases

  • S. Guha et al.

    CURE: A clustering algorithm for large databases

There are more references available in the full text version of this article.

Cited by (723)

  • Boosting cluster tree with reciprocal nearest neighbors scoring

    2024, Engineering Applications of Artificial Intelligence
View all citing articles on Scopus

Recommended by Felipe Carino.

View full text