Article

A robust and scalable clustering algorithm for mixed type attributes in large database environment

Authors:
Tom Chiu

American Century Investments, Kansas City, MO

American Century Investments, Kansas City, MO
View Profile

,
DongPing Fang

SPSS Inc., Chicago, IL

SPSS Inc., Chicago, IL
View Profile

,
John Chen

SPSS Inc., Chicago, IL

SPSS Inc., Chicago, IL
View Profile

,
Yao Wang

SPSS Inc., Chicago, IL

SPSS Inc., Chicago, IL
View Profile

,
Christopher Jeris

SPSS Inc., Chicago, IL

SPSS Inc., Chicago, IL
View Profile

KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2001Pages 263–268https://doi.org/10.1145/502512.502549

Published:26 August 2001Publication History

KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 263–268

ABSTRACT

Clustering is a widely used technique in data mining applications to discover patterns in the underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either continuous or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining problems. In this paper, we propose a distance measure that enables clustering data with both continuous and categorical attributes. This distance measure is derived from a probabilistic model that the distance between two clusters is equivalent to the decrease in log-likelihood function as a result of merging. Calculation of this measure is memory efficient as it depends only on the merging cluster pair and not on all the other clusters. Zhang et al [8] proposed a clustering method named BIRCH that is especially suitable for very large datasets. We develop a clustering algorithm using our distance measure based on the framework of BIRCH. Similar to BIRCH, our algorithm first performs a pre-clustering step by scanning the entire dataset and storing the dense regions of data records in terms of summary statistics. A hierarchical clustering algorithm is then applied to cluster the dense regions. Apart from the ability of handling mixed type of attributes, our algorithm differs from BIRCH in that we add a procedure that enables the algorithm to automatically determine the appropriate number of clusters and a new strategy of assigning cluster membership to noisy data. For data with mixed type of attributes, our experimental results confirm that the algorithm not only generates better quality clusters than the traditional k-means algorithms, but also exhibits good scalability properties and is able to identify the underlying number of clusters in the data correctly. The algorithm is implemented in the commercial data mining tool Clementine 6.0 which supports the PMML standard of data mining model deployment.

References

1.Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and Non-Gaussian Clustering. Biometrics, 49, p. 803-821.Google Scholar
2.Fraley, C., and Raftery, A. E. (1998). How Many Clusters? Which Clustering Method? Answers via Model-based Cluster Analysis. Computer Journal, 4, p. 578-588.Google ScholarCross Ref
3.Ganti, V., Gehrke, J., and Ramakrishnan, R. (1999). CACTUS - Clustering Categorical Data Using Summaries. In Proceedings of 1999 SIGKDD Conference. p. 73-82. Google ScholarDigital Library
4.Guha, S., Rastogi, R., and Shim, K. (1999). ROCK: A Robust Clustering Algorithm for Categorical Attributes. URL: http://www.bell-labs.com/proiect/serendip/.Google Scholar
5.Huang, Z. (1998). Extensions to the K-means Algorithm for Clustering Large Datasets with Categorical Values. Data Mining and Knowledge Discovery, 2, p. 283-304. Google ScholarDigital Library
6.MacQueen, J. B. (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, p. 281-297.Google Scholar
7.Schwarz, G. (1978). Estimating the dimension of a model. The Annual of Statistics, 6, p. 461-464.Google ScholarCross Ref
8.Zhang, T., Ramakrishnan, R., and Livny M. (1997). BIRCH: A New Data Clustering Algorithm and its Applications. Data Mining and Knowledge Discovery 1 (2). Google ScholarDigital Library

Index Terms

A robust and scalable clustering algorithm for mixed type attributes in large database environment
1. Human-centered computing
  1. Visualization
    1. Visualization application domains
      1. Geographic visualization
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering
    2. Spatial-temporal systems

Recommendations

K-Harmonic means type clustering algorithm for mixed datasets

Display Omitted A K-Harmonic clustering algorithm for mixed data has been presented to reduce random initialization problem for partitional algorithms.The proposed clustering algorithm uses a distance measure developed for mixed datasets.The experiment ...
Read More
Determining the number of clusters using information entropy for mixed data

In cluster analysis, one of the most challenging and difficult problems is the determination of the number of clusters in a data set, which is a basic input parameter for most clustering algorithms. To solve this problem, many algorithms have been ...
Read More
A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets

Almost all subspace clustering algorithms proposed so far are designed for numeric datasets. In this paper, we present a k-means type clustering algorithm that finds clusters in data subspaces in mixed numeric and categorical datasets. In this method, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
August 2001
493 pages
ISBN:158113391X
DOI:10.1145/502512
Conference Chair:
Doheon Lee
Chonnam National University, Korea
,
General Chair:
Mario Schkolnick
SGI
,
Program Chairs:
Foster Provost
New York University
,
Ramakrishnan Srikant
IBM Almaden Research Center
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 August 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Mixed type of attributes
clustering
log-likelihood
noisy data
number of clusters
Qualifiers
- Article
Conference

Acceptance Rates
KDD '01 Paper Acceptance Rate31of237submissions,13%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 331
  Total Citations
  View Citations
- 4,166
  Total Downloads
- Downloads (Last 12 months)154
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A robust and scalable clustering algorithm for mixed type attributes in large database environment

KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

K-Harmonic means type clustering algorithm for mixed datasets

Determining the number of clusters using information entropy for mixed data

A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets