Incremental clustering of mixed data based on distance hierarchy
Introduction
Clustering is the unsupervised classification of patterns into groups. It is an important data analyzing technique, which organizes a collection of patterns into clusters based on similarity (Hsu, 2006, Hsu and Wang, 2005, Jain and Dubes, 1988). Clustering is useful in several exploratory pattern-analysis, grouping, decision-making, and machine-learning situations. This includes data mining, document retrieval, image segmentation, and pattern classification. Clustering methods have been successfully applied in many fields including pattern recognition (Anderberg, 1973), biology, psychiatry, psychology, archaeology, geology, geography, marketing, image processing (Jain & Dubes, 1988) and information retrieval (Rasmussen, 1992, Salton and Buckley, 1991). Intuitively, patterns with a valid cluster are more similar to each other than they are to a pattern belonging to a different cluster.
Data clustering has been considered as a primary data mining method for knowledge discovery. There have been many clustering algorithms in the literature. In general, major clustering methods can be classified into the hierarchical or the partition category. A hierarchical method creates a hierarchical decomposition of the given set of data patterns. A partition approach produces k partitions of the patterns, where each partition represents a cluster. Further classification in each of the categories is possible (Jain & Dubes, 1988). In addition, Jian (1999) discussed some cross-cutting issues that might affect all of the different approaches regardless of their placement in the categories (Jain, Murty, & Flynn, 1999). Being non-incremental or incremental is one of the issues (Hsu, 2006, Hsu and Wang, 2005). Non-incremental clustering methods process all the data patterns at a time. These algorithms usually require the entire datasets being loaded into memory and therefore have high requirement in memory space.
The major advantage with the incremental clustering algorithms is that it is not necessary to store the entire pattern matrix in the memory. So, the space requirements of incremental algorithms are very small. Incremental clustering considers input patterns one at a time and assigns them to the existing clusters (Jain & Dubes, 1988). Here, a new input pattern is assigned to a cluster without affecting the existing clusters significantly. Moreover, a major advantage of the incremental clustering algorithms is their limited space requirement since the entire dataset is not necessary to store in the memory. Therefore, these algorithms are well suited for a dynamic environment and for very large datasets. They have already been applied along these directions (Can, 1993, Ester et al., 1998, Somlo and Adele, 2001).
Most of clustering algorithms consider either categorical data or numeric data. However, many mixed datasets including categorical and numeric values existed nowadays. A common practice to clustering mixed dataset is to transform categorical values into numeric values and then proceed to use a numeric clustering algorithm. Another approach is to compare the categorical values directly, in which two distinct values result in distance 1 while identical values result in distance 0. Nevertheless, these two methods do not take into account the similarity information embedded between categorical values. Consequently, the clustering results do not faithfully reveal the similarity structure of the dataset (Hsu, 2006, Hsu and Wang, 2005).
This article is based on distance hierarchy (Hsu, 2006, Hsu and Wang, 2005) to propose a new incremental clustering algorithm for mixed datasets, in which the similarity information embedded between categorical attribute is considered during clustering. In our setting, each attribute of the data is associated with a distance hierarchy, which is an extension of the concept hierarchy (Somlo & Adele, 2001) with link weights representing the distance between concepts. The distance between two mixed data patterns is then calculated according to distance hierarchies.
It is worth mentioning that the representation scheme of distance hierarchy can generalize some conventional distance computation schemes including the simple matching and the binary encoding for categorical values, and the subtraction method for continuous values and ordinal values.
The rest of this article is organized as follows. Section 2 reviews clustering algorithms and discusses the shortcomings of the conventional approaches to clustering mixed data. Section 3 presents distance hierarchy for categorical data and proposes the incremental clustering algorithm based on distance hierarchies. In Section 4, experimental results on synthetic and real datasets are presented. Conclusions are given in Section 5.
Section snippets
Literature review
Adaptive resonance theory neural networks model real-time prediction, search, learning, and recognition. ART networks function as models of human cognitive information processing (Carpenter, 1997, Carpenter and Grossberg, 1993, Grossberg, 1980, Grossberg, 1999, Grossberg, 2003). A central feature of all ART systems is a pattern matching process that compares an external input with the internal memory of an active code. ART1 deals with the binary numerical data and ART2 deals with the general
Clustering hybrid data based on distance hierarchy
This paper proposes the distance hierarchy tree structure to overcome the expression for similar degree. This distance hierarchy tree algorithm combines the adaptive resonance theory network algorithm and it can be effective with mixed data in data clustering. This section presents distance hierarchy for categorical data and it proposes the incremental clustering algorithm based on distance hierarchies.
Experiments and discussion
This paper develops a prototype system with Borland C++ Builder 6. A series of experiments have been performed in order to verify the method. A mixed synthetic dataset and a UCI dataset have also been designed to show the capability of the M-ART in reasonably expressing and faithfully preserving the distance between the categorical data. It also reports the experimental results of artificial and actual data.
Conclusions
Most traditional clustering algorithms can only handle either categorical or numeric value. Although some research results have been published for handling mixed data, they still cannot reasonably express the similarities among categorical data. The paper presents a MART algorithm, which can handle mixed dataset directly. The experimental results on synthetic data sets show that the proposed approach can better reveal the similarity structure among data, particularly when categorical attributes
References (26)
Distributed learning, recognition, and prediction by ART and ARTMAP neural networks
Neural Networks
(1997)- et al.
Normal and amnesic learning, recognition, and memory by a neural model of cortico-hippocampal interactions
Trends in Neuroscience
(1993) - et al.
Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system
Neural Networks
(1991) The link between brain, learning, attention, and consciousness
Consciousness and Cognition
(1999)- et al.
Mining of mixed data with application to catalog marketing
Expert Systems with Applications
(2007) Cluster analysis for applications
(1973)- Barbara, D., Couto, J., & Li, Y. (2002). COOLCAT: An entropy-based algorithm for categorical clustering. In Proceedings...
Incremental clustering for dynamic information processing
ACM Transaction for Information Systems
(1993)- et al.
ART2: Self-organization of stable category recognition codes for analog input patterns
Applied Optics: Special Issue on Neural Networks
(1987) - Dash, M., and Choi, K., Scheuermann, P., and Liu, H., (2002). Feature selection for clustering – a filter solution. In...
How does a brain build a cognitive code?
Psychological Review
Cited by (59)
An ensemble method with a hybrid of genetic algorithm and K-prototypes algorithm for mixed data classification
2024, Computers and Industrial EngineeringPredicting pathways for old and new metabolites through clustering
2024, Journal of Theoretical BiologyMixed data clustering based on a number of similar features
2023, Pattern RecognitionFuzzy Centroid and Genetic Algorithms: Solutions for Numeric and Categorical Mixed Data Clustering
2021, Procedia Computer ScienceMetaheuristic-based possibilistic multivariate fuzzy weighted c-means algorithms for market segmentation
2020, Applied Soft Computing Journal