Incremental clustering of mixed data based on distance hierarchy

https://doi.org/10.1016/j.eswa.2007.08.049Get rights and content

Abstract

Clustering is an important function in data mining. Its typical application includes the analysis of consumer’s materials. Adaptive resonance theory network (ART) is very popular in the unsupervised neural network. Type I adaptive resonance theory network (ART1) deals with the binary numerical data, whereas type II adaptive resonance theory network (ART2) deals with the general numerical data. Several information systems collect the mixing type attitudes, which included numeric attributes and categorical attributes. However, ART1 and ART2 do not deal with mixed data. If the categorical data attributes are transferred to the binary data format, the binary data do not reflect the similar degree. It influences the clustering quality. Therefore, this paper proposes a modified adaptive resonance theory network (M-ART) and the conceptual hierarchy tree to solve similar degrees of mixed data. This paper utilizes artificial simulation materials and collects a piece of actual data about the family income to do experiments. The results show that the M-ART algorithm can process the mixed data and has a great effect on clustering.

Introduction

Clustering is the unsupervised classification of patterns into groups. It is an important data analyzing technique, which organizes a collection of patterns into clusters based on similarity (Hsu, 2006, Hsu and Wang, 2005, Jain and Dubes, 1988). Clustering is useful in several exploratory pattern-analysis, grouping, decision-making, and machine-learning situations. This includes data mining, document retrieval, image segmentation, and pattern classification. Clustering methods have been successfully applied in many fields including pattern recognition (Anderberg, 1973), biology, psychiatry, psychology, archaeology, geology, geography, marketing, image processing (Jain & Dubes, 1988) and information retrieval (Rasmussen, 1992, Salton and Buckley, 1991). Intuitively, patterns with a valid cluster are more similar to each other than they are to a pattern belonging to a different cluster.

Data clustering has been considered as a primary data mining method for knowledge discovery. There have been many clustering algorithms in the literature. In general, major clustering methods can be classified into the hierarchical or the partition category. A hierarchical method creates a hierarchical decomposition of the given set of data patterns. A partition approach produces k partitions of the patterns, where each partition represents a cluster. Further classification in each of the categories is possible (Jain & Dubes, 1988). In addition, Jian (1999) discussed some cross-cutting issues that might affect all of the different approaches regardless of their placement in the categories (Jain, Murty, & Flynn, 1999). Being non-incremental or incremental is one of the issues (Hsu, 2006, Hsu and Wang, 2005). Non-incremental clustering methods process all the data patterns at a time. These algorithms usually require the entire datasets being loaded into memory and therefore have high requirement in memory space.

The major advantage with the incremental clustering algorithms is that it is not necessary to store the entire pattern matrix in the memory. So, the space requirements of incremental algorithms are very small. Incremental clustering considers input patterns one at a time and assigns them to the existing clusters (Jain & Dubes, 1988). Here, a new input pattern is assigned to a cluster without affecting the existing clusters significantly. Moreover, a major advantage of the incremental clustering algorithms is their limited space requirement since the entire dataset is not necessary to store in the memory. Therefore, these algorithms are well suited for a dynamic environment and for very large datasets. They have already been applied along these directions (Can, 1993, Ester et al., 1998, Somlo and Adele, 2001).

Most of clustering algorithms consider either categorical data or numeric data. However, many mixed datasets including categorical and numeric values existed nowadays. A common practice to clustering mixed dataset is to transform categorical values into numeric values and then proceed to use a numeric clustering algorithm. Another approach is to compare the categorical values directly, in which two distinct values result in distance 1 while identical values result in distance 0. Nevertheless, these two methods do not take into account the similarity information embedded between categorical values. Consequently, the clustering results do not faithfully reveal the similarity structure of the dataset (Hsu, 2006, Hsu and Wang, 2005).

This article is based on distance hierarchy (Hsu, 2006, Hsu and Wang, 2005) to propose a new incremental clustering algorithm for mixed datasets, in which the similarity information embedded between categorical attribute is considered during clustering. In our setting, each attribute of the data is associated with a distance hierarchy, which is an extension of the concept hierarchy (Somlo & Adele, 2001) with link weights representing the distance between concepts. The distance between two mixed data patterns is then calculated according to distance hierarchies.

It is worth mentioning that the representation scheme of distance hierarchy can generalize some conventional distance computation schemes including the simple matching and the binary encoding for categorical values, and the subtraction method for continuous values and ordinal values.

The rest of this article is organized as follows. Section 2 reviews clustering algorithms and discusses the shortcomings of the conventional approaches to clustering mixed data. Section 3 presents distance hierarchy for categorical data and proposes the incremental clustering algorithm based on distance hierarchies. In Section 4, experimental results on synthetic and real datasets are presented. Conclusions are given in Section 5.

Section snippets

Literature review

Adaptive resonance theory neural networks model real-time prediction, search, learning, and recognition. ART networks function as models of human cognitive information processing (Carpenter, 1997, Carpenter and Grossberg, 1993, Grossberg, 1980, Grossberg, 1999, Grossberg, 2003). A central feature of all ART systems is a pattern matching process that compares an external input with the internal memory of an active code. ART1 deals with the binary numerical data and ART2 deals with the general

Clustering hybrid data based on distance hierarchy

This paper proposes the distance hierarchy tree structure to overcome the expression for similar degree. This distance hierarchy tree algorithm combines the adaptive resonance theory network algorithm and it can be effective with mixed data in data clustering. This section presents distance hierarchy for categorical data and it proposes the incremental clustering algorithm based on distance hierarchies.

Experiments and discussion

This paper develops a prototype system with Borland C++ Builder 6. A series of experiments have been performed in order to verify the method. A mixed synthetic dataset and a UCI dataset have also been designed to show the capability of the M-ART in reasonably expressing and faithfully preserving the distance between the categorical data. It also reports the experimental results of artificial and actual data.

Conclusions

Most traditional clustering algorithms can only handle either categorical or numeric value. Although some research results have been published for handling mixed data, they still cannot reasonably express the similarities among categorical data. The paper presents a MART algorithm, which can handle mixed dataset directly. The experimental results on synthetic data sets show that the proposed approach can better reveal the similarity structure among data, particularly when categorical attributes

References (26)

  • Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., & Xu, X. (1998). Incremental clustering for mining in a data...
  • Gluck, M.-A., & Corter, J.-E. (1985). Information, uncertainty, and the utility of categories. In Proceedings of the...
  • S. Grossberg

    How does a brain build a cognitive code?

    Psychological Review

    (1980)
  • Cited by (59)

    View all citing articles on Scopus
    View full text