Incremental clustering of mixed data based on distance hierarchy

doi:10.1016/j.eswa.2007.08.049

Expert Systems with Applications

Volume 35, Issue 3, October 2008, Pages 1177-1185

https://doi.org/10.1016/j.eswa.2007.08.049 Get rights and content

Abstract

Clustering is an important function in data mining. Its typical application includes the analysis of consumer’s materials. Adaptive resonance theory network (ART) is very popular in the unsupervised neural network. Type I adaptive resonance theory network (ART1) deals with the binary numerical data, whereas type II adaptive resonance theory network (ART2) deals with the general numerical data. Several information systems collect the mixing type attitudes, which included numeric attributes and categorical attributes. However, ART1 and ART2 do not deal with mixed data. If the categorical data attributes are transferred to the binary data format, the binary data do not reflect the similar degree. It influences the clustering quality. Therefore, this paper proposes a modified adaptive resonance theory network (M-ART) and the conceptual hierarchy tree to solve similar degrees of mixed data. This paper utilizes artificial simulation materials and collects a piece of actual data about the family income to do experiments. The results show that the M-ART algorithm can process the mixed data and has a great effect on clustering.

Introduction

Clustering is the unsupervised classification of patterns into groups. It is an important data analyzing technique, which organizes a collection of patterns into clusters based on similarity (Hsu, 2006, Hsu and Wang, 2005, Jain and Dubes, 1988). Clustering is useful in several exploratory pattern-analysis, grouping, decision-making, and machine-learning situations. This includes data mining, document retrieval, image segmentation, and pattern classification. Clustering methods have been successfully applied in many fields including pattern recognition (Anderberg, 1973), biology, psychiatry, psychology, archaeology, geology, geography, marketing, image processing (Jain & Dubes, 1988) and information retrieval (Rasmussen, 1992, Salton and Buckley, 1991). Intuitively, patterns with a valid cluster are more similar to each other than they are to a pattern belonging to a different cluster.

Data clustering has been considered as a primary data mining method for knowledge discovery. There have been many clustering algorithms in the literature. In general, major clustering methods can be classified into the hierarchical or the partition category. A hierarchical method creates a hierarchical decomposition of the given set of data patterns. A partition approach produces k partitions of the patterns, where each partition represents a cluster. Further classification in each of the categories is possible (Jain & Dubes, 1988). In addition, Jian (1999) discussed some cross-cutting issues that might affect all of the different approaches regardless of their placement in the categories (Jain, Murty, & Flynn, 1999). Being non-incremental or incremental is one of the issues (Hsu, 2006, Hsu and Wang, 2005). Non-incremental clustering methods process all the data patterns at a time. These algorithms usually require the entire datasets being loaded into memory and therefore have high requirement in memory space.

The major advantage with the incremental clustering algorithms is that it is not necessary to store the entire pattern matrix in the memory. So, the space requirements of incremental algorithms are very small. Incremental clustering considers input patterns one at a time and assigns them to the existing clusters (Jain & Dubes, 1988). Here, a new input pattern is assigned to a cluster without affecting the existing clusters significantly. Moreover, a major advantage of the incremental clustering algorithms is their limited space requirement since the entire dataset is not necessary to store in the memory. Therefore, these algorithms are well suited for a dynamic environment and for very large datasets. They have already been applied along these directions (Can, 1993, Ester et al., 1998, Somlo and Adele, 2001).

Most of clustering algorithms consider either categorical data or numeric data. However, many mixed datasets including categorical and numeric values existed nowadays. A common practice to clustering mixed dataset is to transform categorical values into numeric values and then proceed to use a numeric clustering algorithm. Another approach is to compare the categorical values directly, in which two distinct values result in distance 1 while identical values result in distance 0. Nevertheless, these two methods do not take into account the similarity information embedded between categorical values. Consequently, the clustering results do not faithfully reveal the similarity structure of the dataset (Hsu, 2006, Hsu and Wang, 2005).

This article is based on distance hierarchy (Hsu, 2006, Hsu and Wang, 2005) to propose a new incremental clustering algorithm for mixed datasets, in which the similarity information embedded between categorical attribute is considered during clustering. In our setting, each attribute of the data is associated with a distance hierarchy, which is an extension of the concept hierarchy (Somlo & Adele, 2001) with link weights representing the distance between concepts. The distance between two mixed data patterns is then calculated according to distance hierarchies.

It is worth mentioning that the representation scheme of distance hierarchy can generalize some conventional distance computation schemes including the simple matching and the binary encoding for categorical values, and the subtraction method for continuous values and ordinal values.

The rest of this article is organized as follows. Section 2 reviews clustering algorithms and discusses the shortcomings of the conventional approaches to clustering mixed data. Section 3 presents distance hierarchy for categorical data and proposes the incremental clustering algorithm based on distance hierarchies. In Section 4, experimental results on synthetic and real datasets are presented. Conclusions are given in Section 5.

Section snippets

Literature review

Adaptive resonance theory neural networks model real-time prediction, search, learning, and recognition. ART networks function as models of human cognitive information processing (Carpenter, 1997, Carpenter and Grossberg, 1993, Grossberg, 1980, Grossberg, 1999, Grossberg, 2003). A central feature of all ART systems is a pattern matching process that compares an external input with the internal memory of an active code. ART1 deals with the binary numerical data and ART2 deals with the general

Clustering hybrid data based on distance hierarchy

This paper proposes the distance hierarchy tree structure to overcome the expression for similar degree. This distance hierarchy tree algorithm combines the adaptive resonance theory network algorithm and it can be effective with mixed data in data clustering. This section presents distance hierarchy for categorical data and it proposes the incremental clustering algorithm based on distance hierarchies.

Experiments and discussion

This paper develops a prototype system with Borland C++ Builder 6. A series of experiments have been performed in order to verify the method. A mixed synthetic dataset and a UCI dataset have also been designed to show the capability of the M-ART in reasonably expressing and faithfully preserving the distance between the categorical data. It also reports the experimental results of artificial and actual data.

Conclusions

Most traditional clustering algorithms can only handle either categorical or numeric value. Although some research results have been published for handling mixed data, they still cannot reasonably express the similarities among categorical data. The paper presents a MART algorithm, which can handle mixed dataset directly. The experimental results on synthetic data sets show that the proposed approach can better reveal the similarity structure among data, particularly when categorical attributes

References (26)

G.-A. Carpenter
Distributed learning, recognition, and prediction by ART and ARTMAP neural networks
Neural Networks
(1997)
G.-A. Carpenter et al.
Normal and amnesic learning, recognition, and memory by a neural model of cortico-hippocampal interactions
Trends in Neuroscience
(1993)
G. Carpenter et al.
Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system
Neural Networks
(1991)
S. Grossberg
The link between brain, learning, attention, and consciousness
Consciousness and Cognition
(1999)
C.-C. Hsu et al.
Mining of mixed data with application to catalog marketing
Expert Systems with Applications
(2007)
M. Anderberg
Cluster analysis for applications
(1973)
Barbara, D., Couto, J., & Li, Y. (2002). COOLCAT: An entropy-based algorithm for categorical clustering. In Proceedings...
F. Can
Incremental clustering for dynamic information processing
ACM Transaction for Information Systems
(1993)
G.-A. Carpenter et al.
ART2: Self-organization of stable category recognition codes for analog input patterns
Applied Optics: Special Issue on Neural Networks
(1987)
Dash, M., and Choi, K., Scheuermann, P., and Liu, H., (2002). Feature selection for clustering – a filter solution. In...

Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., & Xu, X. (1998). Incremental clustering for mining in a data...

Gluck, M.-A., & Corter, J.-E. (1985). Information, uncertainty, and the utility of categories. In Proceedings of the...

S. Grossberg

How does a brain build a cognitive code?

Psychological Review

(1980)

Cited by (59)

An ensemble method with a hybrid of genetic algorithm and K-prototypes algorithm for mixed data classification
2024, Computers and Industrial Engineering
Due to challenges posed by mixed data clustering, this study aims to introduce an innovative clustering-based classification algorithm that possesses the advantages of both classification and clustering techniques for mixed data analysis. The proposed algorithm employs the K-prototypes algorithm with a genetic algorithm to optimize weights and centroids and utilizes the bagging method to build multiple classifiers, thereby enhancing classification performance. Furthermore, it incorporates four mutation mechanisms, including Gaussian, Cauchy, Levy, and single-point mutations, to explore optimal solutions. This study suggests using a 20% sampling ratio for the bootstrap sampling in the proposed algorithm. This ratio has been proven to be sufficient for achieving good classification performance while reducing computational time. Experimental results indicate that the proposed algorithm outperforms benchmark classifiers, demonstrating superior classification performance across five performance indicators. In addition, the loan eligibility case study offers valuable insights into applying the proposed algorithm in real-world scenarios, demonstrating that the proposed algorithm can achieve superior classification performance compared to other algorithms. It also offers managerial implications to help different industries and fields understand the appropriate timing and scenarios for implementing the algorithm.
Predicting pathways for old and new metabolites through clustering
2024, Journal of Theoretical Biology
The diverse metabolic pathways are fundamental to all living organisms, as they harvest energy, synthesize biomass components, produce molecules to interact with the microenvironment, and neutralize toxins. While the discovery of new metabolites and pathways continues, the prediction of pathways for new metabolites can be challenging. It can take vast amounts of time to elucidate pathways for new metabolites; thus, according to HMDB (Human Metabolome Database), only 60% of metabolites get assigned to pathways. Here, we present an approach to identify pathways based on metabolite structure. We extracted 201 features from SMILES annotations and identified new metabolites from PubMed abstracts and HMDB. After applying clustering algorithms to both groups of features, we quantified correlations between metabolites, and found the clusters accurately linked 92% of known metabolites to their respective pathways. Thus, this approach could be valuable for predicting metabolic pathways for new metabolites.
Mixed data clustering based on a number of similar features
2023, Pattern Recognition
Finding the degree of similarity measurement is one of the challenges of mixed data clustering. In this article, it has been tried to design a more efficient method by innovating in three important parts of clustering. In the part of the general method, for assigning data objects to the cluster, in addition to the distance, attention is paid to the “number of similar features”. Compared to assigning each object to a cluster, in cases where the distances are equal or close, the cluster center with the highest number of features similar to the given objects will be appropriate. This method is more accurate than the Hamming distance. To determine the cluster centers, instead of random selection, a more suitable object is identified with a distance-based method. In accuracy in three datasets, the proposed algorithm has performed at least two percent better than the other algorithms.
Group incremental adaptive clustering based on neural network and rough set theory for crime report categorization
2021, Neurocomputing
Explosively growing online text reports are mostly unstructured in nature. Many state-of-the-art techniques involving supervised, unsupervised or semi-supervised approaches have been developed in the recent years for automatic clustering of these reports. Annotation of online crime reports is a challenging task as various types of crime reports are frequently generated over time. To the best of the authors’ knowledge, this is the first attempt taken for group incremental adaptive clustering of crime reports integrating neural network and rough set theory. The proposed work initially identifies the named entities and selects only the context words within a pair of entities as a phrase. Thus every report is described by a collection of phrases. The phrases are vectorized using GloVe and a graph based clustering algorithm is applied to cluster all the collected phrases. The phrases within a cluster are considered as the similar type of phrases, called paraphrases and each report is represented by a binary vector of dimension equal to the number of clusters obtained. If a phrase of the report lies in a cluster then a ‘1’ is set at the corresponding position of the binary vector; otherwise it is set as ‘0’. Next, an adaptive resonance theory neural network is applied on the binary vector representation of the crime reports to generate a set of clusters of crime reports. When a new group of reports is available, the reports are transformed into binary form in the similar way and the rough set theory is applied on them. It puts many reports into existing clusters and for the remaining reports, adaptive resonance theory is further applied to modify the existing clusters and possibly generate the new clusters. Thus, in the dynamic environment when data are generated gradually over time, the proposed group incremental clustering algorithm is adapted to provide the updated set of clusters. The method has been applied on various crime report datasets and validated with the help of several cluster validation indices. The method is also compared with some state-of-the-art clustering algorithms to express its effectiveness and statistical significance in the domain of crime corpora.
Fuzzy Centroid and Genetic Algorithms: Solutions for Numeric and Categorical Mixed Data Clustering
2021, Procedia Computer Science
Statistical data analysis in machine learning and data mining usually uses the clustering technique. However, data with both attributes or mixed data exists universally in real life. K-prototype is a well-known algorithm for clustering mixed data because of its effectiveness in handling large data. However, practically, k-prototype has two main weaknesses, the use of mode as a cluster center for categorical attributes cannot accurately represent the objects, and the algorithm may stop at the local optimum solution because affected by random initial cluster prototypes. To overcome the first weakness, we can use fuzzy centroid, and for second weakness is to implement the genetic algorithm to search the global optimum solution. Our research combines the genetic algorithm and Fuzzy K-Prototype to accommodate these two weaknesses. We set up two multivariate data with high correlation and low correlation to see the robustness of the proposed algorithm. According to four value indexes of clustering result evaluation, Coefficient Varians Index, Partition Coefficient, Partition Entropy, and Purity, show that our proposed algorithm has a better result than K prototype. Based on the evaluation result, we conclude that our proposed algorithm can solve two weaknesses of the k-prototype algorithm.
Metaheuristic-based possibilistic multivariate fuzzy weighted c-means algorithms for market segmentation
2020, Applied Soft Computing Journal
This study proposed the metaheuristic-based possibilistic multivariate fuzzy weighted c-means (PMFWCM) algorithm for clustering mixed data PMFWCM algorithm itself is normally used for numerical data. To implement in the real application of market segmentation, where the data usually contains both numerical and categorical attributes model improvement is a need. First, the distance between two mixed-attribute objects is calculated by using the object–cluster similarity measure. Then, three metaheuristics, i.e., genetic algorithm (GA), particle swarm optimization algorithm (PSO), and sine cosine algorithm (SCA), are employed to integrate with the PMFWCM algorithm for cluster analysis. This combination aims to improve the clustering performance of the PMFWCM algorithm and to make the clustering results more stable. To cluster a real-world dataset certainly, the experiment with benchmark datasets from UCI machine learning repository is conducted to verify the performance of the proposed algorithms. The experiment results show that the clustering performance of the SCA-PMFWCM, GA-PMFWCM and PSO-PMFWCM algorithms are better than that of the PMFWCM algorithm. Moreover, from case study results, the SCA-PMFWCM algorithm gives the smallest sum of squared error and computational time compared with the GA-PMFWCM, PSO-PMFWCM, and PMFWCM algorithms.

View all citing articles on Scopus

View full text

Incremental clustering of mixed data based on distance hierarchy

Abstract

Introduction

Section snippets

Literature review

Clustering hybrid data based on distance hierarchy

Experiments and discussion

Conclusions

Neural Networks

Trends in Neuroscience

Neural Networks

Consciousness and Cognition

Expert Systems with Applications

Cluster analysis for applications

Incremental clustering for dynamic information processing

ACM Transaction for Information Systems

ART2: Self-organization of stable category recognition codes for analog input patterns

Applied Optics: Special Issue on Neural Networks

How does a brain build a cognitive code?

Psychological Review