Growing Self-Organizing Map with cross insert for mixed-type data clustering
Graphical abstract
Highlights
► Semantic similarity inherent in categorical values can be considered during training. ► The scheme also unifies the representation of numeric, ordinal and nominal values. ► A new neuron-insertion method is devised which prevents generating redundant neurons during growing. ► When applied to data cluster analysis, GMixSOM facilitates obtaining better clustering result. ► Similarity inherent in categorical values is reflected in the projection maps.
Introduction
Large amounts of data are produced daily by millions of transactions and activities in real world. By means of data mining, one can extract valuable patterns and important clues from massive data for decision making. Hence, data visualization, as part of the data exploration process, has become a beneficial component in data analysis and knowledge discovery [1]. Generally, relationship between high-dimensional data cannot be observed directly by human unless the data are projected and present their relationship in a low-dimensional space [1], [2].
Self-Organizing Map (SOM) [3] is capable of projecting high-dimensional data into a low-dimensional representation space with preservation of topological order in the data. In recent years, many researchers have successfully applied SOM to analyze various data in real-world applications [4]. For example, database visualization and exploration [5], finding optimal process parameters [6], Web information customization [7], surveillance and human–computer interaction [8], Intrusion detection [9], and so on. Furthermore, several hybrid algorithms took the advantages of the SOM to improve overall performances by combining SOM with a variety of algorithms such as fuzzy neighborhood [10], support vector machine and Naïve Bayes [11], genetic algorithms [12], principal component analysis [13], multiple scheduling rules [14], mixture of Gaussians[15], learning vector quantization [16], etc.
As a variant of the SOM, growing SOMs [17], [18], [19], [20] were proposed to overcome the constraints of the map with a fixed size in conventional SOMs. The map can grow from a small number of initial neurons to a large-size map by inserting neurons during training. A dynamic map offers flexible structure instead of being confined by the predetermined size. Nevertheless, the growing SOMs were proposed in the context of numeric data and lack a robust scheme for processing mixed numeric and categorical data. In most cases, 1-of-k coding is adopted which converts a nominal attribute into a set of binary, numeric attributes.
Unfortunately, inappropriate transformation can result in loss of information and lead to undesired results. Let us take the 1-of-k coding as an example. Assume the drink attribute has a domain of three values {Apple_Juice, Orange_Juice, Coke}. The coding converts the drink attribute to a vector of three binary attributes, say, 〈AppleJ, OrangeJ, Coke〉. The value Apple_Juice is thus encoded by 〈1, 0, 0〉 while Coke is by 〈0, 0, 1〉. By the transformed representation, the similarity or distance between any two of the three values is the same. However, Apple_Juice is intuitively more similar to Orange_Juice than to Coke. The semantics inherent in the nominal values is not preserved by the coding scheme. If such a scheme is adopted, SOM will fail to reflect correct topological order implied by the nominal values.
In this paper, an extended growing SOM model, called Growing Mixed-type SOM (GMixSOM) is proposed which can process mixed-type data with consideration of the semantics inherent in categorical values. Consequently, the map is able to present the topological order reflecting the semantics of categorical values. The paper is structured as follows. Section 2, SOM, growing SOM and distance hierarchy are briefly reviewed and border effect is discussed. Section 3, the process of GMixSOM and cross insertion are elaborated. Section 4, experiments on synthetic and real-world datasets were conducted to verify the performance of GMixSOM for mixed-type data. Finally, conclusions are stated in Section 5.
Section snippets
Data visualization
Data visualization facilitates direct observation of the relationship between high-dimensional data. An analyst can identify all of promising data partitions which are candidates for further analysis via data visualization [1]. Dimension reduction is a common solution for data visualization such as Principle Component Analysis (PCA) and Factor Analysis (FA). Additionally, principle curve, Multidimensional scaling (MDS) and SOM are regarded as a branch of dimension reduction for processing
Distance hierarchy for value similarity
The extended model includes a data structure distance hierarchy which offers two merits: (1) a unified representation of categorical and numeric values, and (2) facilitation of representing and measuring semantic similarity between categorical values.
A distance hierarchy [33] consists of concept nodes, links, and weights as shown in Fig. 1. The similarity between concepts (or points) is measured by the weight of the path between the concepts (or points). Specifically, a point X in a distance
Performance measures
Several metrics are used to measure the quality of projection, including mean squared error, within-between ratio, entropy and Sammon stress.
Conclusions
In this paper, a growing SOM for processing mixed-type data is proposed. The contribution is two-fold: first, a value representation scheme distance hierarchy is integrated to the SOM so that semantic similarity inherent in categorical values can be considered during training. The scheme also unifies the representation of numeric, ordinal and nominal values. Second, a new neuron-insertion method is devised which prevents generating redundant neurons during growing and also help produce better
Acknowledgement
The work is supported by National Science Council, Taiwan under Grant NSC98-2410-H-224-010-MY2.
Wei-Shen Tai received his MS degree from the Department of Information Management, Da-Yeh University, Chang-Hua, Taiwan, in 2001. He is a doctoral student in the Department of Information Management, National Yunlin University of Science and Technology, Taiwan since 2005. His research interests include data mining, machine learning, fuzzy set theory and application, multiple criteria decision-making, and decision support systems.
References (42)
- et al.
Externally growing self-organizing maps and its application to database visualization and exploration
Applied Soft Computing
(2006) - et al.
Quality-oriented optimization of wave soldering process by using self-organizing maps
Applied Soft Computing
(2011) SOMSE: a semantic map based meta-search engine for the purpose of web information customization
Applied Soft Computing
(2011)- et al.
Surveillance and human–computer interaction applications of self-growing models
Applied Soft Computing
(2011) - et al.
The use of computational intelligence in intrusion detection systems: a review
Applied Soft Computing
(2010) - et al.
A self-organizing map-based initialization for hybrid training of feedforward neural networks
Applied Soft Computing
(2011) - et al.
Text mining with emergent self organizing maps and multi-dimensional scaling: a comparative study on domestic violence
Applied Soft Computing
(2011) - et al.
Deriving operating policies for multi-objective reservoir systems: application of self-learning genetic algorithm
Applied Soft Computing
(2010) - et al.
Combined use of principal component analysis and self organisation map for condition monitoring in pickling process
Applied Soft Computing
(2011) - et al.
Study of SOM-based intelligent multi-controller for real-time scheduling
Applied Soft Computing
(2011)
Dynamic topology learning with the probabilistic self-organizing graph
Neurocomputing
Snap-drift neural network for self-organisation and sequence learning
Neural Networks
Growing hierarchical tree SOM: an unsupervised neural network with dynamic topology
Neural Networks
Class structure visualization with semi-supervised growing self-organizing maps
Neurocomputing
Growing cell structures – a self-organizing network for unsupervised and supervised learning
Neural Networks
A neural network approach
Neural Networks
Local and global mappings of topology representing networks
Information Sciences
Information Visualization in Data Mining and Knowledge Discovery
Data Mining Know It All
The self-organizing map
Proceedings of the IEEE
The self-organizing maps: background, theories, extensions and applications
Computational Intelligence: A Compendium
Cited by (16)
Spark-GHSOM: Growing Hierarchical Self-Organizing Map for large scale mixed attribute datasets
2019, Information SciencesCitation Excerpt :This inevitably leads to a loss in precision. Tai and Hsu [46] address the problem of clustering mixed attributes datasets by devising a distance measure which considers information embedded in concept hierarchies, to properly find similarities between the data instances and neurons. Alternatively, SCM models [13] have been proposed to address symbol strings clustering by extracting a lattice of nodes on a 2D map.
Integration of growing self-organizing map and bee colony optimization algorithm for part clustering
2018, Computers and Industrial EngineeringA directed batch growing approach to enhance the topology preservation of self-organizing map
2017, Applied Soft Computing JournalCitation Excerpt :In the GSOM, the weight vector adaptation and neuron insertion steps are similar to IGG but the connections between adjacent neurons are always kept established. Another type of growing SOM for processing mixed-type data is proposed by Hsu et al. [22]. The method named GMix-SOM which design to deal with categorical values in the learning phase.
Adaptive Resonance Theory-based Clustering for Handling Mixed Data
2022, Proceedings of the International Joint Conference on Neural NetworksVaBank: Visual Analytics for Banking Transactions
2020, Proceedings of the International Conference on Information Visualisation
Wei-Shen Tai received his MS degree from the Department of Information Management, Da-Yeh University, Chang-Hua, Taiwan, in 2001. He is a doctoral student in the Department of Information Management, National Yunlin University of Science and Technology, Taiwan since 2005. His research interests include data mining, machine learning, fuzzy set theory and application, multiple criteria decision-making, and decision support systems.
Chung-Chian Hsu received the MS and PhD degrees in computer science from Northwestern University, Evanston IL, USA, in 1988 and 1992, respectively. He joined the Department of Information Management at National Yunlin University of Science and Technology, Taiwan, in 1993. He was the Chairman of the Department from 2000 to 2003. He is currently a professor at the Department. Since 2002, he has also been the director of the Information Systems Division at the Testing Center for Technological and Vocational Education, Taiwan. His research interests include data mining, machine learning, pattern recognition, information retrieval, and decision support systems.