Elsevier

Applied Soft Computing

Volume 12, Issue 9, September 2012, Pages 2856-2866
Applied Soft Computing

Growing Self-Organizing Map with cross insert for mixed-type data clustering

https://doi.org/10.1016/j.asoc.2012.04.004Get rights and content

Abstract

Self-Organizing Map (SOM) possesses effective capability for visualizing high-dimensional data. Therefore, SOM has numerous applications in visualized clustering. Many growing SOMs have been proposed to overcome the constraint of having a fixed map size in conventional SOMs. However, most growing SOMs lack a robust solution to process mixed-type data which may include numeric, ordinal and categorical values in a dataset. Moreover, the growing scheme has an impact on the quality of resultant maps. In this paper, we propose a Growing Mixed-type SOM (GMixSOM), combining a value representation mechanism distance hierarchy with a novel growing scheme to tackle the problem of analyzing mixed-type data and to improve the quality of the projection map. Experimental results on synthetic and real-world datasets demonstrate that the proposed mechanism is feasible and the growing scheme yields better projection maps than the existing method.

Highlights

► Semantic similarity inherent in categorical values can be considered during training. ► The scheme also unifies the representation of numeric, ordinal and nominal values. ► A new neuron-insertion method is devised which prevents generating redundant neurons during growing. ► When applied to data cluster analysis, GMixSOM facilitates obtaining better clustering result. ► Similarity inherent in categorical values is reflected in the projection maps.

Introduction

Large amounts of data are produced daily by millions of transactions and activities in real world. By means of data mining, one can extract valuable patterns and important clues from massive data for decision making. Hence, data visualization, as part of the data exploration process, has become a beneficial component in data analysis and knowledge discovery [1]. Generally, relationship between high-dimensional data cannot be observed directly by human unless the data are projected and present their relationship in a low-dimensional space [1], [2].

Self-Organizing Map (SOM) [3] is capable of projecting high-dimensional data into a low-dimensional representation space with preservation of topological order in the data. In recent years, many researchers have successfully applied SOM to analyze various data in real-world applications [4]. For example, database visualization and exploration [5], finding optimal process parameters [6], Web information customization [7], surveillance and human–computer interaction [8], Intrusion detection [9], and so on. Furthermore, several hybrid algorithms took the advantages of the SOM to improve overall performances by combining SOM with a variety of algorithms such as fuzzy neighborhood [10], support vector machine and Naïve Bayes [11], genetic algorithms [12], principal component analysis [13], multiple scheduling rules [14], mixture of Gaussians[15], learning vector quantization [16], etc.

As a variant of the SOM, growing SOMs [17], [18], [19], [20] were proposed to overcome the constraints of the map with a fixed size in conventional SOMs. The map can grow from a small number of initial neurons to a large-size map by inserting neurons during training. A dynamic map offers flexible structure instead of being confined by the predetermined size. Nevertheless, the growing SOMs were proposed in the context of numeric data and lack a robust scheme for processing mixed numeric and categorical data. In most cases, 1-of-k coding is adopted which converts a nominal attribute into a set of binary, numeric attributes.

Unfortunately, inappropriate transformation can result in loss of information and lead to undesired results. Let us take the 1-of-k coding as an example. Assume the drink attribute has a domain of three values {Apple_Juice, Orange_Juice, Coke}. The coding converts the drink attribute to a vector of three binary attributes, say, 〈AppleJ, OrangeJ, Coke〉. The value Apple_Juice is thus encoded by 〈1, 0, 0〉 while Coke is by 〈0, 0, 1〉. By the transformed representation, the similarity or distance between any two of the three values is the same. However, Apple_Juice is intuitively more similar to Orange_Juice than to Coke. The semantics inherent in the nominal values is not preserved by the coding scheme. If such a scheme is adopted, SOM will fail to reflect correct topological order implied by the nominal values.

In this paper, an extended growing SOM model, called Growing Mixed-type SOM (GMixSOM) is proposed which can process mixed-type data with consideration of the semantics inherent in categorical values. Consequently, the map is able to present the topological order reflecting the semantics of categorical values. The paper is structured as follows. Section 2, SOM, growing SOM and distance hierarchy are briefly reviewed and border effect is discussed. Section 3, the process of GMixSOM and cross insertion are elaborated. Section 4, experiments on synthetic and real-world datasets were conducted to verify the performance of GMixSOM for mixed-type data. Finally, conclusions are stated in Section 5.

Section snippets

Data visualization

Data visualization facilitates direct observation of the relationship between high-dimensional data. An analyst can identify all of promising data partitions which are candidates for further analysis via data visualization [1]. Dimension reduction is a common solution for data visualization such as Principle Component Analysis (PCA) and Factor Analysis (FA). Additionally, principle curve, Multidimensional scaling (MDS) and SOM are regarded as a branch of dimension reduction for processing

Distance hierarchy for value similarity

The extended model includes a data structure distance hierarchy which offers two merits: (1) a unified representation of categorical and numeric values, and (2) facilitation of representing and measuring semantic similarity between categorical values.

A distance hierarchy [33] consists of concept nodes, links, and weights as shown in Fig. 1. The similarity between concepts (or points) is measured by the weight of the path between the concepts (or points). Specifically, a point X in a distance

Performance measures

Several metrics are used to measure the quality of projection, including mean squared error, within-between ratio, entropy and Sammon stress.

Conclusions

In this paper, a growing SOM for processing mixed-type data is proposed. The contribution is two-fold: first, a value representation scheme distance hierarchy is integrated to the SOM so that semantic similarity inherent in categorical values can be considered during training. The scheme also unifies the representation of numeric, ordinal and nominal values. Second, a new neuron-insertion method is devised which prevents generating redundant neurons during growing and also help produce better

Acknowledgement

The work is supported by National Science Council, Taiwan under Grant NSC98-2410-H-224-010-MY2.

Wei-Shen Tai received his MS degree from the Department of Information Management, Da-Yeh University, Chang-Hua, Taiwan, in 2001. He is a doctoral student in the Department of Information Management, National Yunlin University of Science and Technology, Taiwan since 2005. His research interests include data mining, machine learning, fuzzy set theory and application, multiple criteria decision-making, and decision support systems.

References (42)

  • E. López-Rubio et al.

    Dynamic topology learning with the probabilistic self-organizing graph

    Neurocomputing

    (2011)
  • D. Palmer-Brown et al.

    Snap-drift neural network for self-organisation and sequence learning

    Neural Networks

    (2011)
  • A. Forti et al.

    Growing hierarchical tree SOM: an unsupervised neural network with dynamic topology

    Neural Networks

    (2006)
  • A. Hsu et al.

    Class structure visualization with semi-supervised growing self-organizing maps

    Neurocomputing

    (2008)
  • B. Fritzke

    Growing cell structures – a self-organizing network for unsupervised and supervised learning

    Neural Networks

    (1994)
  • K.L. Du et al.

    A neural network approach

    Neural Networks

    (2010)
  • A. Vathy-Fogarassy et al.

    Local and global mappings of topology representing networks

    Information Sciences

    (2009)
  • U. Fayyad et al.

    Information Visualization in Data Mining and Knowledge Discovery

    (2001)
  • S. Chakrabarti et al.

    Data Mining Know It All

    (2009)
  • T. Kohonen

    The self-organizing map

    Proceedings of the IEEE

    (1990)
  • H. Yin

    The self-organizing maps: background, theories, extensions and applications

    Computational Intelligence: A Compendium

    (2008)
  • Cited by (16)

    • Spark-GHSOM: Growing Hierarchical Self-Organizing Map for large scale mixed attribute datasets

      2019, Information Sciences
      Citation Excerpt :

      This inevitably leads to a loss in precision. Tai and Hsu [46] address the problem of clustering mixed attributes datasets by devising a distance measure which considers information embedded in concept hierarchies, to properly find similarities between the data instances and neurons. Alternatively, SCM models [13] have been proposed to address symbol strings clustering by extracting a lattice of nodes on a 2D map.

    • A directed batch growing approach to enhance the topology preservation of self-organizing map

      2017, Applied Soft Computing Journal
      Citation Excerpt :

      In the GSOM, the weight vector adaptation and neuron insertion steps are similar to IGG but the connections between adjacent neurons are always kept established. Another type of growing SOM for processing mixed-type data is proposed by Hsu et al. [22]. The method named GMix-SOM which design to deal with categorical values in the learning phase.

    • Adaptive Resonance Theory-based Clustering for Handling Mixed Data

      2022, Proceedings of the International Joint Conference on Neural Networks
    • VaBank: Visual Analytics for Banking Transactions

      2020, Proceedings of the International Conference on Information Visualisation
    View all citing articles on Scopus

    Wei-Shen Tai received his MS degree from the Department of Information Management, Da-Yeh University, Chang-Hua, Taiwan, in 2001. He is a doctoral student in the Department of Information Management, National Yunlin University of Science and Technology, Taiwan since 2005. His research interests include data mining, machine learning, fuzzy set theory and application, multiple criteria decision-making, and decision support systems.

    Chung-Chian Hsu received the MS and PhD degrees in computer science from Northwestern University, Evanston IL, USA, in 1988 and 1992, respectively. He joined the Department of Information Management at National Yunlin University of Science and Technology, Taiwan, in 1993. He was the Chairman of the Department from 2000 to 2003. He is currently a professor at the Department. Since 2002, he has also been the director of the Information Systems Division at the Testing Center for Technological and Vocational Education, Taiwan. His research interests include data mining, machine learning, pattern recognition, information retrieval, and decision support systems.

    View full text