A powerful hybrid clustering method based on modified stem cells and Fuzzy C-means algorithms

https://doi.org/10.1016/j.engappai.2013.03.002Get rights and content

Highlights

  • • We modified the stem cells algorithm (SCA) and used FCM for data clustering.

  • • This algorithm, not only helps the FCM clustering escape from local optima.

  • • This algorithm overcomes the shortcoming of the slow convergence speed of the SCA.

  • • The performance of SC-FCM is compared with the six clustering methods on six datasets.

  • • Experiment results indicate the superiority of the SC-FCM algorithm.

Abstract

One of the simple techniques for Data Clustering is based on Fuzzy C-means (FCM) clustering which describes the belongingness of each data to a cluster by a fuzzy membership function instead of a crisp value. However, the results of fuzzy clustering depend highly on the initial state selection and there is also a high risk for getting the best results when the datasets are large. In this paper, we present a hybrid algorithm based on FCM and modified stem cells algorithms, we called it SC-FCM algorithm, for optimum clustering of a dataset into K clusters. The experimental results obtained by using the new algorithm on different well-known datasets compared with those obtained by K-means algorithm, FCM, Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), Artificial Bee Colony (ABC) Algorithm demonstrate the better performance of the new algorithm.

Introduction

Data clustering is an important tool in many applications such as data mining (Busygin et al., 2008, Kuo et al., 2011), statistical data analysis (Aitkin and Aitkin, 2011), math programming (Masmoudi et al., 2010), image segmentation (Taherdangkoo et al., 2010) etc. Many approaches have been proposed for data clustering in general and specific applications. One of the most important and applicable clustering algorithms is the K-means algorithm. The algorithm is an unsupervised algorithm that tries to put N data points into K clusters (i.e. C1,C2,....,Ck) with randomly initializing K data points (K data points as a set of primary centers). In general, the resulted clusters should satisfy some conditions as follows:i=1KCi=Swhere S is the entire dataset. Moreover, there should be at least one point in each cluster, so that:Ck|k=1,...,KΦwhere Ф is an empty set. There are not also any data point jointly existing in two clusters which is expressed as follows:CkCj=Φ|kj

In K-means algorithm, the clusters are formed such that the existing data in each cluster should have the minimum Euclidean distance to the center of that cluster. The Euclidean distance between xiwhich is ith data point and Zjwhich is the center of Cj, is defined as:E=xiZjwhere . is the norm of vectors difference. Then, the centers of clusters are recomputed based on the fact that a center is the best represent of data points of a cluster. This process is repeated iteratively and ended when the centers do not change, namely the algorithm converges.

However, the algorithm has some limitations when the datasets are large and it may not work properly. For instance, bad selection of K data points as the initial centers of the clusters may causes that the algorithm does not converge at all. Another fundamental problem is that if the entire set of data points is large, the convergence to local minimums may be long and eventually the response obtained after several repeats may not be the optima response. To overcome these drawbacks, many clustering algorithms have been introduced recently. For instance, the Genetic algorithms have been widely employed for data clustering (Laszlo and Mukherjee, 2007). However, many parameters should be fixed in these algorithms and the convergence is not guaranteed.

Another unsupervised algorithm that attracted the scientists attention is Fuzzy C-mean (FCM) for data clustering (Bezdek et al., 1984, Jain and Dubes, 1988). The algorithm works according to minimizing the objective function that is defined as:Jm(U,V,X)=i=1Nj=1Cuijmd2(xi,vj)

In above equation,U is a fuzzy membership function matrix with N rows and C columns in which N is the number of data points and C is the number of clusters. The element uijin theithrow andjthcolumn in U, represents the degree of belongingness of the ith data point to the jth cluster, V={v1,v2,...,vj...,vC} is the set of centers of clusters and X={x1,x2,...,xi....,xN} is the matrix of input data points, m is the fuzzy index that governs the influence of membership grades (m is any real number greater than 1, in general, m is set to 2) andd2(xi,vj)is squared norm distance, which is used for measuring the similarity betweenxi and vj.d2(xi,vj)=xivj2

In Eq.(5), uij is a fuzzy membership function whose value is between the interval of [0, 1] and defined as follow:uij=1k=1C(d(xi,vj)d(xi,vk))2m1j=1Cuij=1,i=1,...,N

For the centers of clusters, vj, we have:vj=i=1N(uij)mxii=1N(uij)mj=1,...,C1<m<Ν

The process of FCM algorithm begins by initializing the centers vectors. Then uij are computed using Eq. (7). Next, the centers are recomputed using Eq. (8) then the process is repeated with new centers until |vjt+1vjt|<ε and |uijt+1uijt|<ε for all i and j. ε is a small value (ε is a termination criterion between 0 and 1) and t shows the iteration. Although, the FCM algorithm is efficient; but it is a very sensitive to the selection of initial values and having long convergence in the case of large datasets. To overcome these drawbacks, many clustering algorithms have been introduced recently. For instance, the Genetic algorithms have been widely employed for data clustering (Gan et al., 2009). This paper (Gan et al., 2009), introduced the responses that are expressed as a string of bits and the algorithm starts using a population consisting of a series of initial responses and obtains ideal responses and continues its process until achieving a convergence to a global minimum. It should be noted that the optimization of FCM using Genetic algorithm increases only the speed of reaching final response and improves a little the performance of FCM, although it solves somehow the problem of local minimum convergence.

Data clustering approaches have been also extended by the algorithms based on the social behavior of ants known as ant colony optimization (ACO) algorithms (Kanade and Hall, 2007). Kanade and Hall (2007)) introduced an algorithm by the concept of using the increase of pheromone evaporation rate on data closer to the clusters′ centers and achieved responses far better than previous algorithms. Some researchers have used recently particle swarm optimization (PSO) algorithm for data clustering to overcome the FCM problems (Biswal et al., 2009). For instance, in works (Biswal et al., 2009), data clustering was achieved by optimizing FCM based on PSO. The method demonstrated a performance much better than previous algorithms. In this field, the artificial bee colony (ABC) algorithm have been employed for data clustering (Zhang et al., 2010, Yan et al., 2012).

In this paper, we improve based on new optimization algorithm by integrating FCM algorithm in it in order to improve the performance of data clustering such as high accuracy and high speed of convergence.

Section snippets

Stem cells algorithm

Recently, stem cells algorithm has been introduced as one of the newest optimization algorithms for optimizing numerical functions (Taherdangkoo et al., 2012a, Taherdangkoo et al., 2012b, Taherdangkoo et al., 2013). This algorithm was introduced and implemented on the basis of stem cells behavior in the human body (at the time of entering into the body). In overall, finding injured and weak organs is the main goal of this algorithm. Each of these organs is an optimum solution for solving all

Modified stem cells algorithm

As mentioned in the previous section regarding the main algorithm of stem cells, self-renewal process is done either similarly or mutually. Although this process has some advantages in numerical functions, but by this method can lead to difficulties in applying to multi-objective functions or data classification and also convergence process slows down and the time to reach the optimum solution or ideal classification may increase up. In this work, we significantly improve the algorithm′s

Proposed method for data analysis

In order to obtain a classification with minimum error, the exact centers of clusters are needed. According to this fact that most of the classification algorithms selected the centers of clusters randomly; reaching to a convergence in classification of datasets seems to be important and is the major problem for all the clustering methods, because if datasets are very large the accuracy in reaching to an ideal response decreases. The FCM algorithm is faster than the other algorithms in data

Experimental results

In order to evaluate the performance of the new data clustering algorithm, a set of well-known datasets has been used. They are known as Vowel, Iris, Crude Oil, Control Chart, Wood Defects, Wine, Nomao, University, and Seeds (Blake et al., 1998). Table 1 shows the characteristics of each dataset including the number of objects, features and classes. We have also compared the obtained results using the proposed algorithm with those using standard K-mean algorithm, FCM algorithm, ACDE, GCUK, and

Discussion

Many clustering methods are composed of assumptions that should be taken into consideration prior to the implementation, for instance information about the clusters structure. Therefore the structure and performance of the presented clustering algorithms and comparison of them with the proposed algorithm are discussed in this section thoroughly.

K-means algorithm′s structure is very simple and is initiated with a bunch of data randomly selection as a centers. As observable in the figures of this

Conclusion

In this paper, we have presented a novel method for data clustering based on one of the newest optimization algorithms, i.e. modified Stem Cells algorithm and on using the principles of FCM algorithm. The important part of the proposed method is the ability of automatic inspection of the optimum number of clusters in a large-scale dataset. Therefore some new large datasets have been used for evaluating the performance of the proposed method (SC-FCM) and comparing with other clustering

References (21)

There are more references available in the full text version of this article.

Cited by (29)

  • A data clustering approach based on universal gravity rule

    2015, Engineering Applications of Artificial Intelligence
    Citation Excerpt :

    However, the initial state may cause the algorithm to trap into a local optimum, therewith affecting the quality of the final solution. Many studies have been made to overcome this drawback of the K-means algorithm, particularly by using fuzzy set theory and evolutionary algorithms (Taherdangkoo and Bagheri, 2013; Niknam and Amiri, 2010). Fuzzy C-means (FCM) algorithm is an improvement over K-means clustering.

  • Stochastic numerical treatment for thin film flow of third grade fluid using unsupervised neural networks

    2015, Journal of the Taiwan Institute of Chemical Engineers
    Citation Excerpt :

    Discrete and continuous versions of the algorithms have been developed along with many variants, and these are broadly used as expert systems for constrained and unconstrained optimization problems. A few examples in which PSO methods have performed exceedingly well include prediction of flammability limit temperatures from molecular structures [41], reconfiguration of shipboard power system [42], estimation of product quality with high reliability [43], powerful hybrid clustering and fuzzy C-mean algorithm [44], and automatic detection of erythemato-squamous diseases [45] etc. Keeping in view the strengths of the PSO and SQP algorithms, the intention is to use these schemes along with their hybrid approach referred to as PSO–SQP, for training the weights of the neural networks.

View all citing articles on Scopus
View full text