Engineering Applications of Artificial Intelligence
A powerful hybrid clustering method based on modified stem cells and Fuzzy C-means algorithms
Introduction
Data clustering is an important tool in many applications such as data mining (Busygin et al., 2008, Kuo et al., 2011), statistical data analysis (Aitkin and Aitkin, 2011), math programming (Masmoudi et al., 2010), image segmentation (Taherdangkoo et al., 2010) etc. Many approaches have been proposed for data clustering in general and specific applications. One of the most important and applicable clustering algorithms is the K-means algorithm. The algorithm is an unsupervised algorithm that tries to put N data points into K clusters (i.e. ) with randomly initializing K data points (K data points as a set of primary centers). In general, the resulted clusters should satisfy some conditions as follows:where S is the entire dataset. Moreover, there should be at least one point in each cluster, so that:where Ф is an empty set. There are not also any data point jointly existing in two clusters which is expressed as follows:
In K-means algorithm, the clusters are formed such that the existing data in each cluster should have the minimum Euclidean distance to the center of that cluster. The Euclidean distance between which is data point and which is the center of , is defined as:where is the norm of vectors difference. Then, the centers of clusters are recomputed based on the fact that a center is the best represent of data points of a cluster. This process is repeated iteratively and ended when the centers do not change, namely the algorithm converges.
However, the algorithm has some limitations when the datasets are large and it may not work properly. For instance, bad selection of K data points as the initial centers of the clusters may causes that the algorithm does not converge at all. Another fundamental problem is that if the entire set of data points is large, the convergence to local minimums may be long and eventually the response obtained after several repeats may not be the optima response. To overcome these drawbacks, many clustering algorithms have been introduced recently. For instance, the Genetic algorithms have been widely employed for data clustering (Laszlo and Mukherjee, 2007). However, many parameters should be fixed in these algorithms and the convergence is not guaranteed.
Another unsupervised algorithm that attracted the scientists attention is Fuzzy C-mean (FCM) for data clustering (Bezdek et al., 1984, Jain and Dubes, 1988). The algorithm works according to minimizing the objective function that is defined as:
In above equation, is a fuzzy membership function matrix with N rows and C columns in which N is the number of data points and C is the number of clusters. The element in therow andcolumn in , represents the degree of belongingness of the data point to the cluster, is the set of centers of clusters and is the matrix of input data points, m is the fuzzy index that governs the influence of membership grades ( is any real number greater than 1, in general, m is set to 2) andis squared norm distance, which is used for measuring the similarity between and .
In Eq.(5), is a fuzzy membership function whose value is between the interval of [0, 1] and defined as follow:
For the centers of clusters, , we have:
The process of FCM algorithm begins by initializing the centers vectors. Then are computed using Eq. (7). Next, the centers are recomputed using Eq. (8) then the process is repeated with new centers until and for all i and j. ε is a small value (ε is a termination criterion between 0 and 1) and t shows the iteration. Although, the FCM algorithm is efficient; but it is a very sensitive to the selection of initial values and having long convergence in the case of large datasets. To overcome these drawbacks, many clustering algorithms have been introduced recently. For instance, the Genetic algorithms have been widely employed for data clustering (Gan et al., 2009). This paper (Gan et al., 2009), introduced the responses that are expressed as a string of bits and the algorithm starts using a population consisting of a series of initial responses and obtains ideal responses and continues its process until achieving a convergence to a global minimum. It should be noted that the optimization of FCM using Genetic algorithm increases only the speed of reaching final response and improves a little the performance of FCM, although it solves somehow the problem of local minimum convergence.
Data clustering approaches have been also extended by the algorithms based on the social behavior of ants known as ant colony optimization (ACO) algorithms (Kanade and Hall, 2007). Kanade and Hall (2007)) introduced an algorithm by the concept of using the increase of pheromone evaporation rate on data closer to the clusters′ centers and achieved responses far better than previous algorithms. Some researchers have used recently particle swarm optimization (PSO) algorithm for data clustering to overcome the FCM problems (Biswal et al., 2009). For instance, in works (Biswal et al., 2009), data clustering was achieved by optimizing FCM based on PSO. The method demonstrated a performance much better than previous algorithms. In this field, the artificial bee colony (ABC) algorithm have been employed for data clustering (Zhang et al., 2010, Yan et al., 2012).
In this paper, we improve based on new optimization algorithm by integrating FCM algorithm in it in order to improve the performance of data clustering such as high accuracy and high speed of convergence.
Section snippets
Stem cells algorithm
Recently, stem cells algorithm has been introduced as one of the newest optimization algorithms for optimizing numerical functions (Taherdangkoo et al., 2012a, Taherdangkoo et al., 2012b, Taherdangkoo et al., 2013). This algorithm was introduced and implemented on the basis of stem cells behavior in the human body (at the time of entering into the body). In overall, finding injured and weak organs is the main goal of this algorithm. Each of these organs is an optimum solution for solving all
Modified stem cells algorithm
As mentioned in the previous section regarding the main algorithm of stem cells, self-renewal process is done either similarly or mutually. Although this process has some advantages in numerical functions, but by this method can lead to difficulties in applying to multi-objective functions or data classification and also convergence process slows down and the time to reach the optimum solution or ideal classification may increase up. In this work, we significantly improve the algorithm′s
Proposed method for data analysis
In order to obtain a classification with minimum error, the exact centers of clusters are needed. According to this fact that most of the classification algorithms selected the centers of clusters randomly; reaching to a convergence in classification of datasets seems to be important and is the major problem for all the clustering methods, because if datasets are very large the accuracy in reaching to an ideal response decreases. The FCM algorithm is faster than the other algorithms in data
Experimental results
In order to evaluate the performance of the new data clustering algorithm, a set of well-known datasets has been used. They are known as Vowel, Iris, Crude Oil, Control Chart, Wood Defects, Wine, Nomao, University, and Seeds (Blake et al., 1998). Table 1 shows the characteristics of each dataset including the number of objects, features and classes. We have also compared the obtained results using the proposed algorithm with those using standard K-mean algorithm, FCM algorithm, ACDE, GCUK, and
Discussion
Many clustering methods are composed of assumptions that should be taken into consideration prior to the implementation, for instance information about the clusters structure. Therefore the structure and performance of the presented clustering algorithms and comparison of them with the proposed algorithm are discussed in this section thoroughly.
K-means algorithm′s structure is very simple and is initiated with a bunch of data randomly selection as a centers. As observable in the figures of this
Conclusion
In this paper, we have presented a novel method for data clustering based on one of the newest optimization algorithms, i.e. modified Stem Cells algorithm and on using the principles of FCM algorithm. The important part of the proposed method is the ability of automatic inspection of the optimum number of clusters in a large-scale dataset. Therefore some new large datasets have been used for evaluating the performance of the proposed method (SC-FCM) and comparing with other clustering
References (21)
- et al.
Genetic clustering for automatic evolution of clusters and application to image classification
Pattern Recognition
(2002) - et al.
FCM: The Fuzzy C-means clustering algorithm
Comput. Geosci.
(1984) - et al.
Biclustering in data mining
Comput. Oper. Res.
(2008) - et al.
A genetic fuzzy k-modes algorithm for clustering categorical data
Expert Syst. Appl. I
(2009) - et al.
A genetic algorithm that exchanges neighboring centers for K-means clustering
Pattern Recognition Lett.
(2007) - et al.
A new approach for data clustering using hybrid artificial bee colony algorithm
Neurocomputing
(2012) - et al.
An artificial bee colony approach for clustering
Expert Syst. Appl.
(2010) - et al.
Theories of data analysis and statistical inference
In: Statistical Modeling of the National Assessment of Educational Progress. Springer, New York, pp.
(2011) - et al.
A point symmetry based clustering technique for automatic evolution of clusters
IEEE Trans. Knowl. Data Eng.
(2008) - et al.
Power quality disturbance classification using Fuzzy C-means algorithm and adaptive particle swarm optimization
IEEE Trans. Ind. Electron.
(2009)
Cited by (29)
A data clustering approach based on universal gravity rule
2015, Engineering Applications of Artificial IntelligenceCitation Excerpt :However, the initial state may cause the algorithm to trap into a local optimum, therewith affecting the quality of the final solution. Many studies have been made to overcome this drawback of the K-means algorithm, particularly by using fuzzy set theory and evolutionary algorithms (Taherdangkoo and Bagheri, 2013; Niknam and Amiri, 2010). Fuzzy C-means (FCM) algorithm is an improvement over K-means clustering.
A dynamic shuffled differential evolution algorithm for data clustering
2015, NeurocomputingStochastic numerical treatment for thin film flow of third grade fluid using unsupervised neural networks
2015, Journal of the Taiwan Institute of Chemical EngineersCitation Excerpt :Discrete and continuous versions of the algorithms have been developed along with many variants, and these are broadly used as expert systems for constrained and unconstrained optimization problems. A few examples in which PSO methods have performed exceedingly well include prediction of flammability limit temperatures from molecular structures [41], reconfiguration of shipboard power system [42], estimation of product quality with high reliability [43], powerful hybrid clustering and fuzzy C-mean algorithm [44], and automatic detection of erythemato-squamous diseases [45] etc. Keeping in view the strengths of the PSO and SQP algorithms, the intention is to use these schemes along with their hybrid approach referred to as PSO–SQP, for training the weights of the neural networks.
A new systematic firefly algorithm for forecasting the durability of reinforced recycled aggregate concrete
2022, Frontiers of Structural and Civil Engineering