A data clustering approach based on universal gravity rule
Introduction
Generally, there is no common terminology on definition of data clustering, but most researchers describe a cluster through its internal homogeneity and external separation, i.e., data points in the same cluster should be similar (or related) to each other, and different from (or unrelated to) the data points in other clusters. Both similarity and dissimilarity should be examinable in a clear and meaningful way. These algorithms are usually unsupervised, and they are mostly used in the fields of machine learning, data mining, pattern recognition, image analysis and bioinformatics (Jain et al., 1999, Hammouda, 2011).
The general part of all clustering algorithms is to find the representatives of clusters, i.e. cluster centroids for compact clusters. A clustering algorithm decides, each input data belongs to which cluster, (i.e. is the closest to which centroid). In some of the clustering techniques, the algorithm tries to partition data points into a given number of clusters, e.g. K-means (Hartigan and Wong, 1978) and fuzzy C-means (Bezdek et al., 1984). In some cases the number of clusters is not known a priori. Such an algorithm starts by finding the largest cluster first, next goes to find the second, and so on (Yager and Filev, 1994, Chiu, 1995, Wu and Yang, 2002).
The well-known K-means algorithm is one of the most used algorithms due to its efficiency and simplicity in data clustering where, it measures the distance between clusters׳ representatives (centroids) and data points to partition data into K clusters. In most cases, the Euclidean distance is used as the dissimilarity measure. To find the best position of the representatives, the K-means algorithm minimizes a cost function of data variations around the centroids. However, the initial state may cause the algorithm to trap into a local optimum, therewith affecting the quality of the final solution. Many studies have been made to overcome this drawback of the K-means algorithm, particularly by using fuzzy set theory and evolutionary algorithms (Taherdangkoo and Bagheri, 2013, Niknam and Amiri, 2010).
Fuzzy C-means (FCM) algorithm is an improvement over K-means clustering. In this algorithm, each data point belongs to a cluster by a degree of membership. Similar to the K-means algorithm, FCM relies on minimizing a cost function of the dissimilarity measure to find centroids. The FCM uses the concept of fuzzy set theory to handle the uncertainty associated with the data to be clustered (Gustafson and Kessel, 1979, Pal et al., 2005, Zarandi et al., 2009).
The Mountain clustering algorithm calculates a mountain function (density function) at every possible position in the data space, and chooses the position with the greatest density value as the center of the first cluster. Then, it removes the effects of the first cluster mountain function and finds the second cluster center (Moertini, 2002). This process is repeated until a desired number of clusters is found. The subtractive clustering is similar to the mountain clustering, except that it uses data point positions to calculate the density function, thus reduces the number of calculations significantly (Kim et al., 2005). This means that the computation depends on the problem size instead of the problem dimension.
Each clustering algorithm has its advantages and disadvantages in clustering of different types of data points. Therefore, many clustering methods have been proposed to rectify disadvantages of these algorithms. Also, some researchers tried to suggest new algorithms inspired by nature. In this paper, a nature inspired clustering algorithm is introduced by employing Newtonian law of gravity. Gravity based clustering algorithms are not new and their history returns to the 70s. In the following, first related works are reviewed and then the proposed work is described.
Recently, more and more attention has been focused on using nature based inspired algorithms to solve clustering problems (Nanda and Panda, 2014). Moreover, there are many clustering algorithms which have been made by hybridizing different types of evolutionary algorithms with K-means algorithm to overcome the disadvantage of K-means. Evolutionary algorithms such as genetic algorithm (GA) (Maulik and Bandyopadhyay, 2000), ant colony optimization (ACO) (Kashef and Nezamabadi-pour, 2014), particle swarm optimization (PSO) (Niknam and Amiri, 2010), gravitational search algorithm (GSA) (Rashedi and Nezamabadi-Pour, 2013, Dowlatshahi and Nezamabadi-pour, 2014) , are usually nature inspired. Some of these methods are reviewed in Yazdani et al. (2014).
Various applications of the gravity theory in solving problems of different areas have been considered including image edge detection (Sun et al., 2007, Deregeh and Nezamabadi-pour, 2014), data classification (Shafigh et al., 2013, Rezaei and Nezamabadi-pour, 2015), optimization (Soleimanpour-Moghadam et al., 2014), and data clustering (Sanchez et al., 2014, Wright, 1977, Yung and Lai, 1998). Gravity-based clustering algorithms simulate the process of the attraction and merging of objects by their gravity forces. To realize data clustering, these algorithms consider each data point as an object and assign a mass to it.
The first version of gravitational clustering algorithm was proposed by Wright (1977). This algorithm uses a Markovian model for the gravitational clustering. It is an incremental algorithm that updates the position of each data point in each iteration. How the objects are joined is determined by the continuous motion of all objects in the system according to gravitational forces. In this method, objects do not converge together but rather converge to “equilibrium” positions (Wright, 1977).
Yung and Lai (1998) present a Markovian model of clustering based on gravitational concepts. The model is used for color image segmentation in RGB color space. In the clustering process, each pixel is considered as an object with the unity mass. All objects apply gravitational force to each other. The Markovian model of gravitational attraction between two objects i and j is defined as follows:whereandpresent the locating D-dimensional vectors of objects and , respectively, and are the mass of them, G is the universal gravitational constant and ||.|| is a Euclidean norm function where . This force causes the objects (data points) to move in the space. Two objects that move to the same location are merged and form a new object that its mass is considered as the summation of the two merged masses.
A clustering method based on the notion of attraction force between each pair of data points has been presented by Kundu (1999). The clusters are formed by allowing each data point to move slowly under the resultant effect of all forces exerted to it, and by merging two data points when they become close to each other. When two or more data points (clusters) are merged, the sum of their masses becomes the mass of the resulting cluster; therefore, the mass of each cluster equals the number of its data instances (Kundu, 1999).
In the work done by Gomez et al. (2004), a modified version of Newton׳s law of gravity is used. This modification simplifies the calculation of gravitational force, where Eq. (2) is used for moving a data point according to the gravitational law. The value of the universal gravitational constant, G, is reduced after each iteration, which serves to eliminate the big crunch effect of all the data points. This is used as a mechanism that does not end up with only one clusterwhere is the vector direction, is a decreasing function such as and is the position of the data point, is the vector direction of such data point in time t, is the rough estimate of maximum distance between the closest points such as for n data points in the D-dimensional [0,1] Euclidean space and ||.|| is the norm function (Gomez et al., 2004).
Long and Jin (2006) proposed a simplified gravitational clustering (SGC) method for multi-prototype learning based on minimum classification error. It simulates the movement of objects according to the gravity force and checks for possible merging. The gravitational clustering is simplified by ignoring velocity and multi-force attraction. The pair objects are merged based on the following equations:
The approach proposed by Zhang is based on a simplified version of Newtonian gravitational forces and Newtonian motion of objects, as shown in Eqs. (4), (5), respectively (Zhang and Hongshan, 2010).where is a small discrete time interval and V is the velocity of object i.
The value of the universal gravitational constant, G is reduced at each iteration (Zhang and Hongshan, 2010). In the GRIN algorithm, an incremental hierarchical clustering technique based on the gravity theory, is presented to construct clustering dendrograms. An incremental clustering algorithm refers to the abstraction of distribution of data instances generated by the previous run of the algorithm. The mass of a cluster is the number of its data instances. GRIN algorithm works in two phases: initial phase, and incremental phase. In both phases, it invokes the gravitational agglomerative hierarchical clustering algorithm (Chen et al., 2005).
Ilc and Dobnikar (2012) presented a gravitational based method for clustering, known as GSOM, in which, each data point is viewed as a mass object. In GSOM, the basic idea is used and integrated with SOM, considering the connections between neurons (Ilc and Dobnikar, 2012).
Rashedi et al. (2009) have been proposed a stochastic population-based metaheuristic, called Gravitational Search Algorithm (GSA), based on Newtonian law of gravity and the laws of motion. Originally, GSA is designed for solving continuous optimization problems. GSA, like most metaheuristics has flexible abilities in different application such as clustering (Hatamlou et al., 2012).
Rashedi and Nezamabadi-Pour (2013) have been proposed a clustering algorithm based on the theory of gravity. In the proposed gravitational clustering, two objects that are close to each other with respect to a predefined threshold are merged together to construct a larger cluster. A “travel” operator has been introduced to move each pixel to a new position in the feature space, and a “merging” operator has been used to make a decision to merge or not. This may cause a drawback for sequential clustering, i.e. the final clustering depends heavily on the samples presentation order. Finally, an “escaping” operator inspired by the concept of escape velocity, has been introduced to eliminate the unpleasant effect of the travel operator. The authors apply the proposed gravitational clustering to image segmentation problem (Rashedi and Nezamabadi-Pour, 2013).
Some of the major challenges in clustering algorithms are the ability to deal with noises and outliers, imbalanced groups as well as the sensitivity to the initial position of cluster centroids. The aim of this paper is to propose a nature inspired clustering algorithm which is able to overcome some problems like outliers (or noisy data), imbalanced clusters, overlapped clusters and sensitivity to the initial position of cluster centroids. To the best of our knowledge, the data points in the existing gravity-based algorithms are considered as movable objects and are allowed to move around the feature space in the influence of Newtonian law of gravity and merge together if they are close enough to each other. In the proposed clustering method, we consider the data points as fixed celestial objects with the unity mass to apply a gravity force to movable objects and change their positions in the feature space. The aim is to find the best position of each cluster centroids (cluster representative) where each centroid is modeled by a movable agent with the unity mass. The centroids move around the feature space in the influence of the gravity force exerted by the celestial objects to find the best position. One can expect that the centroids stop in the optimum positions.
The rest of the paper is organized as follows. In the next Section, basic concepts of clustering and Newton׳s law of universal gravitation are reviewed and their properties are discussed. Section 3 is devoted to describe the proposed gravity algorithm. A comparative Experimental study is given in Section 4. Finally, we conclude the paper in Section 5.
Section snippets
Clustering
Clustering is the task of unsupervised classification of data points into groups (clusters). Data points are represented conventionally as multidimensional vectors, where each dimension is a single feature. Clustering algorithms are mainly used to identify groups of similar items within a universe of data. The grouping can be performed in a number of ways. The output clustering can be hard (i.e. a partition of the data into groups) or fuzzy where each data point has a variable degree of
Basic idea
It is assumed that each data point, , is located in a D-dimensional space, where D is the number of features. The clusters are compact and a point representative (centroid) is used to present each cluster. The main idea in the proposed algorithm is to consider a movable gravity object (agent) as the centroid of a cluster and each data point as a fixed gravity object. In this gravity system, the fixed objects apply the gravity force to the agents and change their positions in
Datasets
The performance of the proposed algorithm is evaluated in different problems in the field of clustering and classification and the results are compared with those of well-known methods in this field. To see how the proposed algorithm works for noisy data, outliers, and imbalance dataset, we apply it to three clustering datasets in 2-and-3 dimensional feature space for visual inspection and comparison. The used datasets in the visual comparison are shown in Fig. 3 (Wu and Yang, 2002).
As it is
Conclusions
In this paper, a new nature inspired clustering algorithm based on Newtonian law of gravity was presented. In order to evaluate the performance of the gravity-based clustering algorithm, experiments were performed using 12 datasets from the UCI machine learning repository in the field of clustering and classification. Furthermore, some experiments were performed on three visual datasets to see how it performs. The proposed algorithm has been compared with a number of well-known alternative
Acknowledgments
The authors give kind respect and special thanks to the anonymous reviewers for their useful advices. Moreover, the authors would like to express their gratitude towards Prof. Hadi Sadoghi-Yazdi for providing valuable help and Prof. Malihe M. Farasngi, Miss Elham Ghazizadeh and Miss Bahareh Nikpour for proof reading the manuscript.
References (52)
- et al.
A prototype classifier based on gravitational search algorithm
Appl. Soft Comput. J.
(2012) - et al.
FCM: the fuzzy c-means clustering algorithm
Comput. Geosci.
(1984) - et al.
A statistics-based approach to control the quality of subclusters in incremental gravitational clustering
Pattern Recognit.
(2005) - et al.
A prototype classification method and its use in a hybrid solution for multiclass pattern recognition
Pattern Recognit.
(2006) - et al.
GGSA: a grouping gravitational search algorithm for data clustering
Eng. Appl. Artif. Intell.
(2014) - et al.
Efficient stochastic algorithms for document clustering
Inf. Sci.
(2013) - et al.
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power
Inf. Sci.
(2010) - et al.
A combined approach for clustering based on K-means and gravitational search algorithms
Swarm Evol. Comput.
(2012) - et al.
Generation of a clustering ensemble based on a gravitational self-organising map
Neurocomputing
(2012) - et al.
An advanced ACO algorithm for feature subset selection
Neurocomputing
(2015)
A kernel-based subtractive clustering method
Pattern Recognit. Lett.
Gravitational clustering: a new approach based on the spatial distribution of the points
Pattern Recognit.
Genetic algorithm-based clustering technique
Pattern Recogn.
A survey on nature inspired metaheuristic algorithms for partitional clustering
Swarm Evol. Comput.
An efficient hybrid approach based on PSO, ACO and k-means for cluster analysis
Appl. Soft Comput. J.
A stochastic gravitational approach to feature based color image segmentation
Eng. Appl. Artif. Intell.
GSA: a gravitational search algorithm
Inf. Sci.
Using gravitational search algorithm in prototype generation for nearest neighbor classification
Neurocomputing
Fuzzy granular gravitational clustering algorithm for multivariate data
Inf. Sci.
Gravitation based classification
Inf. Sci.
A quantum inspired gravitational search algorithm for numerical function optimization
Inf. Sci.
A novel approach for edge detection based on the theory of universal gravity
Pattern Recognit.
A powerful hybrid clustering method based on modified stem cells and Fuzzy C-means algorithms
Eng. Appl. Artif. Intell.
Gravitational clustering
Pattern Recognit.
Alternative c-means clustering algorithms
Pattern Recognit.
A gravitational search algorithm for multimodal optimization
Swarm Evol. Comput.
Cited by (23)
Clustering ensemble-based novelty score for outlier detection
2023, Engineering Applications of Artificial IntelligenceCDBH: A clustering and density-based hybrid approach for imbalanced data classification
2021, Expert Systems with ApplicationsCitation Excerpt :The performance of the k-means algorithm is very dependent on the initial centers of the clusters. In other words, by selecting the initial centers of the clusters incorrectly, the final centers of the clusters will not be appropriate, and this is the main disadvantage of this algorithm (Bahrololoum et al., 2015; Celebi et al., 2013; Hartigan & Wong, 1979; Mirzaei et al., 2014; Singh & Kaur, 2013). The proposed CDBH method is a simple hybrid method and belongs to the data-level category.
Optimized gravitational-based data clustering algorithm
2018, Engineering Applications of Artificial IntelligenceCitation Excerpt :Finally, the conclusions are presented in Section 5. The GC algorithm was proposed by Bahrololoum et al. (2015). It is designed to cluster data points in a multidimensional feature space.
Density-based particle swarm optimization algorithm for data clustering
2018, Expert Systems with ApplicationsA multi-expert based framework for automatic image annotation
2017, Pattern RecognitionCitation Excerpt :One of the major challenges in clustering algorithms are the ability to deal with noises and outliers, imbalanced groups as well as the sensitivity to the initial position of cluster centroids. Ref. [46] proposed a nature inspired clustering algorithm which is able to overcome some problems like outliers (or noisy data), imbalanced clusters, overlapped clusters and sensitivity to the initial position of cluster centroids. In this clustering algorithm, the data points to be clustered are considered as fixed celestial objects with the unity mass to apply a gravity force to movable objects (centroids) and change their positions in the feature space.
A new selection strategy for selective cluster ensemble based on Diversity and Independency
2016, Engineering Applications of Artificial IntelligenceCitation Excerpt :The major aim of data clustering is to find groups of patterns (clusters) in such a way that patterns in one cluster can be more similar to each other than to patterns of other clusters (Akbari et al., 2015). A clustering algorithm decides each input data belongs to which cluster (Bahrololoum et al., 2015). Thus, Clustering can be considered as a powerful tool to reveal and visualize structure of data (Izakian et al., 2015).