Geodesic distance based fuzzy c-medoid clustering – searching for central points in graphs and high dimensional data
Introduction
Cluster is a group of objects that are more similar to one another than to members of other clusters. In metric spaces, similarity is often defined by means of a distance norm. Distance can be measured among the data vectors themselves, or as a distance from a prototypical object of the cluster. In case of high-dimensional data, classical Euclidean distance-based methods like k-means, fuzzy c-means or fuzzy c-medoid do not perform well and it is difficult to design and validate cluster prototypes, that are able to represent the distribution of the data. One approach to handle this problem is to assume, that the data of interest lie on an embedded non-linear manifold within the higher-dimensional space.
In brief, our main idea is to reveal the hidden, complex structure of high dimensional data by constructing and clustering the k-nn graph of the data. Our concept can be seen in Fig. 1. Shortest path distances among the objects (nodes) can be considered as the approximation of geodesic distances defined on the low dimensional manifold within the higher-dimensional space. Centrality measures of the nodes of the weighted undirected k-nn graph can be used to initialize the cluster centers. The classical fuzzy c-medoid algorithm is able to find clusters using the obtained initial values and distance matrix. The clustered graph (distance matrix) is visualized by a distance-preserving mapping to give more insight to the problem.
It should be noted, that the proposed shortest-path based approach is parallel to the concept of transitive closure. Therefore, our algorithm is capable to cluster networks or even more complex and abstract objects based their on partially known pairwise similarities.
The rest of the paper is organized as follows. In the next section, we give a short review of the most relevant literature. We present the algorithm in Section 3.1. The details of the graph construction are given in Section 3.2. Section 3.3 clarifies the connection between geodesic distances and transitive similarities. In Section 3.4 we discuss the most important aspects of the utilized centrality measures, while the details are given in Appendix A. Appendix B presents some cluster validity indices used to prove that it is worth using the proposed techniques. Section 4 demonstrates that the proposed approach works well on benchmark data sets and high dimensional problems, like document clustering. Comparison with other graph-based clustering techniques and two real-life examples are also presented. Section 5 concludes our work.
Section snippets
Literature review
Clustering is a very important research topic in machine learning and data mining. The classical k-means algorithm is capable to solve most of the practical segmentation problems [1]. Although k-means is an almost fifty years old method, the existence of its several variants, like the k-medoid [2] or the fuzzy c-means [3] algorithm prove the high relevance of this approach. Cluster prototypes may be vectors of the same dimension as the data objects, but they can also be defined as
Geodesic distance based fuzzy c-medoid clustering and its initialization based on graph centrality measures
As it is shown in Fig. 1, the key idea of the proposed algorithm is that the shortest path distances of the k-nn graph can be considered as geodesic distances on the low dimensional manifold of the high-dimensional feature space. In this section, we detail how the fuzzy c-medoid algorithm is able to find clusters using this distance matrix, how the k-nn graph should be formed and how the clusters can be initialized based on graph centrality measures.
Experimental results
In this section examples are given to present the applicability, parameter sensitivity, and efficiency of the proposed algorithm. Six different data sets are used: a three-dimensional synthetic data set to illustrate how a lower dimensional manifold can be clustered; “wine” and “iris” data sets as well-known clustering problems; the co-authors data set, which presents a real-life network; a geographical data set, in which a real life logistical problem is defined and solved; and finally, a
Conclusions
We developed an algorithm for mining central objects in graphs and high dimensional data. The modified fuzzy c-medoid clustering algorithm uses geodesic distance measure, and selects potential cluster centers among the set of most central objects. The proposed approach is able to handle data that lie on a low dimensional manifold of a high dimensional feature space, since clustering uses a distance measure, which reflects the true embedded manifold. We proposed a very simple but powerful
Acknowledgements
The research has been supported by the European Union and the Hungarian Republic through the TÁMOP-4.2.2.C-11/1/KONV-2012-0004 – National Research Center for Development and Market Introduction of Advanced Information and Communication Technologies. The research of János Abonyi was realized in the frames of TÁMOP-4.2.4.A/2-11-1-2012-0001 “National Excellence Program Elaborating and Operating an Inland Student and Researcher Personal Support System”.
References (40)
- et al.
FCM: the fuzzy c-means clustering algorithm
Comput. Geosci.
(1984) - et al.
Modified Gath–Geva clustering for fuzzy segmentation of multivariate time-series
Fuzzy Sets in Knowledge Discovery
Fuzzy Sets Syst.
(2005) - et al.
Graph based k-means clustering
Signal Process.
(2012) - et al.
Relational duals of the c-means clustering algorithms
Pattern Recognit.
(1989) - et al.
Nerf c-means: non-euclidean relational fuzzy clustering
Pattern Recognit.
(1994) - et al.
Topology representing networks
Neural Netw.
(1994) - et al.
The new k-windows algorithm for improving the k-means clustering algorithm
J. Complex.
(2002) Some methods for classification and analysis of multivariate observations
- et al.
Clustering by means of medoids
- et al.
Fuzzy clustering with fuzzy covariance matrix
Unsupervised optimal fuzzy clustering
IEEE Trans. Pattern Anal. Mach. Intell.
Modified Gath–Geva fuzzy clustering for identification of Takagi–Sugeno fuzzy models
IEEE Trans. Syst. Man Cybern.
Mining scale-free networks using geodesic clustering
Random walks for image segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
Many random walks are faster than one
Multi-agent random walks for local clustering on graphs
Soft geodesic kernel k-means
Geodesic k-means clustering
Geodesic distance and MST-based image segmentation
Unsupervised clustering on multi-components datasets: applications on images and astrophysics data
Cited by (14)
A review and proposal of (fuzzy) clustering for nonlinearly separable data
2019, International Journal of Approximate ReasoningCitation Excerpt :For this reason, they also suggest to use the fuzzy c-medoids (FcMed) algorithm [53]. In their proposal, deeply investigated in [54], an extension of FcMed involving the geodesic distance is considered. It represents the fuzzy counterpart of [47].
Cluster-scaled principal component analysis
2022, Wiley Interdisciplinary Reviews: Computational StatisticsA fuzzy clustering algorithm based on hybrid surrogate model
2022, Journal of Intelligent and Fuzzy SystemsParallel Transitive Closure Algorithm for Heterogeneous Architecture
2021, Jisuanji Gongcheng/Computer EngineeringReduction of relative degree by optimal control and sensor placement
2020, SpringerBriefs in Computer Science