Geodesic distance based fuzzy c-medoid clustering – searching for central points in graphs and high dimensional data

doi:10.1016/j.fss.2015.06.022

Fuzzy Sets and Systems

Volume 286, 1 March 2016, Pages 157-172

https://doi.org/10.1016/j.fss.2015.06.022 Get rights and content

Abstract

Clustering high dimensional data and identifying central nodes in a graph are complex and computationally expensive tasks. We utilize k-nn graph of high dimensional data as efficient representation of the hidden structure of the clustering problem. Initial cluster centers are determined by graph centrality measures. Cluster centers are fine-tuned by minimizing fuzzy-weighted geodesic distances. The shortest-path based representation is parallel to the concept of transitive closure. Therefore, our algorithm is capable to cluster networks or even more complex and abstract objects based on their partially known pairwise similarities.

The algorithm is proven to be effective to identify senior researchers in a co-author network, central cities in topographical data, and clusters of documents represented by high dimensional feature vectors.

Introduction

Cluster is a group of objects that are more similar to one another than to members of other clusters. In metric spaces, similarity is often defined by means of a distance norm. Distance can be measured among the data vectors themselves, or as a distance from a prototypical object of the cluster. In case of high-dimensional data, classical Euclidean distance-based methods like k-means, fuzzy c-means or fuzzy c-medoid do not perform well and it is difficult to design and validate cluster prototypes, that are able to represent the distribution of the data. One approach to handle this problem is to assume, that the data of interest lie on an embedded non-linear manifold within the higher-dimensional space.

In brief, our main idea is to reveal the hidden, complex structure of high dimensional data by constructing and clustering the k-nn graph of the data. Our concept can be seen in Fig. 1. Shortest path distances among the objects (nodes) can be considered as the approximation of geodesic distances defined on the low dimensional manifold within the higher-dimensional space. Centrality measures of the nodes of the weighted undirected k-nn graph can be used to initialize the cluster centers. The classical fuzzy c-medoid algorithm is able to find clusters using the obtained initial values and distance matrix. The clustered graph (distance matrix) is visualized by a distance-preserving mapping to give more insight to the problem.

It should be noted, that the proposed shortest-path based approach is parallel to the concept of transitive closure. Therefore, our algorithm is capable to cluster networks or even more complex and abstract objects based their on partially known pairwise similarities.

The rest of the paper is organized as follows. In the next section, we give a short review of the most relevant literature. We present the algorithm in Section 3.1. The details of the graph construction are given in Section 3.2. Section 3.3 clarifies the connection between geodesic distances and transitive similarities. In Section 3.4 we discuss the most important aspects of the utilized centrality measures, while the details are given in Appendix A. Appendix B presents some cluster validity indices used to prove that it is worth using the proposed techniques. Section 4 demonstrates that the proposed approach works well on benchmark data sets and high dimensional problems, like document clustering. Comparison with other graph-based clustering techniques and two real-life examples are also presented. Section 5 concludes our work.

Section snippets

Literature review

Clustering is a very important research topic in machine learning and data mining. The classical k-means algorithm is capable to solve most of the practical segmentation problems [1]. Although k-means is an almost fifty years old method, the existence of its several variants, like the k-medoid [2] or the fuzzy c-means [3] algorithm prove the high relevance of this approach. Cluster prototypes may be vectors of the same dimension as the data objects, but they can also be defined as

Geodesic distance based fuzzy c-medoid clustering and its initialization based on graph centrality measures

As it is shown in Fig. 1, the key idea of the proposed algorithm is that the shortest path distances of the k-nn graph can be considered as geodesic distances on the low dimensional manifold of the high-dimensional feature space. In this section, we detail how the fuzzy c-medoid algorithm is able to find clusters using this distance matrix, how the k-nn graph should be formed and how the clusters can be initialized based on graph centrality measures.

Experimental results

In this section examples are given to present the applicability, parameter sensitivity, and efficiency of the proposed algorithm. Six different data sets are used: a three-dimensional synthetic data set to illustrate how a lower dimensional manifold can be clustered; “wine” and “iris” data sets as well-known clustering problems; the co-authors data set, which presents a real-life network; a geographical data set, in which a real life logistical problem is defined and solved; and finally, a

Conclusions

We developed an algorithm for mining central objects in graphs and high dimensional data. The modified fuzzy c-medoid clustering algorithm uses geodesic distance measure, and selects potential cluster centers among the set of most central objects. The proposed approach is able to handle data that lie on a low dimensional manifold of a high dimensional feature space, since clustering uses a distance measure, which reflects the true embedded manifold. We proposed a very simple but powerful

Acknowledgements

The research has been supported by the European Union and the Hungarian Republic through the TÁMOP-4.2.2.C-11/1/KONV-2012-0004 – National Research Center for Development and Market Introduction of Advanced Information and Communication Technologies. The research of János Abonyi was realized in the frames of TÁMOP-4.2.4.A/2-11-1-2012-0001 “National Excellence Program Elaborating and Operating an Inland Student and Researcher Personal Support System”.

References (40)

J.C. Bezdek et al.
FCM: the fuzzy c-means clustering algorithm
Comput. Geosci.
(1984)
J. Abonyi et al.
Modified Gath–Geva clustering for fuzzy segmentation of multivariate time-series
Fuzzy Sets in Knowledge Discovery
Fuzzy Sets Syst.
(2005)
L. Galluccio et al.
Graph based k-means clustering
Signal Process.
(2012)
R.J. Hathaway et al.
Relational duals of the c-means clustering algorithms
Pattern Recognit.
(1989)
R.J. Hathaway et al.
Nerf c-means: non-euclidean relational fuzzy clustering
Pattern Recognit.
(1994)
T. Martinetz et al.
Topology representing networks
Neural Netw.
(1994)
M.N. Vrahatis et al.
The new k-windows algorithm for improving the k-means clustering algorithm
J. Complex.
(2002)
J. MacQueen
Some methods for classification and analysis of multivariate observations
L. Kaufman et al.
Clustering by means of medoids
D. Gustafson et al.
Fuzzy clustering with fuzzy covariance matrix

I. Gath et al.

Unsupervised optimal fuzzy clustering

IEEE Trans. Pattern Anal. Mach. Intell.

(1989)

J. Abonyi et al.

Modified Gath–Geva fuzzy clustering for identification of Takagi–Sugeno fuzzy models

IEEE Trans. Syst. Man Cybern.

(2002)

A.Y. Wu et al.

Mining scale-free networks using geodesic clustering

L. Grady

Random walks for image segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

(2006)

N. Alon et al.

Many random walks are faster than one

M. Alamgir et al.

Multi-agent random walks for local clustering on graphs

J. Kim et al.

Soft geodesic kernel k-means

N. Asgharbeygi et al.

Geodesic k-means clustering

G. Economou et al.

Geodesic distance and MST-based image segmentation

L. Galluccio et al.

Unsupervised clustering on multi-components datasets: applications on images and astrophysics data

Cited by (14)

A review and proposal of (fuzzy) clustering for nonlinearly separable data
2019, International Journal of Approximate Reasoning
Citation Excerpt :
For this reason, they also suggest to use the fuzzy c-medoids (FcMed) algorithm [53]. In their proposal, deeply investigated in [54], an extension of FcMed involving the geodesic distance is considered. It represents the fuzzy counterpart of [47].
In many practical situations data may be characterized by nonlinearly separable clusters. Classical (hard or fuzzy) clustering algorithms produce a partition of objects by computing the Euclidean distance. As such, they are based on the linearity assumption and, therefore, do not identify properly clusters characterized by nonlinear structures. To overcome this limitation, several approaches can be followed: density-, kernel-, graph- or manifold-based clustering. A review of these approaches is offered and some new fuzzy manifold-based clustering algorithms, involving the so-called geodesic distance, are proposed. The effectiveness of such algorithms is shown by synthetic, benchmark and real data.
Goal-oriented possibilistic fuzzy C-Medoid clustering of human mobility patterns—Illustrative application for the Taxicab trips-based enrichment of public transport services
2022, PLoS ONE
Cluster-scaled principal component analysis
2022, Wiley Interdisciplinary Reviews: Computational Statistics
A fuzzy clustering algorithm based on hybrid surrogate model
2022, Journal of Intelligent and Fuzzy Systems
Parallel Transitive Closure Algorithm for Heterogeneous Architecture
2021, Jisuanji Gongcheng/Computer Engineering
Reduction of relative degree by optimal control and sensor placement
2020, SpringerBriefs in Computer Science

View all citing articles on Scopus

View full text

Geodesic distance based fuzzy c-medoid clustering – searching for central points in graphs and high dimensional data

Abstract

Introduction

Section snippets

Literature review

Geodesic distance based fuzzy c-medoid clustering and its initialization based on graph centrality measures

Experimental results

Conclusions

Acknowledgements

Comput. Geosci.

Fuzzy Sets Syst.

Signal Process.

Pattern Recognit.

Pattern Recognit.

Neural Netw.

J. Complex.

Some methods for classification and analysis of multivariate observations

Clustering by means of medoids

Fuzzy clustering with fuzzy covariance matrix

Unsupervised optimal fuzzy clustering

IEEE Trans. Pattern Anal. Mach. Intell.

Modified Gath–Geva fuzzy clustering for identification of Takagi–Sugeno fuzzy models

IEEE Trans. Syst. Man Cybern.

Mining scale-free networks using geodesic clustering

Random walks for image segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

Many random walks are faster than one

Multi-agent random walks for local clustering on graphs

Soft geodesic kernel k-means

Geodesic k-means clustering

Geodesic distance and MST-based image segmentation

Unsupervised clustering on multi-components datasets: applications on images and astrophysics data