Elsevier

Pattern Recognition

Volume 42, Issue 11, November 2009, Pages 2551-2556
Pattern Recognition

A fast k-means clustering algorithm using cluster center displacement

https://doi.org/10.1016/j.patcog.2009.02.014Get rights and content

Abstract

In this paper, we present a fast k-means clustering algorithm (FKMCUCD) using the displacements of cluster centers to reject unlikely candidates for a data point. The computing time of our proposed algorithm increases linearly with the data dimension d, whereas the computational complexity of major available kd-tree based algorithms increases exponentially with the value of d. Theoretical analysis shows that our method can reduce the computational complexity of full search by a factor of SF and SF is independent of vector dimension. The experimental results show that compared to full search, our proposed method can reduce computational complexity by a factor of 1.37–4.39 using the data set from six real images. Compared with the filtering algorithm, which is among the available best algorithms of k-means clustering, our algorithm can effectively reduce the computing time. It is noted that our proposed algorithm can generate the same clusters as that produced by hard k-means clustering. The superiority of our method is more remarkable when a larger data set with higher dimension is used.

Introduction

Data clustering is used frequently in a number of applications, such as vector quantization (VQ) [1], [2], [3], [4], pattern recognition [5], knowledge discovery [6], speaker recognition [7], fault detection [8], and web/data mining [9]. Among clustering formulations that minimize an object function, k-means clustering is perhaps the most widely used and studied [10]. The k-means clustering algorithm, which is also called the generalized Lloyd algorithm (GLA), is a special case of the generalized hard clustering scheme, when point representatives are adopted and the Euclidean distances are used to measure the dissimilarity between a vector X and its cluster representative C.

The k-mean clustering algorithm performs iteratively the partition step and new cluster center generation step until convergence. An iterative process with extensive computations is usually required to generate a set of cluster representatives. Haung and Harris developed a DSBS method, which used the splitting technique to double the number of clusters, to reduce the computing time of generating a set of cluster centers [11]. Lai and Lue developed a fast codebook generation algorithm (FCGA) to speed up the process of center generation [12]. The FCGA modified the fast search principles developed in Ref. [13] and used triangular inequality elimination criteria to generate cluster centers. Kanungo et al. [10] developed a filtering algorithm on a kd-tree to speed up the generation of new cluster centers. Sorting data points in a kd-tree for k-means clustering was also used by Pelleg and Moore [14]. All the above-mentioned methods use the geometrical information of data points to speed up the partition step of k-means clustering. All these methods pass no information from one stage of iteration to the next.

After some iterations of the GLA, it is expected that most of the centers are converged to their final positions and the majority of data points have few candidates to be selected as their closest centers. In this paper, we will exploit this characteristic to reduce the computational complexity of k-means clustering. The algorithm first classifies cluster centers into static and active groups. We use the information of center displacements to reject impossible candidates in the partition step of our algorithm. It is noted that the algorithm developed here obtains the same set of cluster centers as that produced by the GLA (k-means clustering). This paper is organized as follows. Section 2 describes the k-means clustering algorithm. Section 3 presents the algorithm developed in this paper. Some theoretical analyses of our presented algorithm are also shown in Section 3. Experimental results are presented in Section 4 and concluding remarks are given in Section 5.

Section snippets

k-Means clustering algorithm

The k-means clustering algorithm partitions data points into k clusters Si (i=1,2,,k) and the cluster Si are associated with representatives (cluster centers) Ci. Denote the set of data points as S={X}. Let d(X,Y) be the distortion between any two vectors X and Y. In this paper, d(X,Y) is defined as the Euclidean distance between X and Y.

The major process of GLA (k-means clustering) is mapping a given set of representative vectors into an improved one through partitioning data points. It

Fast k-means clustering using center displacement

In the clustering process, some cluster centers will remain unchanged after few times of iterations, while others need the much longer times to be stabilized. Let the ith cluster centers in the current and previous partitions be denoted as Ci and Ci, respectively. Denote the displacement between Ci and Ci as Di. That is Di=Ci-Ci. If Di=0, then the vector Ci is defined as a static cluster center; otherwise it is called an active cluster center. The cluster associated with an active center

Experimental results

To evaluate the performance of the proposed algorithm, several sets of synthetic data and a set of real images have been used. In the first example, the set of data points has about 100,000 data points. This data set obtained from real images consists of image blocks of 4×4 pixels. That is, the set of data points with dimension 16 is obtained from six real images: “Peppers,” “Lena,” “Baboon,” “Parrot,” “Airplane,” and “Island.” It is noted that the block size for each data point is 4×4. In

Conclusions

In this paper, we develop FKMCUCD to speed up k-means clustering through using the information of center displacements between two successive partition processes. The computing time of our proposed algorithm grows linearly with the data dimension d. Our approach is different from kd-tree based algorithms which have the characteristic of exponential dependence on the value of d. Compared to the filtering algorithm, which is among the available best algorithms of k-means clustering, our algorithm

About the Author—JIM Z.C. LAI received his Ph.D. in Engineering from the University of California, Los Angeles, in 1986. He then worked as a research engineer at GM and a lead project engineer at Rockwell International. He joined the Department of Information Engineering and Computer Science in 1988 as an associate professor. From 1994 to 2003, he was the full professor at the same department. Since 2004, he has been the full professor at the Department of Computer Science, National Taiwan

References (16)

There are more references available in the full text version of this article.

Cited by (69)

  • K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data

    2023, Information Sciences
    Citation Excerpt :

    The reduction in computational time was achieved through the use of the center displacement method and an erasure test at the second stage of the standard algorithm. The proposed algorithm was three times faster than the FKMCUCD [118], the ancestor method. Lee and Lin [122], in another of their variant algorithm, tried to accelerate the computational speed by combining their previous work [121] with Lai, Huang, and Liaw [118], adding the norms product test for identifying the impossible candidate to be deleted.

  • r-Reference points based k-means algorithm

    2022, Information Sciences
    Citation Excerpt :

    The filtering algorithm determines the practical efficiency by checking whether the algorithm runs faster as the separation between clusters increases. Lai et al. [10] proposed a fast k-means clustering (FKMC) algorithm which uses the displacements of cluster centers to remove unlikely candidates for a data point. The running time of the algorithm increases linearly with the number of data dimensions, while the computation time of major available K-D-tree based algorithms increases exponentially with the number of data dimensions.

  • Efficiency of corporate debt financing based on machine learning and convolutional neural network

    2021, Microprocessors and Microsystems
    Citation Excerpt :

    Risk is determined by a combination of strength and conditions of various elements [3]. The impact of these factors has a certain ambiguity [4]. Assessor can assess the risk by calculating its value in the traditional method [5].

  • Two improved k-means algorithms

    2018, Applied Soft Computing Journal
    Citation Excerpt :

    In this study, about a half of the data in each dataset are randomly selected as the training samples, and the remaining data as the testing samples. In the first experiment, tri-level k-means, bi-layer k-means, traditional k-means, fast k-means clustering algorithm using cluster center displacement (FKMCUCD) [23], fast global k-means (FGKM) [24], k-means++ [25], and k-means* algorithms [26] are used to classify the data in a dataset. The Precision, Recall, and F-measure in Eqs. (5)–(7) are employed to evaluate the accuracy of classification.

View all citing articles on Scopus

About the Author—JIM Z.C. LAI received his Ph.D. in Engineering from the University of California, Los Angeles, in 1986. He then worked as a research engineer at GM and a lead project engineer at Rockwell International. He joined the Department of Information Engineering and Computer Science in 1988 as an associate professor. From 1994 to 2003, he was the full professor at the same department. Since 2004, he has been the full professor at the Department of Computer Science, National Taiwan Ocean University. His current interests are in image processing, multimedia, and network security.

About the Author—TSUNG-JEN HUANG received his B.S. and M.S. degrees in Information Engineering and Computer Science from Feng-Chia University, Taiwan, in 1992 and 1994, respectively. From 1994 to 2001, he was a Ph.D. student in the Department of Computer Science from National Chiao Tung University, Hsinchu, Taiwan. Since 2001, he has been an engineer with Industrial Technology Research Institute, Hsinchu, Taiwan. Moreover he is a Ph.D. student in Computer Science at National Taiwan Ocean University, Keelung, Taiwan. His current interests are in image processing, multimedia and clustering algorithms.

About the Author—YI-CHING LIAW received his B.S., M.S., and Ph.D. degrees in Information Engineering and Computer Science from Feng-Chia University, Taiwan, in 1992, 1994, and 2004, respectively. From 1999 to 2004, he was an engineer with Industrial Technology Research Institute, Hsinchu, Taiwan. Since 2005, he has been an assistant professor at the Department of Computer Science and Information Engineering, Nanhua University, Chiayi, Taiwan. His current interests are in image processing and fast algorithm.

View full text