New methods for the initialisation of clusters

https://doi.org/10.1016/0167-8655(95)00119-0Get rights and content

Abstract

One of the most widely used clustering techniques is the k-means algorithms. Solutions obtained from this technique are dependent on the initialisation of cluster centres. In this article, two initialisation methods are developed. These methods are particularly suited to problems involving very large data sets. The methods have been applied to different data sets and good results are obtained.

References (14)

  • M. Ismail et al.

    Multidimensional data clustering utilization hybrid search strategies

    Pattern Recognition

    (1989)
  • N. Venkateswarlu et al.

    Fast isodata clustering algorithms

    Pattern Recognition

    (1992)
  • G. Babu et al.

    A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm

    Pattern Recognition Lett.

    (1993)
  • A. Buzo et al.

    Speech coding based upon vector quantization

    IEEE Trans. Acoust. Speech Signal Process.

    (1980)
  • C. Chatfield et al.

    Introduction to Multivariate Analysis

    (1980)
  • A. Gersho et al.

    Vector Quantization and Signal Compression

    (1992)
  • C. Huang et al.

    A comparison of several vector quantization codebook generation approaches

    IEEE Trans. Image Process

    (1993)
There are more references available in the full text version of this article.

Cited by (62)

  • The application of spatial domain in optimum initialization for clustering image data using particle swarm optimization

    2021, Expert Systems with Applications
    Citation Excerpt :

    Their main idea is that if initial centers are selected close to the dense areas of the feature space, then it would be seemed these initial centers are more likely close to the final centroids (Tran, Wehrens, & Buydens, 2003; Yang & Luo, 2005). Based on this concept, various techniques have been proposed, including: dividing the data into subsets and selecting samples with more neighbors(dense points) in each subset (Moh’d B & Roberts, 1996), and selecting samples having maximum frequency in the feature space (Aliwy & Aljanabi, 2017). Density-based methods have a great deal of attention to the distribution and density of the data, but the main disadvantage of these methods is time-consuming, especially for imagery data.

  • How much can k-means be improved by using better initialization and repeats?

    2019, Pattern Recognition
    Citation Excerpt :

    There are three common approaches for this: The first approach divides the space by a regular grid, and counts the frequency of the points in every bucket [76]. The density of a point is then inherited from the bucket it is in.

  • Parallel implementation of Kaufman's initialization for clustering large remote sensing images on clouds

    2017, Computers, Environment and Urban Systems
    Citation Excerpt :

    Then, it chooses the next centroid, a point that is farthest from the nearest centroid. AlDaoud and Roberts (1996) proposed a density-based clustering initialization method which partitions the data uniformly into N cells. From each of these cells, a number of centroids are chosen randomly until K centroids are obtained; the number of centroids is proportional to the number of objects in each cell.

  • A novel approach for initializing the spherical K-means clustering algorithm

    2015, Simulation Modelling Practice and Theory
    Citation Excerpt :

    The third family, density estimation methods, includes the Kaufman initialization, which is also described earlier, the KR’s main drawback (as for Kaufman & Rousseeuw in [32]) is the high computational complexity. Another method, by Al-Daoud and Roberts [2], relies on dividing the space Rd into M smaller subspaces, each of which spanned by a proportion of the data points, and the seeds are distributed evenly across these subspaces. For each subspace, initial seeds are chosen randomly, the given method is sensitive to the number of subspaces M, which has to be compatible with k somehow, or else it would affect the density estimation.

View all citing articles on Scopus
View full text