Skip to main content
Top

2018 | OriginalPaper | Chapter

3. Algorithms of Combinatorial Cluster Analysis

Authors : Sławomir T. Wierzchoń, Mieczysław A. Kłopotek

Published in: Modern Algorithms of Cluster Analysis

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

While Chap. 2 presented the broadness of the spectrum of clustering methods, this chapter focusses on clustering of objects that are embedded in some metric space and gives an insight into the potential variations of clustering algorithms driven by peculiarities of data at hand. It concentrates essentially around k-means clustering algorithm and its variants, which serve as an illustration on the kinds of issues the algorithms need to accommodate to. After presenting the basic variants of k-means, implementational issues are discussed, including initialisation and efficiency (e.g. k-means++). Then accommodations to special forms of data, e.g. text data (spherical k-means), or data with clusters separated by complicated boundary lines (kernel k-means), or ones residing in non-Euclidean space (k-medoids, k-modes) are presented. Subsequently non-sharp separation areas between clusters are discussed. First the EM algorithm is presented, then fuzzy k-means, with fuzzified variants of the aforementioned algorithms. k-means application presumes apriorical knowledge of the number of clusters. To overcome this limitation, so-called affinity propagation algorithm is presented. Finally approaches to handling data of non-typical relation between the number of objects and the number of dimensions, that is residing (predominantly) in a narrow section of the feature space are discussed (projective clustering, random projection, k q-flats, manifold clustering), or with too large data sets (subsampling), or clustering changing over time. The latter issue is closely connected to clustering under partial pre-labelling of objects. Also methods for simultaneous clustering of both objects and their features (co-clustering) are briefly outlined.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
Though the algorithm was presented in a report dated July 31st, 1957, it was not officially published till 1982. A similar algorithm was published earlier in a paper by J. Max, Quantizing for minimum distortion. IEEE Trans. on Info. Th.\(\mathbf{50}\)(5), 937–994, 1960. For this reason it is called also Lloyd-Max algorithm, or Max-Lloyd algorithm.
 
2
We mentioned this property in Sect. 2.​6.
 
3
See also Q. Du, M. Emelianenko, and L. Ju, Convergence of the Lloyd algorithm for computing centroidal Voronoi tessellations, SIAM J. on Numerical Analysis, \(\mathbf{44}\)(1), 102–119, 2006; M. Emelianenko, L. Ju, and A. Rand, Nondegeneracy and weak global convergence of the Lloyd algorithm in \(\mathbb {R}^d\), SIAM J. on Numerical Analysis, \(\mathbf{46}\)(3), 1423–1441, 2009.
 
4
It is a mixture (a hybrid) of worst-case and average-case analyses. It allows to estimate realistically the algorithm complexity in practical settings. The research in this area was initiated by D. Spielman and S.-H. Teng with the paper “Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time”, Proc. of the 33-rd Annual ACM Symposium on Theory of Computing, ACM 2001, pp. 296–305. We recommend “Mini course on smoothed analysis” available at the Web site http://​roeglin.​org/​publications/​SmoothedAnalysis​.​pdf.
 
5
Pollard in [395] considers a bit different perspective of the k-means algorithm. He assumes that a probability density function f is defined over \(\mathbb {R}^n\) and a set C of k or fewer points \(\varvec{\mu }_i\), \(i=1,\dots , k\) is sought the minimises the function \(\Phi (C, f)=\int \min _{\varvec{\mu }\in C} ||\mathbf{x}-\varvec{\mu }||_2^2 f(\mathbf{x}) d\mathbf{x}\). The k-means algorithm serves then as a means of clustering a finite sample of m elements from the distribution f. He asks the question whether or not the sequence of sets \(C_m\) of cluster centres of samples of size m for increasing m to infinity would converge to the set C of “true” cluster centres. He proves that it would so if there existed a unique C and if we had at our disposal an algorithm computing the true optimum of Eq. (3.1) for any finite sample. Such an algorithm was shown to be NP-hard, however.
 
6
Cluster gravity centres are called also cluster prototypes, or class centres or just prototypes.
 
7
We treat here \(J_p\) as a continuous function of cluster centre coordinates.
 
8
The number \(\vartheta (m, k)\) is defined in p. 34.
 
9
Note that the gravity centre of the second cluster would make the diagram more complex.
 
10
rand means a random number generator sampling with uniform distribution from the interval [0, 1].
 
11
Let us illustrate the complexity issue following the ideas of Inaba et al. [269]. The k cluster centres generated by the k-means algorithm create a Voronoi tessellation of the n-dimensional space into k regions each of which is separated from each of its neighbours by a \(n-1\) dimensional hyperplane passing through the middle of the line segment connecting the two centres and orthogonal to this line segment. As there are k regions, each has at most \(k-1\) neighbours, so at most \(k(k-1)/2\) such separating hyperplanes exist. Out of the m data points, assume that no \(n+1\) of them are hypercoplanar. In \(\mathbb {R}^n\) n non-hypercoplanar points are needed to define a hyperplane \(\pi \) (splitting the space into two halves). It assigns the remaining points to either side of this hyperplane, marking them say with + and −. The set of hyperplanes preserving the “side” of these other points shall be called “associated” with the n points selected. The “associated” hyperplanes may be deemed of as the “slight” change of position of the hyperplane \(\pi \), containing the selected points. It is easily checked that these “associated” hyperplanes can assign the set of the n points any +/− labelling, that is \(2^n\) different labellings of the n selected points such that they agree with the hyperplane \(\pi \) labelling of the remaining points. As there are m data points, we can select at most \({{m}\atopwithdelims (){n}}\) distinct hyperplanes, containing exactly n of the points, yielding at most \(2^n{{m}\atopwithdelims (){n}}\) different labellings. Out of them you will pick up the mentioned \(k(k-1)/2\) hyperplanes separating the Voronoi regions. So the choice is restricted to \({{2^n{{m}\atopwithdelims (){n}}}\atopwithdelims (){k(k-1)/2}}\ll \left( 2m\right) ^{nk(k-1)/2}\) and this is the upper bound on the number of iterations of a k-means algorithm or its variants. The number is prohibitively big, but for low k and large m it is still orders of magnitude lower than \(k^{m-k}\), that is checking any split of m data points into k subsets. Actually the number of distinct splits of m data points into k subsets is much larger, but we use this expression as an illustrative lower bound.
However, the expected run time has been shown to be polynomial in the number of elements in the sample, see D. Arthur, B. Manthey, and H. Röglin. Smoothed Analysis of the k-Means Method. J. ACM 58 (5), 2011, Article 19 (October 2011), 31 pages.
 
12
See Hruschka, E.R., et al. A survey of evolutionary algorithms for clustering. IEEE Trans. on Systems Man, and Cybernetics, Part C: Applications and Reviews, 39(2), 2009, 133–155; A. Ajith, S. Das, S. Roy. Swarm intelligence algorithms for data clustering. Soft Computing for Knowledge Discovery and Data Mining. Springer US, 2008. 279–313.
 
13
If \(U^t\) denotes the split matrix obtained in the t-th iteration, then the split corresponding to it is called stabilised one if \(U^{t+1}=U^t\).
 
14
If we replace the Euclidean distance with Minkowski \(d_l\) then the elements of the set X should be picked with probability \(p(\mathbf{x}) = u^l(\mathbf{x})/\sum _{\mathbf{x}' \in X} u^l(\mathbf{x}')\). Consult [30, p. 98].
 
15
A reader interested in this type of solutions is encouraged to consult the tutorial https://​hadoop.​apache.​org/​docs/​r1.​2.​1/​mapred_​tutorial.​html.
 
16
Of course bad initialisations will be corrected by proper k-means to a reasonable extent.
 
18
Read more in Sect. 3.15.2 about semi-supervised clustering.
 
19
The plain k-means algorithm assumes \(w(\mathbf{x}_i)=1\).
 
20
Dhillon et al. [149] suggest for example to use a weighted kernel k-means to discover clusters in graphs analogous to ones obtained from a graph by normalised cuts. They optimise a weighted version of the cost function, that is
$$ J_2^\Phi (U) = \sum _{i=1}^m \min _{1\le j \le k} w(\mathbf{x}_i) \Vert \Phi (\mathbf{x}_i) - \varvec{\mu }_j^\Phi \Vert ^2 $$
where
$$ \varvec{\mu }_j^\Phi = \frac{1}{\sum _{\mathbf{x}_i \in C_j} w(\mathbf{x}_i)} \sum _{\mathbf{x}_i \in C_j} w(\mathbf{x}_i)\Phi (\mathbf{x}_i) $$
The application of kernel k-means would liberate from the necessity of computing eigenvalues and eigenvectors needed in spectral clustering of graphs.
 
21
See e.g. N. Mladenović, J. Brimberg, P. Hansen, J.A. Moreno-Pérez. The p-median problem: A survey of meta-heuristic approaches. European J. of Op. Res., 179(3), 2007, 927–939.
 
22
More precisely, it is a so called “hidden feature” or “hidden variable”.
 
23
One of initialisation methods for the EM algorithm, estimating mixture parameters, consists in applying the k-means algorithm.
 
24
See J.J. Verbeek, N. Vlassis, B. Kröse: “Efficient greedy learning of Gaussian mixture models”, Neural computation, 15(2), 2003, pp. 469–485.
 
25
G. Celeux, D. Chauveau, J. Diebolt: Stochastic versions of the EM algorithm: an experimental study in the mixture case. J. of Stat. Computation and Simulation, 55(4), 1996, pp. 287–314.
 
26
a “crisp” partition is the traditional view of splitting a set: each element is assigned to exactly one cluster. A “fuzzy” partition allows each element to be assigned to “some degree” to several clusters.
 
27
In other words, we do not allow for the existence of other (unknown) classes, to which objects could partially or completely belong.
 
28
Dunn [166] recommended \(\alpha =2\) on the grounds of the capability of the algorithm to reconstruct well separated clusters.
 
29
This problem is discussed in depth in Sect. 3.3.5.1.
 
30
T. Kohonen. Self-Organization and Associative Memory, Springer-Verlag, Berlin, Heidelberg, 1898.
 
31
K. Jajuga, \(L_1\)-norm based fuzzy clustering. Fuzzy Sets and System 39, 43–50, 1991. See also P.J.F. Groenen, U. Kaymak and J. van Rosmalen: Fuzzy clustering with Minkowski distance functions, Chap. 3 in [140].
 
32
Its role is similar to that of the fuzziness exponent \(\alpha \). Usually we assume \(r=n-1\).
 
33
R.N. Dave. Boundary detection through fuzzy clustering. In IEEE International Conf. on Fuzzy Systems, pp. 127–134, San Diego, USA, 1992.
 
34
See J.C. Bezdek, C. Coray, R. Gunderson and J. Watson. Detection and characterization of cluster substructure: I. Linear structure: Fuzzy c-lines. SIAM J. Appl. Math. 40(2), 339–357, 1981; Part II Fuzzy c-varieties and convex combinations thereof, SIAM J. Appl. Math. 40(2), 358–372, 1981.
 
35
Note that \(S_j\) is a symmetric matrix of dimension \(n \times n\). Hence it has n (not necessarily distinct) real-valued eigenvalues. See Sect. B.3.
 
36
See e.g. F. Klawonn, R. Kruse, and H. Timm. Fuzzy shell cluster analysis. Courses and Lectures—International Centre for Mechanical Sciences, Springer 1997, pp. 105–120; http://​public.​fh-wolfenbuettel.​de/​~klawonn/​Papers/​klawonnetaludine​97.​pdf.
 
37
In fact the formulas below remain valid for any kernel function matching the condition \(K(\mathbf{x}, \mathbf{x})=1\).
 
38
If \(\mathsf{K}(\mathbf{x}_i, \varvec{\mu }_l)=1\) for some index l then, upon determining \(u_{il}\), we proceed analogously as when applying Eq. (3.59).
 
39
This problem has been analysed in the paper by H. Timm, C. Borgelt, C. Doring, and R. Kruse, “An extension to possibilistic fuzzy cluster analysis”, Fuzzy Sets Syst., 147(1), 2004, pp. 3–16.
 
40
We recommend also the paper X. Wu, B. Wu, J. Sun and H. Fu, “Unsupervised possibilistic fuzzy clustering”, J. of Information & Computational Science, 7(5), 2010, pp. 1075–1080.
 
41
Formally it is the squared distance.
 
42
Its authors call those prototypes exemplars (of data groups) [191].
 
43
Interested reader is recommended to get acquainted with the Ph.D. Thesis [165].
 
44
In geometry, a flat is a subset of n-dimensional Euclidean space that is congruent to an Euclidean sub-space of lower dimension. A point is a 0-flat, a line is a 1-flat, a plane is a 2-flat, because they are congruent with 0,1,2 dimensional Euclidean subspace resp.
 
45
For degenerate sets of points the number may be even larger.
 
46
Still another approach, called subspace clustering , was proposed by Timmerman et al. [460]. In this approach, the coordinates of cluster centres are no more required to lie in a single q-flat, but have rather two components, one “outer” (in a \(q_b\)-flat) and one inner (\(q_w\)-flat), that are orthogonal. So not only subspaces are sought, where clustering can be performed better, but for each cluster a different subspace may be detected. Regrettably, this method is equivalent to \(q_w+q_b\)-flat based reduced k-means clustering in terms of found clusters, but additionally ambiguities are introduced when identifying both flats.
 
47
This sensitivity was dealt already for the classical k-means by Cuesta-Albertos et al. [128] in that at each step, after cluster assignment was performed, \(\alpha 100\%\) of elements most distant to their cluster centres were removed and the cluster centres were recomputed.
 
48
The algorithm is of broader applicability than described here. It extends the Bregman Hard Clustering algorithm, which itself is an extension of k-means algorithm by replacing the Euclidean distance in the cost function with the more general Bregman Divergence, discussed in Sect. 2.​2.​1.​3.
 
49
W.B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemp. Math., pages 189–206. Amer. Math. Soc., Providence, RI, 1984.
 
Metadata
Title
Algorithms of Combinatorial Cluster Analysis
Authors
Sławomir T. Wierzchoń
Mieczysław A. Kłopotek
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-69308-8_3

Premium Partner