Skip to main content
Erschienen in: Journal of Classification 2/2021

24.07.2020

Initializing k-means Clustering by Bootstrap and Data Depth

verfasst von: Aurora Torrente, Juan Romo

Erschienen in: Journal of Classification | Ausgabe 2/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The k-means algorithm is widely used in various research fields because of its fast convergence to the cost function minima; however, it frequently gets stuck in local optima as it is sensitive to initial conditions. This paper explores a simple, computationally feasible method, which provides k-means with a set of initial seeds to cluster datasets of arbitrary dimensions. Our technique consists of two stages: firstly, we use the original data space to obtain a set of prototypes (cluster centers) by applying k-means to bootstrap replications of the data and, secondly, we cluster the space of centers, which has tighter (thus easier to separate) groups, and search the deepest point in each assembled cluster using a depth notion. We test this method with simulated and real data, compare it with commonly used k-means initialization algorithms, and show that it is feasible and more efficient than previous proposals in many situations.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Aloise, D., Damasceno, N. C., Mladenović, N., & Pinheiro, D. N. (2017). On strategies to fix degenerate k-means solutions. Journal of Classification, 34(2), 165–190.MathSciNetMATHCrossRef Aloise, D., Damasceno, N. C., Mladenović, N., & Pinheiro, D. N. (2017). On strategies to fix degenerate k-means solutions. Journal of Classification, 34(2), 165–190.MathSciNetMATHCrossRef
Zurück zum Zitat Arcones, M. A., & Giné, E. (1992). On the bootstrap of M-estimators and other statistical functionals. In R. Lepage, & L. Billard (Eds.) Exploring the limits of the bootstrap (pp. 13–47). New York: Wiley. Arcones, M. A., & Giné, E. (1992). On the bootstrap of M-estimators and other statistical functionals. In R. Lepage, & L. Billard (Eds.) Exploring the limits of the bootstrap (pp. 13–47). New York: Wiley.
Zurück zum Zitat Arthur, D., & Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1027–1035). Arthur, D., & Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1027–1035).
Zurück zum Zitat Äyrämö, S., Kärkkäinen, T., & Majava, K. (2007). Robust refinement of initial prototypes for partitioning-based clustering algorithms. In C.H. Skiadas (Ed.) Recent advances in stochastic modelling and data analysis (pp. 473–482). Crete: World Scientific. Äyrämö, S., Kärkkäinen, T., & Majava, K. (2007). Robust refinement of initial prototypes for partitioning-based clustering algorithms. In C.H. Skiadas (Ed.) Recent advances in stochastic modelling and data analysis (pp. 473–482). Crete: World Scientific.
Zurück zum Zitat Bradley, P. S., & Fayyad, U. (1998). Refining initial points for k-means clustering. In Proceedings of the 15th International Conference of Machine Learning (pp. 91–99). Bradley, P. S., & Fayyad, U. (1998). Refining initial points for k-means clustering. In Proceedings of the 15th International Conference of Machine Learning (pp. 91–99).
Zurück zum Zitat Brusco, M. J. (2004). Clustering binary data in the presence of masking variables. Psychological Methods, 9(4), 510–523.CrossRef Brusco, M. J. (2004). Clustering binary data in the presence of masking variables. Psychological Methods, 9(4), 510–523.CrossRef
Zurück zum Zitat Celebi, M. E. (2011). Improving the performance of k-means for color quantization. Image and Vision Computing, 29(4), 260–271.CrossRef Celebi, M. E. (2011). Improving the performance of k-means for color quantization. Image and Vision Computing, 29(4), 260–271.CrossRef
Zurück zum Zitat Celebi, M. E., Kingravi, H. A., & Vela, P. A. (2013). A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1), 200–210.CrossRef Celebi, M. E., Kingravi, H. A., & Vela, P. A. (2013). A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1), 200–210.CrossRef
Zurück zum Zitat Dolnicar, S., & Leisch, F. (2001). Behavioral market segmentation of binary guest survey data with bagged clustering. In B.H.H.K. Dorffner G (Ed.) Artificial neural networks ICANN 2001, volume 2130 of lecture notes in computer science (pp. 111–118). Berlin: Springer. Dolnicar, S., & Leisch, F. (2001). Behavioral market segmentation of binary guest survey data with bagged clustering. In B.H.H.K. Dorffner G (Ed.) Artificial neural networks ICANN 2001, volume 2130 of lecture notes in computer science (pp. 111–118). Berlin: Springer.
Zurück zum Zitat Dudoit, S., & Fridlyand, J. (2003). Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19(9), 1090–1099.CrossRef Dudoit, S., & Fridlyand, J. (2003). Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19(9), 1090–1099.CrossRef
Zurück zum Zitat El Agha, M., & Ashour, W. M. (2012). Efficient and fast initialization algorithm for k-means clustering. International Journal of Intelligent Systems and Applications, 1, 21–31.CrossRef El Agha, M., & Ashour, W. M. (2012). Efficient and fast initialization algorithm for k-means clustering. International Journal of Intelligent Systems and Applications, 1, 21–31.CrossRef
Zurück zum Zitat Forgy, E. (1965). Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 21, 768–780. Forgy, E. (1965). Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 21, 768–780.
Zurück zum Zitat Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: a guide to the theory of NP-completeness. New York: Freeman.MATH Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: a guide to the theory of NP-completeness. New York: Freeman.MATH
Zurück zum Zitat Hand, D. J., & Krzanowski, W. J. (2004). Optimising k-means clustering results with standard software packages. Computational Statistics & Data Analysis, 49, 969–973.MathSciNetMATHCrossRef Hand, D. J., & Krzanowski, W. J. (2004). Optimising k-means clustering results with standard software packages. Computational Statistics & Data Analysis, 49, 969–973.MathSciNetMATHCrossRef
Zurück zum Zitat Hofmans, J., Ceulemans, E., Steinley, D., & Van Mechelen, I. (2015). On the added value of bootstrap analysis for k-means clustering. Journal of Classification, 32(2), 268–284.MathSciNetMATHCrossRef Hofmans, J., Ceulemans, E., Steinley, D., & Van Mechelen, I. (2015). On the added value of bootstrap analysis for k-means clustering. Journal of Classification, 32(2), 268–284.MathSciNetMATHCrossRef
Zurück zum Zitat Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–198.MATHCrossRef Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–198.MATHCrossRef
Zurück zum Zitat Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8), 651–666.CrossRef Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8), 651–666.CrossRef
Zurück zum Zitat Jörnsten, R. (2004). Clustering and classification based on the L1-data depth. Journal of Multivariate Analysis, 90, 67–89.MathSciNetMATHCrossRef Jörnsten, R. (2004). Clustering and classification based on the L1-data depth. Journal of Multivariate Analysis, 90, 67–89.MathSciNetMATHCrossRef
Zurück zum Zitat Jörnsten, R., Vardi, Y., & Zhang, C. H. (2002). A robust clustering method and visualization tool based on data depth. In Y. Dodge (Ed.) Statistical data analysis based on the L1–norm and related methods (pp. 313–366). Birkhäuser Verla: Basel. Jörnsten, R., Vardi, Y., & Zhang, C. H. (2002). A robust clustering method and visualization tool based on data depth. In Y. Dodge (Ed.) Statistical data analysis based on the L1–norm and related methods (pp. 313–366). Birkhäuser Verla: Basel.
Zurück zum Zitat Katsavounidis, I., Kuo, C., & Zhang, Z. (1994). A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters, 1, 144–146.CrossRef Katsavounidis, I., Kuo, C., & Zhang, Z. (1994). A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters, 1, 144–146.CrossRef
Zurück zum Zitat Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: an introduction to cluster analysis. New York: Wiley.MATHCrossRef Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: an introduction to cluster analysis. New York: Wiley.MATHCrossRef
Zurück zum Zitat Kerr, M. K., & Churchill, G. A. (2001). Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proceedings of the National Academy of Sciences of the United States of America, 98(16), 8961–8965.MATHCrossRef Kerr, M. K., & Churchill, G. A. (2001). Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proceedings of the National Academy of Sciences of the United States of America, 98(16), 8961–8965.MATHCrossRef
Zurück zum Zitat Khan, S. S., & Ahmad, A. (2004). Cluster center initialization algorithm for k-means clustering. Pattern Recognition Letters, 25(22), 1293–1302.CrossRef Khan, S. S., & Ahmad, A. (2004). Cluster center initialization algorithm for k-means clustering. Pattern Recognition Letters, 25(22), 1293–1302.CrossRef
Zurück zum Zitat Liao, H., Jihjai, X., Sun, W., Dai, J., & Yu, S. (2014). Adaptative initialization method based on spatial local information for k-means algorithm. Mathematical Problems in Engineering. Article ID 761468, 11 pp. Liao, H., Jihjai, X., Sun, W., Dai, J., & Yu, S. (2014). Adaptative initialization method based on spatial local information for k-means algorithm. Mathematical Problems in Engineering. Article ID 761468, 11 pp.
Zurück zum Zitat López-Pintado, S., & Romo, J. (2009). On the concept of depth for functional data. Journal of the American Statistical Association, 104(486), 718–734.MathSciNetMATHCrossRef López-Pintado, S., & Romo, J. (2009). On the concept of depth for functional data. Journal of the American Statistical Association, 104(486), 718–734.MathSciNetMATHCrossRef
Zurück zum Zitat MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium in Mathematics and Probability (pp. 281–297). MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium in Mathematics and Probability (pp. 281–297).
Zurück zum Zitat Mahajan, M., Nimnhorkar, P., & Varadarajan, K. (2012). The planar k- means problem is NP-hard. Theoretical Computer Science, 442, 13–21.MathSciNetMATHCrossRef Mahajan, M., Nimnhorkar, P., & Varadarajan, K. (2012). The planar k- means problem is NP-hard. Theoretical Computer Science, 442, 13–21.MathSciNetMATHCrossRef
Zurück zum Zitat Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19(2), 354–376.MathSciNetCrossRef Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19(2), 354–376.MathSciNetCrossRef
Zurück zum Zitat Milligan, G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.CrossRef Milligan, G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.CrossRef
Zurück zum Zitat Milligan, G. W., & Cooper, M. C. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181–204.MathSciNetCrossRef Milligan, G. W., & Cooper, M. C. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181–204.MathSciNetCrossRef
Zurück zum Zitat Milligan, G. W., Soon, S. C., & Sokol, L. M. (1983). The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Transactions of Pattern Analysis and Machine Intelligence, PAMI-5 (1), 40–47.CrossRef Milligan, G. W., Soon, S. C., & Sokol, L. M. (1983). The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Transactions of Pattern Analysis and Machine Intelligence, PAMI-5 (1), 40–47.CrossRef
Zurück zum Zitat Peña, J. M., Lozano, J. A., & Larrañaga, P. (1999). An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters, 20, 1027–1040.CrossRef Peña, J. M., Lozano, J. A., & Larrañaga, P. (1999). An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters, 20, 1027–1040.CrossRef
Zurück zum Zitat Pillai, K. C. S. (1954). On some distribution problems in multivariate analysis. Mimeograph Series No. 88, Institute of Statistics, University of North Carolina. Pillai, K. C. S. (1954). On some distribution problems in multivariate analysis. Mimeograph Series No. 88, Institute of Statistics, University of North Carolina.
Zurück zum Zitat Punj, G., & Stewart, D. W. (1983). Cluster analysis in marketing research: review and suggestions for application. Journal of Marketing Research, 20(2), 134–148.CrossRef Punj, G., & Stewart, D. W. (1983). Cluster analysis in marketing research: review and suggestions for application. Journal of Marketing Research, 20(2), 134–148.CrossRef
Zurück zum Zitat Reddy, D., Mishra, D., & Jana, P. K. (2011). MST-based cluster initialization for k-means. Advances in Computer Science and Information Technology 2011, Part I, Communications in Computer and Information Science, 131, 329–338. Reddy, D., Mishra, D., & Jana, P. K. (2011). MST-based cluster initialization for k-means. Advances in Computer Science and Information Technology 2011, Part I, Communications in Computer and Information Science, 131, 329–338.
Zurück zum Zitat Redmonds, S. J., & Heneghan, C. (2007). A method for initialising the k-means clustering algorithm using kd-trees. Pattern Recognition Letters, 28(8), 965–973.CrossRef Redmonds, S. J., & Heneghan, C. (2007). A method for initialising the k-means clustering algorithm using kd-trees. Pattern Recognition Letters, 28(8), 965–973.CrossRef
Zurück zum Zitat Selim, S. Z., & Ismail, M. A. (1984). k-means type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis, 6(1), 81–87.MATHCrossRef Selim, S. Z., & Ismail, M. A. (1984). k-means type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis, 6(1), 81–87.MATHCrossRef
Zurück zum Zitat Steinley, D. (2003). Local optima in k-means clustering: what you don’t know may hurt you. Psychological Methods, 8(3), 294–304.CrossRef Steinley, D. (2003). Local optima in k-means clustering: what you don’t know may hurt you. Psychological Methods, 8(3), 294–304.CrossRef
Zurück zum Zitat Steinley, D. (2004). Properties of the Hubert-Arabie adjusted rand index. Psychological Methods, 9(3), 386–396.CrossRef Steinley, D. (2004). Properties of the Hubert-Arabie adjusted rand index. Psychological Methods, 9(3), 386–396.CrossRef
Zurück zum Zitat Steinley, D. (2006). Profiling local optima in k-means clustering: developing a diagnostic technique. Psychological Methods, 11(2), 178–192.CrossRef Steinley, D. (2006). Profiling local optima in k-means clustering: developing a diagnostic technique. Psychological Methods, 11(2), 178–192.CrossRef
Zurück zum Zitat Steinley, D., & Brusco, M. J. (2007). Initializing k-means batch clustering: a critical evaluation of several techniques. Journal of Classification, 24, 99–121.MathSciNetMATHCrossRef Steinley, D., & Brusco, M. J. (2007). Initializing k-means batch clustering: a critical evaluation of several techniques. Journal of Classification, 24, 99–121.MathSciNetMATHCrossRef
Zurück zum Zitat Su, T., & Dy, J. G. (2007). In search of deterministic methods for initializing k-means and Gaussian mixture clustering. Intelligent Data Analysis, 11, 319–338.CrossRef Su, T., & Dy, J. G. (2007). In search of deterministic methods for initializing k-means and Gaussian mixture clustering. Intelligent Data Analysis, 11, 319–338.CrossRef
Zurück zum Zitat Tou, J. T., & González, R. C. (1974). Pattern recognition principles. Massachusetts: Addison Wesley.MATH Tou, J. T., & González, R. C. (1974). Pattern recognition principles. Massachusetts: Addison Wesley.MATH
Zurück zum Zitat Tukey, J. W. (1975). Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians (pp. 523–531). Tukey, J. W. (1975). Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians (pp. 523–531).
Zurück zum Zitat Vardi, Y., & Zhang, C. H. (2000). The multivariate L1-median and associated data depth. Proceedings of the National Academy of Sciences of the United States of America, 97, 1423–1426.MathSciNetMATHCrossRef Vardi, Y., & Zhang, C. H. (2000). The multivariate L1-median and associated data depth. Proceedings of the National Academy of Sciences of the United States of America, 97, 1423–1426.MathSciNetMATHCrossRef
Zurück zum Zitat Vega-Pons, S., & Ruiz-Schulcloper, J. (2011). A survey of clustering ensemble algorithms. International Journal of Patter Recognition and Artificial Intelligence, 25(3), 337–372.MathSciNetCrossRef Vega-Pons, S., & Ruiz-Schulcloper, J. (2011). A survey of clustering ensemble algorithms. International Journal of Patter Recognition and Artificial Intelligence, 25(3), 337–372.MathSciNetCrossRef
Zurück zum Zitat Ward, J. J. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244.MathSciNetCrossRef Ward, J. J. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244.MathSciNetCrossRef
Metadaten
Titel
Initializing k-means Clustering by Bootstrap and Data Depth
verfasst von
Aurora Torrente
Juan Romo
Publikationsdatum
24.07.2020
Verlag
Springer US
Erschienen in
Journal of Classification / Ausgabe 2/2021
Print ISSN: 0176-4268
Elektronische ISSN: 1432-1343
DOI
https://doi.org/10.1007/s00357-020-09372-3

Weitere Artikel der Ausgabe 2/2021

Journal of Classification 2/2021 Zur Ausgabe

Premium Partner