Skip to main content
Top

2014 | OriginalPaper | Chapter

Robustness and Stability Analysis of Factor PD-Clustering on Large Social Data Sets

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Factor clustering methods were proposed to cluster large data sets. Among them factor probabilistic distance clustering (FPDC) shows interesting performance. The method is based on two main steps: a Tucker3 decomposition of the distance array and probabilistic distance (PD) clustering on the resulting factors. The aim of this paper is to apply FPDC on behavioral and social data sets of large dimensions, to obtain homogeneous and well-separated clusters of individuals. The scope is to evaluate the stability and the robustness of the method dealing with these data sets. Stability of results is referred to the invariance of results in each iteration of the method. Robustness is referred to the sensitivity of the method to errors in data. These characteristics of the method are evaluated using bootstrap resampling.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Arabie, P., & Hubert, L. (1994). Cluster analysis in marketing research. In R. P. Bagozzi (Ed.), Advanced methods in marketing research (pp 160–189). Oxford: Blackwell. Arabie, P., & Hubert, L. (1994). Cluster analysis in marketing research. In R. P. Bagozzi (Ed.), Advanced methods in marketing research (pp 160–189). Oxford: Blackwell.
go back to reference Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing, 7(6), 6–17. Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing, 7(6), 6–17.
go back to reference Bryan, J. (2004). Problems in gene clustering based on gene expression data. Journal of Multivariate Analyis, 90, 67–89.CrossRefMathSciNet Bryan, J. (2004). Problems in gene clustering based on gene expression data. Journal of Multivariate Analyis, 90, 67–89.CrossRefMathSciNet
go back to reference Bubeck, S., Meilă, M., & Von Luxburg, U. (2012). How the initialization affects the stability of the k-means algorithm. Probability and statistics: PS, 16, 436–452.CrossRefMATH Bubeck, S., Meilă, M., & Von Luxburg, U. (2012). How the initialization affects the stability of the k-means algorithm. Probability and statistics: PS, 16, 436–452.CrossRefMATH
go back to reference De Soete, G., & Carroll, J. D. (1994). k-means clustering in a low-dimensional euclidean space. In E. Diday, Y. Lechevallier, M. Schader, et al. (Eds.), New approaches in classification and data analysis. (pp. 212–219). Heidelberg: Springer. De Soete, G., & Carroll, J. D. (1994). k-means clustering in a low-dimensional euclidean space. In E. Diday, Y. Lechevallier, M. Schader, et al. (Eds.), New approaches in classification and data analysis. (pp. 212–219). Heidelberg: Springer.
go back to reference Devé, R. N., & Krishnapuram, R. (1997). Robust clustering methods: A unified view. IEEE Transiction on Fuzzy Systems, 5(2), 270–293.CrossRef Devé, R. N., & Krishnapuram, R. (1997). Robust clustering methods: A unified view. IEEE Transiction on Fuzzy Systems, 5(2), 270–293.CrossRef
go back to reference Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biology, 3, 0036.1–0036.21.CrossRef Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biology, 3, 0036.1–0036.21.CrossRef
go back to reference Gettler Summa, M., Palumbo, F., & Tortora, C. (2011). Factor pd-clustering. Working paper [arXiv:1106.3830v3] Gettler Summa, M., Palumbo, F., & Tortora, C. (2011). Factor pd-clustering. Working paper [arXiv:1106.3830v3]
go back to reference Ghahramani, Z., & Hinton, G. E. (1997). The em algorithm for mixtures of factor analyzers. Crg-tr-96-1, University of Toronto, Toronto. Ghahramani, Z., & Hinton, G. E. (1997). The em algorithm for mixtures of factor analyzers. Crg-tr-96-1, University of Toronto, Toronto.
go back to reference Grün, B., & Leisch, F. (2004). Bootstrapping finite mixture models. Compstat 2004, proceedings in Computational Statistics, 1115–1122. Grün, B., & Leisch, F. (2004). Bootstrapping finite mixture models. Compstat 2004, proceedings in Computational Statistics, 1115–1122.
go back to reference Iyigun, C. (2007). Probabilistic Distance Clustering. Ph.D. thesis, New Brunswick Rutgers, The State University of New Jersey. Iyigun, C. (2007). Probabilistic Distance Clustering. Ph.D. thesis, New Brunswick Rutgers, The State University of New Jersey.
go back to reference Kiers, H. A. L., & Kinderen, A. (2003). A fast method for choosing the numbers of components in tucker3 analysis. British Journal of Mathematical and Statistical Psychology, 56(1), 119–125.CrossRefMathSciNet Kiers, H. A. L., & Kinderen, A. (2003). A fast method for choosing the numbers of components in tucker3 analysis. British Journal of Mathematical and Statistical Psychology, 56(1), 119–125.CrossRefMathSciNet
go back to reference Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stability-based validation of clustering solutions. Neural Computation, 16(6), 1299–1323.CrossRefMATH Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stability-based validation of clustering solutions. Neural Computation, 16(6), 1299–1323.CrossRefMATH
go back to reference Lebart, A., Morineau, A., & Warwick, K. (1984). Multivariate statistical descriptive analysis. New York: Wiley.MATH Lebart, A., Morineau, A., & Warwick, K. (1984). Multivariate statistical descriptive analysis. New York: Wiley.MATH
go back to reference Maronna, R. A., & Zamar, R. H. (2002). Robust estimates of location and dispersion for high-dimensional datasets. Technometrics, 44(4), 307–317.CrossRefMathSciNet Maronna, R. A., & Zamar, R. H. (2002). Robust estimates of location and dispersion for high-dimensional datasets. Technometrics, 44(4), 307–317.CrossRefMathSciNet
go back to reference McLachlan, G. J., & Peel, D. (2003). Modelling high-dimensional data by mixtures of factor analyzers, Computational Statistics and Data Analysis, 41(3), 379–388.CrossRefMATHMathSciNet McLachlan, G. J., & Peel, D. (2003). Modelling high-dimensional data by mixtures of factor analyzers, Computational Statistics and Data Analysis, 41(3), 379–388.CrossRefMATHMathSciNet
go back to reference Monti, S., Tamayo, P., Mesirov, J., & Golub, T. (2001). Consensus clustering: A resampling-based method for class discovery and visualization of gene. Expression Microarray Data, Machine Learning, 52, 91–118.CrossRef Monti, S., Tamayo, P., Mesirov, J., & Golub, T. (2001). Consensus clustering: A resampling-based method for class discovery and visualization of gene. Expression Microarray Data, Machine Learning, 52, 91–118.CrossRef
go back to reference Rocci, R., Gattone, S. A., & Vichi, M. (2011). A new dimension reduction method: Factor discriminant k-means. Journal of Classification, 28(2), 210–226.CrossRefMATHMathSciNet Rocci, R., Gattone, S. A., & Vichi, M. (2011). A new dimension reduction method: Factor discriminant k-means. Journal of Classification, 28(2), 210–226.CrossRefMATHMathSciNet
go back to reference Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14, 511–528.CrossRefMathSciNet Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14, 511–528.CrossRefMathSciNet
go back to reference Timmerman, M. E., Ceulemans, E., Kiers, H. A. L., & Vichi, M. (2010). Factorial and reduced k-means reconsidered. Computational Statistics & Data Analysis, 54(7), 1858–1871.CrossRefMATHMathSciNet Timmerman, M. E., Ceulemans, E., Kiers, H. A. L., & Vichi, M. (2010). Factorial and reduced k-means reconsidered. Computational Statistics & Data Analysis, 54(7), 1858–1871.CrossRefMATHMathSciNet
go back to reference Tortora, C., Gettler Summa, M., & Palumbo, F. (2013). Factor pd-clustering. In U. Alfred, L. Berthold, & V. Dirk (Eds.), Algorithms from and for nature and life (volume, in press). Tortora, C., Gettler Summa, M., & Palumbo, F. (2013). Factor pd-clustering. In U. Alfred, L. Berthold, & V. Dirk (Eds.), Algorithms from and for nature and life (volume, in press).
go back to reference Vendramin, L., Campello, R., & Hruschka, E. (2009). In SDM. On the comparison of relative clustering validity criteria (pp. 733–744). Vendramin, L., Campello, R., & Hruschka, E. (2009). In SDM. On the comparison of relative clustering validity criteria (pp. 733–744).
go back to reference Vichi, M., & Kiers, H. A. L. (2001). Factorial k-means analysis for two way data. Computational Statistics and Data Analysis, 37, 29–64.CrossRefMathSciNet Vichi, M., & Kiers, H. A. L. (2001). Factorial k-means analysis for two way data. Computational Statistics and Data Analysis, 37, 29–64.CrossRefMathSciNet
Metadata
Title
Robustness and Stability Analysis of Factor PD-Clustering on Large Social Data Sets
Authors
Cristina Tortora
Marina Marino
Copyright Year
2014
DOI
https://doi.org/10.1007/978-3-319-06692-9_29

Premium Partner