Skip to main content

2014 | OriginalPaper | Buchkapitel

Robustness and Stability Analysis of Factor PD-Clustering on Large Social Data Sets

verfasst von : Cristina Tortora, Marina Marino

Erschienen in: Analysis and Modeling of Complex Data in Behavioral and Social Sciences

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Factor clustering methods were proposed to cluster large data sets. Among them factor probabilistic distance clustering (FPDC) shows interesting performance. The method is based on two main steps: a Tucker3 decomposition of the distance array and probabilistic distance (PD) clustering on the resulting factors. The aim of this paper is to apply FPDC on behavioral and social data sets of large dimensions, to obtain homogeneous and well-separated clusters of individuals. The scope is to evaluate the stability and the robustness of the method dealing with these data sets. Stability of results is referred to the invariance of results in each iteration of the method. Robustness is referred to the sensitivity of the method to errors in data. These characteristics of the method are evaluated using bootstrap resampling.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Arabie, P., & Hubert, L. (1994). Cluster analysis in marketing research. In R. P. Bagozzi (Ed.), Advanced methods in marketing research (pp 160–189). Oxford: Blackwell. Arabie, P., & Hubert, L. (1994). Cluster analysis in marketing research. In R. P. Bagozzi (Ed.), Advanced methods in marketing research (pp 160–189). Oxford: Blackwell.
Zurück zum Zitat Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing, 7(6), 6–17. Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing, 7(6), 6–17.
Zurück zum Zitat Bryan, J. (2004). Problems in gene clustering based on gene expression data. Journal of Multivariate Analyis, 90, 67–89.CrossRefMathSciNet Bryan, J. (2004). Problems in gene clustering based on gene expression data. Journal of Multivariate Analyis, 90, 67–89.CrossRefMathSciNet
Zurück zum Zitat Bubeck, S., Meilă, M., & Von Luxburg, U. (2012). How the initialization affects the stability of the k-means algorithm. Probability and statistics: PS, 16, 436–452.CrossRefMATH Bubeck, S., Meilă, M., & Von Luxburg, U. (2012). How the initialization affects the stability of the k-means algorithm. Probability and statistics: PS, 16, 436–452.CrossRefMATH
Zurück zum Zitat De Soete, G., & Carroll, J. D. (1994). k-means clustering in a low-dimensional euclidean space. In E. Diday, Y. Lechevallier, M. Schader, et al. (Eds.), New approaches in classification and data analysis. (pp. 212–219). Heidelberg: Springer. De Soete, G., & Carroll, J. D. (1994). k-means clustering in a low-dimensional euclidean space. In E. Diday, Y. Lechevallier, M. Schader, et al. (Eds.), New approaches in classification and data analysis. (pp. 212–219). Heidelberg: Springer.
Zurück zum Zitat Devé, R. N., & Krishnapuram, R. (1997). Robust clustering methods: A unified view. IEEE Transiction on Fuzzy Systems, 5(2), 270–293.CrossRef Devé, R. N., & Krishnapuram, R. (1997). Robust clustering methods: A unified view. IEEE Transiction on Fuzzy Systems, 5(2), 270–293.CrossRef
Zurück zum Zitat Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biology, 3, 0036.1–0036.21.CrossRef Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biology, 3, 0036.1–0036.21.CrossRef
Zurück zum Zitat Gettler Summa, M., Palumbo, F., & Tortora, C. (2011). Factor pd-clustering. Working paper [arXiv:1106.3830v3] Gettler Summa, M., Palumbo, F., & Tortora, C. (2011). Factor pd-clustering. Working paper [arXiv:1106.3830v3]
Zurück zum Zitat Ghahramani, Z., & Hinton, G. E. (1997). The em algorithm for mixtures of factor analyzers. Crg-tr-96-1, University of Toronto, Toronto. Ghahramani, Z., & Hinton, G. E. (1997). The em algorithm for mixtures of factor analyzers. Crg-tr-96-1, University of Toronto, Toronto.
Zurück zum Zitat Grün, B., & Leisch, F. (2004). Bootstrapping finite mixture models. Compstat 2004, proceedings in Computational Statistics, 1115–1122. Grün, B., & Leisch, F. (2004). Bootstrapping finite mixture models. Compstat 2004, proceedings in Computational Statistics, 1115–1122.
Zurück zum Zitat Iyigun, C. (2007). Probabilistic Distance Clustering. Ph.D. thesis, New Brunswick Rutgers, The State University of New Jersey. Iyigun, C. (2007). Probabilistic Distance Clustering. Ph.D. thesis, New Brunswick Rutgers, The State University of New Jersey.
Zurück zum Zitat Kiers, H. A. L., & Kinderen, A. (2003). A fast method for choosing the numbers of components in tucker3 analysis. British Journal of Mathematical and Statistical Psychology, 56(1), 119–125.CrossRefMathSciNet Kiers, H. A. L., & Kinderen, A. (2003). A fast method for choosing the numbers of components in tucker3 analysis. British Journal of Mathematical and Statistical Psychology, 56(1), 119–125.CrossRefMathSciNet
Zurück zum Zitat Kroonenberg, P. M. (2008). Applied multiway data analysis. Hoboken: Ebooks Corporation.CrossRefMATH Kroonenberg, P. M. (2008). Applied multiway data analysis. Hoboken: Ebooks Corporation.CrossRefMATH
Zurück zum Zitat Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stability-based validation of clustering solutions. Neural Computation, 16(6), 1299–1323.CrossRefMATH Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stability-based validation of clustering solutions. Neural Computation, 16(6), 1299–1323.CrossRefMATH
Zurück zum Zitat Lebart, A., Morineau, A., & Warwick, K. (1984). Multivariate statistical descriptive analysis. New York: Wiley.MATH Lebart, A., Morineau, A., & Warwick, K. (1984). Multivariate statistical descriptive analysis. New York: Wiley.MATH
Zurück zum Zitat Maronna, R. A., & Zamar, R. H. (2002). Robust estimates of location and dispersion for high-dimensional datasets. Technometrics, 44(4), 307–317.CrossRefMathSciNet Maronna, R. A., & Zamar, R. H. (2002). Robust estimates of location and dispersion for high-dimensional datasets. Technometrics, 44(4), 307–317.CrossRefMathSciNet
Zurück zum Zitat McLachlan, G. J., & Peel, D. (2003). Modelling high-dimensional data by mixtures of factor analyzers, Computational Statistics and Data Analysis, 41(3), 379–388.CrossRefMATHMathSciNet McLachlan, G. J., & Peel, D. (2003). Modelling high-dimensional data by mixtures of factor analyzers, Computational Statistics and Data Analysis, 41(3), 379–388.CrossRefMATHMathSciNet
Zurück zum Zitat Monti, S., Tamayo, P., Mesirov, J., & Golub, T. (2001). Consensus clustering: A resampling-based method for class discovery and visualization of gene. Expression Microarray Data, Machine Learning, 52, 91–118.CrossRef Monti, S., Tamayo, P., Mesirov, J., & Golub, T. (2001). Consensus clustering: A resampling-based method for class discovery and visualization of gene. Expression Microarray Data, Machine Learning, 52, 91–118.CrossRef
Zurück zum Zitat Rocci, R., Gattone, S. A., & Vichi, M. (2011). A new dimension reduction method: Factor discriminant k-means. Journal of Classification, 28(2), 210–226.CrossRefMATHMathSciNet Rocci, R., Gattone, S. A., & Vichi, M. (2011). A new dimension reduction method: Factor discriminant k-means. Journal of Classification, 28(2), 210–226.CrossRefMATHMathSciNet
Zurück zum Zitat Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14, 511–528.CrossRefMathSciNet Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14, 511–528.CrossRefMathSciNet
Zurück zum Zitat Timmerman, M. E., Ceulemans, E., Kiers, H. A. L., & Vichi, M. (2010). Factorial and reduced k-means reconsidered. Computational Statistics & Data Analysis, 54(7), 1858–1871.CrossRefMATHMathSciNet Timmerman, M. E., Ceulemans, E., Kiers, H. A. L., & Vichi, M. (2010). Factorial and reduced k-means reconsidered. Computational Statistics & Data Analysis, 54(7), 1858–1871.CrossRefMATHMathSciNet
Zurück zum Zitat Tortora, C., Gettler Summa, M., & Palumbo, F. (2013). Factor pd-clustering. In U. Alfred, L. Berthold, & V. Dirk (Eds.), Algorithms from and for nature and life (volume, in press). Tortora, C., Gettler Summa, M., & Palumbo, F. (2013). Factor pd-clustering. In U. Alfred, L. Berthold, & V. Dirk (Eds.), Algorithms from and for nature and life (volume, in press).
Zurück zum Zitat Vendramin, L., Campello, R., & Hruschka, E. (2009). In SDM. On the comparison of relative clustering validity criteria (pp. 733–744). Vendramin, L., Campello, R., & Hruschka, E. (2009). In SDM. On the comparison of relative clustering validity criteria (pp. 733–744).
Zurück zum Zitat Vichi, M., & Kiers, H. A. L. (2001). Factorial k-means analysis for two way data. Computational Statistics and Data Analysis, 37, 29–64.CrossRefMathSciNet Vichi, M., & Kiers, H. A. L. (2001). Factorial k-means analysis for two way data. Computational Statistics and Data Analysis, 37, 29–64.CrossRefMathSciNet
Metadaten
Titel
Robustness and Stability Analysis of Factor PD-Clustering on Large Social Data Sets
verfasst von
Cristina Tortora
Marina Marino
Copyright-Jahr
2014
DOI
https://doi.org/10.1007/978-3-319-06692-9_29