Skip to main content
Erschienen in: Advances in Data Analysis and Classification 3/2013

01.09.2013 | Regular Article

Model-based clustering of probability density functions

verfasst von: Angela Montanari, Daniela G. Calò

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 3/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Complex data such as those where each statistical unit under study is described not by a single observation (or vector variable), but by a unit-specific sample of several or even many observations, are becoming more and more popular. Reducing these sample data by summary statistics, like the average or the median, implies that most inherent information (about variability, skewness or multi-modality) gets lost. Full information is preserved only if each unit is described by a whole distribution. This new kind of data, a.k.a. “distribution-valued data”, require the development of adequate statistical methods. This paper presents a method to group a set of probability density functions (pdfs) into homogeneous clusters, provided that the pdfs have to be estimated nonparametrically from the unit-specific data. Since elements belonging to the same cluster are naturally thought of as samples from the same probability model, the idea is to tackle the clustering problem by defining and estimating a proper mixture model on the space of pdfs. The issue of model building is challenging here because of the infinite-dimensionality and the non-Euclidean geometry of the domain space. By adopting a wavelet-based representation for the elements in the space, the task is accomplished by using mixture models for hyper-spherical data. The proposed solution is illustrated through a simulation experiment and on two real data sets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Abramowitz M, Stegun IA (1974) Handbook of mathematical functions. Dover Publ Inc., New York Abramowitz M, Stegun IA (1974) Handbook of mathematical functions. Dover Publ Inc., New York
Zurück zum Zitat Applegate D, Dasu T, Krishnan S, Urbanek S (2011) Unsupervised clustering of multidimensional distributions using earth mover distance. In: the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 636–644. doi:10.1145/2020408.2020508 Applegate D, Dasu T, Krishnan S, Urbanek S (2011) Unsupervised clustering of multidimensional distributions using earth mover distance. In: the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 636–644. doi:10.​1145/​2020408.​2020508
Zurück zum Zitat Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises-Fisher distributions. J Mach Learn Res 6:1345–1382MathSciNetMATH Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises-Fisher distributions. J Mach Learn Res 6:1345–1382MathSciNetMATH
Zurück zum Zitat Bezdeck JC (1981) Pattern recognition with Fuzzy objective function algorithms. Plenum Press, New YorkCrossRef Bezdeck JC (1981) Pattern recognition with Fuzzy objective function algorithms. Plenum Press, New YorkCrossRef
Zurück zum Zitat Bock H-H, Diday E (2000) Analysis of symbolic data. Springer, HeidelbergCrossRef Bock H-H, Diday E (2000) Analysis of symbolic data. Springer, HeidelbergCrossRef
Zurück zum Zitat Chervoneva I, Zhan T, Iglewicz B, Walter H, Birck DE (2012) Two-stage hierarchical modeling for analysis of subpopulations in conditional distributions. J Appl Stat 39:445–460MathSciNetCrossRef Chervoneva I, Zhan T, Iglewicz B, Walter H, Birck DE (2012) Two-stage hierarchical modeling for analysis of subpopulations in conditional distributions. J Appl Stat 39:445–460MathSciNetCrossRef
Zurück zum Zitat Dempster NM, Laird AP, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm. J Roy Stat Soc (Ser B) 39:1–39MathSciNetMATH Dempster NM, Laird AP, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm. J Roy Stat Soc (Ser B) 39:1–39MathSciNetMATH
Zurück zum Zitat Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175MATHCrossRef Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175MATHCrossRef
Zurück zum Zitat Diday E, Noirhomme M (2008) Symbolic data analysis. Wiley, New YorkMATH Diday E, Noirhomme M (2008) Symbolic data analysis. Wiley, New YorkMATH
Zurück zum Zitat Herrick DRM, Nason GP, Silverman BW (2001) Some new methods for wavelet density estimation. Sankhya A 63:391–411MathSciNet Herrick DRM, Nason GP, Silverman BW (2001) Some new methods for wavelet density estimation. Sankhya A 63:391–411MathSciNet
Zurück zum Zitat Maharaj EA, D’Urso P, Galagedera DUA (2010) Wavelets-based fuzzy clustering of time series. J Classif 27:231–275MathSciNetCrossRef Maharaj EA, D’Urso P, Galagedera DUA (2010) Wavelets-based fuzzy clustering of time series. J Classif 27:231–275MathSciNetCrossRef
Zurück zum Zitat Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Patt An Mach Intell 11:674–693MATHCrossRef Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Patt An Mach Intell 11:674–693MATHCrossRef
Zurück zum Zitat Mardia KV, Jupp PE (2000) Directional statistics. Wiley, New YorkMATH Mardia KV, Jupp PE (2000) Directional statistics. Wiley, New YorkMATH
Zurück zum Zitat Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4:157–170MathSciNetCrossRef Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4:157–170MathSciNetCrossRef
Zurück zum Zitat Ogden RT (1997) Essential wavelets for statistical application and data analysis. Birkhauser, BostonCrossRef Ogden RT (1997) Essential wavelets for statistical application and data analysis. Birkhauser, BostonCrossRef
Zurück zum Zitat Peel D, Whiten WJ, McLachlan GJ (2001) Fitting mixtures of Kent distributions to aid in joint set identification. J Am Stat Assoc 96:56–63MathSciNetCrossRef Peel D, Whiten WJ, McLachlan GJ (2001) Fitting mixtures of Kent distributions to aid in joint set identification. J Am Stat Assoc 96:56–63MathSciNetCrossRef
Zurück zum Zitat Percival DB, Walden AT (2000) Wavelet methods for time series analysis. Cambridge University Press, New YorkMATH Percival DB, Walden AT (2000) Wavelet methods for time series analysis. Cambridge University Press, New YorkMATH
Zurück zum Zitat Peter A, Rangarajan A (2008) Maximum likelihood wavelet density estimation with applications to image and shape matching. IEEE Trans Image Proc 17:458–468MathSciNetCrossRef Peter A, Rangarajan A (2008) Maximum likelihood wavelet density estimation with applications to image and shape matching. IEEE Trans Image Proc 17:458–468MathSciNetCrossRef
Zurück zum Zitat Pinheiro A, Vidakovic B (1997) Estimating the square root of a density via compactly supported wavelets. Comput Stat Data Anal 25:399–415MathSciNetMATHCrossRef Pinheiro A, Vidakovic B (1997) Estimating the square root of a density via compactly supported wavelets. Comput Stat Data Anal 25:399–415MathSciNetMATHCrossRef
Zurück zum Zitat Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density estimation. J Roy Statist Soc (Ser B) 53:683–690MathSciNetMATH Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density estimation. J Roy Statist Soc (Ser B) 53:683–690MathSciNetMATH
Zurück zum Zitat Silverman B (1986) Density estimation. Chapman and Hall, LondonMATH Silverman B (1986) Density estimation. Chapman and Hall, LondonMATH
Zurück zum Zitat Spellman E, Vemuri BC, Rao M (2005) Using the KL-center for efficient and accurate retrieval of distributions arising from texture images. IEEE Comput Soc Confer Comput V Pattern Recogn 1:111–116. doi:10.1109/CVPR.2005.363 Spellman E, Vemuri BC, Rao M (2005) Using the KL-center for efficient and accurate retrieval of distributions arising from texture images. IEEE Comput Soc Confer Comput V Pattern Recogn 1:111–116. doi:10.​1109/​CVPR.​2005.​363
Zurück zum Zitat Sra S, Karp D (2013) The multivariate Watson distribution: maximum-likelihood estimation and other aspects. J Multivariate Anal 114:256–269MathSciNetMATHCrossRef Sra S, Karp D (2013) The multivariate Watson distribution: maximum-likelihood estimation and other aspects. J Multivariate Anal 114:256–269MathSciNetMATHCrossRef
Zurück zum Zitat Srivastava A, Jermyn I, Joshi S (2007) Riemannian analysis of probability density functions with applications in vision. IEEE Conf Comput Vision Patt Recogn. doi:10.1109/CVPR.2007.383188 Srivastava A, Jermyn I, Joshi S (2007) Riemannian analysis of probability density functions with applications in vision. IEEE Conf Comput Vision Patt Recogn. doi:10.​1109/​CVPR.​2007.​383188
Zurück zum Zitat Sturges H (1926) The choice of a class-interval. J Am Stat Assoc 21:65–66CrossRef Sturges H (1926) The choice of a class-interval. J Am Stat Assoc 21:65–66CrossRef
Zurück zum Zitat Terada Y, Yadohisa H (2010) Non-hierarchical clustering for distribution-valued data. In: Lechevallier Y, Saporta G (eds) Proceedings of COMPSTAT 2010. Physica-Verlag, Heidelberg, pp 1653–1660 Terada Y, Yadohisa H (2010) Non-hierarchical clustering for distribution-valued data. In: Lechevallier Y, Saporta G (eds) Proceedings of COMPSTAT 2010. Physica-Verlag, Heidelberg, pp 1653–1660
Zurück zum Zitat Verde R, Irpino A (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) Proceedings of COMPSTAT2008. Physica-Verlag, Heidelberg, pp 77–89 Verde R, Irpino A (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) Proceedings of COMPSTAT2008. Physica-Verlag, Heidelberg, pp 77–89
Zurück zum Zitat Vrac M, Billard L, Diday E, Chdin A (2011) Copula analysis of mixture models. Comput Stat 27:427–457CrossRef Vrac M, Billard L, Diday E, Chdin A (2011) Copula analysis of mixture models. Comput Stat 27:427–457CrossRef
Zurück zum Zitat Walter GG (1995) Estimation with wavelets and the curse of dimensionality. Technical report—Department of Mathematical Sciences. University of Wisconsin-Milwaukee Walter GG (1995) Estimation with wavelets and the curse of dimensionality. Technical report—Department of Mathematical Sciences. University of Wisconsin-Milwaukee
Zurück zum Zitat Wouters BJ, Lwenberg B, Erpelinck-Verschueren CA, van Putten W, Valk P, Delwel R (2009) Double CEBPA mutations, but not single CEBPA mutations, define a subgroup of acute myeloid leukemia with a distinctive gene expression profile that is uniquely associated with a favorable outcome. Blood 26:3088–3091CrossRef Wouters BJ, Lwenberg B, Erpelinck-Verschueren CA, van Putten W, Valk P, Delwel R (2009) Double CEBPA mutations, but not single CEBPA mutations, define a subgroup of acute myeloid leukemia with a distinctive gene expression profile that is uniquely associated with a favorable outcome. Blood 26:3088–3091CrossRef
Zurück zum Zitat Yamal JM, Follen M, Guillaud M, Cox D (2011) Classifying tissue samples from measurements on cells with within-class tissue sample heterogeneity. Biostatistics 12:695–709CrossRef Yamal JM, Follen M, Guillaud M, Cox D (2011) Classifying tissue samples from measurements on cells with within-class tissue sample heterogeneity. Biostatistics 12:695–709CrossRef
Metadaten
Titel
Model-based clustering of probability density functions
verfasst von
Angela Montanari
Daniela G. Calò
Publikationsdatum
01.09.2013
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 3/2013
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-013-0140-8

Weitere Artikel der Ausgabe 3/2013

Advances in Data Analysis and Classification 3/2013 Zur Ausgabe

Premium Partner