Skip to main content
Erschienen in: Advances in Data Analysis and Classification 2/2015

01.06.2015 | Regular Article

Basic statistics for distributional symbolic variables: a new metric-based approach

verfasst von: Antonio Irpino, Rosanna Verde

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 2/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In data mining it is usual to describe a group of measurements using summary statistics or through empirical distribution functions. Symbolic data analysis (SDA) aims at the treatment of such kinds of data, allowing the description and the analysis of conceptual data or of macrodata summarizing classical data. In the conceptual framework of SDA, the paper aims at presenting new basic statistics for distribution-valued variables, i.e., variables whose realizations are distributions. The proposed measures extend some classical univariate (mean, variance, standard deviation) and bivariate (covariance and correlation) basic statistics to distribution-valued variables, taking into account the nature and the variability of such data. The novel statistics are based on a distance between distributions: the \(\ell _2\) Wasserstein distance. A comparison with other univariate and bivariate statistics presented in the literature points out some relevant properties of the proposed ones. An application on a clinic dataset shows the main differences in terms of interpretation of results.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
We joined together the three tables describing the three histogram variables that are presented in different sections of the book.
 
Literatur
Zurück zum Zitat Aitchison J (1986) The statistical analysis of compositional data. Chapman Hall, New YorkCrossRefMATH Aitchison J (1986) The statistical analysis of compositional data. Chapman Hall, New YorkCrossRefMATH
Zurück zum Zitat Bacelar-Nicolau H (1987) On the distribution equivalence in cluster analysis. In: Devijver PA, Kittler J (eds) Pattern recognition theory and applications, NATO ASI SeriesF, vol 30. Springer Verlag, Berlin, pp 73–79CrossRef Bacelar-Nicolau H (1987) On the distribution equivalence in cluster analysis. In: Devijver PA, Kittler J (eds) Pattern recognition theory and applications, NATO ASI SeriesF, vol 30. Springer Verlag, Berlin, pp 73–79CrossRef
Zurück zum Zitat Bacelar-Nicolau H (1988) Two probabilistic models for classification of variables in frequency tables. In: Bock HH (ed) Classification and related methods. North-Holland, Amsterdam, pp 181–189 Bacelar-Nicolau H (1988) Two probabilistic models for classification of variables in frequency tables. In: Bock HH (ed) Classification and related methods. North-Holland, Amsterdam, pp 181–189
Zurück zum Zitat Barrio E, Matran C, Rodriguez-Rodriguez J, Cuesta-Albertos JA (1999) Tests of goodness of fit based on the L2-Wasserstein distance. Ann Stat 27:1230–1239MATH Barrio E, Matran C, Rodriguez-Rodriguez J, Cuesta-Albertos JA (1999) Tests of goodness of fit based on the L2-Wasserstein distance. Ann Stat 27:1230–1239MATH
Zurück zum Zitat Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock HH, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 103–124 Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock HH, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 103–124
Zurück zum Zitat Billard L (2007) Dependencies and variation components of symbolic interval-valued data. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 3–12CrossRef Billard L (2007) Dependencies and variation components of symbolic interval-valued data. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 3–12CrossRef
Zurück zum Zitat Billard L (2008) Sample covariance function for complex quantitative data. In: Proceedings of IASC 2008, Yokohama, Japan, pp 157–163 Billard L (2008) Sample covariance function for complex quantitative data. In: Proceedings of IASC 2008, Yokohama, Japan, pp 157–163
Zurück zum Zitat Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487MathSciNet Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487MathSciNet
Zurück zum Zitat Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, ChirchesterCrossRef Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, ChirchesterCrossRef
Zurück zum Zitat Bock HH, Diday E (2000) Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Studies in Classification, Data Analysis and Knowledge Organisation. Springer-Verlag, Berlin Bock HH, Diday E (2000) Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Studies in Classification, Data Analysis and Knowledge Organisation. Springer-Verlag, Berlin
Zurück zum Zitat Brito P (2007) On the analysis of symbolic data. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 13–22CrossRef Brito P (2007) On the analysis of symbolic data. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 13–22CrossRef
Zurück zum Zitat Chisini O (1929) Sul concetto di media. Periodico di Matematiche 4:106–116 Chisini O (1929) Sul concetto di media. Periodico di Matematiche 4:106–116
Zurück zum Zitat Diday E (2013) Principal component analysis for bar charts and metabins tables. Stat Anal Data Min 6(5):403–430MathSciNet Diday E (2013) Principal component analysis for bar charts and metabins tables. Stat Anal Data Min 6(5):403–430MathSciNet
Zurück zum Zitat Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, BerlinMATH Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, BerlinMATH
Zurück zum Zitat Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 7(3):419–435 Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 7(3):419–435
Zurück zum Zitat Gilchrist WG (2000) Statistical modelling with quantile functions. Chapman and Hall/CRC, LondonCrossRef Gilchrist WG (2000) Statistical modelling with quantile functions. Chapman and Hall/CRC, LondonCrossRef
Zurück zum Zitat Ginestet CE, Simmons A, Kolaczyk ED (2012) Weighted Frechet means as convex combinations in metric spaces: properties and generalized median inequalities. Stat Probab Lett 82(10):1859–1863MATHMathSciNet Ginestet CE, Simmons A, Kolaczyk ED (2012) Weighted Frechet means as convex combinations in metric spaces: properties and generalized median inequalities. Stat Probab Lett 82(10):1859–1863MATHMathSciNet
Zurück zum Zitat Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min 4(2):184–198MathSciNet Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min 4(2):184–198MathSciNet
Zurück zum Zitat Irpino A, Lechevallier Y, Verde R (2006) Dynamic clustering of histograms using Wasserstein metric. In: Rizzi A, Vichi M (eds) COMPSTAT 2006. Physica-Verlag, Berlin, pp 869–876 Irpino A, Lechevallier Y, Verde R (2006) Dynamic clustering of histograms using Wasserstein metric. In: Rizzi A, Vichi M (eds) COMPSTAT 2006. Physica-Verlag, Berlin, pp 869–876
Zurück zum Zitat Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batanjeli V, Bock HH, Ferligoj A, Ziberna A (eds) Data science and classification, IFCS 2006. Springer, Berlin, pp 185–192CrossRef Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batanjeli V, Bock HH, Ferligoj A, Ziberna A (eds) Data science and classification, IFCS 2006. Springer, Berlin, pp 185–192CrossRef
Zurück zum Zitat Irpino A, Verde R (2008a) Dynamic clustering of interval data using a Wasserstein-based distance. Pattern Recogn Lett 29:1648–1658 Irpino A, Verde R (2008a) Dynamic clustering of interval data using a Wasserstein-based distance. Pattern Recogn Lett 29:1648–1658
Zurück zum Zitat Irpino A, Verde R (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) COMPSTAT 2008. Physica-Verlag, Heidelberg, pp 77–89 Irpino A, Verde R (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) COMPSTAT 2008. Physica-Verlag, Heidelberg, pp 77–89
Zurück zum Zitat Kim J, Billard L (2013) Dissimilarity measures for histogram-valued observations. Commun Stat-Theor M 42:283–303MATHMathSciNet Kim J, Billard L (2013) Dissimilarity measures for histogram-valued observations. Commun Stat-Theor M 42:283–303MATHMathSciNet
Zurück zum Zitat Matusita K (1951) On the theory of statistical decision functions. Ann I Stat Math 3(1):1–30 Matusita K (1951) On the theory of statistical decision functions. Ann I Stat Math 3(1):1–30
Zurück zum Zitat Moore RE (1966) Interval analysis. Prentice Hall, Englewood CliffsMATH Moore RE (1966) Interval analysis. Prentice Hall, Englewood CliffsMATH
Zurück zum Zitat Moore R, Lodwick W (2003) Interval analysis and fuzzy set theory. Fuzzy Set Syst 135(1):5–9MATHMathSciNet Moore R, Lodwick W (2003) Interval analysis and fuzzy set theory. Fuzzy Set Syst 135(1):5–9MATHMathSciNet
Zurück zum Zitat Noirhomme-Fraiture M, Brito P (2012) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170MathSciNet Noirhomme-Fraiture M, Brito P (2012) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170MathSciNet
Zurück zum Zitat Nielsen F, Nock R (2009) Sided and symmetrized Bregman centroids. IEEE T Inform Theory 55(6):2882–2904MathSciNet Nielsen F, Nock R (2009) Sided and symmetrized Bregman centroids. IEEE T Inform Theory 55(6):2882–2904MathSciNet
Zurück zum Zitat Rüschendorf L (2001) Wasserstein metric. In: Hazewinkel M (ed) Encyclopedia of mathematics. Springer, New York Rüschendorf L (2001) Wasserstein metric. In: Hazewinkel M (ed) Encyclopedia of mathematics. Springer, New York
Zurück zum Zitat Verde R, Irpino A (2007) Dynamic clustering of histogram data: using the right metric. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 123–134CrossRef Verde R, Irpino A (2007) Dynamic clustering of histogram data: using the right metric. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 123–134CrossRef
Metadaten
Titel
Basic statistics for distributional symbolic variables: a new metric-based approach
verfasst von
Antonio Irpino
Rosanna Verde
Publikationsdatum
01.06.2015
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 2/2015
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-014-0176-4

Weitere Artikel der Ausgabe 2/2015

Advances in Data Analysis and Classification 2/2015 Zur Ausgabe

Premium Partner