Skip to main content
Erschienen in: Journal of Classification 2/2022

18.03.2022

Batch Self-Organizing Maps for Distributional Data with an Automatic Weighting of Variables and Components

verfasst von: Francisco de A. T. de Carvalho, Antonio Irpino, Rosanna Verde, Antonio Balzanella

Erschienen in: Journal of Classification | Ausgabe 2/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper deals with a batch self organizing map algorithm for data described by distributional-valued variables (DBSOM). Such variables are characterized to take as values probability or frequency distributions on numeric support. According to the nature of the data, the loss function is based on the L2 Wasserstein distance, that is one of the most used metrics to compare distributions in the context of distributional data analysis. Besides, to consider the different contributions of the variables, four adaptive versions of the DBSOM algorithm are proposed. Relevance weights are automatically learned, one for each distributional-valued variable, in an additional step of the algorithm. Since the L2 Wasserstein metric allows a decomposition of the distance into two components, one related to the means and one related to the size and shape of the distributions, relevance weights are automatically learned for each of the two components to emphasize the importance of the different characteristics, related to the moments of the distributions, on the distance value. The proposed algorithms are corroborated by applications on real distributional-valued data sets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
We remark that gmj is a distributional data because the corresponding quantile function Qmj is a weighted average quantile function. For further details see Irpino and Verde (2015)
 
3
See the ?? for the proof.
 
Literatur
Zurück zum Zitat Altun, K, Barshan, B, & Tunçel, O (2010). Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognition, 43(10), 3605–3620.CrossRef Altun, K, Barshan, B, & Tunçel, O (2010). Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognition, 43(10), 3605–3620.CrossRef
Zurück zum Zitat Badran, F, Yacoub, M, & Thiria, S (2005). Self-organizing maps and unsupervised classification. In G Dreyfus (Ed.) Neural Networks: methodology and applications (pp. 379–442). Singapore: Springer. Badran, F, Yacoub, M, & Thiria, S (2005). Self-organizing maps and unsupervised classification. In G Dreyfus (Ed.) Neural Networks: methodology and applications (pp. 379–442). Singapore: Springer.
Zurück zum Zitat Bao, C, Peng, H, He, D, & Wang, J (2018). Adaptive fuzzy c-means clustering algorithm for interval data type based on interval-dividing technique. Pattern Analysis and Applications, 21, 803–812.MathSciNetCrossRef Bao, C, Peng, H, He, D, & Wang, J (2018). Adaptive fuzzy c-means clustering algorithm for interval data type based on interval-dividing technique. Pattern Analysis and Applications, 21, 803–812.MathSciNetCrossRef
Zurück zum Zitat Barshan, B, & Yuksek, M C (2014). Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units. The Computer Journal, 57(11), 1649–1667.CrossRef Barshan, B, & Yuksek, M C (2014). Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units. The Computer Journal, 57(11), 1649–1667.CrossRef
Zurück zum Zitat Barshan, B, & Yurtman, A (2016). Investigating inter-subject and inter-activity variations in activity recognition using wearable motion sensors. The Computer Journal, 59(9), 1345–1362.CrossRef Barshan, B, & Yurtman, A (2016). Investigating inter-subject and inter-activity variations in activity recognition using wearable motion sensors. The Computer Journal, 59(9), 1345–1362.CrossRef
Zurück zum Zitat Bock, H H (2002). Clustering algorithms and Kohonen maps for symbolic data. J Jpn Soc Comp Statist, 15, 1–13.CrossRef Bock, H H (2002). Clustering algorithms and Kohonen maps for symbolic data. J Jpn Soc Comp Statist, 15, 1–13.CrossRef
Zurück zum Zitat Bock, H H, & Diday, E. (2000). Analysis of symbolic data exploratory methods for extracting statistical information from complex data. Berlin: Springer.MATH Bock, H H, & Diday, E. (2000). Analysis of symbolic data exploratory methods for extracting statistical information from complex data. Berlin: Springer.MATH
Zurück zum Zitat Cabanes, G, Bennani, Y, Destenay, R, & Hardy, A (2013). A new topological clustering algorithm for interval data. Pattern Recognition, 46, 3030–3039.CrossRef Cabanes, G, Bennani, Y, Destenay, R, & Hardy, A (2013). A new topological clustering algorithm for interval data. Pattern Recognition, 46, 3030–3039.CrossRef
Zurück zum Zitat Campello, R J G B, & Hruschka, E R (2006). A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets and Systems, 157(21), 2858–2875.MathSciNetCrossRef Campello, R J G B, & Hruschka, E R (2006). A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets and Systems, 157(21), 2858–2875.MathSciNetCrossRef
Zurück zum Zitat de Carvalho, F A T, & De Souza, R M C R (2010). Unsupervised pattern recognition models for mixed feature–type symbolic data. Pattern Recognition Letters, 31, 430–443.CrossRef de Carvalho, F A T, & De Souza, R M C R (2010). Unsupervised pattern recognition models for mixed feature–type symbolic data. Pattern Recognition Letters, 31, 430–443.CrossRef
Zurück zum Zitat de Carvalho, F A T, & Lechevallier, Y (2009). Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognition, 42(7), 1223–1236.CrossRef de Carvalho, F A T, & Lechevallier, Y (2009). Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognition, 42(7), 1223–1236.CrossRef
Zurück zum Zitat de Carvalho, F A T, Bertrand, P, & Simões, E C (2016). Batch SOM algorithms for interval-valued data with automatic weighting of the variables. Neurocomputing, 182, 66–81.CrossRef de Carvalho, F A T, Bertrand, P, & Simões, E C (2016). Batch SOM algorithms for interval-valued data with automatic weighting of the variables. Neurocomputing, 182, 66–81.CrossRef
Zurück zum Zitat Diday, E, & Govaert, G (1977). Classification automatique avec distances adaptatives. RAIRO Informatique Computer Science, 11(4), 329–349.MathSciNetMATH Diday, E, & Govaert, G (1977). Classification automatique avec distances adaptatives. RAIRO Informatique Computer Science, 11(4), 329–349.MathSciNetMATH
Zurück zum Zitat Diday, E, & Simon, J C (1976). Clustering analysis. In K Fu (Ed.) Digital pattern classification (pp. 47–94). Berlin: Springer. Diday, E, & Simon, J C (1976). Clustering analysis. In K Fu (Ed.) Digital pattern classification (pp. 47–94). Berlin: Springer.
Zurück zum Zitat D’Urso, P, & Giovanni, L D (2011). Midpoint radius self-organizing maps for interval-valued data with telecommunications application. Applied Soft Computing, 11, 3877–3886.CrossRef D’Urso, P, & Giovanni, L D (2011). Midpoint radius self-organizing maps for interval-valued data with telecommunications application. Applied Soft Computing, 11, 3877–3886.CrossRef
Zurück zum Zitat Friedman, J H, & Meulman, J J (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society: Serie B, 66, 815–849.MathSciNetCrossRef Friedman, J H, & Meulman, J J (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society: Serie B, 66, 815–849.MathSciNetCrossRef
Zurück zum Zitat Gibbs, A L, & Su, F E (2002). On choosing and bounding probability metrics. International Statistical Review, 70(3), 419–435.CrossRef Gibbs, A L, & Su, F E (2002). On choosing and bounding probability metrics. International Statistical Review, 70(3), 419–435.CrossRef
Zurück zum Zitat Hajjar, C, & Hamdan, H (2011a). Self-organizing map based on city-block distance for interval-valued data. In Complex Systems Design and Management - CSDM, (Vol. 2011 pp. 281–292). Hajjar, C, & Hamdan, H (2011a). Self-organizing map based on city-block distance for interval-valued data. In Complex Systems Design and Management - CSDM, (Vol. 2011 pp. 281–292).
Zurück zum Zitat Hajjar, C, & Hamdan, H (2011b). Self-organizing map based on Hausdorff distance for interval-valued data. In IEEE International Conference on Systems, Man, and Cybernetics - SMC, (Vol. 2011 pp. 1747–1752). Hajjar, C, & Hamdan, H (2011b). Self-organizing map based on Hausdorff distance for interval-valued data. In IEEE International Conference on Systems, Man, and Cybernetics - SMC, (Vol. 2011 pp. 1747–1752).
Zurück zum Zitat Hajjar, C, & Hamdan, H (2011c). Self-organizing map based on l2 distance for interval-valued data. In 6th IEEE International Symposium on Applied Computational Intelligence and Informatics - SACI, (Vol. 2011 pp. 317–322). Hajjar, C, & Hamdan, H (2011c). Self-organizing map based on l2 distance for interval-valued data. In 6th IEEE International Symposium on Applied Computational Intelligence and Informatics - SACI, (Vol. 2011 pp. 317–322).
Zurück zum Zitat Hajjar, C, & Hamdan, H (2013). Interval data clustering using self-organizing maps based on adaptive mahalanobis distances. Neural Networks, 46, 124–132.CrossRef Hajjar, C, & Hamdan, H (2013). Interval data clustering using self-organizing maps based on adaptive mahalanobis distances. Neural Networks, 46, 124–132.CrossRef
Zurück zum Zitat Huang, J Z, Ng, M K, Rong, H, & Li, Z (2005). Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell, 27(5), 657–668.CrossRef Huang, J Z, Ng, M K, Rong, H, & Li, Z (2005). Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell, 27(5), 657–668.CrossRef
Zurück zum Zitat Hubert, L, & Arabie, P (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.CrossRef Hubert, L, & Arabie, P (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.CrossRef
Zurück zum Zitat Irpino, A. (2018). HistDAWass: Histogram data analysis using Wasserstein distance. R package version 1.0.1. Irpino, A. (2018). HistDAWass: Histogram data analysis using Wasserstein distance. R package version 1.0.1.
Zurück zum Zitat Irpino, A, & Romano, E. (2007). Optimal histogram representation of large data sets: Fisher vs piecewise linear approximation. Revue des Nouvelles Technologies de l’Information RNTI-E-9, 99–110. Irpino, A, & Romano, E. (2007). Optimal histogram representation of large data sets: Fisher vs piecewise linear approximation. Revue des Nouvelles Technologies de l’Information RNTI-E-9, 99–110.
Zurück zum Zitat Irpino, A, & Verde, R (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In B Vea (Ed.) Data Science and Classification (pp. 185–192). Berlin: Springer. Irpino, A, & Verde, R (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In B Vea (Ed.) Data Science and Classification (pp. 185–192). Berlin: Springer.
Zurück zum Zitat Irpino, A, & Verde, R (2015). Basic statistics for distributional symbolic variables: A new metric-based approach. Advances in Data Analysis and Classification, 9(2), 143–175.MathSciNetCrossRef Irpino, A, & Verde, R (2015). Basic statistics for distributional symbolic variables: A new metric-based approach. Advances in Data Analysis and Classification, 9(2), 143–175.MathSciNetCrossRef
Zurück zum Zitat Irpino, A, Verde, R, & de Carvalho, F A T (2014). Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Systems with Applications, 41(7), 3351–3366.CrossRef Irpino, A, Verde, R, & de Carvalho, F A T (2014). Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Systems with Applications, 41(7), 3351–3366.CrossRef
Zurück zum Zitat Irpino, A, Verde, R, & de Carvalho, F A T (2017). Fuzzy clustering of distributional data with automatic weighting of variable components. Information Sciences, 406-407, 248–268.MathSciNetCrossRef Irpino, A, Verde, R, & de Carvalho, F A T (2017). Fuzzy clustering of distributional data with automatic weighting of variable components. Information Sciences, 406-407, 248–268.MathSciNetCrossRef
Zurück zum Zitat Kim, J, & Billard, L (2011). A polythetic clustering process and cluster validity indexes for histogram-valued objects. Computational Statistics and Data Analysis, 55(7), 2250–2262.MathSciNetCrossRef Kim, J, & Billard, L (2011). A polythetic clustering process and cluster validity indexes for histogram-valued objects. Computational Statistics and Data Analysis, 55(7), 2250–2262.MathSciNetCrossRef
Zurück zum Zitat Kim, J, & Billard, L (2013). Dissimilarity measures for histogram-valued observations. Communications in Statistics - Theory and Methods, 42(2), 283–303.MathSciNetCrossRef Kim, J, & Billard, L (2013). Dissimilarity measures for histogram-valued observations. Communications in Statistics - Theory and Methods, 42(2), 283–303.MathSciNetCrossRef
Zurück zum Zitat Kohonen, T (2013). Essentials of the self-organizing map. Neural Networks, 37(1), 52–65.CrossRef Kohonen, T (2013). Essentials of the self-organizing map. Neural Networks, 37(1), 52–65.CrossRef
Zurück zum Zitat Kohonen, T. (2014). MATLAB Implementations and Applications of the Self-Organizing Map. Helsinki: Unigrafia Oy. Kohonen, T. (2014). MATLAB Implementations and Applications of the Self-Organizing Map. Helsinki: Unigrafia Oy.
Zurück zum Zitat Korenjak-Černe, S, & Batagelj, V. (2002). Symbolic data analysis approach to clustering large datasets, (pp. 319–327). Berlin: Springer. Korenjak-Černe, S, & Batagelj, V. (2002). Symbolic data analysis approach to clustering large datasets, (pp. 319–327). Berlin: Springer.
Zurück zum Zitat Manning, C, Raghavan, P, & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.CrossRef Manning, C, Raghavan, P, & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.CrossRef
Zurück zum Zitat Meila, M (2007). Comparing clusterings – an information based distance. Journal of Multivariate Analysis, 98(5), 873–895.MathSciNetCrossRef Meila, M (2007). Comparing clusterings an information based distance. Journal of Multivariate Analysis, 98(5), 873–895.MathSciNetCrossRef
Zurück zum Zitat Modha, D S, & Spangler, W S (2003). Feature weighting in k-means clustering. Machine Learning, 52(3), 217–237.CrossRef Modha, D S, & Spangler, W S (2003). Feature weighting in k-means clustering. Machine Learning, 52(3), 217–237.CrossRef
Zurück zum Zitat Mount, N J, & Weaver, D (2011). Self-organizing maps and boundary effects: Quantifying the benefits of torus wrapping for mapping som trajectories. Pattern Analysis and Applications, 14(2), 139–148.MathSciNetCrossRef Mount, N J, & Weaver, D (2011). Self-organizing maps and boundary effects: Quantifying the benefits of torus wrapping for mapping som trajectories. Pattern Analysis and Applications, 14(2), 139–148.MathSciNetCrossRef
Zurück zum Zitat Rousseeuw, P J (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.CrossRef Rousseeuw, P J (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.CrossRef
Zurück zum Zitat Rüshendorff, L. (2001). Wasserstein metric. In Encyclopedia of Mathematics. Springer. Rüshendorff, L. (2001). Wasserstein metric. In Encyclopedia of Mathematics. Springer.
Zurück zum Zitat Terada, T, & Yadohisa, H (2010). Non-hierarchical clustering for distribution-valued data. In Y Lechevallier G Saporta (Eds.) Proceedings of COMPSTAT, (Vol. 2010 pp. 1653–1660). Berlin: Springer. Terada, T, & Yadohisa, H (2010). Non-hierarchical clustering for distribution-valued data. In Y Lechevallier G Saporta (Eds.) Proceedings of COMPSTAT, (Vol. 2010 pp. 1653–1660). Berlin: Springer.
Zurück zum Zitat Verde, R, & Irpino, A (2008a). Comparing histogram data using a Mahalanobis-Wasserstein distance. In P Brito (Ed.) Proceedings of COMPSTAT 2008, Compstat, (Vol. 2008 pp. 77–89). Heidelberg: Springer. Verde, R, & Irpino, A (2008a). Comparing histogram data using a Mahalanobis-Wasserstein distance. In P Brito (Ed.) Proceedings of COMPSTAT 2008, Compstat, (Vol. 2008 pp. 77–89). Heidelberg: Springer.
Zurück zum Zitat Verde, R, & Irpino, A (2008b). Dynamic clustering of histogram data: Using the right metric. In B Pea (Ed.) Selected contributions in data analysis and classification (pp. 123–134). Berlin: Springer. Verde, R, & Irpino, A (2008b). Dynamic clustering of histogram data: Using the right metric. In B Pea (Ed.) Selected contributions in data analysis and classification (pp. 123–134). Berlin: Springer.
Zurück zum Zitat Verde, R, & Irpino, A. (2018). Multiple factor analysis of distributional data. Statistica Applicata: Italin Journal of applied statistics. To appear.MATH Verde, R, & Irpino, A. (2018). Multiple factor analysis of distributional data. Statistica Applicata: Italin Journal of applied statistics. To appear.MATH
Zurück zum Zitat Verde, R, Irpino, A, & Lechevallier, Y (2006). Dynamic clustering of histograms using Wasserstein metric. In A Rizzi M Vichi (Eds.) Proceedings of COMPSTAT 2006, Compstat, (Vol. 2006 pp. 869–876). Heidelberg: Physica Verlag. Verde, R, Irpino, A, & Lechevallier, Y (2006). Dynamic clustering of histograms using Wasserstein metric. In A Rizzi M Vichi (Eds.) Proceedings of COMPSTAT 2006, Compstat, (Vol. 2006 pp. 869–876). Heidelberg: Physica Verlag.
Zurück zum Zitat Vesanto, J, Himberg, J, Alhoniemi, E, & Parhankangas, J. (1999). Self-organizing map in matlab: the som toolbox. In Inproceedings of the Matlab DSP Conference (pp. 35–40). Vesanto, J, Himberg, J, Alhoniemi, E, & Parhankangas, J. (1999). Self-organizing map in matlab: the som toolbox. In Inproceedings of the Matlab DSP Conference (pp. 35–40).
Zurück zum Zitat Vrac, M, Billard, L, Diday, E, & Chedin, A (2012). Copula analysis of mixture models. Computational Statistics, 27, 427–457.MathSciNetCrossRef Vrac, M, Billard, L, Diday, E, & Chedin, A (2012). Copula analysis of mixture models. Computational Statistics, 27, 427–457.MathSciNetCrossRef
Zurück zum Zitat Zhang, L, Bing, Z, & Zhang, L (2015). A hybrid clustering algorithm based on missing attribute interval estimation for incomplete data. Pattern Analysis and Applications, 18, 377–384.MathSciNetCrossRef Zhang, L, Bing, Z, & Zhang, L (2015). A hybrid clustering algorithm based on missing attribute interval estimation for incomplete data. Pattern Analysis and Applications, 18, 377–384.MathSciNetCrossRef
Metadaten
Titel
Batch Self-Organizing Maps for Distributional Data with an Automatic Weighting of Variables and Components
verfasst von
Francisco de A. T. de Carvalho
Antonio Irpino
Rosanna Verde
Antonio Balzanella
Publikationsdatum
18.03.2022
Verlag
Springer US
Erschienen in
Journal of Classification / Ausgabe 2/2022
Print ISSN: 0176-4268
Elektronische ISSN: 1432-1343
DOI
https://doi.org/10.1007/s00357-022-09411-1

Weitere Artikel der Ausgabe 2/2022

Journal of Classification 2/2022 Zur Ausgabe

Premium Partner