Fuzzy clustering of distributional data with automatic weighting of variable components

doi:10.1016/j.ins.2017.04.040

Information Sciences

Volumes 406–407, September 2017, Pages 248-268

https://doi.org/10.1016/j.ins.2017.04.040 Get rights and content

Abstract

Distributional data, expressed as realizations of distributional variables, are new types of data arising from several sources. In this paper, we present some new fuzzy c-means algorithms for data described by distributional variables. The algorithms use the L₂ Wasserstein distance between distributions as dissimilarity measure. Usually, in fuzzy c-means, all the variables are considered equally important in the clustering task. However, some variables could be more or less important or even irrelevant for this task. Considering a decomposition of the squared L₂ Wasserstein distance, and using the notion of adaptive distance, we propose some algorithms for automatically computing relevance weights associated with variables, as well as with their components. This is done for the whole dataset or cluster-wise. Relevance weights express the importance of each variable, or of each component, in the clustering process acting also as a variable selection method. Using artificial and real-world data, we observed that algorithms with automatic weighting of variables (or components) are better able to take into account the cluster structure of data.

Introduction

One of the current big-data age requirements is the need of representing groups of data by summaries allowing the minimum loss of information as possible (i.e., data reduction). This is usually done by synthesizing groups of data with a set of characteristic values of their distribution (e.g.: the mean, the standard deviation, higher moments or quantiles). When a set of data is observed with respect to a numerical variable, it is usual to describe it by an empirical distribution or by the estimate of the theoretical distribution that best fits the data. In these cases, each set of observations is described by a distribution-valued data, namely, a value observed for a distributional variable. Distributional variables (DV) were firstly introduced in the context of Symbolic Data Analysis (SDA) [3] as particular set-valued variables, namely, modal variables having numeric support. For example, a histogram variable is particular type of DV whose values are histograms. Thus, we call distributional variable a more general type of attribute whose values are one-dimensional probability or frequency distributions on a numeric support.

Such kinds of data are arising in many practical situations. For example, for preserving the individuals’ privacy, groups of transactions or of measurements of customers of a bank or of patients of a hospital, data are released after being aggregated. In wireless sensor networks, where the energy limitations constraint the communication of data, the use of suitable synthesis of sensed data are necessary. Official statistical institutes collect data about territorial units or administrations and release them as histograms. A practical example of such kind of data is given in Section 4.2. There, we consider an application on sex-age population pyramids. A sex-age population pyramid is a common way to represent the distribution of sex and age of people living in a given administrative unit (for instance, in a town, region or country) jointly. Each country is represented by two histograms describing the age distributions for the male and the female population respectively. Both distributions are vertically opposed giving a graph that is similar to a pyramid. Its shape changes according to the distribution of the age in the population that is considered as a consequence of the development of a country. In Fig. 1 is shown the age pyramid of the World in 2014. In this example, the object “World” is described by two DV, namely “Age distribution of male population” and “Age distribution of female population”. Finally, images or multimedia objects may be described by distributions of their features.

So far, several methods have been proposed for estimating distributions from a set of classical data, while few methods have been developed for the data analysis of objects described by distributions. Among the exploratory tools, clustering is considered a typical tool for unsupervised learning aiming at discovering patterns in data by aggregating a set of objects into clusters such that: the objects belonging to a certain cluster are more similar than those belonging to the other clusters. Clustering algorithms are distinguished in agglomerative (e.g. hierarchical or divisive) and in partitioning methods. Clustering methods can also be distinguished into hard and fuzzy methods. In hard partitioning methods, each object belongs to a single cluster, while in fuzzy clustering [1] methods, a fuzzy partition of data is obtained by considering that an object may belong to more clusters according to a membership degree. Such methods are particularly suitable when data are described by distributions, for example, partially overlapped, so that, they cannot induce a strict partition of the object description space.

Generally, in clustering methods all the variables participate with the same importance to the clustering process. However, in most applications, some variables may be more discriminant of the clusters than others, so that the clusters may be more characterized by some variables with respect to some others. For taking into consideration the different role of the variables, several strategies have been adopted in clustering methods. A first strategy consists in associating weights in advance with the variables according to background knowledge. After fixing the weights for each variable, a clustering procedure is performed to obtain a partition of the objects of the data set. A second strategy consists in adding a step in the algorithm in order to compute weights for the variables, automatically. To tackle this issue, Diday and Govaert [11] proposed to use adaptive distances, such that the weights satisfy a product to one constraint. The use of adaptive distances in clustering consists in introducing a weighting step in the optimization process, where a set of weights are obtained by minimizing a global criterion. Such relevance weights are associated with each variable (for all the clusters or for each cluster) and represent a measure of the importance of the variables in the clustering process. Further, weights can be used for performing a variable selection in clustering process, see [34].

In the hierarchical clustering of distributional data (DD), some methods have been proposed by Irpino and Verde [22], as well as a generalization of the Dynamic Clustering (DC) algorithm [12], in [37], [39] as partitioning algorithm. The DC algorithm, a generalization of the c-means, is a two-steps algorithm: it alternates a representation and an allocation step, such that a within homogeneity criterion is minimized. Other approaches can be referred to Terada and Yadohisa [36], that consists in a c-means clustering method using empirical joint distributions, and to Vrac et al. [40] where a DC algorithm based on the copula analysis is proposed. All the cited methods rely on the choice of a suitable dissimilarity between DD. More recently, the use of Wasserstein distances [35] has been investigated and a new set of statistical indexes has been proposed [23] for DV.

In the framework of SDA, several DC algorithms including adaptive distances have been proposed for interval-valued data [6], [7], [8], for histogram-valued data [26], [27] and for modal symbolic data [18], [28], [29]. More recently, a DC algorithm using adaptive distances based on the L₂ Wasserstein metric have been proposed [24]. In the framework of c-means on classic scalar data with automatic relevance weights estimation a second approach was proposed in [19], where a sum to one constraint on the weights is considered. However, this algorithm need to fix in advance (by the user or after a cross-validation) a smoothing parameter for the relevance weights.

This paper extends [24], [39] by proposing a fuzzy c-means clustering algorithm for DD, and improves [10], where a fuzzy c-means on DD is proposed. Using a decomposition of the L₂ Wasserstein distance [21] and considering the variability measures introduced in [38], the variability of a DV can be decomposed in two independent components: one is related to the variability of the means (or expectations) of the DD and a second component is related to their differences in sizes and shapes. In this paper, adaptive distances take into account the two components of the variability. This is done globally and cluster-wise. The improvements with respect to De Carvalho et al. [10] concern two new automatic weighting schemes for the components of the DV. Especially, in [10] the weights of components are looked independently for each component. In that case, even if a component plays a minor role with respect to the other, the algorithms may assign a relevant contribution even if it is not relevant. Here, we propose solutions to solve this side effect. The proposed fuzzy clustering algorithm alternates three steps: i) an allocation step, that computes the memberships of the objects to the clusters, ii) a weighting step, that computes the weights for each variable and/or each component, iii) a representation step, that computes the clusters’ prototypes, until a stationary value of a homogeneity criterion is reached.

Beside extending the methods above discussed, we also propose two new variants of the algorithms taking into consideration two new sets of constraints.

Applications on simulated and real-world data show that the proposed settings are better able to identify the most important components of the variables in fuzzy clustering also in presence of non discriminant variables in the cluster structure of data.

The remainder of the paper is organized as follows. Section 2 introduces distributional data and the L₂ Wasserstein distance between distributions. Section 3 details the proposed algorithms by defining: the objective function that is minimized by each algorithm, the relevance weights for the variables or their components, and the derived adaptive distances for DD. In Section 4, the proposed algorithms are tested on synthetic datasets and on a real-world one. Section 5 concludes the paper.

Section snippets

Distribution-valued data and Wasserstein distance

A Distributional Variable (DV) takes as values one-dimensional probability (empirical, or theoretical, parametric or non parametric) density functions. We assume that a set of N objects is described by P DV. In the remainder of the paper we will use the following notation: objects are indexed by k with $k = 1, \dots, N$ ; the P variables are denoted by Y_j (with $j = 1, \dots, P$ ); the kth distributional data of the Y_j variable, is denoted with y_kj; clusters are indexed by i (with $i = 1, \dots, C$ ), and the kth object

Fuzzy c-means with adaptive Wasserstein distances

Fuzzy c-means is a prototype-based clustering method, namely, clusters are represented by centers (or prototypes) and the membership degree of an object to a cluster depends on its distance from the cluster prototype compared with the distances from the other prototypes. Prototypes and memberships are obtained by the minimization of a within-cluster dispersion criterion that depends on a distance chosen for comparing objects and prototypes.

A representative or prototype of each fuzzy cluster $i (i =$

Applications

In this section, two applications are performed on synthetic datasets and on a real-world one. Three synthetic datasets are generated according to three different clustering structures for unimodal distributions, and a several datasets of DD having one, two or three modes have further used for investigating the sensitivity of the proposed algorithms on multimodal distributions. In these cases, since the a priori membership of each object is known, algorithm results are validated by external

Conclusions

The paper presented an extension of fuzzy c-means algorithms to data described by distributional variables. The fuzzy c-means algorithm has been integrated in order to compute the relevance of each distributional variable, or of its components, in order to take into consideration also non-spherical clusters. We presented an automatic weighting system which is related to the determinant of the within covariance matrix, leading to a set of six product-to-one constraints for the relevance weights.

Acknowledgments

The authors are grateful to the anonymous referees for their careful revision, valuable suggestions, and comments which improved this paper. The Brazilian author would like to thank FACEPE, Research Agency from the State of Pernambuco, Brazil (APV-0012-1.03/15), and CNPq, National Council for Scientific and Technological Development, Brazil (303187/2013-1, 470997/2013-3), for their financial support.

References (42)

R.J.G.B. Campello et al.
A fuzzy extension of the silhouette width criterion for cluster analysis
Fuzzy Sets Syst.
(2006)
R.N. Dave
Validating fuzzy partitions obtained through c-shells clustering
Pattern Recognit. Lett.
(1996)
F.A.T. De Carvalho et al.
Unsupervised pattern recognition models for mixed feature–type symbolic data
Pattern. Recognit. Lett.
(2010)
F.A.T. De Carvalho et al.
Partitional clustering algorithms for symbolic interval data based on single adaptive distances
Pattern Recognit.
(2009)
H. Frigui et al.
Unsupervised learning of prototypes and attribute weights
Pattern Recognit.
(2004)
H. Frigui et al.
Clustering and aggregation of relational data with applications to image database categorization
Pattern Recognit.
(2007)
A. Irpino et al.
Dynamic clustering of histogram data based on adaptive squared Wasserstein distances
Expert Syst. Appl.
(2014)
D.W. Kim et al.
A novel initialization scheme for the fuzzy c-means algorithm for color clustering
Pattern Recognit. Lett.
(2004)
J. Kim et al.
A polythetic clustering process and cluster validity indexes for histogram-valued objects
Comput. Stat. Data Anal.
(2011)
M. Meila
Comparing clusterings $-$ an information based distance
J. Multivar. Anal.
(2007)

J.C. Bezdek

Pattern Recognition with Fuzzy Objective Function Algorithms

(1981)

J.C. Bezdek et al.

Fuzzy Models and Algorithms for Pattern Recognition and Image Processing

(1999)

H.H. Bock et al.

Analysis of Symbolic Data. Exploratory Methods for Extracting Statistical Information from Complex Data

(2000)

F.A.T. De Carvalho et al.

Dynamic clustering of interval-valued data based on adaptive quadratic distances

Trans. Sys. Man Cyber. Part A

(2009)

F.A.T. De Carvalho et al.

Partitional fuzzy clustering methods based on adaptive quadratic distances

Fuzzy Sets Syst.

(2006)

F.A.T. De Carvalho et al.

Fuzzy clustering of distribution-valued data using an adaptive L₂ Wasserstein distance

Fuzzy Systems (FUZZ-IEEE), 2015 IEEE International Conference on

(2015)

E. Diday et al.

Classification automatique avec distances adaptatives

R.A.I.R.O. Inf. Comput. Sci.

(1977)

E. Diday et al.

Clustering analysis

A.L. Gibbs et al.

On choosing and bounding probability metrics

Int. Stat. Rev.

(2002)

W. Gilchrist

Statistical Modelling with Quantile Functions

(2000)

C.R. Givens et al.

A class of wasserstein metrics for probability distributions.

Michigan Math. J.

(1984)

Cited by (13)

Clustering interval-valued data with adaptive Euclidean and City-Block distances
2022, Expert Systems with Applications
Citation Excerpt :
Thus, the available information is imprecise and cannot be revealed exactly by single numerical data (D’Urso & Leski, 2016). Therefore, the combined information from the variables of both boundaries should be used in the clustering as in Irpino et al. (2017), Rodríguez and de Carvalho (2019). The proposed algorithms start with an initial partition and compute the prototypes, the relevance weights of the variables for lower and upper boundaries, and the cluster partition iteratively in three steps (representation, weighting, and assignment) until a stopping criterion is satisfied.
In several applications, data information is obtained in the form of intervals, such as the monthly temperature in a meteorological station or daily pollution levels in different locations. This paper proposes partitioning clustering algorithms for interval-valued data based on adaptive Euclidean and City-Block distances. Since some boundary variables may be more relevant for the clustering process, the proposals consider the joint weights of the relevance of the lower and upper boundaries of the interval-valued variables. Consequently, clusters of different shapes and sizes in some subspaces of the variables, even in specific boundaries of the interval-valued data, can be recognized. In addition, robust dissimilarity functions were introduced to reduce the influence of outliers in the data. The adaptive distances change at each iteration of the algorithms and can be different from one cluster to another. The methods optimize an objective function by alternating three steps for obtaining the representatives of each group, the cluster partition, and the relevance weights for the interval-valued variables. Experiments on synthetic and real data sets corroborate the robustness and usefulness of the proposed adaptive clustering methods.
Soft subspace clustering of interval-valued data with regularizations
2021, Knowledge-Based Systems
Data analysis plays an indispensable role in understanding different phenomena. One of the vital means of handling these data is to group them into a set of clusters given a measure of similarity. Usually, clustering methods deal with objects described by single-valued variables. Nevertheless, this representation is too restrictive for representing complex data, such as lists, histograms, or even intervals. Furthermore, in some problems, many dimensions are irrelevant and can mask existing clusters. In this regard, new interval-valued data clustering methods with regularizations and adaptive distances are proposed. These approaches consider that the boundaries of the interval-valued variables have the same and different importance for the clustering process. The algorithms optimize an objective function alternating three steps for obtaining the representatives of each group, a fuzzy partition, and the relevance weights of the variables. Experiments on synthetic and real data sets corroborate the robustness and usefulness of the proposed methods.
Co-clustering algorithms for distributional data with automated variable weighting
2021, Information Sciences
This paper is concerned with the co-clustering of distribution-valued data, that is, the simultaneous partitioning of rows and columns of an input data table, the elements of which are distributions (or histograms) representing aggregate data. The first proposed method extends the double k-means algorithm to distributional data. The $L_{2}$ Wasserstein distance, also known as Mallow’s distance, is used to compare distributions. To consider the different relevance of the variables characterizing the clusters, four variants of adaptive distributional double k-means are proposed. Accordingly, in the co-clustering procedure, an additional step is introduced to compute the relevance weights associated with the variables. In particular, each of the four algorithms provides i) a set of weights for the variables; ii) different sets of weights for the variables, one for each cluster (cluster-wise); iii) a double set of weights for the variables according to the decomposition of the $L_{2}$ Wasserstein distance into two components; iv) different double sets of weights for the variables and distance components, one for each cluster (cluster-wise). Applications using simulated and real data demonstrate the effectiveness of the proposed algorithms and the contribution of the relevance weights to the co-clustering procedure according to the structure of the data.
Robust synchronization of chaotic systems with novel fuzzy rule-based controllers
2019, Information Sciences
Citation Excerpt :
Fuzzy logic has been applied to solve many types of problems in mechanical system design [1,3,5,7–10,12–23,25,27–29].
This paper proposes a set of novel fuzzy controllers with adjustable control forces for the robust synchronization of different chaotic systems that include uncertain terms or unexpected disturbances. The proposed fuzzy rule-based control system is developed through Lyapunov stability theory and the fuzzy logic concept to form a multi-input multi-output (MIMO) system. This system provides flexible control inputs based on the error dynamics of the system, and the inputs satisfy the synchronization condition of being asymptotically stable. In this case, the error derivatives always have signs opposite those of the error states, which yields negative definite first-order Lyapunov function derivatives. Furthermore, the proposed system appropriately adjusts the control inputs through the constructed fuzzy if-then rules. This method is robust to both unmodeled terms and unexpected interference signals during the synchronization process. The simulation results demonstrate the feasibility and effectiveness of the proposed fuzzy rule-based control system.
On LAMDA clustering method based on typicality degree and intuitionistic fuzzy sets
2018, Expert Systems with Applications
Citation Excerpt :
In a further work, the addition of a covariance function within GTD could be studied to suit in this scenario. In Irpino, Verde, and de A.T. de Carvalho (2017), a new FCM algorithm was developed for data described by distributional variables where the L2 Wasserstein distance among distributions is used as a dissimilarity measure. The main idea of Wasserstein’s distance is to identify the most and the least important variables for the clustering.
The learning algorithm for multivariable data analysis (LAMDA) is a learning method to group or classify quantitative and qualitative historical data. LAMDA can be applied for self-learning and directed learning. Considering self-learning, LAMDA groups historical data through the global membership degrees of a sample with respect to a class, considering the contribution of all features for each sample. Such contribution is called marginal adequacy degree (MAD). To mix all MADs, several aggregation functions such as linear and geometrical connectives are used. The aggregation function calculates the global adequacy degree (GAD) of a sample with respect to a class. This procedure avoids assigning the class number as initial parameter due to that LAMDA applies an automatic threshold called non-informative class (NIC) which finds the best assignment of a portion of data to one class. One improvement for the LAMDA algorithm is the use of a full reinforcement operator called Yager–Rybalov Triple Pi, which can reduce the number of classes. However, this kind of solution ignores the degree to which one sample belongs to one class relative to all samples. Another limitation is that the classes reduction is not controlled by a parameter or an algorithm, which removes some classes required by a human expert. Such removed classes can be useful for the human expert in the data analysis of complex processes or for intelligent systems when the searching of additional information is required in decision-making. In this paper, the typicality degree and the intuitionistic fuzzy sets (IFSs) are used to propose one improvement of the LAMDA method in order to develop a strategy that avoids the loss of significant classes. This paper shows two functions: global typicality degree (GTD) and intuitionistic global adequacy degree (IGAD). By integrating both functions, a new function called typicality and intuitionistic global adequacy degree (TIGAD) is proposed. The proposal is applied in three study cases. The results show an improvement of clustering which were validated by the human expert and the information of data. Furthermore, the impact of the research is focused on data analysis, where the expert should make the best decision.
Batch Self-Organizing Maps for Distributional Data with an Automatic Weighting of Variables and Components
2022, Journal of Classification

View all citing articles on Scopus

View full text

Fuzzy clustering of distributional data with automatic weighting of variable components

Abstract

Introduction

Section snippets

Distribution-valued data and Wasserstein distance

Fuzzy c-means with adaptive Wasserstein distances

Applications

Conclusions

Acknowledgments

Fuzzy Sets Syst.

Pattern Recognit. Lett.

Pattern. Recognit. Lett.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Expert Syst. Appl.

Pattern Recognit. Lett.

Comput. Stat. Data Anal.

J. Multivar. Anal.

Pattern Recognition with Fuzzy Objective Function Algorithms

Fuzzy Models and Algorithms for Pattern Recognition and Image Processing

Analysis of Symbolic Data. Exploratory Methods for Extracting Statistical Information from Complex Data

Dynamic clustering of interval-valued data based on adaptive quadratic distances

Trans. Sys. Man Cyber. Part A

Partitional fuzzy clustering methods based on adaptive quadratic distances

Fuzzy Sets Syst.

Fuzzy clustering of distribution-valued data using an adaptive L2 Wasserstein distance

Fuzzy Systems (FUZZ-IEEE), 2015 IEEE International Conference on

Classification automatique avec distances adaptatives

R.A.I.R.O. Inf. Comput. Sci.

Clustering analysis

On choosing and bounding probability metrics

Int. Stat. Rev.

Statistical Modelling with Quantile Functions

A class of wasserstein metrics for probability distributions.

Michigan Math. J.

Fuzzy clustering of distribution-valued data using an adaptive L₂ Wasserstein distance