Elsevier

Information Sciences

Volumes 406–407, September 2017, Pages 248-268
Information Sciences

Fuzzy clustering of distributional data with automatic weighting of variable components

https://doi.org/10.1016/j.ins.2017.04.040Get rights and content

Abstract

Distributional data, expressed as realizations of distributional variables, are new types of data arising from several sources. In this paper, we present some new fuzzy c-means algorithms for data described by distributional variables. The algorithms use the L2 Wasserstein distance between distributions as dissimilarity measure. Usually, in fuzzy c-means, all the variables are considered equally important in the clustering task. However, some variables could be more or less important or even irrelevant for this task. Considering a decomposition of the squared L2 Wasserstein distance, and using the notion of adaptive distance, we propose some algorithms for automatically computing relevance weights associated with variables, as well as with their components. This is done for the whole dataset or cluster-wise. Relevance weights express the importance of each variable, or of each component, in the clustering process acting also as a variable selection method. Using artificial and real-world data, we observed that algorithms with automatic weighting of variables (or components) are better able to take into account the cluster structure of data.

Introduction

One of the current big-data age requirements is the need of representing groups of data by summaries allowing the minimum loss of information as possible (i.e., data reduction). This is usually done by synthesizing groups of data with a set of characteristic values of their distribution (e.g.: the mean, the standard deviation, higher moments or quantiles). When a set of data is observed with respect to a numerical variable, it is usual to describe it by an empirical distribution or by the estimate of the theoretical distribution that best fits the data. In these cases, each set of observations is described by a distribution-valued data, namely, a value observed for a distributional variable. Distributional variables (DV) were firstly introduced in the context of Symbolic Data Analysis (SDA) [3] as particular set-valued variables, namely, modal variables having numeric support. For example, a histogram variable is particular type of DV whose values are histograms. Thus, we call distributional variable a more general type of attribute whose values are one-dimensional probability or frequency distributions on a numeric support.

Such kinds of data are arising in many practical situations. For example, for preserving the individuals’ privacy, groups of transactions or of measurements of customers of a bank or of patients of a hospital, data are released after being aggregated. In wireless sensor networks, where the energy limitations constraint the communication of data, the use of suitable synthesis of sensed data are necessary. Official statistical institutes collect data about territorial units or administrations and release them as histograms. A practical example of such kind of data is given in Section 4.2. There, we consider an application on sex-age population pyramids. A sex-age population pyramid is a common way to represent the distribution of sex and age of people living in a given administrative unit (for instance, in a town, region or country) jointly. Each country is represented by two histograms describing the age distributions for the male and the female population respectively. Both distributions are vertically opposed giving a graph that is similar to a pyramid. Its shape changes according to the distribution of the age in the population that is considered as a consequence of the development of a country. In Fig. 1 is shown the age pyramid of the World in 2014. In this example, the object “World” is described by two DV, namely “Age distribution of male population” and “Age distribution of female population”. Finally, images or multimedia objects may be described by distributions of their features.

So far, several methods have been proposed for estimating distributions from a set of classical data, while few methods have been developed for the data analysis of objects described by distributions. Among the exploratory tools, clustering is considered a typical tool for unsupervised learning aiming at discovering patterns in data by aggregating a set of objects into clusters such that: the objects belonging to a certain cluster are more similar than those belonging to the other clusters. Clustering algorithms are distinguished in agglomerative (e.g. hierarchical or divisive) and in partitioning methods. Clustering methods can also be distinguished into hard and fuzzy methods. In hard partitioning methods, each object belongs to a single cluster, while in fuzzy clustering [1] methods, a fuzzy partition of data is obtained by considering that an object may belong to more clusters according to a membership degree. Such methods are particularly suitable when data are described by distributions, for example, partially overlapped, so that, they cannot induce a strict partition of the object description space.

Generally, in clustering methods all the variables participate with the same importance to the clustering process. However, in most applications, some variables may be more discriminant of the clusters than others, so that the clusters may be more characterized by some variables with respect to some others. For taking into consideration the different role of the variables, several strategies have been adopted in clustering methods. A first strategy consists in associating weights in advance with the variables according to background knowledge. After fixing the weights for each variable, a clustering procedure is performed to obtain a partition of the objects of the data set. A second strategy consists in adding a step in the algorithm in order to compute weights for the variables, automatically. To tackle this issue, Diday and Govaert [11] proposed to use adaptive distances, such that the weights satisfy a product to one constraint. The use of adaptive distances in clustering consists in introducing a weighting step in the optimization process, where a set of weights are obtained by minimizing a global criterion. Such relevance weights are associated with each variable (for all the clusters or for each cluster) and represent a measure of the importance of the variables in the clustering process. Further, weights can be used for performing a variable selection in clustering process, see [34].

In the hierarchical clustering of distributional data (DD), some methods have been proposed by Irpino and Verde [22], as well as a generalization of the Dynamic Clustering (DC) algorithm [12], in [37], [39] as partitioning algorithm. The DC algorithm, a generalization of the c-means, is a two-steps algorithm: it alternates a representation and an allocation step, such that a within homogeneity criterion is minimized. Other approaches can be referred to Terada and Yadohisa [36], that consists in a c-means clustering method using empirical joint distributions, and to Vrac et al. [40] where a DC algorithm based on the copula analysis is proposed. All the cited methods rely on the choice of a suitable dissimilarity between DD. More recently, the use of Wasserstein distances [35] has been investigated and a new set of statistical indexes has been proposed [23] for DV.

In the framework of SDA, several DC algorithms including adaptive distances have been proposed for interval-valued data [6], [7], [8], for histogram-valued data [26], [27] and for modal symbolic data [18], [28], [29]. More recently, a DC algorithm using adaptive distances based on the L2 Wasserstein metric have been proposed [24]. In the framework of c-means on classic scalar data with automatic relevance weights estimation a second approach was proposed in [19], where a sum to one constraint on the weights is considered. However, this algorithm need to fix in advance (by the user or after a cross-validation) a smoothing parameter for the relevance weights.

This paper extends [24], [39] by proposing a fuzzy c-means clustering algorithm for DD, and improves [10], where a fuzzy c-means on DD is proposed. Using a decomposition of the L2 Wasserstein distance [21] and considering the variability measures introduced in [38], the variability of a DV can be decomposed in two independent components: one is related to the variability of the means (or expectations) of the DD and a second component is related to their differences in sizes and shapes. In this paper, adaptive distances take into account the two components of the variability. This is done globally and cluster-wise. The improvements with respect to De Carvalho et al. [10] concern two new automatic weighting schemes for the components of the DV. Especially, in [10] the weights of components are looked independently for each component. In that case, even if a component plays a minor role with respect to the other, the algorithms may assign a relevant contribution even if it is not relevant. Here, we propose solutions to solve this side effect. The proposed fuzzy clustering algorithm alternates three steps: i) an allocation step, that computes the memberships of the objects to the clusters, ii) a weighting step, that computes the weights for each variable and/or each component, iii) a representation step, that computes the clusters’ prototypes, until a stationary value of a homogeneity criterion is reached.

Beside extending the methods above discussed, we also propose two new variants of the algorithms taking into consideration two new sets of constraints.

Applications on simulated and real-world data show that the proposed settings are better able to identify the most important components of the variables in fuzzy clustering also in presence of non discriminant variables in the cluster structure of data.

The remainder of the paper is organized as follows. Section 2 introduces distributional data and the L2 Wasserstein distance between distributions. Section 3 details the proposed algorithms by defining: the objective function that is minimized by each algorithm, the relevance weights for the variables or their components, and the derived adaptive distances for DD. In Section 4, the proposed algorithms are tested on synthetic datasets and on a real-world one. Section 5 concludes the paper.

Section snippets

Distribution-valued data and Wasserstein distance

A Distributional Variable (DV) takes as values one-dimensional probability (empirical, or theoretical, parametric or non parametric) density functions. We assume that a set of N objects is described by P DV. In the remainder of the paper we will use the following notation: objects are indexed by k with k=1,,N; the P variables are denoted by Yj (with j=1,,P); the kth distributional data of the Yj variable, is denoted with ykj; clusters are indexed by i (with i=1,,C), and the kth object

Fuzzy c-means with adaptive Wasserstein distances

Fuzzy c-means is a prototype-based clustering method, namely, clusters are represented by centers (or prototypes) and the membership degree of an object to a cluster depends on its distance from the cluster prototype compared with the distances from the other prototypes. Prototypes and memberships are obtained by the minimization of a within-cluster dispersion criterion that depends on a distance chosen for comparing objects and prototypes.

A representative or prototype of each fuzzy cluster i(i=

Applications

In this section, two applications are performed on synthetic datasets and on a real-world one. Three synthetic datasets are generated according to three different clustering structures for unimodal distributions, and a several datasets of DD having one, two or three modes have further used for investigating the sensitivity of the proposed algorithms on multimodal distributions. In these cases, since the a priori membership of each object is known, algorithm results are validated by external

Conclusions

The paper presented an extension of fuzzy c-means algorithms to data described by distributional variables. The fuzzy c-means algorithm has been integrated in order to compute the relevance of each distributional variable, or of its components, in order to take into consideration also non-spherical clusters. We presented an automatic weighting system which is related to the determinant of the within covariance matrix, leading to a set of six product-to-one constraints for the relevance weights.

Acknowledgments

The authors are grateful to the anonymous referees for their careful revision, valuable suggestions, and comments which improved this paper. The Brazilian author would like to thank FACEPE, Research Agency from the State of Pernambuco, Brazil (APV-0012-1.03/15), and CNPq, National Council for Scientific and Technological Development, Brazil (303187/2013-1, 470997/2013-3), for their financial support.

References (42)

  • J.C. Bezdek

    Pattern Recognition with Fuzzy Objective Function Algorithms

    (1981)
  • J.C. Bezdek et al.

    Fuzzy Models and Algorithms for Pattern Recognition and Image Processing

    (1999)
  • H.H. Bock et al.

    Analysis of Symbolic Data. Exploratory Methods for Extracting Statistical Information from Complex Data

    (2000)
  • F.A.T. De Carvalho et al.

    Dynamic clustering of interval-valued data based on adaptive quadratic distances

    Trans. Sys. Man Cyber. Part A

    (2009)
  • F.A.T. De Carvalho et al.

    Partitional fuzzy clustering methods based on adaptive quadratic distances

    Fuzzy Sets Syst.

    (2006)
  • F.A.T. De Carvalho et al.

    Fuzzy clustering of distribution-valued data using an adaptive L2 Wasserstein distance

    Fuzzy Systems (FUZZ-IEEE), 2015 IEEE International Conference on

    (2015)
  • E. Diday et al.

    Classification automatique avec distances adaptatives

    R.A.I.R.O. Inf. Comput. Sci.

    (1977)
  • E. Diday et al.

    Clustering analysis

  • A.L. Gibbs et al.

    On choosing and bounding probability metrics

    Int. Stat. Rev.

    (2002)
  • W. Gilchrist

    Statistical Modelling with Quantile Functions

    (2000)
  • C.R. Givens et al.

    A class of wasserstein metrics for probability distributions.

    Michigan Math. J.

    (1984)
  • Cited by (13)

    • Clustering interval-valued data with adaptive Euclidean and City-Block distances

      2022, Expert Systems with Applications
      Citation Excerpt :

      Thus, the available information is imprecise and cannot be revealed exactly by single numerical data (D’Urso & Leski, 2016). Therefore, the combined information from the variables of both boundaries should be used in the clustering as in Irpino et al. (2017), Rodríguez and de Carvalho (2019). The proposed algorithms start with an initial partition and compute the prototypes, the relevance weights of the variables for lower and upper boundaries, and the cluster partition iteratively in three steps (representation, weighting, and assignment) until a stopping criterion is satisfied.

    • Robust synchronization of chaotic systems with novel fuzzy rule-based controllers

      2019, Information Sciences
      Citation Excerpt :

      Fuzzy logic has been applied to solve many types of problems in mechanical system design [1,3,5,7–10,12–23,25,27–29].

    • On LAMDA clustering method based on typicality degree and intuitionistic fuzzy sets

      2018, Expert Systems with Applications
      Citation Excerpt :

      In a further work, the addition of a covariance function within GTD could be studied to suit in this scenario. In Irpino, Verde, and de A.T. de Carvalho (2017), a new FCM algorithm was developed for data described by distributional variables where the L2 Wasserstein distance among distributions is used as a dissimilarity measure. The main idea of Wasserstein’s distance is to identify the most and the least important variables for the clustering.

    View all citing articles on Scopus
    View full text