Granular meta-clustering based on hierarchical, network, and temporal connections

Lingras, Pawan; Haider, Farhana; Triff, Matt

doi:10.1007/s41066-015-0007-9

Granular meta-clustering based on hierarchical, network, and temporal connections

Original Paper
Published: 05 January 2016

Volume 1, pages 71–92, (2016)
Cite this article

Download PDF

Granular Computing Aims and scope Submit manuscript

Granular meta-clustering based on hierarchical, network, and temporal connections

Download PDF

Pawan Lingras¹,
Farhana Haider¹ &
Matt Triff²

2449 Accesses
70 Citations
Explore all metrics

Abstract

In granular computing, each object is represented as an information granule and an information granule can be connected to other granules through semantic relationships. These connections can lead to a granular hierarchy or a network. Data mining of one set of objects may not be able to capture information contained in granular connections. This paper describes a concept of meta-clustering that clusters a set of granules using clustering information from another or the same set of networked granules. Cluster membership of one granule can affect another granule’s cluster membership, resulting in a recursive meta-clustering process. We illustrate the usefulness of such meta-clustering for a granular hierarchy consisting of sets of businesses and reviewers, a set of networked granules representing mobile phone users, and trading patterns of financial instruments that are linked to each other through a temporal dimension.

Trends and Future Perspective Challenges in Big Data

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

1 Introduction

The conventional data mining process begins with the representation of objects based on raw data from a dataset. More refined data mining activities use results from previous data mining activities to improve the quality of the results. Examples of such secondary data mining techniques include ensemble of classifiers or stacked regression, where the results of previous classifications/predictions are combined to produce more accurate results. The combination of the results can be based on a predetermined formulation or could use further machine learning techniques. This paper describes a novel set of integrated secondary data mining approaches to clustering. Clustering is one of the most frequently used unsupervised data mining techniques for grouping similar objects. It is used at various stages in data mining from preliminary exploration of a new dataset and identification of outliers to sophisticated analysis for decision making. The proposals presented in this paper enhance the conventional clustering techniques for granular, network, and temporal data.

Granular computing is an emerging area of research that provides an ability to create innovative representations of objects. Such representations can facilitate the development of new algorithms (Bargiela and Pedrycz 2002; Pedrycz and Kreinovich 2008; Yao 2010a; Zadeh 1979). In granular computing, a granule represents an object associated with a set of information. For example, a customer with certain purchasing patterns could represent an information granule. A granule can include a collection of finer granules. For example, a customer granule could include many visits, which are finer granules. A visit in turn can include the purchase of a number of products, which are even finer granules. This results in a hierarchy of customers–visits–products. Profiles of customers created by clustering should also include the profiles of visits that these customers make. The profiling of visits should in turn include profiles of customers. Thus, profiles of products should be both influenced by, and should influence, profiles of customers and visits. Similar granular hierarchies exist in other datasets. For example, at yelp.com a business has multiple reviewers, while a reviewer reviews multiple businesses. The set of business granules are thus connected to the set of reviewer granules. We describe an iterative clustering technique that iterates back and forth through a granular hierarchy to obtain a stable set of profiles of objects at all levels of the hierarchy.

Similar interdependency can also be observed in a networked environment, where objects such as phone users are connected to other phone users. In such a case, the profile of a phone user should include the profiles of other users created by the same clustering process. These dependencies are applicable to any social network. This paper presents a recursive clustering technique for such networked environments. The recursive clustering developed for a networked environment can also be extended to temporal databases, where profiles of daily patterns of a quantity should include profiles of previous daily patterns, and in some cases profiles of future daily patterns. For example, let us assume that we need to profile a stock based on its volatility. The volatility in a stock price today should take into account volatility of stock prices in the immediate past and immediate future. One can look at such a daily pattern as an object that is connected to the past and future daily patterns. Extension of the recursive clustering for a simple network into temporal networks will be applicable in such a case.

This paper illustrates the versatility of the meta-clustering algorithm for different types of granular connections, namely, two-way hierarchical connections between businesses and reviewers at yelp.com, a social network of mobile phone users, and a temporal network of daily stock patterns. The proposed meta-clustering approach can be applied to both regular and fuzzy clustering. This fact is demonstrated through regular and fuzzy meta-clustering of the yelp.com dataset.

2 Review of literature

2.1 Review of clustering and fuzzy clustering

This section reviews conventional clustering with the help of a popular algorithm called k-means (MacQueen 1967). Let $X=\{\mathbf {x}_{1},\ldots ,\mathbf {x}_{n}\}$ be a finite set of objects, and we assume that the objects are represented by m-dimensional vectors. A clustering scheme groups n objects into k clusters $C=\{\mathbf {c}_{1},\ldots ,\mathbf {c}_{k}\}$. Here, C is the set of clusters. Each of the clusters $\mathbf {c}_i$ is represented by an m-dimensional vector, which is the centroid or mean vector for that cluster. Each cluster centroid $\mathbf {c}_i$ is also associated with a set of objects assigned to the ith cluster. We will use $\mathbf {c}_i$ for both the centroid vector or the set representation of ith cluster depending on the context.

2.1.1 Clustering using k-means

k-means clustering is one of the most popular statistical clustering techniques (Hartigan and Wong 1979; MacQueen 1967). The objective is to assign n objects to k clusters. The process begins by randomly choosing k objects as the centroids of the k clusters. The objects are assigned to one of the k clusters based on the minimum value of the distance $d(\mathbf {x}_l,\mathbf {c}_i)$ between the object vector $\mathbf {x}_l$ and the cluster vector $\mathbf {c}_i$.

After the assignment of all the objects to various clusters, the new centroid vectors of the clusters are calculated as:

$$\begin{aligned} \mathbf {c}_i= \frac{\sum _{\mathbf {x}_l\in \mathbf {c}_i}\mathbf {x}_l}{\mid \mathbf {C}_i \mid }, \quad \mathrm{where} \quad 1 \le i \le k. \end{aligned}$$

Here $\mid \mathbf {C}_i \mid $ is cardinality of cluster $\mathbf {C}_i$. The process stops when the centroids of the clusters stabilize.

Several cluster validity indices have been proposed to evaluate cluster quality obtained by different clustering algorithms. An excellent summary of various validity measures can be found in Halkidi et al. (2002). Many of the cluster validity measures are functions of the sum of within-cluster scatter to between-cluster separation. The scatter within the ith cluster, denoted by $S_i$, and the distance between cluster $\mathbf {c}_i$ and $\mathbf {c}_j$, denoted by $d_{ij}$, are defined as follows:

$$\begin{aligned} S_{i}=\frac{1}{\mid \mathbf {C}_i\mid }\sum _{\mathbf {x}\in \mathbf {c}_i}\mathrm{distance}(\mathbf {x},\mathbf {c}_i) \end{aligned}$$

(1)

$$\begin{aligned} d_{ij}= \mathrm{distance}(\mathbf {c}_i , \mathbf {c}_j ) \end{aligned}$$

(2)

where $\mathbf {c}_i$ is the center of the ith cluster, $|\mathbf {C}_i|$ is the number of objects in $\mathbf {C}_i$, and $\mathrm{distance}(\mathbf {x},\mathbf {y})$ is the distance between two vectors. Depending upon the application, we can choose any distance function. Two popular distance functions are Euclidean distance and the inverse of cosine similarity function. This study uses Euclidean distance. However, it will also be interesting to experiment with other distance measures including the Mahalanobis distance that is particularly useful when we are working with a dataset that represents only a sample of the universe.

We can sum up the within-cluster scatter for all clusters in a clustering scheme C as:

$$\begin{aligned} S(C) = \sum _{i = 1}^k S_{i} \end{aligned}$$

(3)

Similarly, between-cluster distance for a clustering scheme can be summed as:

$$\begin{aligned} D(C) = \sum _{i = 1}^k\sum _{j = 1}^k d_{ij} \end{aligned}$$

(4)

It is advisable to plot both of these measures for the datasets under study. Usually, the scatter within cluster starts rising rapidly, while distance between cluster starts falling rapidly when the number of clusters falls below a certain value. The knee of the curves (Lingras et al. 2014) can be used as the range for determining an appropriate number of clusters.

2.1.2 Fuzzy c-means clustering

Conventional clustering assigns various objects to precisely one cluster. A fuzzy generalization of clustering uses a fuzzy membership function to describe the degree of membership of an object to a given cluster (ranging from 0 to 1; where a greater value signifies a greater degree of membership) (Coppi and D’Urso 2002; Coppi and D’Urso 2003; Coppi et al. 2012; D’Urso and De Giovanni 2014; D’Urso et al. 2014). There is a stipulation that the sum of fuzzy memberships of an object to all the clusters must be equal to 1.

The algorithm was first proposed by Dunn (1973). Subsequently, a modification was proposed by Bezdek (1981). The fuzzy c-means (FCM) algorithm is based on minimization of the following objective function:

$$\begin{aligned} \sum _{i=1}^n {\sum _{j=1}^k u_{ij}^m \; d(\mathbf {x_i},\mathbf {c_j})}, \quad 1< m < \infty \end{aligned}$$

(5)

where n is the number of objects and each object is a d dimensional vector. A parameter m is any real number greater than 1, $u_{ij}$ is the degree of membership of the ith object $(\mathbf {x_i})$ in the cluster j, and $d(\mathbf {x_i},\mathbf {c_j})$ is the Euclidean distance between the object and a cluster center $c_j$.

The degree of membership given by a matrix $\mathbf {u}$ for objects on the edge of a cluster, may have a lesser degree than objects in the center of a cluster. However, the sum of these coefficients for any given object $x_i$ is defined to be 1.

$$\begin{aligned} \sum _{j=1}^k u_{ij} = 1 \quad \forall i \end{aligned}$$

(6)

The centroid of a fuzzy cluster is the weighted average of all objects, where the weights of each object is its degree of membership to a cluster:

$$\begin{aligned} \mathbf {c_j} = \frac{\sum _{i=1}^n u_{ij}^m \mathbf {x_i}}{\sum _{i=1}^n u_{ij}^m} \end{aligned}$$

(7)

FCM is an iterative algorithm that terminates if

$$\begin{aligned} \mathrm{max} \left( \left| u_{ij}^{t+1} - u_{ij}^t \right| \right) < \delta \end{aligned}$$

(8)

where $ \delta $ is a termination criterion between 0 and 1, and t is the iteration step.

2.2 Review of granular computing

Granular computing encompasses multiple levels or layers of granularity in thinking, problem solving, and information processing (Yao 2010a). A granular set can be represented in the form of five-tuple array, (U, D, L, H, J), where U is the universe of the problem discussed, D describes all the elements in U, L and H are the operators of the opposite direction, and J restricts the L and H (Li 2009). Initially, the main stream of granular computing research focused on fuzzy sets, rough sets, interval analysis and cluster analysis (Pedrycz and Kreinovich 2008; Yao 2007a, 2008b). In the context of fuzzy sets, Zadeh introduced the notion of information granulation in 1979 (Zadeh 1979). In 1982, Pawlak introduced the theory of rough sets using partitions induced by equivalence relations, which can be considered a specific type of granulation. Zadeh further elaborated on information granulation and its central role in human reasoning (Zadeh 1997), which provided new insights into granular computing. A more elaborate formulation of granular computing followed Zadeh’s paper in the form of a book by early pioneers Bargiela and Pedrycz. They provided an elegant pyramidal information processing paradigm for granular computing (Bargiela and Pedrycz 2002).

Artificial intelligence (AI), fuzzy sets, and rough sets continue to be the primary motivations behind developments in granular computing (Yao 2008b, c). A comprehensive study of AI perspectives on granular computing is provided by Yao (2011). For example, some of the central concepts and strategies of granular computing can be seen in sub-areas of artificial intelligence, such as concept formation, categorization, learning, abstraction and reformulation, computer vision and image processing, planning, and hierarchical problem solving (Yao 2011). An overview of the current research in granular computing can be found in a handbook and a number of edited books (Pedrycz 2005; Pedrycz and Chen 2011; Pedrycz and Kreinovich 2008; Yao 2010a, 2009).

Yao’s research expanded the granular computing perspective, with investigations of concrete theoretical foundations and resulting models that led to a general triarchic theory. The triarchic theory of granular computing provides the philosophical basis, followed by algorithmic developments leading to practical computational techniques (Yao 2007b, 2008a, b, 2009, 2010b). Yao’s proposal provides a conceptual framework and an architecture for developing granular computing as an independent area of research as opposed to a collection of related theories such as fuzzy and rough set theories. Each granular structure is a hierarchical construct that provides a multilevel representation and understanding of a problem. Such a structure forms the basis of the meta-clustering algorithm proposed in this paper. A family of hierarchical structures can help us understand a problem from multiple points of view (Yao 2009). The granular computing triangle consists of three essential elements of granular computing: philosophy, methodology/algorithms, and computational aspects. This new paradigm of granular computing will help us become better problem solvers by designing and implementing intelligent systems based on granular reasoning, representation, and processing (Yao 2010a).

Yao’s investigations on granular computing also provide an insight into, a structured understanding of, and structured methods for solving, real-world problems from multiple views and at multiple levels of abstraction within each view (Chen and Yao 2008; Luo and Yao 2011, 2010a, 2007b, 2009). (Yao 2002, 2003a, b) demonstrated the potential value of granular computing in intelligent systems, as well as a test bed for playing with ideas from granular computing. Hoeber used a granular approach for studying retrieval support through visualization in web information retrieval support systems (Hoeber 2008). Yan et al. demonstrated the effectiveness of a granular information retrieval system by implementing and evaluating a prototype system (Yan et al. 2011). Xie et al. applied granular computing for a conceptual biology research supporting platform (Xie et al. 2008).

Gacek and Pedrycz (2015) developed a comprehensive and systematic approach to show an emergence of granular information of higher type, which are used to implement granular interval prototypes. They discussed a way of forming granular data in the context of representation of time series. Pedrycz and Bargiela (2012) considered a concept of granular prototypes that generalizes the numeric representation of the clusters and helps to capture more details about the data structure. Using a granulation-degranulation scheme, they designed granular prototypes being reflective of the structure of data to a higher extent than the representation that is provided by their numeric counterparts (prototypes).

2.3 Review of simultaneous and meta-clustering

Researchers have found it advantageous to cluster multiple objects from a dataset at the same time. Representing a dataset as a matrix, data miners normally consider rows to be the objects and columns as the attributes of these objects. A good example of such a dataset is a document collection or corpus having n rows and m columns. Each row in the table corresponds to a document. Each column corresponds to a keyword. A cell in the ith row and jth column is the frequency of $\mathrm{keyword}_j$ in document $\mathrm{doc}_i$. However, one can easily transpose this view and say that it is a collection of keywords that shows how often the keyword occurs in various documents in the collection given by the columns in the table. Slonim and Tishby (2000) proposed a two stage clustering method for this application. El-Yaniv and Souroujon (2001) extended this two stage approach with an iterative version where the resulting document clustering could be used to re-cluster the words and the process would continue. More generically, double clustering can be viewed as a dimensionality reduction technique that replaces the columns by groups of the columns. Castellano et al. (2002) further generalized double clustering using fuzzy set theory. Caruana et al. (2006) showed that the use of meta-clustering, meaning clusters of clusters, can make it easier for the users to see the more meaningful groupings. Ramirez-Cano et al. (2010) took meta-clustering to three levels for grouping players in a game based on three different criteria: skills, preferences while playing the game, and relationships with other players.

The generalization of meta-clustering can be found when bi-clustering. Bi-clustering was first introduced by Mirkin (1996). It was extended to tri-clustering and then more generally to n-clustering (Ignatov et al. 2012, 2013; Gnatyshak et al. 2012, 2013). These efforts were based on Formal Concept Analysis (FCA). In bi-clustering, clustering was done row-wise and column-wise simultaneously to determine the intersecting regions. The extension of the matrix to a three-dimensional cuboid gave rise to tri-clustering. Tri-clustering was shown to be useful for a number of applications including Bibsonomy and a crime dataset. Ignatov et al. (2013) described how the tri-clustering can efficiently handle large connected social graph involving three sets of objects.

3 Generalized and unified granular view of meta-clustering

The review of literature in the previous subsection shows that meta-clustering is an important and useful research direction. The majority of the research consists of analyzing matrices or cuboids of related objects. In the next section, we will briefly review a generalized view of meta-clustering that unifies the conventional clustering from static information in a dataset with dynamic information that is generated through simultaneous clustering of a related set of objects. Static information is the original data used for the clustering. Dynamic information represents the relation of each of the static objects with the cluster result obtained. We have called this information dynamic as this data changes with respect to the cluster results in each iteration.

In granular computing, a granule represents an object associated with a set of information. For example, a customer with certain purchasing patterns could represent an information granule. A granule can include a collection of finer granules. For example, a customer granule could include finer granules containing information for many individual visits. A visit, in turn, can include the purchase of a number of products, which are even finer granules. This results in a hierarchy of customers–visits–products. Profiles of customers created by clustering should also include the profiles of visits that these customers make. The profiling of visits should in turn include profiles of customers. Similarly, profiles of products should be both influenced by, and should influence, profiles of customers and visits. Lingras et al. (2014) described how such a hierarchy can be clustered iteratively with the help of static information and the dynamically changing profiles of customers and products throughout the meta-clustering process. This approach unified the conventional static clustering with the simultaneous meta-clustering such as double clustering using granular computing.

The proposed granular meta-clustering does not require the presence of a matrix or cuboid consisting of multiple sets of objects. Similar interdependency can also be observed in a networked environment, where objects such as phone users are connected to other phone users within the same dataset. In such a case, the profile of a phone user should include the profiles of other users created by the same clustering process. These dependencies are applicable to any social network. Lingras and Rathinavel (2012) proposed a recursive clustering technique for such networked environments.

The recursive clustering method developed for a networked environment can also be extended to temporal databases, where profiles of daily patterns of a quantity should include profiles of previous daily patterns, and in some cases profiles of future daily patterns. For example, let us assume that we need to profile a stock based on its volatility. The volatility in a stock price today should take into account volatility of stock prices in the immediate past and immediate future. One can look at such a daily pattern as an object that is connected to the past and future daily patterns with different connection weights. The weight of the connection will decrease based on temporal differences between the daily patterns. Lingras and Haider (2014b) enhance the recursive clustering originally developed for a simple network for applications in weighted networks in general, and temporal networks in particular. The basic technique of recursive meta-clustering algorithm is shown in Fig. 1. Figure 2 shows a simple illustration of the process using a similar type of data used in Sect. 4. However, for ease of demonstration only a two-dimensional dummy data set is used in this illustration.

The integrated meta-clustering of hierarchical, network, and temporal data show that the use of granular computing can help us visualize the profiling problem in a broader context and combine the conventional clustering with the emerging simultaneous clustering approaches such as double clustering, bi-clustering, tri-clustering, n-clustering. Moreover, the proposed approach is general enough to work with one, two, three, or n number of sets of objects.

The following sections describe the applications of this meta-clustering algorithm for different types of granular connections:

two-way hierarchical connections between businesses and reviewers on yelp.com
social network of mobile phone users
temporal network of daily stock patterns

The proposed meta-clustering approach can be applied to both regular and soft clustering. This fact is demonstrated through regular and fuzzy meta-clustering of the yelp.com dataset.

4 Meta-clustering in granular hierarchy using yelp.com data

Yelp.com is an online review and recommendation community. Yelp was founded in 2004, is available in 30 countries worldwide, and it currently has over 140 million unique monthly visitors. Yelp provides value to consumers by allowing users to research written reviews, ratings, business details such as business hours and whether or not a business has free WiFi, as well as pictures posted by other users of the business and its products. Yelp also provides a social platform for its users, allowing them to create events, lists of recommended businesses to share and comment on, and to message and become friends with other users.

In Spring 2013, Yelp released a large set of data, covering the entire Phoenix Metropolitan Area (PMA) as part of the Yelp Dataset Challenge. The Yelp Dataset Challenge was open-ended and aimed at finding innovative uses for the data Yelp collects. Yelp posed potential questions to answer, such as “What time of day is a restaurant busy, based on its reviews?”, “What makes a review funny or cool?”, “Which business is likely to be reviewed next by a user?”, and more. Yelp encouraged the submission in any form that entrants felt conveyed the appeal of their project, which would later be judged for one of ten cash prizes. The data covering the PMA came as four separate files, one each for businesses, check-ins, users, and reviews. Business information included each businesses unique ID, name, neighborhoods that they are located within, full localized address, city, state, latitude, longitude, average star rating out of five (rounded to half stars) from reviewers, categories, and a variable set for whether the business is still active. Reviews contained the business ID of the business being reviewed, the ID of the reviewer writing the review, the number of stars the reviewer gave the business out of five (rounded to the half star), the text of the written review, the date the review was given, and the number of votes other users have given the review, in the categories of “Funny”, “Cool” and “Useful”. Reviewer data contained the unique user ID, first name, number of reviews they have given, average stars rated (as a floating point average of all the reviews they have made), and the total number of votes their reviews have received for the three categories previously mentioned. Finally, check-in data contained information of which business the check-in data related to, and the total number of people who had checked-in to the business on the mobile Yelp app or webpage for each hour of each day of the week (168 categories, or 24 categories for each day of the week).

4.1 Algorithm: meta-clustering in granular hierarchy

In this section, we use k-means and fuzzy c-means clustering algorithms to group similar objects. We ensure that our clusters are as compact as possible and well separated from each other. Since the k-means algorithm depends on randomly selected initial centroids of the clusters, we apply the algorithm multiple times and choose a clustering scheme that has the most compact clusters. Cluster compactness and manual inspection of cluster centroids were used to determine the optimal number of clusters.

Table 1 Static and dynamic parts of reviewer data

Granular meta-clustering based on hierarchical, network, and temporal connections

Abstract

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

1 Introduction

2 Review of literature

2.1 Review of clustering and fuzzy clustering

2.1.1 Clustering using k-means

2.1.2 Fuzzy c-means clustering

2.2 Review of granular computing

2.3 Review of simultaneous and meta-clustering

3 Generalized and unified granular view of meta-clustering

4 Meta-clustering in granular hierarchy using yelp.com data

4.1 Algorithm: meta-clustering in granular hierarchy

4.2 Comparison of regular and fuzzy static clustering on yelp.com

4.3 Regular meta-clustering in a granular hierarchy

4.4 Fuzzy meta-clustering in a granular hierarchy

4.5 Comparison of static and meta-regular/fuzzy clustering

5 Meta-clustering in granular networks using mobile phone data

5.1 Algorithm: meta-clustering in granular networks

5.2 Mobile phone network data

5.3 Results of meta-clustering in granular networks

5.3.1 Static results

5.3.2 Dynamic results

6 Meta-clustering in granular temporal environment

6.1 Algorithm: meta-clustering in granular temporal environment

6.2 Financial market data processing

6.3 Results of granular temporal meta-clustering

7 Computational requirements for the meta-clustering algorithm

8 Summary and conclusions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation