An adaptive minimum spanning tree test for detecting irregularly-shaped spatial clusters

doi:10.1016/j.csda.2015.03.008

Computational Statistics & Data Analysis

Volume 89, September 2015, Pages 134-146

https://doi.org/10.1016/j.csda.2015.03.008 Get rights and content

Abstract

The clustering methodologies based on minimum spanning tree (MST) have been widely discussed due to their simplicity and efficiency in signaling irregular clusters. However, most of the MST-based clustering methods estimate the most likely cluster based on the maximum likelihood ratio from the resulting subtrees after the removal of edges of the MST. They can only estimate one cluster even if there are multiple clusters actually present over the study region. To overcome this limitation, we propose an adaptive MST (AMST) method to detect irregularly-shaped clusters. The basic idea is to first determine the best number of partition over the study region using a validity index and then to determine the significance of the candidate clusters. The comparison results with both the static and dynamic MST methods favor the proposed method.

Introduction

The detection of disease clusters in space is an important public health problem which has to deal with multiple statistical testing problems. A great number of tests have been proposed to handle this problem. According to Besag and Newell (1991), tests for spatial clusters could be generally classified into two categories: focused tests and general tests, depending on whether the assumption about cluster locations is made or not. Song and Kulldorff (2006) further divided the general tests into two groups: tests for global clustering and tests for localized clusters, depending on the ability to identify the locations of potential clusters. The global clustering tests look for evidence to determine whether there are clusters present over the whole study region without concern about cluster locations (Whittemore et al., 1987, Besag and Newell, 1991, Tango, 1995). On the other hand, tests for localized clusters are concerned with not only testing their statistical significance, but also detecting the location of clusters (Turnbull et al., 1990, Kulldorff, 1997).

Among these tests, the spatial scan test proposed by Kulldorff (1997) is conceptually intuitive and has been widely discussed as a tool for detecting localized clusters. The conventional scan test is to first compute the likelihood ratio statistics based on a circular scanning window with varying radius, and then to determine the area with the maximum likelihood ratio score. The area giving the maximum likelihood ratio is estimated as the most likely cluster. The spatial scan statistic test has been proved to be very powerful when the real cluster has a circular shape. However, the shape of the potential cluster is generally unknown in practice and the shape is generally irregular.

The use of a scanning window of a rigid geometry has been viewed as a limitation of the spatial scan test. To deal with this limitation, some modifications have been made to the spatial scan test, including the search for elongated, elliptical, and irregular shapes. For example, Neill and Moore (2004) and Kulldorff et al. (2006) proposed the spatial scan statistics for elongated and elliptical-shaped clusters. However, these methods seem to be only slightly less constraining than the spatial scan tests based on circular-shaped clusters. The clusters with less-compact shapes will be overlooked by these methods.

In order to detect arbitrarily shaped clusters, various procedures have been developed. A sample of research in this respect includes the upper level set (ULS) method proposed by Patil and Taillie (2004), the simulated annealing (SA) approach proposed by Duczmal and Assunção (2004), and the $k$ -nearest-neighbor method proposed by Tango and Takahashi (2005). The ULS is simple and fast to implement. However, it is shown not to have good performance as compared with other clustering methods. The SA method has the drawback that it requires the setting of awkward tuning parameters. These parameters are not easily interrelated, and there is lack of guidance to do this. The $k$ -nearest-neighbor method extends the regular scanning window in the conventional spatial scan test to the flexible shape by using $k$ -nearest neighbors. Each cell with its $k$ -nearest neighbors at most is considered as a candidate cluster when determining the set of candidate clusters. Clearly, this method suffers from the computation complexity issue as the number of candidate clusters increases exponentially with the value of $k$ . This makes it inefficient for large value of $k$ . In general, there is no versatile method that can handle all types of clustering problem due to the arbitrary shape, unbalanced background population size, and variable densities.

Another effective clustering method consists of using the graph-based tests, especially the minimum spanning tree (MST), owing to its intuitive and effective data representation. Cluster design using MST was initiated by Zahn (1971) and later discussed by Maravalle and Simeone (1995) and Maravalle et al. (1997). It has now been successfully employed in many different settings, such as in image processing (Wang et al., 2009), pattern recognition (Paivinen, 2005), and biological data analysis (Xu et al., 2002), and public health surveillance (Assunção et al., 2006).

In the context of public health surveillance, Assunção et al. (2006) recently proposed two types of MST-based tests for fast detection of arbitrarily shaped disease clusters: the static MST (SMST) and the dynamic MST (DMST). They showed that the SMST includes the ULS method as a special case. Compared to the SMST, the DMST has larger detection power and is thus suggested for practical use. Note that the MST-based clustering method enjoys two good properties. First, it gets away from the specification of tuning parameters. Second, MST is an extremely economical way to represent the spatial structure of a graph and thus has the capacity of quickly identifying clusters without constraining the possible shapes. Due to these good properties, the MST-based method is more computationally efficient as compared to the SA and $k$ -nearest-neighbor methods.

The MST-based methods determine the most likely cluster using the maximization of the likelihood ratio over a set of candidate clusters. Thus, both the SMST and DMST methods can only return/estimate a single cluster. In addition to a single cluster present, multiple clusters can also appear, which are more general in practice. This limitation would restrict the applications of SMST and DMST for clustering analysis. To meet the gap in part, we propose an adaptive MST (AMST) approach, aimed at automatically and simultaneously identifying the locations of clusters with arbitrary shapes. The key is to define a validity index taking both compactness and isolation of data into consideration to help determine best partition of a MST. Based on the clustering results from validity index, one can then evaluate the statistical significance of these candidate clusters.

The rest of this paper will be organized as follows. In Section 2, we briefly review the SMST and DMST methods. In Section 3, we present the AMST method. In Section 4, we compare the detection power and identification capability based on simulations. Both the case with a single cluster and the case with multiple clusters are considered. In Section 5, two simulation examples are provided to illustrate the use of the proposed method. Some concluding remarks are given in Section 5.

Section snippets

Notation

We begin by defining some notation. A whole study region contains $I$ connected locations. For each location $i$ , denote $x_{i}$ as the number of cases observed, and $n_{i}$ as its population size. Also, define $X = \sum_{i} x_{i}$ and $N = \sum_{i} n_{i}$ as the total number of cases and total population size over all the study region, respectively. A Poisson model is often assumed for the data, namely, $x_{i} \sim$ Poisson ( $n_{i} λ_{i}$ ), where $λ_{i}$ is the disease rate in cell $i$ , which is interpreted as the number of cases per unit population. Assuming

Adaptive minimum spanning tree test

In the above MST-based methods, the cutting of each edge leads to two connected components. The repeated application of this cutting procedure leads to different cluster candidates that are represented by the connected components. For example, after the removal of two relatively long edges in the MST in Fig. 1(a), three candidate clusters are generated as shown in Fig. 1(b). The candidate cluster with the maximum likelihood ratio is estimated as the most likely cluster. Clearly, both the SMST

Simulation setting

In this section, we carried out extensive simulation studies to compare the AMST with the existing SMST and DMST methods. Two measures of performance based on power and Dice Similarity Coefficient (DSC) (Dice, 1945) were considered in the simulations in order to evaluate the efficiency of these methods. The power measures the probability of rejecting the null hypothesis for a test when the alternative hypothesis is actually true. In particular, the power associated with a test when a cluster is

Illustrative examples

To illustrate the practical use of the AMST approach for scanning spatial clusters, we use the population structure from the previous examples to generate one set of simulated count at each county. Once again, we set $λ_{0} = 2.4$ per 10,000 persons where there is no cluster present, and $λ_{C} = 2 λ_{0}$ in the presence of clusters. In the simulation, two examples were designed. In the first example, we consider Case 1 with one true cluster assumed. This cluster consists of 6 counties, as shown in Table 1. We

Conclusion

The graphical method based on MST is a common tool used for detecting clusters of disease over a geographic region due to its graphical intuition and simplicity. The recent SMST and DMST methods can automatically and simultaneously estimate clusters with arbitrary shapes. However, both the SMST and DMST methods estimate the most likely cluster based on the maximum likelihood ratio of the resulting candidate clusters after the removal of edges from the MST. Therefore, they might not be efficient

Acknowledgments

The authors are grateful to the editor and two anonymous referees who have provided insightful comments that resulted in significant improvement in this paper. Lianjie Shu’s work was supported in part by FDCT/002/2013/A, MYRG090(Y1-L2)-FBA13-SLJ, and MYRG096(Y1-L2)-FBA12-SLJ. Yan Su’s work was supported in part by FDCT/060/2014/A2 and FDCT/115/2012/A.

References (29)

L. Duczmal et al.
A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters
Comput. Statist. Data Anal.
(2004)
G. Kerr et al.
Techniques for clustering gene expression data
Comput. Biol. Med.
(2008)
M. Maravalle et al.
Clustering on trees
Comput. Statist. Data Anal.
(1997)
N. Paivinen
Clustering with a minimum spanning tree of scale-free-like structure
Pattern Recognit. Lett.
(2005)
R. Assunção et al.
Fast detection of arbitrarily shaped disease clusters
Stat. Med.
(2006)
J. Besag et al.
The detection of clusters of rare diseases
J. Roy. Statist. Soc. Ser. A
(1991)
L.R. Dice
Measures of the amount of ecologic association between species
Ecology
(1945)
V.M.K. Goura et al.
The 2nd International Conference on Biotechnology and Food Science. Vol. 7
(2011)
A.K. Jain et al.
Algorithms for Clustering
(1988)
M. Kulldorff
A spatial scan statistic
Comm. Statist. Theory Methods
(1997)

M. Kulldorff

Prospective time periodic geographical disease surveillance using a scan statistic

J. Roy. Statist. Soc. Ser. A

(2001)

M. Kulldorff et al.

An elliptic spatial scan statistic

Stat. Med.

(2006)

J. Lin et al.

Minimum spanning tree based spatial outlier mining and its applications

M. Maravalle et al.

A spanning tree euristic for regional clustering

Comm. Statist. Theory Methods

(1995)

Cited by (15)

A novel hybridization approach to improve the critical distance clustering algorithm: Balancing speed and quality
2024, Expert Systems with Applications
Clustering is a prominent research area, with numerous studies and the development of hundreds of algorithms over the years. However, a fundamental challenge in clustering research is the trade-off between algorithm speed and clustering quality. Existing algorithms tend to prioritize either fast execution with compromised clustering quality or slower performance with superior clustering results. In this study, we propose a novel CDC-2 algorithm, an improved version of the Critical Distance Clustering (CDC) algorithm, to address this challenge. Inspired by the concepts of hybridization in biology and the division of labor in the economic system, we present a new hybridization strategy. Our approach integrates the connectivity and coherence aspects of the K-means and CDC-2 algorithms, respectively, allowing us to combine speed and quality in a single algorithm. This approach is referred to as the CDC++ algorithm, and it is characterized as a hybrid that combines elements from two algorithms, K-means and CDC-2, in order to leverage their strengths while mitigating their weaknesses. Moreover, the structure and mechanism of the CDC++ algorithm led to the introduction of a new concept called “object autoencoder.” Unlike traditional feature reduction methods, this concept focuses on object reduction, representing a significant advancement in clustering techniques. To validate our approach, we conducted experimental studies on thirteen synthetic and five real datasets. Comparative analysis with four well-known algorithms demonstrates that our proposed development and hybridization enable efficient processing of large-scale and high-dimensional datasets without compromising clustering quality.
A Bayesian spatial scan statistic for multinomial data
2024, Statistics and Probability Letters
Discovery of arbitrarily shaped significant clusters in spatial point data with noise
2021, Applied Soft Computing
Citation Excerpt :
These methods are time-consuming and are usually not suitable for large spatial data, especially the FleXScan algorithm only works well for small cluster size (up to 30) [24]. Although the complexity can be reduced by employing the neighborhood graph search [26], Bayesian estimation [27] or the linear optimization model [28], the quality of the clustering results would be reduced and noises may be identified as clusters [29]. To improve the power of flexibly-shaped scanning methods, some intelligent optimization scanning methods were developed, such as the simulated annealing-based scan statistic (SAScan) [30], genetic optimization-based scan statistic (GAScan) [31], ant colony optimization-based scan statistic (AntScan) [23].
Spatial point data is an important data source that represents the locations of spatial events (such as crime, disease cases, and earthquakes). Detecting clusters I n spatial point data plays a key role in exploratory spatial data analysis. Although much attention has been paid to the clustering of spatial points, how to automatically and efficiently discover the statistically significant clusters with irregular shapes is still a challenging work. On that account, an automatic method to detect the statistically significant high-density clusters in spatial point data with noise is proposed in this paper. First, the Voronoi diagram of the spatial points is constructed, and the densities of spatial points are defined by the areas of the Voronoi cells. Then, high-density points are automatically detected using spatial hotspot statistics analysis, and a density-based clustering strategy is further adapted to combine the neighboring high-density points into candidate clusters. Finally, a statistical significance test is proposed to evaluate the significance of the candidate clusters under the spatially homogeneous or heterogeneous distribution assumption. We tested the proposed method with the simulated data sets and the real-world taxi trajectory data for detecting the pick-up hotspot regions in Wuhan, China. Results show that the proposed method can successfully find arbitrarily shaped significant clusters that existing state-of-the-art clustering algorithms may fail to find in spatial point data with noise.
A novel data clustering algorithm based on gravity center methodology
2020, Expert Systems with Applications
Citation Excerpt :
They used random projection to process high dimensional data which allows computationally effective hierarchical clustering using Baire metric. Zhou et al. (2015) proposed an MST based clustering algorithm called Adaptive Minimum Spanning Tree (AMST) to extract irregular shaped clusters. The general idea of the algorithm is first to determine the appropriate number of partitions in the region utilizing a validity index and then determining the significance of candidate clusters.
The concept of clustering is to separate clusters based on the similarity which is greater within cluster than among clusters. The similarity consists of two principles, namely, connectivity and cohesion. However, in partitional clustering, while some algorithms such as K-means and K-medians divides the dataset points according to the first principle (connectivity) based on centroid clusters without any regard to the second principle (cohesion), some others like K-medoids partially consider cohesion in addition to connectivity. This prevents to discover clusters with convex shape and results are affected negatively by outliers. In this paper a new Gravity Center Clustering (GCC) algorithm is proposed which depends on critical distance (λ) to define threshold among clusters. The algorithm falls under partition clustering and is based on gravity center which is a point within cluster that verifies both the connectivity and cohesion in determining the similarity of each point in the dataset. Therefore, the proposed algorithm deals with any shape of data better than K-means, K-medians and K-medoids. Furthermore, GCC algorithm does not need any parameters beforehand to perform clustering but can help user improving the control over clustering results and deal with overlapping and outliers providing two coefficients and an indicator. In this study, 22 experiments are conducted using different types of synthetic, and real healthcare datasets. The results show that the proposed algorithm satisfies the concept of clustering and provides great flexibility to get the optimal solution especially since clustering is considered as an optimization problem.
A new data clustering algorithm based on critical distance methodology
2019, Expert Systems with Applications
Citation Excerpt :
The drawbacks of algorithms like GAs is that they necessarily include many parameters. Another type of clustering algorithm using an MST was presented by Zhou, Shu, and Su (2015), called the adaptive MST-based clustering algorithm (AMST). Their study focused on extracting clusters that have irregular shapes.
A variety of algorithms have recently emerged in the field of cluster analysis. Consequently, based on the distribution nature of the data, an appropriate algorithm can be chosen for the purpose of clustering. It is difficult for a user to decide a priori which algorithm would be the most appropriate for a given dataset. Algorithms based on graphs provide good results for this task. However, these algorithms are vulnerable to outliers with limited information about edges contained in the tree to split a dataset. Thus, in several fields, the need for better clustering algorithms increases and for this reason utilizing robust and dynamic algorithms to improve and simplify the whole process of data clustering has become an urgent need. In this paper, we propose a novel distance-based clustering algorithm called the critical distance clustering algorithm. This algorithm depends on the Euclidean distance between data points and some basic mathematical statistics operations. The algorithm is simple, robust, and flexible; it works with quantitative data that are real-valued, not qualitative, and categorical with different dimensions. In this work, 26 experiments are conducted using different types of real and synthetic datasets taken from different fields. The results prove that the new algorithm outperforms some popular clustering algorithms such as MST-based clustering, K-means, and Dbscan. Moreover, the algorithm can precisely produce more reasonable clusters even when the dataset contains outliers and without specifying any parameters in advance. It also provides a number of indicators to evaluate the established clusters and prove the validity of the clustering.
Modelling and application of fuzzy adaptive minimum spanning tree in tourism agglomeration area division
2018, Knowledge-Based Systems
Citation Excerpt :
Additionally, to solve the limitations of traditional clustering that only circular clusters can be detected, Assuncao [14] proposed static MST (SMST) and dynamic MST (DMST) to detect the clusters of any shapes. To deal with the limitation that only one clustering solution can be returned, Zhou R [15] proposed adaptive MST (AMST) by evaluating the effectiveness of the clustering results, which returns a variety of clustering solutions to determine the optimal clustering. Based on previous studies, MST has been widely used in clustering analysis [16,17], image segmentation [18,19], density estimation [20], diversity assessment [21] and line optimization [22–24], which are related to biology [25–27], physics [28–30] and many other disciplines.
Tourism agglomeration area division plays an increasingly important role in government's policy making on planning and development of tourism industry nowadays. With the development of ICT technologies, tourism “big data”, such as geographic data, tourist attractions evaluation data, tourist service data and related traffic information, becomes available to access, which provides great opportunities to develop intelligent tourism decision support systems (TDSS) for related government policy making. To effectively divide tourism agglomeration areas to support tourism-planning decision, by use of the “big data” resources, a fuzzy adaptive minimum spanning tree (F-AMST) model, which integrates adaptive minimum spanning tree (AMST) method and fuzzy level evaluation method, is proposed in this study. The F-AMST model consists of three parts: F-MST generation, F-MST splitting, and clustering solution evaluation and adjustment. The proposed model is then applied to cluster 142 A-level scenic spots in mountain areas of Hebei province, China, and the optimal tourism clustering solution with seven tourism agglomeration areas is finally obtained. The result is then analysed from the following aspects: the spatial agglomeration degree of each tourism agglomeration area is analysed by use of the fuzzy dense degree based on F-AMST model; the spatial distribution type of scenic spot nodes within each agglomeration area is analysed by using the nearest neighbour index R with consideration of spatial distance factors; the scenic spot node level system of each agglomeration area is analysed by use of the node level perfection index Z considering the scenic spot node level factor; the influences of the two factors, the spatial distance and the scenic spot node level, to the agglomeration degree are then analysed by use of the correlation between R and fuzzy dense degree and that between Z and fuzzy dense degree respectively. The findings are carefully described in this paper and the results can directly support government's decision making in tourism resources planning and construction of tourism agglomeration areas so as to improve the regional tourism competitiveness.

View all citing articles on Scopus

View full text

An adaptive minimum spanning tree test for detecting irregularly-shaped spatial clusters

Abstract

Introduction

Section snippets

Notation

Adaptive minimum spanning tree test

Simulation setting

Illustrative examples

Conclusion

Acknowledgments

Comput. Statist. Data Anal.

Comput. Biol. Med.

Comput. Statist. Data Anal.

Pattern Recognit. Lett.

Fast detection of arbitrarily shaped disease clusters

Stat. Med.

The detection of clusters of rare diseases

J. Roy. Statist. Soc. Ser. A

Measures of the amount of ecologic association between species

Ecology

The 2nd International Conference on Biotechnology and Food Science. Vol. 7

Algorithms for Clustering

A spatial scan statistic

Comm. Statist. Theory Methods

Prospective time periodic geographical disease surveillance using a scan statistic

J. Roy. Statist. Soc. Ser. A

An elliptic spatial scan statistic

Stat. Med.

Minimum spanning tree based spatial outlier mining and its applications

A spanning tree euristic for regional clustering

Comm. Statist. Theory Methods