An adaptive minimum spanning tree test for detecting irregularly-shaped spatial clusters
Introduction
The detection of disease clusters in space is an important public health problem which has to deal with multiple statistical testing problems. A great number of tests have been proposed to handle this problem. According to Besag and Newell (1991), tests for spatial clusters could be generally classified into two categories: focused tests and general tests, depending on whether the assumption about cluster locations is made or not. Song and Kulldorff (2006) further divided the general tests into two groups: tests for global clustering and tests for localized clusters, depending on the ability to identify the locations of potential clusters. The global clustering tests look for evidence to determine whether there are clusters present over the whole study region without concern about cluster locations (Whittemore et al., 1987, Besag and Newell, 1991, Tango, 1995). On the other hand, tests for localized clusters are concerned with not only testing their statistical significance, but also detecting the location of clusters (Turnbull et al., 1990, Kulldorff, 1997).
Among these tests, the spatial scan test proposed by Kulldorff (1997) is conceptually intuitive and has been widely discussed as a tool for detecting localized clusters. The conventional scan test is to first compute the likelihood ratio statistics based on a circular scanning window with varying radius, and then to determine the area with the maximum likelihood ratio score. The area giving the maximum likelihood ratio is estimated as the most likely cluster. The spatial scan statistic test has been proved to be very powerful when the real cluster has a circular shape. However, the shape of the potential cluster is generally unknown in practice and the shape is generally irregular.
The use of a scanning window of a rigid geometry has been viewed as a limitation of the spatial scan test. To deal with this limitation, some modifications have been made to the spatial scan test, including the search for elongated, elliptical, and irregular shapes. For example, Neill and Moore (2004) and Kulldorff et al. (2006) proposed the spatial scan statistics for elongated and elliptical-shaped clusters. However, these methods seem to be only slightly less constraining than the spatial scan tests based on circular-shaped clusters. The clusters with less-compact shapes will be overlooked by these methods.
In order to detect arbitrarily shaped clusters, various procedures have been developed. A sample of research in this respect includes the upper level set (ULS) method proposed by Patil and Taillie (2004), the simulated annealing (SA) approach proposed by Duczmal and Assunção (2004), and the -nearest-neighbor method proposed by Tango and Takahashi (2005). The ULS is simple and fast to implement. However, it is shown not to have good performance as compared with other clustering methods. The SA method has the drawback that it requires the setting of awkward tuning parameters. These parameters are not easily interrelated, and there is lack of guidance to do this. The -nearest-neighbor method extends the regular scanning window in the conventional spatial scan test to the flexible shape by using -nearest neighbors. Each cell with its -nearest neighbors at most is considered as a candidate cluster when determining the set of candidate clusters. Clearly, this method suffers from the computation complexity issue as the number of candidate clusters increases exponentially with the value of . This makes it inefficient for large value of . In general, there is no versatile method that can handle all types of clustering problem due to the arbitrary shape, unbalanced background population size, and variable densities.
Another effective clustering method consists of using the graph-based tests, especially the minimum spanning tree (MST), owing to its intuitive and effective data representation. Cluster design using MST was initiated by Zahn (1971) and later discussed by Maravalle and Simeone (1995) and Maravalle et al. (1997). It has now been successfully employed in many different settings, such as in image processing (Wang et al., 2009), pattern recognition (Paivinen, 2005), and biological data analysis (Xu et al., 2002), and public health surveillance (Assunção et al., 2006).
In the context of public health surveillance, Assunção et al. (2006) recently proposed two types of MST-based tests for fast detection of arbitrarily shaped disease clusters: the static MST (SMST) and the dynamic MST (DMST). They showed that the SMST includes the ULS method as a special case. Compared to the SMST, the DMST has larger detection power and is thus suggested for practical use. Note that the MST-based clustering method enjoys two good properties. First, it gets away from the specification of tuning parameters. Second, MST is an extremely economical way to represent the spatial structure of a graph and thus has the capacity of quickly identifying clusters without constraining the possible shapes. Due to these good properties, the MST-based method is more computationally efficient as compared to the SA and -nearest-neighbor methods.
The MST-based methods determine the most likely cluster using the maximization of the likelihood ratio over a set of candidate clusters. Thus, both the SMST and DMST methods can only return/estimate a single cluster. In addition to a single cluster present, multiple clusters can also appear, which are more general in practice. This limitation would restrict the applications of SMST and DMST for clustering analysis. To meet the gap in part, we propose an adaptive MST (AMST) approach, aimed at automatically and simultaneously identifying the locations of clusters with arbitrary shapes. The key is to define a validity index taking both compactness and isolation of data into consideration to help determine best partition of a MST. Based on the clustering results from validity index, one can then evaluate the statistical significance of these candidate clusters.
The rest of this paper will be organized as follows. In Section 2, we briefly review the SMST and DMST methods. In Section 3, we present the AMST method. In Section 4, we compare the detection power and identification capability based on simulations. Both the case with a single cluster and the case with multiple clusters are considered. In Section 5, two simulation examples are provided to illustrate the use of the proposed method. Some concluding remarks are given in Section 5.
Section snippets
Notation
We begin by defining some notation. A whole study region contains connected locations. For each location , denote as the number of cases observed, and as its population size. Also, define and as the total number of cases and total population size over all the study region, respectively. A Poisson model is often assumed for the data, namely, Poisson (), where is the disease rate in cell , which is interpreted as the number of cases per unit population. Assuming
Adaptive minimum spanning tree test
In the above MST-based methods, the cutting of each edge leads to two connected components. The repeated application of this cutting procedure leads to different cluster candidates that are represented by the connected components. For example, after the removal of two relatively long edges in the MST in Fig. 1(a), three candidate clusters are generated as shown in Fig. 1(b). The candidate cluster with the maximum likelihood ratio is estimated as the most likely cluster. Clearly, both the SMST
Simulation setting
In this section, we carried out extensive simulation studies to compare the AMST with the existing SMST and DMST methods. Two measures of performance based on power and Dice Similarity Coefficient (DSC) (Dice, 1945) were considered in the simulations in order to evaluate the efficiency of these methods. The power measures the probability of rejecting the null hypothesis for a test when the alternative hypothesis is actually true. In particular, the power associated with a test when a cluster is
Illustrative examples
To illustrate the practical use of the AMST approach for scanning spatial clusters, we use the population structure from the previous examples to generate one set of simulated count at each county. Once again, we set per 10,000 persons where there is no cluster present, and in the presence of clusters. In the simulation, two examples were designed. In the first example, we consider Case 1 with one true cluster assumed. This cluster consists of 6 counties, as shown in Table 1. We
Conclusion
The graphical method based on MST is a common tool used for detecting clusters of disease over a geographic region due to its graphical intuition and simplicity. The recent SMST and DMST methods can automatically and simultaneously estimate clusters with arbitrary shapes. However, both the SMST and DMST methods estimate the most likely cluster based on the maximum likelihood ratio of the resulting candidate clusters after the removal of edges from the MST. Therefore, they might not be efficient
Acknowledgments
The authors are grateful to the editor and two anonymous referees who have provided insightful comments that resulted in significant improvement in this paper. Lianjie Shu’s work was supported in part by FDCT/002/2013/A, MYRG090(Y1-L2)-FBA13-SLJ, and MYRG096(Y1-L2)-FBA12-SLJ. Yan Su’s work was supported in part by FDCT/060/2014/A2 and FDCT/115/2012/A.
References (29)
- et al.
A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters
Comput. Statist. Data Anal.
(2004) - et al.
Techniques for clustering gene expression data
Comput. Biol. Med.
(2008) - et al.
Clustering on trees
Comput. Statist. Data Anal.
(1997) Clustering with a minimum spanning tree of scale-free-like structure
Pattern Recognit. Lett.
(2005)- et al.
Fast detection of arbitrarily shaped disease clusters
Stat. Med.
(2006) - et al.
The detection of clusters of rare diseases
J. Roy. Statist. Soc. Ser. A
(1991) Measures of the amount of ecologic association between species
Ecology
(1945)- et al.
The 2nd International Conference on Biotechnology and Food Science. Vol. 7
(2011) - et al.
Algorithms for Clustering
(1988) A spatial scan statistic
Comm. Statist. Theory Methods
(1997)
Prospective time periodic geographical disease surveillance using a scan statistic
J. Roy. Statist. Soc. Ser. A
An elliptic spatial scan statistic
Stat. Med.
Minimum spanning tree based spatial outlier mining and its applications
A spanning tree euristic for regional clustering
Comm. Statist. Theory Methods
Cited by (15)
A novel hybridization approach to improve the critical distance clustering algorithm: Balancing speed and quality
2024, Expert Systems with ApplicationsA Bayesian spatial scan statistic for multinomial data
2024, Statistics and Probability LettersDiscovery of arbitrarily shaped significant clusters in spatial point data with noise
2021, Applied Soft ComputingCitation Excerpt :These methods are time-consuming and are usually not suitable for large spatial data, especially the FleXScan algorithm only works well for small cluster size (up to 30) [24]. Although the complexity can be reduced by employing the neighborhood graph search [26], Bayesian estimation [27] or the linear optimization model [28], the quality of the clustering results would be reduced and noises may be identified as clusters [29]. To improve the power of flexibly-shaped scanning methods, some intelligent optimization scanning methods were developed, such as the simulated annealing-based scan statistic (SAScan) [30], genetic optimization-based scan statistic (GAScan) [31], ant colony optimization-based scan statistic (AntScan) [23].
A novel data clustering algorithm based on gravity center methodology
2020, Expert Systems with ApplicationsCitation Excerpt :They used random projection to process high dimensional data which allows computationally effective hierarchical clustering using Baire metric. Zhou et al. (2015) proposed an MST based clustering algorithm called Adaptive Minimum Spanning Tree (AMST) to extract irregular shaped clusters. The general idea of the algorithm is first to determine the appropriate number of partitions in the region utilizing a validity index and then determining the significance of candidate clusters.
A new data clustering algorithm based on critical distance methodology
2019, Expert Systems with ApplicationsCitation Excerpt :The drawbacks of algorithms like GAs is that they necessarily include many parameters. Another type of clustering algorithm using an MST was presented by Zhou, Shu, and Su (2015), called the adaptive MST-based clustering algorithm (AMST). Their study focused on extracting clusters that have irregular shapes.
Modelling and application of fuzzy adaptive minimum spanning tree in tourism agglomeration area division
2018, Knowledge-Based SystemsCitation Excerpt :Additionally, to solve the limitations of traditional clustering that only circular clusters can be detected, Assuncao [14] proposed static MST (SMST) and dynamic MST (DMST) to detect the clusters of any shapes. To deal with the limitation that only one clustering solution can be returned, Zhou R [15] proposed adaptive MST (AMST) by evaluating the effectiveness of the clustering results, which returns a variety of clustering solutions to determine the optimal clustering. Based on previous studies, MST has been widely used in clustering analysis [16,17], image segmentation [18,19], density estimation [20], diversity assessment [21] and line optimization [22–24], which are related to biology [25–27], physics [28–30] and many other disciplines.