A Tabu search based clustering algorithm and its parallel implementation on Spark
Introduction
The purpose of a clustering process is to group a set of (abstract or physical) objects into multiple classes, so that the objects in each class (cluster) are similar according to certain rules or criteria. A clustering algorithm in general seeks to build the clusters by the two interrelated criteria of selecting objects to lie in the same cluster that are as similar as possible while undertaking to assure that objects that lie in different clusters are as dissimilar as possible, where the definition of similarity can be problem dependent. Clustering problems can be found in applications ranging from data analytics to logistics applications as documented in the surveys of Jain et al. [1], Berkhin [2], Grabmeier and Rudolph [3], Xu and Wunsch [4], and Jain [5]. The search for new clustering algorithms that best fit the different applications has proved fundamentally important to many recent advances in the domains of biology, genetics, medicine, business, engineering, and social science, among others.
Data-driven decision making as well as the burgeoning demand for data analytics has inspired increasing numbers of scholars as well practitioners to develop and apply clustering algorithms. Clustering is generally known as an unsupervised learning method (since no prior knowledge is provided that determines which objects should be grouped in a common cluster) and plays a crucial role in finding patterns and trends in the datasets. The role of clustering is highlighted in Grabmeier and Rudolph [3], who propose a variety of criteria to evaluate the quality of clusters along with suggestions about how to select solution methodologies. From another perspective, Łuczak [6] proposes a hierarchical clustering approach for classification problems involving time series datasets. Although Łuczak’s method works well for classifying time series data, it cannot be applied directly to most other clustering problems without major modifications.
Data analytics based on clustering are especially pervasive in the public health arena. Glatman-Freedman et al. [7] discuss the use of near real-time spatiotemporal cluster analysis to devise strategies for combatting some enteric bacteria diseases. With the help of this type of analysis, the source of a disease can be identified in a timely manner to enable appropriate measures to be taken before it becomes widespread. Clustering applications also abound in the field of logistics, where spatial and other relational restrictions often exist to limit the choice of objects that can lie in a common cluster. To solve the problems encountered in such applications, Cao and Glover [8] present an algorithm based on Thiessen-polygons that proves highly effective in generating clusters that satisfy restrictions on balancing and connectivity. In this case, the entire service area for the logistics application is divided into several geographically connected and balanced subareas that account for geographic obstacles and constraints imposed by business logic, enabling a logistics service provider to offer better and more efficient services for customers.
Among various clustering approaches, the K-means clustering algorithm [9] is one of the most popular and widely applied. Modern data analytical solver packages, including open source packages such as R and Spark MLlib, include the K-means clustering algorithm among their offerings. Nevertheless, the K-means clustering algorithm exhibits some limitations that need to be addressed to solve clustering problems more effectively. Notably, the algorithm relies on an initial determination of a set of centroids in order to launch subsequent steps that assign objects to these centroids and thereby identify new centroids. The choice of these initial centroids has a major influence on the structure and quality of the final clusters produced, and hence an improper selection of these centroids can lead to unsatisfactory outcomes. Moreover, when starting from any given set of centroids, the K-means process for successively assigning objects and generating new centroids relies on a heuristic approximation of conditions necessary to achieve optimality, and the solution trajectory produced by this approach may not be particularly effective. Thus, researchers and practitioners have developed a variety of procedures in an attempt to overcome these limitations.
Xiong et al. [10] utilize the exiting max-mix distance algorithm as the foundation for improving the K-means approach, to overcome the great fluctuations in the final outcomes produced by generating initial centroids randomly. This approach accelerates the convergence of the K-means algorithm but does not perform well in terms of accuracy. Lin et al. [11] attempt to improve the K-means algorithm by integrating it with an approach based on particle swarm optimization and multiclass merging. Their experimental tests show that their approach yields better overall outcomes than the original K-means algorithm. In general, the solutions created by the K-means algorithm are influenced greatly by the initial solution settings. Poorly selected initial solutions lead to undesirable final solutions. To overcome this problem, Celebi et al. [12] and Wang and Bai [13] propose a variety of methods to generate better initial solutions rather than relying on randomly picked centroids. The goal of generating better initial solutions is also one of the motivations for our paper.
Furthermore, like other heuristics, the K-means algorithm suffers from its susceptibility to become trapped in local optima. To address this issue, for a data analytical project Cao et al. [14] propose a Tabu Search algorithm that aims to produce tighter and more cohesive clusters based on incorporating criteria for these characteristics in the objective function. The algorithm succeeds both in obtaining better global solutions and in producing clusters that are more cohesive, but the computation time is greater than required by a well-implemented K-means method. To address the performance of K-means in solving large-scale problems, Bhimani et al. [15] propose three parallel computational models including shared memory, distributed memory, and GPU-based heterogeneous computing. Computational experiments on image datasets showed these parallel versions of K-means to be 30 times faster on average than the sequential counterpart. To overcome the well-known sensitivity of K-means to the initial solution, a set of candidate solutions is first generated in parallel and the best is selected to initiate the algorithm. This approach proved essential to maximize algorithm speedup. Xu and Cao [16] present a method for parallelizing a Tabu Search clustering algorithm utilizing a subspace partitioning principle. The computational results are appealing in terms of both solution quality and computing performance, but a gap remains in achieving outcomes that are ideal. The implementation utilizes a multi-core platform to run a multi-thread version of the Tabu Search clustering processes while employing a subspace partitioning principle to carry out the data transactions. However, the underlying algorithm structure is not compatible with a big data computational framework such as Spark.
With the existence of big data computing infrastructures including cloud computing, we are now increasingly able to analyze and process large volumes of data efficiently. However, a variety of classic optimization problems that require many iterations to solve have yet to benefit from these latest technologies. Recourse is sometimes made to heuristics that yield local optima of questionable quality. More ambitious metaheuristic based approaches yield better solutions but in some cases face a challenge to produce such solutions within a reasonable span of computation time. As more businesses move their IT services to centralized environments such as cloud platforms, it is necessary to develop and implement optimization algorithms more efficiently to accommodate the ever-increasing scale of practical problems. Consequently, we are motivated to explore the possibilities of utilizing big data computing infrastructures like Spark to solve large-scale optimization problems.
As one of a series of research projects, in this paper we address two major issues encountered in solving large-scale clustering problems, namely, the potentially poor quality of local optima obtained by simple clustering algorithms such as K-means, and the generally poor computational times produced by more complicated clustering algorithms. We propose a Tabu Search strategy to tackle the local optimality problem in conjunction with a Spark platform parallel processing implementation that makes it possible to handle large-scale problems more efficiently. The main contributions of the paper are:
- -
Design and implement the parallel mechanism for the algorithm to operate within a big data computing infrastructure.
- -
Develop a strategy for generating initial clustering solutions to provide stable and robust outcomes.
- -
Design the solution neighborhood structure and an associated candidate list selection strategy so that the solution procedure will be capable of effectively exploiting the MapReduce operations.
- -
Establish the merit and feasibility of applying metaheuristics such as Tabu Search within the Spark environment, thereby encouraging other researchers to explore the use of metaheuristics in big data environments to solve large-scale optimization problems.
The paper is organized as follows: Section 2 describes the clustering model, the Tabu Search based clustering algorithm, and its parallel implementation on Spark platform. Computational results are presented in Section 3. Finally, Section 4 concludes the paper with our findings and future research directions.
Section snippets
The model
In the following we represent similarity by a distance measure, and seek a collection of clusters that minimizes intra-cluster distance and maximizes inter-cluster distance. We call the objects to be clustered as data points and refer to the set of objects as a dataset. Consider a dataset of Np objects. Each data point in the dataset has k attributes, i.e., it is k-dimensional. A data point xt will be represented by a vector . The underlying dataset then can be represented by
Experiments
We have implemented the algorithm presented in Section 2 on the Spark platform using a computational environment consisting of four machines, which are divided into one master node and three slave nodes. The hardware configuration of each machine is the same: CPU Intel core2 2.2 GHZ, RAM 2 GB, Hard Disk 500 GB, Ethernet 100 M/s. The overall architecture is depicted as follows:
The datasets (Iris, Wine, Yeast, and Seeds) for the computational experiments are downloaded from the UCI open dataset
Conclusions
Our Tabu Search based clustering algorithm utilizes the centroid-driven orientation of the K-means algorithm under the guidance of a simple version of Tabu Search. Given that the non-centroid data points can be assigned to the proper clusters based on their distances to the centroids without knowing the individual “coordinates” or attributes of each data point, the centroid-driven strategy of our algorithm facilitates its parallel implementation in the Spark environment. One of the primary
Acknowledgements
We are indebted to two anonymous reviewers for insightful observations and suggestions that have helped to improve our paper. This work was partially supported by the China Intelligent Urbanization Co-Creation Center [grant number CIUC20150011].
References (24)
Data clustering: 50 years beyond K-means
Pattern Recognit. Lett.
(2010)Hierarchical clustering of time series data with parametric derivative dynamic time warping
Exp. Syst. Appl.
(2016)- et al.
Near real-time space-time cluster analysis for detection of enteric disease outbreaks in a community setting
J. Infect.
(2016) - et al.
A comparative study of efficient initialization methods for the K-means clustering algorithm
Expert Syst. Appl.
(2013) Node-ejection chains for the vehicle routing problem: sequential and parallel algorithms
Parallel Comput.
(2001)- et al.
Implementation of a parallel algorithm based on a spark cloud computing platform
Algorithms
(2015) - et al.
Surrogate constraint normalization for the set covering problem
Eur. J. Oper. Res.
(2010) - et al.
Data clustering: a review
ACM Comput. Surv. (CSUR)
(1999) Survey of Clustering Data Mining Techniques
(2002)- et al.
Techniques of cluster algorithms in data mining
Data Min. Knowl. Discov.
(2002)
Survey of clustering algorithms
IEEE Trans. Neural Netw.
Creating balanced and connected clusters for improved service delivery routes in logistics planning
J. Syst. Sci. Syst. Eng.
Cited by (35)
How to Use K-means for Big Data Clustering?
2023, Pattern RecognitionCitation Excerpt :Various metaheuristics or meta-ideas are usually built into the multi-start process to obtain optimal initializations. Some examples are genetic search [11,13], variable neighborhood search (VNS) [12,33], Greedy Randomized Adaptive Search Procedure (GRASP) [31], simulated annealing [34], tabu search [35], and others. Consequently, a topical task is to build such an algorithm that would allow organizing the global search process in big data conditions without using any known metaheuristics or meta-ideas.
A Full-Sample Clustering Model Considering Whole Process Optimization of Data
2022, Big Data ResearchA comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects
2022, Engineering Applications of Artificial IntelligenceCitation Excerpt :Sung and Jin (2000) combined the tabu search algorithm with complimentary packing and releasing procedures for solving the clustering problem. Lu et al. (2018) proposed a tabu search-based clustering algorithm and its parallel implementation on Spark. Their design was adapted to alleviate the challenges associated with big data applications taking advantage of the parallel processing based on the Spark framework.
Multistart solution-based tabu search for the Set-Union Knapsack Problem
2021, Applied Soft ComputingCitation Excerpt :The tabu search technology [23] has been successfully applied to solve many difficult optimization problems. Although most studies rely on the popular and well-known attributed-based tabu search (ABTS) as exemplified by the studies of [21,22,24–26], recent studies indicated that the solution-based tabu search (SBTS) [27,28] is a highly competitive approach for solving several notoriously difficult binary optimization problems such as 0/1 multidimensional knapsack [29], multidemand multidimensional knapsack [30], minimum differential dispersion [31], and maximum min-sum dispersion [32] and obnoxious p-median [33]. Compared to the ABTS method, SBTS has the advantage of avoiding the use of tabu tenure and simplifying the determination of tabu status.
A simulated annealing algorithm with a dual perturbation method for clustering
2021, Pattern RecognitionCitation Excerpt :Metaheuristics serve as a different method to solve this local optima problem [19]. Some authors [20,21] used the local search heuristic while some others [22,23] applied the Tabu search method [24], and some others [25,26] used genetic algorithms. Nature-based metaheuristic methods have also been explored [27].
Clustering-based anomaly detection in multivariate time series data
2021, Applied Soft Computing