Elsevier

Applied Soft Computing

Volume 63, February 2018, Pages 97-109
Applied Soft Computing

A Tabu search based clustering algorithm and its parallel implementation on Spark

https://doi.org/10.1016/j.asoc.2017.11.038Get rights and content

Highlights

  • Our Tabu Search based clustering algorithm produces more accurate and stable solutions compared to the widely-applied Spark MLlib K-means algorithm.

  • In addition to these advantages, the parallelized Tabu Search based clustering algorithms achieves an acceleration rate similar to that of the K-means algorithm in Spark MLlib.

  • To our knowledge, this is the first Tabu Search based clustering algorithm that has been implemented on the Spark platform.

Abstract

The well-known K-means clustering algorithm has been employed widely in different application domains ranging from data analytics to logistics applications. However, the K-means algorithm can be affected by factors such as the initial choice of centroids and can readily become trapped in a local optimum. In this paper, we propose an improved K-means clustering algorithm that is augmented by a Tabu Search strategy, and which is better adapted to meet the needs of big data applications. Our design focuses on enhancements to take advantage of parallel processing based on the Spark framework. Computational experiments demonstrate the superiority of our parallel Tabu Search based clustering algorithm over a widely used version of the K-means approach embodied in the parallel Spark MLlib system, comparing the algorithms in terms of scalability, accuracy, and effectiveness.

Introduction

The purpose of a clustering process is to group a set of (abstract or physical) objects into multiple classes, so that the objects in each class (cluster) are similar according to certain rules or criteria. A clustering algorithm in general seeks to build the clusters by the two interrelated criteria of selecting objects to lie in the same cluster that are as similar as possible while undertaking to assure that objects that lie in different clusters are as dissimilar as possible, where the definition of similarity can be problem dependent. Clustering problems can be found in applications ranging from data analytics to logistics applications as documented in the surveys of Jain et al. [1], Berkhin [2], Grabmeier and Rudolph [3], Xu and Wunsch [4], and Jain [5]. The search for new clustering algorithms that best fit the different applications has proved fundamentally important to many recent advances in the domains of biology, genetics, medicine, business, engineering, and social science, among others.

Data-driven decision making as well as the burgeoning demand for data analytics has inspired increasing numbers of scholars as well practitioners to develop and apply clustering algorithms. Clustering is generally known as an unsupervised learning method (since no prior knowledge is provided that determines which objects should be grouped in a common cluster) and plays a crucial role in finding patterns and trends in the datasets. The role of clustering is highlighted in Grabmeier and Rudolph [3], who propose a variety of criteria to evaluate the quality of clusters along with suggestions about how to select solution methodologies. From another perspective, Łuczak [6] proposes a hierarchical clustering approach for classification problems involving time series datasets. Although Łuczak’s method works well for classifying time series data, it cannot be applied directly to most other clustering problems without major modifications.

Data analytics based on clustering are especially pervasive in the public health arena. Glatman-Freedman et al. [7] discuss the use of near real-time spatiotemporal cluster analysis to devise strategies for combatting some enteric bacteria diseases. With the help of this type of analysis, the source of a disease can be identified in a timely manner to enable appropriate measures to be taken before it becomes widespread. Clustering applications also abound in the field of logistics, where spatial and other relational restrictions often exist to limit the choice of objects that can lie in a common cluster. To solve the problems encountered in such applications, Cao and Glover [8] present an algorithm based on Thiessen-polygons that proves highly effective in generating clusters that satisfy restrictions on balancing and connectivity. In this case, the entire service area for the logistics application is divided into several geographically connected and balanced subareas that account for geographic obstacles and constraints imposed by business logic, enabling a logistics service provider to offer better and more efficient services for customers.

Among various clustering approaches, the K-means clustering algorithm [9] is one of the most popular and widely applied. Modern data analytical solver packages, including open source packages such as R and Spark MLlib, include the K-means clustering algorithm among their offerings. Nevertheless, the K-means clustering algorithm exhibits some limitations that need to be addressed to solve clustering problems more effectively. Notably, the algorithm relies on an initial determination of a set of centroids in order to launch subsequent steps that assign objects to these centroids and thereby identify new centroids. The choice of these initial centroids has a major influence on the structure and quality of the final clusters produced, and hence an improper selection of these centroids can lead to unsatisfactory outcomes. Moreover, when starting from any given set of centroids, the K-means process for successively assigning objects and generating new centroids relies on a heuristic approximation of conditions necessary to achieve optimality, and the solution trajectory produced by this approach may not be particularly effective. Thus, researchers and practitioners have developed a variety of procedures in an attempt to overcome these limitations.

Xiong et al. [10] utilize the exiting max-mix distance algorithm as the foundation for improving the K-means approach, to overcome the great fluctuations in the final outcomes produced by generating initial centroids randomly. This approach accelerates the convergence of the K-means algorithm but does not perform well in terms of accuracy. Lin et al. [11] attempt to improve the K-means algorithm by integrating it with an approach based on particle swarm optimization and multiclass merging. Their experimental tests show that their approach yields better overall outcomes than the original K-means algorithm. In general, the solutions created by the K-means algorithm are influenced greatly by the initial solution settings. Poorly selected initial solutions lead to undesirable final solutions. To overcome this problem, Celebi et al. [12] and Wang and Bai [13] propose a variety of methods to generate better initial solutions rather than relying on randomly picked centroids. The goal of generating better initial solutions is also one of the motivations for our paper.

Furthermore, like other heuristics, the K-means algorithm suffers from its susceptibility to become trapped in local optima. To address this issue, for a data analytical project Cao et al. [14] propose a Tabu Search algorithm that aims to produce tighter and more cohesive clusters based on incorporating criteria for these characteristics in the objective function. The algorithm succeeds both in obtaining better global solutions and in producing clusters that are more cohesive, but the computation time is greater than required by a well-implemented K-means method. To address the performance of K-means in solving large-scale problems, Bhimani et al. [15] propose three parallel computational models including shared memory, distributed memory, and GPU-based heterogeneous computing. Computational experiments on image datasets showed these parallel versions of K-means to be 30 times faster on average than the sequential counterpart. To overcome the well-known sensitivity of K-means to the initial solution, a set of candidate solutions is first generated in parallel and the best is selected to initiate the algorithm. This approach proved essential to maximize algorithm speedup. Xu and Cao [16] present a method for parallelizing a Tabu Search clustering algorithm utilizing a subspace partitioning principle. The computational results are appealing in terms of both solution quality and computing performance, but a gap remains in achieving outcomes that are ideal. The implementation utilizes a multi-core platform to run a multi-thread version of the Tabu Search clustering processes while employing a subspace partitioning principle to carry out the data transactions. However, the underlying algorithm structure is not compatible with a big data computational framework such as Spark.

With the existence of big data computing infrastructures including cloud computing, we are now increasingly able to analyze and process large volumes of data efficiently. However, a variety of classic optimization problems that require many iterations to solve have yet to benefit from these latest technologies. Recourse is sometimes made to heuristics that yield local optima of questionable quality. More ambitious metaheuristic based approaches yield better solutions but in some cases face a challenge to produce such solutions within a reasonable span of computation time. As more businesses move their IT services to centralized environments such as cloud platforms, it is necessary to develop and implement optimization algorithms more efficiently to accommodate the ever-increasing scale of practical problems. Consequently, we are motivated to explore the possibilities of utilizing big data computing infrastructures like Spark to solve large-scale optimization problems.

As one of a series of research projects, in this paper we address two major issues encountered in solving large-scale clustering problems, namely, the potentially poor quality of local optima obtained by simple clustering algorithms such as K-means, and the generally poor computational times produced by more complicated clustering algorithms. We propose a Tabu Search strategy to tackle the local optimality problem in conjunction with a Spark platform parallel processing implementation that makes it possible to handle large-scale problems more efficiently. The main contributions of the paper are:

  • -

    Design and implement the parallel mechanism for the algorithm to operate within a big data computing infrastructure.

  • -

    Develop a strategy for generating initial clustering solutions to provide stable and robust outcomes.

  • -

    Design the solution neighborhood structure and an associated candidate list selection strategy so that the solution procedure will be capable of effectively exploiting the MapReduce operations.

  • -

    Establish the merit and feasibility of applying metaheuristics such as Tabu Search within the Spark environment, thereby encouraging other researchers to explore the use of metaheuristics in big data environments to solve large-scale optimization problems.

The paper is organized as follows: Section 2 describes the clustering model, the Tabu Search based clustering algorithm, and its parallel implementation on Spark platform. Computational results are presented in Section 3. Finally, Section 4 concludes the paper with our findings and future research directions.

Section snippets

The model

In the following we represent similarity by a distance measure, and seek a collection of clusters that minimizes intra-cluster distance and maximizes inter-cluster distance. We call the objects to be clustered as data points and refer to the set of objects as a dataset. Consider a dataset of Np objects. Each data point in the dataset has k attributes, i.e., it is k-dimensional. A data point xt will be represented by a vector xt=(xt1,xt2,,xtk). The underlying dataset then can be represented byX=

Experiments

We have implemented the algorithm presented in Section 2 on the Spark platform using a computational environment consisting of four machines, which are divided into one master node and three slave nodes. The hardware configuration of each machine is the same: CPU Intel core2 2.2 GHZ, RAM 2 GB, Hard Disk 500 GB, Ethernet 100 M/s. The overall architecture is depicted as follows:

The datasets (Iris, Wine, Yeast, and Seeds) for the computational experiments are downloaded from the UCI open dataset

Conclusions

Our Tabu Search based clustering algorithm utilizes the centroid-driven orientation of the K-means algorithm under the guidance of a simple version of Tabu Search. Given that the non-centroid data points can be assigned to the proper clusters based on their distances to the centroids without knowing the individual “coordinates” or attributes of each data point, the centroid-driven strategy of our algorithm facilitates its parallel implementation in the Spark environment. One of the primary

Acknowledgements

We are indebted to two anonymous reviewers for insightful observations and suggestions that have helped to improve our paper. This work was partially supported by the China Intelligent Urbanization Co-Creation Center [grant number CIUC20150011].

References (24)

  • R. Xu et al.

    Survey of clustering algorithms

    IEEE Trans. Neural Netw.

    (2005)
  • B. Cao et al.

    Creating balanced and connected clusters for improved service delivery routes in logistics planning

    J. Syst. Sci. Syst. Eng.

    (2010)
  • Cited by (35)

    • How to Use K-means for Big Data Clustering?

      2023, Pattern Recognition
      Citation Excerpt :

      Various metaheuristics or meta-ideas are usually built into the multi-start process to obtain optimal initializations. Some examples are genetic search [11,13], variable neighborhood search (VNS) [12,33], Greedy Randomized Adaptive Search Procedure (GRASP) [31], simulated annealing [34], tabu search [35], and others. Consequently, a topical task is to build such an algorithm that would allow organizing the global search process in big data conditions without using any known metaheuristics or meta-ideas.

    • A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects

      2022, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Sung and Jin (2000) combined the tabu search algorithm with complimentary packing and releasing procedures for solving the clustering problem. Lu et al. (2018) proposed a tabu search-based clustering algorithm and its parallel implementation on Spark. Their design was adapted to alleviate the challenges associated with big data applications taking advantage of the parallel processing based on the Spark framework.

    • Multistart solution-based tabu search for the Set-Union Knapsack Problem

      2021, Applied Soft Computing
      Citation Excerpt :

      The tabu search technology [23] has been successfully applied to solve many difficult optimization problems. Although most studies rely on the popular and well-known attributed-based tabu search (ABTS) as exemplified by the studies of [21,22,24–26], recent studies indicated that the solution-based tabu search (SBTS) [27,28] is a highly competitive approach for solving several notoriously difficult binary optimization problems such as 0/1 multidimensional knapsack [29], multidemand multidimensional knapsack [30], minimum differential dispersion [31], and maximum min-sum dispersion [32] and obnoxious p-median [33]. Compared to the ABTS method, SBTS has the advantage of avoiding the use of tabu tenure and simplifying the determination of tabu status.

    • A simulated annealing algorithm with a dual perturbation method for clustering

      2021, Pattern Recognition
      Citation Excerpt :

      Metaheuristics serve as a different method to solve this local optima problem [19]. Some authors [20,21] used the local search heuristic while some others [22,23] applied the Tabu search method [24], and some others [25,26] used genetic algorithms. Nature-based metaheuristic methods have also been explored [27].

    View all citing articles on Scopus
    View full text