Skip to main content
Top
Published in: Journal of Big Data 1/2017

Open Access 01-12-2017 | Research

Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

Authors: Chowdam Sreedhar, Nagulapally Kasiviswanath, Pakanti Chenna Reddy

Published in: Journal of Big Data | Issue 1/2017

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Big data has become popular for processing, storing and managing massive volumes of data. The clustering of datasets has become a challenging issue in the field of big data analytics. The K-means algorithm is best suited for finding similarities between entities based on distance measures with small datasets. Existing clustering algorithms require scalable solutions to manage large datasets. This study presents two approaches to the clustering of large datasets using MapReduce. The first approach, K-Means Hadoop MapReduce (KM-HMR), focuses on the MapReduce implementation of standard K-means. The second approach enhances the quality of clusters to produce clusters with maximum intra-cluster and minimum inter-cluster distances for large datasets. The results of the proposed approaches show significant improvements in the efficiency of clustering in terms of execution times. Experiments conducted on standard K-means and proposed solutions show that the KM-I2C approach is both effective and efficient.

Introduction

In the recent years, datasets generated by machines have been large in terms of volume and have been globally distributed [1]. Big data can be described based on various characteristics (namely volume, velocity, variety, veracity, value and volatility). Big data include datasets that are large and difficult to manage, acquire, store, analyse and visualize. A dataset can be defined as a collection of related sets of information that can have individual or multidimensional attributes. Large datasets include massive volumes of data such that traditional database management systems cannot manage them. Big data has become popular due to their ability to manage structured, unstructured and semi-structured data sources and formats [2] through the use of advanced data intensive technologies. The increased size of datasets has boosted demand for efficient clustering techniques that satisfy memory use, document processing and execution time requirements. An issue related to big data concerns the grouping of objects such that data of the same group are more similar than those of the other groups or clusters. Applications of big data are used in telecommunications, healthcare, bioinformatics, banking, marketing, biology, insurance, city planning, earthquake studies, web document classification and transport services.
Clustering is an important tool for data mining and knowledge discovery. The objective of clustering is to find meaningful groups of entities and to differentiate clusters formed for a dataset. Traditional K-means clustering works well when applied to small datasets. Large datasets must be clustered such that every other entity or data point in the cluster is similar to any other entity in the same cluster. Clustering problems can be applied to several clustering disciplines [3]. The ability to automatically group similar items enables one to discover hidden similarities and key concepts while combining a large amount of data into a few groups. This enables users to comprehend a large amount of data. Clusters can be classified as homogeneous and heterogeneous clusters. In homogeneous clusters, all nodes have similar properties. Heterogeneous clusters are used in private data-centres in which nodes have different characteristics and in which it may be difficult to distinguish between nodes [4].
Clustering methods require the application of more precise definitions of observation and cluster similarities. When grouping is based on attributes, it is natural to employ familiar concepts of distance. A problem with this procedure is associated with the measurement of distances between clusters comprising of two or more observations. Unlike existing conventional statistical methods, most clustering algorithms do not rely on statistical distributions of data and thus can be helpful to apply when little prior knowledge exists on a certain issue [5]. McCallum et al. [6] described how the number of iterations can be reduced by partitioning a dataset into overlapping subsets and by only iterating data objects within overlapping areas.
When using flat file formats, data objects are represented as vectors in n-dimensional space that each describe an object, and this object is characterized by n attributes, each of which has a single value. Almost all existing data analysis and data mining tools such as clustering tools, inductive learning tools, and statistical analysis tools assume that datasets to be analysed are represented through a structured file format. The well-known inductive learning environment [7] and a similar decision tree based rule induction algorithm [8], as well as conceptual clustering algorithms (e.g., COBWEB [9], AutoClass [10, 11], and ITERATE [12]) and statistical packages, make this assumption. Issues of scalability are becoming a major concern when applying clustering algorithms as datasets increase in size; most are computationally expensive in terms of space and time. There is a need to manage such large volumes of data and to cluster them easily for data analytics while minimizing maximum inter-cluster distances and managing large datasets. Furthermore, such algorithms should be efficient, scalable and highly accurate.
The K-means clustering algorithm is a popular unsupervised clustering technique used to identify similarities between objects based on distance vectors suited to small datasets. Currently, datasets generated by sources such as Wikipedia, meteorological departments, telecommunications systems, and sensors are so large that traditional K-means clustering algorithms are no longer able to group related objects to develop meaningful insights. There is a need to enhance clustering algorithms to suit large datasets. Hadoop was designed to store datasets that can be scaled up to the petabyte level. Hence, the present work develops a clustering solution using Hadoop.
We apply the Hadoop MapReduce standard K-means clustering algorithm to manage large datasets and introduce a new metric for similarity measurements such that the distances between objects exhibit high levels of intra-cluster similarity and low levels of inter-cluster similarity.
The remainder of this study is organized as follows. Popular distance metrics and the need for Hadoop are discussed in “Background”. Approaches recently developed to improve clustering results are discussed in “Related work”. The proposed KM-HMR and KM-I2C algorithms are described in “Proposed work”. An extensive set of experiments on the datasets along with a comparison of existing and proposed solutions is presented in “Experimental evaluation”. Finally, conclusions and future work are discussed in “Conclusions”.

Background

Applications based on similarity measures include the following: data mining [13], image processing [14], and document clustering [15]. The most popular continuous distance metric is the Euclidean distance. City-block measures partition based on a central median [16].
The following are popular distance metrics used in various clustering techniques:
a.
Euclidean distance.
 
b.
Manhattan distance.
 
c.
Minkowski distance.
 
d.
Jaccard distance.
 
Euclidean distance is defined as:
$$d\left( {i,j} \right) = \sqrt {(x_{i1} - x_{j1} )^{2} + (x_{i2} - x_{j2} )^{2} + \cdots + (x_{in} - x_{jn} )^{2} }$$
(1)
where, i = (xi1, xi2,…, xin) and j = (xj1, xj2,…, xjn) are two n-dimensional data objects.
The Manhattan (or city-block) distance, is defined as:
$$d\left( {i,j} \right) = \left| {(x_{i1} - x_{j1} )} \right| + \left| {(x_{i2} - x_{j2} )} \right| + \cdots + \left| {(x_{in} - x_{jn} )} \right|$$
(2)
where, i = (xi1, xi2,…, xin) and j = (xj1, xj2,…, xjn) are two n-dimensional data objects.
The Minkowski distance, is defined as:
$${\text{d }}\left( {{\text{i}},{\text{j}}} \right) = \, \left( {\left| {{\text{x}}_{{{\text{i}}1}} - {\text{x}}_{{{\text{j}}1}} } \right|{\text{p }} + \, \left| {{\text{ x}}_{{{\text{i}}2}} - {\text{x}}_{{{\text{j}}2}} } \right|{\text{p }} + \cdots + \, \left| {{\text{ x}}_{\text{in}} - {\text{x}}_{\text{jn}} } \right|{\text{p}}} \right)1/{\text{p}}$$
(3)
where, i = (xi1, xi2,…, xin) and j = (xj1, xj2,…, xjn) are two n-dimensional data objects.
The Jaccard distance computes similarity between two sets of concepts as follows:
$$J(U,A) = \frac{{\left| {U \cap A} \right|}}{{\left| {U \cup A} \right|}}$$
(4)
Jaccard distance measure is defined as the size of intersection divided by size of union of datasets. \(\left| {U \cap A} \right|\) is the number of concepts in the intersection of U and A; \(\left| {U \cup A} \right|\) represents the number of concepts in the union of U and A. It is a measure of similarity for the two datasets U, A which is used to find binary differences between two or more objects. Summary of the various distance metrics used in clustering is shown in Table 1.
Table 1
Summary of distance metrics
Distance metric
Equation
Explanation
Euclidean distance
\(D\left( {i,j} \right) = \sqrt {\sum\nolimits_{i = 1}^{n} {|{\text{Xi}} - {\text{Yi}}|^{2} } }\)
Popular distance metric used. Suitable for small datasets
Manhattan distance
\(D\left( {i,j} \right) = \sum\nolimits_{i = 1}^{n} {\left| {Xi {-} Yi} \right|}\)
Distance based on an absolute value. Measures each partition based on the mediancentre. Suitable for compact clusters
Minkowski distance
\({\text{D}}\left( {{\text{i}},{\text{j}}} \right) = \sqrt[p]{{\sum\nolimits_{i = 1}^{d} {|{\text{Xi}} {-}{\text{Yi}} |^{p} } }}\)
Features with large values and variances tend to dominate other features. Suitable for numeric datasets
Jaccard distance
\(J(U,A) = \frac{{\left| {U \cap A} \right|}}{{\left| {U \cup A} \right|}}\)
Used as a measure of similarity. Generally applied to binary values to measure distances between objects
Hadoop was created by Doug Cutting in 2005 and has its origins in Apache Nutch, an open source Internet search engine. Apache Hadoop is an open source iteration of MapReduce, which is a framework designed for the in-depth analysis and processing of large volumes of data. Hadoop can analyse complex data of various formats such as structured, semi-structured and unstructured data across several machines. Examples of unstructured data include books, journals, documents and metadata. Structured data include quantitative facts such as numbers and dates. Xml and json documents are considered as an examples for semi-structured data. Hadoop manages its own file system, the Hadoop distributed file system (HDFS), which replicates data to several nodes to ensure the availability of data. Large collections of unstructured data must be managed using a better clustering algorithm with high levels of intra-cluster similarity and with low levels of inter-cluster similarity to derive insights. Applications using unstructured data include, for instance, porphyry copper deposit datasets [17].
Hadoop and Spark serve the same purpose of processing large datasets. Spark is an open source platform that can execute map and reduce jobs several times faster than Hadoop. The overall performance of a particular platform cannot be based on a single criterion [18]. Hadoop offers the following advantages over Spark. Spark’s performance is mainly hindered by the fact that it executes more map and reduce tasks than Hadoop and this can affect resources and node workloads when running clustering jobs. Spark uses a considerable amount of memory to shuffle files written onto disks because it does not manage its own file management system [19]. Hadoop may be used over Spark for the other following reasons: low-cost hardware, lower latency value, and the capacity to manage large datasets with the help of an HDFS.
In this work, we use Hadoop instead of Spark to enhance overall performance levels over a reasonable execution period using low-cost machines whereby data operations and reporting requirements are static and where data are processed via batch processing.
Table 2 illustrates the benefits of Hadoop technologies relative to traditional relational database management systems. Hadoop is best known for offering two main features: HDFS and MapReduce. Other sub-projects of Hadoop technologies include the following:
Table 2
Need for Hadoop technology
 
Traditional RDBMS
Hadoop MapReduce
Size
To gigabytes
Extends to petabytes
Processing
Realtime and batch
Batch
Updates
Reads and writes many times
Writes once and reads many times
Structure
Static schema
Dynamic schema
Integrity
High
Low
Scaling
Non-linear
Linear
  • Pig [20]: A high-level data flow language and execution framework for parallel computation.
  • Zookeeper [21]: A high-performance coordination service for distributed applications.
  • HCatalog [22]: A table and storage management service for Hadoop data.
  • Hive [23]: An SQL-like interface for data stored.
  • Avro [24]: A language-neutral data serialization system and language-independent scheme associated with read and write operations.
  • HBASE [25]: An open-source non-relational distributed database that runs on top of an HDFS that affords Hadoop with Bigtable-line capabilities.
  • Chukwa [26]: An open-source tool for monitoring the Hadoop cluster.
A MapReduce job splits a dataset into chunks of data such that map tasks are processed in a parallel manner. The framework sorts the outputs of maps, which are then input to reduce the number of tasks involved. Inputs and outputs of a job are stored in a file system. The framework manages scheduling tasks, monitors tasks and re-executes failed tasks. The HDFS is a distributed file system that uses master/slave architecture and that is capable of storing large volumes of data. MapReduce jobs are executed on a cluster of nodes while carrying out large-scale data operations that are scalable and feasible to execute in a reasonable amount of time for complex or large datasets. MapReduce works in parallel with several clusters of computers, making it easy to scale to large datasets.
MapReduce programming is designed for computer clusters. MapReduce applications can process large datasets using several low-cost machines referred to as clusters. Individual computers in a cluster are often referred to as nodes in that cluster. MapReduce involves two main broad computational stages (a map phase and reduce phase) that are applied in sequence on large volumes of data. The map function applies to each line of data and breaks data into chunks to form key-value pairs. A reducer function is then applied to all key-value pairs sharing the same key.
Robson et al. [27] focus on problems faced when clustering large datasets(e.g., obtaining efficient clusters over better execution times for large datasets) and propose a parallel clustering (ParC) method for data partitioning that leverages data localities in MapReduce to cluster large datasets. When clustering large datasets, parallel processing is best suited to Hadoop MapReduce due to its capacity to divide large datasets into chunks and to store them in worker nodes while an algorithm is applied to the data where it is stored. Best of both Worlds (BoW) is an adaptive technique used to dynamically select the best ways to minimize costs. The top ten eigenvectors of Twitter stored as an adjacency matrix have been considered and experiments conducted using proposed solutions reveal good results. ParC focuses on two issues: minimizing I/O costs and reducing costs related to networks. In developing a good clustering algorithm, speedup and scale up processes are two major issues that must be exploited through parallelism techniques to generate efficient clustering algorithms.
Parallel K-means clustering based on MapReduce [28] clusters large datasets and applies a K-means clustering algorithm via the Hadoop MapReduce framework suited for large datasets. Applying the Fast K-means clustering algorithm through MapReduce has been proposed. The mean values of data points in clusters are used to measure cluster similarities and MapReduce computes distances between data points simultaneously. The map function stores input datasets in an HDFS which is used to assign data objects or points to the closest centre and which reduces functions in updating calculated centres based on the Euclidean distance. The results have been analysed based on various sizes of datasets, better results have been achieved with increasing dataset size. The Fast K-means algorithm is suited to large datasets but for a good clustering algorithm addressing large datasets, cluster efficiency and inter-cluster and intra-cluster measures should be considered apart from execution times and should be scaled up to meet input data size requirements.
Chen [29] proposed the use of MapReduce-based soft clustering for large datasets. A divide and conquer approach is used to split large amounts of data into chunks and the MapReduce model causes each of map and reduce phase to work in independent of other running map and reduce phases which are run in parallel. Clustering quality metrics have not been considered and evaluations based on different workloads (dataset size) should also be conducted. To overcome this drawback, a two-phase K-means (TPK) method [30] that can efficiently process large volumes of data has been introduced.
Chu et al. [31] applied a big data parallel programming technique involving K-means clustering through the MapReduce framework. Xia et al. [32] proposed a parallel K-means optimization algorithm (Par3PKM) implemented through the MapReduce framework. As a distance metric used in Par3PKM, the Euclidean distance is employed to calculate distances between data points or objects and cluster centroids. Parallelization and optimization techniques can cluster large datasets through MapReduce. The parallel CoMK-means clustering algorithm [33] uses MapReduce to distribute input data across several slave nodes by using an HDFS to overcome large dataset clustering instabilities.
The clustering of large bio-informatics datasets [34] focuses on gene sequence analyses conducted via the MapReduce framework. The sizes of bioinformatics datasets range from a few gigabytes to several petabytes depending on the genes of living creatures and plants. In addressing large volumes of data, clustering plays a vital role in analysing a particular pattern and in finding similarities between two different genes. With such volumes of data, parallelism serves as the only means to generate efficient clustering results. The expressed sequence tag (EST) [35] is a clustering tool used to group sequences originating from the same gene to form clusters so that they are similar to the representative. As a drawback of EST clustering, when finding similarities within the same cluster and between two different clusters from genetic datasets, good inter-cluster and intra-clustering measures must be used.
Fang et al. [36] used large-scale meteorological datasets and an MK-means algorithm applied through MapReduce to manage meteorological information stored in high-cost servers over several years. Optimal K-centroids were calculated using the standard distance measure of similarity and the Euclidean distance. The performance of MK-means is investigated for different datasets of different sizes. Stability and scalability were found to be two major factors considered in MK-means. The MK-means clustering algorithm is limited in terms of the quality of clusters and distance metric efficiency.
Tsai et al. [37] propose a MapReduce black hole (MRBH) for leveraging the strength of the black hole algorithm through the MapReduce framework to accelerate clustering processes.

Proposed work

The proposed work involves two approaches: K-means Hadoop MapReduce (KM-HMR) and K-means modified inter and intra clustering (KM-I2C). The following section describes the proposed algorithms for efficient clustering.

K-means Hadoop MapReduce (KM-HMR)

Algorithm 1 describes KM-HMR, a MapReduce implementation of K-means that can find clusters faster than standard K-means clustering methods. KM-HMR focuses on the MapReduce implementation of standard K-means, which serves as a framework for parallelization problems (e.g., clustering) that takes advantage of localities of data and which involves two major phases: a map phase and a reduce phase. The MapReduce job involves splitting a dataset into parts of a fixed sized referred to as chunks, which are consumed by a single map. The map phase calculates the distances between each object and each cluster and assigns each object to its nearest cluster. One map task is created for each input split and is executed by map functions for each record of the input split. The input split size is set to 128 MB and can the split size can be set to the size of an HDFS block. The distance metric used in KM-HMR is the Euclidean distance. The objects that belong to the same cluster are sent to reduce phase. The reduce phase calculates the new cluster centroids for the next MapReduce job. Figure 1 depicts the overall flow of KM-HMR. Cluster centroids produced at the end of an initial iteration are stored in an old cluster file and are tested for the appearance of new cluster centroids with each iteration. When new cluster centroid values are obtained, new cluster centroid values are updated in a new file and the number of iterations is increased by one. This process is repeated until no more changes in cluster centroid values are found, and this state is referred to as convergence. The final output clusters are stored in a result file. Table 3 describes notation used in Algorithm 1. https://static-content.springer.com/image/art%3A10.1186%2Fs40537-017-0087-2/MediaObjects/40537_2017_87_Figa_HTML.gif
Table 3
Notations used in Algorithm 1
Notation
Description
It
Number of iterations
ic
Initial centroid
D
Dataset
k
Number of clusters
oc
Previous centroid values
nc
New cluster centroid values
Result
Final result
select()
Function for selecting data based on the k value
input()
Function for data file uploading
job.mapper()
Map function
job.reducer()
Reduce function
write()
Function for writing centroid values to a file
read()
Function for reading centroid values to a file
update()
Function for testing for updated centroid values
To improve the efficiency of the clustering stage, we used the KM-HMR algorithm and the large datasets described in “Datasets” section. Large datasets provided as input were divided into chunks and stored in the HDFS. We selected K data points and objects passed as initial cluster centres and updated the centroid of every cluster after a cluster completed several iterations. The map phase involved receiving a sequence file containing initial cluster centres and forms as keyvalue (k,v) pairs. In this phase, the distances between data objects and each cluster were computed using the Euclidean distance. The standard K-means clustering algorithm is suited for application to small datasets and structured data. When datasets are large in size in terms of volume and contain unstructured data, data processing and result generate take a considerable amount of time and KM-HMR makes an attempt to process faster within a reasonable execution time. The proposed KM-HMR achieves the above goal of processing large volumes of data in parallel with the MapReduce programming model.

K-means modified inter and intra clustering (KM-I2C)

All techniques used to cluster datasets using the K-means algorithm for estimating the number of clusters suffer from deficiencies of cluster similarity measures in forming distinct clusters. A good clustering algorithm must produce high-quality clusters with high levels of intra-cluster similarity and with low levels of inter-cluster similarity. The KM-I2C algorithm divides distinct clusters in datasets into sub clusters based on inter-cluster and intra-cluster similarities. Clusters formed maximize inter-cluster distances and minimize intra-cluster distances. The quality of clustering results is dependent on similarity and implementation measures. The KM-I2C algorithm uses a set of ‘O’ data points in a dataset ‘D’ and an initial number of clusters ‘k’.
Algorithms 2 and 3 describing algorithmic steps of the map and reduce phases of KM-I2C. https://static-content.springer.com/image/art%3A10.1186%2Fs40537-017-0087-2/MediaObjects/40537_2017_87_Figb_HTML.gif
The KM-I2C algorithm follows similar KM-HMR steps as those of Hadoop MapReduce. In KM-I2C, distance metrics of the Ie (inter-cluster distance) and the Ia (intra-cluster distance) are used as similarity and dissimilarity measures. The new distance measures Ie and Ia are calculated as follows:
The new distance measure Inter cluster distance is calculated as follows:
$$Ie = \frac{1}{2}\left( {\frac{{\sum\nolimits_{i = 1}^{O1} {\sum\nolimits_{j = 1}^{O2} {\left( {Ai - Bj} \right)^{2} } } }}{O1*O2}} \right)$$
(5)
where Ie is the Inter cluster distance, O1 and O2 are data points in Cluster 1 and Cluster 2, respectively, and Ai represents the ith data point in Cluster 1 and the jth data point in clusters A and B, respectively.
$$Ia = \frac{1}{2}\left( {\frac{{\sum\nolimits_{i,j = 1}^{O1 + O2} {\left( {Ai - Bj} \right)^{2} } }}{{\left( {O1 + O2} \right)*\left( {O1 + O2 - 1} \right)}}} \right)$$
(6)
https://static-content.springer.com/image/art%3A10.1186%2Fs40537-017-0087-2/MediaObjects/40537_2017_87_Figc_HTML.gif Algorithm 2 describes the map phase of the proposed KM-I2C algorithm and Algorithm 3 describes reduce phase of the proposed KM-I2C algorithm. The input dataset is stored locally in the HDFS as a 〈k, v〉 pair where k is the key and v is the value for a given record in dataset D. The input dataset is distributed across several mappers. In the map phase, each map task is read in a shared file containing all cluster centroids. A cluster centroid file can typically be stored in a Hadoop distributed cache. Map tasks are read from a small number of input data points. It assigned each data point to the cluster whose centroid is closest to the data. The map phase then produces intermediate 〈key, value〉 pairs, which are grouped automatically by the Hadoop system in preparation for the reduce phase. The final output is stored in an Nc file.

Experimental evaluation

The Hadoop cluster is a special parallel computational cluster that includes a master node and several slave nodes. For a given dataset, a file is split into numerous components equal to the block size set for the HDFS cluster (which is 64 MB by default). Several experiments were conducted on various datasets to evaluate the quality and scalability of our proposed algorithms. To evaluate the performance of the KM-I2C algorithm, we compared it to other algorithms (e.g., standard K-means and KM-HMR).
Experiments were conducted on a cluster of one master node acting as a NameNode and on ten slave nodes acting as DataNodes as described in Table 4. VMware virtual nodes are used in the CentOS 6.3 operating system.
Table 4
Experimental setup
Node
CPU
No. of cores
RAM (GB)
MN: master node
Intel i3 5005U
4
8
SN_A: slave node 1–3
Intel i3 5005U
2
2
SN_B: slave node 4–6
Intel i3 5005U
2
2
SN_C: slave node 7–10
Intel i3 5005U
2
1
A master node acts as a Namenode and JobTracker via an Intel Core i3—5005U CPU 2.66 GHz with a RAM capacity of 8 GB and with four processor cores. Slave or worker nodes are defined as Datanodes and TaskTrackers with an Intel Core i3—5005U CPU 2.66 GHz configuration and with a specified RAM capacity and two processor cores. The VMWare Workstation is used to create slave nodes from virtual machines. JDK 1.8.0 with CentOS operating system is used in our experiments.
Key configurations of the Hadoop environment modify conf/master files according to the specified number of slave nodes involved. Namenode and JobTracker are configured under a conf/hadoop-site.xml file and necessary changes are made to other Hadoop configuration files such as hadoop-default.xml and hadoop-core.xml. Experiments were conducted on a varied number of Datanodes and with various block sizes to evaluate the performance of the proposed algorithms.

Datasets

Table 5 describes the datasets used in our experiment. Project Gutenberg (PG) consists of approximately 50,000 free ebooks downloaded from [38]. PG contains approximately 3000 English documents written by 142 authors and makes an effort to create and distribute ebooks of mostly public domain documents. Unstructured data refers to information that does not have a predefined data model and is not organized in a pre-defined manner. We selected this dataset due to the reason that there is no clearly defined observation and variables (rows and columns). Analysing the document manually is an impossible task. Clustering makes it easier for grouping of documents according to their similarities. We applied the implementation method to the subset of PG documents. All of the proposed algorithms were run with document sets of different sizes taken from the above dataset to generate effective clustering results. Categories of focus include the following: audio, music, entertainment, children, education, comics, crafts, finance, health and markets. This dataset was used to investigate and explore documents to conduct a cluster analysis of unstructured data with better execution times.
Table 5
Datasets description
Dataset
Size (GB)
Description
DS_A: million song dataset
300
Collection of 53 audio features and metadata
DS_B: US climate reference network (USCRN)
200
Collected from 143 stations to maintain high quality climate observations
DS_C: Project Gutenberg
110
Includes over 50,000 free ebooks
The US climate reference network (USCRN) [39] is a systematic and sustained network of climate monitoring stations with sites across the United States. Approximately 114 stations are equipped with high-quality devices that measure temperatures, precipitation levels, soil conditions and wind speeds. The purpose of this dataset is to obtain high-quality climatic observations for monitoring climatic patterns. The million song dataset (MSD) [40] is a collection of audio features and metadata available through the Amazon Public Dataset that can be attached to an Amazon EC2 virtual machine to run experiments. The MSD is an unstructured dataset that contains information on approximately 1 million songs with 53 features and is approximately 300 GB in size. We have considered the subset of the entire datasets in our experiment. The split size of the complete dataset is based on the total input file size divided by the number of map tasks launched. The number of mappers is based on the number of input splits made. In the MapReduce framework, the split size of input files and of InputFormat describes input specifications for a MapReduce job. InputFormat used in MapReduce is directly proportional to the number of files and to the total number of blocks of input files. The default split size is the size of the HDFS block size. We can control the split size by applying the mapred.min.split.size parameter available through hadoop-site.xml. In this work, we used the HDFS block size of 128 MB as the split size.
Figure 2 compares the execution times of the proposed algorithms for document sets of the PG dataset. As the number of documents to be clustered increases, the time dedicated to standard K-means exceeds that used for the proposed solution. Parallel processing is essential in processing large volumes of data. In this experiment, we only considered some documents of a subset of the PG dataset. Table 6 shows the clustering results for Project Gutenberg. The proposed KM-HMR takes less time to employ than the standard K-means algorithm due to its processing of large datasets through several different machines. The proposed KM-I2C takes less time to employ than the standard K-means and KM-HMR. KM-HMR uses the Euclidean distance as a similarity measure, and in our proposed KM-I2C we use inter and intra-clustering measures to obtain efficient and high-quality clusters with high levels of intra-cluster similarity and low levels of inter-cluster similarity to obtain valuable insights from large datasets.
Table 6
Clustering results
Broad categories
Subcategories formed/total
KM-I2C
KM-HMR
Standard K-means
Digital audio
21/22
19/22
15/22
Space
26/28
17/28
14/28
Education
61/64
51/64
42/64
Entertainment
50/52
40/52
30/52
One of the most complex problems associated with clustering concerns is identifying the optimal number of clusters. Unfortunately, there is no general theoretical solution used to find the optimal number of clusters for a dataset. The silhouette coefficient (SC) [41, 42] is used to validate the optimal number of clusters and the degree of inter- and intra-clustering cohesiveness. The SC is a dimensionless quantity valued at −1 to 1. Negative values are undesirable and values closer to 1 denote an overall measure of goodness for a cluster. The SC value can be obtained by the following steps:
1.
For object d, calculate ad, the average distance to all other objects in the cluster, using inter- and intra-clustering equations used in the proposed work.
 
2.
For object d and any cluster that does not contain the object, the object’s average distance (bd) is calculated for all objects in a cluster.
 
3.
For the dth object, the SC is given by:
 
$${\text{S}}_{\text{c}} \left( {\text{d}} \right) = \frac{{\left( {{\text{b}}_{\text{d}} - {\text{a}}_{\text{d}} } \right)}}{{{ \hbox{max} }\left( {{\text{a}}_{\text{d}} ,{\text{b}}_{\text{d}} } \right)}}$$
(7)
Table 7 compares the performance of the proposed model to that of the KM-HMR and KM-I2C algorithms when applied to the USCRN dataset. Parallel K-means worked well with the datasets studied over a reasonable amount of time due to inherent levels of data parallelism. In TPK, an overhead resulted after splitting the algorithm into two phases. KM-I2C performed better in terms of execution time, as clusters formed in less time than when using the other existing clustering techniques. Average lengths of time used for each cluster indicate that with parallelization, the proposed clustering algorithm was more efficient than the single node standard K-means clustering method. Through the experiment, we found that block sizes should depend upon the input datasets employed. When the block size is too small, the number of collaborations involved can increase, affecting performance results and disabling opportunities for parallelization.
Table 7
Comparisons of total execution periods (s)
 
Parallel K-means
Standard K-means
KM-HMR
KM-I2C
 
4320.14
4502.18
4214.16
3876.23
3880.23
4025.06
3705.29
3518.34
3691.18
3907.12
3519.12
3412.49
Average
3963.85
4144.78
3812.85
3602.35
Figures 3 and 4 show the execution times of five clusters for the datasets (DS_A, DS_B, and DS_C) considered in this work for KM-HMR and KM-I2C respectively. As the size of datasets increases, time used by KM-I2C improved relative to the used via the standard K-means and KM-HMR methods. The x-axis shows the number of virtual machines used and the y-axis shows the amount of time taken to cluster the datasets in seconds. The results illustrate the superiority of the proposed algorithm when applied to different parameters relative to the traditional algorithm. The execution time is another measure used to compare the two algorithms. We have improved the clustering response time with respect to Hadoop’s characteristics. Effects of the number of nodes used on the clustering algorithms were also studied for the same set of data objects. Scalability levels were tested by increasing the number of nodes used for execution while maintaining the same datasets and the same number of nodes but while varying the dataset sizes.
Figures 5 and 6 show execution times dedicated for varied numbers of clusters (ten clusters for KM-HMR and KM-I2C respectively) and using a specified number of virtual machines. The more parallel processing was executed, the less time was taken for algorithm execution. KM-I2C generated better execution times than the other algorithms. KM-HMR and KM-I2C are comparable in terms of execution time with KM-HMR performing better than the standard K-means algorithm and with KM-I2C generating better results than the KM-HMR algorithm. This was expected, as KM-I2C uses inter and intra-cluster distances rather than Euclidean distances to perform the benchmark. We also measured execution times of the proposed clustering algorithms against those of the standard K-means clustering algorithms.
Figures 7 and 8 show execution times dedicated for varied numbers of clusters (15 clusters for KM-HMR and KM-I2C respectively). The execution time is a key facet of a successful clustering technique. The proposed KM-HMR and KM-I2C algorithms are better suited for clustering large datasets via MapReduce than the standard K-means clustering algorithm. Effects of the dimensionality of data on time taken were computed. To obtain more conclusive results from our comparisons, the number of processes used for sum computations was varied for each series of runs. To generate efficient results, the K-means dependency of the number of clusters was determined for different variations of cluster numbers.

Conclusions

Clustering is a challenging issue that is heavily shaped by data used and problems considered. The proposed algorithms show improvements in terms of execution time. Hadoop can compute map and reduce jobs in parallel to cluster large datasets effectively and efficiently. The main objective of this work was to accelerate and scaleup large datasets to obtain efficient high-quality clusters. The standard K-means method is the most popular clustering method due to its simplicity and reasonable execution efficiency when applied to small datasets. A clustering approach should however also maintain cluster efficiency levels when large datasets are involved by considering inter and intra-clustering distances between data objects in a dataset. We have introduced a KH-HMR algorithm to make use of parallelization tools through Hadoop and to obtain better execution times than those of the standard K-means approach. We have developed a novel KM-I2C algorithm by making modifications to the clustering distance metric. We, in turn, have developed an efficient and more effective method relative to other clustering techniques. Future work should enhance the performance of map and reduce jobs to suit large datasets. The performance of Hadoop can be enhanced by using multilevel queues for the efficient scheduling of jobs suitable for large datasets.

Authors’ contributions

CS: prepared manuscript and contributed two major algorithms: KM-HMR and KM-I2C and had been made extensive study on the literature. Introduced new distance metric which is able to produce clusters at a faster rate when compared with the other clustering techniques. NK: conception and design of the study. PCR: analysis and interpretation of data. All authors read and approved the final manuscript.

Authors’ information

Sreedhar C, is currently working as Associate Professor in Department of Computer Science and Engineering at G Pulla Reddy Engineering College, Kurnool, India. His research includes Big Data, Network Communications. He has 13 years of teaching experience and published 19 papers in various International Journals/Conferences.
Kasiviswanath is currently working as Professor and Head in the Department of Computer Science and Engineering at G Pulla Reddy Engineering College, Kurnool, India. He has over 20 years teaching experience. His research areas include Computer Networks, Security and Big data. To his credit, he published around 25 papers in various International Journals/Conferences.
Chenna Reddy is currently working as Professor in the Department of Computer Science and Engineering at Jawaharlal Nehru Technological University Anantapur, Anantapuram, India. He is high profiled, worked and working as Director and various esteemed capacities at University level.

Acknowledgements

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Availability of data and materials

Data will not be shared at this moment, as the datasets are for use in extension for my research work. Complete results and datasets of my research work will be shared in github.
Not applicable.
Not applicable.

Funding

Not applicable.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://​creativecommons.​org/​licenses/​by/​4.​0/​), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Literature
1.
go back to reference De Camargo RY, Goldchleger A, Kon F. InteGrade: a tool for executing parallel applications on a grid for opportunistic computing. In: Proceedings of 23th Brazilian symposium on computer networks. 2005. De Camargo RY, Goldchleger A, Kon F. InteGrade: a tool for executing parallel applications on a grid for opportunistic computing. In: Proceedings of 23th Brazilian symposium on computer networks. 2005.
2.
go back to reference Sreedhar C, Kasiviswanath N, Reddy PC. A survey on big data management and job scheduling. Int J Comput Appl. 2015;130(13):41–49. Sreedhar C, Kasiviswanath N, Reddy PC. A survey on big data management and job scheduling. Int J Comput Appl. 2015;130(13):41–49.
4.
go back to reference Han J. Data mining: concepts and techniques. San Francisco: Morgan Kaufmann Publishers Inc.; 2005. ISBN 1558609016. Han J. Data mining: concepts and techniques. San Francisco: Morgan Kaufmann Publishers Inc.; 2005. ISBN 1558609016.
6.
go back to reference McCallum A, Nigam K, Ungar LH. Efficient clustering of high-dimensional datasets with application to reference matching. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. 2000. p. 169–78. McCallum A, Nigam K, Ungar LH. Efficient clustering of high-dimensional datasets with application to reference matching. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. 2000. p. 169–78.
7.
go back to reference Quinlan JR. Programs for machine learning. San Francisco: Morgan Kaufmann Publishers Inc; 1993. ISBN 1-55860-238-0. Quinlan JR. Programs for machine learning. San Francisco: Morgan Kaufmann Publishers Inc; 1993. ISBN 1-55860-238-0.
8.
go back to reference Domingos P. Linear-time rule induction. In: Proceedings of knowledge discovery and data mining (KDD-96), Portland, Oregon. 1996. Domingos P. Linear-time rule induction. In: Proceedings of knowledge discovery and data mining (KDD-96), Portland, Oregon. 1996.
9.
go back to reference Fisher D. Knowledge acquisition via incremental conceptual clustering. Mach Learn J. 1987;2(2):139–72. Fisher D. Knowledge acquisition via incremental conceptual clustering. Mach Learn J. 1987;2(2):139–72.
10.
go back to reference Cheeseman P, Stutz J. Bayesian classification (AutoClass): theory and results. In: Advances in knowledge discovery and data mining, 1996. p. 153–80. ISBN: 0-262-56097-6. Cheeseman P, Stutz J. Bayesian classification (AutoClass): theory and results. In: Advances in knowledge discovery and data mining, 1996. p. 153–80. ISBN: 0-262-56097-6.
11.
go back to reference Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R, editors. Cambridge: AAAI/MIT Press; 1996. p. 153–80. Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R, editors. Cambridge: AAAI/MIT Press; 1996. p. 153–80.
12.
go back to reference Biswas G, Weinberg J, Li C. ITERATE: a conceptual clustering method for knowledge discovery in databases. In: Braunschweig B, Day R, editors. Innovative applications of artificial intelligence in the oil and gas industry. 1995. Biswas G, Weinberg J, Li C. ITERATE: a conceptual clustering method for knowledge discovery in databases. In: Braunschweig B, Day R, editors. Innovative applications of artificial intelligence in the oil and gas industry. 1995.
13.
go back to reference Das G, Mannila H, Ronkainen P. Similarity of attributes by external probes. In: Proceedings of the fourth international conference on knowledge discovery and data mining KDD’98. New York: AAAI Press; 1998. p. 23–9. Das G, Mannila H, Ronkainen P. Similarity of attributes by external probes. In: Proceedings of the fourth international conference on knowledge discovery and data mining KDD’98. New York: AAAI Press; 1998. p. 23–9.
14.
go back to reference Ortega M, Rui Y, Chakrabarti K, Mehrotra S, Huang T. Supporting ranked boolean similarity queries in mars. IEEE Trans Knowl Data Eng. 1998;10(6):905–25.CrossRef Ortega M, Rui Y, Chakrabarti K, Mehrotra S, Huang T. Supporting ranked boolean similarity queries in mars. IEEE Trans Knowl Data Eng. 1998;10(6):905–25.CrossRef
15.
go back to reference Dhillon IS, Modha DS. Concept decompositions for large sparse text data using clustering. Mach Learn. 2001;42(1):143–75.CrossRefMATH Dhillon IS, Modha DS. Concept decompositions for large sparse text data using clustering. Mach Learn. 2001;42(1):143–75.CrossRefMATH
16.
go back to reference Mao J, Jain AK. Self-organizing network for hyperellipsoidal clustering (HEC). Proc IEEE Neural Netw. 1994;5:2967–72. Mao J, Jain AK. Self-organizing network for hyperellipsoidal clustering (HEC). Proc IEEE Neural Netw. 1994;5:2967–72.
17.
go back to reference Shanley RJ, Mahtab MA. Delineation and analysis of clusters in orientation data. J Int Assoc Math Geol. 1974;8(1):9–23.CrossRef Shanley RJ, Mahtab MA. Delineation and analysis of clusters in orientation data. J Int Assoc Math Geol. 1974;8(1):9–23.CrossRef
19.
go back to reference haimov N, Malony A, Canon S, Iancu C, Ibrahim KZ, Srinivasan J. Scaling spark on HPC systems. In: Proceedings of 25th ACM international symposium on high-performance parallel and distributed computing. 2016. p. 97–110. haimov N, Malony A, Canon S, Iancu C, Ibrahim KZ, Srinivasan J. Scaling spark on HPC systems. In: Proceedings of 25th ACM international symposium on high-performance parallel and distributed computing. 2016. p. 97–110.
20.
go back to reference Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a not-so-foreign language for data processing. In: Proceedings of 2008 ACM SIGMOD international conference on management of data. 2008. p. 1099–110. Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a not-so-foreign language for data processing. In: Proceedings of 2008 ACM SIGMOD international conference on management of data. 2008. p. 1099–110.
21.
go back to reference Hunt P, Konar M, Junqueira FP, Reed B. ZooKeeper: wait free coordination for internet-scale systems. In: Proceedings of USENIX conference. 2010. Hunt P, Konar M, Junqueira FP, Reed B. ZooKeeper: wait free coordination for internet-scale systems. In: Proceedings of USENIX conference. 2010.
22.
go back to reference Wadkar S, Siddalingaiah M. HCatalog and Hadoop in the enterprise. In: Proceedings of Apache Hadoop 2014; Apress, Berkeley, CA. Wadkar S, Siddalingaiah M. HCatalog and Hadoop in the enterprise. In: Proceedings of Apache Hadoop 2014; Apress, Berkeley, CA.
23.
go back to reference Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a warehousing solution over a map-reduce framework. J VLDB Endow. 2009;2(2):1626–9.CrossRef Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a warehousing solution over a map-reduce framework. J VLDB Endow. 2009;2(2):1626–9.CrossRef
24.
go back to reference Vohra D. Apache Avro. Book of practical Hadoop ecosystem. 2016. p. 303–23. Vohra D. Apache Avro. Book of practical Hadoop ecosystem. 2016. p. 303–23.
25.
go back to reference Shastry K, Madhyastha S, Kumar S, Bresniker KM, Battas G. Transaction support for HBase. In: Proceedings of international conference on management of data. 2014. p. 117–20. Shastry K, Madhyastha S, Kumar S, Bresniker KM, Battas G. Transaction support for HBase. In: Proceedings of international conference on management of data. 2014. p. 117–20.
26.
go back to reference Wadkar S, Siddalingaiah M. Apache Ambari. Book of Pro Apache Hadoop. 2014. p. 399–401. Wadkar S, Siddalingaiah M. Apache Ambari. Book of Pro Apache Hadoop. 2014. p. 399–401.
27.
go back to reference Ferreira Cordeiro RL, Traina Junior C, Machado Traina AJ, López J, Kang U, Faloutsos C. Clustering very large multi-dimensional datasets with MapReduce. In: Proceedings of KDD’11, ACM, California, August 21–24. 2011. Ferreira Cordeiro RL, Traina Junior C, Machado Traina AJ, López J, Kang U, Faloutsos C. Clustering very large multi-dimensional datasets with MapReduce. In: Proceedings of KDD’11, ACM, California, August 21–24. 2011.
28.
go back to reference Zhao W, Ma H, He Q. Parallel K-means clustering based on MapReduce. In: CloudCom 2009, LNCS 5931. Berlin: Springer; 2009. pp. 674–9. Zhao W, Ma H, He Q. Parallel K-means clustering based on MapReduce. In: CloudCom 2009, LNCS 5931. Berlin: Springer; 2009. pp. 674–9.
29.
go back to reference Chen M. Soft clustering for very large data sets. Comput Sci Netw Secur J. 2017;17(11):102–8. Chen M. Soft clustering for very large data sets. Comput Sci Netw Secur J. 2017;17(11):102–8.
30.
go back to reference Pham DT, Dimov SS, Nguyen CD. A two-phase K-means algorithm for large datasets. Mech Eng Sci J. 2004;218(10):1269–73.CrossRef Pham DT, Dimov SS, Nguyen CD. A two-phase K-means algorithm for large datasets. Mech Eng Sci J. 2004;218(10):1269–73.CrossRef
31.
go back to reference Chu C-T, Kim SK, Lin Y-A. Map-reduce for machine learning on multicore. In: Proceedings of the 20th annual conference on neural information processing systems (NIPS’06), Vancouver, Canada. 2006. p. 281–8. Chu C-T, Kim SK, Lin Y-A. Map-reduce for machine learning on multicore. In: Proceedings of the 20th annual conference on neural information processing systems (NIPS’06), Vancouver, Canada. 2006. p. 281–8.
32.
go back to reference Xia D, Wang B, Li H, Li Y, Zhang Z. A distributed spatial–temporal weighted model on MapReduce for short-term traffic flow forecasting. J Neurocomput. 2016;179(C):246–63.CrossRef Xia D, Wang B, Li H, Li Y, Zhang Z. A distributed spatial–temporal weighted model on MapReduce for short-term traffic flow forecasting. J Neurocomput. 2016;179(C):246–63.CrossRef
33.
go back to reference Chao L, Yan Y, Tonny R. A parallel Cop-K means clustering algorithm based on MapReduce framework. Adv Intell Soft Comput J. 2011;123:93–102. Chao L, Yan Y, Tonny R. A parallel Cop-K means clustering algorithm based on MapReduce framework. Adv Intell Soft Comput J. 2011;123:93–102.
34.
go back to reference Wang C, Guo M, Liu Y. EST clustering in large dataset with MapReduce. In: Proceedings of pervasive computing, signal processing and applications, Sept 2010. 2010. Wang C, Guo M, Liu Y. EST clustering in large dataset with MapReduce. In: Proceedings of pervasive computing, signal processing and applications, Sept 2010. 2010.
35.
go back to reference Zhang LD, Yuan DJ, Zhang JW, Wang SP and Zhang QF. A new method for EST clustering. J Yi chuan xue bao = Acta genetica Sinica. 2003. Zhang LD, Yuan DJ, Zhang JW, Wang SP and Zhang QF. A new method for EST clustering. J Yi chuan xue bao = Acta genetica Sinica. 2003.
37.
go back to reference Tsai CW, Hsieh CH, Chiang MC. Parallel black hole clustering based on MapReduce. In: Proceedings of IEEE international conference on systems, man and cybernetics. 2015. Tsai CW, Hsieh CH, Chiang MC. Parallel black hole clustering based on MapReduce. In: Proceedings of IEEE international conference on systems, man and cybernetics. 2015.
39.
go back to reference Palecki MA, Lawrimore JH, Leeper RD, Bell JE, Embler S, Casey N. US climate reference network processed data from USCRN database version 2. 2015. Palecki MA, Lawrimore JH, Leeper RD, Bell JE, Embler S, Casey N. US climate reference network processed data from USCRN database version 2. 2015.
40.
go back to reference Bertin-Mahieux T, Ellis DP, Whitman B, Lamere P. The million song dataset. In: Proceedings of the 12th International conference on music information. Retrieval (ISMIR 2011). 2011. Bertin-Mahieux T, Ellis DP, Whitman B, Lamere P. The million song dataset. In: Proceedings of the 12th International conference on music information. Retrieval (ISMIR 2011). 2011.
41.
go back to reference Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.CrossRefMATH Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.CrossRefMATH
42.
go back to reference Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. New York: Wiley; 1990.CrossRefMATH Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. New York: Wiley; 1990.CrossRefMATH
Metadata
Title
Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop
Authors
Chowdam Sreedhar
Nagulapally Kasiviswanath
Pakanti Chenna Reddy
Publication date
01-12-2017
Publisher
Springer International Publishing
Published in
Journal of Big Data / Issue 1/2017
Electronic ISSN: 2196-1115
DOI
https://doi.org/10.1186/s40537-017-0087-2

Other articles of this Issue 1/2017

Journal of Big Data 1/2017 Go to the issue

Premium Partner