Abstract

Data streams are continuously generated over time from Internet of Things (IoT) devices. The faster all of this data is analyzed, its hidden trends and patterns discovered, and new strategies created, the faster action can be taken, creating greater value for organizations. Density-based method is a prominent class in clustering data streams. It has the ability to detect arbitrary shape clusters, to handle outlier, and it does not need the number of clusters in advance. Therefore, density-based clustering algorithm is a proper choice for clustering IoT streams. Recently, several density-based algorithms have been proposed for clustering data streams. However, density-based clustering in limited time is still a challenging issue. In this paper, we propose a density-based clustering algorithm for IoT streams. The method has fast processing time to be applicable in real-time application of IoT devices. Experimental results show that the proposed approach obtains high quality results with low computation time on real and synthetic datasets.

1. Introduction

Using RFID and conventional sensors in the base of the data collection mechanisms in Internet of Things (IoT) makes the volume of the collected data intensively large. In many cases, the communications and data transfers between the objects are required to enable smart analytics. Such communications and transfers require both bandwidth and energy consumption, which are usually limited resources in real scenarios. Furthermore, the analytics required for such applications is often real-time, and therefore it requires the design of methods which can provide real-time insights [13]. Data mining techniques are very useful for this kind of analytics. However, since the generated data is considered as stream, we modify the multilayer data mining model for Internet of Things (IoT) from [4] to a multilayer data stream mining model for IoT. The model is illustrated in Figure 1.

Mining data stream is relatively a new area of research in the data mining community. It became more prominent in many applications such as monitoring environmental sensors, social network analysis, real-time detection of anomalies in computer network traffic, and web searches [5, 6].

Clustering is a remarkable task in mining data stream [6]. However, data stream clustering needs some important requirements due to data streams’ characteristics such as clustering in limited memory and time with single pass over the evolving data streams and also handling noisy data [79].

There are different methods for clustering data streams. In clustering methods, data are categorized based on the similarities among objects. The similarity is determined based on distance or density [5]. The distance-based method [10] leads to form only spherical shapes. On the other hand, density-based method [11] has the ability to detect any shape cluster and they are useful for identifying the noise.

In the last few years, many proposals to extend density-based clustering for data stream have been presented [12]. Density-based data stream clusterings are mainly grouped as density grid-based method and density microclustering method.

The density grid-based clustering [13] quantizes the data space into a number of density grids that form a grid structure on which all of the operations for clustering are performed. The main advantage of the approach is its fast processing time, which is independent of the number of data points, yet dependent on only the number of cells. However, they may have lower quality and accuracy of the clusters despite the fast processing time of the technique [5]. Some of density grid-based clustering algorithms are D-Stream [14], MR-Stream [9], and ExCC [15].

On the other hand, in density-based microclustering [16], microclusters keep summary information about data and clustering is performed on this synopsis information. Microcluster [10] is a temporal extension of cluster feature (CF), that is, a summarization triple maintained about a cluster. Density-based microclustering methods keep summary of clusters in microclusters and form final clusters from them. They have better quality compared to grid-based ones but need more computation time. Some of the density microclustering algorithms include DenStream [14], FlockStream [17], and SOStream [18].

To mitigate the problem of density microclustering methods, we propose a hybrid density-based method for clustering evolving data streams. Our proposed method uses the advantages of both density grid-based and microclustering methods. We refer to our algorithm as HDC-Stream (hybrid density-based clustering for data stream). HDC-Stream has three steps: in step one, the new data point is either mapped to the gird or merged to an existing minicluster. Minicluster is a concept similar to microcluster which is formed from a grid cell. Second step prunes miniclusters and grids in each pruning time. Last step forms the final clusters from the pruned miniclusters using a modified DBSCAN algorithm.

The main contributions of HDC-Stream are summarized as follows.(1)In HDC-Stream, instead of searching list of outlier microclusters to find the suitable one, it maps the new data point into the grid cell which saves computation time. This reduces the number of comparisons from in finding outlier microclusters to which is the mapping time. is the number of miniclusters.(2)In HDC-Stream, instead of forming a new microcluster for a new data point, which is not placed in any existing microcluster and may be a seed of outlier, the new data point is mapped and kept in the grid until the grid density reaches a predefined threshold. In this case, it is converted to a minicluster.(3)The experimental results also show that it outperforms two of the well-known existing density microclustering and density grid-based clustering methods in terms of quality and execution time. Furthermore, the experimental results show that HDC-Stream obtains clusters of high quality even when the noise is present.

The remainder of this paper is organized as follows: Section 2 surveys related work. Section 3 introduces basic definitions. In Section 4, we explain in detail the HDC-Stream algorithm. We analyze the HDC-Stream algorithm using synthetic and real datasets in Section 5. Section 6 discusses the advantages of the proposed method. We conclude the paper in Section 7.

Clustering is an important task in data stream mining. Recently, a plenty of clustering algorithms have been developed for data streams. These clustering algorithms can be generally grouped into the four following main categories [5].

A partitioning-based clustering algorithm tries to find the best partitioning for data points in which intraclass similarity is maximum and interclass similarity is minimum. Two of the well-known extensions of -means [19, 20] on data streams are STREAM [7] and CluStream [10]. Hierarchical clustering algorithms work by decomposing data objects into a tree of clusters. BIRCH [10] and ClusTree [8] are examples of hierarchical clustering family. Grid-based clustering is independent of the distribution of data objects. In fact, it partitions the data space into a number of cells, which forms the grids. Grid-based clustering has fast processing time since it is not dependent on the number of data objects. D-Stream [14], MR-Stream [9], and ExCC [15] are grid-based clusterings over data stream.

Density-based clustering algorithms have been developed to discover clusters with arbitrary shapes. They find clusters based on the dense areas in a shape. If two points are close enough and the region around them is dense, then these two data points join and contribute to construction of a cluster. DBSCAN [21], OPTICS [22], and DENCLUE [23] are examples of this approach.

Due to data streams’ characteristics, the traditional density-based clustering is not applicable. Recently, many density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using density-based method in the clustering process and at the same time overcoming the constraints, which are put by data stream’s nature. Density-based clustering algorithms are categorized into two broad groups called density microclustering and density grid-based clustering algorithms. A comprehensive survey on density-based clustering algorithm on data stream is presented in [12].

DenStream [24] is a density microclustering algorithm for evolving data stream. The algorithm extends the microcluster [10] concept and introduces the outlier and potential microclusters to distinguish between outliers and the real data. It has online and offline phases. In the online phase, the microclusters are formed and the offline phase performs macroclustering on the microclusters. FlockStream [17] is an extension of DenStream using a bioinspired model. It is based on flocking model [25] in which agents are microclusters and they work independently but form clusters together. It considers an agent for each data point which is mapped in the virtual space. Agents move in their predefined visibility range for a fixed time. If they visit another agent, they join to form a cluster in case they are similar to each other. It merges the online and offline phases since the agents form the clusters at any time. In FlockStream, searching for the similar agents is a time consuming process. SOStream (self-organizing density-based clustering over data stream) [18] detects structures within fast evolving data streams by automatically adapting the threshold for density-based clustering. SOStream dynamically creates, merges, and removes clusters in an online manner. It uses competitive learning as introduced for SOMs (self-organizing maps) [26] which is a time consuming method for clustering data stream. Density microclusterings are effective in terms of quality and they can capture the evolution of clusters effectively. However, they have high computation time in finding suitable microclusters.

The other important category is density grid-based method. D-Stream [27] is a density grid-based clustering algorithm in which the data points are mapped to the corresponding grids and the grids are clustered based on their density. It adjusts the clusters in real-time and captures the evolving behavior of data streams and has techniques for handling the outliers. MR-Stream [9] is another clustering algorithm which has the ability to cluster data stream at multiple resolutions. The algorithm partitions the data space into cells and a tree-like data structure which keeps the space partitioning. The tree data structure keeps the data clustering in different resolutions. Each node has the summary information about its parent and children. The algorithm improves the performance of clustering by determining the right time to generate the clusters. D-Stream and MR-Stream algorithms cannot work properly for high dimensional data stream [12]. ExCC (exclusive and complete clustering) [15] is a density grid-based clustering for heterogeneous data stream. The algorithm maps the numerical attributes to the grid and the categorical attributes are assigned granularities according to distinct values in respective domain sets. ExCC introduces fast and slow stream based on the average arrival time of the data points in the data stream. The algorithm detects noise in the offline phase using wait and watch policy. For detecting real outliers, it keeps the data points in the hold queue, which is kept separately for each dimension. The hold queue strategy needs more memory and processing time since it is defined for each dimension. Density grid-based clustering has lower quality since it depends on the granularity of clustering. On the other hand, they can handle the outlier effectively. The computation time is high for high dimensional data.

3. Basic Definitions of HDC-Stream

Definition 1 (-neighborhood of a point). The neighborhood is within a radius of . Neighborhood of point is denoted by : where is an Euclidean distance between and .

Definition 2 (). MinPts is the minimum number of data points around a data point in the -neighborhood of .

Definition 3 (data point weight value). For each data point in the data stream, we consider a weight which decreases over time. The initial value of data point is 1. The weight of data point (with dimensions) in time is defined based on the weight in as follows (): where function is a fading function. The fading function [28] that we use in HDC-Stream is defined as , where .

Definition 4 (grid weight). For a grid at current time , the grid weight is defined based on sum of data points’ weights which are mapped to it:

According to the work presented in [27], we update the grid weight in with the last updated value as follows:

The total weight of all the grids in data space is which is less than . Moreover, we have

It means that sum of all data points’ weights has an upper bound of . The number of grids equals , which is , and every th dimension is divided into partitions. Therefore, the average density of each grid is .

Definition 5 (core point). It is defined as an object for which its overall weight of all -neighborhood data points is at least a value .

Definition 6 (dense grid). At time , for a grid , we call it a dense grid if .

Definition 7 (sparse grid). At time , for a grid , we call it a sparse grid if .

Because the overall weight cannot be more than , is a controlling threshold.

Definition 8 (minicluster (MIC)). A at time is defined as for a group of very close data points with timestamps as follows: where is an Euclidean distance between the center of minicluster and the data points in that grid cell.

Definition 9 (grid synopsis). Is a tuple where is the number of data points, is the last timestamp and is the grid weight.

Definition 10 (outlier weight threshold (OWT)). This threshold is considered for the sparse grids which do not receive any data for long. In fact, these grids do not have any chance to be converted to dense grids and consequently to . If the grid weight is less than this threshold, it can safely be deleted from the grid list (in the outlier buffer) [14]. If the last updated time of grid is , then, at current time , the outlier weight threshold is defined as follows ():

Definition 11 (pruning time). We check all MICs’ weights as well as the weights of all grid cells in a time we call it . is the minimum time for a in timestamp to be converted to an outlier in () which is described as follows:

Lemma 12.

Proof.

4. HDC-Stream Algorithm

HDC-Stream is a hybrid density-based clustering algorithm for evolving data streams. The overall architecture of HDC-Stream algorithm is outlined in Algorithm 1. It has an online-offline component. For a data stream, at each timestamp, the online component of HDC-Stream continuously reads a new data record and either adds it to an existing minicluster or maps it to the grid. In pruning time, HDC-Stream periodically removes real outliers. The offline component generates the final clusters on demand by the user. The procedure adopted in this algorithm is divided into three steps as follows. The steps are also illustrated in Figure 2.(1)Merging or papping (MM-Step): the new data point is added to an existing minicluster or mapped to the grid (lines 5–18 of Algorithm 1).(2)Pruning grids and miniclusters (PGM-Step): the grids cells as well as miniclusters’ weights are periodically checked in pruning time. The periods are defined based on the minimum time for a minicluster to be converted to an outlier. The grids and the miniclusters with the weights less than a threshold are discarded, and the memory space is released (lines 19–33 of Algorithm 1).(3)Forming final clusters (FFC-Step): final clusters are formed based on miniclusters which are pruned. Each minicluster is clustered as a virtual point using a modified DBSCAN (lines 34–36 of Algorithm 1).

Input: a data stream, MinPts, , and
Output: arbitrary shape clusters
(1) 
(2)
(3)while not end of stream do
(4) Read data point from Data Stream
 {***** MM-Step *****}
(5)  Find the nearest mini-cluster MIC to
(6)if distance   then
(7)   Merge to the MIC
(8)else
(9)   Map the new data point to the grid
(10)    
(11)    Update
(12)   if   ≥ MinPts and ≥     then
(13)    
(14)    
(15)    
(16)    Remove grid from the grid list
(17)   end if
(18)  end if
 {***** PGM-Step *****}
(19)  if   mod   then
(20)   for all grid   do
(21)    
(22)    if     then
(23)     Remove grid from the grid list
(24)    end if
(25)   end for
(26)   for all     do
(27)    if     then
(28)     Remove MIC from
(29)    end if
(30)   end for
(31)  end if
(32)  
(33) end while
{***** FCC-Step *****}
(34) if the clustering request is arrived then
(35)  Generate clusters using a modified DBSCAN
(36) end if

The steps are explained as follows.

4.1. MM-Step of HDC-Stream

When a new data point arrives (Figure 3), we get the following.(i)HDC-Stream finds the nearest to the new data point.(ii)If the new data point’s distance to the nearest is less than , it will be added to that particular .(iii)Otherwise, the data point has to be mapped into the grid in the outlier buffer.(a)If the number of data points in grid reaches , then we check the grid weight .(1)If the grid weight is higher than the dense grid threshold, then we form a new out of the data points in this grid.(2)The related grid of the new is discarded from the grid list.

4.2. PGM-Step of HDC-Stream

For each , if no new point is added, its weight will gradually decay. Furthermore, there are some grids which do not receive data points for a long time and become sporadic. These kinds of and grid cells should be removed from the miniclusters and the grid list, respectively. The decision for removing grids and miniclusters is made based on a comparison of their weights and a specified threshold. Therefore, PGM-Step is performed in each which is defined in Definition 11.

4.3. FCC-Step of HDC-Stream

When a clustering request arrives, a variant of DBSCAN algorithm is applied on the set of the online maintained miniclusters to get the clustering result. Each minicluster is considered as a virtual point located at the center of with the weight . We adopt the concept of density connectivity from [21], in order to determine the final clusters. All the density-connected MICs form a cluster. The variant of DBSCAN algorithm includes two parameters: and .

Definition 13 (directly density-reachable). A is directly density-reachable from a with respect to and if . is the Euclidean distance between the centers of and .

Definition 14 (density-reachable). A is density-reachable from a with respect to and if there is a chain of miniclusters , such that and ( is directly density reachable from ).

Definition 15 (density-connected). A is density-connected to a with respect to and if there is a minicluster such that both and are density-reachable from with respect to and .

5. Experimental Evaluation

In this section, we present the evaluation of HDC-Stream with respect to two existing well-known methods DenStream and D-Stream. We have implemented HDC-Stream as well as the comparative methods in Java. All experiments were conducted on a 2.5 GHz machine with 4 GB memory, running on Mac OS X. In this section, firstly, we describe the datasets and then evaluation measures used for the evaluation of the HDC-Stream algorithm. Detailed experiments on real and synthetic datasets are discussed as well.

5.1. Datasets

For evaluation purposes, the clustering quality, scalability, and sensitivity of the HDC-Stream algorithm on both real and synthetic datasets are used. We generated three synthetic datasets DS1, DS2, and DS3 which are depicted in Figures 4(a), 4(b), and 4(c), respectively. DS1 has 10000 data points with 5% noise. DS2 has 10000 data points with 4% noise, and DS3 has 10000 data points with 5% noise. Eventually, we generated an evolving data stream (EDS) by randomly selecting one of the datasets (DS1, DS2, and DS3) 10 times. For each iteration, the chosen dataset forms a 10000-point part of the data stream, so the total length of the evolving data stream is 100000.

The real dataset used is KDD CUP99 Network Intrusion Detection dataset (all 34 continuous attributes out of the total 42 available attributes are used) [29]. The dataset comes from the 1998 DARPA Intrusion Detection. It contains training data consisting of 7 weeks of network-based intrusions inserted in the normal data and 2 weeks of network-based intrusions and normal data for a total of 4,999,000 connection records described by 42 characteristics. KDD CUP99 has been used in [14, 17, 24, 27] and it is converted into data stream by taking the data input order as the order of streaming.

5.2. Evaluation Metrics

Cluster validity is an important issue in cluster analysis. Its objective is to assess clustering results of the proposed algorithm by comparing existing well-known clustering algorithms. In the following, we adopt two popular measures, purity and normalized mutual information (NMI), in order to evaluate the quality of HDC-Stream.

5.2.1. Purity

The clustering quality is evaluated by the average purity of clusters which is defined as follows: where is  number of clusters, is the number of points with the dominant class label in cluster , and is the number of points in cluster . The purity is calculated only for the points arriving in a predefined window (), since the weight of points diminishes continuously.

5.2.2. Normalized Mutual Information (NMI)

The normalized mutual information (NMI) is a well-known information theoretic measure that assesses how similar two clusterings are. Given the true clustering and the grouping obtained by a clustering method, let be the confusion matrix whose element is the number of records of cluster of that are also in the cluster of . The normalized mutual information, , is defined as where is the number of groups in the partition , is the sum of the elements of in row (column ), and is the number of data points. If , , and if and are completely different, .

The parameters of HDC-Stream adopt the following settings: decay factor , minimum number of points , and . The parameters for DenStream and D-Stream are chosen to be the same as those adopted in [24] and [14], respectively.

5.3. Evaluation of HDC-Stream on Synthetic Datasets

Figure 5 shows the purity results of HDC-Stream compared to DenStream and D-Stream on EDS data stream. In Figure 5(a), the stream speed is set to 2000 points per time unit and horizon . HDC-Stream shows a good clustering quality. Its clustering purity is higher than 97%. We also set the stream speed at 2000 points per time unit and horizon for EDS. Figure 5(b) shows similar results too. We conclude that HDC-Stream achieves much higher clustering quality than DenStream and D-Stream in two different horizons. For example, in horizon , time unit 50, HDC-Stream has 98% while DenStream and D-Stream have purity values as 82% and 78%, respectively.

The same is observed from the normalized mutual information aspect. In fact, Figure 6 shows the NMI values obtained by three methods. We repeated the experiments with the same horizon and stream speed (Figures 6(a) and 6(b)). The results show a noticeable high NMI score for HDC-Stream. In fact, its value approaches 1 for both horizons. It also proves that DenStream has better NMI compared to D-Stream.

We noted very good clustering quality of HDC-Stream, D-Stream, and DenStream when no noise is present in the dataset. In fact, purity values are always higher than 98% and all methods are insensitive to the horizon length.

5.4. Evaluation of HDC-Stream for Real Datasets

The comparison results among HDC-Stream and both DenStream and D-Stream on the Network Intrusion dataset are shown in Figure 7. The evaluation is defined based on the selected time units when the attacks happen on horizons 2 and 5, whereas the stream speed is 1000. For instance, in horizon and stream speed 1000, there are 99 teardrop attacks, 182 ipsweep attacks, 618 neptune attacks, and 4097 normal connections. HDC-Stream clearly outperforms DenStream and specifically D-Stream. The purity of HDC-Stream is always above 91%. For example, at time 55, the purity of HDC-Stream is about 95% which is higher than both DenStream (86%) and D-Stream (76%).

We show the normalized mutual information results on Network Intrusion Detection dataset in Figure 8. The results have been determined by setting the horizon to 1 and 5, whereas the stream speed is 1000 (Figures 8(a) and 8(b)). The values of normalized mutual information for HDC-Stream approach 1 for both horizons. It reveals that HDC-Stream detects the true class labels of data more accurately than DenStream and D-Stream do.

5.5. Scalability Results
5.5.1. Execution Time

The execution time of HDC-Stream is influenced by the number of data points processed at each time unit, that is, the stream speed. Figure 9 shows the execution time in seconds on Network Intrusion Detection dataset for HDC-Stream compared to DenStream and D-Stream, when the stream speed augments from 1000 to 10,000 data items.

DenStream has higher processing time due to its merging task which is time consuming. HDC-Stream has lower execution time compared to the others. The execution time of other methods increases linearly with respect to the stream speed.

5.5.2. Memory Usage

Memory usage of HDC-Stream is which is the total number of miniclusters and grids.

5.6. Sensitivity Analysis

An important parameter of HDC-Stream is . It controls the importance of historical data. We test the quality of clustering on different values of ranging from 0.0078 to 1 (Figure 10). When is too small or too large, the clustering quality becomes poor. For example, when , the purity is about 75%, and, when , the points decay soon after their arrival, and only a small number of recent points contribute to the final results. So the result is not very good. However, the quality of HDC-Stream is still higher than that of DenStream and D-Stream. It is proved that if varies from 0.0625 to 0.25, the clustering quality is quite good, stable, and always above 96%.

6. Discussion

We proposed a hybrid method for clustering evolving data streams which has high quality and low computation time compared to existing methods. The algorithm clusters data streams in three distinctive steps. In existing methods such as DenStream, when a new data point arrives, it takes time to search in two lists of microclusters including potentials and outliers in order to find the suitable microcluster. If it is unable to find a microcluster, DenStream forms a new microcluster for that data point which may be a seed of an outlier, hence leading to a low clustering quality result. However, HDC-Stream only searches in potential list and if it cannot find the suitable microcluster, the data point is mapped to the grid, which keeps the outlier buffer. We reduced the time complexity of clustering algorithm using grid-based clustering. The grid-based method allows us to decrease merging time complexity from to . We implemented the grid list in a 2-3-4 tree data structure which makes search and update faster. The size of the grid list is and the time required for search and update in the grid list is . Consider

We reduced the number of comparisons; therefore, time complexity for merging to minicluster list is ; in which the number of is less than number of microclusters in DenStream, since, in that algorithm, there are two lists to keep potential and outlier microclusters. Furthermore, we increased the clustering quality by forming miniclusters from the data points that are surely not outliers. When the grid density reaches the specified threshold, the data points inside that grid form a minicluster. Therefore, we do not need to form a minicluster for a newly arrived data if it cannot be placed in any minicluster. The quality is also increased since miniclusters are never formed from an outlier.

Finally, the evaluation results prove that using a hybrid method for clustering evolving data streams improves the clustering quality results and reduces the computation time.

7. Conclusion

In this paper, we proposed a hybrid density-based clustering algorithm for Internet of Things (IoT) streams. Our hybrid algorithm has three steps in which the new data point is either mapped to grid or merged to an existing minicluster, the outliers are removed, and finally arbitrary shape clusters are formed using miniclusters by a modified DBSCAN. Our method is a hybrid one, which uses density grid-based clustering and density microclustering to improve the computation time and quality. The evaluation results on synthetic and real datasets show that it has high quality with low computation time for merging. However, HDC-Stream is not suitable to be used in distributed environments.

Our future work will focus on the improvement of HDC-Stream as a distributed density-based data stream clustering algorithm.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research is supported by High Impact Research (HIR) Grant, University of Malaya, no. UM.C/625/HIR/MOHE/SC/13/2 from Ministry of Higher Education.