Clustering of time series data—a survey
Introduction
The goal of clustering is to identify structure in an unlabeled data set by objectively organizing data into homogeneous groups where the within-group-object similarity is minimized and the between-group-object dissimilarity is maximized. Clustering is necessary when no labeled data are available regardless of whether the data are binary, categorical, numerical, interval, ordinal, relational, textual, spatial, temporal, spatio-temporal, image, multimedia, or mixtures of the above data types. Data are called static if all their feature values do not change with time, or change negligibly. The bulk of clustering analyses has been performed on static data. Most, if not all, clustering programs developed as an independent program or as part of a large suite of data analysis or data mining software to date work only with static data. Han and Kamber [1] classified clustering methods developed for handing various static data into five major categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. A brief description of each category of methods follows.
Given a set of n unlabeled data tuples, a partitioning method constructs k partitions of the data, where each partition represents a cluster containing at least one object and . The partition is crisp if each object belongs to exactly one cluster, or fuzzy if one object is allowed to be in more than one cluster to a different degree. Two renowned heuristic methods for crisp partitions are the k-means algorithm [2], where each cluster is represented by the mean value of the objects in the cluster and the k-medoids algorithm [3], where each cluster is represented by the most centrally located object in a cluster. Two counterparts for fuzzy partitions are the fuzzy c-means algorithm [4] and the fuzzy c-medoids algorithm [5]. These heuristic algorithms work well for finding spherical-shaped clusters and small to medium data sets. To find clusters with non-spherical or other complex shapes, specially designed algorithms such as Gustafson–Kessel and adaptive fuzzy clustering algorithms [6] or density-based methods to be introduced in the sequel are needed. Most genetic clustering methods implement the spirit of partitioning methods, especially the k-means algorithm [7], [8], the k-medoids algorithm [9], and the fuzzy c-means algorithm [10].
A hierarchical clustering method works by grouping data objects into a tree of clusters. There are generally two types of hierarchical clustering methods: agglomerative and divisive. Agglomerative methods start by placing each object in its own cluster and then merge clusters into larger and larger clusters, until all objects are in a single cluster or until certain termination conditions such as the desired number of clusters are satisfied. Divisive methods do just the opposite. A pure hierarchical clustering method suffers from its inability to perform adjustment once a merge or split decision has been executed. For improving the clustering quality of hierarchical methods, there is a trend to integrate hierarchical clustering with other clustering techniques. Both Chameleon [11] and CURE [12] perform careful analysis of object “linkages” at each hierarchical partitioning whereas BIRCH [13] uses iterative relocation to refine the results obtained by hierarchical agglomeration.
The general idea of density-based methods such as DBSCAN [14] is to continue growing a cluster as long as the density (number of objects or data points) in the “neighborhood” exceeds some threshold. Rather than producing a clustering explicitly, OPTICS [15] computes an augmented cluster ordering for automatic and interactive cluster analysis. The ordering contains information that is equivalent to density-based clustering obtained from a wide range of parameter settings, thus overcoming the difficulty of selecting parameter values.
Grid-based methods quantize the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. A typical example of the grid-based approach is STING [16], which uses several levels of rectangular cells corresponding to different levels of resolution. Statistical information regarding the attributes in each cell are pre-computed and stored. A query process usually starts at a relatively high level of the hierarchical structure. For each cell in the current layer, the confidence interval is computed reflecting the cell's relevance to the given query. Irrelevant cells are removed from further consideration. The query process continues to the next lower level for the relevant cells until the bottom layer is reached.
Model-based methods assume a model for each of the clusters and attempt to best fit the data to the assumed model. There are two major approaches of model-based methods: statistical approach and neural network approach. An example of statistical approach is AutoClass [17], which uses Bayesian statistical analysis to estimate the number of clusters. Two prominent methods of the neural network approach to clustering are competitive learning, including ART [18] and self-organizing feature maps [19].
Unlike static data, the time series of a feature comprise values changed with time. Time series data are of interest because of its pervasiveness in various areas ranging from science, engineering, business, finance, economic, health care, to government. Given a set of unlabeled time series, it is often desirable to determine groups of similar time series. These unlabeled time series could be monitoring data collected during different periods from a particular process or from more than one process. The process could be natural, biological, business, or engineered. Works devoting to the cluster analysis of time series are relatively scant compared with those focusing on static data. However, there seems to be a trend of increased activity.
This paper intends to introduce the basics of time series clustering and to provide an overview of time series clustering works been done so far. In the next section, the basics of time series clustering are presented. Details of three major components required to perform time series clustering are given in three subsections: clustering algorithms in Section 2.1, data similarity/distance measurement in Section 2.2, and performance evaluation criterion in Section 2.3. Section 3 categories and surveys time series clustering works that have been published in the open literature. Several possible topics for future research are discussed in Section 4 and finally the paper is concluded. In Appendix A, the application areas reported are summarized with pointers to openly available time series data.
Section snippets
Basics of time series clustering
Just like static data clustering, time series clustering requires a clustering algorithm or procedure to form clusters given a set of unlabeled data objects and the choice of clustering algorithm depends both on the type of data available and on the particular purpose and application. As far as time series data are concerned, distinctions can be made as to whether the data are discrete-valued or real-valued, uniformly or non-uniformly sampled, univariate or multivariate, and whether data series
Major time series clustering approaches
This paper groups previously developed time series clustering methods into three major categories depending upon whether they work directly with raw data, indirectly with features extracted from the raw data, or indirectly with models built from the raw data. The essence of each study is summarized in this section. Studies using clustering algorithms, similarity/dissimilarity measures, and evaluation criteria reviewed in Section 2.1, 2.2, and 2.3, respectively, are as italicized.
Discussion
Among all the papers surveyed the studies of Ramoni et al. [50], [51] are the only two that assumed discrete-valued time series data. The work of Kumar et al. [23] is the only one that takes data error into account. Most studies address evenly sampled data while Möller-Levet et al. [22] are the only ones who consider unevenly sampled data. Note that some studies such as Maharaj [49] and Baragona [30] are restricted to stationary time series only whereas most others are not. None of the papers
Concluding remarks
In this paper we surveyed most recent studies on the subject of time series clustering. These studies are organized into three major categories depending upon whether they work directly with the original data (either in the time or frequency domain), indirectly with features extracted from the raw data, or indirectly with models built from the raw data. The basics of time series clustering, including the three key components of time series clustering studies are highlighted in this survey: the
About the Author—T. WARREN LIAO received his Ph.D. in Industrial Engineering from Lehigh University in 1990 and is currently a Professor with the Industrial Engineering Department, Louisiana State University. His research interests include soft computing, pattern recognition, data mining, and their applications in manufacturing. He has more than 50 refereed journal publications and was the guest editor for several journals including Journal of Intelligent Manufacturing, Computers and Industrial
References (65)
- et al.
A massively parallel architecture for a self-organizing neural pattern recognition machine
Comput. Vision Graphics Image Process.
(1987) On the Kullback–Leibler information divergence of locally stationary processes
Stochastic Process. Appl.
(1996)Time–frequency clustering and discriminant analysis
Stat. Probab. Lett.
(2003)- et al.
Using cluster analysis to classify time series
Physica D
(1992) - et al.
On clustering fMRI time series
Neuroimage
(1999) - et al.
Simultaneous grouping of parts and machines with an integrated fuzzy clustering method
Fuzzy Sets Syst.
(2002) - et al.
Efficient clustering of large data sets
Pattern Recognition
(2001) - et al.
Data Mining: Concepts and Techniques
(2001) - J. MacQueen, Some methods for classification and analysis of multivariate observations, in: L.M. LeCam, J. Neyman...
- et al.
Finding Groups in Data: An Introduction to Cluster Analysis
(1990)
Pattern Recognition with Fuzzy Objective Function Algorithms
Low-complexity fuzzy relational clustering algorithms for web mining
IEEE Trans. Fuzzy Systems
A note on the Gustafson–Kessel and adaptive fuzzy clustering algorithms
IEEE Trans. Fuzzy Systems
Genetic -means algorithms
IEEE Trans. Syst. Man Cybernet.-BCybernet.
A genetic hard -means clustering algorithm
Dyn. Continuous Discrete Impulsive Syst. Ser. B: Appl. Algorithms
Clustering with a genetically optimized approach
IEEE Trans. Evolutionary Computat.
Chameleon: hierarchical clustering using dynamic modeling
Computer August
Bayesian classification (AutoClass): theory and results
The self organizing maps
Proc. IEEE
A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters
J. Cybernet.
A new correlation-based fuzzy logic clustering algorithm for fMRI
Mag. Resonance Med.
Discrimination and clustering for multivariate time series
J. Amer. Stat. Assoc.
Some new indexes of cluster validity
IEEE Trans. Syst. Man Cybernet. B: Cybernet.
Performance evaluation of some clustering algorithms and validity indices
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (2072)
Metro Station functional clustering and dual-view recurrent graph convolutional network for metro passenger flow prediction
2024, Expert Systems with ApplicationsAn exhaustive comparison of distance measures in the classification of time series with 1NN method
2024, Journal of Computational ScienceTemporal classification of short time series data
2024, BMC Bioinformatics
About the Author—T. WARREN LIAO received his Ph.D. in Industrial Engineering from Lehigh University in 1990 and is currently a Professor with the Industrial Engineering Department, Louisiana State University. His research interests include soft computing, pattern recognition, data mining, and their applications in manufacturing. He has more than 50 refereed journal publications and was the guest editor for several journals including Journal of Intelligent Manufacturing, Computers and Industrial Engineering, Applied Soft Computing, and International Journal of Industrial Engineering.