Clustering of time series data—a survey

doi:10.1016/j.patcog.2005.01.025

Pattern Recognition

Volume 38, Issue 11, November 2005, Pages 1857-1874

https://doi.org/10.1016/j.patcog.2005.01.025 Get rights and content

Abstract

Time series clustering has been shown effective in providing useful information in various domains. There seems to be an increased interest in time series clustering as part of the effort in temporal data mining research. To provide an overview, this paper surveys and summarizes previous works that investigated the clustering of time series data in various application domains. The basics of time series clustering are presented, including general-purpose clustering algorithms commonly used in time series clustering studies, the criteria for evaluating the performance of the clustering results, and the measures to determine the similarity/dissimilarity between two time series being compared, either in the forms of raw data, extracted features, or some model parameters. The past researchs are organized into three groups depending upon whether they work directly with the raw data either in the time or frequency domain, indirectly with features extracted from the raw data, or indirectly with models built from the raw data. The uniqueness and limitation of previous research are discussed and several possible topics for future research are identified. Moreover, the areas that time series clustering have been applied to are also summarized, including the sources of data used. It is hoped that this review will serve as the steppingstone for those interested in advancing this area of research.

Introduction

The goal of clustering is to identify structure in an unlabeled data set by objectively organizing data into homogeneous groups where the within-group-object similarity is minimized and the between-group-object dissimilarity is maximized. Clustering is necessary when no labeled data are available regardless of whether the data are binary, categorical, numerical, interval, ordinal, relational, textual, spatial, temporal, spatio-temporal, image, multimedia, or mixtures of the above data types. Data are called static if all their feature values do not change with time, or change negligibly. The bulk of clustering analyses has been performed on static data. Most, if not all, clustering programs developed as an independent program or as part of a large suite of data analysis or data mining software to date work only with static data. Han and Kamber [1] classified clustering methods developed for handing various static data into five major categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. A brief description of each category of methods follows.

Given a set of n unlabeled data tuples, a partitioning method constructs k partitions of the data, where each partition represents a cluster containing at least one object and $k ⩽ n$ . The partition is crisp if each object belongs to exactly one cluster, or fuzzy if one object is allowed to be in more than one cluster to a different degree. Two renowned heuristic methods for crisp partitions are the k-means algorithm [2], where each cluster is represented by the mean value of the objects in the cluster and the k-medoids algorithm [3], where each cluster is represented by the most centrally located object in a cluster. Two counterparts for fuzzy partitions are the fuzzy c-means algorithm [4] and the fuzzy c-medoids algorithm [5]. These heuristic algorithms work well for finding spherical-shaped clusters and small to medium data sets. To find clusters with non-spherical or other complex shapes, specially designed algorithms such as Gustafson–Kessel and adaptive fuzzy clustering algorithms [6] or density-based methods to be introduced in the sequel are needed. Most genetic clustering methods implement the spirit of partitioning methods, especially the k-means algorithm [7], [8], the k-medoids algorithm [9], and the fuzzy c-means algorithm [10].

A hierarchical clustering method works by grouping data objects into a tree of clusters. There are generally two types of hierarchical clustering methods: agglomerative and divisive. Agglomerative methods start by placing each object in its own cluster and then merge clusters into larger and larger clusters, until all objects are in a single cluster or until certain termination conditions such as the desired number of clusters are satisfied. Divisive methods do just the opposite. A pure hierarchical clustering method suffers from its inability to perform adjustment once a merge or split decision has been executed. For improving the clustering quality of hierarchical methods, there is a trend to integrate hierarchical clustering with other clustering techniques. Both Chameleon [11] and CURE [12] perform careful analysis of object “linkages” at each hierarchical partitioning whereas BIRCH [13] uses iterative relocation to refine the results obtained by hierarchical agglomeration.

The general idea of density-based methods such as DBSCAN [14] is to continue growing a cluster as long as the density (number of objects or data points) in the “neighborhood” exceeds some threshold. Rather than producing a clustering explicitly, OPTICS [15] computes an augmented cluster ordering for automatic and interactive cluster analysis. The ordering contains information that is equivalent to density-based clustering obtained from a wide range of parameter settings, thus overcoming the difficulty of selecting parameter values.

Grid-based methods quantize the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. A typical example of the grid-based approach is STING [16], which uses several levels of rectangular cells corresponding to different levels of resolution. Statistical information regarding the attributes in each cell are pre-computed and stored. A query process usually starts at a relatively high level of the hierarchical structure. For each cell in the current layer, the confidence interval is computed reflecting the cell's relevance to the given query. Irrelevant cells are removed from further consideration. The query process continues to the next lower level for the relevant cells until the bottom layer is reached.

Model-based methods assume a model for each of the clusters and attempt to best fit the data to the assumed model. There are two major approaches of model-based methods: statistical approach and neural network approach. An example of statistical approach is AutoClass [17], which uses Bayesian statistical analysis to estimate the number of clusters. Two prominent methods of the neural network approach to clustering are competitive learning, including ART [18] and self-organizing feature maps [19].

Unlike static data, the time series of a feature comprise values changed with time. Time series data are of interest because of its pervasiveness in various areas ranging from science, engineering, business, finance, economic, health care, to government. Given a set of unlabeled time series, it is often desirable to determine groups of similar time series. These unlabeled time series could be monitoring data collected during different periods from a particular process or from more than one process. The process could be natural, biological, business, or engineered. Works devoting to the cluster analysis of time series are relatively scant compared with those focusing on static data. However, there seems to be a trend of increased activity.

This paper intends to introduce the basics of time series clustering and to provide an overview of time series clustering works been done so far. In the next section, the basics of time series clustering are presented. Details of three major components required to perform time series clustering are given in three subsections: clustering algorithms in Section 2.1, data similarity/distance measurement in Section 2.2, and performance evaluation criterion in Section 2.3. Section 3 categories and surveys time series clustering works that have been published in the open literature. Several possible topics for future research are discussed in Section 4 and finally the paper is concluded. In Appendix A, the application areas reported are summarized with pointers to openly available time series data.

Section snippets

Basics of time series clustering

Just like static data clustering, time series clustering requires a clustering algorithm or procedure to form clusters given a set of unlabeled data objects and the choice of clustering algorithm depends both on the type of data available and on the particular purpose and application. As far as time series data are concerned, distinctions can be made as to whether the data are discrete-valued or real-valued, uniformly or non-uniformly sampled, univariate or multivariate, and whether data series

Major time series clustering approaches

This paper groups previously developed time series clustering methods into three major categories depending upon whether they work directly with raw data, indirectly with features extracted from the raw data, or indirectly with models built from the raw data. The essence of each study is summarized in this section. Studies using clustering algorithms, similarity/dissimilarity measures, and evaluation criteria reviewed in Section 2.1, 2.2, and 2.3, respectively, are as italicized.

Discussion

Among all the papers surveyed the studies of Ramoni et al. [50], [51] are the only two that assumed discrete-valued time series data. The work of Kumar et al. [23] is the only one that takes data error into account. Most studies address evenly sampled data while Möller-Levet et al. [22] are the only ones who consider unevenly sampled data. Note that some studies such as Maharaj [49] and Baragona [30] are restricted to stationary time series only whereas most others are not. None of the papers

Concluding remarks

In this paper we surveyed most recent studies on the subject of time series clustering. These studies are organized into three major categories depending upon whether they work directly with the original data (either in the time or frequency domain), indirectly with features extracted from the raw data, or indirectly with models built from the raw data. The basics of time series clustering, including the three key components of time series clustering studies are highlighted in this survey: the

About the Author—T. WARREN LIAO received his Ph.D. in Industrial Engineering from Lehigh University in 1990 and is currently a Professor with the Industrial Engineering Department, Louisiana State University. His research interests include soft computing, pattern recognition, data mining, and their applications in manufacturing. He has more than 50 refereed journal publications and was the guest editor for several journals including Journal of Intelligent Manufacturing, Computers and Industrial

References (65)

G.A. Carpenter et al.
A massively parallel architecture for a self-organizing neural pattern recognition machine
Comput. Vision Graphics Image Process.
(1987)
R. Dahlhaus
On the Kullback–Leibler information divergence of locally stationary processes
Stochastic Process. Appl.
(1996)
R.H. Shumway
Time–frequency clustering and discriminant analysis
Stat. Probab. Lett.
(2003)
C.T. Shaw et al.
Using cluster analysis to classify time series
Physica D
(1992)
C. Goutte et al.
On clustering fMRI time series
Neuroimage
(1999)
K. Josien et al.
Simultaneous grouping of parts and machines with an integrated fuzzy clustering method
Fuzzy Sets Syst.
(2002)
V.S. Ananthanarayana et al.
Efficient clustering of large data sets
Pattern Recognition
(2001)
J. Han et al.
Data Mining: Concepts and Techniques
(2001)
J. MacQueen, Some methods for classification and analysis of multivariate observations, in: L.M. LeCam, J. Neyman...
L. Kaufman et al.
Finding Groups in Data: An Introduction to Cluster Analysis
(1990)

J.C. Bezdek

Pattern Recognition with Fuzzy Objective Function Algorithms

(1987)

R. Krishnapuram et al.

Low-complexity fuzzy relational clustering algorithms for web mining

IEEE Trans. Fuzzy Systems

(2001)

R. Krishnapuram et al.

A note on the Gustafson–Kessel and adaptive fuzzy clustering algorithms

IEEE Trans. Fuzzy Systems

(1999)

K. Krishna et al.

Genetic $k$ -means algorithms

IEEE Trans. Syst. Man Cybernet.-BCybernet.

(1999)

L. Meng et al.

A genetic hard $c$ -means clustering algorithm

Dyn. Continuous Discrete Impulsive Syst. Ser. B: Appl. Algorithms

(2002)

V. Estivill-Castro, A.T. Murray, Spatial clustering for data mining with genetic algorithms,...

L.O. Hall et al.

Clustering with a genetically optimized approach

IEEE Trans. Evolutionary Computat.

(1999)

G. Karypis et al.

Chameleon: hierarchical clustering using dynamic modeling

Computer August

(1999)

S. Guha, R. Rastogi, K. Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998...

T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings...

M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial...

M. Ankerst, M. Breunig, H.-P. Kriegel, J. Sander, OPTICS: ordering points to identify the clustering structure,...

W. Wang, J. Yang, R. Muntz, R., STING: a statistical information grid approach to spatial data mining, Proceedings of...

P. Cheeseman et al.

Bayesian classification (AutoClass): theory and results

T. Kohonen

The self organizing maps

Proc. IEEE

(1990)

J.C. Dunn

A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters

J. Cybernet.

(1974)

X. Golay et al.

A new correlation-based fuzzy logic clustering algorithm for fMRI

Mag. Resonance Med.

(1998)

C.S. Möller-Levet, F. Klawonn, K.-H. Cho, O. Wolkenhauer, Fuzzy clustering of short time series and unevenly...

M. Kumar, N.R. Patel, J. Woo, Clustering seasonality patterns in the presence of errors, Proceedings of KDD ’02,...

Y. Kakizawa et al.

Discrimination and clustering for multivariate time series

J. Amer. Stat. Assoc.

(1998)

J.C. Bezdek et al.

Some new indexes of cluster validity

IEEE Trans. Syst. Man Cybernet. B: Cybernet.

(1998)

U. Maulik et al.

Performance evaluation of some clustering algorithms and validity indices

IEEE Trans. Pattern Anal. Mach. Intell.

(2002)

Cited by (2072)

Metro Station functional clustering and dual-view recurrent graph convolutional network for metro passenger flow prediction
2024, Expert Systems with Applications
The metro system is indispensable for alleviating traffic congestion in the urban transportation system. Precise metro passenger flow (MPF) prediction is crucial in ensuring smooth operations of the metro system. Recently, the graph convolutional network (GCN), which is effective in the spatial feature extraction, has been applied in traffic prediction. However, most existing GCN-based methods construct the empirical graphs based on distance and adjacency, which cannot fully express the correlations of metro stations. This paper proposes a novel MPF prediction method consisting of three parts: K-means-based metro station functional clustering (KMSFC), external feature fusion, and dual-view recurrent GCN (DVRGCN). The KMSFC identifies the metro stations both having similar MPF changing tendencies and being located in similar urban functional areas. Furthermore, the DVRGCN is designed to simultaneously capture the spatiotemporal and external features. The dual-view GCN module in the DVRGCN captures both explicit and implicit spatial features of the metro traffic network. To demonstrate the capability for making accurate MPF predictions, the experiments using a real-world metro traffic dataset are conducted. The ablation experiments are also performed to prove the contribution of each module in the proposed method. The experimental results show that the proposed method outperforms other state-of-the-art traffic prediction methods.
A statistical analysis of COVID-19 mortality dynamics: Unraveling the interplay between vaccination trends, socioeconomic factors, and government interventions in Brazilian states
2024, Socio-Economic Planning Sciences
The main challenges in fighting the COVID-19 pandemic in Brazil included socioeconomic inequality among different states and the lack of consensus on the measures implemented to contain and prevent the pandemic. This study analyzes the dynamics of COVID-19-related deaths in Brazilian states associated with the evolution of vaccination trends, socioeconomic factors, and government interventions through population mobility. By applying time series clustering techniques and regression modeling, the insights obtained show the commonalities and differences among the 27 Brazilian states and how the different vaccination temporal patterns, socioeconomic factors, and mobility rates correlated to the evolution of COVID-19 deaths.
Rethinking new venture growth: A time series cluster analysis of biotech startups’ heterogeneous growth trajectories
2024, Long Range Planning
Startups are crucial job creators and drivers of economic growth. Research on startups has predominantly targeted high-growth startups, while a comprehensive understanding of alternative growth journeys remains limited. Addressing this gap, we employ the theory of early firm growth and the time-calibrated theory of entrepreneurial action to examine 416 biotech startups. We use time series cluster analysis to unveil four heterogeneous new venture growth trajectories. These are characterized by unique timings, paces, and sequences of financial, human, and innovative resource-related activities. This study contributes to new venture growth research, particularly in science-based high-tech startups, with its nuanced understanding of diverse growth pathways, including intriguing notions of early failure, growth reversal, and high and moderate steady growth.
An exhaustive comparison of distance measures in the classification of time series with 1NN method
2024, Journal of Computational Science
Time series classification is an important and challenging problem in data analysis. With the increase in time series data availability, hundreds of algorithms have been proposed. A huge effort over the past two decades caused a significant improvement in both the efficiency and effectiveness of time series classification. There is a belief in the community that the best method is a surprisingly simple one. Even though there exist many algorithms outperforming the nearest neighbor (NN) classifier, the popularity of the latter remains stable — due to its simplicity and high performance in many domains, especially with dynamic time warping (DTW) as the distance measure. In the paper, we present an exhaustive study in which we compare the performance of different similarity measures relying on the 1NN classifier. We used the most highly cited time series distance measures used in classification (in total we compared 56 distance measures). We evaluate methods on all datasets from the UCR Time Series Classification Archive. Additionally, we perform extensive statistical comparison of the examined methods. We show that none of the distance measures is the best for all datasets, however, there is a group performing statistically significantly better than the others.
Temporal classification of short time series data
2024, BMC Bioinformatics
High-Fidelity Energy Benchmarking of Lower- and Middle-Income Schools in South Africa to Drive Efficiency
2024, SSRN

View all citing articles on Scopus

View full text

Clustering of time series data—a survey

Abstract

Introduction

Section snippets

Basics of time series clustering

Major time series clustering approaches

Discussion

Concluding remarks

Comput. Vision Graphics Image Process.

Stochastic Process. Appl.

Stat. Probab. Lett.

Physica D

Neuroimage

Fuzzy Sets Syst.

Pattern Recognition

Data Mining: Concepts and Techniques

Finding Groups in Data: An Introduction to Cluster Analysis

Pattern Recognition with Fuzzy Objective Function Algorithms

Low-complexity fuzzy relational clustering algorithms for web mining

IEEE Trans. Fuzzy Systems

A note on the Gustafson–Kessel and adaptive fuzzy clustering algorithms

IEEE Trans. Fuzzy Systems

Genetic k-means algorithms

IEEE Trans. Syst. Man Cybernet.-BCybernet.

A genetic hard c-means clustering algorithm

Dyn. Continuous Discrete Impulsive Syst. Ser. B: Appl. Algorithms

Clustering with a genetically optimized approach

IEEE Trans. Evolutionary Computat.

Chameleon: hierarchical clustering using dynamic modeling

Computer August

Bayesian classification (AutoClass): theory and results

The self organizing maps

Proc. IEEE

A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters

J. Cybernet.

A new correlation-based fuzzy logic clustering algorithm for fMRI

Mag. Resonance Med.

Discrimination and clustering for multivariate time series

J. Amer. Stat. Assoc.

Some new indexes of cluster validity

IEEE Trans. Syst. Man Cybernet. B: Cybernet.

Performance evaluation of some clustering algorithms and validity indices

IEEE Trans. Pattern Anal. Mach. Intell.

Genetic $k$ -means algorithms

A genetic hard $c$ -means clustering algorithm