A periodogram-based metric for time series classification

https://doi.org/10.1016/j.csda.2005.04.012Get rights and content

Abstract

The statistical discrimination and clustering literature has studied the problem of identifying similarities in time series data. Some studies use non-parametric approaches for splitting a set of time series into clusters by looking at their Euclidean distances in the space of points. A new measure of distance between time series based on the normalized periodogram is proposed. Simulation results comparing this measure with others parametric and non-parametric metrics are provided. In particular, the classification of time series as stationary or as non-stationary is discussed. The use of both hierarchical and non-hierarchical clustering algorithms is considered. An illustrative example with economic time series data is also presented.

Introduction

Classification and clustering time series is becoming an important area of research in several fields, such as economics, marketing, business, finance, medicine, biology, physics, psychology, zoology, and many others. For example, in Economics we may be interested in classifying the economic situation of a country by looking at some time series indicators, such as Gross National Product, investment expenditure, disposable income, unemployment rate or inflation rate. In Medicine, a patient may be classified in different classes using the information from the electrocardiogram time series.

The problem of identifying similarities or dissimilarities in time series data has been studied in the discrimination and clustering literature (see for instance Jonhson and Wichern, 1992). Some studies use non-parametric approaches for splitting a set of time series into clusters by looking at their Euclidean distances in the space of points. As pointed out by Galeano and Peña (2000), this metric has the important limitation of being invariant to transformations that modify the order of observations over time, and, therefore, it does not take into account the correlation structure of the time series. Piccolo (1990) introduced a metric for ARIMA models based on the autoregressive representation and applied this measure to the identification of similarities between industrial production series. Tong and Dabas (1990) investigated the affinity among some linear and non-linear fitted models by applying classical clustering techniques to the estimated residuals. Diggle and Fisher (1991) introduced a non-parametric approach to compare the spectrum of two time series based on the underlying cumulative periodograms. Diggle and al Wasel (1997) developed inference methods in spectral analysis based in the likelihood ratio to compare replicated time series data. Kakizawa et al. (1998) proposed parametric models for discriminating and clustering multivariate time series, with applications to environmental data (for discriminant analysis for time series, see also Shumway and Unger, 1974, Shumway, 1982, Dargahi-Noubary and Laycock, 1981, Dargahi-Noubary, 1992, Zhang and Taniguchi, 1994). Maharaj (2000) used a test of hypothesis in the comparison of two stationary time series based on the autoregressive parameters and proposed a classification method using the p-value of this test as a measure of similarity. Maharaj (2002) compared two non-stationary time series using the evolutionary spectra approach in order to take into account the structural changes over time. Other related works on clustering of time series are by Bohte et al. (1980), Kosmelj and Batagelj (1990), Shaw and King (1992), Maharaj (1999) and Xiong and Yeung (2004).

In this paper, we propose a metric based on the normalized periodogram and we use it for time series classification. We provide simulation results comparing this metric to the one by Piccolo (1990) and the ones based on autocorrelation, partial autocorrelation and inverse autocorrelation coefficients. In particular, we discuss the classification of time series as stationary or as non-stationary.

The remainder of the paper is organized as follows. In Section 2 we discuss briefly previous related methods on clustering time series and present our periodogram-based metrics. In Section 3 we discuss the methodology used for empirical classification of ARMA and ARIMA models and in Section 4 we present results from various approaches. In Section 5 we present an illustrative example with economic time series data to identify similarities among industrial production index series in United States, and in Section 6, we summarize the paper and discuss possible future research.

Section snippets

Time series metrics

A fundamental problem in classification analysis of time series is the choice of a relevant metric. Let Xt=x1,t,,xk,t be a vector time series with components represented by autoregressive integrated moving average or ARIMA(p,d,q) models,φi(B)1-Bdxi,t=θi(B)εi,t,i=1,,k,where φi(B) is the autoregressive operator of order p and θi(B) is the moving average operator of order q; B is the back-shift operator and 1-Bd is the differencing operator of order d. The autoregressive and moving average

Methodology of time series classification

In this section we will use the following previously discussed distances for time series classification:

Step 1: Find similarities or dissimilarities between every pair of time series in the data set. For each data we compute a distance matrix with k(k-1)/2 different pairs using the following metrics:

  • (i)

    Classical Euclidean (EUCL) distance, dEUCL(x,y)=t=1nxt-yt2.

  • (ii)

    Piccolo's distance, dPIC(x,y)=j=1πj,x-πj,y2. The application of this distance requires the fitting of an ARIMA model to the time series.

Simulation results

We simulated one thousand time series replications of each of the following six stationary [(a)–(f)] and six non-stationary [(g)–(l)] models. All the series have zero mean and unit variance white noise. The samples sizes were taken equal to 50, 100, 200, 500, 1000 and 10000 observations:

Model (a): AR(1), with φ1=0.9;
Model (b): AR(2), with φ1=0.95 and φ2=-0.1;
Model (c): ARMA(1,1), with φ1=0.95 and θ1=0.1;
Model (d): ARMA(1,1),with φ1=-0.1 and θ1=-0.95;
Model (e): MA(1), with θ1=-0.9;
Model (f):

Application

As an illustrative example we use the Industrial Production (by Market Group) indices in United States (source: http://www.economagic.com). The 20 time series indices (seasonally adjusted) with sample sizes of n=309, from January 1977 to September 2002, are reported in Table 3.

Before carrying out clustering analysis, the series were transformed in differences of the logarithm, logxt-logxt-1, as shown in Fig. 2, in order to get the percentages increases from period to period. This gets rid of

Conclusions

In this paper, we have studied metrics based on different dependence measures to classify time series as stationary or as non-stationary. Simulation results show that the metrics based on the logarithm of the normalized periodogram and the metric based on the autocorrelation coefficients can all distinguish empirically with high success ARMA from ARIMA models, while this does not happen with the classic Euclidean distance nor with the metric based in the autoregressive weights proposed by

Acknowledgements

This research was partially supported by a grant from the Fundação para a Ciência e Tecnologia (POCTI/FCT) and by a grant from the Fundação Calouste Gulbenkian. The third author acknowledges support from grant SEJ2004-03303 and from Fundación BBVA, Spain. Part of this work was completed during the visit of Jorge Caiado to the Department of Statistics, Universidad Carlos III de Madrid, Spain. The authors gratefully acknowledge the helpful comments and suggestions of associate editor and an

References (34)

  • E.A. Maharaj

    Comparison and classifying of stationary multivariate time series

    Pattern Recognition

    (1999)
  • E.A. Maharaj

    Comparison of non-stationary time series in the frequency domain

    Comput. Statist. Data Anal.

    (2002)
  • C.T. Shaw et al.

    Using cluster analysis to classify time series

    Physica D

    (1992)
  • F. Battaglia

    Inverse autocovariances and a measure of linear determinism for a stationary process

    J. Time Ser. Anal.

    (1983)
  • F. Battaglia

    Recursive estimation of the inverse correlation function

    Statistica

    (1986)
  • F. Battaglia

    On the estimation of the inverse correlation function

    J. Time Ser. Anal.

    (1988)
  • J. Beran et al.

    On unified model selection for stationary and nonstationary short and long memory autoregressive processes

    Biometrika

    (1998)
  • R.J. Bhansali

    Autoregressive and window estimates of the inverse autocorrelation function

    Biometrika

    (1980)
  • R.J. Bhansali

    A simulation study of autoregressive and window estimators of the inverse correlation function

    Appl. Statist.

    (1983)
  • Z.D. Bohte et al.

    Clustering of time series

    Proc. COMPSTAT

    (1980)
  • P.J. Brockwell et al.

    Time Series: Theory and Methods

    (1991)
  • C. Chatfield

    Inverse autocorrelations

    J. Roy. Statist. Soc. Ser. A

    (1979)
  • W.S. Cleveland

    The inverse autocorrelations of a time series and their applications

    Technometrics

    (1972)
  • G.R. Dargahi-Noubary

    Discrimination between Gaussian time series based on their spectral differences

    Commun. Statist. Theory Methods

    (1992)
  • G.R. Dargahi-Noubary et al.

    Spectral ratio discriminant and information theory

    J. Time Ser. Anal.

    (1981)
  • P.J. Diggle et al.

    Nonparametric comparison of cumulative periodograms

    Appl. Statist.

    (1991)
  • P.J. Diggle et al.

    Spectral analysis of replicated biomedical time series

    Appl. Statist.

    (1997)
  • Cited by (198)

    • Network log-ARCH models for forecasting stock market volatility

      2024, International Journal of Forecasting
    View all citing articles on Scopus
    View full text