Skip to main content

2004 | Buch

High Performance Discovery in Time Series

Techniques and Case Studies

verfasst von: Dennis Shasha, Yunyue Zhu

Verlag: Springer New York

Buchreihe : Monographs in Computer Science

insite
SUCHEN

Über dieses Buch

Overview and Goals Data arriving in time order (a data stream) arises in fields ranging from physics to finance to medicine to music, just to name a few. Often the data comes from sensors (in physics and medicine for example) whose data rates continue to improve dramati­ cally as sensor technology improves. Further, the number of sensors is increasing, so correlating data between sensors becomes ever more critical in orderto distill knowl­ edge from the data. On-line response is desirable in many applications (e.g., to aim a telescope at a burst of activity in a galaxy or to perform magnetic resonance-based real-time surgery). These factors - data size, bursts, correlation, and fast response­ motivate this book. Our goal is to help you design fast, scalable algorithms for the analysis of single or multiple time series. Not only will you find useful techniques and systems built from simple primi­ tives, but creative readers will find many other applications of these primitives and may see how to create new ones of their own. Our goal, then, is to help research mathematicians and computer scientists find new algorithms and to help working scientists and financial mathematicians design better, faster software.

Inhaltsverzeichnis

Frontmatter

Review of Techniques

Frontmatter
1. Time Series Preliminaries
Abstract
A time series is a sequence of recorded values. These values are usually real numbers recorded at regular intervals, such as yearly, monthly, weekly, daily, and hourly. Data recorded irregularly are often interpolated to form values at regular intervals before the time series is analyzed. We often represent a time series as just a vector of real numbers.
Dennis Shasha, Yunyue Zhu
2. Data Reduction and Transformation Techniques
Abstract
From a data mining point of view, time series data has two important characteristics:
1.
High Dimensional If we think of each time point of a time series as a dimension, a time series is a point in a very high dimensional space. A time series of length 1000 corresponds to a point in a 1000-dimensional space. Though a time series of length 1000 is very common in practice, processing in a 1000-dimensional space is extremely difficult even with modern computer systems.
 
2.
Temporal Order Fortunately, the consecutive values in a time series are related because of the temporal order of a time series. For example, for financial time series, the differences between consecutive values will be within some predictable threshold most of the time. This temporal relationship between nearby data points in a time series produces some redundancy, and such redundancy provides an opportunity for data reduction.
 
Dennis Shasha, Yunyue Zhu
3. Indexing Methods
Abstract
For our purposes, an index is a data structure that allows fast retrieval of time series that are close to a query time series. In index terms, because a time series is represented as a point in some space, our searches are similarity point searches in a single or multidimensional space. Of course, indexes have been extremely well studied and your favorite database text should describe B-trees and at least some multidimensional structures in great depth. Our intent is to focus on those indexes that are most useful to time series analysis as well as how to use them.
Dennis Shasha, Yunyue Zhu
4. Flexible Similarity Search
Abstract
There are many applications for similarity search in time series data of which the following are just a small sample.
1.
In finance, a trader may be interested in finding pairs of stocks that move similarly, perhaps with some lag.
 
2.
In music, a person may want to find a song that is similar to one that he can hum.
 
3.
In business management, spotting products with similar selling patterns can result in more efficient product management.
 
4.
In environmental science, by comparing the pollutant level in different sections of a river, scientists can have a better understanding of environmental changes.
 
Dennis Shasha, Yunyue Zhu

Case Studies

Frontmatter
5. StatStream
Summary
Consider the problem of monitoring tens of thousands of time series data streams in an online fashion and making decisions based on them. In addition to single stream statistics such as average and standard deviation, we also want to find the most highly correlated pairs of streams especially in a sliding window sense. A stock market trader might use such a tool to spot arbitrage opportunities. In this chapter, we propose efficient methods for solving this problem based on Discrete Fourier Transforms (see Chapter 2) and a three level time interval hierarchy. Extensive experiments on synthetic data and real world financial trading data show that our algorithm beats the direct computation approach by several orders of magnitude. It also improves on previous Fourier Transform approaches by allowing the efficient computation of time-delayed correlation over any size sliding window and any time delay. Correlation also lends itself to an efficient grid-based data structure. The result is the first algorithm that we know of to compute correlations over thousands of data streams in real time. The algorithm is incremental, has fixed response time, and can monitor the pairwise correlations of 10,000 streams on a single PC. The algorithm is embarrassingly parallelizable.
Dennis Shasha, Yunyue Zhu
6. Query by Humming
Summary
The goal of a Query by Humming system is to allow a user to find a song by humming part of the tune. No musical training is needed. The problem is still unsolved. Some systems have low retrieval precision because they rely on melodic contour information from the hum tune, which in turn relies on the error-prone note segmentation process. Some systems yield better precision when matching the melody directly from audio, but they are slow because of their extensive use of Dynamic Time Warping (DTW) (see Chapter 4). HumFinder [106, 107] improves both the retrieval precision and speed compared to previous approaches. We treat music as a time series and exploit and improve well-developed techniques from time series databases to indexing the music for fast similarity queries. We improve on existing DTW indexes technique by introducing the concept of envelope transforms, which gives a general guideline for extending existing dimensionality reduction methods to DTW indexes. The net result is high scalability. We test our system through experiments. Please read this approach as a case study of the techniques you have seen, not as a complete solution to this hard problem.
Dennis Shasha, Yunyue Zhu
7. Elastic Burst Detection
Summary
Burst detection is the activity of rinding abnormal aggregates in data streams. Such aggregates are based on sliding windows over data streams. In some applications, we want to monitor many sliding window sizes simultaneously and to report those windows with aggregates significantly different from other periods. We will present a general data structure and system called OmniBurst [104] for detecting interesting aggregates over such elastic windows in near linear time. We present applications of the algorithm to detecting Gamma Ray Bursts in large-scale astrophysical data. Detection of periods with high volumes of trading activities and high stock price volatility is also demonstrated using real time Trade and Quote (TAQ) data from the New York Stock Exchange (NYSE). Our algorithm filters out periods of non-bursts in linear time, so beats the quadratic direct computation approach (of testing all window sizes individually) by several orders of magnitude.
Dennis Shasha, Yunyue Zhu
8. A Call to Exploration
Abstract
Algorithmic improvements will be essential to time series analysis in the coming years. This might surprise those who marvel at the technology trends in which processor speed improvements of several orders of magnitude will occur in the coming decades. But there is no contradiction. Improvements in processors speed up existing algorithms, to be sure. But they also make detectors far more capable.
Dennis Shasha, Yunyue Zhu
Backmatter
Metadaten
Titel
High Performance Discovery in Time Series
verfasst von
Dennis Shasha
Yunyue Zhu
Copyright-Jahr
2004
Verlag
Springer New York
Electronic ISBN
978-1-4757-4046-2
Print ISBN
978-1-4419-1842-0
DOI
https://doi.org/10.1007/978-1-4757-4046-2