Abstract
Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms, including classification, clustering, motif discovery, anomaly detection, and so on. The difficulty of scaling a search to large datasets explains to a great extent why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine massive time series for the first time. We demonstrate the following unintuitive fact: in large datasets we can exactly search under Dynamic Time Warping (DTW) much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We explain how our ideas allow us to solve higher-level time series data mining problems such as motif discovery and clustering at scales that would otherwise be untenable. Moreover, we show how our ideas allow us to efficiently support the uniform scaling distance measure, a measure whose utility seems to be underappreciated, but which we demonstrate here. In addition to mining massive datasets with up to one trillion datapoints, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.
- Adams, N., Marquez, D., and Wakefield, G. 2005. Iterative deepening for melody alignment and retrieval. In Proceedings of ISMIR. 199--206.Google Scholar
- Alon, J., Athitsos, V., Yuan, Q., and Sclaroff, S. 2009. A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 31, 9, 1685--1699. Google ScholarDigital Library
- Assent, I., Krieger, R., Afschari, F., and Seidl, T. 2008. The TS-Tree: Efficient time series search and retrieval. In Proceedings of EDBT. 252--63. Google ScholarDigital Library
- Bei, C. D. and Gray, R. M. 1985. An improvement of the minimum distortion encoding algorithm for vector quantization. IEEE Trans. Commun. 33, 10, 1132--1133.Google ScholarCross Ref
- Bragge, T., Tarvainen, M. P., and Karjalainen, P. A. 2004. High-Resolution QRS Detection Algorithm for Sparsely Sampled ECG Recordings. Department of Applied Physics Report, University of Kuopio.Google Scholar
- Chadwick, N. A., McMeekin, D. A., and Tan, T. 2011. Classifying eye and head movement artifacts in EEG Signals. In Proceedings of IEEE DEST. 285--291.Google Scholar
- Chan, T. F., Golub, G. H., and Leveque, R. J. 1983. Algorithms for computing the sample variance: Analysis and recommendations. Amer. Statist. 37, 242--247.Google ScholarCross Ref
- Chaovalitwongse, W. A., Sachdeo, R. C., Pardalos, P. M., Iasemidis, L. D., and Sackellares, J. C. 2005. Automated brain activity classifier. Epilepsia 46, 313.Google Scholar
- Chen, L. and Ng, R. 2004. On the marriage of LP-norms and edit distance. In Proceedings of VLDB. 792--803. Google ScholarDigital Library
- Chen, Y., Chen, G., Chen, K., and Ooi, B. C. 2009. Efficient processing of warping time series join of motion capture data. In Proceedings of ICDE. 1048--1059. Google ScholarDigital Library
- Cheng, D. Y., Gersho, A., Ramamurthi, B., and Shoham, Y. 1984. Fast search algorithms for vector quantization and pattern matching. In Proceedings of ICASSP. 372--375.Google Scholar
- Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., and Keogh, E. J. 2008. Querying and mining of time series data: Experimental comparison of representations and distance measures. J. VLDB 1, 2, 1542--1552. Google ScholarDigital Library
- Dupasquier, B. and Burschka, S. 2011. Data mining for hackers--Encrypted traffic mining. In Proceedings of the 28th Chaos Comm’ Congress.Google Scholar
- Fornés, A., Lladós, J., and Sanchez, G. 2007. Old handwritten musical symbol classification by a dynamic time warping based method. Graph. Recogn. 5046, 51--60.Google Scholar
- Fu, A., Keogh, E. J., Lau, L., Ratanamahatana, C., and Wong, R. 2008. Scaling and time warping in time series querying. VLDB J. 17, 4, 899--921. Google ScholarDigital Library
- Gillian, N., Knapp, R., and O’Modhrain, S. 2011. Recognition of multivariate temporal musical gestures using n-dimensional dynamic time warping. In Proceedings of the 11th International Conference on New Interfaces for Musical Expression.Google Scholar
- Goldberg, D. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1. Google ScholarDigital Library
- Guitel, G. 1975. Histoire Comparée des Numérations Écrites. Flammarion, Paris. 566--574.Google Scholar
- Hsiao, M., West, K., and Vedatesh, G. 2005. Online context recognition in multisensor system using dynamic time warping. In Proceedings of ISSNIP. 283--288.Google Scholar
- Huber-Mörk, R., Zambanini, S., Zaharieva, M., and Kampel, M. 2011. Identification of ancient coins based on fusion of shape and local features. Mach. Vis. Appl. 22, 6, 983--994. Google ScholarDigital Library
- Jegou, H., Douze, M., Schmid, C., and Perez, P. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of IEEE CVPR. 3304--3311.Google Scholar
- Kahveci, T. and Singh, A. K. 2004. Optimizing similarity search for arbitrary length time series queries. IEEE Trans. Knowl. Data Eng. 16, 4, 418--433. Google ScholarDigital Library
- Keogh, E. and Ratanamahatana, C. A. 2005. Exact indexing of dynamic time warping. Knowl. Inform. Syst. 7, 3, 358--386. Google ScholarDigital Library
- Keogh, E. J. and Kasetty, S. 2003. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Mining Knowl. Discov. 7, 4, 349--371. Google ScholarDigital Library
- Keogh, E. J., Palpanas, T., Zordan, V. B., Gunopulos, D., and Cardle, M. 2004. Indexing large human-motion databases. In Proceedings of VLDB. 780--791. Google ScholarDigital Library
- Keogh, E. J., Wei, L., Xi, X., Vlachos, M., Lee, S. H., and Protopapas, P. 2009. Supporting exact indexing of arbitrarily rotated shapes and periodic time series under Euclidean and warping distance measures. VLDB J. 18, 3, 611--630. Google ScholarDigital Library
- Kim, S., Park, S., and Chu, W. 2001. An index-based approach for similarity search supporting time warping in large sequence databases. In Proceedings of ICDE. 607--614. Google ScholarDigital Library
- Laerhoven, K., Berlin, E., and Schiele, B. 2009. Enabling efficient time series analysis for wearable activity data. In Proceedings of ICMLA. 392--397. Google ScholarDigital Library
- Lim, S. H., Park, H., and Kim, S. W. 2007. Using multiple indexes for efficient subsequence matching in time-series databases. Inf. Sci. 177, 24, 5691--5706. Google ScholarDigital Library
- Ling, R. F. 1974. Comparison of several algorithms for computing sample means and variances. J. Amer. Statist. Assoc. 69, 348, 859--866.Google ScholarCross Ref
- Locke, D. P., Hillier, L. W., Warren, W. C., et al. 2011. Comparative and demographic analysis of orangutan genomes. Nature 469, 529--533.Google ScholarCross Ref
- Masek, W. J. and Paterson, M. S. 1980. A faster algorithm computing string edit distances. J. Comput. Syst. Sci. 20, 1, 18--31.Google ScholarCross Ref
- McNames, J. 2000. Rotated partial distance search for faster vector quantization encoding. IEEE Signal Proc. Lett. 7, 9, 244--246.Google ScholarCross Ref
- Mueen, A. and Keogh, E. J. 2010. Online discovery and maintenance of time series motifs. In Proceedings of KDD. 1089--1098. Google ScholarDigital Library
- Mueen, A., Keogh, E. J., Zhu, Q., Cash, S., Westover, M. B., and Shamlo, N. 2011. A disk-aware algorithm for time series motif discovery. Data Min. Knowl. Discov. 22, 1--2, 73--105. Google ScholarDigital Library
- Muller, M. 2009. Analysis and retrieval techniques for motion and music data. EUROGRAPHICS tutorial.Google Scholar
- Papapetrou, P., Athitsos, V., Potamias, M., Kollios, G., and Gunopulos, D. 2011. Embedding-based subsequence matching in time-series databases. ACM Trans. Datab. Syst. 36, 3, 174. Google ScholarDigital Library
- Pressly, W. 2008. TSPad: A Tablet-PC based application for annotation and collaboration on time series data. In Proceedings of ACM Southeast Regional Conference. 527--552. Google ScholarDigital Library
- Raghavendra, B., Bera, D., Bopardikar, A., and Narayanan, R. 2011. Cardiac arrhythmia detection using dynamic time warping of ECG beats in e-healthcare systems. In Proceedings of WOWMOM. 1--6. Google ScholarDigital Library
- Rebbapragada, U., Protopapas, P., Brodley, C., and Alcock, C. 2009. Finding anomalous periodic time series. Mach. Learn. 74, 3, 281--313. Google ScholarDigital Library
- Sakurai, Y., Yoshikawa, M., and Faloutsos, C. 2005. FTW: Fast similarity search under the time warping distance. In Proceedings of PODS. 326--337. Google ScholarDigital Library
- Sakurai, Y., Faloutsos, C., and Yamamuro, M. 2007. Stream monitoring under the time warping distance. In Proceedings of ICDE. 1046--1055.Google Scholar
- Shieh, J. and Keogh, E. J. 2008. iSAX: Indexing and mining terabyte sized time series. In Proceedings of KDD. 623--631. Google ScholarDigital Library
- Srikanthan, S., Kumar, A., and Gupta, R. 2011. Implementing the dynamic time warping algorithm in multithreaded environments for real time and unsupervised pattern discovery. In Proceedings of IEEE ICCCT. 394--398.Google Scholar
- Stiefmeier, T., Roggen, D., and Tröster, G. 2007. Gestures are strings: Efficient online gesture spotting and classification using string matching. In Proceedings of the ICST 2nd International Conference on Body Area Networks. Google ScholarDigital Library
- Vlachos, M., Hadjieleftheriou, M., Gunopulos, D., and Keogh, E. J. 2003. Indexing multi-dimensional time-series with support for multiple distance measures. In Proceedings of KDD. 216--225. Google ScholarDigital Library
- Whitney, C. R. 1997. Jeanne Calment, World’s elder, dies at 122. New York Times (8/5/97).Google Scholar
- Wobbrock, J. O., Wilson, A. D., and Li, Y. 2007. Gestures without libraries, toolkits or training: A $1 recognizer for user interface prototypes. In Proceedings of ACM UIST. 159--168. Google ScholarDigital Library
- Ye, L. and Keogh, E. J. 2009. Time series shapelets: A new primitive for data mining. In Proceedings of KDD. 947--956. Google ScholarDigital Library
- Yi, B., Jagadish, H., and Faloutsos, C. 1998. Efficient retrieval of similar time sequences under time warping. In Proceedings of ICDE. 201--208. Google ScholarDigital Library
Index Terms
- Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping
Recommendations
Searching and mining trillions of time series subsequences under dynamic time warping
KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data miningMost time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets ...
Approximate Similarity Search for Time Series Data Enhanced by Section Min-Hash
Similarity Search and ApplicationsAbstractDynamic Time Warping (DTW) is a well-known similarity measure between time series data. Although DTW can calculate the similarity between time series with different lengths, it is computationally expensive. Therefore, fast algorithms that ...
Speeding up similarity search under dynamic time warping by pruning unpromising alignments
Similarity search is the core procedure for several time series mining tasks. While different distance measures can be used for this purpose, there is clear evidence that the Dynamic Time Warping (DTW) is the most suitable distance function for a wide ...
Comments