Top

Published in:

16-11-2020 | Regular Paper

BestNeighbor: efficient evaluation of kNN queries on large time series databases

Authors: Oleksandra Levchenko, Boyan Kolev, Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Themis Palpanas, Dennis Shasha, Patrick Valduriez

Published in: Knowledge and Information Systems | Issue 2/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This paper presents parallel solutions (developed based on two state-of-the-art algorithms iSAX and sketch) for evaluating k nearest neighbor queries on large databases of time series, compares them based on various measures of quality and time performance, and offers a tool that uses the characteristics of application data to determine which algorithm to choose for that application and how to set the parameters for that algorithm. Specifically, our experiments show that: (i) iSAX and its derivatives perform best in both time and quality when the time series can be characterized by a few low-frequency Fourier Coefficients, a regime where the iSAX pruning approach works well. (ii) iSAX performs significantly less well when high-frequency Fourier Coefficients have much of the energy of the time series. (iii) A random projection approach based on sketches by contrast is more or less independent of the frequency power spectrum. The experiments show the close relationship between pruning ratio and time for exact iSAX as well as between pruning ratio and the quality of approximate iSAX. Our toolkit analyzes typical time series of an application (i) to determine optimal segment sizes for iSAX and (ii) when to use Parallel Sketches instead of iSAX. Our algorithms have been implemented using Spark, evaluated over a cluster of nodes, and have been applied to both real and synthetic data. The results apply to any databases of numerical sequences, whether or not they relate to time.

previous article A relative position attention network for aspect-based sentiment analysis

next article Statistical model for reproducibility in ranking-based feature selection

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

http://www.grid5000.fr.

http://finance.yahoo.com.

Indexing time responses are not reported here since they mainly depend on the cluster size.

Achlioptas D (2003) Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687MathSciNetCrossRef

Assent I, Krieger R, Afschari F, Seidl T (2008) The TS-tree: efficient time series search and retrieval. In: Proceedings of the international conference on extending database technology (EDBT), pp 252–263

Cai Y, Ng R (2004) Indexing spatio-temporal trajectories with Chebyshev polynomials. In: Proceedings of the international conference on management of data (SIGMOD). ACM, pp 599–610

Camerra A, Palpanas T, Shieh J, Keogh E (2010) iSAX 2.0: indexing and mining one billion time series. In: Proceedings of the 2010 IEEE international conference on data mining, ICDM ’10. pp 58–67

Camerra A, Shieh J, Palpanas T, Rakthanmanon T, Keogh EJ (2014) Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. Knowl Inf Syst (KAIS) 39:123–151 CrossRef

Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thiry-fourth annual ACM symposium on theory of computing (STOC), pp 380–388

Dasgupta S (1999) Learning mixtures of gaussians. In: Proceedings of the 40th annual symposium on foundations of computer science, FOCS ’99. p 634

Echihabi K, Zoumpatianos K, Palpanas T, Benbrahim H (2018) The lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. PVLDB 12(2):112–127

Echihabi K, Zoumpatianos K, Palpanas T, Benbrahim H (2019) Return of the lernaean hydra: experimental evaluation of data series approximate similarity search. PVLDB

10.

Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv 45(1):12:1–12:34CrossRef

11.

Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the international conference on management of data (SIGMOD), pp 419–429

12.

Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the international conference on very large data bases (VLDB). pp 518–529

13.

Huijse P, Estévez PA, Protopapas P, Principe JC, Zegers P (2014) Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comput Int Mag 9(3):27–39CrossRef

14.

Indyk P (2000) Stable distributions, pseudorandom generators, embeddings and data stream computation. In: 41st Annual symposium on foundations of computer science (FOCS), pp 189–197

15.

Indyk P, Koudas N, Muthukrishnan S (2000) Identifying representative trends in massive time series data sets using sketches. In: International conference on very large data bases (VLDB). pp 363–372

16.

I.R.I. for Seismology with Artificial Intelligence (2019) Seismic data access. http://ds.iris.edu/data/access/

17.

Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. In: Conference in modern analysis and probability, volume 26 of contemporary mathematics. pp 189–206

18.

Kashino K, Smith G, Murase H (1999) Time-series active search for quick retrieval of audio and video. In: ICASSP

19.

Keogh EJ (2002) Exact indexing of dynamic time warping. In: VLDB

20.

Kondylakis H, Dayan N, Zoumpatianos K, Palpanas T (2018) Coconut: a scalable bottom-up approach for building data series indexes. PVLDB 11(6):677–690

21.

Kushilevitz E, Ostrovsky R, Rabani Y (1998) Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the thirtieth annual ACM symposium on theory of computing (STOC). pp 614–623

22.

Levchenko O, Kolev B, Yagoubi DE, Shasha DE, Palpanas T, Valduriez P, Akbarinia R, Masseglia F (2019) Distributed algorithms to find similar time series. In: Machine learning and knowledge discovery in databases. ECML PKDD

23.

Levchenko O, Yagoubi DE, Akbarinia R, Masseglia F, Kolev B, Shasha DE (2018) Spark-parsketch: a massively distributed indexing of time series datasets. In: Proceedings of the 27th ACM international conference on information and knowledge management, CIKM 2018, Torino, Italy, October 22–26, 2018. pp 1951–1954

24.

Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: SIGMOD

25.

Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144MathSciNetCrossRef

26.

Linardi M, Palpanas T (2018) ULISSE: ultra compact index for variable-length similarity search in data series. In: ICDE

27.

Linardi M, Palpanas T (2019) Scalable, variable-length similarity search in data series: The ulisse approach. PVLDB

28.

Linardi M, Zhu Y, Palpanas T, Keogh EJ (2018) Matrix profile X: VALMOD—scalable discovery of variable-length motifs in data series. In: SIGMOD

29.

Linardi M, Zhu Y, Palpanas T, Keogh EJ (2018) VALMOD: a suite for easy and exact detection of variable length motifs in data series. In: SIGMO

30.

Palpanas T (2015) Data series management: the road to big sequence analytics. SIGMOD Rec 44(2):47–52CrossRef

31.

Palpanas T (2020) Evolution of a data series index. CCIS 1197

32.

Peng B, Palpanas T, Fatourou P (2018) Paris: the next destination for fast data series indexing and query answering. IEEE BigData

33.

Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD

34.

Raza U, Camerra A, Murphy AL, Palpanas T, Picco GP (2015) Practical data prediction for real-world wireless sensor networks. IEEE Trans Knowl Data Eng 27:2231–2244CrossRef

35.

Vision SC (2019) Deep billion-scale indexing. http://sites.skoltech.ru/compvision/noimi

36.

Soldi S, Beckmann V, Baumgartner WH, Ponti G, Shrader CR, Lubinski P, Krimm HA, Mattana F, Tueller J (2014) Long-term variability of agn at hard x-rays. Astron Astrophys 563:A57CrossRef

37.

Southwest University (2019) Southwest University adult lifespan dataset (sald). http://fcon_1000.projects.nitrc.org/indi/retro/sald.html

38.

Schneider J, Vlachos M (2017) Scalable density-based clustering with quality guarantees using random projections. Data Min Knowl Discov 31(4):972–1005MathSciNetCrossRef

39.

Shasha D, Zhu Y (2004) High performance discovery in time series, techniques and case studies. Springer, BerlinCrossRef

40.

Shasha D (1999) Tuning time series queries in finance: case studies and recommendations. IEEE Data Eng Bull 22(2):40–46

41.

Shieh J, Keogh E (2008) isax: Indexing and mining terabyte sized time series. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08. pp 623–631

42.

Shieh J, Keogh E (2009) iSAX: disk-aware mining and indexing of massive time series datasets. DMKD 19(1):24–57

43.

Soldi S, Beckmann V, Baumgartner WH, Ponti G, Shrader CR, Lubiński P, Krimm HA, Mattana F, Tueller J (2014) Long-term variability of agn at hard x-rays. A&A 563:A57CrossRef

44.

Yang W, Peng W, Jian P, Wei W, Sheng H (2013) A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10):793–804

45.

Wilkinson L, Anand A, Tuan DN (2011) Chirp: a new classifier based on composite hypercubes on iterated random projections. In: International conference on knowledge discovery and data mining (KDD). ACM, pp 6–14

46.

Yagoubi DE, Akbarinia R, Masseglia F, Palpanas T (2017) Dpisax: massively distributed partitioned isax. In: ICDM. pp 1135–1140

47.

Yagoubi DE, Akbarinia R, Masseglia F, Palpanas T (2020) Massively distributed time series indexing and querying. IEEE Trans Knowl Data Eng 32(1):108–120CrossRef

48.

Yagoubi DE, Akbarinia R, Masseglia F, Shasha DE (2017) Radiussketch: massively distributed indexing of time series. In: 2017 IEEE international conference on data science and advanced analytics, DSAA 2017, Tokyo, Japan, October 19–21, 2017. pp 262–271

49.

Ye L, Keogh EJ (2009) Time series shapelets: a new primitive for data mining. In: KDD

50.

Zoumpatianos K, Idreos S, Palpanas T (2014) Indexing for interactive exploration of big data series. In: Proceedings of the international conference on management of data (SIGMOD), SIGMOD ’14. pp 1555–1566

51.

Zoumpatianos K, Idreos S, Palpanas T (2016) ADS: the adaptive data series index. VLDB J 25(6):843–866CrossRef

52.

Zoumpatianos K, Palpanas T (2018) Data series management: fulfilling the need for big sequence analytics. In: ICDE

Title: BestNeighbor: efficient evaluation of kNN queries on large time series databases
Authors: Oleksandra Levchenko
Boyan Kolev
Djamel-Edine Yagoubi
Reza Akbarinia
Florent Masseglia
Themis Palpanas
Dennis Shasha
Patrick Valduriez
Publication date: 16-11-2020
Publisher: Springer London
Published in: Knowledge and Information Systems / Issue 2/2021
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-020-01518-4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 2/2021

A relative position attention network for aspect-based sentiment analysis

Partial multi-label learning with noisy side information

A repairing missing activities approach with succession relation for event logs

A graph grammar and -type tournament-based approach to detect conflicts of interest in a social network

Learning credible DNNs via incorporating prior knowledge and model local explanation

CANE: community-aware network embedding via adversarial training

Premium Partner