Skip to main content
Top
Published in: Knowledge and Information Systems 2/2021

16-11-2020 | Regular Paper

BestNeighbor: efficient evaluation of kNN queries on large time series databases

Authors: Oleksandra Levchenko, Boyan Kolev, Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Themis Palpanas, Dennis Shasha, Patrick Valduriez

Published in: Knowledge and Information Systems | Issue 2/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper presents parallel solutions (developed based on two state-of-the-art algorithms iSAX and sketch) for evaluating k nearest neighbor queries on large databases of time series, compares them based on various measures of quality and time performance, and offers a tool that uses the characteristics of application data to determine which algorithm to choose for that application and how to set the parameters for that algorithm. Specifically, our experiments show that: (i) iSAX and its derivatives perform best in both time and quality when the time series can be characterized by a few low-frequency Fourier Coefficients, a regime where the iSAX pruning approach works well. (ii) iSAX performs significantly less well when high-frequency Fourier Coefficients have much of the energy of the time series. (iii) A random projection approach based on sketches by contrast is more or less independent of the frequency power spectrum. The experiments show the close relationship between pruning ratio and time for exact iSAX as well as between pruning ratio and the quality of approximate iSAX. Our toolkit analyzes typical time series of an application (i) to determine optimal segment sizes for iSAX and (ii) when to use Parallel Sketches instead of iSAX. Our algorithms have been implemented using Spark, evaluated over a cluster of nodes, and have been applied to both real and synthetic data. The results apply to any databases of numerical sequences, whether or not they relate to time.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
3
Indexing time responses are not reported here since they mainly depend on the cluster size.
 
Literature
1.
go back to reference Achlioptas D (2003) Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687MathSciNetCrossRef Achlioptas D (2003) Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687MathSciNetCrossRef
2.
go back to reference Assent I, Krieger R, Afschari F, Seidl T (2008) The TS-tree: efficient time series search and retrieval. In: Proceedings of the international conference on extending database technology (EDBT), pp 252–263 Assent I, Krieger R, Afschari F, Seidl T (2008) The TS-tree: efficient time series search and retrieval. In: Proceedings of the international conference on extending database technology (EDBT), pp 252–263
3.
go back to reference Cai Y, Ng R (2004) Indexing spatio-temporal trajectories with Chebyshev polynomials. In: Proceedings of the international conference on management of data (SIGMOD). ACM, pp 599–610 Cai Y, Ng R (2004) Indexing spatio-temporal trajectories with Chebyshev polynomials. In: Proceedings of the international conference on management of data (SIGMOD). ACM, pp 599–610
4.
go back to reference Camerra A, Palpanas T, Shieh J, Keogh E (2010) iSAX 2.0: indexing and mining one billion time series. In: Proceedings of the 2010 IEEE international conference on data mining, ICDM ’10. pp 58–67 Camerra A, Palpanas T, Shieh J, Keogh E (2010) iSAX 2.0: indexing and mining one billion time series. In: Proceedings of the 2010 IEEE international conference on data mining, ICDM ’10. pp 58–67
5.
go back to reference Camerra A, Shieh J, Palpanas T, Rakthanmanon T, Keogh EJ (2014) Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. Knowl Inf Syst (KAIS) 39:123–151 CrossRef Camerra A, Shieh J, Palpanas T, Rakthanmanon T, Keogh EJ (2014) Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. Knowl Inf Syst (KAIS) 39:123–151 CrossRef
6.
go back to reference Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thiry-fourth annual ACM symposium on theory of computing (STOC), pp 380–388 Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thiry-fourth annual ACM symposium on theory of computing (STOC), pp 380–388
7.
go back to reference Dasgupta S (1999) Learning mixtures of gaussians. In: Proceedings of the 40th annual symposium on foundations of computer science, FOCS ’99. p 634 Dasgupta S (1999) Learning mixtures of gaussians. In: Proceedings of the 40th annual symposium on foundations of computer science, FOCS ’99. p 634
8.
go back to reference Echihabi K, Zoumpatianos K, Palpanas T, Benbrahim H (2018) The lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. PVLDB 12(2):112–127 Echihabi K, Zoumpatianos K, Palpanas T, Benbrahim H (2018) The lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. PVLDB 12(2):112–127
9.
go back to reference Echihabi K, Zoumpatianos K, Palpanas T, Benbrahim H (2019) Return of the lernaean hydra: experimental evaluation of data series approximate similarity search. PVLDB Echihabi K, Zoumpatianos K, Palpanas T, Benbrahim H (2019) Return of the lernaean hydra: experimental evaluation of data series approximate similarity search. PVLDB
10.
go back to reference Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv 45(1):12:1–12:34CrossRef Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv 45(1):12:1–12:34CrossRef
11.
go back to reference Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the international conference on management of data (SIGMOD), pp 419–429 Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the international conference on management of data (SIGMOD), pp 419–429
12.
go back to reference Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the international conference on very large data bases (VLDB). pp 518–529 Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the international conference on very large data bases (VLDB). pp 518–529
13.
go back to reference Huijse P, Estévez PA, Protopapas P, Principe JC, Zegers P (2014) Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comput Int Mag 9(3):27–39CrossRef Huijse P, Estévez PA, Protopapas P, Principe JC, Zegers P (2014) Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comput Int Mag 9(3):27–39CrossRef
14.
go back to reference Indyk P (2000) Stable distributions, pseudorandom generators, embeddings and data stream computation. In: 41st Annual symposium on foundations of computer science (FOCS), pp 189–197 Indyk P (2000) Stable distributions, pseudorandom generators, embeddings and data stream computation. In: 41st Annual symposium on foundations of computer science (FOCS), pp 189–197
15.
go back to reference Indyk P, Koudas N, Muthukrishnan S (2000) Identifying representative trends in massive time series data sets using sketches. In: International conference on very large data bases (VLDB). pp 363–372 Indyk P, Koudas N, Muthukrishnan S (2000) Identifying representative trends in massive time series data sets using sketches. In: International conference on very large data bases (VLDB). pp 363–372
17.
go back to reference Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. In: Conference in modern analysis and probability, volume 26 of contemporary mathematics. pp 189–206 Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. In: Conference in modern analysis and probability, volume 26 of contemporary mathematics. pp 189–206
18.
go back to reference Kashino K, Smith G, Murase H (1999) Time-series active search for quick retrieval of audio and video. In: ICASSP Kashino K, Smith G, Murase H (1999) Time-series active search for quick retrieval of audio and video. In: ICASSP
19.
go back to reference Keogh EJ (2002) Exact indexing of dynamic time warping. In: VLDB Keogh EJ (2002) Exact indexing of dynamic time warping. In: VLDB
20.
go back to reference Kondylakis H, Dayan N, Zoumpatianos K, Palpanas T (2018) Coconut: a scalable bottom-up approach for building data series indexes. PVLDB 11(6):677–690 Kondylakis H, Dayan N, Zoumpatianos K, Palpanas T (2018) Coconut: a scalable bottom-up approach for building data series indexes. PVLDB 11(6):677–690
21.
go back to reference Kushilevitz E, Ostrovsky R, Rabani Y (1998) Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the thirtieth annual ACM symposium on theory of computing (STOC). pp 614–623 Kushilevitz E, Ostrovsky R, Rabani Y (1998) Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the thirtieth annual ACM symposium on theory of computing (STOC). pp 614–623
22.
go back to reference Levchenko O, Kolev B, Yagoubi DE, Shasha DE, Palpanas T, Valduriez P, Akbarinia R, Masseglia F (2019) Distributed algorithms to find similar time series. In: Machine learning and knowledge discovery in databases. ECML PKDD Levchenko O, Kolev B, Yagoubi DE, Shasha DE, Palpanas T, Valduriez P, Akbarinia R, Masseglia F (2019) Distributed algorithms to find similar time series. In: Machine learning and knowledge discovery in databases. ECML PKDD
23.
go back to reference Levchenko O, Yagoubi DE, Akbarinia R, Masseglia F, Kolev B, Shasha DE (2018) Spark-parsketch: a massively distributed indexing of time series datasets. In: Proceedings of the 27th ACM international conference on information and knowledge management, CIKM 2018, Torino, Italy, October 22–26, 2018. pp 1951–1954 Levchenko O, Yagoubi DE, Akbarinia R, Masseglia F, Kolev B, Shasha DE (2018) Spark-parsketch: a massively distributed indexing of time series datasets. In: Proceedings of the 27th ACM international conference on information and knowledge management, CIKM 2018, Torino, Italy, October 22–26, 2018. pp 1951–1954
24.
go back to reference Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: SIGMOD Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: SIGMOD
25.
go back to reference Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144MathSciNetCrossRef Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144MathSciNetCrossRef
26.
go back to reference Linardi M, Palpanas T (2018) ULISSE: ultra compact index for variable-length similarity search in data series. In: ICDE Linardi M, Palpanas T (2018) ULISSE: ultra compact index for variable-length similarity search in data series. In: ICDE
27.
go back to reference Linardi M, Palpanas T (2019) Scalable, variable-length similarity search in data series: The ulisse approach. PVLDB Linardi M, Palpanas T (2019) Scalable, variable-length similarity search in data series: The ulisse approach. PVLDB
28.
go back to reference Linardi M, Zhu Y, Palpanas T, Keogh EJ (2018) Matrix profile X: VALMOD—scalable discovery of variable-length motifs in data series. In: SIGMOD Linardi M, Zhu Y, Palpanas T, Keogh EJ (2018) Matrix profile X: VALMOD—scalable discovery of variable-length motifs in data series. In: SIGMOD
29.
go back to reference Linardi M, Zhu Y, Palpanas T, Keogh EJ (2018) VALMOD: a suite for easy and exact detection of variable length motifs in data series. In: SIGMO Linardi M, Zhu Y, Palpanas T, Keogh EJ (2018) VALMOD: a suite for easy and exact detection of variable length motifs in data series. In: SIGMO
30.
go back to reference Palpanas T (2015) Data series management: the road to big sequence analytics. SIGMOD Rec 44(2):47–52CrossRef Palpanas T (2015) Data series management: the road to big sequence analytics. SIGMOD Rec 44(2):47–52CrossRef
31.
go back to reference Palpanas T (2020) Evolution of a data series index. CCIS 1197 Palpanas T (2020) Evolution of a data series index. CCIS 1197
32.
go back to reference Peng B, Palpanas T, Fatourou P (2018) Paris: the next destination for fast data series indexing and query answering. IEEE BigData Peng B, Palpanas T, Fatourou P (2018) Paris: the next destination for fast data series indexing and query answering. IEEE BigData
33.
go back to reference Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD
34.
go back to reference Raza U, Camerra A, Murphy AL, Palpanas T, Picco GP (2015) Practical data prediction for real-world wireless sensor networks. IEEE Trans Knowl Data Eng 27:2231–2244CrossRef Raza U, Camerra A, Murphy AL, Palpanas T, Picco GP (2015) Practical data prediction for real-world wireless sensor networks. IEEE Trans Knowl Data Eng 27:2231–2244CrossRef
36.
go back to reference Soldi S, Beckmann V, Baumgartner WH, Ponti G, Shrader CR, Lubinski P, Krimm HA, Mattana F, Tueller J (2014) Long-term variability of agn at hard x-rays. Astron Astrophys 563:A57CrossRef Soldi S, Beckmann V, Baumgartner WH, Ponti G, Shrader CR, Lubinski P, Krimm HA, Mattana F, Tueller J (2014) Long-term variability of agn at hard x-rays. Astron Astrophys 563:A57CrossRef
37.
go back to reference Southwest University (2019) Southwest University adult lifespan dataset (sald). http://fcon_1000.projects.nitrc.org/indi/retro/sald.html Southwest University (2019) Southwest University adult lifespan dataset (sald). http://​fcon_​1000.​projects.​nitrc.​org/​indi/​retro/​sald.​html
38.
go back to reference Schneider J, Vlachos M (2017) Scalable density-based clustering with quality guarantees using random projections. Data Min Knowl Discov 31(4):972–1005MathSciNetCrossRef Schneider J, Vlachos M (2017) Scalable density-based clustering with quality guarantees using random projections. Data Min Knowl Discov 31(4):972–1005MathSciNetCrossRef
39.
go back to reference Shasha D, Zhu Y (2004) High performance discovery in time series, techniques and case studies. Springer, BerlinCrossRef Shasha D, Zhu Y (2004) High performance discovery in time series, techniques and case studies. Springer, BerlinCrossRef
40.
go back to reference Shasha D (1999) Tuning time series queries in finance: case studies and recommendations. IEEE Data Eng Bull 22(2):40–46 Shasha D (1999) Tuning time series queries in finance: case studies and recommendations. IEEE Data Eng Bull 22(2):40–46
41.
go back to reference Shieh J, Keogh E (2008) isax: Indexing and mining terabyte sized time series. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08. pp 623–631 Shieh J, Keogh E (2008) isax: Indexing and mining terabyte sized time series. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08. pp 623–631
42.
go back to reference Shieh J, Keogh E (2009) iSAX: disk-aware mining and indexing of massive time series datasets. DMKD 19(1):24–57 Shieh J, Keogh E (2009) iSAX: disk-aware mining and indexing of massive time series datasets. DMKD 19(1):24–57
43.
go back to reference Soldi S, Beckmann V, Baumgartner WH, Ponti G, Shrader CR, Lubiński P, Krimm HA, Mattana F, Tueller J (2014) Long-term variability of agn at hard x-rays. A&A 563:A57CrossRef Soldi S, Beckmann V, Baumgartner WH, Ponti G, Shrader CR, Lubiński P, Krimm HA, Mattana F, Tueller J (2014) Long-term variability of agn at hard x-rays. A&A 563:A57CrossRef
44.
go back to reference Yang W, Peng W, Jian P, Wei W, Sheng H (2013) A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10):793–804 Yang W, Peng W, Jian P, Wei W, Sheng H (2013) A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10):793–804
45.
go back to reference Wilkinson L, Anand A, Tuan DN (2011) Chirp: a new classifier based on composite hypercubes on iterated random projections. In: International conference on knowledge discovery and data mining (KDD). ACM, pp 6–14 Wilkinson L, Anand A, Tuan DN (2011) Chirp: a new classifier based on composite hypercubes on iterated random projections. In: International conference on knowledge discovery and data mining (KDD). ACM, pp 6–14
46.
go back to reference Yagoubi DE, Akbarinia R, Masseglia F, Palpanas T (2017) Dpisax: massively distributed partitioned isax. In: ICDM. pp 1135–1140 Yagoubi DE, Akbarinia R, Masseglia F, Palpanas T (2017) Dpisax: massively distributed partitioned isax. In: ICDM. pp 1135–1140
47.
go back to reference Yagoubi DE, Akbarinia R, Masseglia F, Palpanas T (2020) Massively distributed time series indexing and querying. IEEE Trans Knowl Data Eng 32(1):108–120CrossRef Yagoubi DE, Akbarinia R, Masseglia F, Palpanas T (2020) Massively distributed time series indexing and querying. IEEE Trans Knowl Data Eng 32(1):108–120CrossRef
48.
go back to reference Yagoubi DE, Akbarinia R, Masseglia F, Shasha DE (2017) Radiussketch: massively distributed indexing of time series. In: 2017 IEEE international conference on data science and advanced analytics, DSAA 2017, Tokyo, Japan, October 19–21, 2017. pp 262–271 Yagoubi DE, Akbarinia R, Masseglia F, Shasha DE (2017) Radiussketch: massively distributed indexing of time series. In: 2017 IEEE international conference on data science and advanced analytics, DSAA 2017, Tokyo, Japan, October 19–21, 2017. pp 262–271
49.
go back to reference Ye L, Keogh EJ (2009) Time series shapelets: a new primitive for data mining. In: KDD Ye L, Keogh EJ (2009) Time series shapelets: a new primitive for data mining. In: KDD
50.
go back to reference Zoumpatianos K, Idreos S, Palpanas T (2014) Indexing for interactive exploration of big data series. In: Proceedings of the international conference on management of data (SIGMOD), SIGMOD ’14. pp 1555–1566 Zoumpatianos K, Idreos S, Palpanas T (2014) Indexing for interactive exploration of big data series. In: Proceedings of the international conference on management of data (SIGMOD), SIGMOD ’14. pp 1555–1566
51.
go back to reference Zoumpatianos K, Idreos S, Palpanas T (2016) ADS: the adaptive data series index. VLDB J 25(6):843–866CrossRef Zoumpatianos K, Idreos S, Palpanas T (2016) ADS: the adaptive data series index. VLDB J 25(6):843–866CrossRef
52.
go back to reference Zoumpatianos K, Palpanas T (2018) Data series management: fulfilling the need for big sequence analytics. In: ICDE Zoumpatianos K, Palpanas T (2018) Data series management: fulfilling the need for big sequence analytics. In: ICDE
Metadata
Title
BestNeighbor: efficient evaluation of kNN queries on large time series databases
Authors
Oleksandra Levchenko
Boyan Kolev
Djamel-Edine Yagoubi
Reza Akbarinia
Florent Masseglia
Themis Palpanas
Dennis Shasha
Patrick Valduriez
Publication date
16-11-2020
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 2/2021
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-020-01518-4

Other articles of this Issue 2/2021

Knowledge and Information Systems 2/2021 Go to the issue

Premium Partner