Skip to main content
Erschienen in: The VLDB Journal 6/2018

17.07.2018 | Regular Paper

Generating data series query workloads

verfasst von: Kostas Zoumpatianos, Yin Lou, Ioana Ileana, Themis Palpanas, Johannes Gehrke

Erschienen in: The VLDB Journal | Ausgabe 6/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data series (including time series) has attracted lots of interest in recent years. Most of the research has focused on how to efficiently support similarity or nearest neighbor queries over large data series collections (an important data mining task), and several data series summarization and indexing methods have been proposed in order to solve this problem. Up to this point, very little attention has been paid to properly evaluating such index structures, with most previous works relying solely on randomly selected data series to use as queries. In this work, we show that random workloads are inherently not suitable for the task at hand and we argue that there is a need for carefully generating query workloads. We define measures that capture the characteristics of queries, and we propose a method for generating workloads with the desired properties, that is, effectively evaluating and comparing data series summarizations and indexes. In our experimental evaluation, with carefully controlled query workloads, we shed light on key factors affecting the performance of nearest neighbor search in large data series collections. This is the first paper that introduces a method for quantifying hardness of data series queries, as well as the ability to generate queries of predefined hardness.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Note that when these values are measured over time (usually at fixed time intervals), we call them time series. However, time series are just one special case of data series: A series can also be defined over other measures (e.g., mass in mass spectroscopy, position in genome sequences, angle in radial chemical profiles, etc.). For the rest of this paper, we will use the terms sequence, data series, and time series interchangeably.
 
3
This work (built on our preliminary version [46]) includes a more precise formal definition of the problem, a deeper analysis of previous workloads, a robust geometric solution for placing nearest neighbors at predefined distances from a query that removes earlier limitations, and an expanded experimental evaluation section.
 
4
In this work, we use the well-known FFT algorithm.
 
5
Informally, the effort is the amount of work that an index needs to perform. We formally define the notion of effort later in this section.
 
6
A similar definition has been proposed in the past [5].
 
7
We also use the same datasets in our experimental section.
 
8
ftp://ftp.ensembl.org/pub/release-42/
 
9
This algorithm iterates over all symbols in the DNA sequence and constructs the series as a cumulative sum, which increases by 2 for every appearance of the base “A,” by 1 for “G” and decreases by 1 and 2 for each appearance of “C” and “T,” respectively.
 
Literatur
1.
Zurück zum Zitat Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence databases. In: FODO (1993)CrossRef Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence databases. In: FODO (1993)CrossRef
2.
Zurück zum Zitat Assent, I., Krieger, R., Afschari, F., Seidl, T.: The ts-tree: Efficient time series search and retrieval. In: EDBT (2008) Assent, I., Krieger, R., Afschari, F., Seidl, T.: The ts-tree: Efficient time series search and retrieval. In: EDBT (2008)
4.
Zurück zum Zitat Bay, S.D., Kibler, D., Pazzani, M.J., Smyth, P.: The uci kdd archive of large data sets for data mining research and experimentation. In: SIGKDD Explorations (2000) Bay, S.D., Kibler, D., Pazzani, M.J., Smyth, P.: The uci kdd archive of large data sets for data mining research and experimentation. In: SIGKDD Explorations (2000)
5.
Zurück zum Zitat Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor" meaningful? In: ICDT (1999) Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor" meaningful? In: ICDT (1999)
6.
Zurück zum Zitat Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: Indexing and mining one billion time series. In: ICDM (2010) Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: Indexing and mining one billion time series. In: ICDM (2010)
7.
Zurück zum Zitat Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.: Beyond one billion time series: indexing and mining very large time series collections with isax2+. KAIS (2013) Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.: Beyond one billion time series: indexing and mining very large time series collections with isax2+. KAIS (2013)
8.
Zurück zum Zitat Chakrabarti, K., Keogh, E., Mehrotra, S., Pazzani, M.: Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD (2002) Chakrabarti, K., Keogh, E., Mehrotra, S., Pazzani, M.: Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD (2002)
9.
Zurück zum Zitat Chan, K.P., Fu, A.C.: Efficient time series matching by wavelets. In: ICDE (1999) Chan, K.P., Fu, A.C.: Efficient time series matching by wavelets. In: ICDE (1999)
10.
Zurück zum Zitat Chen, Q., Chen, L., Lian, X., Liu, Y., Yu, J.X.: Indexable pla for efficient similarity search. In: VLDB (2007) Chen, Q., Chen, L., Lian, X., Liu, Y., Yu, J.X.: Indexable pla for efficient similarity search. In: VLDB (2007)
12.
Zurück zum Zitat Dallachiesa, M., Nushi, B., Mirylenka, K., Palpanas, T.: Uncertain time-series similarity: Return to the basics. In: VLDB (2012) Dallachiesa, M., Nushi, B., Mirylenka, K., Palpanas, T.: Uncertain time-series similarity: Return to the basics. In: VLDB (2012)
13.
Zurück zum Zitat Dallachiesa, M., Palpanas, T., Ilyas, I.F.: Top-k nearest neighbor search in uncertain data series. In: VLDB (2015) Dallachiesa, M., Palpanas, T., Ilyas, I.F.: Top-k nearest neighbor search in uncertain data series. In: VLDB (2015)
15.
Zurück zum Zitat Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD (1994) Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD (1994)
16.
Zurück zum Zitat Fu, A.W., Leung, O.T., Keogh, E.J., Lin, J.: Finding time series discords based on haar transform. In: Advanced Data Mining and Applications, Second International Conference, ADMA 2006, Xi’an, China, August 14-16, 2006, Proceedings, pp. 31–41 (2006). https://doi.org/10.1007/11811305_3 CrossRef Fu, A.W., Leung, O.T., Keogh, E.J., Lin, J.: Finding time series discords based on haar transform. In: Advanced Data Mining and Applications, Second International Conference, ADMA 2006, Xi’an, China, August 14-16, 2006, Proceedings, pp. 31–41 (2006). https://​doi.​org/​10.​1007/​11811305_​3 CrossRef
17.
Zurück zum Zitat Goldin, D.Q., Kanellakis, P.C.: On similarity queries for time-series data: Constraint specification and implementation. In: Principles and Practice of Constraint Programming (1995)CrossRef Goldin, D.Q., Kanellakis, P.C.: On similarity queries for time-series data: Constraint specification and implementation. In: Principles and Practice of Constraint Programming (1995)CrossRef
18.
Zurück zum Zitat Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD (1984) Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD (1984)
19.
Zurück zum Zitat Huijse, P., Estévez, P.A., Protopapas, P., Principe, J.C., Zegers, P.: Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comp. Int. Mag. 9(3), 27–39 (2014)CrossRef Huijse, P., Estévez, P.A., Protopapas, P., Principe, J.C., Zegers, P.: Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comp. Int. Mag. 9(3), 27–39 (2014)CrossRef
20.
Zurück zum Zitat Kashino, K., Smith, G., Murase, H.: Time-series active search for quick retrieval of audio and video. In: ICASSP (1999) Kashino, K., Smith, G., Murase, H.: Time-series active search for quick retrieval of audio and video. In: ICASSP (1999)
21.
Zurück zum Zitat Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: KDD (2011) Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: KDD (2011)
22.
Zurück zum Zitat Keogh, E.: Machine learning in time series databases (and everything is a time series!). In: Tutorial at the AAAI International Conference on Artificial Intelligence, vol. 2 (2011) Keogh, E.: Machine learning in time series databases (and everything is a time series!). In: Tutorial at the AAAI International Conference on Artificial Intelligence, vol. 2 (2011)
23.
Zurück zum Zitat Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS 3 (2000)CrossRef Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS 3 (2000)CrossRef
24.
Zurück zum Zitat Keogh, E., Pazzani, M.: Scaling up dynamic time warping to massive datasets. In: PKDD (1999) Keogh, E., Pazzani, M.: Scaling up dynamic time warping to massive datasets. In: PKDD (1999)
25.
Zurück zum Zitat Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: SIGMOD (1997) Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: SIGMOD (1997)
26.
Zurück zum Zitat Kremer, H., Günnemann, S., Ivanescu, A.M., Assent, I., Seidl, T.: Efficient processing of multiple dtw queries in time series databases. In: SSDBM (2011) Kremer, H., Günnemann, S., Ivanescu, A.M., Assent, I., Seidl, T.: Efficient processing of multiple dtw queries in time series databases. In: SSDBM (2011)
27.
Zurück zum Zitat Li, C.S., Yu, P., Castelli, V.: Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In: ICDE (1996) Li, C.S., Yu, P., Castelli, V.: Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In: ICDE (1996)
28.
Zurück zum Zitat Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A symbolic representation of time series, with implications for streaming algorithms. In: DMKD (2003) Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A symbolic representation of time series, with implications for streaming algorithms. In: DMKD (2003)
29.
Zurück zum Zitat Lin, J., Keogh, E.J., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2), 107–144 (2007)MathSciNetCrossRef Lin, J., Keogh, E.J., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2), 107–144 (2007)MathSciNetCrossRef
30.
Zurück zum Zitat Lin, J., Khade, R., Li, Y.: Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst. 39(2), 287–315 (2012)CrossRef Lin, J., Khade, R., Li, Y.: Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst. 39(2), 287–315 (2012)CrossRef
32.
Zurück zum Zitat Rafiei, D., Mendelzon, A.: Similarity-based queries for time series data. In: SIGMOD (1997) Rafiei, D., Mendelzon, A.: Similarity-based queries for time series data. In: SIGMOD (1997)
33.
Zurück zum Zitat Rafiei, D., Mendelzon, A.: Efficient retrieval of similar time sequences using dft. In: ICDE (1998) Rafiei, D., Mendelzon, A.: Efficient retrieval of similar time sequences using dft. In: ICDE (1998)
34.
Zurück zum Zitat Rakthanmanon, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Zakaria, J., Keogh, E.: Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD (2012) Rakthanmanon, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Zakaria, J., Keogh, E.: Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD (2012)
36.
Zurück zum Zitat Ravi Kanth, K.V., Agrawal, D., Singh, A.: Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD (1998) Ravi Kanth, K.V., Agrawal, D., Singh, A.: Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD (1998)
37.
Zurück zum Zitat Schäfer, P., Högqvist, M.: Sfa: A symbolic fourier approximation and index for similarity search in high dimensional datasets. In: EDBT (2012) Schäfer, P., Högqvist, M.: Sfa: A symbolic fourier approximation and index for similarity search in high dimensional datasets. In: EDBT (2012)
38.
Zurück zum Zitat Shasha, D.: Tuning time series queries in finance: Case studies and recommendations. IEEE Data Eng. Bull. 22(2), 40–46 (1999) Shasha, D.: Tuning time series queries in finance: Case studies and recommendations. IEEE Data Eng. Bull. 22(2), 40–46 (1999)
39.
Zurück zum Zitat Shieh, J., Keogh, E.: isax: Indexing and mining terabyte sized time series. In: KDD (2008) Shieh, J., Keogh, E.: isax: Indexing and mining terabyte sized time series. In: KDD (2008)
40.
Zurück zum Zitat Wang, X., Mueen, A., Ding, H., Trajcevski, G., Scheuermann, P., Keogh, E.: Experimental comparison of representation methods and distance measures for time series data. DMKD 26(2), 275–309 (2013)MathSciNetCrossRef Wang, X., Mueen, A., Ding, H., Trajcevski, G., Scheuermann, P., Keogh, E.: Experimental comparison of representation methods and distance measures for time series data. DMKD 26(2), 275–309 (2013)MathSciNetCrossRef
41.
Zurück zum Zitat Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. In: VLDB (2013) Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. In: VLDB (2013)
42.
Zurück zum Zitat Ye, L., Keogh, E.J.: Time series shapelets: a new primitive for data mining. In: KDD (2009) Ye, L., Keogh, E.J.: Time series shapelets: a new primitive for data mining. In: KDD (2009)
43.
Zurück zum Zitat Yi, B.K., Jagadish, H., Faloutsos, C.: Efficient retrieval of similar time sequences under time warping. In: ICDE (1998) Yi, B.K., Jagadish, H., Faloutsos, C.: Efficient retrieval of similar time sequences under time warping. In: ICDE (1998)
44.
Zurück zum Zitat Zoumpatianos, K., Idreos, S., Palpanas, T.: Indexing for interactive exploration of big data series. In: SIGMOD (2014) Zoumpatianos, K., Idreos, S., Palpanas, T.: Indexing for interactive exploration of big data series. In: SIGMOD (2014)
45.
Zurück zum Zitat Zoumpatianos, K., Idreos, S., Palpanas, T.: Rinse: Interactive data series exploration. In: VLDB (2015)CrossRef Zoumpatianos, K., Idreos, S., Palpanas, T.: Rinse: Interactive data series exploration. In: VLDB (2015)CrossRef
46.
Zurück zum Zitat Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pp. 1603–1612 (2015) Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pp. 1603–1612 (2015)
Metadaten
Titel
Generating data series query workloads
verfasst von
Kostas Zoumpatianos
Yin Lou
Ioana Ileana
Themis Palpanas
Johannes Gehrke
Publikationsdatum
17.07.2018
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 6/2018
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-018-0513-x

Weitere Artikel der Ausgabe 6/2018

The VLDB Journal 6/2018 Zur Ausgabe