Skip to main content
Top
Published in: Knowledge and Information Systems 8/2020

06-03-2020 | Regular Paper

SUM-optimal histograms for approximate query processing

Authors: Meifan Zhang, Hongzhi Wang, Jianzhong Li, Hong Gao

Published in: Knowledge and Information Systems | Issue 8/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we study the problem of the SUM query approximation with histograms. We define a new kind of histogram called the SUM-optimal histogram which can provide better estimation result for the SUM queries than the traditional equi-depth and V-optimal histograms. We propose three methods for the histogram construction. The first one is a dynamic programming method, and the other two are approximate methods. We use a greedy strategy to insert separators into a histogram and use the stochastic gradient descent method to improve the accuracy of separators. The experimental results indicate that our method can provide better estimations for the SUM queries than the equi-depth and V-optimal histograms.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Acharya J, Diakonikolas I, Hegde C, Li JZ, Schmidt L (2015) Fast and near-optimal algorithms for approximating distributions by histograms. In: Proceedings of the 34th ACM symposium on principles of database systems, PODS 2015, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp 249–263 Acharya J, Diakonikolas I, Hegde C, Li JZ, Schmidt L (2015) Fast and near-optimal algorithms for approximating distributions by histograms. In: Proceedings of the 34th ACM symposium on principles of database systems, PODS 2015, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp 249–263
2.
go back to reference Acharya S, Gibbons PB, Poosala V (2000) Congressional samples for approximate answering of group-by queries. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, May 16–18, 2000, Dallas, TX, USA, pp 487–498 Acharya S, Gibbons PB, Poosala V (2000) Congressional samples for approximate answering of group-by queries. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, May 16–18, 2000, Dallas, TX, USA, pp 487–498
3.
go back to reference Acharya S, Gibbons PB, Poosala V, Ramaswamy S (1999) The aqua approximate query answering system. In: SIGMOD 1999, proceedings ACM SIGMOD international conference on management of data, June 1–3, 1999, Philadelphia, PA, USA, pp 574–576 Acharya S, Gibbons PB, Poosala V, Ramaswamy S (1999) The aqua approximate query answering system. In: SIGMOD 1999, proceedings ACM SIGMOD international conference on management of data, June 1–3, 1999, Philadelphia, PA, USA, pp 574–576
4.
go back to reference Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) Blinkdb: queries with bounded errors and bounded response times on very large data. In: Eighth Eurosys conference 2013, EuroSys’13, Prague, Czech Republic, April 14–17, 2013, pp 29–42 Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) Blinkdb: queries with bounded errors and bounded response times on very large data. In: Eighth Eurosys conference 2013, EuroSys’13, Prague, Czech Republic, April 14–17, 2013, pp 29–42
5.
go back to reference Agrawal R, Swami AN (1995) A one-pass space-efficient algorithm for finding quantiles. In: COMAD Agrawal R, Swami AN (1995) A one-pass space-efficient algorithm for finding quantiles. In: COMAD
6.
go back to reference Buccafurri F, Furfaro F, Mazzeo GM, Saccà D (2011) A quad-tree based multiresolution approach for two-dimensional summary data. Inf Syst 36(7):1082–1103CrossRef Buccafurri F, Furfaro F, Mazzeo GM, Saccà D (2011) A quad-tree based multiresolution approach for two-dimensional summary data. Inf Syst 36(7):1082–1103CrossRef
7.
go back to reference Buccafurri F, Lax G, Saccà D, Pontieri L, Rosaci D (2008) Enhancing histograms by tree-like bucket indices. VLDB J 17(5):1041–1061CrossRef Buccafurri F, Lax G, Saccà D, Pontieri L, Rosaci D (2008) Enhancing histograms by tree-like bucket indices. VLDB J 17(5):1041–1061CrossRef
8.
go back to reference Chaiken R, Jenkins B, Larson PÅ, Ramsey B, Shakib D, Weaver S, Zhou J (2008) SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2):1265–1276 Chaiken R, Jenkins B, Larson PÅ, Ramsey B, Shakib D, Weaver S, Zhou J (2008) SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2):1265–1276
9.
go back to reference Chaudhuri S, Das G, Datar M, Motwani R, Narasayya VR (2001) Overcoming limitations of sampling for aggregation queries. In: Proceedings of the 17th international conference on data engineering, April 2–6, 2001, Heidelberg, Germany, pp 534–542 Chaudhuri S, Das G, Datar M, Motwani R, Narasayya VR (2001) Overcoming limitations of sampling for aggregation queries. In: Proceedings of the 17th international conference on data engineering, April 2–6, 2001, Heidelberg, Germany, pp 534–542
10.
go back to reference Chaudhuri S, Das G, Narasayya VR (2001) A robust, optimization-based approach for approximate answering of aggregate queries. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 295–306 Chaudhuri S, Das G, Narasayya VR (2001) A robust, optimization-based approach for approximate answering of aggregate queries. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 295–306
11.
go back to reference Chaudhuri S, Ding B, Kandula S (2017) Approximate query processing: no silver bullet. In: Proceedings of the 2017 ACM international conference on management of data, SIGMOD conference 2017, Chicago, IL, USA, May 14–19, 2017, pp 511–519 Chaudhuri S, Ding B, Kandula S (2017) Approximate query processing: no silver bullet. In: Proceedings of the 2017 ACM international conference on management of data, SIGMOD conference 2017, Chicago, IL, USA, May 14–19, 2017, pp 511–519
12.
go back to reference Chaudhuri S, Motwani R, Narasayya VR (1998) Random sampling for histogram construction: How much is enough? In: SIGMOD 1998, proceedings ACM SIGMOD international conference on management of data, June 2–4, 1998, Seattle, Washington, USA, pp 436–447 Chaudhuri S, Motwani R, Narasayya VR (1998) Random sampling for histogram construction: How much is enough? In: SIGMOD 1998, proceedings ACM SIGMOD international conference on management of data, June 2–4, 1998, Seattle, Washington, USA, pp 436–447
13.
go back to reference Cormode G, Garofalakis MN, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4(1–3):1–294MATH Cormode G, Garofalakis MN, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4(1–3):1–294MATH
14.
go back to reference Ding X, Liu P, Jin H (2019) Privacy-preserving multi-keyword top-\(k\) k similarity search over encrypted data. IEEE Trans Dependable Sec Comput 16(2):344–357CrossRef Ding X, Liu P, Jin H (2019) Privacy-preserving multi-keyword top-\(k\) k similarity search over encrypted data. IEEE Trans Dependable Sec Comput 16(2):344–357CrossRef
15.
go back to reference Ding X, Yang W, Choo K-KR, Wang X, Jin H (2019) Privacy preserving similarity joins using mapreduce. Inf Sci 493:20–33CrossRef Ding X, Yang W, Choo K-KR, Wang X, Jin H (2019) Privacy preserving similarity joins using mapreduce. Inf Sci 493:20–33CrossRef
16.
go back to reference Galakatos A, Crotty A, Zgraggen E, Binnig C, Kraska T (2017) Revisiting reuse for approximate query processing. PVLDB 10(10):1142–1153 Galakatos A, Crotty A, Zgraggen E, Binnig C, Kraska T (2017) Revisiting reuse for approximate query processing. PVLDB 10(10):1142–1153
17.
go back to reference Gibbons PB, Matias Y, Poosala V (2002) Fast incremental maintenance of approximate histograms. ACM Trans Database Syst 27(3):261–298CrossRef Gibbons PB, Matias Y, Poosala V (2002) Fast incremental maintenance of approximate histograms. ACM Trans Database Syst 27(3):261–298CrossRef
18.
go back to reference Gilbert AC, Guha S, Indyk P, Kotidis Y, Muthukrishnan S, Strauss M (2002) Fast, small-space algorithms for approximate histogram maintenance. In: STOC. ACM, New York, pp 389–398 Gilbert AC, Guha S, Indyk P, Kotidis Y, Muthukrishnan S, Strauss M (2002) Fast, small-space algorithms for approximate histogram maintenance. In: STOC. ACM, New York, pp 389–398
19.
go back to reference Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 58–66 Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 58–66
20.
go back to reference Guha S, Koudas N, Shim K (2006) Approximation and streaming algorithms for histogram construction problems. ACM Trans Database Syst 31(1):396–438CrossRef Guha S, Koudas N, Shim K (2006) Approximation and streaming algorithms for histogram construction problems. ACM Trans Database Syst 31(1):396–438CrossRef
21.
go back to reference Indyk P, Levi R, Rubinfeld R (2012) Approximating and testing \(k\)-histogram distributions in sub-linear time. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS 2012, Scottsdale, AZ, USA, May 20–24, 2012, pp 15–22 Indyk P, Levi R, Rubinfeld R (2012) Approximating and testing \(k\)-histogram distributions in sub-linear time. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS 2012, Scottsdale, AZ, USA, May 20–24, 2012, pp 15–22
22.
go back to reference Ioannidis YE, Poosala V (1995) Balancing histogram optimality and practicality for query result size estimation. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, California, May 22–25, 1995, pp 233–244 Ioannidis YE, Poosala V (1995) Balancing histogram optimality and practicality for query result size estimation. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, California, May 22–25, 1995, pp 233–244
23.
go back to reference Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T (1998) Optimal histograms with quality guarantees. In: VLDB’98, proceedings of 24th international conference on very large data bases, August 24–27, 1998, New York City, NY, USA, pp 275–286 Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T (1998) Optimal histograms with quality guarantees. In: VLDB’98, proceedings of 24th international conference on very large data bases, August 24–27, 1998, New York City, NY, USA, pp 275–286
24.
go back to reference Joseph AG, Bhatnagar S (2015) A stochastic approximation algorithm for quantile estimation. In: Neural information processing—22nd international conference, ICONIP 2015, Istanbul, Turkey, November 9–12, 2015, Proceedings, Part II, pp 311–319 Joseph AG, Bhatnagar S (2015) A stochastic approximation algorithm for quantile estimation. In: Neural information processing—22nd international conference, ICONIP 2015, Istanbul, Turkey, November 9–12, 2015, Proceedings, Part II, pp 311–319
25.
go back to reference Li K, Li G (2018) Approximate query processing: What is new and where to go? A survey on approximate query processing. Data Sci Eng 3(4):379–397CrossRef Li K, Li G (2018) Approximate query processing: What is new and where to go? A survey on approximate query processing. Data Sci Eng 3(4):379–397CrossRef
26.
go back to reference Ma Q, Triantafillou P (2019) Dbest: revisiting approximate query processing engines with machine learning models. In: Proceedings of the 2019 international conference on management of data, SIGMOD conference 2019, Amsterdam, The Netherlands, June 30–July 5, 2019, pp 1553–1570 Ma Q, Triantafillou P (2019) Dbest: revisiting approximate query processing engines with machine learning models. In: Proceedings of the 2019 international conference on management of data, SIGMOD conference 2019, Amsterdam, The Netherlands, June 30–July 5, 2019, pp 1553–1570
27.
go back to reference Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M, Vassilakis T (2011) Dremel: interactive analysis of web-scale datasets. Commun ACM 54(6):114–123CrossRef Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M, Vassilakis T (2011) Dremel: interactive analysis of web-scale datasets. Commun ACM 54(6):114–123CrossRef
29.
go back to reference Muthukrishnan S, Poosala V, Suel T (1999) On rectangular partitionings in two dimensions: algorithms, complexity, and applications. In: Database Theory—ICDT’99, 7th international conference, Jerusalem, Israel, January 10–12, 1999, Proceedings, pp 236–256 Muthukrishnan S, Poosala V, Suel T (1999) On rectangular partitionings in two dimensions: algorithms, complexity, and applications. In: Database Theory—ICDT’99, 7th international conference, Jerusalem, Israel, January 10–12, 1999, Proceedings, pp 236–256
30.
go back to reference Olma M, Papapetrou O, Appuswamy R, Ailamaki A (2019) Taster: self-tuning, elastic and online approximate query processing. In: 35th IEEE international conference on data engineering, ICDE 2019, Macao, China, April 8–11, 2019, pp 482–493 Olma M, Papapetrou O, Appuswamy R, Ailamaki A (2019) Taster: self-tuning, elastic and online approximate query processing. In: 35th IEEE international conference on data engineering, ICDE 2019, Macao, China, April 8–11, 2019, pp 482–493
31.
go back to reference Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2008, Vancouver, BC, Canada, June 10–12, 2008, pp 1099–1110 Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2008, Vancouver, BC, Canada, June 10–12, 2008, pp 1099–1110
32.
go back to reference Pearson K (1901) Mathematical contributions to the theory of evolution. X. Supplement to a memoir on skew variation. Philos Trans R Soc Lond 197(11):443–459MATH Pearson K (1901) Mathematical contributions to the theory of evolution. X. Supplement to a memoir on skew variation. Philos Trans R Soc Lond 197(11):443–459MATH
33.
go back to reference Peng J, Zhang D, Wang J, Pei J (2018) AQP++: connecting approximate query processing with aggregate precomputation for interactive analytics. In: Proceedings of the 2018 international conference on management of data, SIGMOD conference 2018, Houston, TX, USA, June 10–15, 2018, pp 1477–1492 Peng J, Zhang D, Wang J, Pei J (2018) AQP++: connecting approximate query processing with aggregate precomputation for interactive analytics. In: Proceedings of the 2018 international conference on management of data, SIGMOD conference 2018, Houston, TX, USA, June 10–15, 2018, pp 1477–1492
34.
go back to reference Piatetsky-Shapiro G, Connell C (1984) Accurate estimation of the number of tuples satisfying a condition. In: SIGMOD’84, proceedings of annual meeting, Boston, MA, June 18–21, 1984, pp 256–276 Piatetsky-Shapiro G, Connell C (1984) Accurate estimation of the number of tuples satisfying a condition. In: SIGMOD’84, proceedings of annual meeting, Boston, MA, June 18–21, 1984, pp 256–276
35.
go back to reference Poosala V, Ioannidis YE (1996) Estimation of query-result distribution and its application in parallel-join load balancing. In: VLDB’96, proceedings of 22nd international conference on very large data bases, September 3–6, 1996, Mumbai (Bombay), India, pp 448–459 Poosala V, Ioannidis YE (1996) Estimation of query-result distribution and its application in parallel-join load balancing. In: VLDB’96, proceedings of 22nd international conference on very large data bases, September 3–6, 1996, Mumbai (Bombay), India, pp 448–459
36.
go back to reference Poosala V, Ioannidis YE, Haas PJ, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, Montreal, Quebec, Canada, June 4–6, 1996, pp 294–305 Poosala V, Ioannidis YE, Haas PJ, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, Montreal, Quebec, Canada, June 4–6, 1996, pp 294–305
38.
go back to reference Shekelyan M, Dignös A, Gamper J (2017) Digithist: a histogram-based data summary with tight error bounds. PVLDB 10(11):1514–1525 Shekelyan M, Dignös A, Gamper J (2017) Digithist: a histogram-based data summary with tight error bounds. PVLDB 10(11):1514–1525
39.
go back to reference Sidirourgos L, Kersten ML, Boncz PA (2011) Sciborq: scientific data management with bounds on runtime and quality. In: CIDR 2011, 5th biennial conference on innovative data systems research, Asilomar, CA, USA, January 9–12, 2011, online proceedings, pp 296–301 Sidirourgos L, Kersten ML, Boncz PA (2011) Sciborq: scientific data management with bounds on runtime and quality. In: CIDR 2011, 5th biennial conference on innovative data systems research, Asilomar, CA, USA, January 9–12, 2011, online proceedings, pp 296–301
40.
go back to reference Song G, Wenwen Q, Liu X, Wang X (2018) Approximate calculation of window aggregate functions via global random sample. Data Sci Eng 3(1):40–51CrossRef Song G, Wenwen Q, Liu X, Wang X (2018) Approximate calculation of window aggregate functions via global random sample. Data Sci Eng 3(1):40–51CrossRef
41.
go back to reference To H, Chiang K, Shahabi C (2013) Entropy-based histograms for selectivity estimation. In: 22nd ACM international conference on information and knowledge management, CIKM’13, San Francisco, CA, USA, October 27–November 1, 2013, pp 1939–1948 To H, Chiang K, Shahabi C (2013) Entropy-based histograms for selectivity estimation. In: 22nd ACM international conference on information and knowledge management, CIKM’13, San Francisco, CA, USA, October 27–November 1, 2013, pp 1939–1948
42.
go back to reference Yildiz B, Büyüktanir T, Emekçi F (2016) Equi-depth histogram construction for big data with quality guarantees. CoRR arXiv:1606.05633 Yildiz B, Büyüktanir T, Emekçi F (2016) Equi-depth histogram construction for big data with quality guarantees. CoRR arXiv:​1606.​05633
Metadata
Title
SUM-optimal histograms for approximate query processing
Authors
Meifan Zhang
Hongzhi Wang
Jianzhong Li
Hong Gao
Publication date
06-03-2020
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 8/2020
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-020-01450-7

Other articles of this Issue 8/2020

Knowledge and Information Systems 8/2020 Go to the issue

Premium Partner