nach oben

The VLDB Journal

Erschienen in:

02.11.2019 | Special Issue Paper

EntropyDB: a probabilistic approach to approximate query processing

verfasst von: Laurel Orr, Magdalena Balazinska, Dan Suciu

Erschienen in: The VLDB Journal | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

We present, an interactive data exploration system that uses a probabilistic approach to generate a small, query-able summary of a dataset. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. We then present solving techniques, give two critical optimizations to improve preprocessing time and query execution time, and explore methods to reduce query error. Lastly, we experimentally evaluate our work using a 5 GB dataset of flights within the USA and a 210 GB dataset from an astronomy particle simulation. While our current work only supports linear queries, we show that our technique can successfully answer queries faster than sampling while introducing, on average, no more error than sampling and can better distinguish between rare and nonexistent values. We also discuss extensions that can allow for data updates and linear queries over joins.

Vorheriger Artikel DB: bolt-on versioning for relational databases (extended version)

Nächster Artikel Adaptive partitioning and indexing for in situ query processing

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Using the MaxEnt principle will generate a probability distribution that is different from the tuple-independent distribution because the MaxEnt principle does not guarantee tuple independence.

We support continuous data types by bucketizing their active domains.

This is a standard data model in several applications, such as differential privacy [31].

We abuse notation here for readability. Technically, \(\alpha _{i} = \alpha _{a_i}\), \(\beta _{i} = \alpha _{b_i}\), and \(\gamma _{i} = \alpha _{c_i}\).

\(\mathcal {P}([k])\) is the power set of \(\{1, 2, \ldots , k\}\).

This can be found by calculating, for all attribute pairs, the Chi-squared value on the contingency table of \(A_{i_1}\) and \(A_{i_2}\) and sorting from highest to lowest Chi-squared value.

The maximum amount of memory used in experiments was approximately 40 GB, meaning a system this large is not required.

Pair 3 is the same for FlightsCoarse and FlightsFine

Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM Sigmod Record, vol. 29, pp. 487–498. ACM (2000)

Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: The aqua approximate query answering system. In: ACM Sigmod Record, vol. 28, pp. 574–576. ACM (1999)

Agarwal, S., et al.: Blinkdb: queries with bounded errors and bounded response times on very large data. In: Proceedings of EuroSys’13, pp. 29–42 (2013)

Applegate, D.A., Calinescu, G., Johnson, D.S., Karloff, H., Ligett, K., Wang, J.: Compressing rectilinear pictures and minimizing access control lists. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms

Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 539–550 (2003)

Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: International Workshop on Randomization and Approximation Techniques in Computer Science, pp. 1–10. Springer (2002)

Behrisch, M., Bach, B., Henry Riche, N., Schreck, T., Fekete, J.-D.: Matrix reordering methods for table and network visualization. In: Computer Graphics Forum, vol. 35, pp. 693–716. Wiley Online Library (2016)

Bekker, J., Davis, J., Choi, A., Darwiche, A., Van den Broeck, G.: Tractable learning for complex probability queries. In: Advances in Neural Information Processing Systems, pp. 2242–2250 (2015)

Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22(1), 39–71 (1996)

10.

Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015) CrossRef

11.

Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. Int. J. Very Large Data Bases 10(2–3), 199–223 (2001)CrossRef

12.

Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach for approximate answering of aggregate queries. ACM SIGMOD Rec. 30, 295–306 (2001)CrossRef

13.

Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: No silver bullet. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 511–519. ACM (2017)

14.

Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–467 (1968)MathSciNetCrossRef

15.

Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C., et al.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends® Databases 4(1–3), 1–294 (2011)MATH

16.

Crotty, A., Galakatos, A., Zgraggen, A., Binnig, C., Kraska, T.: The case for interactive data exploration accelerators (ideas). In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, p. 11. ACM (2016)

17.

Dalvi, N., Ré, C., Suciu, D.: Probabilistic databases: diamonds in the dirt. Commun. ACM 52(7), 86–94 (2009)CrossRef

18.

Deshpande, A., Garofalakis, M.N., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. In: SIGMOD Conference (2001)

19.

Ding, B., Huang, S., Chaudhuri, S., Chakrabarti, K., Wang, C.: Sample+seek: approximating aggregates with distribution precision guarantee. In: Proceedings of SIGMOD, pp. 679–694 (2016)

20.

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A.: Generalization in adaptive data analysis and holdout reuse. In: Advances in Neural Information Processing Systems, pp. 2350–2358 (2015)

21.

Galakatos, A., Crotty, A., Zgraggen, E., Binnig, C., Kraska, T.: Revisiting reuse for approximate query processing. Proc. VLDB Endow. 10(10), 1142–1153 (2017)CrossRef

22.

Hardt, M., Rothblum, G.N.: A multiplicative weights mechanism for privacy-preserving data analysis. In: 2010 51st Annual IEEE Symposium on, Foundations of Computer Science (FOCS), pp. 61–70. IEEE (2010)

23.

Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: ACM SIGMOD Record, vol. 26, pp. 171–182. ACM (1997)

24.

Hosangadi, A., Fallah, F., Kastner, R.: Factoring and eliminating common subexpressions in polynomial expressions. In: IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004 (2004)

25.

Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable approximate query processing with the DBO engine. ACM Trans. Database Syst. (TODS) 33(4), 23 (2008)CrossRef

26.

http://www.transtats.bts.gov/

27.

Jetley, P. et al.: Massively parallel cosmological simulations with ChaNGa. In: Proceedings of IPDPS (2008)

28.

Jordan, M.: An introduction to probabilistic graphical models (2003). http://www.cs.cmu.edu/~lebanon/pub/book/. Accessed 10 Nov 2018

29.

Kandula, S., Shanbhag, A., Vitorovic, A., Olma, M., Grandl, R., Chaudhuri, S., Ding, B.: Quickr: lazily approximating complex adhoc queries in bigdata clusters. In: Proceedings of the 2016 International Conference on Management of Data, pp. 631–646. ACM (2016)

30.

Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P., Kemper, A.: Learned cardinalities: estimating correlated joins with deep learning (2018). arXiv preprint arXiv:1809.00677

31.

Li, C., et al.: Optimizing linear counting queries under differential privacy. In: Proceedings of PODS, pp. 123–134 (2010)

32.

Li, K., Li, G.: Approximate query processing: what is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)CrossRef

33.

Li, K., Zhang, Y., Li, G., Tao, W., Yan, Y.: Bounded approximate query processing. IEEE Trans. Knowl. Data Eng. (2018). https://doi.org/10.1109/TKDE.2018.2877362 CrossRef

34.

Mäkinen, E., Siirtola, H.: Reordering the reorderable matrix as an algorithmic problem. In: International Conference on Theory and Application of Diagrams, pp. 453–468. Springer (2000)

35.

Markl, V., et al.: Consistently estimating the selectivity of conjuncts of predicates. In: Proceedings of VLDB, pp. 373–384. VLDB Endowment (2005)

36.

Mozafari, B., Niu, N.: A handbook for building an approximate query engine. IEEE Data Eng. Bull. 38(3), 3–29 (2015)

37.

Murphy, K.: Undirected graphical models (2006). https://www.cs.ubc.ca/~murphyk/Teaching/CS340-Fall06/reading/ugm.pdf. Accessed 19 Nov 2018

38.

Orr, L., Balazinska, M., Suciu, D.: Probabilistic database summarization for interactive data exploration. Proc. VLDB Endow. 10(10), 1154–1165 (2017)CrossRef

39.

Ortiz, J., Balazinska, M., Gehrke, J., Keerthi, S.S.: Learning state representations for query optimization with deep reinforcement learning (2018). arXiv preprint arXiv:1803.08604

40.

Park, Y., Mozafari, B., Sorenson, J., Wang, J.: Verdictdb: universalizing approximate query processing. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1461–1476. ACM (2018)

41.

Peng, J., Zhang, D., Wang, J., Pei, J.: Aqp++: connecting approximate query processing with aggregate precomputation for interactive analytics. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1477–1492. ACM (2018)

42.

Ré, C., Suciu, D.: Understanding cardinality estimation using entropy maximization. ACM TODS 37(1), 6 (2012)CrossRef

43.

Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic databases. Synth. Lect. Data Manag. 3(2), 1–180 (2011)CrossRef

44.

Teh, Y.W., Welling, M.: On improving the efficiency of the iterative proportional fitting procedure. In: AIStats (2003)

45.

Thirumuruganathan, S., Hasan, S., Koudas, N., Das, G.: Approximate query processing using deep generative models (2019). arXiv preprint arXiv:1903.10000

46.

Tzoumas, K., Deshpande, A., Jensen, C.S.: Efficiently adapting graphical models for selectivity estimation. VLDB J. 22(1), 3–27 (2013)CrossRef

47.

Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1(1–2), 1–305 (2008)CrossRef

48.

Wu, M., Jermaine, C.: A Bayesian method for guessing the extreme values in a data set? In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 471–482. VLDB Endowment (2007)

49.

Yang, E., Ravikumar, P., Allen, G.I., Liu, Z.: Graphical models via univariate exponential family distributions. J. Mach. Learn. Res. 16(1), 3813–3847 (2015)MathSciNetMATH

Titel: EntropyDB: a probabilistic approach to approximate query processing
verfasst von: Laurel Orr
Magdalena Balazinska
Dan Suciu
Publikationsdatum: 02.11.2019
Verlag: Springer Berlin Heidelberg
Erschienen in: The VLDB Journal / Ausgabe 1/2020
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI: https://doi.org/10.1007/s00778-019-00582-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2020

Complex event recognition in the Big Data era: a survey

Adaptive partitioning and indexing for in situ query processing

Evaluating interactive data systems

Microblogs data management: a survey

Explaining Natural Language query results

LSM-based storage techniques: a survey