skip to main content
10.1145/2588555.2588579acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

The analytical bootstrap: a new method for fast error estimation in approximate query processing

Authors Info & Claims
Published:18 June 2014Publication History

ABSTRACT

Sampling is one of the most commonly used techniques in Approximate Query Processing (AQP)-an area of research that is now made more critical by the need for timely and cost-effective analytics over "Big Data". Assessing the quality (i.e., estimating the error) of approximate answers is essential for meaningful AQP, and the two main approaches used in the past to address this problem are based on either (i) analytic error quantification or (ii) the bootstrap method. The first approach is extremely efficient but lacks generality, whereas the second is quite general but suffers from its high computational overhead. In this paper, we introduce a probabilistic relational model for the bootstrap process, along with rigorous semantics and a unified error model, which bridges the gap between these two traditional approaches. Based on our probabilistic framework, we develop efficient algorithms to predict the distribution of the approximation results. These enable the computation of any bootstrap-based quality measure for a large class of SQL queries via a single-round evaluation of a slightly modified query. Extensive experiments on both synthetic and real-world datasets show that our method has superior prediction accuracy for bootstrap-based quality measures, and is several orders of magnitude faster than bootstrap.

References

  1. Conviva Inc. http://www.conviva.com/.Google ScholarGoogle Scholar
  2. MonetDB. http://www.monetdb.org/Home.Google ScholarGoogle Scholar
  3. The R Project. http://www.r-project.org/.Google ScholarGoogle Scholar
  4. TPC-H Benchmark. http://www.tpc.org/tpch/.Google ScholarGoogle Scholar
  5. Vertica Inc. http://www.vertica.com/.Google ScholarGoogle Scholar
  6. S. Acharya, P. B. Gibbons, et al. Join Synopses for Approximate Query Answering. In SIGMOD, pages 275--286, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Acharya, P. B. Gibbons, et al. The Aqua Approximate Query Answering System. In SIGMOD, pages 574--576, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Agarwal, H. Milner, et al. Knowing When You're Wrong: Building Fast and Reliable Approximate Query Processing Systems. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Agarwal, B. Mozafari, et al. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In EuroSys, pages 29--42, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Antova, T. Jansen, et al. Fast and Simple Relational Processing of Uncertain Data. In ICDE, pages 983--992, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Babcock, S. Chaudhuri, et al. Dynamic Sample Selection for Approximate Query Processing. In SIGMOD, pages 539--550, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Babcock, M. Datar, et al. Load Shedding for Aggregation Queries over Data Streams. In ICDE, page 350, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. J. Bickel and D. A. Freedman. Some Asymptotic Theory for the Bootstrap. The Annals of Statistics, 9(6):1196--1217, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  14. G. Box, J. S. Hunter, et al. Statistics for Experimenters: Design, Innovation, Discovery. Wiley-Interscience, 2005.Google ScholarGoogle Scholar
  15. M. Charikar, S. Chaudhuri, et al. Towards Estimation Error Guarantees for Distinct Values. In PODS, pages 268--279, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Chaudhuri, G. Das, et al. Optimized stratified sampling for approximate query processing. TODS, 32(2):9, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Cui, J. Widom, et al. Tracing the lineage of view data in a warehousing environment. TODS, 25(2):179--227, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDBJ, 16:523--544, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, New York, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  20. H. Garcia-Molina, J. D. Ullman, et al. Database systems - the complete book (2. ed.). Pearson Education, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Glavic and G. Alonso. Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting. In ICDE, pages 174--185, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. M. Hellerstein, P. J. Haas, et al. Online Aggregation. In SIGMOD, pages 171--182, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Hu, S. Sundara, et al. Estimating Aggregates in Time-Constrained Approximate Queries in Oracle. In EDBT, pages 1104--1107, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Jermaine, S. Arumugam, et al. Scalable Approximate Query Processing with DBO Engine. In SIGMOD, pages 1--54, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Joshi and C. Jermaine. Sampling-Based Estimators for Subset-Based Queries. VLDB J., 18(1):181--202, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Karvounarakis and T. J. Green. Semiring-Annotated Data: Queries and Provenance? SIGMOD Record, 41(3):5--14, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Kleiner, A. Talwalkar, et al. A General Bootstrap Performance Diagnostic. In KDD, pages 419--427, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Kleiner, A. Talwalkar, et al. The Big Data Bootstrap. In ICML, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. G. Kolaitis and M. Y. Vardi. Conjunctive-Query Containment and Constraint Satisfaction. In PODS, pages 205--213, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. Laptev, K. Zeng, et al. Early Accurate Results for Advanced Analytics on MapReduce. PVLDB, 5(10):1028--1039, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Mozafari and C. Zaniolo. Optimal Load Shedding with Aggregates and Mining Queries. In ICDE, pages 76--88, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  32. C. Olston, E. Bortnikov, et al. Interactive Analysis of Web-Scale Data. In CIDR, 2009.Google ScholarGoogle Scholar
  33. N. Pansare, V. R. Borkar, et al. Online Aggregation for Large MapReduce Jobs. PVLDB, 4(11):1135--1145, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Pol and C. Jermaine. Relational Confidence Bounds Are Easy With The Bootstrap. In SIGMOD, pages 587--598, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Rabl, M. Poess, et al. Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance. In SPEC, pages 361--372, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. Ré and D. Suciu. The Trichotomy of HAVING Queries on a Probabilistic Database. VLDBJ, 18(5):1091--1116, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. P. Sen, A. Deshpande, et al. Read-Once Functions and Query Evaluation in Probabilistic Databases. PVLDB, 3(1):1068--1079, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. M. Siddiqui and C. Butler. Asymptotic Joint Distribution of Linear Systematic Statistics from Multivariate Distributions. Journal of the American Statistical Association, 64(325):300--305, 1969.Google ScholarGoogle ScholarCross RefCross Ref
  39. T. T. L. Tran, Y. Diao, et al. Supporting User-Defined Functions on Uncertain Data. PVLDB, 6(6):469--480, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. T. T. L. Tran, L. Peng, et al. CLARO: Modeling and Processing Uncertain Data Streams. VLDBJ, (5):651--676, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, corrected edition, Nov. 2000.Google ScholarGoogle Scholar
  42. D. Z. Wang, E. Michelakis, et al. Bayesstore: Managing large, uncertain data repositories with probabilistic graphical models. PVLDB, 1(1):340--351, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Wilhelm. tmvtnorm: A Package for the Truncated Multivariate Normal Distribution. sigma, 2:2.Google ScholarGoogle Scholar
  44. S. Wu, B. C. Ooi, et al. Continuous Sampling for Online Aggregation over Multiple Queries. In SIGMOD, pages 651--662, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. K. Zeng, S. Gao, et al. The Analytical Bootstrap: A New Method for Fast Error Estimation in Approximate Query Processing. Technical Report CSD #130028, UCLA, 2013.Google ScholarGoogle Scholar

Index Terms

  1. The analytical bootstrap: a new method for fast error estimation in approximate query processing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
        June 2014
        1645 pages
        ISBN:9781450323765
        DOI:10.1145/2588555

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 June 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader