research-article

The analytical bootstrap: a new method for fast error estimation in approximate query processing

Authors:
Kai Zeng

University of California, Los Angeles, Los Angeles, CA, USA

University of California, Los Angeles, Los Angeles, CA, USA
View Profile

,
Shi Gao

University of California, Los Angeles, Los Angeles, CA, USA

University of California, Los Angeles, Los Angeles, CA, USA
View Profile

,
Barzan Mozafari

University of Michigan, Ann Arbor, MI, USA

University of Michigan, Ann Arbor, MI, USA
View Profile

,
Carlo Zaniolo

University of California, Los Angeles, Los Angeles, CA, USA

University of California, Los Angeles, Los Angeles, CA, USA
View Profile

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataJune 2014Pages 277–288https://doi.org/10.1145/2588555.2588579

Published:18 June 2014Publication History

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 277–288

ABSTRACT

Sampling is one of the most commonly used techniques in Approximate Query Processing (AQP)-an area of research that is now made more critical by the need for timely and cost-effective analytics over "Big Data". Assessing the quality (i.e., estimating the error) of approximate answers is essential for meaningful AQP, and the two main approaches used in the past to address this problem are based on either (i) analytic error quantification or (ii) the bootstrap method. The first approach is extremely efficient but lacks generality, whereas the second is quite general but suffers from its high computational overhead. In this paper, we introduce a probabilistic relational model for the bootstrap process, along with rigorous semantics and a unified error model, which bridges the gap between these two traditional approaches. Based on our probabilistic framework, we develop efficient algorithms to predict the distribution of the approximation results. These enable the computation of any bootstrap-based quality measure for a large class of SQL queries via a single-round evaluation of a slightly modified query. Extensive experiments on both synthetic and real-world datasets show that our method has superior prediction accuracy for bootstrap-based quality measures, and is several orders of magnitude faster than bootstrap.

References

Conviva Inc. http://www.conviva.com/.Google Scholar
MonetDB. http://www.monetdb.org/Home.Google Scholar
The R Project. http://www.r-project.org/.Google Scholar
TPC-H Benchmark. http://www.tpc.org/tpch/.Google Scholar
Vertica Inc. http://www.vertica.com/.Google Scholar
S. Acharya, P. B. Gibbons, et al. Join Synopses for Approximate Query Answering. In SIGMOD, pages 275--286, 1999. Google ScholarDigital Library
S. Acharya, P. B. Gibbons, et al. The Aqua Approximate Query Answering System. In SIGMOD, pages 574--576, 1999. Google ScholarDigital Library
S. Agarwal, H. Milner, et al. Knowing When You're Wrong: Building Fast and Reliable Approximate Query Processing Systems. In SIGMOD, 2014. Google ScholarDigital Library
S. Agarwal, B. Mozafari, et al. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In EuroSys, pages 29--42, 2013. Google ScholarDigital Library
L. Antova, T. Jansen, et al. Fast and Simple Relational Processing of Uncertain Data. In ICDE, pages 983--992, 2008. Google ScholarDigital Library
B. Babcock, S. Chaudhuri, et al. Dynamic Sample Selection for Approximate Query Processing. In SIGMOD, pages 539--550, 2003. Google ScholarDigital Library
B. Babcock, M. Datar, et al. Load Shedding for Aggregation Queries over Data Streams. In ICDE, page 350, 2004. Google ScholarDigital Library
P. J. Bickel and D. A. Freedman. Some Asymptotic Theory for the Bootstrap. The Annals of Statistics, 9(6):1196--1217, 1981.Google ScholarCross Ref
G. Box, J. S. Hunter, et al. Statistics for Experimenters: Design, Innovation, Discovery. Wiley-Interscience, 2005.Google Scholar
M. Charikar, S. Chaudhuri, et al. Towards Estimation Error Guarantees for Distinct Values. In PODS, pages 268--279, 2000. Google ScholarDigital Library
S. Chaudhuri, G. Das, et al. Optimized stratified sampling for approximate query processing. TODS, 32(2):9, 2007. Google ScholarDigital Library
Y. Cui, J. Widom, et al. Tracing the lineage of view data in a warehousing environment. TODS, 25(2):179--227, 2000. Google ScholarDigital Library
N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDBJ, 16:523--544, 2007. Google ScholarDigital Library
B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, New York, 1993.Google ScholarCross Ref
H. Garcia-Molina, J. D. Ullman, et al. Database systems - the complete book (2. ed.). Pearson Education, 2009. Google ScholarDigital Library
B. Glavic and G. Alonso. Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting. In ICDE, pages 174--185, 2009. Google ScholarDigital Library
J. M. Hellerstein, P. J. Haas, et al. Online Aggregation. In SIGMOD, pages 171--182, 1997. Google ScholarDigital Library
Y. Hu, S. Sundara, et al. Estimating Aggregates in Time-Constrained Approximate Queries in Oracle. In EDBT, pages 1104--1107, 2009. Google ScholarDigital Library
C. Jermaine, S. Arumugam, et al. Scalable Approximate Query Processing with DBO Engine. In SIGMOD, pages 1--54, 2007. Google ScholarDigital Library
S. Joshi and C. Jermaine. Sampling-Based Estimators for Subset-Based Queries. VLDB J., 18(1):181--202, 2009. Google ScholarDigital Library
G. Karvounarakis and T. J. Green. Semiring-Annotated Data: Queries and Provenance? SIGMOD Record, 41(3):5--14, 2012. Google ScholarDigital Library
A. Kleiner, A. Talwalkar, et al. A General Bootstrap Performance Diagnostic. In KDD, pages 419--427, 2013. Google ScholarDigital Library
A. Kleiner, A. Talwalkar, et al. The Big Data Bootstrap. In ICML, 2012.Google ScholarDigital Library
P. G. Kolaitis and M. Y. Vardi. Conjunctive-Query Containment and Constraint Satisfaction. In PODS, pages 205--213, 1998. Google ScholarDigital Library
N. Laptev, K. Zeng, et al. Early Accurate Results for Advanced Analytics on MapReduce. PVLDB, 5(10):1028--1039, 2012. Google ScholarDigital Library
B. Mozafari and C. Zaniolo. Optimal Load Shedding with Aggregates and Mining Queries. In ICDE, pages 76--88, 2010.Google ScholarCross Ref
C. Olston, E. Bortnikov, et al. Interactive Analysis of Web-Scale Data. In CIDR, 2009.Google Scholar
N. Pansare, V. R. Borkar, et al. Online Aggregation for Large MapReduce Jobs. PVLDB, 4(11):1135--1145, 2011.Google ScholarDigital Library
A. Pol and C. Jermaine. Relational Confidence Bounds Are Easy With The Bootstrap. In SIGMOD, pages 587--598, 2005. Google ScholarDigital Library
T. Rabl, M. Poess, et al. Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance. In SPEC, pages 361--372, 2013. Google ScholarDigital Library
C. Ré and D. Suciu. The Trichotomy of HAVING Queries on a Probabilistic Database. VLDBJ, 18(5):1091--1116, 2009. Google ScholarDigital Library
P. Sen, A. Deshpande, et al. Read-Once Functions and Query Evaluation in Probabilistic Databases. PVLDB, 3(1):1068--1079, 2010. Google ScholarDigital Library
M. M. Siddiqui and C. Butler. Asymptotic Joint Distribution of Linear Systematic Statistics from Multivariate Distributions. Journal of the American Statistical Association, 64(325):300--305, 1969.Google ScholarCross Ref
T. T. L. Tran, Y. Diao, et al. Supporting User-Defined Functions on Uncertain Data. PVLDB, 6(6):469--480, 2013. Google ScholarDigital Library
T. T. L. Tran, L. Peng, et al. CLARO: Modeling and Processing Uncertain Data Streams. VLDBJ, (5):651--676, 2012. Google ScholarDigital Library
A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, corrected edition, Nov. 2000.Google Scholar
D. Z. Wang, E. Michelakis, et al. Bayesstore: Managing large, uncertain data repositories with probabilistic graphical models. PVLDB, 1(1):340--351, 2008. Google ScholarDigital Library
S. Wilhelm. tmvtnorm: A Package for the Truncated Multivariate Normal Distribution. sigma, 2:2.Google Scholar
S. Wu, B. C. Ooi, et al. Continuous Sampling for Online Aggregation over Multiple Queries. In SIGMOD, pages 651--662, 2010. Google ScholarDigital Library
K. Zeng, S. Gao, et al. The Analytical Bootstrap: A New Method for Fast Error Estimation in Approximate Query Processing. Technical Report CSD #130028, UCLA, 2013.Google Scholar

Index Terms

The analytical bootstrap: a new method for fast error estimation in approximate query processing
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Knowing when you're wrong: building fast and reliable approximate query processing systems
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions.The quantity of data and limitations of disk and memory bandwidth often make it ...
Read More
ABS: a system for scalable approximate queries with accuracy guarantees
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Approximate Query Processing (AQP) based on sampling is critical for supporting timely and cost-effective analytics over big data. To be applied successfully, AQP must be accompanied by reliable estimates on the quality of sample-produced approximate ...
Read More
iOLAP: Managing Uncertainty for Efficient Incremental OLAP
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

The size of data and the complexity of analytics continue to grow along with the need for timely and cost-effective analysis. However, the growth of computation power cannot keep up with the growth of data. This calls for a paradigm shift from ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
approximate query processing
bootstrap
error estimation
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 87
  Total Citations
  View Citations
- 714
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The analytical bootstrap: a new method for fast error estimation in approximate query processing

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Knowing when you're wrong: building fast and reliable approximate query processing systems

ABS: a system for scalable approximate queries with accuracy guarantees

iOLAP: Managing Uncertainty for Efficient Incremental OLAP