ABSTRACT
Approximate Query Processing (AQP) based on sampling is critical for supporting timely and cost-effective analytics over big data. To be applied successfully, AQP must be accompanied by reliable estimates on the quality of sample-produced approximate answers; the two main techniques used in the past for this purpose are (i) closed-form analytic error estimation, and (ii) the bootstrap method. Approach (i) is extremely efficient but lacks generality, whereas (ii) is general but suffers from high computational overhead. Our recently introduced Analytical Bootstrap method combines the strengths of both approaches and provides the basis for our ABS system, which will be demonstrated at the conference. The ABS system models bootstrap by a probabilistic relational model, and extends relational algebra with operations on probabilistic relations to predict the distributions of the AQP results. Thus, ABS entails a very fast computation of bootstrap-based quality measures for a general class of SQL queries, which is several orders of magnitude faster than the standard simulation-based bootstrap. In this demo, we will demonstrate the generality, automaticity, and ease of use of the ABS system, and its superior performance over the traditional approaches described above.
- Apache Hive Project. https://hive.apache.org/.Google Scholar
- Shark Project. http://shark.cs.berkeley.edu/.Google Scholar
- TPC-H Benchmark. http://www.tpc.org/tpch/.Google Scholar
- S. Acharya, P. B. Gibbons, et al. The Aqua Approximate Query Answering System. In SIGMOD, pages 574--576, 1999. Google ScholarDigital Library
- S. Agarwal, B. Mozafari, et al. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In EuroSys, pages 29--42, 2013. Google ScholarDigital Library
- B. Babcock, S. Chaudhuri, et al. Dynamic Sample Selection for Approximate Query Processing. In SIGMOD, pages 539--550, 2003. Google ScholarDigital Library
- B. Babcock, M. Datar, et al. Load Shedding for Aggregation Queries over Data Streams. In ICDE, page 350, 2004. Google ScholarDigital Library
- M. Charikar, S. Chaudhuri, et al. Towards Estimation Error Guarantees for Distinct Values. In PODS, pages 268--279, 2000. Google ScholarDigital Library
- S. Chaudhuri, G. Das, et al. Optimized stratified sampling for approximate query processing. TODS, 32(2):9, 2007. Google ScholarDigital Library
- B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, New York, 1993.Google ScholarCross Ref
- J. M. Hellerstein, P. J. Haas, et al. Online Aggregation. In SIGMOD, pages 171--182, 1997. Google ScholarDigital Library
- Y. Hu, S. Sundara, et al. Estimating Aggregates in Time-Constrained Approximate Queries in Oracle. In EDBT, pages 1104--1107, 2009. Google ScholarDigital Library
- C. Jermaine, S. Arumugam, et al. Scalable Approximate Query Processing with DBO Engine. In SIGMOD, pages 1--54, 2007. Google ScholarDigital Library
- G. Karvounarakis and T. J. Green. Semiring-Annotated Data: Queries and Provenance? SIGMOD Record, 41(3):5--14, 2012. Google ScholarDigital Library
- A. Kleiner, A. Talwalkar, et al. A General Bootstrap Performance Diagnostic. In KDD, pages 419--427, 2013. Google ScholarDigital Library
- N. Laptev, K. Zeng, et al. Early Accurate Results for Advanced Analytics on MapReduce. PVLDB, 5(10):1028--1039, 2012. Google ScholarDigital Library
- B. Mozafari and C. Zaniolo. Optimal Load Shedding with Aggregates and Mining Queries. In ICDE, pages 76--88, 2010.Google ScholarCross Ref
- C. Olston, E. Bortnikov, et al. Interactive Analysis of Web-Scale Data. In CIDR, 2009.Google Scholar
- N. Pansare, V. R. Borkar, et al. Online Aggregation for Large MapReduce Jobs. PVLDB, 4(11):1135--1145, 2011.Google ScholarDigital Library
- A. Pol and C. Jermaine. Relational Confidence Bounds Are Easy With The Bootstrap. In SIGMOD, pages 587--598, 2005. Google ScholarDigital Library
- A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, corrected edition, Nov. 2000.Google Scholar
- S. Wu, B. C. Ooi, et al. Continuous Sampling for Online Aggregation over Multiple Queries. In SIGMOD, pages 651--662, 2010. Google ScholarDigital Library
- K. Zeng, S. Gao, et al. The Analytical Bootstrap: a New Method for Fast Error Estimation in Approximate Query Processing. In SIGMOD, 2014. Google ScholarDigital Library
Index Terms
- ABS: a system for scalable approximate queries with accuracy guarantees
Recommendations
The analytical bootstrap: a new method for fast error estimation in approximate query processing
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataSampling is one of the most commonly used techniques in Approximate Query Processing (AQP)-an area of research that is now made more critical by the need for timely and cost-effective analytics over "Big Data". Assessing the quality (i.e., estimating ...
Knowing when you're wrong: building fast and reliable approximate query processing systems
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataModern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions.The quantity of data and limitations of disk and memory bandwidth often make it ...
iOLAP: Managing Uncertainty for Efficient Incremental OLAP
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataThe size of data and the complexity of analytics continue to grow along with the need for timely and cost-effective analysis. However, the growth of computation power cannot keep up with the growth of data. This calls for a paradigm shift from ...
Comments