research-article

Knowing when you're wrong: building fast and reliable approximate query processing systems

Authors:
Sameer Agarwal

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Henry Milner

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Ariel Kleiner

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Ameet Talwalkar

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Michael Jordan

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Samuel Madden

Massachusetts Institute of Technology, Boston, MA, USA

Massachusetts Institute of Technology, Boston, MA, USA
View Profile

,
Barzan Mozafari

University of Michigan, Ann Arbor, Ann Arbor, MI, USA

University of Michigan, Ann Arbor, Ann Arbor, MI, USA
View Profile

,
Ion Stoica

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataJune 2014Pages 481–492https://doi.org/10.1145/2588555.2593667

Published:18 June 2014Publication History

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 481–492

ABSTRACT

Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions.The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at interactive speeds. However, it has been widely observed that many applications can tolerate some degree of inaccuracy. This is especially true for exploratory queries on data, where users are satisfied with "close-enough" answers if they can come quickly. A popular technique for speeding up queries at the cost of accuracy is to execute each query on a sample of data, rather than the whole dataset. To ensure that the returned result is not too inaccurate, past work on approximate query processing has used statistical techniques to estimate "error bars" on returned results. However, existing work in the sampling-based approximate query processing (S-AQP) community has not validated whether these techniques actually generate accurate error bars for real query workloads. In fact, we find that error bar estimation often fails on real world production workloads. Fortunately, it is possible to quickly and accurately diagnose the failure of error estimation for a query. In this paper, we show that it is possible to implement a query approximation pipeline that produces approximate answers and reliable error bars at interactive speeds.

References

AMPLab Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/.Google Scholar
BlinkDB alpha 0.1.1. http://blinkdb.org.Google Scholar
Common Crawl Document Corpus. http://commoncrawl.org/.Google Scholar
Conviva Networks. http://www.conviva.com/.Google Scholar
Facebook Presto. http://prestodb.io/.Google Scholar
Intel Hadoop Benchmark. https://github.com/intel-hadoop/HiBench.Google Scholar
S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: A fast decision support system using approximate query answers. In VLDB, 1999. Google ScholarDigital Library
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In EuroSys, 2013. Google ScholarDigital Library
S. Agarwal, A. Panda, B. Mozafari, A. P. Iyer, S. Madden, and I. Stoica. Blink and It's Done: Interactive Queries on Very Large Data. PVLDB, 5(12), 2012. Google ScholarDigital Library
P. Agrawal, D. Kifer, and C. Olston. Scheduling shared scans of large data files. PVLDB, 2008. Google ScholarDigital Library
N. Bruno, S. Agarwal, S. Kandula, B. Shi, M.-C. Wu, and J. Zhou. Recurring job optimization in scope. In SIGMOD, 2012. Google ScholarDigital Library
A. J. Canty, A. C. Davison, D. V. Hinkley, and V. Ventura. Bootstrap diagnostics and remedies. Canadian Journal of Statistics, 34(1), 2006.Google Scholar
S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 2007. Google ScholarDigital Library
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, 2010. Google ScholarDigital Library
A. Dobra, C. Jermaine, F. Rusu, and F. Xu. Turbo-charging estimate convergence in dbo. PVLDB, 2(1), 2009. Google ScholarDigital Library
B. Efron. Bootstrap methods: another look at the jackknife. The annals of Statistics, 1979.Google Scholar
B. Efron and R. Tibshirani. An introduction to the bootstrap, volume 57. CRC press, 1993.Google Scholar
G. Giannikis, G. Alonso, and D. Kossmann. Shareddb: Killing one thousand queries with one stone. PVLDB, 2012. Google ScholarDigital Library
P. J. Haas and P. J. Haas. Hoeffding inequalities for join-selectivity estimation and online aggregation. In IBM Research Report RJ 10040, IBM Almaden Research, 1996.Google Scholar
P. Hall. On symmetric bootstrap confidence intervals. Journal of the Royal Statistical Society. Series B (Methodological), 50(1), 1988.Google ScholarCross Ref
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. Google ScholarDigital Library
A. Kleiner, A. Talwalkar, S. Agarwal, I. Stoica, and M. I. Jordan. A general bootstrap performance diagnostic. In KDD, 2013. Google ScholarDigital Library
N. Laptev, K. Zeng, and C. Zaniolo. Early Accurate Results for Advanced Analytics on MapReduce. PVLDB, 5(10), 2012. Google ScholarDigital Library
C. McDiarmid. Concentration. In Probabilistic methods for algorithmic discrete mathematics. Springer, 1998.Google ScholarCross Ref
R. B. Miller. Response time in man-computer conversational transactions. In Proceedings of the December 9-11, 1 968, fall joint computer conference, part I. ACM, 1968. Google ScholarDigital Library
B. Mozafari and C. Zaniolo. Optimal load shedding with aggregates and mining queries. In ICDE, 2010.Google ScholarCross Ref
F. Olken and D. Rotem. Simple random sampling from relational databases. In VLDB, 1986. Google ScholarDigital Library
N. C. Oza. Online bagging and boosting. In International Conference on Systems, Man, and Cybernetics, Special Session on Ensemble Methods for Extreme Environments, 2005.Google ScholarCross Ref
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11), 2011.Google Scholar
A. Pol and C. Jermaine. Relational confidence bounds are easy with the bootstrap. In In SIGMOD, 2005. Google ScholarDigital Library
C. Qin and F. Rusu. PF-OLA: a high-performance framework for parallel online aggregation. Distributed and Parallel Databases, 2013. Google ScholarDigital Library
J. Rice. Mathematical Statistics and Data Analysis. Brooks/Cole Pub. Co., 1988.Google Scholar
A. Thusoo et al. Hive: a warehousing solution over a map-reduce framework. PVLDB, 2(2), 2009. Google ScholarDigital Library
A. van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.Google Scholar
A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer, 1996.Google Scholar
J. Wang et al. A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data. In SIGMOD, 2014. Google ScholarDigital Library
X. Wang, C. Olston, A. D. Sarma, and R. C. Burns. Coscan: cooperative scan sharing in the cloud. In SoCC, 2011. Google ScholarDigital Library
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD Conference, 2013. Google ScholarDigital Library
M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI, 2012. Google ScholarDigital Library
K. Zeng et al. The Analytical Bootstrap: a New Method for Fast Error Estimation in Approximate Query Processing. In SIGMOD, 2014. Google ScholarDigital Library

Index Terms

Knowing when you're wrong: building fast and reliable approximate query processing systems
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Demonstration of VerdictDB, the Platform-Independent AQP System
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

We demonstrate VerdictDB, the first platform-independent approximate query processing (AQP) system. Unlike existing AQP systems that are tightly-integrated into a specific database, VerdictDB operates at the driver-level, acting as a middleware between ...
Read More
The analytical bootstrap: a new method for fast error estimation in approximate query processing
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Sampling is one of the most commonly used techniques in Approximate Query Processing (AQP)-an area of research that is now made more critical by the need for timely and cost-effective analytics over "Big Data". Assessing the quality (i.e., estimating ...
Read More
Approximate Query Processing with Error Guarantees
Big-Data-Analytics in Astronomy, Science, and Engineering
Abstract
In recent years, with the increase of data and the sophistication of analysis requirements, query processing in databases has become more important. Recently, approximate query processing (AQP) was proposed for efficiently executing database ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
approximate query processing
diagnostics
error estimation
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 135
  Total Citations
  View Citations
- 1,040
  Total Downloads
- Downloads (Last 12 months)52
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Knowing when you're wrong: building fast and reliable approximate query processing systems

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Demonstration of VerdictDB, the Platform-Independent AQP System

The analytical bootstrap: a new method for fast error estimation in approximate query processing

Approximate Query Processing with Error Guarantees