ABSTRACT
Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions.The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at interactive speeds. However, it has been widely observed that many applications can tolerate some degree of inaccuracy. This is especially true for exploratory queries on data, where users are satisfied with "close-enough" answers if they can come quickly. A popular technique for speeding up queries at the cost of accuracy is to execute each query on a sample of data, rather than the whole dataset. To ensure that the returned result is not too inaccurate, past work on approximate query processing has used statistical techniques to estimate "error bars" on returned results. However, existing work in the sampling-based approximate query processing (S-AQP) community has not validated whether these techniques actually generate accurate error bars for real query workloads. In fact, we find that error bar estimation often fails on real world production workloads. Fortunately, it is possible to quickly and accurately diagnose the failure of error estimation for a query. In this paper, we show that it is possible to implement a query approximation pipeline that produces approximate answers and reliable error bars at interactive speeds.
- AMPLab Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/.Google Scholar
- BlinkDB alpha 0.1.1. http://blinkdb.org.Google Scholar
- Common Crawl Document Corpus. http://commoncrawl.org/.Google Scholar
- Conviva Networks. http://www.conviva.com/.Google Scholar
- Facebook Presto. http://prestodb.io/.Google Scholar
- Intel Hadoop Benchmark. https://github.com/intel-hadoop/HiBench.Google Scholar
- S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: A fast decision support system using approximate query answers. In VLDB, 1999. Google ScholarDigital Library
- S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In EuroSys, 2013. Google ScholarDigital Library
- S. Agarwal, A. Panda, B. Mozafari, A. P. Iyer, S. Madden, and I. Stoica. Blink and It's Done: Interactive Queries on Very Large Data. PVLDB, 5(12), 2012. Google ScholarDigital Library
- P. Agrawal, D. Kifer, and C. Olston. Scheduling shared scans of large data files. PVLDB, 2008. Google ScholarDigital Library
- N. Bruno, S. Agarwal, S. Kandula, B. Shi, M.-C. Wu, and J. Zhou. Recurring job optimization in scope. In SIGMOD, 2012. Google ScholarDigital Library
- A. J. Canty, A. C. Davison, D. V. Hinkley, and V. Ventura. Bootstrap diagnostics and remedies. Canadian Journal of Statistics, 34(1), 2006.Google Scholar
- S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 2007. Google ScholarDigital Library
- T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, 2010. Google ScholarDigital Library
- A. Dobra, C. Jermaine, F. Rusu, and F. Xu. Turbo-charging estimate convergence in dbo. PVLDB, 2(1), 2009. Google ScholarDigital Library
- B. Efron. Bootstrap methods: another look at the jackknife. The annals of Statistics, 1979.Google Scholar
- B. Efron and R. Tibshirani. An introduction to the bootstrap, volume 57. CRC press, 1993.Google Scholar
- G. Giannikis, G. Alonso, and D. Kossmann. Shareddb: Killing one thousand queries with one stone. PVLDB, 2012. Google ScholarDigital Library
- P. J. Haas and P. J. Haas. Hoeffding inequalities for join-selectivity estimation and online aggregation. In IBM Research Report RJ 10040, IBM Almaden Research, 1996.Google Scholar
- P. Hall. On symmetric bootstrap confidence intervals. Journal of the Royal Statistical Society. Series B (Methodological), 50(1), 1988.Google ScholarCross Ref
- J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. Google ScholarDigital Library
- A. Kleiner, A. Talwalkar, S. Agarwal, I. Stoica, and M. I. Jordan. A general bootstrap performance diagnostic. In KDD, 2013. Google ScholarDigital Library
- N. Laptev, K. Zeng, and C. Zaniolo. Early Accurate Results for Advanced Analytics on MapReduce. PVLDB, 5(10), 2012. Google ScholarDigital Library
- C. McDiarmid. Concentration. In Probabilistic methods for algorithmic discrete mathematics. Springer, 1998.Google ScholarCross Ref
- R. B. Miller. Response time in man-computer conversational transactions. In Proceedings of the December 9-11, 1 968, fall joint computer conference, part I. ACM, 1968. Google ScholarDigital Library
- B. Mozafari and C. Zaniolo. Optimal load shedding with aggregates and mining queries. In ICDE, 2010.Google ScholarCross Ref
- F. Olken and D. Rotem. Simple random sampling from relational databases. In VLDB, 1986. Google ScholarDigital Library
- N. C. Oza. Online bagging and boosting. In International Conference on Systems, Man, and Cybernetics, Special Session on Ensemble Methods for Extreme Environments, 2005.Google ScholarCross Ref
- N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11), 2011.Google Scholar
- A. Pol and C. Jermaine. Relational confidence bounds are easy with the bootstrap. In In SIGMOD, 2005. Google ScholarDigital Library
- C. Qin and F. Rusu. PF-OLA: a high-performance framework for parallel online aggregation. Distributed and Parallel Databases, 2013. Google ScholarDigital Library
- J. Rice. Mathematical Statistics and Data Analysis. Brooks/Cole Pub. Co., 1988.Google Scholar
- A. Thusoo et al. Hive: a warehousing solution over a map-reduce framework. PVLDB, 2(2), 2009. Google ScholarDigital Library
- A. van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.Google Scholar
- A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer, 1996.Google Scholar
- J. Wang et al. A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data. In SIGMOD, 2014. Google ScholarDigital Library
- X. Wang, C. Olston, A. D. Sarma, and R. C. Burns. Coscan: cooperative scan sharing in the cloud. In SoCC, 2011. Google ScholarDigital Library
- R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD Conference, 2013. Google ScholarDigital Library
- M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI, 2012. Google ScholarDigital Library
- K. Zeng et al. The Analytical Bootstrap: a New Method for Fast Error Estimation in Approximate Query Processing. In SIGMOD, 2014. Google ScholarDigital Library
Index Terms
- Knowing when you're wrong: building fast and reliable approximate query processing systems
Recommendations
Demonstration of VerdictDB, the Platform-Independent AQP System
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataWe demonstrate VerdictDB, the first platform-independent approximate query processing (AQP) system. Unlike existing AQP systems that are tightly-integrated into a specific database, VerdictDB operates at the driver-level, acting as a middleware between ...
The analytical bootstrap: a new method for fast error estimation in approximate query processing
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataSampling is one of the most commonly used techniques in Approximate Query Processing (AQP)-an area of research that is now made more critical by the need for timely and cost-effective analytics over "Big Data". Assessing the quality (i.e., estimating ...
Approximate Query Processing with Error Guarantees
Big-Data-Analytics in Astronomy, Science, and EngineeringAbstractIn recent years, with the increase of data and the sophistication of analysis requirements, query processing in databases has become more important. Recently, approximate query processing (AQP) was proposed for efficiently executing database ...
Comments