skip to main content
10.1145/2588555.2593667acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Knowing when you're wrong: building fast and reliable approximate query processing systems

Authors Info & Claims
Published:18 June 2014Publication History

ABSTRACT

Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions.The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at interactive speeds. However, it has been widely observed that many applications can tolerate some degree of inaccuracy. This is especially true for exploratory queries on data, where users are satisfied with "close-enough" answers if they can come quickly. A popular technique for speeding up queries at the cost of accuracy is to execute each query on a sample of data, rather than the whole dataset. To ensure that the returned result is not too inaccurate, past work on approximate query processing has used statistical techniques to estimate "error bars" on returned results. However, existing work in the sampling-based approximate query processing (S-AQP) community has not validated whether these techniques actually generate accurate error bars for real query workloads. In fact, we find that error bar estimation often fails on real world production workloads. Fortunately, it is possible to quickly and accurately diagnose the failure of error estimation for a query. In this paper, we show that it is possible to implement a query approximation pipeline that produces approximate answers and reliable error bars at interactive speeds.

References

  1. AMPLab Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/.Google ScholarGoogle Scholar
  2. BlinkDB alpha 0.1.1. http://blinkdb.org.Google ScholarGoogle Scholar
  3. Common Crawl Document Corpus. http://commoncrawl.org/.Google ScholarGoogle Scholar
  4. Conviva Networks. http://www.conviva.com/.Google ScholarGoogle Scholar
  5. Facebook Presto. http://prestodb.io/.Google ScholarGoogle Scholar
  6. Intel Hadoop Benchmark. https://github.com/intel-hadoop/HiBench.Google ScholarGoogle Scholar
  7. S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: A fast decision support system using approximate query answers. In VLDB, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In EuroSys, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Agarwal, A. Panda, B. Mozafari, A. P. Iyer, S. Madden, and I. Stoica. Blink and It's Done: Interactive Queries on Very Large Data. PVLDB, 5(12), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Agrawal, D. Kifer, and C. Olston. Scheduling shared scans of large data files. PVLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Bruno, S. Agarwal, S. Kandula, B. Shi, M.-C. Wu, and J. Zhou. Recurring job optimization in scope. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. J. Canty, A. C. Davison, D. V. Hinkley, and V. Ventura. Bootstrap diagnostics and remedies. Canadian Journal of Statistics, 34(1), 2006.Google ScholarGoogle Scholar
  13. S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Dobra, C. Jermaine, F. Rusu, and F. Xu. Turbo-charging estimate convergence in dbo. PVLDB, 2(1), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Efron. Bootstrap methods: another look at the jackknife. The annals of Statistics, 1979.Google ScholarGoogle Scholar
  17. B. Efron and R. Tibshirani. An introduction to the bootstrap, volume 57. CRC press, 1993.Google ScholarGoogle Scholar
  18. G. Giannikis, G. Alonso, and D. Kossmann. Shareddb: Killing one thousand queries with one stone. PVLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. J. Haas and P. J. Haas. Hoeffding inequalities for join-selectivity estimation and online aggregation. In IBM Research Report RJ 10040, IBM Almaden Research, 1996.Google ScholarGoogle Scholar
  20. P. Hall. On symmetric bootstrap confidence intervals. Journal of the Royal Statistical Society. Series B (Methodological), 50(1), 1988.Google ScholarGoogle ScholarCross RefCross Ref
  21. J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Kleiner, A. Talwalkar, S. Agarwal, I. Stoica, and M. I. Jordan. A general bootstrap performance diagnostic. In KDD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. Laptev, K. Zeng, and C. Zaniolo. Early Accurate Results for Advanced Analytics on MapReduce. PVLDB, 5(10), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. McDiarmid. Concentration. In Probabilistic methods for algorithmic discrete mathematics. Springer, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  25. R. B. Miller. Response time in man-computer conversational transactions. In Proceedings of the December 9-11, 1 968, fall joint computer conference, part I. ACM, 1968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Mozafari and C. Zaniolo. Optimal load shedding with aggregates and mining queries. In ICDE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  27. F. Olken and D. Rotem. Simple random sampling from relational databases. In VLDB, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. N. C. Oza. Online bagging and boosting. In International Conference on Systems, Man, and Cybernetics, Special Session on Ensemble Methods for Extreme Environments, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  29. N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11), 2011.Google ScholarGoogle Scholar
  30. A. Pol and C. Jermaine. Relational confidence bounds are easy with the bootstrap. In In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Qin and F. Rusu. PF-OLA: a high-performance framework for parallel online aggregation. Distributed and Parallel Databases, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Rice. Mathematical Statistics and Data Analysis. Brooks/Cole Pub. Co., 1988.Google ScholarGoogle Scholar
  33. A. Thusoo et al. Hive: a warehousing solution over a map-reduce framework. PVLDB, 2(2), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.Google ScholarGoogle Scholar
  35. A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer, 1996.Google ScholarGoogle Scholar
  36. J. Wang et al. A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. X. Wang, C. Olston, A. D. Sarma, and R. C. Burns. Coscan: cooperative scan sharing in the cloud. In SoCC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD Conference, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. K. Zeng et al. The Analytical Bootstrap: a New Method for Fast Error Estimation in Approximate Query Processing. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Knowing when you're wrong: building fast and reliable approximate query processing systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
        June 2014
        1645 pages
        ISBN:9781450323765
        DOI:10.1145/2588555

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 June 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader