skip to main content
10.1145/2465351.2465355acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

BlinkDB: queries with bounded errors and bounded response times on very large data

Published:15 April 2013Publication History

ABSTRACT

In this paper, we present BlinkDB, a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional stratified samples from original data over time, and (2) a dynamic sample selection strategy that selects an appropriately sized sample based on a query's accuracy or response time requirements. We evaluate BlinkDB against the well-known TPC-H benchmarks and a real-world analytic workload derived from Conviva Inc., a company that manages video distribution over the Internet. Our experiments on a 100 node cluster show that BlinkDB can answer queries on up to 17 TBs of data in less than 2 seconds (over 200 x faster than Hive), within an error of 2-10%.

References

  1. Apache Hadoop Distributed File System. http://hadoop.apache.org/hdfs/.Google ScholarGoogle Scholar
  2. Apache Hadoop Mapreduce Project. http://hadoop.apache.org/mapreduce/.Google ScholarGoogle Scholar
  3. TPC-H Query Processing Benchmarks. http://www.tpc.org/tpch/.Google ScholarGoogle Scholar
  4. S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In ACM SIGMOD, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In ACM SIGMOD, June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. ACM SIGMOD Record, 28(2), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing Data Parallel Computing. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Ananthanarayanan, S. Kandula, A. G. Greenberg, et al. Reining in the outliers in map-reduce clusters using mantri. In OSDI, pages 265--278, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In VLDB, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Cormode. Sketch techniques for massive data. In Synposes for Massive Data: Samples, Histograms, Wavelets and Sketches. 2011.Google ScholarGoogle Scholar
  13. C. Engle, A. Lupher, R. Xin, M. Zaharia, et al. Shark: Fast Data Analysis Using Coarse-grained Distributed Memory. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Garofalakis and P. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001. Tutorial. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Lohr. Sampling: design and analysis. Thomson, 2009.Google ScholarGoogle Scholar
  17. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Commun. ACM, 54:114--123, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, and B. Reed. Interactive analysis of web-scale data. In CIDR, 2009.Google ScholarGoogle Scholar
  19. N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for Large MapReduce Jobs. PVLDB, 4(11):1135--1145, 2011.Google ScholarGoogle Scholar
  20. C. Sapia. Promise: Predicting query behavior to enable predictive caching strategies for olap systems. DaWaK, pages 22--233. Springer-Verlag, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. Sidirourgos, M. L. Kersten, and P. A. Boncz. SciBORQ: Scientific data management with Bounds On Runtime and Quality. In CIDR'11, 2011.Google ScholarGoogle Scholar
  22. A. Thusoo, J. S. Sarma, N. Jain, et al. Hive: a warehousing solution over a map-reduce framework. PVLDB, 2(2), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Tirthapura and D. Woodruff. Optimal random sampling from distributed streams revisited. Distributed Computing, pages 283--297, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. SIGMOD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Zaharia, A. Konwinski, A. D. Joseph, et al. Improving MapReduce Performance in Heterogeneous Environments. In OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. BlinkDB: queries with bounded errors and bounded response times on very large data

          Recommendations

          Reviews

          Mohamed Eltabakh

          Aggregation queries over large-scale datasets are very common in most emerging data analytics applications, such as log processing, clickstream analysis, and social network updates. These aggregation queries may process terabytes (TBs) of data, which can take a long time to complete, so a key challenge is how to support more efficient execution of these queries. One approach involves approximate processing, where the system generates results faster, but with lower accuracy. This paper proposes a scalable query engine called BlinkDB, which is designed to support approximate processing of aggregation queries over very large datasets. The system is shown to query TBs of data within a few seconds. BlinkDB is built on top of the distributed Hadoop framework. The authors classified possible workloads into four categories depending on whether the future queries (or their components) can be predicted. From these, they selected one relatively flexible category called predictable query column sets. The core of BlinkDB is designed to produce faster results with an estimated error bound, given a response-time budget associated with the input query. The system leverages sampling techniques to create representative samples and operates on them instead of the raw data to improve response time. BlinkDB has two main components: the creation and maintenance of samples, and the runtime engine and sample selection. The first component addresses the types and sizes of samples to create. The authors present algorithms for creating stratified samples over single and multiple queries, and describe several optimizations for selecting a subset of columns on which to build the summaries. The second component focuses on selecting the most appropriate sample and sample size to answer a given query. The runtime engine also estimates the error bound of the reported query answer. The system has been evaluated experimentally and the results demonstrate its scalability and practicality in handling aggregation queries over large datasets. End users who depend on databases for managing their data will find this paper worth reading, as will researchers and scientists working in the data management field, and the data mining community in general. Online Computing Reviews Service

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            EuroSys '13: Proceedings of the 8th ACM European Conference on Computer Systems
            April 2013
            401 pages
            ISBN:9781450319942
            DOI:10.1145/2465351

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 15 April 2013

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            EuroSys '13 Paper Acceptance Rate28of143submissions,20%Overall Acceptance Rate241of1,308submissions,18%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader