research-article

BlinkDB: queries with bounded errors and bounded response times on very large data

Authors:
Sameer Agarwal

University of California, Berkeley

University of California, Berkeley
View Profile

,
Barzan Mozafari

Massachusetts Institute of Technology

Massachusetts Institute of Technology
View Profile

,
Aurojit Panda

University of California, Berkeley

University of California, Berkeley
View Profile

,
Henry Milner

University of California, Berkeley

University of California, Berkeley
View Profile

,
Samuel Madden

Massachusetts Institute of Technology

Massachusetts Institute of Technology
View Profile

,
Ion Stoica

Conviva Inc. and University of California, Berkeley

Conviva Inc. and University of California, Berkeley
View Profile

EuroSys '13: Proceedings of the 8th ACM European Conference on Computer SystemsApril 2013Pages 29–42https://doi.org/10.1145/2465351.2465355

Published:15 April 2013Publication History

EuroSys '13: Proceedings of the 8th ACM European Conference on Computer Systems

Pages 29–42

ABSTRACT

In this paper, we present BlinkDB, a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional stratified samples from original data over time, and (2) a dynamic sample selection strategy that selects an appropriately sized sample based on a query's accuracy or response time requirements. We evaluate BlinkDB against the well-known TPC-H benchmarks and a real-world analytic workload derived from Conviva Inc., a company that manages video distribution over the Internet. Our experiments on a 100 node cluster show that BlinkDB can answer queries on up to 17 TBs of data in less than 2 seconds (over 200 x faster than Hive), within an error of 2-10%.

References

Apache Hadoop Distributed File System. http://hadoop.apache.org/hdfs/.Google Scholar
Apache Hadoop Mapreduce Project. http://hadoop.apache.org/mapreduce/.Google Scholar
TPC-H Query Processing Benchmarks. http://www.tpc.org/tpch/.Google Scholar
S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In ACM SIGMOD, May 2000. Google ScholarDigital Library
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In ACM SIGMOD, June 1999. Google ScholarDigital Library
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. ACM SIGMOD Record, 28(2), 1999. Google ScholarDigital Library
S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing Data Parallel Computing. In NSDI, 2012. Google ScholarDigital Library
G. Ananthanarayanan, S. Kandula, A. G. Greenberg, et al. Reining in the outliers in map-reduce clusters using mantri. In OSDI, pages 265--278, 2010. Google ScholarDigital Library
B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In VLDB, 2003.Google ScholarDigital Library
S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 2007. Google ScholarDigital Library
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, 2010. Google ScholarDigital Library
G. Cormode. Sketch techniques for massive data. In Synposes for Massive Data: Samples, Histograms, Wavelets and Sketches. 2011.Google Scholar
C. Engle, A. Lupher, R. Xin, M. Zaharia, et al. Shark: Fast Data Analysis Using Coarse-grained Distributed Memory. In SIGMOD, 2012. Google ScholarDigital Library
M. Garofalakis and P. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001. Tutorial. Google ScholarDigital Library
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. Google ScholarDigital Library
S. Lohr. Sampling: design and analysis. Thomson, 2009.Google Scholar
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Commun. ACM, 54:114--123, June 2011. Google ScholarDigital Library
C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, and B. Reed. Interactive analysis of web-scale data. In CIDR, 2009.Google Scholar
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for Large MapReduce Jobs. PVLDB, 4(11):1135--1145, 2011.Google Scholar
C. Sapia. Promise: Predicting query behavior to enable predictive caching strategies for olap systems. DaWaK, pages 22--233. Springer-Verlag, 2000. Google ScholarDigital Library
L. Sidirourgos, M. L. Kersten, and P. A. Boncz. SciBORQ: Scientific data management with Bounds On Runtime and Quality. In CIDR'11, 2011.Google Scholar
A. Thusoo, J. S. Sarma, N. Jain, et al. Hive: a warehousing solution over a map-reduce framework. PVLDB, 2(2), 2009. Google ScholarDigital Library
S. Tirthapura and D. Woodruff. Optimal random sampling from distributed streams revisited. Distributed Computing, pages 283--297, 2011. Google ScholarDigital Library
J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. SIGMOD, 1999. Google ScholarDigital Library
M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library
M. Zaharia, A. Konwinski, A. D. Joseph, et al. Improving MapReduce Performance in Heterogeneous Environments. In OSDI, 2008. Google ScholarDigital Library

Index Terms

BlinkDB: queries with bounded errors and bounded response times on very large data
1. Information systems
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Equivalence and minimization of conjunctive queries under combined semantics
ICDT '12: Proceedings of the 15th International Conference on Database Theory

The problems of query containment, equivalence, and minimization are fundamental problems in the context of query processing and optimization. In their classic work [2] published in 1977, Chandra and Merlin solved the three problems for the language of ...
Read More
Scalable and efficient processing of top-k multiple-type integrated queries
Abstract
In this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various ...
Read More
Combining Joint and Semi-Join Operations for Distributed Query Processing

The application of a combination of join and semi-join operations to minimize the amount of data transmission required for distributed query processing is discussed. Specifically, two important concepts that occur with the use of join operations as ...
Read More

Reviews

Reviewer: Mohamed Eltabakh

Aggregation queries over large-scale datasets are very common in most emerging data analytics applications, such as log processing, clickstream analysis, and social network updates. These aggregation queries may process terabytes (TBs) of data, which can take a long time to complete, so a key challenge is how to support more efficient execution of these queries. One approach involves approximate processing, where the system generates results faster, but with lower accuracy. This paper proposes a scalable query engine called BlinkDB, which is designed to support approximate processing of aggregation queries over very large datasets. The system is shown to query TBs of data within a few seconds. BlinkDB is built on top of the distributed Hadoop framework. The authors classified possible workloads into four categories depending on whether the future queries (or their components) can be predicted. From these, they selected one relatively flexible category called predictable query column sets. The core of BlinkDB is designed to produce faster results with an estimated error bound, given a response-time budget associated with the input query. The system leverages sampling techniques to create representative samples and operates on them instead of the raw data to improve response time. BlinkDB has two main components: the creation and maintenance of samples, and the runtime engine and sample selection. The first component addresses the types and sizes of samples to create. The authors present algorithms for creating stratified samples over single and multiple queries, and describe several optimizations for selecting a subset of columns on which to build the summaries. The second component focuses on selecting the most appropriate sample and sample size to answer a given query. The runtime engine also estimates the error bound of the reported query answer. The system has been evaluated experimentally and the results demonstrate its scalability and practicality in handling aggregation queries over large datasets. End users who depend on databases for managing their data will find this paper worth reading, as will researchers and scientists working in the data management field, and the data mining community in general. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '13: Proceedings of the 8th ACM European Conference on Computer Systems
April 2013
401 pages
ISBN:9781450319942
DOI:10.1145/2465351
General Chairs:
Zdenek Hanzálek
Czech Technical University Prague
,
Hermann Härtig
Technische Universität Dresden
,
Program Chairs:
Miguel Castro
Microsoft Research Cambridge
,
M. Frans Kaashoek
MIT
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 April 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
EuroSys '13 Paper Acceptance Rate28of143submissions,20%Overall Acceptance Rate241of1,308submissions,18%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 495
  Total Citations
  View Citations
- 2,061
  Total Downloads
- Downloads (Last 12 months)271
- Downloads (Last 6 weeks)42
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

BlinkDB: queries with bounded errors and bounded response times on very large data

EuroSys '13: Proceedings of the 8th ACM European Conference on Computer Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Equivalence and minimization of conjunctive queries under combined semantics

Scalable and efficient processing of top-k multiple-type integrated queries

Combining Joint and Semi-Join Operations for Distributed Query Processing

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

BlinkDB: queries with bounded errors and bounded response times on very large data

EuroSys '13: Proceedings of the 8th ACM European Conference on Computer Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Equivalence and minimization of conjunctive queries under combined semantics

Scalable and efficient processing of top-k multiple-type integrated queries

Combining Joint and Semi-Join Operations for Distributed Query Processing

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media