skip to main content
research-article

Blink and it's done: interactive queries on very large data

Published:01 August 2012Publication History
Skip Abstract Section

Abstract

In this demonstration, we present BlinkDB, a massively parallel, sampling-based approximate query processing framework for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA (selection, projection, join and aggregate) queries as supported by these systems. BlinkDB provides real-time answers along with statistical error guarantees, and can scale to petabytes of data and thousands of machines in a fault-tolerant manner. Our experiments using the TPC-H benchmark and on an anonymized real-world video content distribution workload from Conviva Inc. show that BlinkDB can execute a wide range of queries up to 150x faster than Hive on MapReduce and 10--150x faster than Shark (Hive on Spark) over tens of terabytes of data stored across 100 machines, all with an error of 2--10%.

References

  1. Apache Hive Project. http://hive.apache.org/.Google ScholarGoogle Scholar
  2. Conviva Inc. http://www.conviva.com/.Google ScholarGoogle Scholar
  3. S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing Data Parallel Computing. In NSDI, pages 281--294, 2012. Google ScholarGoogle Scholar
  4. S. Agarwal, A. Panda, B. Mozafari, S. Madden, and I. Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. Technical Report, http://arxiv.org/abs/1203.5485, 2012.Google ScholarGoogle Scholar
  5. N. Bruno, S. Agarwal, S. Kandula, B. Shi, M.-C. Wu, and J. Zhou. Recurring Job Optimization in Scope. In SIGMOD, pages 805--806, 2012. Google ScholarGoogle Scholar
  6. C. Engle et al. Shark: Fast Data Analysis Using Coarse-grained Distributed Memory. In SIGMOD Conference, pages 689--692, 2012. Google ScholarGoogle Scholar
  7. M. Garofalakis and P. Gibbons. Approximate Query Processing: Taming the Terabytes. In VLDB, 2001. Tutorial. Google ScholarGoogle Scholar
  8. M. L. Kersten, S. Idreos, S. Manegold, and E. Liarou. The Researcher's Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds. PVLDB, 4(12):1474--1477, 2011.Google ScholarGoogle Scholar
  9. L. Sidirourgos et al. SciBORQ: Scientific Data Management With Bounds On Runtime and Quality. In CIDR, pages 296--301, 2011.Google ScholarGoogle Scholar
  10. M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI, pages 15--28, 2012. Google ScholarGoogle Scholar
  11. K. Zeng, B. Mozafari, S. Gao, and C. Zaniolo. Uncertainty Propagation in Complex Query Networks on Data Streams: A New Paradigm for Load Shedding. Technical Report 120016, UCLA, 2011.Google ScholarGoogle Scholar

Index Terms

  1. Blink and it's done: interactive queries on very large data
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 5, Issue 12
        August 2012
        340 pages

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 August 2012
        Published in pvldb Volume 5, Issue 12

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader