skip to main content
research-article

Adaptive query processing on RAW data

Published:01 August 2014Publication History
Skip Abstract Section

Abstract

Database systems deliver impressive performance for large classes of workloads as the result of decades of research into optimizing database engines. High performance, however, is achieved at the cost of versatility. In particular, database systems only operate efficiently over loaded data, i.e., data converted from its original raw format into the system's internal data format. At the same time, data volume continues to increase exponentially and data varies increasingly, with an escalating number of new formats. The consequence is a growing impedance mismatch between the original structures holding the data in the raw files and the structures used by query engines for efficient processing. In an ideal scenario, the query engine would seamlessly adapt itself to the data and ensure efficient query processing regardless of the input data formats, optimizing itself to each instance of a file and of a query by leveraging information available at query time. Today's systems, however, force data to adapt to the query engine during data loading.

This paper proposes adapting the query engine to the formats of raw data. It presents RAW, a prototype query engine which enables querying heterogeneous data sources transparently. RAW employs Just-In-Time access paths, which efficiently couple heterogeneous raw files to the query engine and reduce the overheads of traditional general-purpose scan operators. There are, however, inherent overheads with accessing raw data directly that cannot be eliminated, such as converting the raw values. Therefore, RAW also uses column shreds, ensuring that we pay these costs only for the subsets of raw data strictly needed by a query. We use RAW in a real-world scenario and achieve a two-order of magnitude speedup against the existing hand-written solution.

References

  1. G. Aad et al. The ATLAS Experiment at the CERN Large Hadron Collider. Journal of Instrumentation, 3(8):1--438, 2008.Google ScholarGoogle Scholar
  2. D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, and S. Madden. The design and implementation of modern column-oriented database systems. Foundations and Trends in Databases, 5(3): 197--280, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. Madden. Materialization Strategies in a Column-Oriented DBMS. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  4. A. Abouzied, D. J. Abadi, and A. Silberschatz. Invisible Loading: Access-Driven Data Transfer from Raw Files into Database Systems. In EDBT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: A Hands-free Adaptive Store. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Boncz, M. Kersten, and S. Manegold. Breaking the memory wall in MonetDB. Commun. ACM, 51(12): 77--85, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Boncz, S. Manegold, and M. Kersten. Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct. PVLDB, 2(2): 1648--1653, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, 2005.Google ScholarGoogle Scholar
  10. R. Brun and F. Rademakers. ROOT - An Object Oriented Data Analysis Framework. In AIHENP'96 Workshop, 1997.Google ScholarGoogle Scholar
  11. D. D. Chamberlin et al. A History and Evaluation of System R. Commun. ACM, 24(10): 632--646, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. W. Bethel, A. Shoshani, O. Rübel, Prabhat, and R. D. Ryne. Parallel index and query for large scale data analysis. In SC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. J. DeWitt, A. Halverson, R. V. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split Query Processing in Polybase. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ESRI. Shapefile Technical Description. http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf.Google ScholarGoogle Scholar
  15. Google. Supersonic Library. https://code.google.com/p/supersonic/.Google ScholarGoogle Scholar
  16. G. Graefe and W. McKenna. The Volcano optimizer generator: extensibility and efficient search. In ICDE, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Ivanova, M. Kersten, and S. Manegold. Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories. In SSDBM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Ivanova, M. Kersten, N. Nes, and R. Goncalves. An architecture for recycling intermediates in a column-store. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Klonatos, A. Nötzli, A. Spielmann, C. Koch, and V. Kuncak. Automatic synthesis of out-of-core algorithms. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Koch. Abstraction without regret in data management systems. In CIDR, 2013.Google ScholarGoogle Scholar
  21. K. Krikellas, S. Viglas, and M. Cintra. Generating code for holistic query evaluation. In ICDE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  22. B. W. Lampson. Lazy and Speculative Execution in Computer Systems. In OPODIS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Laney. 3D Data Management: Controlling Data Volume, Velocity, and Variety. Technical report, META Group, February 2001.Google ScholarGoogle Scholar
  24. C. Lattner and V. S. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1): 330--339, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz. Aries: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. TODS, 17: 94--162, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. Mühlbauer, W. Rödiger, R. Seilbeck, A. Reiser, A. Kemper, and T. Neumann. Instant Loading for Main Memory Databases. PVLDB, 6(14): 1702--1713, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. MySQL. Chapter 24. Writing a Custom Storage Engine. http://dev.mysql.com/doc/internals/en/custom-engine.html.Google ScholarGoogle Scholar
  29. F. Nagel, P. A. Boncz, and S. Viglas. Recycling in pipelined query evaluation. In ICDE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9): 539--550, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. H. Pirk, F. Funke, M. Grund, T. Neumann, U. Leser, S. Manegold, A. Kemper, and M. Kersten. CPU and Cache Efficient Management of Memory-Resident Databases. In ICDE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Rao, H. Pirahesh, C. Mohan, and G. M. Lohman. Compiled Query Execution Engine using JVM. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Sompolski, M. Zukowski, and P. A. Boncz. Vectorization vs. compilation in query execution. In DaMoN, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Stonebraker. Technical perspective - One size fits all: an idea whose time has come and gone. Commun. ACM, 51(12), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A Column-oriented DBMS. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. The HDF Group. HDF5. http://www.hdfgroup.org/HDF5.Google ScholarGoogle Scholar
  38. A. Thusoo et al. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2): 1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Unidata. NetCDF. http://www.unidata.ucar.edu/software/netcdf/.Google ScholarGoogle Scholar
  40. K. Wu et al. Fastbit: interactively searching massive data. SciDAC, 2009.Google ScholarGoogle Scholar

Index Terms

  1. Adaptive query processing on RAW data
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 7, Issue 12
        August 2014
        296 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 August 2014
        Published in pvldb Volume 7, Issue 12

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader