Abstract
Database systems deliver impressive performance for large classes of workloads as the result of decades of research into optimizing database engines. High performance, however, is achieved at the cost of versatility. In particular, database systems only operate efficiently over loaded data, i.e., data converted from its original raw format into the system's internal data format. At the same time, data volume continues to increase exponentially and data varies increasingly, with an escalating number of new formats. The consequence is a growing impedance mismatch between the original structures holding the data in the raw files and the structures used by query engines for efficient processing. In an ideal scenario, the query engine would seamlessly adapt itself to the data and ensure efficient query processing regardless of the input data formats, optimizing itself to each instance of a file and of a query by leveraging information available at query time. Today's systems, however, force data to adapt to the query engine during data loading.
This paper proposes adapting the query engine to the formats of raw data. It presents RAW, a prototype query engine which enables querying heterogeneous data sources transparently. RAW employs Just-In-Time access paths, which efficiently couple heterogeneous raw files to the query engine and reduce the overheads of traditional general-purpose scan operators. There are, however, inherent overheads with accessing raw data directly that cannot be eliminated, such as converting the raw values. Therefore, RAW also uses column shreds, ensuring that we pay these costs only for the subsets of raw data strictly needed by a query. We use RAW in a real-world scenario and achieve a two-order of magnitude speedup against the existing hand-written solution.
- G. Aad et al. The ATLAS Experiment at the CERN Large Hadron Collider. Journal of Instrumentation, 3(8):1--438, 2008.Google Scholar
- D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, and S. Madden. The design and implementation of modern column-oriented database systems. Foundations and Trends in Databases, 5(3): 197--280, 2013. Google ScholarDigital Library
- D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. Madden. Materialization Strategies in a Column-Oriented DBMS. In ICDE, 2007.Google ScholarCross Ref
- A. Abouzied, D. J. Abadi, and A. Silberschatz. Invisible Loading: Access-Driven Data Transfer from Raw Files into Database Systems. In EDBT, 2013. Google ScholarDigital Library
- I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD, 2012. Google ScholarDigital Library
- I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: A Hands-free Adaptive Store. In SIGMOD, 2014. Google ScholarDigital Library
- P. Boncz, M. Kersten, and S. Manegold. Breaking the memory wall in MonetDB. Commun. ACM, 51(12): 77--85, 2008. Google ScholarDigital Library
- P. Boncz, S. Manegold, and M. Kersten. Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct. PVLDB, 2(2): 1648--1653, 2009. Google ScholarDigital Library
- P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, 2005.Google Scholar
- R. Brun and F. Rademakers. ROOT - An Object Oriented Data Analysis Framework. In AIHENP'96 Workshop, 1997.Google Scholar
- D. D. Chamberlin et al. A History and Evaluation of System R. Commun. ACM, 24(10): 632--646, 1981. Google ScholarDigital Library
- J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. W. Bethel, A. Shoshani, O. Rübel, Prabhat, and R. D. Ryne. Parallel index and query for large scale data analysis. In SC, 2011. Google ScholarDigital Library
- D. J. DeWitt, A. Halverson, R. V. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split Query Processing in Polybase. In SIGMOD, 2013. Google ScholarDigital Library
- ESRI. Shapefile Technical Description. http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf.Google Scholar
- Google. Supersonic Library. https://code.google.com/p/supersonic/.Google Scholar
- G. Graefe and W. McKenna. The Volcano optimizer generator: extensibility and efficient search. In ICDE, 1993. Google ScholarDigital Library
- M. Ivanova, M. Kersten, and S. Manegold. Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories. In SSDBM, 2012. Google ScholarDigital Library
- M. Ivanova, M. Kersten, N. Nes, and R. Goncalves. An architecture for recycling intermediates in a column-store. In SIGMOD, 2009. Google ScholarDigital Library
- Y. Klonatos, A. Nötzli, A. Spielmann, C. Koch, and V. Kuncak. Automatic synthesis of out-of-core algorithms. In SIGMOD, 2013. Google ScholarDigital Library
- C. Koch. Abstraction without regret in data management systems. In CIDR, 2013.Google Scholar
- K. Krikellas, S. Viglas, and M. Cintra. Generating code for holistic query evaluation. In ICDE, 2010.Google ScholarCross Ref
- B. W. Lampson. Lazy and Speculative Execution in Computer Systems. In OPODIS, 2006. Google ScholarDigital Library
- D. Laney. 3D Data Management: Controlling Data Volume, Velocity, and Variety. Technical report, META Group, February 2001.Google Scholar
- C. Lattner and V. S. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO, 2004. Google ScholarDigital Library
- S. Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1): 330--339, 2010. Google ScholarDigital Library
- C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz. Aries: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. TODS, 17: 94--162, 1992. Google ScholarDigital Library
- T. Mühlbauer, W. Rödiger, R. Seilbeck, A. Reiser, A. Kemper, and T. Neumann. Instant Loading for Main Memory Databases. PVLDB, 6(14): 1702--1713, 2013. Google ScholarDigital Library
- MySQL. Chapter 24. Writing a Custom Storage Engine. http://dev.mysql.com/doc/internals/en/custom-engine.html.Google Scholar
- F. Nagel, P. A. Boncz, and S. Viglas. Recycling in pipelined query evaluation. In ICDE, 2013. Google ScholarDigital Library
- T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9): 539--550, 2011. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarDigital Library
- H. Pirk, F. Funke, M. Grund, T. Neumann, U. Leser, S. Manegold, A. Kemper, and M. Kersten. CPU and Cache Efficient Management of Memory-Resident Databases. In ICDE, 2013.Google ScholarDigital Library
- J. Rao, H. Pirahesh, C. Mohan, and G. M. Lohman. Compiled Query Execution Engine using JVM. In ICDE, 2006. Google ScholarDigital Library
- J. Sompolski, M. Zukowski, and P. A. Boncz. Vectorization vs. compilation in query execution. In DaMoN, 2011. Google ScholarDigital Library
- M. Stonebraker. Technical perspective - One size fits all: an idea whose time has come and gone. Commun. ACM, 51(12), 2008. Google ScholarDigital Library
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A Column-oriented DBMS. In VLDB, 2005. Google ScholarDigital Library
- The HDF Group. HDF5. http://www.hdfgroup.org/HDF5.Google Scholar
- A. Thusoo et al. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2): 1626--1629, 2009. Google ScholarDigital Library
- Unidata. NetCDF. http://www.unidata.ucar.edu/software/netcdf/.Google Scholar
- K. Wu et al. Fastbit: interactively searching massive data. SciDAC, 2009.Google Scholar
Index Terms
- Adaptive query processing on RAW data
Recommendations
XML query processing: efficiency and optimality
IDEAS '12: Proceedings of the 16th International Database Engineering & Applications SysmposiumXML (Extensible Mark-up Language) is a well established format which is often used for modeling of semi-structured data. XPath and XQuery are de facto standards among XML query languages and searching for occurrences of a twig pattern query (TPQ) in an ...
View-based query processing: On the relationship between rewriting, answering and losslessness
As a result of the extensive research in view-based query processing, three notions have been identified as fundamental, namely rewriting, answering, and losslessness. Answering amounts to computing the tuples satisfying the query in all databases ...
Comments