research-article

Adaptive query processing on RAW data

Authors:
Manos Karpathiotakis

EPFL, Switzerland

EPFL, Switzerland
View Profile

,
Miguel Branco

EPFL, Switzerland

EPFL, Switzerland
View Profile

,
Ioannis Alagiannis

EPFL, Switzerland

EPFL, Switzerland
View Profile

,
Anastasia Ailamaki

EPFL, Switzerland

EPFL, Switzerland
View Profile

Proceedings of the VLDB Endowment Volume 7 Issue 12pp 1119–1130https://doi.org/10.14778/2732977.2732986

Published:01 August 2014Publication History

Proceedings of the VLDB Endowment

Abstract

Database systems deliver impressive performance for large classes of workloads as the result of decades of research into optimizing database engines. High performance, however, is achieved at the cost of versatility. In particular, database systems only operate efficiently over loaded data, i.e., data converted from its original raw format into the system's internal data format. At the same time, data volume continues to increase exponentially and data varies increasingly, with an escalating number of new formats. The consequence is a growing impedance mismatch between the original structures holding the data in the raw files and the structures used by query engines for efficient processing. In an ideal scenario, the query engine would seamlessly adapt itself to the data and ensure efficient query processing regardless of the input data formats, optimizing itself to each instance of a file and of a query by leveraging information available at query time. Today's systems, however, force data to adapt to the query engine during data loading.

This paper proposes adapting the query engine to the formats of raw data. It presents RAW, a prototype query engine which enables querying heterogeneous data sources transparently. RAW employs Just-In-Time access paths, which efficiently couple heterogeneous raw files to the query engine and reduce the overheads of traditional general-purpose scan operators. There are, however, inherent overheads with accessing raw data directly that cannot be eliminated, such as converting the raw values. Therefore, RAW also uses column shreds, ensuring that we pay these costs only for the subsets of raw data strictly needed by a query. We use RAW in a real-world scenario and achieve a two-order of magnitude speedup against the existing hand-written solution.

References

G. Aad et al. The ATLAS Experiment at the CERN Large Hadron Collider. Journal of Instrumentation, 3(8):1--438, 2008.Google Scholar
D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, and S. Madden. The design and implementation of modern column-oriented database systems. Foundations and Trends in Databases, 5(3): 197--280, 2013. Google ScholarDigital Library
D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. Madden. Materialization Strategies in a Column-Oriented DBMS. In ICDE, 2007.Google ScholarCross Ref
A. Abouzied, D. J. Abadi, and A. Silberschatz. Invisible Loading: Access-Driven Data Transfer from Raw Files into Database Systems. In EDBT, 2013. Google ScholarDigital Library
I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD, 2012. Google ScholarDigital Library
I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: A Hands-free Adaptive Store. In SIGMOD, 2014. Google ScholarDigital Library
P. Boncz, M. Kersten, and S. Manegold. Breaking the memory wall in MonetDB. Commun. ACM, 51(12): 77--85, 2008. Google ScholarDigital Library
P. Boncz, S. Manegold, and M. Kersten. Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct. PVLDB, 2(2): 1648--1653, 2009. Google ScholarDigital Library
P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, 2005.Google Scholar
R. Brun and F. Rademakers. ROOT - An Object Oriented Data Analysis Framework. In AIHENP'96 Workshop, 1997.Google Scholar
D. D. Chamberlin et al. A History and Evaluation of System R. Commun. ACM, 24(10): 632--646, 1981. Google ScholarDigital Library
J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. W. Bethel, A. Shoshani, O. Rübel, Prabhat, and R. D. Ryne. Parallel index and query for large scale data analysis. In SC, 2011. Google ScholarDigital Library
D. J. DeWitt, A. Halverson, R. V. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split Query Processing in Polybase. In SIGMOD, 2013. Google ScholarDigital Library
ESRI. Shapefile Technical Description. http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf.Google Scholar
Google. Supersonic Library. https://code.google.com/p/supersonic/.Google Scholar
G. Graefe and W. McKenna. The Volcano optimizer generator: extensibility and efficient search. In ICDE, 1993. Google ScholarDigital Library
M. Ivanova, M. Kersten, and S. Manegold. Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories. In SSDBM, 2012. Google ScholarDigital Library
M. Ivanova, M. Kersten, N. Nes, and R. Goncalves. An architecture for recycling intermediates in a column-store. In SIGMOD, 2009. Google ScholarDigital Library
Y. Klonatos, A. Nötzli, A. Spielmann, C. Koch, and V. Kuncak. Automatic synthesis of out-of-core algorithms. In SIGMOD, 2013. Google ScholarDigital Library
C. Koch. Abstraction without regret in data management systems. In CIDR, 2013.Google Scholar
K. Krikellas, S. Viglas, and M. Cintra. Generating code for holistic query evaluation. In ICDE, 2010.Google ScholarCross Ref
B. W. Lampson. Lazy and Speculative Execution in Computer Systems. In OPODIS, 2006. Google ScholarDigital Library
D. Laney. 3D Data Management: Controlling Data Volume, Velocity, and Variety. Technical report, META Group, February 2001.Google Scholar
C. Lattner and V. S. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO, 2004. Google ScholarDigital Library
S. Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1): 330--339, 2010. Google ScholarDigital Library
C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz. Aries: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. TODS, 17: 94--162, 1992. Google ScholarDigital Library
T. Mühlbauer, W. Rödiger, R. Seilbeck, A. Reiser, A. Kemper, and T. Neumann. Instant Loading for Main Memory Databases. PVLDB, 6(14): 1702--1713, 2013. Google ScholarDigital Library
MySQL. Chapter 24. Writing a Custom Storage Engine. http://dev.mysql.com/doc/internals/en/custom-engine.html.Google Scholar
F. Nagel, P. A. Boncz, and S. Viglas. Recycling in pipelined query evaluation. In ICDE, 2013. Google ScholarDigital Library
T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9): 539--550, 2011. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarDigital Library
H. Pirk, F. Funke, M. Grund, T. Neumann, U. Leser, S. Manegold, A. Kemper, and M. Kersten. CPU and Cache Efficient Management of Memory-Resident Databases. In ICDE, 2013.Google ScholarDigital Library
J. Rao, H. Pirahesh, C. Mohan, and G. M. Lohman. Compiled Query Execution Engine using JVM. In ICDE, 2006. Google ScholarDigital Library
J. Sompolski, M. Zukowski, and P. A. Boncz. Vectorization vs. compilation in query execution. In DaMoN, 2011. Google ScholarDigital Library
M. Stonebraker. Technical perspective - One size fits all: an idea whose time has come and gone. Commun. ACM, 51(12), 2008. Google ScholarDigital Library
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A Column-oriented DBMS. In VLDB, 2005. Google ScholarDigital Library
The HDF Group. HDF5. http://www.hdfgroup.org/HDF5.Google Scholar
A. Thusoo et al. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2): 1626--1629, 2009. Google ScholarDigital Library
Unidata. NetCDF. http://www.unidata.ucar.edu/software/netcdf/.Google Scholar
K. Wu et al. Fastbit: interactively searching massive data. SciDAC, 2009.Google Scholar

Index Terms

Adaptive query processing on RAW data
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

XML query processing: efficiency and optimality
IDEAS '12: Proceedings of the 16th International Database Engineering & Applications Sysmposium

XML (Extensible Mark-up Language) is a well established format which is often used for modeling of semi-structured data. XPath and XQuery are de facto standards among XML query languages and searching for occurrences of a twig pattern query (TPQ) in an ...
Read More
Efficient structural query processing in xml databases
Read More
View-based query processing: On the relationship between rewriting, answering and losslessness

As a result of the extensive research in view-based query processing, three notions have been identified as fundamental, namely rewriting, answering, and losslessness. Answering amounts to computing the tuples satisfying the query in all databases ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 7, Issue 12
August 2014
296 pages
ISSN:2150-8097
Editors:
H. V. Jagadish
University of Michigan
,
Aoying Zhou
East Normal University, China
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2014
Published in pvldb Volume 7, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 308
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Adaptive query processing on RAW data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

XML query processing: efficiency and optimality

Efficient structural query processing in xml databases

View-based query processing: On the relationship between rewriting, answering and losslessness

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Adaptive query processing on RAW data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

XML query processing: efficiency and optimality

Efficient structural query processing in xml databases

View-based query processing: On the relationship between rewriting, answering and losslessness

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media