research-article

Fast queries over heterogeneous data through engine customization

Authors:
Manos Karpathiotakis

Ecole Polytechnique Fédérale de Lausanne

Ecole Polytechnique Fédérale de Lausanne
View Profile

,
Ioannis Alagiannis

Ecole Polytechnique Fédérale de Lausanne

Ecole Polytechnique Fédérale de Lausanne
View Profile

,
Anastasia Ailamaki

Ecole Polytechnique Fédérale de Lausanne and RAW Labs SA

Ecole Polytechnique Fédérale de Lausanne and RAW Labs SA
View Profile

Proceedings of the VLDB Endowment Volume 9 Issue 12pp 972–983https://doi.org/10.14778/2994509.2994516

Published:01 August 2016Publication History

Proceedings of the VLDB Endowment

Abstract

Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of heterogeneous datasets to gain insights. The different data models and formats pose a significant challenge on performing analysis over a combination of diverse datasets. Serving all queries using a single, general-purpose query engine is slow. On the other hand, using a specialized engine for each heterogeneous dataset increases complexity: queries touching a combination of datasets require an integration layer over the different engines.

This paper presents a system design that natively supports heterogeneous data formats and also minimizes query execution times. For multi-format support, the design uses an expressive query algebra which enables operations over various data models. For minimal execution times, it uses a code generation mechanism to mimic the system and storage most appropriate to answer a query fast. We validate our design by building Proteus, a query engine which natively supports queries over CSV, JSON, and relational binary data, and which specializes itself to each query, dataset, and workload via code generation. Proteus outperforms state-of-the-art open-source and commercial systems on both synthetic and real-world workloads without being tied to a single data model or format, all while exposing users to a single query interface.

References

Apache Drill. https://drill.apache.org/.Google Scholar
LLVM's Analysis and Transform Passes. http://llvm.org/docs/Passes.html.Google Scholar
A. Abouzeid et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009. Google ScholarDigital Library
S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB, 2000. Google ScholarDigital Library
I. Alagiannis et al. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD, 2012. Google ScholarDigital Library
I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: a hands-free adaptive store. In SIGMOD, 2014. Google ScholarDigital Library
S. Alsubaiee et al. AsterixDB: A Scalable, Open Source BDMS. PVLDB, 7(14):1905--1916, 2014. Google ScholarDigital Library
M. Armbrust et al. Spark SQL: Relational Data Processing in Spark. In SIGMOD, 2015. Google ScholarDigital Library
C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85--96, 2013. Google ScholarDigital Library
R. Barber et al. Business Analytics in (a) Blink. IEEE Data Eng. Bull., 35(1):9--14, 2012.Google Scholar
K. S. Beyer et al. System RX: one part relational, one part XML. In SIGMOD, 2005. Google ScholarDigital Library
K. S. Beyer et al. Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. PVLDB, 4(12):1272--1283, 2011.Google ScholarDigital Library
S. Blanas et al. Parallel data analysis directly on scientific file formats. In SIGMOD, 2014. Google ScholarDigital Library
P. Boncz et al. MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In SIGMOD, 2006. Google ScholarDigital Library
P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking the memory wall in MonetDB. Commun. ACM, 51(12):77--85, 2008. Google ScholarDigital Library
R. Brunel et al. Supporting hierarchical data in SAP HANA. In ICDE, 2015.Google ScholarCross Ref
F. Bugiotti et al. Invisible Glue: Scalable Self-Tuning Multi-Stores. In CIDR, 2015.Google Scholar
M. J. Carey et al. Towards Heterogeneous Multimedia Information Systems: The Garlic Approach. In RIDE-DOM, 1995. Google ScholarDigital Library
C. Chasseur, Y. Li, and J. M. Patel. Enabling JSON document stores in relational systems. In WebDB, 2013.Google Scholar
S. S. Chawathe et al. The TSIMMIS Project: Integration of Heterogeneous Information Sources. In IPSJ, 1994.Google Scholar
Y. Cheng and F. Rusu. Parallel In-Situ Data Processing with Speculative Loading. In SIGMOD, 2014. Google ScholarDigital Library
D. J. DeWitt et al. Split Query Processing in Polybase. In SIGMOD, 2013. Google ScholarDigital Library
J. Duggan et al. The BigDAWG Polystore System. SIGMOD Record, 44(2):11--16, 2015. Google ScholarDigital Library
L. Fegaras and D. Maier. Optimizing object queries using an effective calculus. ACM Trans. Database Syst., 25(4):457--516, 2000. Google ScholarDigital Library
M. F. Fernández, J. Siméon, and P. Wadler. An Algebra for XML Query. In FST TCS, 2000.Google ScholarCross Ref
S. J. Finkelstein. Common Subexpression Analysis in Database Applications. In SIGMOD, 1982. Google ScholarDigital Library
G. Graefe and W. McKenna. The Volcano optimizer generator: extensibility and efficient search. In ICDE, 1993. Google ScholarDigital Library
M. Grund et al. HYRISE - A Main Memory Hybrid Storage Engine. PVLDB, 4(2):105--116, 2010. Google ScholarDigital Library
A. Y. Halevy. Answering queries using views: A survey. VLDB J., 10(4):270--294, 2001. Google ScholarDigital Library
S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my Data Files. Here are my Queries. Where are my Results? In CIDR, 2011.Google Scholar
M. Ivanova, M. Kersten, N. Nes, and R. Goncalves. An architecture for recycling intermediates in a column-store. In SIGMOD, 2009. Google ScholarDigital Library
M. Karpathiotakis, M. Branco, I. Alagiannis, and A. Ailamaki. Adaptive Query Processing on RAW Data. PVLDB, 7(12):1119--1130, 2014. Google ScholarDigital Library
M. Karpathiotakis et al. Just-In-Time Data Virtualization: Lightweight Data Management with ViDa. In CIDR, 2015.Google Scholar
Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building Efficient Query Engines in a High-Level Language. PVLDB, 7(10):853--864, 2014. Google ScholarDigital Library
Y. Kotidis and N. Roussopoulos. DynaMat: A Dynamic View Management System for Data Warehouses. In SIGMOD, 1999. Google ScholarDigital Library
K. Krikellas, S. Viglas, and M. Cintra. Generating code for holistic query evaluation. In ICDE, 2010.Google ScholarCross Ref
C. Lattner and V. S. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO, 2004. Google ScholarDigital Library
Z. H. Liu, B. C. Hammerschmidt, and D. McMahon. JSON data management: supporting schema-less development in RDBMS. In SIGMOD, 2014. Google ScholarDigital Library
S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing main-memory join on modern hardware. IEEE TKDE, 14(4):709--730, 2002. Google ScholarDigital Library
S. Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1):330--339, 2010. Google ScholarDigital Library
R. Murthy et al. Towards an enterprise XML architecture. In SIGMOD, 2005. Google ScholarDigital Library
F. Nagel, P. A. Boncz, and S. Viglas. Recycling in pipelined query evaluation. In ICDE, 2013. Google ScholarDigital Library
T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9):539--550, 2011. Google ScholarDigital Library
C. Olston et al. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarDigital Library
J. Rao, H. Pirahesh, C. Mohan, and G. M. Lohman. Compiled Query Execution Engine using JVM. In ICDE, 2006. Google ScholarDigital Library
M. T. Roth, F. Ozcan, and L. M. Haas. Cost Models DO Matter: Providing Cost Information for Diverse Data Sources in a Federated System. In VLDB, 1999. Google ScholarDigital Library
M. T. Roth and P. M. Schwarz. Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources. In VLDB, 1997. Google ScholarDigital Library
P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. Efficient and Extensible Algorithms for Multi Query Optimization. In SIGMOD, 2000. Google ScholarDigital Library
A. Shaikhha et al. How to Architect a Query Compiler. In SIGMOD, 2016. Google ScholarDigital Library
J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities. In VLDB, 1999. Google ScholarDigital Library
M. Stonebraker. Technical perspective - One size fits all: an idea whose time has come and gone. Commun. ACM, 51(12):76, 2008. Google ScholarDigital Library
D. Tahara, T. Diamond, and D. J. Abadi. Sinew: a SQL system for multi-structured data. In SIGMOD, 2014. Google ScholarDigital Library
A. Thusoo et al. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
A. Tomasic, L. Raschid, and P. Valduriez. Scaling Access to Heterogeneous Data Sources with DISCO. IEEE TKDE, 10(5):808--823, 1998. Google ScholarDigital Library
P. W. Trinder. Comprehensions, a Query Notation for DBPLs. In Database Programming Languages: Bulk Types and Persistent Data., 1991. Google ScholarDigital Library

Index Terms

Fast queries over heterogeneous data through engine customization
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Approximating expressive queries on graph-modeled data

We present GeX for the approximate matching of complex queries on graph-modeled data.GeX generalizes existing approaches and allows for querying any graph-based datasets.GeX query language supports queries ranging from keyword-based to complex ones.GeX ...
Read More
Approximate Queries on Big Heterogeneous Data
BIGDATACONGRESS '15: Proceedings of the 2015 IEEE International Congress on Big Data

The fundamental assumption for query rewriting in heterogeneous environments is that the mappings used for the rewriting are complete, i.e., Every relation and attribute mentioned in the query is associated, through mappings, to relations and attributes ...
Read More
Dynamic and fast processing of queries on large-scale RDF data

As RDF data continue to gain popularity, we witness the fast growing trend of RDF datasets in both the number of RDF repositories and the size of RDF datasets. Many known RDF datasets contain billions of RDF triples (subject, predicate and object). One ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 9, Issue 12
August 2016
345 pages
ISSN:2150-8097
Editors:
Surajit Chaudhuri
Microsoft Research
,
Jayant Haritsa
I.I.Sc. Bangalore
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2016
Published in pvldb Volume 9, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 419
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast queries over heterogeneous data through engine customization

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Approximating expressive queries on graph-modeled data

Approximate Queries on Big Heterogeneous Data

Dynamic and fast processing of queries on large-scale RDF data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Fast queries over heterogeneous data through engine customization

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Approximating expressive queries on graph-modeled data

Approximate Queries on Big Heterogeneous Data

Dynamic and fast processing of queries on large-scale RDF data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media