Abstract
Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of heterogeneous datasets to gain insights. The different data models and formats pose a significant challenge on performing analysis over a combination of diverse datasets. Serving all queries using a single, general-purpose query engine is slow. On the other hand, using a specialized engine for each heterogeneous dataset increases complexity: queries touching a combination of datasets require an integration layer over the different engines.
This paper presents a system design that natively supports heterogeneous data formats and also minimizes query execution times. For multi-format support, the design uses an expressive query algebra which enables operations over various data models. For minimal execution times, it uses a code generation mechanism to mimic the system and storage most appropriate to answer a query fast. We validate our design by building Proteus, a query engine which natively supports queries over CSV, JSON, and relational binary data, and which specializes itself to each query, dataset, and workload via code generation. Proteus outperforms state-of-the-art open-source and commercial systems on both synthetic and real-world workloads without being tied to a single data model or format, all while exposing users to a single query interface.
- Apache Drill. https://drill.apache.org/.Google Scholar
- LLVM's Analysis and Transform Passes. http://llvm.org/docs/Passes.html.Google Scholar
- A. Abouzeid et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009. Google ScholarDigital Library
- S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB, 2000. Google ScholarDigital Library
- I. Alagiannis et al. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD, 2012. Google ScholarDigital Library
- I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: a hands-free adaptive store. In SIGMOD, 2014. Google ScholarDigital Library
- S. Alsubaiee et al. AsterixDB: A Scalable, Open Source BDMS. PVLDB, 7(14):1905--1916, 2014. Google ScholarDigital Library
- M. Armbrust et al. Spark SQL: Relational Data Processing in Spark. In SIGMOD, 2015. Google ScholarDigital Library
- C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85--96, 2013. Google ScholarDigital Library
- R. Barber et al. Business Analytics in (a) Blink. IEEE Data Eng. Bull., 35(1):9--14, 2012.Google Scholar
- K. S. Beyer et al. System RX: one part relational, one part XML. In SIGMOD, 2005. Google ScholarDigital Library
- K. S. Beyer et al. Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. PVLDB, 4(12):1272--1283, 2011.Google ScholarDigital Library
- S. Blanas et al. Parallel data analysis directly on scientific file formats. In SIGMOD, 2014. Google ScholarDigital Library
- P. Boncz et al. MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In SIGMOD, 2006. Google ScholarDigital Library
- P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking the memory wall in MonetDB. Commun. ACM, 51(12):77--85, 2008. Google ScholarDigital Library
- R. Brunel et al. Supporting hierarchical data in SAP HANA. In ICDE, 2015.Google ScholarCross Ref
- F. Bugiotti et al. Invisible Glue: Scalable Self-Tuning Multi-Stores. In CIDR, 2015.Google Scholar
- M. J. Carey et al. Towards Heterogeneous Multimedia Information Systems: The Garlic Approach. In RIDE-DOM, 1995. Google ScholarDigital Library
- C. Chasseur, Y. Li, and J. M. Patel. Enabling JSON document stores in relational systems. In WebDB, 2013.Google Scholar
- S. S. Chawathe et al. The TSIMMIS Project: Integration of Heterogeneous Information Sources. In IPSJ, 1994.Google Scholar
- Y. Cheng and F. Rusu. Parallel In-Situ Data Processing with Speculative Loading. In SIGMOD, 2014. Google ScholarDigital Library
- D. J. DeWitt et al. Split Query Processing in Polybase. In SIGMOD, 2013. Google ScholarDigital Library
- J. Duggan et al. The BigDAWG Polystore System. SIGMOD Record, 44(2):11--16, 2015. Google ScholarDigital Library
- L. Fegaras and D. Maier. Optimizing object queries using an effective calculus. ACM Trans. Database Syst., 25(4):457--516, 2000. Google ScholarDigital Library
- M. F. Fernández, J. Siméon, and P. Wadler. An Algebra for XML Query. In FST TCS, 2000.Google ScholarCross Ref
- S. J. Finkelstein. Common Subexpression Analysis in Database Applications. In SIGMOD, 1982. Google ScholarDigital Library
- G. Graefe and W. McKenna. The Volcano optimizer generator: extensibility and efficient search. In ICDE, 1993. Google ScholarDigital Library
- M. Grund et al. HYRISE - A Main Memory Hybrid Storage Engine. PVLDB, 4(2):105--116, 2010. Google ScholarDigital Library
- A. Y. Halevy. Answering queries using views: A survey. VLDB J., 10(4):270--294, 2001. Google ScholarDigital Library
- S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my Data Files. Here are my Queries. Where are my Results? In CIDR, 2011.Google Scholar
- M. Ivanova, M. Kersten, N. Nes, and R. Goncalves. An architecture for recycling intermediates in a column-store. In SIGMOD, 2009. Google ScholarDigital Library
- M. Karpathiotakis, M. Branco, I. Alagiannis, and A. Ailamaki. Adaptive Query Processing on RAW Data. PVLDB, 7(12):1119--1130, 2014. Google ScholarDigital Library
- M. Karpathiotakis et al. Just-In-Time Data Virtualization: Lightweight Data Management with ViDa. In CIDR, 2015.Google Scholar
- Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building Efficient Query Engines in a High-Level Language. PVLDB, 7(10):853--864, 2014. Google ScholarDigital Library
- Y. Kotidis and N. Roussopoulos. DynaMat: A Dynamic View Management System for Data Warehouses. In SIGMOD, 1999. Google ScholarDigital Library
- K. Krikellas, S. Viglas, and M. Cintra. Generating code for holistic query evaluation. In ICDE, 2010.Google ScholarCross Ref
- C. Lattner and V. S. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO, 2004. Google ScholarDigital Library
- Z. H. Liu, B. C. Hammerschmidt, and D. McMahon. JSON data management: supporting schema-less development in RDBMS. In SIGMOD, 2014. Google ScholarDigital Library
- S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing main-memory join on modern hardware. IEEE TKDE, 14(4):709--730, 2002. Google ScholarDigital Library
- S. Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1):330--339, 2010. Google ScholarDigital Library
- R. Murthy et al. Towards an enterprise XML architecture. In SIGMOD, 2005. Google ScholarDigital Library
- F. Nagel, P. A. Boncz, and S. Viglas. Recycling in pipelined query evaluation. In ICDE, 2013. Google ScholarDigital Library
- T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9):539--550, 2011. Google ScholarDigital Library
- C. Olston et al. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarDigital Library
- J. Rao, H. Pirahesh, C. Mohan, and G. M. Lohman. Compiled Query Execution Engine using JVM. In ICDE, 2006. Google ScholarDigital Library
- M. T. Roth, F. Ozcan, and L. M. Haas. Cost Models DO Matter: Providing Cost Information for Diverse Data Sources in a Federated System. In VLDB, 1999. Google ScholarDigital Library
- M. T. Roth and P. M. Schwarz. Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources. In VLDB, 1997. Google ScholarDigital Library
- P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. Efficient and Extensible Algorithms for Multi Query Optimization. In SIGMOD, 2000. Google ScholarDigital Library
- A. Shaikhha et al. How to Architect a Query Compiler. In SIGMOD, 2016. Google ScholarDigital Library
- J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities. In VLDB, 1999. Google ScholarDigital Library
- M. Stonebraker. Technical perspective - One size fits all: an idea whose time has come and gone. Commun. ACM, 51(12):76, 2008. Google ScholarDigital Library
- D. Tahara, T. Diamond, and D. J. Abadi. Sinew: a SQL system for multi-structured data. In SIGMOD, 2014. Google ScholarDigital Library
- A. Thusoo et al. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
- A. Tomasic, L. Raschid, and P. Valduriez. Scaling Access to Heterogeneous Data Sources with DISCO. IEEE TKDE, 10(5):808--823, 1998. Google ScholarDigital Library
- P. W. Trinder. Comprehensions, a Query Notation for DBPLs. In Database Programming Languages: Bulk Types and Persistent Data., 1991. Google ScholarDigital Library
Index Terms
- Fast queries over heterogeneous data through engine customization
Recommendations
Approximating expressive queries on graph-modeled data
We present GeX for the approximate matching of complex queries on graph-modeled data.GeX generalizes existing approaches and allows for querying any graph-based datasets.GeX query language supports queries ranging from keyword-based to complex ones.GeX ...
Approximate Queries on Big Heterogeneous Data
BIGDATACONGRESS '15: Proceedings of the 2015 IEEE International Congress on Big DataThe fundamental assumption for query rewriting in heterogeneous environments is that the mappings used for the rewriting are complete, i.e., Every relation and attribute mentioned in the query is associated, through mappings, to relations and attributes ...
Dynamic and fast processing of queries on large-scale RDF data
As RDF data continue to gain popularity, we witness the fast growing trend of RDF datasets in both the number of RDF repositories and the size of RDF datasets. Many known RDF datasets contain billions of RDF triples (subject, predicate and object). One ...
Comments