skip to main content
research-article

Fast queries over heterogeneous data through engine customization

Published:01 August 2016Publication History
Skip Abstract Section

Abstract

Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of heterogeneous datasets to gain insights. The different data models and formats pose a significant challenge on performing analysis over a combination of diverse datasets. Serving all queries using a single, general-purpose query engine is slow. On the other hand, using a specialized engine for each heterogeneous dataset increases complexity: queries touching a combination of datasets require an integration layer over the different engines.

This paper presents a system design that natively supports heterogeneous data formats and also minimizes query execution times. For multi-format support, the design uses an expressive query algebra which enables operations over various data models. For minimal execution times, it uses a code generation mechanism to mimic the system and storage most appropriate to answer a query fast. We validate our design by building Proteus, a query engine which natively supports queries over CSV, JSON, and relational binary data, and which specializes itself to each query, dataset, and workload via code generation. Proteus outperforms state-of-the-art open-source and commercial systems on both synthetic and real-world workloads without being tied to a single data model or format, all while exposing users to a single query interface.

References

  1. Apache Drill. https://drill.apache.org/.Google ScholarGoogle Scholar
  2. LLVM's Analysis and Transform Passes. http://llvm.org/docs/Passes.html.Google ScholarGoogle Scholar
  3. A. Abouzeid et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. I. Alagiannis et al. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: a hands-free adaptive store. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Alsubaiee et al. AsterixDB: A Scalable, Open Source BDMS. PVLDB, 7(14):1905--1916, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Armbrust et al. Spark SQL: Relational Data Processing in Spark. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85--96, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Barber et al. Business Analytics in (a) Blink. IEEE Data Eng. Bull., 35(1):9--14, 2012.Google ScholarGoogle Scholar
  11. K. S. Beyer et al. System RX: one part relational, one part XML. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. S. Beyer et al. Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. PVLDB, 4(12):1272--1283, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Blanas et al. Parallel data analysis directly on scientific file formats. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Boncz et al. MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In SIGMOD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking the memory wall in MonetDB. Commun. ACM, 51(12):77--85, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Brunel et al. Supporting hierarchical data in SAP HANA. In ICDE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  17. F. Bugiotti et al. Invisible Glue: Scalable Self-Tuning Multi-Stores. In CIDR, 2015.Google ScholarGoogle Scholar
  18. M. J. Carey et al. Towards Heterogeneous Multimedia Information Systems: The Garlic Approach. In RIDE-DOM, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Chasseur, Y. Li, and J. M. Patel. Enabling JSON document stores in relational systems. In WebDB, 2013.Google ScholarGoogle Scholar
  20. S. S. Chawathe et al. The TSIMMIS Project: Integration of Heterogeneous Information Sources. In IPSJ, 1994.Google ScholarGoogle Scholar
  21. Y. Cheng and F. Rusu. Parallel In-Situ Data Processing with Speculative Loading. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. J. DeWitt et al. Split Query Processing in Polybase. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Duggan et al. The BigDAWG Polystore System. SIGMOD Record, 44(2):11--16, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Fegaras and D. Maier. Optimizing object queries using an effective calculus. ACM Trans. Database Syst., 25(4):457--516, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. F. Fernández, J. Siméon, and P. Wadler. An Algebra for XML Query. In FST TCS, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  26. S. J. Finkelstein. Common Subexpression Analysis in Database Applications. In SIGMOD, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Graefe and W. McKenna. The Volcano optimizer generator: extensibility and efficient search. In ICDE, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Grund et al. HYRISE - A Main Memory Hybrid Storage Engine. PVLDB, 4(2):105--116, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Y. Halevy. Answering queries using views: A survey. VLDB J., 10(4):270--294, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my Data Files. Here are my Queries. Where are my Results? In CIDR, 2011.Google ScholarGoogle Scholar
  31. M. Ivanova, M. Kersten, N. Nes, and R. Goncalves. An architecture for recycling intermediates in a column-store. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Karpathiotakis, M. Branco, I. Alagiannis, and A. Ailamaki. Adaptive Query Processing on RAW Data. PVLDB, 7(12):1119--1130, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Karpathiotakis et al. Just-In-Time Data Virtualization: Lightweight Data Management with ViDa. In CIDR, 2015.Google ScholarGoogle Scholar
  34. Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building Efficient Query Engines in a High-Level Language. PVLDB, 7(10):853--864, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. Kotidis and N. Roussopoulos. DynaMat: A Dynamic View Management System for Data Warehouses. In SIGMOD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. K. Krikellas, S. Viglas, and M. Cintra. Generating code for holistic query evaluation. In ICDE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  37. C. Lattner and V. S. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Z. H. Liu, B. C. Hammerschmidt, and D. McMahon. JSON data management: supporting schema-less development in RDBMS. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing main-memory join on modern hardware. IEEE TKDE, 14(4):709--730, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1):330--339, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R. Murthy et al. Towards an enterprise XML architecture. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. F. Nagel, P. A. Boncz, and S. Viglas. Recycling in pipelined query evaluation. In ICDE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9):539--550, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. C. Olston et al. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Rao, H. Pirahesh, C. Mohan, and G. M. Lohman. Compiled Query Execution Engine using JVM. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. M. T. Roth, F. Ozcan, and L. M. Haas. Cost Models DO Matter: Providing Cost Information for Diverse Data Sources in a Federated System. In VLDB, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. M. T. Roth and P. M. Schwarz. Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources. In VLDB, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. Efficient and Extensible Algorithms for Multi Query Optimization. In SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Shaikhha et al. How to Architect a Query Compiler. In SIGMOD, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities. In VLDB, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. M. Stonebraker. Technical perspective - One size fits all: an idea whose time has come and gone. Commun. ACM, 51(12):76, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. D. Tahara, T. Diamond, and D. J. Abadi. Sinew: a SQL system for multi-structured data. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. A. Thusoo et al. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. A. Tomasic, L. Raschid, and P. Valduriez. Scaling Access to Heterogeneous Data Sources with DISCO. IEEE TKDE, 10(5):808--823, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. P. W. Trinder. Comprehensions, a Query Notation for DBPLs. In Database Programming Languages: Bulk Types and Persistent Data., 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast queries over heterogeneous data through engine customization
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 9, Issue 12
        August 2016
        345 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 August 2016
        Published in pvldb Volume 9, Issue 12

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader