skip to main content
research-article

Skipping-oriented partitioning for columnar layouts

Published:01 November 2016Publication History
Skip Abstract Section

Abstract

As data volumes continue to grow, modern database systems increasingly rely on data skipping mechanisms to improve performance by avoiding access to irrelevant data. Recent work [39] proposed a fine-grained partitioning scheme that was shown to improve the opportunities for data skipping in row-oriented systems. Modern analytics and big data systems increasingly adopt columnar storage schemes, and in such systems, a row-based approach misses important opportunities for further improving data skipping. The flexibility of column-oriented organizations, however, comes with the additional cost of tuple reconstruction. In this paper, we develop Generalized Skipping-Oriented Partitioning (GSOP), a novel hybrid data skipping framework that takes into account these row-based and column-based tradeoffs. In contrast to previous column-oriented physical design work, GSOP considers the tradeoffs between horizontal data skipping and vertical partitioning jointly. Our experiments using two public benchmarks and a real-world workload show that GSOP can significantly reduce the amount of data scanned and improve end-to-end query response times over the state-of-the- art techniques.

References

  1. Apache Drill. https://drill.apache.org.Google ScholarGoogle Scholar
  2. Apache Parquet. http://parquet.apache.org.Google ScholarGoogle Scholar
  3. Big Data Benchmark. amplab.cs.berkeley.edu/benchmark.Google ScholarGoogle Scholar
  4. CasJobs. http://skyserver.sdss.org/casjobs/.Google ScholarGoogle Scholar
  5. Sloan Digital Sky Surveys. http://www.sdss.org.Google ScholarGoogle Scholar
  6. TPC-H. http://www.tpc.org/tpch.Google ScholarGoogle Scholar
  7. A. Ailamaki et al. Data page layouts for relational databases on deep memory hierarchies. VLDB Journal, 11(3):198--215, Nov. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Gupta et al. Amazon Redshift and the case for simpler data warehouses. In SIGMOD, pages 1917--1923, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Hall et al. Processing a trillion cells per mouse click. PVLDB, 5(11):1436--1446, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Jindal et al. Trojan data layouts: Right shoes for a running elephant. In SOCC, pages 21:1--21:14, New York, NY, USA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Lamb et al. The Vertica analytic database: C-Store 7 years later. VLDB, 5(12):1790--1801, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Abadi, D. Myers, D. DeWitt, and S. Madden. Materialization strategies in a column-oriented dbms. In ICDE, pages 466--475, April 2007.Google ScholarGoogle ScholarCross RefCross Ref
  13. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: A hands-free adaptive store. In SIGMOD, pages 1103--1114, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Bhattacharjee et al. Efficient query processing for multi-dimensionally clustered tables in DB2. In VLDB, pages 963--974, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Dageville et al. The Snowflake elastic data warehouse. In SIGMOD, pages 215--226, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.Google ScholarGoogle Scholar
  18. C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. PVLDB, 3:48--57, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Abadi et al. Integrating compression and execution in column-oriented database systems. In SIGMOD, SIGMOD, pages 671--682, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Abadi et al. The design and implementation of modern column-oriented database systems. Foundations and Trends in Databases, 5(3), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Ślȩzak et al. Brighthouse: An analytic data warehouse for ad-hoc queries. PVLDB, 1(2):1337--1345, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR, pages 68--78, 2007.Google ScholarGoogle Scholar
  24. S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD, pages 297--308, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Jindal, E. Palatinus, V. Pavlov, and J. Dittrich. A comparison of knives for bread slicing. PVLDB, 6(6):361--372, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Armbrust et al. Spark SQL: relational data processing in spark. In SIGMOD, pages 1383--1394, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Grund et al. Hyrise: A main memory hybrid storage engine. PVLDB, 4(2):105--116, Nov. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Stonebraker et al. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Zukowski el al. DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing. In DaMoN, pages 47--54, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Moerkotte. Small materialized aggregates: A light weight index for data warehousing. In VLDB, pages 476--487, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Hankins et al. Data morphing: An adaptive, cache-conscious storage technique. In VLDB, pages 417--428. VLDB Endowment, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Rao, C. Zhang, N. Megiddo, and G. Lohman. Automating physical database design in a parallel database. In SIGMOD, pages 558--569, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Agarwal et al. Automated selection of materialized views and indexes in SQL databases. In VLDB, pages 496--505, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Agrawal et al. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, pages 359--370, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. Melnik et al. Dremel: interactive analysis of webale datasets. PVLDB, 3(1--2):330--339, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Papadomanolakis el al. AutoPart: Automating schema design for large scientific databases using data partitioning. In SSDBM, pages 383--392, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. F. M. Schuhknecht, A. Jindal, and J. Dittrich. The uncracked pieces in database cracking. PVLDB, 7(2):97--108, Oct. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin. Fine-grained partitioning for aggressive data skipping. In SIGMOD, pages 1115--1126, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Y. He et al. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In ICDE, pages 1199--1208, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Y. Huai et al. Understanding insights into the basic structure and essential issues of table placement methods in clusters. PVLDB, 6(14), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Yin Huai et al. Major technical advancements in Apache Hive. In SIGMOD, pages 1235--1246, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Z Liu et al. JSON data management: Supporting schema-less development in rdbms. In SIGMOD, pages 1247--1258, New York, NY, USA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Zhou, N. Bruno, and W. Lin. Advanced partitioning techniques for massively distributed computation. In SIGMOD, pages 13--24, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Zhou and K. Ross. A multi-resolution block storage model for database design. In IDEAS, pages 22--31, July 2003.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 10, Issue 4
    November 2016
    180 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 November 2016
    Published in pvldb Volume 10, Issue 4

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader