Abstract
As data volumes continue to grow, modern database systems increasingly rely on data skipping mechanisms to improve performance by avoiding access to irrelevant data. Recent work [39] proposed a fine-grained partitioning scheme that was shown to improve the opportunities for data skipping in row-oriented systems. Modern analytics and big data systems increasingly adopt columnar storage schemes, and in such systems, a row-based approach misses important opportunities for further improving data skipping. The flexibility of column-oriented organizations, however, comes with the additional cost of tuple reconstruction. In this paper, we develop Generalized Skipping-Oriented Partitioning (GSOP), a novel hybrid data skipping framework that takes into account these row-based and column-based tradeoffs. In contrast to previous column-oriented physical design work, GSOP considers the tradeoffs between horizontal data skipping and vertical partitioning jointly. Our experiments using two public benchmarks and a real-world workload show that GSOP can significantly reduce the amount of data scanned and improve end-to-end query response times over the state-of-the- art techniques.
- Apache Drill. https://drill.apache.org.Google Scholar
- Apache Parquet. http://parquet.apache.org.Google Scholar
- Big Data Benchmark. amplab.cs.berkeley.edu/benchmark.Google Scholar
- CasJobs. http://skyserver.sdss.org/casjobs/.Google Scholar
- Sloan Digital Sky Surveys. http://www.sdss.org.Google Scholar
- TPC-H. http://www.tpc.org/tpch.Google Scholar
- A. Ailamaki et al. Data page layouts for relational databases on deep memory hierarchies. VLDB Journal, 11(3):198--215, Nov. 2002. Google ScholarDigital Library
- A. Gupta et al. Amazon Redshift and the case for simpler data warehouses. In SIGMOD, pages 1917--1923, 2015. Google ScholarDigital Library
- A. Hall et al. Processing a trillion cells per mouse click. PVLDB, 5(11):1436--1446, 2012. Google ScholarDigital Library
- A. Jindal et al. Trojan data layouts: Right shoes for a running elephant. In SOCC, pages 21:1--21:14, New York, NY, USA, 2011. Google ScholarDigital Library
- A. Lamb et al. The Vertica analytic database: C-Store 7 years later. VLDB, 5(12):1790--1801, 2012. Google ScholarDigital Library
- D. Abadi, D. Myers, D. DeWitt, and S. Madden. Materialization strategies in a column-oriented dbms. In ICDE, pages 466--475, April 2007.Google ScholarCross Ref
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994. Google ScholarDigital Library
- I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: A hands-free adaptive store. In SIGMOD, pages 1103--1114, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- B. Bhattacharjee et al. Efficient query processing for multi-dimensionally clustered tables in DB2. In VLDB, pages 963--974, 2003. Google ScholarDigital Library
- B. Dageville et al. The Snowflake elastic data warehouse. In SIGMOD, pages 215--226, 2016. Google ScholarDigital Library
- P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.Google Scholar
- C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. PVLDB, 3:48--57, 2010. Google ScholarDigital Library
- D. Abadi et al. Integrating compression and execution in column-oriented database systems. In SIGMOD, SIGMOD, pages 671--682, 2006. Google ScholarDigital Library
- D. Abadi et al. The design and implementation of modern column-oriented database systems. Foundations and Trends in Databases, 5(3), 2013. Google ScholarDigital Library
- D. Ślȩzak et al. Brighthouse: An analytic data warehouse for ad-hoc queries. PVLDB, 1(2):1337--1345, 2008. Google ScholarDigital Library
- A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7), 2011. Google ScholarDigital Library
- S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR, pages 68--78, 2007.Google Scholar
- S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD, pages 297--308, 2009. Google ScholarDigital Library
- A. Jindal, E. Palatinus, V. Pavlov, and J. Dittrich. A comparison of knives for bread slicing. PVLDB, 6(6):361--372, 2013. Google ScholarDigital Library
- M. Armbrust et al. Spark SQL: relational data processing in spark. In SIGMOD, pages 1383--1394, 2015. Google ScholarDigital Library
- M. Grund et al. Hyrise: A main memory hybrid storage engine. PVLDB, 4(2):105--116, Nov. 2010. Google ScholarDigital Library
- M. Stonebraker et al. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005. Google ScholarDigital Library
- M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2, 2012. Google ScholarDigital Library
- M. Zukowski el al. DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing. In DaMoN, pages 47--54, 2008. Google ScholarDigital Library
- G. Moerkotte. Small materialized aggregates: A light weight index for data warehousing. In VLDB, pages 476--487, 1998. Google ScholarDigital Library
- R. Hankins et al. Data morphing: An adaptive, cache-conscious storage technique. In VLDB, pages 417--428. VLDB Endowment, 2003. Google ScholarDigital Library
- J. Rao, C. Zhang, N. Megiddo, and G. Lohman. Automating physical database design in a parallel database. In SIGMOD, pages 558--569, 2002. Google ScholarDigital Library
- S. Agarwal et al. Automated selection of materialized views and indexes in SQL databases. In VLDB, pages 496--505, 2000. Google ScholarDigital Library
- S. Agrawal et al. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, pages 359--370, 2004. Google ScholarDigital Library
- S. Melnik et al. Dremel: interactive analysis of webale datasets. PVLDB, 3(1--2):330--339, 2010. Google ScholarDigital Library
- S. Papadomanolakis el al. AutoPart: Automating schema design for large scientific databases using data partitioning. In SSDBM, pages 383--392, 2004. Google ScholarDigital Library
- F. M. Schuhknecht, A. Jindal, and J. Dittrich. The uncracked pieces in database cracking. PVLDB, 7(2):97--108, Oct. 2013. Google ScholarDigital Library
- L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin. Fine-grained partitioning for aggressive data skipping. In SIGMOD, pages 1115--1126, 2014. Google ScholarDigital Library
- V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013. Google ScholarDigital Library
- Y. He et al. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In ICDE, pages 1199--1208, 2011. Google ScholarDigital Library
- Y. Huai et al. Understanding insights into the basic structure and essential issues of table placement methods in clusters. PVLDB, 6(14), 2013. Google ScholarDigital Library
- Yin Huai et al. Major technical advancements in Apache Hive. In SIGMOD, pages 1235--1246, 2014. Google ScholarDigital Library
- Z Liu et al. JSON data management: Supporting schema-less development in rdbms. In SIGMOD, pages 1247--1258, New York, NY, USA, 2014. Google ScholarDigital Library
- J. Zhou, N. Bruno, and W. Lin. Advanced partitioning techniques for massively distributed computation. In SIGMOD, pages 13--24, 2012. Google ScholarDigital Library
- J. Zhou and K. Ross. A multi-resolution block storage model for database design. In IDEAS, pages 22--31, July 2003.Google Scholar
Recommendations
Fine-grained partitioning for aggressive data skipping
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataModern query engines are increasingly being required to process enormous datasets in near real-time. While much can be done to speed up the data access, a promising technique is to reduce the need to access data through data skipping. By maintaining ...
Pando: Enhanced Data Skipping with Logical Data Partitioning
With enormous volumes of data, quickly retrieving data that is relevant to a query is essential for achieving high performance. Modern cloud-based database systems often partition the data into blocks and employ various techniques to skip irrelevant ...
A partitioning framework for aggressive data skipping
We propose to demonstrate a fine-grained partitioning framework that reorganizes the data tuples into small blocks at data loading time. The goal is to enable queries to maximally skip scanning data blocks. The partition framework consists of four steps:...
Comments