research-article

Skipping-oriented partitioning for columnar layouts

Authors:
Liwen Sun

University of California Berkeley

University of California Berkeley
View Profile

,
Michael J. Franklin

University of California Berkeley

University of California Berkeley
View Profile

,
Jiannan Wang

Simon Fraser University

Simon Fraser University
View Profile

,
Eugene Wu

Columbia University

Columbia University
View Profile

Proceedings of the VLDB Endowment Volume 10 Issue 4pp 421–432https://doi.org/10.14778/3025111.3025123

Published:01 November 2016Publication History

Proceedings of the VLDB Endowment

Abstract

As data volumes continue to grow, modern database systems increasingly rely on data skipping mechanisms to improve performance by avoiding access to irrelevant data. Recent work [39] proposed a fine-grained partitioning scheme that was shown to improve the opportunities for data skipping in row-oriented systems. Modern analytics and big data systems increasingly adopt columnar storage schemes, and in such systems, a row-based approach misses important opportunities for further improving data skipping. The flexibility of column-oriented organizations, however, comes with the additional cost of tuple reconstruction. In this paper, we develop Generalized Skipping-Oriented Partitioning (GSOP), a novel hybrid data skipping framework that takes into account these row-based and column-based tradeoffs. In contrast to previous column-oriented physical design work, GSOP considers the tradeoffs between horizontal data skipping and vertical partitioning jointly. Our experiments using two public benchmarks and a real-world workload show that GSOP can significantly reduce the amount of data scanned and improve end-to-end query response times over the state-of-the- art techniques.

References

Apache Drill. https://drill.apache.org.Google Scholar
Apache Parquet. http://parquet.apache.org.Google Scholar
Big Data Benchmark. amplab.cs.berkeley.edu/benchmark.Google Scholar
CasJobs. http://skyserver.sdss.org/casjobs/.Google Scholar
Sloan Digital Sky Surveys. http://www.sdss.org.Google Scholar
TPC-H. http://www.tpc.org/tpch.Google Scholar
A. Ailamaki et al. Data page layouts for relational databases on deep memory hierarchies. VLDB Journal, 11(3):198--215, Nov. 2002. Google ScholarDigital Library
A. Gupta et al. Amazon Redshift and the case for simpler data warehouses. In SIGMOD, pages 1917--1923, 2015. Google ScholarDigital Library
A. Hall et al. Processing a trillion cells per mouse click. PVLDB, 5(11):1436--1446, 2012. Google ScholarDigital Library
A. Jindal et al. Trojan data layouts: Right shoes for a running elephant. In SOCC, pages 21:1--21:14, New York, NY, USA, 2011. Google ScholarDigital Library
A. Lamb et al. The Vertica analytic database: C-Store 7 years later. VLDB, 5(12):1790--1801, 2012. Google ScholarDigital Library
D. Abadi, D. Myers, D. DeWitt, and S. Madden. Materialization strategies in a column-oriented dbms. In ICDE, pages 466--475, April 2007.Google ScholarCross Ref
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994. Google ScholarDigital Library
I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: A hands-free adaptive store. In SIGMOD, pages 1103--1114, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
B. Bhattacharjee et al. Efficient query processing for multi-dimensionally clustered tables in DB2. In VLDB, pages 963--974, 2003. Google ScholarDigital Library
B. Dageville et al. The Snowflake elastic data warehouse. In SIGMOD, pages 215--226, 2016. Google ScholarDigital Library
P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.Google Scholar
C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. PVLDB, 3:48--57, 2010. Google ScholarDigital Library
D. Abadi et al. Integrating compression and execution in column-oriented database systems. In SIGMOD, SIGMOD, pages 671--682, 2006. Google ScholarDigital Library
D. Abadi et al. The design and implementation of modern column-oriented database systems. Foundations and Trends in Databases, 5(3), 2013. Google ScholarDigital Library
D. Ślȩzak et al. Brighthouse: An analytic data warehouse for ad-hoc queries. PVLDB, 1(2):1337--1345, 2008. Google ScholarDigital Library
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7), 2011. Google ScholarDigital Library
S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR, pages 68--78, 2007.Google Scholar
S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD, pages 297--308, 2009. Google ScholarDigital Library
A. Jindal, E. Palatinus, V. Pavlov, and J. Dittrich. A comparison of knives for bread slicing. PVLDB, 6(6):361--372, 2013. Google ScholarDigital Library
M. Armbrust et al. Spark SQL: relational data processing in spark. In SIGMOD, pages 1383--1394, 2015. Google ScholarDigital Library
M. Grund et al. Hyrise: A main memory hybrid storage engine. PVLDB, 4(2):105--116, Nov. 2010. Google ScholarDigital Library
M. Stonebraker et al. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005. Google ScholarDigital Library
M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2, 2012. Google ScholarDigital Library
M. Zukowski el al. DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing. In DaMoN, pages 47--54, 2008. Google ScholarDigital Library
G. Moerkotte. Small materialized aggregates: A light weight index for data warehousing. In VLDB, pages 476--487, 1998. Google ScholarDigital Library
R. Hankins et al. Data morphing: An adaptive, cache-conscious storage technique. In VLDB, pages 417--428. VLDB Endowment, 2003. Google ScholarDigital Library
J. Rao, C. Zhang, N. Megiddo, and G. Lohman. Automating physical database design in a parallel database. In SIGMOD, pages 558--569, 2002. Google ScholarDigital Library
S. Agarwal et al. Automated selection of materialized views and indexes in SQL databases. In VLDB, pages 496--505, 2000. Google ScholarDigital Library
S. Agrawal et al. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, pages 359--370, 2004. Google ScholarDigital Library
S. Melnik et al. Dremel: interactive analysis of webale datasets. PVLDB, 3(1--2):330--339, 2010. Google ScholarDigital Library
S. Papadomanolakis el al. AutoPart: Automating schema design for large scientific databases using data partitioning. In SSDBM, pages 383--392, 2004. Google ScholarDigital Library
F. M. Schuhknecht, A. Jindal, and J. Dittrich. The uncracked pieces in database cracking. PVLDB, 7(2):97--108, Oct. 2013. Google ScholarDigital Library
L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin. Fine-grained partitioning for aggressive data skipping. In SIGMOD, pages 1115--1126, 2014. Google ScholarDigital Library
V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013. Google ScholarDigital Library
Y. He et al. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In ICDE, pages 1199--1208, 2011. Google ScholarDigital Library
Y. Huai et al. Understanding insights into the basic structure and essential issues of table placement methods in clusters. PVLDB, 6(14), 2013. Google ScholarDigital Library
Yin Huai et al. Major technical advancements in Apache Hive. In SIGMOD, pages 1235--1246, 2014. Google ScholarDigital Library
Z Liu et al. JSON data management: Supporting schema-less development in rdbms. In SIGMOD, pages 1247--1258, New York, NY, USA, 2014. Google ScholarDigital Library
J. Zhou, N. Bruno, and W. Lin. Advanced partitioning techniques for massively distributed computation. In SIGMOD, pages 13--24, 2012. Google ScholarDigital Library
J. Zhou and K. Ross. A multi-resolution block storage model for database design. In IDEAS, pages 22--31, July 2003.Google Scholar

Recommendations

Fine-grained partitioning for aggressive data skipping
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Modern query engines are increasingly being required to process enormous datasets in near real-time. While much can be done to speed up the data access, a promising technique is to reduce the need to access data through data skipping. By maintaining ...
Read More
Pando: Enhanced Data Skipping with Logical Data Partitioning

With enormous volumes of data, quickly retrieving data that is relevant to a query is essential for achieving high performance. Modern cloud-based database systems often partition the data into blocks and employ various techniques to skip irrelevant ...
Read More
A partitioning framework for aggressive data skipping

We propose to demonstrate a fine-grained partitioning framework that reorganizes the data tuples into small blocks at data loading time. The goal is to enable queries to maximally skip scanning data blocks. The partition framework consists of four steps:...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 10, Issue 4
November 2016
180 pages
ISSN:2150-8097
Editors:
Peter Boncz
CWI
,
Ken Salem
University of Waterloo
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 November 2016
Published in pvldb Volume 10, Issue 4
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 278
  Total Downloads
- Downloads (Last 12 months)33
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Skipping-oriented partitioning for columnar layouts

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Fine-grained partitioning for aggressive data skipping

Pando: Enhanced Data Skipping with Logical Data Partitioning

A partitioning framework for aggressive data skipping

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Skipping-oriented partitioning for columnar layouts

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Fine-grained partitioning for aggressive data skipping

Pando: Enhanced Data Skipping with Logical Data Partitioning

A partitioning framework for aggressive data skipping

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media