Abstract
We propose to demonstrate a fine-grained partitioning framework that reorganizes the data tuples into small blocks at data loading time. The goal is to enable queries to maximally skip scanning data blocks. The partition framework consists of four steps: (1) workload analysis, which extracts features from a query workload, (2) augmentation, which augments each data tuple with a feature vector, (3) reduce, which succinctly represents a set of data tuples using a set of feature vectors, and (4) partitioning, which performs a clustering algorithm to partition the feature vectors and uses the clustering result to guide the actual data partitioning. Our experiments show that our techniques result in a 3-7x query response time improvement over traditional range partitioning due to more effective data skipping.
- Running Spark on Amazon EC2. https://spark.apache.org/docs/0.9.0/ec2-scripts.html.Google Scholar
- A. Hall, O. Bachmann, R. Büssow, S. Gănceanu, and M. Nunkesser. Processing a trillion cells per mouse click. PVLDB, 5(11):1436--1446, 2012. Google ScholarDigital Library
- L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin. Fine-grained partitioning for aggressive data skipping. In SIGMOD Conference, pages 1115--1126, 2014. Google ScholarDigital Library
- V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013. Google ScholarDigital Library
- R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale. In SIGMOD, pages 13--24, 2013. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2, 2012. Google ScholarDigital Library
Index Terms
- A partitioning framework for aggressive data skipping
Recommendations
Adaptive Data Skipping in Main-Memory Systems
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataAs modern main-memory optimized data systems increasingly rely on fast scans, lightweight indexes that allow for data skipping play a crucial role in data filtering to reduce system I/O. Scans benefit from data skipping when the data order is sorted, ...
Fine-grained partitioning for aggressive data skipping
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataModern query engines are increasingly being required to process enormous datasets in near real-time. While much can be done to speed up the data access, a promising technique is to reduce the need to access data through data skipping. By maintaining ...
Skipping-oriented partitioning for columnar layouts
As data volumes continue to grow, modern database systems increasingly rely on data skipping mechanisms to improve performance by avoiding access to irrelevant data. Recent work [39] proposed a fine-grained partitioning scheme that was shown to improve ...
Comments