research-article

Fine-grained partitioning for aggressive data skipping

Authors:
Liwen Sun

UC Berkeley, Berkeley, CA, USA

UC Berkeley, Berkeley, CA, USA
View Profile

,
Michael J. Franklin

UC Berkeley, Berkeley, CA, USA

UC Berkeley, Berkeley, CA, USA
View Profile

,
Sanjay Krishnan

UC Berkeley, Berkeley, CA, USA

UC Berkeley, Berkeley, CA, USA
View Profile

,
Reynold S. Xin

Databricks Inc., Berkeley, CA, USA

Databricks Inc., Berkeley, CA, USA
View Profile

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataJune 2014Pages 1115–1126https://doi.org/10.1145/2588555.2610515

Published:18 June 2014Publication History

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 1115–1126

ABSTRACT

Modern query engines are increasingly being required to process enormous datasets in near real-time. While much can be done to speed up the data access, a promising technique is to reduce the need to access data through data skipping. By maintaining some metadata for each block of tuples, a query may skip a data block if the metadata indicates that the block does not contain relevant data. The effectiveness of data skipping, however, depends on how well the blocking scheme matches the query filters.

In this paper, we propose a fine-grained blocking technique that reorganizes the data tuples into blocks with a goal of enabling queries to skip blocks aggressively. We first extract representative filters in a workload as features using frequent itemset mining. Based on these features, each data tuple can be represented as a feature vector. We then formulate the blocking problem as a optimization problem on the feature vectors, called Balanced MaxSkip Partitioning, which we prove is NP-hard. To find an approximate solution efficiently, we adopt the bottom-up clustering framework. We prototyped our blocking techniques on Shark, an open-source data warehouse system. Our experiments on TPC-H and a real-world workload show that our blocking technique leads to 2-5x improvement in query response time over traditional range-based blocking techniques.

References

IBM Netezza. http://www.netezza.com.Google Scholar
Impala. https://github.com/cloudera/impala.Google Scholar
Oracle. http://docs.oracle.com/.Google Scholar
Postgres. http://www.postgresql.org/docs/.Google Scholar
H. Koga et al. Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowledge and Information Systems, 12(1), 2007. Google ScholarDigital Library
A. Hall et al. Processing a trillion cells per mouse click. PVLDB, 5(11):1436--1446 ,2012. Google ScholarDigital Library
A. Lamb et al. The Vertica analytic database: C-Store 7 years later. VLDB, 5(12):1790--1801, 2012. Google ScholarDigital Library
A. Thusoo et al. Hive: a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994. Google ScholarDigital Library
S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in SQL databases. In VLDB, pages 496--505,2000. Google ScholarDigital Library
S. Agrawal, V. Narasayya, and B. Yang. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, pages 359--370, 2004. Google ScholarDigital Library
K. Alexiou, D. Kossmann, and P.-A. Larson. Adaptive range filters for cold data: Avoiding trips to Siberia. PVLDB, 6 (14):1714--1725, 2013. Google ScholarDigital Library
D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of euclidean sum-of-squares clustering. Machine Learning, 75(2): 245--248, 2009. Google ScholarDigital Library
K. Aouiche, P.-E. Jouve, and J. Darmont. Clustering-based materialized view selection in data warehouses. In Advances in Databases and Information Systems, pages 81--95, 2006. Google ScholarDigital Library
B. Bhattacharjee et al. Efficient query processing for multi-dimensionally clustered tables in DB2. In VLDB, pages 963--974, 2003. Google ScholarDigital Library
E. Baralis, S. Paraboschi, and E. Teniente. Materialized views selection in a multidimensional database. In VLDB, volume 97, pages 156--165, 1997. Google ScholarDigital Library
C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. PVLDB, 3:48--57, 2010. Google ScholarDigital Library
M. R. Garey and D. S. Johnson. Computers and Intractability.1990.Google Scholar
H. Gupta and I. Mumick. Selection of views to materialize under a maintenance cost constraint. In ICDT, pages 453--470, 1999. Google ScholarDigital Library
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In SIGMOD Record, volume 25, pages 205--216, 1996. Google ScholarDigital Library
S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR ,pages 68--78, 2007.Google Scholar
J. Han et al.Frequentpatternmining:currentstatusand future directions. DMKD, 15 (1):55--86, 2007. Google ScholarDigital Library
D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: a survey. IEEE TKDE 16(11),2004. Google ScholarDigital Library
G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel hypergraph partitioning: applications in VLSI domain. IEEE Trans. on VLSI, (1):69--79, 1999. Google ScholarDigital Library
M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages2--2, 2012. Google ScholarDigital Library
S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE Trans. on Computational Biology and Bioinformatics, 1 (1): 24--45, 2004. Google ScholarDigital Library
P. Miettinen, T. Mielikainen, A. Gionis, G. Das, and H. Mannila. The discrete basis problem. IEEE TKDE, 20 (10):1348--1362, 2008. Google ScholarDigital Library
G. Moerkotte. Small materialized aggregates: A light weight index structure for data warehousing. In VLDB, pages 476--487, 1998. Google ScholarDigital Library
A. Rajaraman and J. D. Ullman. Mining of massive datasets. Cambridge, 2012. Google ScholarDigital Library
J. Rao, C. Zhang, N. Megiddo, and G. Lohman. Automating physical database design in a parallel database. In SIGMOD, pages 558--569, 2002. Google ScholarDigital Library
S. Agarwal et al. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, pages 29--42, 2013. Google ScholarDigital Library
S. Melnik et al. Dremel: interactive analysis of webale datasets. PVLDB, 3(1--2): 330--339, 2010. Google ScholarDigital Library
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE TPAMI, 22: 888--905, 1997. Google ScholarDigital Library
D. Slézak, J. Wróblewski, V. Eastwood, and P. Synak. Brighthouse: An analytic data warehouse for ad-hoc queries. PVLDB, 1(2):1337--1345, 2008. Google ScholarDigital Library
V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6 (11): 1080--1091, 2013. Google ScholarDigital Library
J. Ward. Hierarchical grouping to optimize an objective function. J. American statistical association, (301): 236--244.Google Scholar
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale. In SIGMOD, pages 13--24, 2013. Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In SIGMOD, pages 103--114, 1996. Google ScholarDigital Library
J. Zhou, N. Bruno, and W. Lin. Advanced partitioning techniques for massively distributed computation. In SIGMOD, pages13--24, 2012. Google ScholarDigital Library

Index Terms

Fine-grained partitioning for aggressive data skipping
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data access methods

Recommendations

Processing Aggregate Queries with Materialized Views in Data Warehouse Environment

Materialized views, which are derived from base relations and stored in the database, offer opportunities for significant performance gain in query evaluation by providing quick access to the pre-computed data. A materialized view can be utilized in ...
Read More
What Can Partitioning Do for Your Data Warehouses and Data Marts?
IDEAS '00: Proceedings of the 2000 International Symposium on Database Engineering & Applications

Efficient query processing is a critical requirement for data warehousing systems as decision support applications often require minimum response times to answer complex, ad-hoc queries having aggregations, multi-ways joins over vast repositories of ...
Read More
A partitioning framework for aggressive data skipping

We propose to demonstrate a fine-grained partitioning framework that reorganizes the data tuples into small blocks at data loading time. The goal is to enable queries to maximally skip scanning data blocks. The partition framework consists of four steps:...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
algorithms
data warehouse
partitioning
query processing
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 78
  Total Citations
  View Citations
- 787
  Total Downloads
- Downloads (Last 12 months)53
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fine-grained partitioning for aggressive data skipping

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Processing Aggregate Queries with Materialized Views in Data Warehouse Environment

What Can Partitioning Do for Your Data Warehouses and Data Marts?

A partitioning framework for aggressive data skipping