skip to main content
10.1145/2588555.2610515acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Fine-grained partitioning for aggressive data skipping

Published:18 June 2014Publication History

ABSTRACT

Modern query engines are increasingly being required to process enormous datasets in near real-time. While much can be done to speed up the data access, a promising technique is to reduce the need to access data through data skipping. By maintaining some metadata for each block of tuples, a query may skip a data block if the metadata indicates that the block does not contain relevant data. The effectiveness of data skipping, however, depends on how well the blocking scheme matches the query filters.

In this paper, we propose a fine-grained blocking technique that reorganizes the data tuples into blocks with a goal of enabling queries to skip blocks aggressively. We first extract representative filters in a workload as features using frequent itemset mining. Based on these features, each data tuple can be represented as a feature vector. We then formulate the blocking problem as a optimization problem on the feature vectors, called Balanced MaxSkip Partitioning, which we prove is NP-hard. To find an approximate solution efficiently, we adopt the bottom-up clustering framework. We prototyped our blocking techniques on Shark, an open-source data warehouse system. Our experiments on TPC-H and a real-world workload show that our blocking technique leads to 2-5x improvement in query response time over traditional range-based blocking techniques.

References

  1. IBM Netezza. http://www.netezza.com.Google ScholarGoogle Scholar
  2. Impala. https://github.com/cloudera/impala.Google ScholarGoogle Scholar
  3. Oracle. http://docs.oracle.com/.Google ScholarGoogle Scholar
  4. Postgres. http://www.postgresql.org/docs/.Google ScholarGoogle Scholar
  5. H. Koga et al. Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowledge and Information Systems, 12(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Hall et al. Processing a trillion cells per mouse click. PVLDB, 5(11):1436--1446 ,2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Lamb et al. The Vertica analytic database: C-Store 7 years later. VLDB, 5(12):1790--1801, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Thusoo et al. Hive: a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in SQL databases. In VLDB, pages 496--505,2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Agrawal, V. Narasayya, and B. Yang. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, pages 359--370, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Alexiou, D. Kossmann, and P.-A. Larson. Adaptive range filters for cold data: Avoiding trips to Siberia. PVLDB, 6 (14):1714--1725, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of euclidean sum-of-squares clustering. Machine Learning, 75(2): 245--248, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Aouiche, P.-E. Jouve, and J. Darmont. Clustering-based materialized view selection in data warehouses. In Advances in Databases and Information Systems, pages 81--95, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Bhattacharjee et al. Efficient query processing for multi-dimensionally clustered tables in DB2. In VLDB, pages 963--974, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Baralis, S. Paraboschi, and E. Teniente. Materialized views selection in a multidimensional database. In VLDB, volume 97, pages 156--165, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. PVLDB, 3:48--57, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. R. Garey and D. S. Johnson. Computers and Intractability.1990.Google ScholarGoogle Scholar
  19. H. Gupta and I. Mumick. Selection of views to materialize under a maintenance cost constraint. In ICDT, pages 453--470, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In SIGMOD Record, volume 25, pages 205--216, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR ,pages 68--78, 2007.Google ScholarGoogle Scholar
  22. J. Han et al.Frequentpatternmining:currentstatusand future directions. DMKD, 15 (1):55--86, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: a survey. IEEE TKDE 16(11),2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel hypergraph partitioning: applications in VLSI domain. IEEE Trans. on VLSI, (1):69--79, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages2--2, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE Trans. on Computational Biology and Bioinformatics, 1 (1): 24--45, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Miettinen, T. Mielikainen, A. Gionis, G. Das, and H. Mannila. The discrete basis problem. IEEE TKDE, 20 (10):1348--1362, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Moerkotte. Small materialized aggregates: A light weight index structure for data warehousing. In VLDB, pages 476--487, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Rajaraman and J. D. Ullman. Mining of massive datasets. Cambridge, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Rao, C. Zhang, N. Megiddo, and G. Lohman. Automating physical database design in a parallel database. In SIGMOD, pages 558--569, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Agarwal et al. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, pages 29--42, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Melnik et al. Dremel: interactive analysis of webale datasets. PVLDB, 3(1--2): 330--339, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE TPAMI, 22: 888--905, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Slézak, J. Wróblewski, V. Eastwood, and P. Synak. Brighthouse: An analytic data warehouse for ad-hoc queries. PVLDB, 1(2):1337--1345, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6 (11): 1080--1091, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Ward. Hierarchical grouping to optimize an objective function. J. American statistical association, (301): 236--244.Google ScholarGoogle Scholar
  37. R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale. In SIGMOD, pages 13--24, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In SIGMOD, pages 103--114, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Zhou, N. Bruno, and W. Lin. Advanced partitioning techniques for massively distributed computation. In SIGMOD, pages13--24, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fine-grained partitioning for aggressive data skipping

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
      June 2014
      1645 pages
      ISBN:9781450323765
      DOI:10.1145/2588555

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 June 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader