ABSTRACT
In recent years, Massively Parallel Processors (MPPs) have gained ground enabling vast amounts of data processing. In such environments, data is partitioned across multiple compute nodes, which results in dramatic performance improvements during parallel query execution. To evaluate certain relational operators in a query correctly, data sometimes needs to be re-partitioned (i.e., moved) across compute nodes. Since data movement operations are much more expensive than relational operations, it is crucial to design a suitable data partitioning strategy that minimizes the cost of such expensive data transfers. A good partitioning strategy strongly depends on how the parallel system would be used. In this paper we present a partitioning advisor that recommends the best partitioning design for an expected workload. Our tool recommends which tables should be replicated (i.e., copied into every compute node) and which ones should be distributed according to specific column(s) so that the cost of evaluating similar workloads is minimized. In contrast to previous work, our techniques are deeply integrated with the underlying parallel query optimizer, which results in more accurate recommendations in a shorter amount of time. Our experimental evaluation using a real MPP system, Microsoft SQL Server 2008 Parallel Data Warehouse, with both real and synthetic workloads shows the effectiveness of the proposed techniques and the importance of deep integration of the partitioning advisor with the underlying query optimizer.
- Aster Data. http://www.asterdata.com/.Google Scholar
- Greenplum. http://www.greenplum.com/.Google Scholar
- Netezza. http://www.netezza.com/.Google Scholar
- SQL Server 2008 R2 Parallel Data Warehouse. http://www.microsoft.com/sqlserver/2008/en/us/parallel-data-warehouse.aspx.Google Scholar
- The Vertica Analytic Database Technical Overview White Paper. http://www.vertica-solutions.com/files/10986/verticaarchitecturewhit%epaper.pdf.Google Scholar
- A. Abouzeid et.al. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow., 2(1):922--933, 2009. Google ScholarDigital Library
- A. Kiran et.al. Two techniques for on-line index modification in shared nothing parallel databases. In SIGMOD, 1996. Google ScholarDigital Library
- S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in sql databases. In VLDB, pages 496--505, 2000. Google ScholarDigital Library
- S. Agrawal, V. R. Narasayya, and B. Yang. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, pages 359--370, 2004. Google ScholarDigital Library
- N. Bruno and R. V. Nehme. Configuration-parametric query optimization for physical design tuning. In SIGMOD, 2008. Google ScholarDigital Library
- C. Curino et.al. Schism: a workload-driven approach to database replication and partitioning. In VLDB, 2010. Google ScholarDigital Library
- S. Chaudhuri and V. R. Narasayya. An efficient cost-driven index selection tool for Microsoft SQL Server. In VLDB, 1997. Google ScholarDigital Library
- S. Chaudhuri and V. R. Narasayya. AutoAdmin 'What-if' index analysis utility. In SIGMOD, pages 367--378, 1998. Google ScholarDigital Library
- D. C. Zilio et.al. DB2 design advisor: Integrated automatic physical database design. In VLDB, pages 1087--1097, 2004. Google ScholarDigital Library
- D. DeWitt and J. Gray. Parallel database systems: the future of high perf. database systems. Com. ACM, 35(6):85--98, 1992. Google ScholarDigital Library
- G. Cong et.al. Distributed query evaluation with performance guarantees. In SIGMOD, pages 509--520, 2007. Google ScholarDigital Library
- Ganguly, Sumit et.al. Efficient and accurate cost models for parallel query optimization. In PODS, pages 172--181, 1996. Google ScholarDigital Library
- G. Graefe. The Cascades framework for query optimization. IEEE Data Eng. Bull., 18(3):19--29, 1995.Google Scholar
- G. Graefe and W. J. McKenna. The Volcano optimizer generator: Extensibility and efficient search. In ICDE, 1993. Google ScholarDigital Library
- Isard, Michael et.al. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007. Google ScholarDigital Library
- W. H. Kohler and K. Steiglitz. Characterization and theoretical comparison of branch-and-bound algorithms for permutation problems. J. ACM, 21(1):140--156, 1974. Google ScholarDigital Library
- T.-H. Lai and A. P. Sprague. Performance of parallel branch-and-bound algorithms. In ICPP, pages 194--201, 1985.Google ScholarCross Ref
- P. G. Selinger et.al. Access path selection in a relational database management system. In SIGMOD, pages 23--34, 1979. Google ScholarDigital Library
- R. Chaiken et.al. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008. Google ScholarDigital Library
- J. Rao, C. Zhang, N. Megiddo, and G. M. Lohman. Automating physical database design in a parallel database. In SIGMOD, pages 558--569, 2002. Google ScholarDigital Library
- S. Ghandeharizadeh et.al. Hybrid-range partitioning strategy: A new declustering strategy for multiprocessor database machines. In VLDB, pages 481--492, 1990. Google ScholarDigital Library
- S. Papadomanolakis et.al. Efficient use of the query optimizer for automated database design. In VLDB, 2007. Google ScholarDigital Library
- M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Map-Reduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarDigital Library
- T. Stöhr et.al. Multi-dimensional database allocation for parallel data warehouses. In VLDB, 2000. Google ScholarDigital Library
- Y. Hung-Chih et.al. Map-Reduce-Merge: simplified relational data processing on large clusters. In SIGMOD, 2007. Google ScholarDigital Library
- D. C. Zilio. Physical Database Design Decision Algorithms and Concurrent Reorganization for Parallel Database Systems. PhD thesis, University of Toronto, 1998. Google ScholarDigital Library
Index Terms
- Automated partitioning design in parallel database systems
Recommendations
Horizontal partitioning method for test verification in parallel database systems
In parallel database systems the partitioning methods considered in current researches are static. This research paper presents a partitioning method to divide the database relations into dynamic horizontal partitions. Every partition contains some ...
A load balancing approach for parallel database machines
PDP '95: Proceedings of the 3rd Euromicro Workshop on Parallel and Distributed ProcessingParallel database systems have become a major tool for high performance information processing. These systems require efficient load balancing approaches to partition each relation and to allocate them to the parallel architecture. If the database is ...
Comments