Abstract
Today's data deluge enables organizations to collect massive data, and analyze it with an ever-increasing number of concurrent queries. Traditional data warehouses (DW) face a challenging problem in executing this task, due to their query-centric model: each query is optimized and executed independently. This model results in high contention for resources. Thus, modern DW depart from the query-centric model to execution models involving sharing of common data and work. Our goal is to show when and how a DW should employ sharing. We evaluate experimentally two sharing methodologies, based on their original prototype systems, that exploit work sharing opportunities among concurrent queries at run-time: Simultaneous Pipelining (SP), which shares intermediate results of common sub-plans, and Global Query Plans (GQP), which build and evaluate a single query plan with shared operators.
First, after a short review of sharing methodologies, we show that SP and GQP are orthogonal techniques. SP can be applied to shared operators of a GQP, reducing response times by 20%-48% in workloads with numerous common sub-plans. Second, we corroborate previous results on the negative impact of SP on performance for cases of low concurrency. We attribute this behavior to a bottleneck caused by the push-based communication model of SP. We show that pull-based communication for SP eliminates the overhead of sharing altogether for low concurrency, and scales better on multi-core machines than push-based SP, further reducing response times by 82%-86% for high concurrency. Third, we perform an experimental analysis of SP, GQP and their combination, and show when each one is beneficial. We identify a trade-off between low and high concurrency. In the former case, traditional query-centric operators with SP perform better, while in the latter case, GQP with shared operators enhanced by SP give the best results.
- TPC-H Benchmark: Standard Specification, Revision 2.14.3.Google Scholar
- S. Arumugam et al. The DataPath system: a data-centric analytic processing engine for large data warehouses. In Proc. of the 2010 ACM SIGMOD Int'l Conf. on Management of Data, pages 519-530, 2010. Google Scholar
- G. Candea et al. A scalable, predictable join operator for highly concurrent data warehouses. Proc. of the VLDB Endowment, 2(1):277-288, 2009. Google Scholar
- G. Candea et al. Predictable performance and high query concurrency for data analytics. The Int'l Journal on Very Large Data Bases, 20(2):227-248, 2011. Google Scholar
- H.-T. Chou et al. An evaluation of buffer management strategies for relational database systems. In Proc. of the 11th Int'l Conf. on Very Large Data Bases, pages 127-141, 1985. Google Scholar
- J. Cieslewicz et al. Adaptive aggregation on chip multiprocessors. In Proc. of the 33rd Int'l Conf. on Very Large Data Bases, pages 339-350, 2007. Google Scholar
- L. Colby et al. Red brick vista™: aggregate computation and management. In Proc. of the 14th Int'l Conf. on Data Engineering, pages 174-177, 1998. Google Scholar
- C. Cook. Database Architecture: The Storage Engine, 2001. http://msdn.microsoft.com/library/aa902689(v=sql.80).aspx.Google Scholar
- N. N. Dalvi et al. Pipelining in multi-query optimization. In Proc. of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Databases, pages 59-70, 2001. Google Scholar
- J. Dean et al. MapReduce: Simplified data processing on large clusters. Communications ACM, 51(1):107-113, 2008. Google Scholar
- G. Giannikis et al. SharedDB: killing one thousand queries with one stone. Proc. of the VLDB Endowment, 5(6):526-537, 2012. Google Scholar
- S. Harizopoulos et al. A case for staged database systems. In Proc. of the 2003 Conf. on Innovative Data Systems Research, 2003.Google Scholar
- S. Harizopoulos et al. QPipe: a simultaneously pipelined relational query engine. In Proc. of the 2005 ACM SIGMOD Int'l Conf. on Management of Data, pages 383-394, 2005. Google Scholar
- R. Johnson et al. To share or not to share? In Proc. of the 33rd Int'l Conf. on Very Large Data Bases, pages 351-362, 2007. Google Scholar
- R. Johnson et al. Shore-MT: a scalable storage manager for the multicore era. In Proc. of the 12th Int'l Conf. on Extending Database Technology: Advances in Database Technology, pages 24-35, 2009. Google Scholar
- T. Johnson et al. 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. In Proc. of the 20th Int'l Conf. on Very Large Data Bases, pages 439-450, 1994. Google Scholar
- R. Kimball et al. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. John Wiley & Sons, Inc., 2nd edition, 2002. Google Scholar
- C. Lang et al. Increasing Buffer-Locality for Multiple Relational Table Scans through Grouping and Throttling. In Proc. of the 23rd Int'l Conf. on Data Engineering, pages 1136-1145, 2007.Google Scholar
- N. Megiddo et al. ARC: A Self-Tuning, Low Overhead Replacement Cache. In Proc. of the 2nd USENIX Conf. on File and Storage Technologies, pages 115-130, 2003. Google Scholar
- M. Mehta et al. Batch Scheduling in Parallel Database Systems. In Proc. of the 9th Int'l Conf. on Data Engineering, pages 400-410, 1993. Google Scholar
- P. O. Neil et al. Star Schema Benchmark. 2009.Google Scholar
- E. J. O'Neil et al. The LRU-K page replacement algorithm for database disk buffering. In Proc. of the 1993 ACM SIGMOD Int'l Conf. on Management of Data, pages 297-306, 1993. Google Scholar
- L. Qiao et al. Main-memory scan sharing for multicore cpus. Proc. of the VLDB Endowment, 1(1):610-621, 2008. Google Scholar
- N. Roussopoulos. View indexing in relational databases. ACM Trans. Database Syst., 7(2):258-290, 1982. Google Scholar
- P. Roy et al. Efficient and extensible algorithms for multi query optimization. In Proc. of the 2000 ACM SIGMOD Int'l Conf. on Management of Data, pages 249-260, 2000. Google Scholar
- P. Russom. High-Performance Data Warehousing. TDWI, 2012. http://tdwi.org/research/2012/10/tdwi-best-practices-report-high-performance-data-warehousing.aspx.Google Scholar
- T. K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23-52, 1988. Google Scholar
- J. Shim et al. Dynamic Caching of Query Results for Decision Support Systems. In Proc. of the 11th Int'l Conf. on Scientific and Statistical Database Management, pages 254-263, 1999. Google Scholar
- P. Unterbrunner et al. Predictable performance for unpredictable workloads. Proc. of the VLDB Endowment, 2(1):706-717, 2009. Google Scholar
- M. Zukowski et al. Cooperative scans: dynamic bandwidth sharing in a DBMS. In Proc. of the 33rd Int'l Conf. on Very Large Data Bases, pages 723-734, 2007. Google Scholar
Index Terms
- Sharing data and work across concurrent analytical queries
Recommendations
Reactive and proactive sharing across concurrent analytical queries
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataToday an ever increasing amount of data is collected and analyzed by researchers, businesses, and scientists in data warehouses (DW). In addition to the data size, the number of users and applications querying data grows exponentially. The increasing ...
Lightweight annotations for controlling sharing in concurrent data structures
PLDI '09: Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and ImplementationSharC is a recently developed system for checking data-sharing in multithreaded programs. Programmers specify sharing rules (read-only, protected by a lock, etc.) for individual objects, and the SharC compiler enforces these rules using static and ...
Lightweight annotations for controlling sharing in concurrent data structures
PLDI '09SharC is a recently developed system for checking data-sharing in multithreaded programs. Programmers specify sharing rules (read-only, protected by a lock, etc.) for individual objects, and the SharC compiler enforces these rules using static and ...
Comments