Abstract
Non-uniform memory access (NUMA) architectures pose numerous performance challenges for main-memory column-stores in scaling up analytics on modern multi-socket multi-core servers. A NUMA-aware execution engine needs a strategy for data placement and task scheduling that prefers fast local memory accesses over remote memory accesses, and avoids an imbalance of resource utilization, both CPU and memory bandwidth, across sockets. State-of-the-art systems typically use a static strategy that always partitions data across sockets, and always allows inter-socket task stealing.
In this paper, we show that adapting data placement and task stealing to the workload can improve throughput by up to a factor of 4 compared to a static approach. We focus on highly concurrent workloads dominated by operators working on a single table or table group (copartitioned tables). Our adaptive data placement algorithm tracks the resource utilization of tasks, partitions of tables and table groups, and sockets. When a utilization imbalance across sockets is detected, the algorithm corrects it by moving or repartitioning tables. Also, inter-socket task stealing is dynamically disabled for memory-intensive tasks that could otherwise hurt performance.
- SAP HANA Platform SPS 11 Administration Guide, Dec. 2015. http://help.sap.com/hana_platform.Google Scholar
- SAP HANA Data Distribution Optimizer Administration Guide, Mar. 2016. http://help.sap.com/hana_options_dwf.Google Scholar
- TPC Benchmark H Rev. 2.17.1, 2016. http://www.tpc.org/.Google Scholar
- S. Agrawal et al. Database Tuning Advisor for Microsoft SQL Server 2005. In VLDB, pp. 1110--1121, 2004.Google ScholarCross Ref
- M. Albutiu et al. Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems. PVLDB, 5(10):1064--1075, 2012. Google ScholarDigital Library
- C. Balkesen et al. Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited. PVLDB, 7(1):85--96, 2013. Google ScholarDigital Library
- S. Blagodurov et al. A Case for NUMA-aware Contention Management on Multicore Systems. In USENIX, 2011. Google ScholarDigital Library
- M. Dashti et al. Traffic management: A holistic approach to memory placement on NUMA systems. In ASPLOS, pp. 381--394, 2013. Google ScholarDigital Library
- R. Dementiev et al. Intel Performance Counter Monitor, Mar. 2016. https://software.intel.com/articles/intel-performance-counter-monitor.Google Scholar
- K. P. Eswaran. Placement of records in a file and file allocation in a computer network. In Information Processing, pp. 304--307, 1974.Google Scholar
- F. Färber et al. The SAP HANA Database --- An Architecture Overview. IEEE Data Eng. Bull., 35(1):28--33, 2012.Google Scholar
- J. Giceva et al. Deployment of Query Plans on Multicores. PVLDB, 8(3):233--244, 2014. Google ScholarDigital Library
- L. Golab et al. Distributed data placement to minimize communication costs via graph partitioning. In SSDBM, pp. 20:1--20:12, 2014. Google ScholarDigital Library
- T. Gubner. Achieving many-core scalability in Vectorwise. 2014. Master's thesis. TU Ilmenau.Google Scholar
- G. Hill and A. Ross. Reducing outer joins. The VLDB Journal, 18(3):599--610, Aug. 2008. Google ScholarDigital Library
- T. Kissinger et al. ERIS: A NUMA-Aware In-Memory Storage Engine for Analytical Workloads. ADMS, pp. 74--85, 2014.Google Scholar
- C. Lameter et al. NUMA (Non-Uniform Memory Access): An Overview. ACM Queue, 11(7):40:40--40:51, 2013. Google ScholarDigital Library
- H. Lang et al. Massively Parallel NUMA-aware Hash Joins. In IMDM, pp. 1--12, 2013.Google Scholar
- P.-A. Larson et al. Enhancements to SQL server column stores. In SIGMOD, pp. 1159--1168, 2013. Google ScholarDigital Library
- V. Leis et al. Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age. In SIGMOD, pp. 743--754, 2014. Google ScholarDigital Library
- C. Lemke et al. Speeding up queries in column stores: a case for compression. In DaWaK, pp. 117--129, 2010. Google ScholarDigital Library
- Y. Li et al. NUMA-aware algorithms: the case of data shuffling. In CIDR, 2013.Google Scholar
- N. Mukherjee et al. Distributed Architecture of Oracle Database In-memory. In PVLDB, volume 8, pp. 1630--1641, 2015. Google ScholarDigital Library
- I. Müller et al. Cache-Efficient Aggregation: Hashing Is Sorting. In SIGMOD, pp. 1123--1136, 2015. Google ScholarDigital Library
- M. T. Özsu et al. Principles of Distributed Database Systems, Third Edition. Springer, 2011. Google ScholarDigital Library
- D. Porobic et al. ATraPos: Adaptive transaction processing on hardware Islands. In ICDE, pp. 688--699, 2014.Google ScholarCross Ref
- I. Psaroudakis et al. Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads. In ADMS, pp. 36--45, 2013.Google Scholar
- I. Psaroudakis et al. Scaling Up Concurrent Main-memory Column-store Scans: Towards Adaptive NUMA-aware Data and Task Placement. PVLDB, 8(12):1442--1453, 2015. Google ScholarDigital Library
- I. Psaroudakis et al. Scaling Up Mixed Workloads: A Battle of Data Freshness, Flexibility, and Scheduling. In TPCTC, pp. 97--112, 2015.Google ScholarCross Ref
- V. Raman et al. DB2 with BLU Acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013. Google ScholarDigital Library
- J. Rao et al. Automating Physical Database Design in a Parallel Database. In SIGMOD, pp. 558--569, 2002. Google ScholarDigital Library
- O. Steinau et al. Method for calculating distributed joins in main memory with minimal communicaton overhead. US Patent App. 11/018,697.Google Scholar
- F. Transier et al. Aggregation in parallel computation environments with shared memory, 2012. US Patent App. 12/978,194.Google Scholar
- V. Viswanathan et al. Intel Memory Latency Checker v3.0, Mar. 2016. https://software.intel.com/articles/intelr-memory-latency-checker.Google Scholar
- M. Wagle et al. NUMA-Aware Memory Management with In-Memory Databases. In TPCTC, pp. 45--60, 2015.Google Scholar
- T. Willhalm et al. Vectorizing database column scans with complex predicates. In ADMS, pp. 1--12, 2013.Google Scholar
- Y. Ye et al. Scalable aggregation on multicore processors. DaMoN, 2011. Google ScholarDigital Library
- E. Zamanian et al. Locality-aware Partitioning in Parallel Database Systems. In SIGMOD, pp. 17--30, 2015. Google ScholarDigital Library
- M. Zukowski et al. Vectorwise: Beyond Column Stores. IEEE Data Eng. Bull., 35(1):21--27, 2012.Google Scholar
Index Terms
- Adaptive NUMA-aware data placement and task scheduling for analytical workloads in main-memory column-stores
Recommendations
NUMA-aware scheduling and memory allocation for data-flow task-parallel applications
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingDynamic task parallelism is a popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform ...
NUMA-aware scheduling and memory allocation for data-flow task-parallel applications
PPoPP '16Dynamic task parallelism is a popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform ...
Hardware-oblivious parallelism for in-memory column-stores
The multi-core architectures of today's computer systems make parallelism a necessity for performance critical applications. Writing such applications in a generic, hardware-oblivious manner is a challenging problem: Current database systems thus rely ...
Comments