Abstract
Modern server hardware is increasingly heterogeneous as hardware accelerators, such as GPUs, are used together with multicore CPUs to meet the computational demands of modern data analytics work-loads. Unfortunately, query parallelization techniques used by analytical database engines are designed for homogeneous multicore servers, where query plans are parallelized across CPUs to process data stored in cache coherent shared memory. Thus, these techniques are unable to fully exploit available heterogeneous hardware, where one needs to exploit task-parallelism of CPUs and data-parallelism of GPUs for processing data stored in a deep, non-cache-coherent memory hierarchy with widely varying access latencies and bandwidth.
In this paper, we introduce HetExchange-a parallel query execution framework that encapsulates the heterogeneous parallelism of modern multi-CPU-multi-GPU servers and enables the parallelization of (pre-)existing sequential relational operators. In contrast to the interpreted nature of traditional Exchange, HetExchange is designed to be used in conjunction with JIT compiled engines in order to allow a tight integration with the proposed operators and generation of efficient code for heterogeneous hardware. We validate the applicability and efficiency of our design by building a prototype that can operate over both CPUs and GPUs, and enables its operators to be parallelism- and data-location-agnostic. In doing so, we show that efficiently exploiting CPU-GPU parallelism can provide 2.8x and 6.4x improvement in performance compared to state-of-the-art CPU-based and GPU-based DBMS.
- MapD. https://www.mapd.com/.Google Scholar
- R. Appuswamy, M. Karpathiotakis, D. Porobic, and A. Ailamaki. The case for heterogeneous htap. In CIDR, 2017.Google Scholar
- E. Begoli, J. Camacho-Rodríguez, J. Hyde, M. J. Mior, and D. Lemire. Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources. In SIGMOD, pages 221--230. ACM, 2018. Google ScholarDigital Library
- P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, 2005.Google Scholar
- S. Breß. The Design and Implementation of CoGaDB: A Column-oriented GPU-accelerated DBMS. Datenbank-Spektrum, 14(3):199--209, 2014.Google ScholarCross Ref
- S. Breß, H. Funke, and J. Teubner. Robust Query Processing in Co-Processor-accelerated Databases. In SIGMOD, pages 1891--1906, 2016. Google ScholarDigital Library
- S. Breß, B. Köcher, H. Funke, S. Zeuch, T. Rabl, and V. Markl. Generating custom code for efficient query execution on heterogeneous processors. The VLDB Journal, 27(6):797--822, Dec. 2018. Google ScholarDigital Library
- J. Cheng, M. Grossman, and T. McKercher. Professional Cuda C Programming. John Wiley & Sons, 2014. Google ScholarDigital Library
- P. Chrysogelos, P. Sioulas, and A. Ailamaki. Hardware-conscious query processing in gpu-accelerated analytical engines. In CIDR, 2019.Google Scholar
- H. Funke, S. Breß, S. Noll, V. Markl, and J. Teubner. Pipelined query processing in coprocessor environments. In SIGMOD, pages 1603--1618, 2018. Google ScholarDigital Library
- G. Graefe. Encapsulation of parallelism in the volcano query processing system. In SIGMOD, pages 102--111, 1990. Google ScholarDigital Library
- G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In ICDE. IEEE Computer Society, 1993. Google ScholarDigital Library
- K. O. W. Group. The OpenCL Specification. https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf.Google Scholar
- B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational Query Coprocessing on Graphics Processors. TODS, 34(4):21:1--21:39, 2009. Google ScholarDigital Library
- M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. PVLDB, 6(9):709--720, 2013. Google ScholarDigital Library
- T. Kaldewey, G. Lohman, R. Mueller, and P. Volk. Gpu join processing revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware, pages 55--62. ACM, 2012. Google ScholarDigital Library
- T. Karnagel, D. Habich, and W. Lehner. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. PVLDB, 10(7):733--744, 2017. Google ScholarDigital Library
- M. Karpathiotakis, I. Alagiannis, and A. Ailamaki. Fast Queries Over Heterogeneous Data Through Engine Customization. PVLDB, 9(12):972--983, 2016. Google ScholarDigital Library
- A. Kemper and T. Neumann. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In ICDE, 2011. Google ScholarDigital Library
- T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. Boncz. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. PVLDB, 11(13):2209--2222, 2018. Google ScholarDigital Library
- C. Lameter. An overview of non-uniform memory access. Communications of the ACM, 56(9):59--54, 2013. Google ScholarDigital Library
- V. Leis, P. A. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age. In SIGMOD, pages 743--754, 2014. Google ScholarDigital Library
- P. Menon, T. C. Mowry, and A. Pavlo. Relaxed operator fusion for in-memory databases: Making compilation, vectorization, and prefetching work together at last. PVLDB, 11(1):1--13, 2017. Google ScholarDigital Library
- T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9):539--550, 2011. Google ScholarDigital Library
- NVIDIA. Parallel Thread Execution ISA. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.Google Scholar
- P. E. O'Neil, E. J. O'Neil, X. Chen, and S. Revilak. The Star Schema Benchmark and Augmented Fact Table Indexing. In TPCTC, pages 237--252, 2009. Google ScholarDigital Library
- S. Padmanabhan, T. Malkemus, R. C. Agarwal, and A. Jhingran. Block oriented processing of relational database operations in modern computer architectures. In ICDE, pages 567--574, 2001. Google ScholarDigital Library
- J. Paul, J. He, and B. He. GPL: A GPU-based Pipelined Query Processing Engine. In SIGMOD, pages 1935--1950, 2016. Google ScholarDigital Library
- H. Pirk, O. Moll, M. Zaharia, and S. Madden. Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware. PVLDB, 9(14):1707--1718, 2016. Google ScholarDigital Library
- R. Rui and Y. Tu. Fast Equi-Join Algorithms on GPUs: Design and Implementation. In SSDBM, pages 17:1--17:12, 2017. Google ScholarDigital Library
- P. Sioulas, P. Chrysogelos, M. Karpathiotakis, R. Appuswamy, and A. Ailamaki. Hardware-conscious Joins on GPUs. In ICDE, 2019.Google ScholarCross Ref
- J. Sompolski, M. Zukowski, and P. Boncz. Vectorization vs. compilation in query execution. In Proceedings of the Seventh International Workshop on Data Management on New Hardware, pages 33--40. ACM, 2011. Google ScholarDigital Library
- E. Stehle and H.-A. Jacobsen. A memory bandwidth-efficient hybrid radix sort on gpus. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 417--432. ACM, 2017. Google ScholarDigital Library
- L. Wang, M. Zhou, Z. Zhang, Y. Yang, A. Zhou, and D. Bitton. Elastic pipelining in an in-memory database cluster. In SIGMOD. ACM, 2016. Google ScholarDigital Library
- H. Wu, G. F. Diamos, S. Cadambi, and S. Yalamanchili. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In MICRO, pages 107--118, 2012. Google ScholarDigital Library
- Y. Yuan, R. Lee, and X. Zhang. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. PVLDB, 6(10):817--828, 2013. Google ScholarDigital Library
Recommendations
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
MIC acceleration of short-range molecular dynamics simulations
COSMIC '13: Proceedings of the First International Workshop on Code OptimiSation for MultI and many CoresHeterogeneous systems containing accelerators such as GPUs or co-processors such as Intel MIC are becoming more prevalent due to their ability of exploiting large-scale parallelism in applications. In this paper, we have developed a hierarchical ...
Comments