research-article

HetExchange: encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines

Authors:
Periklis Chrysogelos

EPFL

EPFL
View Profile

,
Manos Karpathiotakis

Facebook

Facebook
View Profile

,
Raja Appuswamy

EURECOM

EURECOM
View Profile

,
Anastasia Ailamaki

EPFL and RAW Labs SA

EPFL and RAW Labs SA
View Profile

Proceedings of the VLDB Endowment Volume 12 Issue 5pp 544–556https://doi.org/10.14778/3303753.3303760

Published:01 January 2019Publication History

Proceedings of the VLDB Endowment

Abstract

Modern server hardware is increasingly heterogeneous as hardware accelerators, such as GPUs, are used together with multicore CPUs to meet the computational demands of modern data analytics work-loads. Unfortunately, query parallelization techniques used by analytical database engines are designed for homogeneous multicore servers, where query plans are parallelized across CPUs to process data stored in cache coherent shared memory. Thus, these techniques are unable to fully exploit available heterogeneous hardware, where one needs to exploit task-parallelism of CPUs and data-parallelism of GPUs for processing data stored in a deep, non-cache-coherent memory hierarchy with widely varying access latencies and bandwidth.

In this paper, we introduce HetExchange-a parallel query execution framework that encapsulates the heterogeneous parallelism of modern multi-CPU-multi-GPU servers and enables the parallelization of (pre-)existing sequential relational operators. In contrast to the interpreted nature of traditional Exchange, HetExchange is designed to be used in conjunction with JIT compiled engines in order to allow a tight integration with the proposed operators and generation of efficient code for heterogeneous hardware. We validate the applicability and efficiency of our design by building a prototype that can operate over both CPUs and GPUs, and enables its operators to be parallelism- and data-location-agnostic. In doing so, we show that efficiently exploiting CPU-GPU parallelism can provide 2.8x and 6.4x improvement in performance compared to state-of-the-art CPU-based and GPU-based DBMS.

References

MapD. https://www.mapd.com/.Google Scholar
R. Appuswamy, M. Karpathiotakis, D. Porobic, and A. Ailamaki. The case for heterogeneous htap. In CIDR, 2017.Google Scholar
E. Begoli, J. Camacho-Rodríguez, J. Hyde, M. J. Mior, and D. Lemire. Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources. In SIGMOD, pages 221--230. ACM, 2018. Google ScholarDigital Library
P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, 2005.Google Scholar
S. Breß. The Design and Implementation of CoGaDB: A Column-oriented GPU-accelerated DBMS. Datenbank-Spektrum, 14(3):199--209, 2014.Google ScholarCross Ref
S. Breß, H. Funke, and J. Teubner. Robust Query Processing in Co-Processor-accelerated Databases. In SIGMOD, pages 1891--1906, 2016. Google ScholarDigital Library
S. Breß, B. Köcher, H. Funke, S. Zeuch, T. Rabl, and V. Markl. Generating custom code for efficient query execution on heterogeneous processors. The VLDB Journal, 27(6):797--822, Dec. 2018. Google ScholarDigital Library
J. Cheng, M. Grossman, and T. McKercher. Professional Cuda C Programming. John Wiley & Sons, 2014. Google ScholarDigital Library
P. Chrysogelos, P. Sioulas, and A. Ailamaki. Hardware-conscious query processing in gpu-accelerated analytical engines. In CIDR, 2019.Google Scholar
H. Funke, S. Breß, S. Noll, V. Markl, and J. Teubner. Pipelined query processing in coprocessor environments. In SIGMOD, pages 1603--1618, 2018. Google ScholarDigital Library
G. Graefe. Encapsulation of parallelism in the volcano query processing system. In SIGMOD, pages 102--111, 1990. Google ScholarDigital Library
G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In ICDE. IEEE Computer Society, 1993. Google ScholarDigital Library
K. O. W. Group. The OpenCL Specification. https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf.Google Scholar
B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational Query Coprocessing on Graphics Processors. TODS, 34(4):21:1--21:39, 2009. Google ScholarDigital Library
M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. PVLDB, 6(9):709--720, 2013. Google ScholarDigital Library
T. Kaldewey, G. Lohman, R. Mueller, and P. Volk. Gpu join processing revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware, pages 55--62. ACM, 2012. Google ScholarDigital Library
T. Karnagel, D. Habich, and W. Lehner. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. PVLDB, 10(7):733--744, 2017. Google ScholarDigital Library
M. Karpathiotakis, I. Alagiannis, and A. Ailamaki. Fast Queries Over Heterogeneous Data Through Engine Customization. PVLDB, 9(12):972--983, 2016. Google ScholarDigital Library
A. Kemper and T. Neumann. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In ICDE, 2011. Google ScholarDigital Library
T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. Boncz. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. PVLDB, 11(13):2209--2222, 2018. Google ScholarDigital Library
C. Lameter. An overview of non-uniform memory access. Communications of the ACM, 56(9):59--54, 2013. Google ScholarDigital Library
V. Leis, P. A. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age. In SIGMOD, pages 743--754, 2014. Google ScholarDigital Library
P. Menon, T. C. Mowry, and A. Pavlo. Relaxed operator fusion for in-memory databases: Making compilation, vectorization, and prefetching work together at last. PVLDB, 11(1):1--13, 2017. Google ScholarDigital Library
T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9):539--550, 2011. Google ScholarDigital Library
NVIDIA. Parallel Thread Execution ISA. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.Google Scholar
P. E. O'Neil, E. J. O'Neil, X. Chen, and S. Revilak. The Star Schema Benchmark and Augmented Fact Table Indexing. In TPCTC, pages 237--252, 2009. Google ScholarDigital Library
S. Padmanabhan, T. Malkemus, R. C. Agarwal, and A. Jhingran. Block oriented processing of relational database operations in modern computer architectures. In ICDE, pages 567--574, 2001. Google ScholarDigital Library
J. Paul, J. He, and B. He. GPL: A GPU-based Pipelined Query Processing Engine. In SIGMOD, pages 1935--1950, 2016. Google ScholarDigital Library
H. Pirk, O. Moll, M. Zaharia, and S. Madden. Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware. PVLDB, 9(14):1707--1718, 2016. Google ScholarDigital Library
R. Rui and Y. Tu. Fast Equi-Join Algorithms on GPUs: Design and Implementation. In SSDBM, pages 17:1--17:12, 2017. Google ScholarDigital Library
P. Sioulas, P. Chrysogelos, M. Karpathiotakis, R. Appuswamy, and A. Ailamaki. Hardware-conscious Joins on GPUs. In ICDE, 2019.Google ScholarCross Ref
J. Sompolski, M. Zukowski, and P. Boncz. Vectorization vs. compilation in query execution. In Proceedings of the Seventh International Workshop on Data Management on New Hardware, pages 33--40. ACM, 2011. Google ScholarDigital Library
E. Stehle and H.-A. Jacobsen. A memory bandwidth-efficient hybrid radix sort on gpus. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 417--432. ACM, 2017. Google ScholarDigital Library
L. Wang, M. Zhou, Z. Zhang, Y. Yang, A. Zhou, and D. Bitton. Elastic pipelining in an in-memory database cluster. In SIGMOD. ACM, 2016. Google ScholarDigital Library
H. Wu, G. F. Diamos, S. Cadambi, and S. Yalamanchili. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In MICRO, pages 107--118, 2012. Google ScholarDigital Library
Y. Yuan, R. Lee, and X. Zhang. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. PVLDB, 6(10):817--828, 2013. Google ScholarDigital Library

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
MIC acceleration of short-range molecular dynamics simulations
COSMIC '13: Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores

Heterogeneous systems containing accelerators such as GPUs or co-processors such as Intel MIC are becoming more prevalent due to their ability of exploiting large-scale parallelism in applications. In this paper, we have developed a hierarchical ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 12, Issue 5
January 2019
163 pages
ISSN:2150-8097
Editors:
Lei Chen
HKUST
,
Fatma Özcan
IBM Research - Almaden
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 January 2019
Published in pvldb Volume 12, Issue 5
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 198
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HetExchange: encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

MIC acceleration of short-range molecular dynamics simulations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

HetExchange: encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

MIC acceleration of short-range molecular dynamics simulations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media