skip to main content
research-article

HetExchange: encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines

Published:01 January 2019Publication History
Skip Abstract Section

Abstract

Modern server hardware is increasingly heterogeneous as hardware accelerators, such as GPUs, are used together with multicore CPUs to meet the computational demands of modern data analytics work-loads. Unfortunately, query parallelization techniques used by analytical database engines are designed for homogeneous multicore servers, where query plans are parallelized across CPUs to process data stored in cache coherent shared memory. Thus, these techniques are unable to fully exploit available heterogeneous hardware, where one needs to exploit task-parallelism of CPUs and data-parallelism of GPUs for processing data stored in a deep, non-cache-coherent memory hierarchy with widely varying access latencies and bandwidth.

In this paper, we introduce HetExchange-a parallel query execution framework that encapsulates the heterogeneous parallelism of modern multi-CPU-multi-GPU servers and enables the parallelization of (pre-)existing sequential relational operators. In contrast to the interpreted nature of traditional Exchange, HetExchange is designed to be used in conjunction with JIT compiled engines in order to allow a tight integration with the proposed operators and generation of efficient code for heterogeneous hardware. We validate the applicability and efficiency of our design by building a prototype that can operate over both CPUs and GPUs, and enables its operators to be parallelism- and data-location-agnostic. In doing so, we show that efficiently exploiting CPU-GPU parallelism can provide 2.8x and 6.4x improvement in performance compared to state-of-the-art CPU-based and GPU-based DBMS.

References

  1. MapD. https://www.mapd.com/.Google ScholarGoogle Scholar
  2. R. Appuswamy, M. Karpathiotakis, D. Porobic, and A. Ailamaki. The case for heterogeneous htap. In CIDR, 2017.Google ScholarGoogle Scholar
  3. E. Begoli, J. Camacho-Rodríguez, J. Hyde, M. J. Mior, and D. Lemire. Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources. In SIGMOD, pages 221--230. ACM, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, 2005.Google ScholarGoogle Scholar
  5. S. Breß. The Design and Implementation of CoGaDB: A Column-oriented GPU-accelerated DBMS. Datenbank-Spektrum, 14(3):199--209, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  6. S. Breß, H. Funke, and J. Teubner. Robust Query Processing in Co-Processor-accelerated Databases. In SIGMOD, pages 1891--1906, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Breß, B. Köcher, H. Funke, S. Zeuch, T. Rabl, and V. Markl. Generating custom code for efficient query execution on heterogeneous processors. The VLDB Journal, 27(6):797--822, Dec. 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Cheng, M. Grossman, and T. McKercher. Professional Cuda C Programming. John Wiley & Sons, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Chrysogelos, P. Sioulas, and A. Ailamaki. Hardware-conscious query processing in gpu-accelerated analytical engines. In CIDR, 2019.Google ScholarGoogle Scholar
  10. H. Funke, S. Breß, S. Noll, V. Markl, and J. Teubner. Pipelined query processing in coprocessor environments. In SIGMOD, pages 1603--1618, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Graefe. Encapsulation of parallelism in the volcano query processing system. In SIGMOD, pages 102--111, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In ICDE. IEEE Computer Society, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. O. W. Group. The OpenCL Specification. https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf.Google ScholarGoogle Scholar
  14. B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational Query Coprocessing on Graphics Processors. TODS, 34(4):21:1--21:39, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. PVLDB, 6(9):709--720, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Kaldewey, G. Lohman, R. Mueller, and P. Volk. Gpu join processing revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware, pages 55--62. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Karnagel, D. Habich, and W. Lehner. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. PVLDB, 10(7):733--744, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Karpathiotakis, I. Alagiannis, and A. Ailamaki. Fast Queries Over Heterogeneous Data Through Engine Customization. PVLDB, 9(12):972--983, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Kemper and T. Neumann. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In ICDE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. Boncz. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. PVLDB, 11(13):2209--2222, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Lameter. An overview of non-uniform memory access. Communications of the ACM, 56(9):59--54, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V. Leis, P. A. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age. In SIGMOD, pages 743--754, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Menon, T. C. Mowry, and A. Pavlo. Relaxed operator fusion for in-memory databases: Making compilation, vectorization, and prefetching work together at last. PVLDB, 11(1):1--13, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9):539--550, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. NVIDIA. Parallel Thread Execution ISA. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.Google ScholarGoogle Scholar
  26. P. E. O'Neil, E. J. O'Neil, X. Chen, and S. Revilak. The Star Schema Benchmark and Augmented Fact Table Indexing. In TPCTC, pages 237--252, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Padmanabhan, T. Malkemus, R. C. Agarwal, and A. Jhingran. Block oriented processing of relational database operations in modern computer architectures. In ICDE, pages 567--574, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Paul, J. He, and B. He. GPL: A GPU-based Pipelined Query Processing Engine. In SIGMOD, pages 1935--1950, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. H. Pirk, O. Moll, M. Zaharia, and S. Madden. Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware. PVLDB, 9(14):1707--1718, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Rui and Y. Tu. Fast Equi-Join Algorithms on GPUs: Design and Implementation. In SSDBM, pages 17:1--17:12, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Sioulas, P. Chrysogelos, M. Karpathiotakis, R. Appuswamy, and A. Ailamaki. Hardware-conscious Joins on GPUs. In ICDE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  32. J. Sompolski, M. Zukowski, and P. Boncz. Vectorization vs. compilation in query execution. In Proceedings of the Seventh International Workshop on Data Management on New Hardware, pages 33--40. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. E. Stehle and H.-A. Jacobsen. A memory bandwidth-efficient hybrid radix sort on gpus. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 417--432. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. L. Wang, M. Zhou, Z. Zhang, Y. Yang, A. Zhou, and D. Bitton. Elastic pipelining in an in-memory database cluster. In SIGMOD. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. H. Wu, G. F. Diamos, S. Cadambi, and S. Yalamanchili. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In MICRO, pages 107--118, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Yuan, R. Lee, and X. Zhang. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. PVLDB, 6(10):817--828, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 12, Issue 5
    January 2019
    163 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 January 2019
    Published in pvldb Volume 12, Issue 5

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader