survey

Query Processing on Heterogeneous CPU/GPU Systems

Authors:
Viktor Rosenfeld

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany

0000-0002-6001-4442
View Profile

,
Sebastian Breß

Snowflake Inc., Berlin, Germany

Snowflake Inc., Berlin, Germany
View Profile

,
Volker Markl

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 55 Issue 1Article No.: 11pp 1–38https://doi.org/10.1145/3485126

Published:17 January 2022Publication History

ACM Computing Surveys

Abstract

Due to their high computational power and internal memory bandwidth, graphic processing units (GPUs) have been extensively studied by the database systems research community. A heterogeneous query processing system that employs CPUs and GPUs at the same time has to solve many challenges, including how to distribute the workload on processors with different capabilities; how to overcome the data transfer bottleneck; and how to support implementations for multiple processors efficiently. In this survey we devise a classification scheme to categorize techniques developed to address these challenges. Based on this scheme, we categorize query processing systems on heterogeneous CPU/GPU systems and identify open research problems.

Supplemental Material

Available for Download

zip

rosenfeld.zip (94.1 KB)

Supplemental movie, appendix, image and software files for, Query Processing on Heterogeneous CPU/GPU Systems

REFERENCES

[1] Abadi D., Carney D., Çetintemel U., Cherniack M., Convey C., Erwin C., Galvez E., Hatoun M., Maskey A., Rasin A., Singer A., Stonebraker M., Tatbul N., Xing Y., Yan R., and Zdonik S.. 2003. Aurora: A data stream management system. In Proc. of SIGMOD’03. Association for Computing Machinery, 666. DOI: DOI: https://doi.org/10.1145/872757.872855 Google ScholarCross Ref
[2] Abadi Daniel, Ailamaki Anastasia, Andersen David, Bailis Peter, Balazinska Magdalena, Bernstein Philip, Boncz Peter, Chaudhuri Surajit, Cheung Alvin, Doan AnHai, Dong Luna, Franklin Michael J., Freire Juliana, Halevy Alon, Hellerstein Joseph M., Idreos Stratos, Kossmann Donald, Kraska Tim, Krishnamurthy Sailesh, Markl Volker, Melnik Sergey, Milo Tova, Mohan C., Neumann Thomas, Ooi Beng Chin, Ozcan Fatma, Patel Jignesh, Pavlo Andrew, Popa Raluca, Ramakrishnan Raghu, Ré Christopher, Stonebraker Michael, and Suciu Dan. 2020. The Seattle report on database research. 48, 4 (2020), 44–53. DOI: DOI: https://doi.org/10.1145/3385658.3385668 Google ScholarCross Ref
[3] Abadi Daniel, Madden Samuel, and Ferreira Miguel. 2006. Integrating compression and execution in column-oriented database systems. In Proc. of ACM SIGMOD’06. Association for Computing Machinery, 671–682. DOI: DOI: https://doi.org/10.1145/1142473.1142548 Google ScholarCross Ref
[4] Devices Advanced Micro. 2021. More About How ROCm Uses PCIe Atomics. https://rocmdocs.amd.com/en/latest/Installation_Guide/More-about-how-ROCm-uses-PCIe-Atomics.html.Google Scholar
[5] Agbaria Adnan, Minor David, Peterfreund Natan, Rozenberg Eyal, and Rosenberg Ofer. 2017. Overtaking CPU DBMSes with a GPU in whole-query analytic processing with parallelism-friendly execution plan optimization. In Proc. of ADMS/IMDM@VLDB’17. Springer International Publishing, 57–78. https://link.springer.com/chapter/10.1007/978-3-319-56111-0_4.Google Scholar
[6] Ajanovic Jasmin. 2009. PCI express 3.0 overview. In Proc. of IEEE HCS 21. 1–61. DOI: DOI: https://doi.org/10.1109/HOTCHIPS.2009.7478337Google Scholar
[7] Appuswamy Raja, Karpathiotakis Manos, Porobic Danica, and Ailamaki Anastasia. 2017. The case for heterogeneous HTAP. In Proc. of CIDR’17. http://infoscience.epfl.ch/record/224447.Google Scholar
[8] Arora Sonu, Bouvier Dan, and Weaver Chris. 2020. AMD next generation 7NM Ryzen™ 4000 APU “Renoir”. In Proc. of IEEE HCS 32. 1–30. DOI: DOI: https://doi.org/10.1109/HCS49909.2020.9220414Google Scholar
[9] Astrahan M. M., Blasgen M. W., Chamberlin D. D., Eswaran K. P., Gray J. N., Griffiths P. P., King W. F., Lorie R. A., McJones P. R., Mehl J. W., Putzolu G. R., Traiger I. L., Wade B. W., and Watson V.. 1976. System R: Relational approach to database management. 1, 2 (1976), 97–137. DOI: DOI: https://doi.org/10.1145/320455.320457 Google ScholarCross Ref
[10] Bandi Nagender, Sun Chengyu, Agrawal Divyakant, and Abbadi Amr El. 2004. Hardware acceleration in commercial databases: A case study of spatial operations. In Proc. of VLDB’04. Morgan Kaufmann, 1021–1032. https://doi.org/10.1016/B978-012088469-8.50089-9 Google ScholarDigital Library
[11] Beier Felix, Kilias Torsten, and Sattler Kai-Uwe. 2012. GiST scan acceleration using coprocessors. In Proc. of ACM DaMoN’12. 63–69. Google ScholarDigital Library
[12] Blythe David. 2020. The Xe GPU architecture. In Proc. of IEEE HCS 32. IEEE Computer Society, 1–27. https://doi.org/10.1109/HCS49909.2020.9220591Google Scholar
[13] Bøgh Kenneth S., Assent Ira, and Magnani Matteo. 2013. Efficient GPU-based skyline computation. In Proc. of ACM DaMoN’13, Article 5. Google ScholarDigital Library
[14] Bøgh Kenneth S., Chester Sean, Šidlauskas Darius, and Assent Ira. 2017. Template skycube algorithms for heterogeneous parallelism on multicore and GPU architectures. In Proc. of ACM SIGMOD’17. Association for Computing Machinery, 447–462. DOI: DOI: https://doi.org/10.1145/3035918.3035962 Google ScholarCross Ref
[15] Bohr M.. 2007. A 30 year retrospective on Dennard’s MOSFET scaling paper. 12, 1 (2007), 11–13. https://doi.org/10.1109/N-SSC.2007.4785534Google Scholar
[16] Boncz Peter A., Zukowski Marcin, and Nes Niels. 2005. MonetDB/X100: Hyper-pipelining query execution. In Proc. of CIDR’05.Google Scholar
[17] Borkar Shekhar and Chien Andrew A.. 2011. The future of microprocessors. 54, 5 (2011), 67–77. https://doi.org/10.1145/1941487.1941507 Google ScholarDigital Library
[18] Bouvier Dan and Sander Ben. 2014. Applying AMD’s Kaveri APU for heterogeneous computing. In Proc. of IEEE HCS 26. 1–42. DOI: DOI: https://doi.org/10.1109/HOTCHIPS.2014.7478810Google Scholar
[19] Branover Alexander, Foley Denis, and Steinman Maurice. 2012. AMD fusion APU: Llano. 32, 2 (2012), 28–37. https://doi.org/10.1109/MM.2012.2 Google ScholarDigital Library
[20] Breß Sebastian. 2014. The design and implementation of CoGaDB: A column-oriented GPU-accelerated DBMS. 14, 3 (2014), 199–209. DOI: DOI: https://doi.org/10.1007/s13222-014-0164-zGoogle Scholar
[21] Breß Sebastian, Beier Felix, Rauhe Hannes, Sattler Kai-Uwe, Schallehn Eike, and Saake Gunter. 2013. Efficient co-processor utilization in database query processing. Information Systems 38, 8 (2013), 1084–1096. DOI: DOI: https://doi.org/10.1016/j.is.2013.05.004 Google ScholarCross Ref
[22] Breß Sebastian, Funke Henning, and Teubner Jens. 2016. Robust query processing in co-processor-accelerated databases. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 1891–1906. https://doi.org/10.1145/2882903.2882936Google Scholar
[23] Breß Sebastian, Heimel Max, Siegmund Norbert, Bellatreche Ladjel, and Saake Gunter. 2014. GPU-accelerated database systems: Survey and open challenges. Trans. Large Scale Data Knowl. Centered Syst. 15 (2014), 1–35. DOI: DOI: https://doi.org/10.1007/978-3-662-45761-0_1Google Scholar
[24] Breß Sebastian, Köcher Bastian, Funke Henning, Zeuch Steffen, Rabl Tilmann, and Markl Volker. 2018. Generating custom code for efficient query execution on heterogeneous processors. The VLDB Journal 27, 6 (2018), 797–822. DOI: DOI: https://doi.org/10.1007/s00778-018-0512-y Google ScholarCross Ref
[25] Carbone Paris, Katsifodimos Asterios, Ewen Stephan, Markl Volker, Haridi Seif, and Tzoumas Kostas. 2015. Apache flink: Stream and batch processing in a single engine.IEEE Data Engineering Bulletin 36, 4 (2015).Google Scholar
[26] Chen Tianqi, Moreau Thierry, Jiang Ziheng, Zheng Lianmin, Yan Eddie, Shen Haichen, Cowan Meghan, Wang Leyuan, Hu Yuwei, Ceze Luis, Guestrin Carlos, and Krishnamurthy Arvind. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proc. of USENIX OSDI’18. USENIX Association, 578–594. https://www.usenix.org/conference/osdi18/presentation/chen. Google ScholarDigital Library
[27] Chrysogelos Periklis, Karpathiotakis Manos, Appuswamy Raja, and Ailamaki Anastasia. 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proc. VLDB Endow 12, 5 (2019), 544–556. DOI: DOI: https://doi.org/10.14778/3303753.3303760 Google ScholarCross Ref
[28] Chrysogelos Periklis, Sioulas Panagiotis, and Ailamaki Anastasia. 2019. Hardware-conscious query processing in GPU-accelerated analytical engines. In Proc. of CIDR’19. 9. http://infoscience.epfl.ch/record/262529.Google Scholar
[29] Cugola Gianpaolo and Margara Alessandro. 2012. Processing flows of information: From data stream to complex event processing. ACM Comput. Surv. 44, 3, Article 15 (2012), 62 pages. DOI: DOI: https://doi.org/10.1145/2187671.2187677 Google ScholarCross Ref
[30] Dagum Leornado and Menon Ramesh. 1998. OpenMP: An industry-standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (1998), 46–55. DOI: DOI: https://doi.org/10.1109/99.660313 Google ScholarCross Ref
[31] Dennard R. H., Gaensslen F. H., Rideout V. L., Bassous E., and LeBlanc A. R.. 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5 (1974), 256–268. DOI: DOI: https://doi.org/10.1109/JSSC.1974.1050511Google Scholar
[32] Doraiswamy Harish, Vo Huy T., Siva Cláudio T., and Freire Juliana. 2016. A GPU-based index to support interactive spatio-temporal queries over historical data. In Proc. of IEEE ICDE’16. 1086–1097. https://doi.org/10.1109/ICDE.2016.7498315Google Scholar
[33] Eldawy Ahmed and Mokbel Mohamed F.. 2016. The era of big spatial data: A survey. Foundations and Trends® in Databases 6, 3–4 (2016), 163–273. https://doi.org/10.1561/1900000054 Google ScholarDigital Library
[34] Elder Gordon. 2002. Radeon 9700. In Proc. of ACM SIGGRAPH/Eurographics’02 Tutorials. https://www.graphicshardware.org/previous/www_2002/presentations/Hot3D-RADEON9700.ppt.Google Scholar
[35] Fang Jian, Mulder Yvo T. B., Hidders Jan, Lee Jinho, and Hofstee H. Peter. 2020. In-memory database acceleration on FPGAs: A survey. 29, 1 (2020), 33–59. DOI: DOI: https://doi.org/10.1007/s00778-019-00581-wGoogle Scholar
[36] Fang Wenbin, He Bingsheng, and Luo Qiong. 2010. Database compression on graphics processors. Proc. VLDB Endow 3, 1–2 (2010), 670–680. DOI: DOI: https://doi.org/10.14778/1920841.1920927 Google ScholarCross Ref
[37] Funke Henning, Breß Sebastian, Noll Stefan, Markl Volker, and Teubner Jens. 2018. Pipelined query processing in coprocessor environments. In Proc. of ACM SIGMOD’18. Association for Computing Machinery, 1603–1618. DOI: DOI: https://doi.org/10.1145/3183713.3183734 Google ScholarCross Ref
[38] Funke Henning and Teubner Jens. 2020. Data-parallel query processing on non-uniform data. Proc. VLDB Endow 13, 6 (2020), 884–897. DOI: DOI: https://doi.org/10.14778/3380750.3380758 Google ScholarCross Ref
[39] Govindaraju Naga K., Gray Jim, Kumar Ritesh, and Manocha Dinesh. 2006. GPUTeraSort: High performance graphics co-processor sorting for large database management. In Proc. of ACM SIGMOD’06. ACM, 325–336. DOI: DOI: https://doi.org/10.1145/1142473.1142511 Google ScholarCross Ref
[40] Govindaraju Naga K., Lloyd Brandon, Wang Wei, Lin Ming, and Manocha Dinesh. 2004. Fast computation of database operations using graphics processors. In Proc. of ACM SIGMOD’04. ACM, 215–226. DOI: DOI: https://doi.org/10.1145/1007568.1007594 Google ScholarCross Ref
[41] Graefe Goetz and Shapiro Leonard D.. 1991. Data compression and database performance. In Proc. of IEEE SAC’91. IEEE Computer Society, 22–27. DOI: DOI: https://doi.org/10.1109/SOAC.1991.143840Google Scholar
[42] Gregg Chris and Hazelwood Kim. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proc. of IEEE ISPASS’11. 134–144. DOI: DOI: https://doi.org/10.1109/ISPASS.2011.5762730 Google ScholarCross Ref
[43] Grochowski Ed, Ronen Ronny, Shen John, and Wang Hong. 2004. Best of both latency and throughput. In Proc. of IEEE ICCD’04. 236–243. DOI: DOI: https://doi.org/10.1109/ICCD.2004.1347928 Google ScholarCross Ref
[44] Harizopoulos Stavros, Shkapenyuk Vladislav, and Ailamaki Anastassia. 2005. QPipe: A simultaneously pipelined relational query engine. In Proc. of ACM SIGMOD’05. Association for Computing Machinery, 383–394. DOI: DOI: https://doi.org/10.1145/1066157.1066201 Google ScholarCross Ref
[45] Harris Mark. 2004. General-purpose computation using graphics hardware. In Eurographics’04 Tutorials. Eurographics Association. DOI: DOI: https://doi.org/10.2312/egt.20041034Google Scholar
[46] He Bingsheng, Govindaraju Naga K., Luo Qiong, and Smith Burton. 2007. Efficient gather and scatter operations on graphics processors. In Proc. of ACM SC’07. ACM, Article 46, 12 pages. DOI: DOI: https://doi.org/10.1145/1362622.1362684 Google ScholarCross Ref
[47] He Bingsheng, Lu Mian, Yang Ke, Fang Rui, Govindaraju Naga K., Luo Qiong, and Sander Pedro V.. 2009. Relational query coprocessing on graphics processors. ACM Trans. Database Syst. 34, 4, Article 21 (2009), 39 pages. DOI: DOI: https://doi.org/10.1145/1620585.1620588 Google ScholarCross Ref
[48] He Bingsheng, Yang Ke, Fang Rui, Lu Mian, Govindaraju Naga, Luo Qiong, and Sander Pedro. 2008. Relational joins on graphics processors. In Proc. of ACM SIGMOD’08. ACM, 511–524. DOI: DOI: https://doi.org/10.1145/1376616.1376670 Google ScholarCross Ref
[49] He Bingsheng and Yu Jeffrey Xu. 2011. High-throughput transaction executions on graphics processors. Proc. VLDB Endow 4, 5 (2011), 314–325. DOI: DOI: https://doi.org/10.14778/1952376.1952381 Google ScholarCross Ref
[50] He Jiong, Lu Mian, and He Bingsheng. 2013. Revisiting co-processing for hash joins on the coupled CPU-GPU architecture. Proc. VLDB Endow 6, 10 (2013), 889–900. DOI: DOI: https://doi.org/10.14778/2536206.2536216 Google ScholarCross Ref
[51] He Jiong, Zhang Shuhao, and He Bingsheng. 2014. In-cache query co-processing on coupled CPU-GPU architectures. Proc. VLDB Endow. 8, 4 (2014), 329–340. DOI: DOI: https://doi.org/10.14778/2735496.2735497 Google ScholarCross Ref
[52] Heimel Max, Kiefer Martin, and Markl Volker. 2015. Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation. In Proc. of ACM SIGMOD’15. Association for Computing Machinery, 1477–1492. DOI: DOI: https://doi.org/10.1145/2723372.2749438 Google ScholarCross Ref
[53] Heimel Max, Saecker Michael, Pirk Holger, Manegold Stefan, and Markl Volker. 2013. Hardware-oblivious parallelism for in-memory column-stores. Proc. VLDB Endow. 6, 9 (2013), 709–720. DOI: DOI: https://doi.org/10.14778/2536360.2536370 Google ScholarCross Ref
[54] Hennessy John L. and Patterson David A.. 2017. Computer Architecture: A Quantitative Approach (6 ed.). Morgan Kaufmann. Google ScholarDigital Library
[55] Jensen Christian S., Pedersen Torben Bach, and Thomsen Christian. 2010. Multidimensional databases and data warehousing. Synthesis Lectures on Data Management 2, 1 (2010), 1–111. DOI: DOI: https://doi.org/10.2200/S00299ED1V01Y201009DTM009 Google ScholarCross Ref
[56] Jia Zhe, Maggioni Marco, Smith Jeffrey, and Scarpazza Daniele Paolo. 2019. Dissecting the NVIDIA turing T4 GPU via microbenchmarking. abs/1903.07486 (2019). http://arxiv.org/abs/1903.07486Google Scholar
[57] Jia Zhe, Maggioni Marco, Staiger Benjamin, and Scarpazza Daniele P.. 2018. Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. abs/1804.06826 (2018). https://arxiv.org/abs/1804.06826Google Scholar
[58] Jouppi Norman P., Young Cliff, Patil Nishant, and Patterson David. 2018. A domain-specific architecture for deep neural networks. Commun. ACM 61, 9 (2018), 50–59. DOI: DOI: https://doi.org/10.1145/3154484 Google ScholarCross Ref
[59] Kaldewey Tim, Lohman Guy, Mueller Rene, and Volk Peter. 2012. GPU join processing revisited. In Proc. of ACM DaMoN’12. ACM, 55–62. DOI: DOI: https://doi.org/10.1145/2236584.2236592 Google ScholarCross Ref
[60] Karnagel Tomas, Ben-Nun Tal, Werner Matthias, Habich Dirk, and Lehner Wolfgang. 2017. Big data causing big (TLB) problems: Taming random memory accesses on the GPU. In Proc. of ACM DaMoN’17. Association for Computing Machinery, Article 6, 10 pages. DOI: DOI: https://doi.org/10.1145/3076113.3076115 Google ScholarCross Ref
[61] Karnagel Tomas, Habich Dirk, and Lehner Wolfgang. 2017. Adaptive work placement for query processing on heterogeneous computing resources. Proc. VLDB Endow 10, 7 (2017), 733–744. DOI: DOI: https://doi.org/10.14778/3067421.3067423 Google ScholarCross Ref
[62] Karnagel Tomas, Habich Dirk, and Lehner Wolfgang. 2015. Local vs. global optimization: Operator placement strategies in heterogeneous environments. In Proc. of EDBT’15 Workshops. CEUR-WS.org, 48–55. http://ceur-ws.org/Vol-1330/paper-10.pdf.Google Scholar
[63] Karnagel Tomas, Habich Dirk, Schlegel Benjamin, and Lehner Wolfgang. 2014. Heterogeneity-aware operator placement in column-store DBMS. Datenbank-Spektrum 14, 3 (2014), 211–221. DOI: DOI: https://doi.org/10.1007/s13222-014-0167-9Google Scholar
[64] Karnagel Tomas, Habich Dirk, Schlegel Benjamin, and Lehner Wolfgang. 2013. The HELLS-Join: A heterogeneous stream join for extremely large windows. In Proc. of ACM DaMoN’13. Association for Computing Machinery, Article 2, 7 pages. DOI: DOI: https://doi.org/10.1145/2485278.2485280 Google ScholarCross Ref
[65] Karnagel Tomas, Müller René, and Lohman Guy M.. 2015. Optimizing GPU-accelerated group-by and aggregation. In Proc. of ADMS@VLDB’15. 13–24. http://www.adms-conf.org/2015/gpu-optimizer-camera-ready.pdf.Google Scholar
[66] Kayiran Onur, Nachiappan Nachiappan Chidambaram, Jog Adwait, Ausavarungnirun Rachata, Kandemir Mahmut T., Loh Gabriel H., Mutlu Onur, and Das Chita R.. 2014. Managing GPU concurrency in heterogeneous architectures. In Proc. of IEEE/ACM MICRO 47. 114–126. DOI: DOI: https://doi.org/10.1109/MICRO.2014.62 Google ScholarCross Ref
[67] Kersten Timo, Leis Viktor, Kemper Alfons, Neumann Thomas, Pavlo Andrew, and Boncz Peter. 2018. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. 11, 13 (2018), 2209–2222. DOI: DOI: https://doi.org/10.14778/3275366.3284966 Google ScholarCross Ref
[68] Kessenich John, Sellers Graham, and Shreiner Dave. 2016. OpenGL Programming Guide: The Official Guide to Learning OpenGL, Version 4.5 with SPIR-V (9 ed.). Addison-Wesley Professional. Google ScholarDigital Library
[69] Group Khronos OpenCL Working. 2013. The OpenCL Specification Version 2.0. https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf.Google Scholar
[70] Koduri Raja. 2019. Exascale for everyone. In Intel HPC Developer Conference’19. https://software.intel.com/content/www/us/en/develop/events/hpc-devcon.html.Google Scholar
[71] Koliousis Alexandros, Weidlich Matthias, Fernandez Raul Castro, Wolf Alexander L., Costa Paolo, and Pietzuch Peter. 2016. SABER: Window-based hybrid stream processing for heterogeneous architectures. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 555–569. DOI: DOI: https://doi.org/10.1145/2882903.2882906 Google ScholarCross Ref
[72] Körber Michael, Eckstein Jakob, Glombiewski Nikolaus, and Seeger Bernhard. 2019. Event stream processing on heterogeneous system architecture. In Proc. of ACM DaMoN’19. Association for Computing Machinery, Article 3. DOI: DOI: https://doi.org/10.1145/3329785.3329933 Google ScholarCross Ref
[73] Lang Harald, Mühlbauer Tobias, Funke Florian, Boncz Peter A., Neumann Thomas, and Kemper Alfons. 2016. Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 311–326. DOI: DOI: https://doi.org/10.1145/2882903.2882925 Google ScholarCross Ref
[74] Lattner Chris and Adve Vikram. 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proc. of IEEE CGO’04. 75–86. DOI: DOI: https://doi.org/10.1109/CGO.2004.1281665 Google ScholarCross Ref
[75] Leis Viktor, Boncz Peter, Kemper Alfons, and Neumann Thomas. 2014. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In Proc. of ACM SIGMOD’14. ACM, 743–754. DOI: DOI: https://doi.org/10.1145/2588555.2610507 Google ScholarCross Ref
[76] Lempel Oded. 2011. 2nd generation Intel® core processor family: Intel® core i7, i5 and i3. In Proc. of IEEE HCS 23. 1–48. DOI: DOI: https://doi.org/10.1109/HOTCHIPS.2011.7477509Google Scholar
[77] Li Chuanwen, Gu Yu, Qi Jianzhong, He Jiayuan, Deng Qingxu, and Yu Ge. 2018. A GPU accelerated update efficient index for kNN queries in road networks. In Proc. of IEEE ICDE’18. 881–892. DOI: DOI: https://doi.org/10.1109/ICDE.2018.00084Google Scholar
[78] Lin Yuan and Grover Vinod. 2018. Using CUDA Warp-Level Primitives. https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/.Google Scholar
[79] Lindholm E., Nickolls J., Oberman S., and Montrym J.. 2008. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (2008), 39–55. DOI: DOI: https://doi.org/10.1109/MM.2008.31 Google ScholarCross Ref
[80] Lindholm Erik, Kilgard Mark J., and Moreton Henry. 2001. A user-programmable vertex engine. In Proc. of ACM SIGGRAPH’01. ACM, 149–158. DOI: DOI: https://doi.org/10.1145/383259.383274 Google ScholarCross Ref
[81] Group LLVM Developer. [n.d.]. The LLVM Target-Independent Code Generator. https://www.llvm.org/docs/CodeGenerator.html.Google Scholar
[82] Luitjens Justin. 2014. Faster Parallel Reductions on Kepler. https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/.Google Scholar
[83] Lutz Clemens, Breß Sebastian, Rabl Tilmann, Zeuch Steffen, and Markl Volker. 2018. Efficient K-means on GPUs. In Proc. of ACM DaMoN’18. Association for Computing Machinery, Article 3, 3 pages. https://doi.org/10.1145/3211922.3211925 Google ScholarDigital Library
[84] Lutz Clemens, Breß Sebastian, Zeuch Steffen, Rabl Tilmann, and Markl Volker. 2020. Pump up the volume: Processing large data on GPUs with fast interconnects. In Proc. of ACM SIGMOD’20. Association for Computing Machinery, 1633–1649. DOI: DOI: https://doi.org/10.1145/3318464.3389705 Google ScholarCross Ref
[85] Manegold Stefan, Boncz Peter, and Kersten Martin L.. 2002. Generic database cost models for hierarchical memory systems. In Proc. of VLDB’02. VLDB Endowment, 191–202. http://vldb.org/conf/2002/S06P03.pdf. Google ScholarDigital Library
[86] Mantor Mike. 2019. 7nm “Navi” GPU - A GPU built for performance and efficiency. In Proc. of IEEE HCS 31. IEEE Computer Society, 1–28. DOI: DOI: https://doi.org/10.1109/HOTCHIPS.2019.8875649Google Scholar
[87] Mark William R., Glanville R. Steven, Akeley Kurt, and Kilgard Mark J.. 2003. Cg: A system for programming graphics hardware in a C-like language. ACM Trans. Graph 22, 3 (2003), 896–907. DOI: DOI: https://doi.org/10.1145/882262.882362 Google ScholarCross Ref
[88] Menon Prashanth, Mowry Todd C., and Pavlo Andrew. 2017. Relaxed operator fusion for in-memory databases: Making compilation, vectorization, and prefetching work together at last. Proc. VLDB Endow 11, 1 (2017), 1–13. DOI: DOI: https://doi.org/10.14778/3151113.3151114 Google ScholarCross Ref
[89] Meraji Sina, Schiefer Berni, Pham Lan, Chu Lee, Kokosielis Peter, Storm Adam, Young Wayne, Ge Chang, Ng Geoffrey, and Kanagaratnam Kajan. 2016. Towards a hybrid design for fast query processing in DB2 with BLU acceleration using graphical processing units: A technology demonstration. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 1951–1960. DOI: DOI: https://doi.org/10.1145/2882903.2903735 Google ScholarCross Ref
[90] Mittal Sparsh and Vetter Jeffrey S.. 2015. A survey of CPU-GPU heterogeneous computing techniques. ACM Comput. Surv. 47, 4, Article 69 (2015), 35 pages. DOI: DOI: https://doi.org/10.1145/2788396 Google ScholarCross Ref
[91] Mühlbauer Tobias, Rödiger Wolf, Seilbeck Robert, Kemper Alfons, and Neumann Thomas. 2014. Heterogeneity-conscious parallel query execution: Getting a better mileage while driving faster!. In Proc. of ACM DaMoN’14. Association for Computing Machinery, Article 2, 10 pages. DOI: DOI: https://doi.org/10.1145/2619228.2619230 Google ScholarDigital Library
[92] Mukherjee Saoni, Sun Yifan, Blinzer Paul, Ziabari Amir Kavyan, and Kaeli David. 2016. A comprehensive performance analysis of HSA and OpenCL 2.0. In Proc. of IEEE ISPASS’16. 183–193. DOI: DOI: https://doi.org/10.1109/ISPASS.2016.7482093Google Scholar
[93] Neugebauer Rolf, Antichi Gianni, Zazo José Fernando, Audzevich Yury, López-Buedo Sergio, and Moore Andrew W.. 2018. Understanding PCIe performance for end host networking. In Proc. of ACM SIGCOMM’18. Association for Computing Machinery, 327–341. DOI: DOI: https://doi.org/10.1145/3230543.3230560 Google ScholarCross Ref
[94] Neumann Thomas. 2011. Efficiently compiling efficient query plans for modern hardware. Proc. VLDB Endow 4, 9 (2011), 539–550. DOI: DOI: https://doi.org/10.14778/2002938.2002940 Google ScholarCross Ref
[95] Nickolls John, Buck Ian, Garland Michael, and Skadron Kevin. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (2008), 40–53. DOI: DOI: https://doi.org/10.1145/1365490.1365500 Google ScholarCross Ref
[96] Corporation NVIDIA. 2020. CUDA C Best Practices Guide (v11.1 ed.). https://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf.Google Scholar
[97] Corporation NVIDIA. [n.d.]. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/index.html.Google Scholar
[98] Corporation NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture.Google Scholar
[99] Corporation NVIDIA. 2007. NVIDIA CUDA Programming Guide (version 1.0 ed.).Google Scholar
[100] Corporation NVIDIA. [n.d.]. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect.Google Scholar
[101] Corporation NVIDIA. 2016. NVIDIA Tesla P100.Google Scholar
[102] Corporation NVIDIA. 2017. NVIDIA Tesla V100 GPU Architecture.Google Scholar
[103] Corporation NVIDIA. 2018. NVIDIA Turing GPU Architecture.Google Scholar
[104] Corporation NVIDIA. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi.Google Scholar
[105] O’Neil Patrick, O’Neil Eizabeth, and Chen Xuedong. 2009. Star Schema Benchmark-Revision 3. http://www.cs.umbo.edu/poneil/StarSchemaB.PDF.Google Scholar
[106] Laboratory Oak Ridge National. 2019. Frontier Spec Sheet. https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontier_specsheet.pdf.Google Scholar
[107] Papazian Irma Esmer. 2020. New 3rd gen Intel® Xeon® scalable processor (codename: Ice Lake-SP). In Proc. of IEEE HCS 32. IEEE Computer Society, 1–22. DOI: DOI: https://doi.org/10.1109/HCS49909.2020.9220434Google Scholar
[108] Papermaster Mark. 2020. Future of High Performance. https://ir.amd.com/news-events/analyst-day.Google Scholar
[109] Paul Johns, He Jiong, and He Bingsheng. 2016. GPL: A GPU-based pipelined query processing engine. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 1935–1950. DOI: DOI: https://doi.org/10.1145/2882903.2915224 Google ScholarCross Ref
[110] Pirk Holger, Manegold Stefan, and Kersten Martin. 2014. Waste not... efficient co-processing of relational data. In Proc. of IEEE ICDE’14. 508–519. DOI: DOI: https://doi.org/10.1109/ICDE.2014.6816677Google Scholar
[111] Pirk Holger, Moll Oscar, Zaharia Matei, and Madden Sam. 2016. Voodoo - a vector algebra for portable database performance on modern hardware. Proc. VLDB Endow 9, 14 (2016), 1707–1718. DOI: DOI: https://doi.org/10.14778/3007328.3007336 Google ScholarCross Ref
[112] Pirk Holger, Sellam Thibault, Manegold Stefan, and Kersten Martin. 2012. X-device query processing by bitwise distribution. In Proc. of ACM DaMoN’12. ACM, 48–54. DOI: DOI: https://doi.org/10.1145/2236584.2236591 Google ScholarCross Ref
[113] Psaroudakis Iraklis, Wolf Florian, May Norman, Neumann Thomas, Böhm Alexander, Ailamaki Anastasia, and Sattler Kai-Uwe. 2015. Scaling up mixed workloads: A battle of data freshness, flexibility, and scheduling. In Proc. of TPCTC’14. Springer International Publishing, 97–112. DOI: DOI: https://doi.org/10.1007/978-3-319-15350-6_7Google Scholar
[114] Raza Syed Mohammad Aunn, Chrysogelos Periklis, Sioulas Panagiotis, Indjic Vladimir, Anadiotis Angelos Christos, and Ailamaki Anastasia. 2020. GPU-accelerated data management under the test of time. In Proc. of CIDR’20.Google Scholar
[115] Rogers Phil. 2013. Heterogeneous system architecture overview. In Proc. of IEEE HCS 25. 1–41. DOI: DOI: https://doi.org/10.1109/HOTCHIPS.2013.7478286Google Scholar
[116] Rogers Phil, Ander Ben, Gaster Benedict, and Bratt Ian. 2013. Heterogeneous system architecture (HSA): Overview and implementation. In Proc. of IEEE HCS 25. 1–41. DOI: DOI: https://doi.org/10.1109/HOTCHIPS.2013.7478286Google Scholar
[117] Rosenfeld Viktor, Breß Sebastian, Zeuch Steffen, Rabl Tilmann, and Markl Volker. 2019. Performance analysis and automatic tuning of hash aggregation on GPUs. In Proc. of ACM DaMoN’19. Association for Computing Machinery, Article 8, 11 pages. DOI: DOI: https://doi.org/10.1145/3329785.3329922 Google ScholarCross Ref
[118] Rosenfeld Viktor, Heimel Max, Viebig Christoph, and Markl Volker. 2015. The operator variant selection problem on heterogeneous hardware. In Proc. of ADMS@VLDB’15. 1–12. http://www.adms-conf.org/2015/ADMS_Viktor_Rosenfeld_CR.pdf.Google Scholar
[119] Rozenberg Eyal and Boncz Peter. 2017. Faster across the PCIe Bus: A GPU library for lightweight decompression. In Proc. of ACM DaMoN’17. Association for Computing Machinery, Article 8, 5 pages. https://doi.org/10.1145/3076113.3076122 Google ScholarDigital Library
[120] Sadasivam Satish Kumar, Thompto Brian W., Kalla Ron, and Starke William J.. 2017. IBM Power9 processor architecture. 37, 2 (2017), 40–51. DOI: DOI: https://doi.org/10.1109/MM.2017.40 Google ScholarCross Ref
[121] Sakharnykh Nikolay. 2018. Everything you need to know about unified memory. In GPU Tech Conference 2018. https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf.Google Scholar
[122] Staff Science. 2011. Special online collection: Dealing with data. challenges and opportunities. Science 331, 6018 (2011), 692–693. DOI: DOI: https://doi.org/10.1126/science.331.6018.692Google Scholar
[123] Sengupta Shubhabrata, Harris Mark, Zhang Yao, and Owens John D.. 2007. Scan primitives for GPU computing. In Proc. of ACM SIGGRAPH/Eurographics’07 Workshop. The Eurographics Association. DOI: DOI: https://doi.org/10.2312/EGGH/EGGH07/097-106 Google ScholarCross Ref
[124] Shahvarani Amirhesam and Jacobsen Hans-Arno. 2016. A hybrid B+-tree as solution for in-memory indexing on CPU-GPU heterogeneous computing platforms. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 1523–1538. DOI: DOI: https://doi.org/10.1145/2882903.2882918 Google ScholarCross Ref
[125] Sharma Debendra Das and Tavallaei Siamak. 2020. Compute Express Link™ 2.0 White Paper.Google Scholar
[126] Sioulas Panagiotis, Chrysogelos Periklis, Karpathiotakis Manos, Appuswamy Raja, and Ailamaki Anastasia. 2019. Hardware-conscious hash-joins on GPUs. In Proc. of IEEE ICDE’19. 698–709. DOI: DOI: https://doi.org/10.1109/ICDE.2019.00068Google Scholar
[127] Spafford Kyle L., Meredith Jeremy S., Lee Seyong, Li Dong, Roth Philip C., and Vetter Jeffrey S.. 2012. The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In Proc. of ACM CF’12. Association for Computing Machinery, 103–112. DOI: DOI: https://doi.org/10.1145/2212908.2212924 Google ScholarCross Ref
[128] Starke William and Thompto Brian. 2020. IBM’s POWER10 processor. In Proc. of IEEE HCS 32. IEEE Computer Society, 1–43. DOI: DOI: https://doi.org/10.1109/HCS49909.2020.9220618Google Scholar
[129] Stehle Elias and Jacobsen Hans-Arno. 2017. A memory bandwidth-efficient hybrid radix sort on GPUs. In Proc. of ACM SIGMOD’17. Association for Computing Machinery, 417–432. DOI: DOI: https://doi.org/10.1145/3035918.3064043 Google ScholarCross Ref
[130] Stone John E., Gohara David, and Shi Guochun. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66–73. DOI: DOI: https://doi.org/10.1109/MCSE.2010.69 Google ScholarCross Ref
[131] Suggs David, Subramony Mahesh, and Bouvier Dan. 2020. The AMD “Zen 2” processor. IEEE Micro 40, 2 (2020), 45–52. DOI: DOI: https://doi.org/10.1109/MM.2020.2974217Google Scholar
[132] Sun Chengyu, Agrawal Divyakant, and Abbadi Amr El. 2003. Hardware acceleration for spatial selections and joins. In Proc. of ACM SIGMOD’03. ACM, 455–466. DOI: DOI: https://doi.org/10.1145/872757.872813 Google ScholarCross Ref
[133] Economist The. 2010. Data, Data Everywhere. A Special Report on Managing Information. https://www.economist.com/special-report/2010/02/27/data-data-everywhere.Google Scholar
[134] Group The Khronos. [n.d.]. The Open Standard for Parallel Programming of Heterogeneous Systems. https://www.khronos.org/opencl/.Google Scholar
[135] TPC. 2021. TPC-H Version 2 and Version 3. http://www.tpc.org/tpch/.Google Scholar
[136] Vera Xavier. 2020. Inside Tiger Lake: Intel’s next generation mobile client CPU. In Proc. of IEEE HCS 32. 1–26. DOI: DOI: https://doi.org/10.1109/HCS49909.2020.9220443Google Scholar
[137] Veselý Ján, Basu Arkaprava, Bhattacharjee Abhishek, Loh Gabriel H., Oskin Mark, and Reinhardt Steven K.. 2018. Generic system calls for GPUs. In Proc. of ACM/IEEE ISCA’18. 843–856. DOI: DOI: https://doi.org/10.1109/ISCA.2018.00075 Google ScholarCross Ref
[138] Wall David W.. 1993. Limits of Instruction-Level Parallelism.Google Scholar
[139] Wang Kaibo, Huai Yin, Lee Rubao, Wang Fusheng, Zhang Xiaodong, and Saltz Joel H.. 2012. Accelerating pathology image data cross-comparison on CPU-GPU hybrid systems. Proc. VLDB Endow. 5, 11 (2012), 1543–1554. DOI: DOI: https://doi.org/10.14778/2350229.2350268 Google ScholarCross Ref
[140] Wulf Wm. A. and McKee Sally A.. 1995. Hitting the memory wall: Implications of the obvious. SIGARCH Comput. Archit. News 23, 1 (1995), 20–24. DOI: DOI: https://doi.org/10.1145/216585.216588 Google ScholarCross Ref
[141] Yuan Yuan, Lee Rubao, and Zhang Xiaodong. 2013. The yin and yang of processing data warehousing queries on GPU devices. Proc. VLDB Endow. 6, 10 (2013), 817–828. DOI: DOI: https://doi.org/10.14778/2536206.2536210 Google ScholarCross Ref
[142] Zacharatou Eleni Tzirita, Doraiswamy Harish, Ailamaki Anastasia, Silva Cláudio T., and Freire Juliana. 2017. GPU rasterization for real-time spatial aggregation over arbitrary polygons. Proc. VLDB Endow 11, 3 (2017), 352–365. DOI: DOI: https://doi.org/10.14778/3157794.3157803 Google ScholarCross Ref
[143] Zeller Cyril, Fernando Randy, Wloka Matthias, and Harris Mark. 2004. Programming graphics hardware. In Proc. of Eurographics’04 Tutorials. Eurographics Association. DOI: DOI: https://doi.org/10.2312/egt.20041034Google Scholar
[144] Zhang Bowen, Shen Yanyan, Zhu Yanmin, and Yu Jiadi. 2018. A GPU-accelerated framework for processing trajectory queries. In Proc. of IEEE ICDE’18. 1037–1048. DOI: DOI: https://doi.org/10.1109/ICDE.2018.00097Google Scholar
[145] Zhang Feng, Yang Lin, Zhang Shuhao, He Bingsheng, Lu Wei, and Du Xiaoyong. 2020. FineStream: Fine-grained window-based stream processing on CPU-GPU integrated architectures. In Proc. of USENIX ATC’20. USENIX Association, 633–647. https://www.usenix.org/conference/atc20/presentation/zhang-feng. Google ScholarDigital Library
[146] Zhang Kai, Hu Jiayu, He Bingsheng, and Hua Bei. 2017. DIDO: Dynamic pipelines for in-memory key-value stores on coupled CPU-GPU architectures. In Proc. of IEEE ICDE’17. 671–682. DOI: DOI: https://doi.org/10.1109/ICDE.2017.120Google Scholar
[147] Zhang Kai, Wang Kaibo, Yuan Yuan, Guo Lei, Lee Rubao, and Zhang Xiaodong. 2015. Mega-KV: A case for GPUs to maximize the throughput of in-memory key-value stores. Proc. VLDB Endow 8, 11 (2015), 1226–1237. DOI: DOI: https://doi.org/10.14778/2809974.2809984 Google ScholarCross Ref
[148] Zukowski Marcin, Héman Sándor, Nes Niels, and Boncz Peter. 2006. Super-scalar RAM-CPU cache compression. In Proc. of IEEE ICDE’06. 59–59. DOI: DOI: https://doi.org/10.1109/ICDE.2006.150Google Scholar

Index Terms

Recommendations

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster Computing

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
Read More
In-cache query co-processing on coupled CPU-GPU architectures

Recently, there have been some emerging processor designs that the CPU and the GPU (Graphics Processing Unit) are integrated in a single chip and share Last Level Cache (LLC). However, the main memory bandwidth of such coupled CPU-GPU architectures can ...
Read More
Algorithmic performance studies on graphics processing units

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 55, Issue 1
January 2023
860 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3492451
Editor:
Albert Zomaya
University of Sydney, Australia
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 January 2022
- Accepted: 1 August 2021
- Revised: 1 June 2021
- Received: 1 December 2020
Published in csur Volume 55, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Heterogeneous query processing
graphics processing units
dedicated GPUs
integrated GPUs
data transfer bottleneck
query processing models
Qualifiers
- survey
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 3,067
  Total Downloads
- Downloads (Last 12 months)1,349
- Downloads (Last 6 weeks)163
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Query Processing on Heterogeneous CPU/GPU Systems

ACM Computing Surveys

Abstract

Supplemental Material

Available for Download

REFERENCES

Cited By

Index Terms

Recommendations

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

In-cache query co-processing on coupled CPU-GPU architectures

Algorithmic performance studies on graphics processing units