Abstract
Due to their high computational power and internal memory bandwidth, graphic processing units (GPUs) have been extensively studied by the database systems research community. A heterogeneous query processing system that employs CPUs and GPUs at the same time has to solve many challenges, including how to distribute the workload on processors with different capabilities; how to overcome the data transfer bottleneck; and how to support implementations for multiple processors efficiently. In this survey we devise a classification scheme to categorize techniques developed to address these challenges. Based on this scheme, we categorize query processing systems on heterogeneous CPU/GPU systems and identify open research problems.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Query Processing on Heterogeneous CPU/GPU Systems
- [1] . 2003. Aurora: A data stream management system. In Proc. of SIGMOD’03. Association for Computing Machinery, 666.
DOI: DOI: https://doi.org/10.1145/872757.872855 Google ScholarCross Ref - [2] . 2020. The Seattle report on database research. 48, 4 (2020), 44–53.
DOI: DOI: https://doi.org/10.1145/3385658.3385668 Google ScholarCross Ref - [3] . 2006. Integrating compression and execution in column-oriented database systems. In Proc. of ACM SIGMOD’06. Association for Computing Machinery, 671–682.
DOI: DOI: https://doi.org/10.1145/1142473.1142548 Google ScholarCross Ref - [4] . 2021. More About How ROCm Uses PCIe Atomics. https://rocmdocs.amd.com/en/latest/Installation_Guide/More-about-how-ROCm-uses-PCIe-Atomics.html.Google Scholar
- [5] . 2017. Overtaking CPU DBMSes with a GPU in whole-query analytic processing with parallelism-friendly execution plan optimization. In Proc. of ADMS/IMDM@VLDB’17. Springer International Publishing, 57–78. https://link.springer.com/chapter/10.1007/978-3-319-56111-0_4.Google Scholar
- [6] . 2009. PCI express 3.0 overview. In Proc. of IEEE HCS 21. 1–61.
DOI: DOI: https://doi.org/10.1109/HOTCHIPS.2009.7478337Google Scholar - [7] . 2017. The case for heterogeneous HTAP. In Proc. of CIDR’17. http://infoscience.epfl.ch/record/224447.Google Scholar
- [8] . 2020. AMD next generation 7NM Ryzen™ 4000 APU “Renoir”. In Proc. of IEEE HCS 32. 1–30.
DOI: DOI: https://doi.org/10.1109/HCS49909.2020.9220414Google Scholar - [9] . 1976. System R: Relational approach to database management. 1, 2 (1976), 97–137.
DOI: DOI: https://doi.org/10.1145/320455.320457 Google ScholarCross Ref - [10] . 2004. Hardware acceleration in commercial databases: A case study of spatial operations. In Proc. of VLDB’04. Morgan Kaufmann, 1021–1032. https://doi.org/10.1016/B978-012088469-8.50089-9 Google ScholarDigital Library
- [11] . 2012. GiST scan acceleration using coprocessors. In Proc. of ACM DaMoN’12. 63–69. Google ScholarDigital Library
- [12] . 2020. The Xe GPU architecture. In Proc. of IEEE HCS 32. IEEE Computer Society, 1–27. https://doi.org/10.1109/HCS49909.2020.9220591Google Scholar
- [13] . 2013. Efficient GPU-based skyline computation. In Proc. of ACM DaMoN’13, Article 5. Google ScholarDigital Library
- [14] . 2017. Template skycube algorithms for heterogeneous parallelism on multicore and GPU architectures. In Proc. of ACM SIGMOD’17. Association for Computing Machinery, 447–462.
DOI: DOI: https://doi.org/10.1145/3035918.3035962 Google ScholarCross Ref - [15] . 2007. A 30 year retrospective on Dennard’s MOSFET scaling paper. 12, 1 (2007), 11–13. https://doi.org/10.1109/N-SSC.2007.4785534Google Scholar
- [16] . 2005. MonetDB/X100: Hyper-pipelining query execution. In Proc. of CIDR’05.Google Scholar
- [17] . 2011. The future of microprocessors. 54, 5 (2011), 67–77. https://doi.org/10.1145/1941487.1941507 Google ScholarDigital Library
- [18] . 2014. Applying AMD’s Kaveri APU for heterogeneous computing. In Proc. of IEEE HCS 26. 1–42.
DOI: DOI: https://doi.org/10.1109/HOTCHIPS.2014.7478810Google Scholar - [19] . 2012. AMD fusion APU: Llano. 32, 2 (2012), 28–37. https://doi.org/10.1109/MM.2012.2 Google ScholarDigital Library
- [20] . 2014. The design and implementation of CoGaDB: A column-oriented GPU-accelerated DBMS. 14, 3 (2014), 199–209.
DOI: DOI: https://doi.org/10.1007/s13222-014-0164-zGoogle Scholar - [21] . 2013. Efficient co-processor utilization in database query processing. Information Systems 38, 8 (2013), 1084–1096.
DOI: DOI: https://doi.org/10.1016/j.is.2013.05.004 Google ScholarCross Ref - [22] . 2016. Robust query processing in co-processor-accelerated databases. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 1891–1906. https://doi.org/10.1145/2882903.2882936Google Scholar
- [23] . 2014. GPU-accelerated database systems: Survey and open challenges. Trans. Large Scale Data Knowl. Centered Syst. 15 (2014), 1–35.
DOI: DOI: https://doi.org/10.1007/978-3-662-45761-0_1Google Scholar - [24] . 2018. Generating custom code for efficient query execution on heterogeneous processors. The VLDB Journal 27, 6 (2018), 797–822.
DOI: DOI: https://doi.org/10.1007/s00778-018-0512-y Google ScholarCross Ref - [25] . 2015. Apache flink: Stream and batch processing in a single engine.IEEE Data Engineering Bulletin 36, 4 (2015).Google Scholar
- [26] . 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proc. of USENIX OSDI’18. USENIX Association, 578–594. https://www.usenix.org/conference/osdi18/presentation/chen. Google ScholarDigital Library
- [27] . 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proc. VLDB Endow 12, 5 (2019), 544–556.
DOI: DOI: https://doi.org/10.14778/3303753.3303760 Google ScholarCross Ref - [28] . 2019. Hardware-conscious query processing in GPU-accelerated analytical engines. In Proc. of CIDR’19. 9. http://infoscience.epfl.ch/record/262529.Google Scholar
- [29] . 2012. Processing flows of information: From data stream to complex event processing. ACM Comput. Surv. 44, 3, Article
15 (2012), 62 pages.DOI: DOI: https://doi.org/10.1145/2187671.2187677 Google ScholarCross Ref - [30] . 1998. OpenMP: An industry-standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (1998), 46–55.
DOI: DOI: https://doi.org/10.1109/99.660313 Google ScholarCross Ref - [31] . 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5 (1974), 256–268.
DOI: DOI: https://doi.org/10.1109/JSSC.1974.1050511Google Scholar - [32] . 2016. A GPU-based index to support interactive spatio-temporal queries over historical data. In Proc. of IEEE ICDE’16. 1086–1097. https://doi.org/10.1109/ICDE.2016.7498315Google Scholar
- [33] . 2016. The era of big spatial data: A survey. Foundations and Trends® in Databases 6, 3–4 (2016), 163–273. https://doi.org/10.1561/1900000054 Google ScholarDigital Library
- [34] . 2002. Radeon 9700. In Proc. of ACM SIGGRAPH/Eurographics’02 Tutorials. https://www.graphicshardware.org/previous/www_2002/presentations/Hot3D-RADEON9700.ppt.Google Scholar
- [35] . 2020. In-memory database acceleration on FPGAs: A survey. 29, 1 (2020), 33–59.
DOI: DOI: https://doi.org/10.1007/s00778-019-00581-wGoogle Scholar - [36] . 2010. Database compression on graphics processors. Proc. VLDB Endow 3, 1–2 (2010), 670–680.
DOI: DOI: https://doi.org/10.14778/1920841.1920927 Google ScholarCross Ref - [37] . 2018. Pipelined query processing in coprocessor environments. In Proc. of ACM SIGMOD’18. Association for Computing Machinery, 1603–1618.
DOI: DOI: https://doi.org/10.1145/3183713.3183734 Google ScholarCross Ref - [38] . 2020. Data-parallel query processing on non-uniform data. Proc. VLDB Endow 13, 6 (2020), 884–897.
DOI: DOI: https://doi.org/10.14778/3380750.3380758 Google ScholarCross Ref - [39] . 2006. GPUTeraSort: High performance graphics co-processor sorting for large database management. In Proc. of ACM SIGMOD’06. ACM, 325–336.
DOI: DOI: https://doi.org/10.1145/1142473.1142511 Google ScholarCross Ref - [40] . 2004. Fast computation of database operations using graphics processors. In Proc. of ACM SIGMOD’04. ACM, 215–226.
DOI: DOI: https://doi.org/10.1145/1007568.1007594 Google ScholarCross Ref - [41] . 1991. Data compression and database performance. In Proc. of IEEE SAC’91. IEEE Computer Society, 22–27.
DOI: DOI: https://doi.org/10.1109/SOAC.1991.143840Google Scholar - [42] . 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proc. of IEEE ISPASS’11. 134–144.
DOI: DOI: https://doi.org/10.1109/ISPASS.2011.5762730 Google ScholarCross Ref - [43] . 2004. Best of both latency and throughput. In Proc. of IEEE ICCD’04. 236–243.
DOI: DOI: https://doi.org/10.1109/ICCD.2004.1347928 Google ScholarCross Ref - [44] . 2005. QPipe: A simultaneously pipelined relational query engine. In Proc. of ACM SIGMOD’05. Association for Computing Machinery, 383–394.
DOI: DOI: https://doi.org/10.1145/1066157.1066201 Google ScholarCross Ref - [45] . 2004. General-purpose computation using graphics hardware. In Eurographics’04 Tutorials. Eurographics Association.
DOI: DOI: https://doi.org/10.2312/egt.20041034Google Scholar - [46] . 2007. Efficient gather and scatter operations on graphics processors. In Proc. of ACM SC’07. ACM, Article
46 , 12 pages.DOI: DOI: https://doi.org/10.1145/1362622.1362684 Google ScholarCross Ref - [47] . 2009. Relational query coprocessing on graphics processors. ACM Trans. Database Syst. 34, 4, Article
21 (2009), 39 pages.DOI: DOI: https://doi.org/10.1145/1620585.1620588 Google ScholarCross Ref - [48] . 2008. Relational joins on graphics processors. In Proc. of ACM SIGMOD’08. ACM, 511–524.
DOI: DOI: https://doi.org/10.1145/1376616.1376670 Google ScholarCross Ref - [49] . 2011. High-throughput transaction executions on graphics processors. Proc. VLDB Endow 4, 5 (2011), 314–325.
DOI: DOI: https://doi.org/10.14778/1952376.1952381 Google ScholarCross Ref - [50] . 2013. Revisiting co-processing for hash joins on the coupled CPU-GPU architecture. Proc. VLDB Endow 6, 10 (2013), 889–900.
DOI: DOI: https://doi.org/10.14778/2536206.2536216 Google ScholarCross Ref - [51] . 2014. In-cache query co-processing on coupled CPU-GPU architectures. Proc. VLDB Endow. 8, 4 (2014), 329–340.
DOI: DOI: https://doi.org/10.14778/2735496.2735497 Google ScholarCross Ref - [52] . 2015. Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation. In Proc. of ACM SIGMOD’15. Association for Computing Machinery, 1477–1492.
DOI: DOI: https://doi.org/10.1145/2723372.2749438 Google ScholarCross Ref - [53] . 2013. Hardware-oblivious parallelism for in-memory column-stores. Proc. VLDB Endow. 6, 9 (2013), 709–720.
DOI: DOI: https://doi.org/10.14778/2536360.2536370 Google ScholarCross Ref - [54] . 2017. Computer Architecture: A Quantitative Approach (6 ed.). Morgan Kaufmann. Google ScholarDigital Library
- [55] . 2010. Multidimensional databases and data warehousing. Synthesis Lectures on Data Management 2, 1 (2010), 1–111.
DOI: DOI: https://doi.org/10.2200/S00299ED1V01Y201009DTM009 Google ScholarCross Ref - [56] . 2019. Dissecting the NVIDIA turing T4 GPU via microbenchmarking. abs/1903.07486 (2019). http://arxiv.org/abs/1903.07486Google Scholar
- [57] . 2018. Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. abs/1804.06826 (2018). https://arxiv.org/abs/1804.06826Google Scholar
- [58] . 2018. A domain-specific architecture for deep neural networks. Commun. ACM 61, 9 (2018), 50–59.
DOI: DOI: https://doi.org/10.1145/3154484 Google ScholarCross Ref - [59] . 2012. GPU join processing revisited. In Proc. of ACM DaMoN’12. ACM, 55–62.
DOI: DOI: https://doi.org/10.1145/2236584.2236592 Google ScholarCross Ref - [60] . 2017. Big data causing big (TLB) problems: Taming random memory accesses on the GPU. In Proc. of ACM DaMoN’17. Association for Computing Machinery, Article
6 , 10 pages.DOI: DOI: https://doi.org/10.1145/3076113.3076115 Google ScholarCross Ref - [61] . 2017. Adaptive work placement for query processing on heterogeneous computing resources. Proc. VLDB Endow 10, 7 (2017), 733–744.
DOI: DOI: https://doi.org/10.14778/3067421.3067423 Google ScholarCross Ref - [62] . 2015. Local vs. global optimization: Operator placement strategies in heterogeneous environments. In Proc. of EDBT’15 Workshops. CEUR-WS.org, 48–55. http://ceur-ws.org/Vol-1330/paper-10.pdf.Google Scholar
- [63] . 2014. Heterogeneity-aware operator placement in column-store DBMS. Datenbank-Spektrum 14, 3 (2014), 211–221.
DOI: DOI: https://doi.org/10.1007/s13222-014-0167-9Google Scholar - [64] . 2013. The HELLS-Join: A heterogeneous stream join for extremely large windows. In Proc. of ACM DaMoN’13. Association for Computing Machinery, Article
2 , 7 pages.DOI: DOI: https://doi.org/10.1145/2485278.2485280 Google ScholarCross Ref - [65] . 2015. Optimizing GPU-accelerated group-by and aggregation. In Proc. of ADMS@VLDB’15. 13–24. http://www.adms-conf.org/2015/gpu-optimizer-camera-ready.pdf.Google Scholar
- [66] . 2014. Managing GPU concurrency in heterogeneous architectures. In Proc. of IEEE/ACM MICRO 47. 114–126.
DOI: DOI: https://doi.org/10.1109/MICRO.2014.62 Google ScholarCross Ref - [67] . 2018. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. 11, 13 (2018), 2209–2222.
DOI: DOI: https://doi.org/10.14778/3275366.3284966 Google ScholarCross Ref - [68] . 2016. OpenGL Programming Guide: The Official Guide to Learning OpenGL, Version 4.5 with SPIR-V (9 ed.). Addison-Wesley Professional. Google ScholarDigital Library
- [69] . 2013. The OpenCL Specification Version 2.0. https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf.Google Scholar
- [70] . 2019. Exascale for everyone. In Intel HPC Developer Conference’19. https://software.intel.com/content/www/us/en/develop/events/hpc-devcon.html.Google Scholar
- [71] . 2016. SABER: Window-based hybrid stream processing for heterogeneous architectures. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 555–569.
DOI: DOI: https://doi.org/10.1145/2882903.2882906 Google ScholarCross Ref - [72] . 2019. Event stream processing on heterogeneous system architecture. In Proc. of ACM DaMoN’19. Association for Computing Machinery, Article
3 .DOI: DOI: https://doi.org/10.1145/3329785.3329933 Google ScholarCross Ref - [73] . 2016. Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 311–326.
DOI: DOI: https://doi.org/10.1145/2882903.2882925 Google ScholarCross Ref - [74] . 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proc. of IEEE CGO’04. 75–86.
DOI: DOI: https://doi.org/10.1109/CGO.2004.1281665 Google ScholarCross Ref - [75] . 2014. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In Proc. of ACM SIGMOD’14. ACM, 743–754.
DOI: DOI: https://doi.org/10.1145/2588555.2610507 Google ScholarCross Ref - [76] . 2011. 2nd generation Intel® core processor family: Intel® core i7, i5 and i3. In Proc. of IEEE HCS 23. 1–48.
DOI: DOI: https://doi.org/10.1109/HOTCHIPS.2011.7477509Google Scholar - [77] . 2018. A GPU accelerated update efficient index for kNN queries in road networks. In Proc. of IEEE ICDE’18. 881–892.
DOI: DOI: https://doi.org/10.1109/ICDE.2018.00084Google Scholar - [78] . 2018. Using CUDA Warp-Level Primitives. https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/.Google Scholar
- [79] . 2008. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (2008), 39–55.
DOI: DOI: https://doi.org/10.1109/MM.2008.31 Google ScholarCross Ref - [80] . 2001. A user-programmable vertex engine. In Proc. of ACM SIGGRAPH’01. ACM, 149–158.
DOI: DOI: https://doi.org/10.1145/383259.383274 Google ScholarCross Ref - [81] . [n.d.]. The LLVM Target-Independent Code Generator. https://www.llvm.org/docs/CodeGenerator.html.Google Scholar
- [82] . 2014. Faster Parallel Reductions on Kepler. https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/.Google Scholar
- [83] . 2018. Efficient K-means on GPUs. In Proc. of ACM DaMoN’18. Association for Computing Machinery, Article
3 , 3 pages. https://doi.org/10.1145/3211922.3211925 Google ScholarDigital Library - [84] . 2020. Pump up the volume: Processing large data on GPUs with fast interconnects. In Proc. of ACM SIGMOD’20. Association for Computing Machinery, 1633–1649.
DOI: DOI: https://doi.org/10.1145/3318464.3389705 Google ScholarCross Ref - [85] . 2002. Generic database cost models for hierarchical memory systems. In Proc. of VLDB’02. VLDB Endowment, 191–202. http://vldb.org/conf/2002/S06P03.pdf. Google ScholarDigital Library
- [86] . 2019. 7nm “Navi” GPU - A GPU built for performance and efficiency. In Proc. of IEEE HCS 31. IEEE Computer Society, 1–28.
DOI: DOI: https://doi.org/10.1109/HOTCHIPS.2019.8875649Google Scholar - [87] . 2003. Cg: A system for programming graphics hardware in a C-like language. ACM Trans. Graph 22, 3 (2003), 896–907.
DOI: DOI: https://doi.org/10.1145/882262.882362 Google ScholarCross Ref - [88] . 2017. Relaxed operator fusion for in-memory databases: Making compilation, vectorization, and prefetching work together at last. Proc. VLDB Endow 11, 1 (2017), 1–13.
DOI: DOI: https://doi.org/10.14778/3151113.3151114 Google ScholarCross Ref - [89] . 2016. Towards a hybrid design for fast query processing in DB2 with BLU acceleration using graphical processing units: A technology demonstration. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 1951–1960.
DOI: DOI: https://doi.org/10.1145/2882903.2903735 Google ScholarCross Ref - [90] . 2015. A survey of CPU-GPU heterogeneous computing techniques. ACM Comput. Surv. 47, 4, Article
69 (2015), 35 pages.DOI: DOI: https://doi.org/10.1145/2788396 Google ScholarCross Ref - [91] . 2014. Heterogeneity-conscious parallel query execution: Getting a better mileage while driving faster!. In Proc. of ACM DaMoN’14. Association for Computing Machinery, Article
2 , 10 pages.DOI: DOI: https://doi.org/10.1145/2619228.2619230 Google ScholarDigital Library - [92] . 2016. A comprehensive performance analysis of HSA and OpenCL 2.0. In Proc. of IEEE ISPASS’16. 183–193.
DOI: DOI: https://doi.org/10.1109/ISPASS.2016.7482093Google Scholar - [93] . 2018. Understanding PCIe performance for end host networking. In Proc. of ACM SIGCOMM’18. Association for Computing Machinery, 327–341.
DOI: DOI: https://doi.org/10.1145/3230543.3230560 Google ScholarCross Ref - [94] . 2011. Efficiently compiling efficient query plans for modern hardware. Proc. VLDB Endow 4, 9 (2011), 539–550.
DOI: DOI: https://doi.org/10.14778/2002938.2002940 Google ScholarCross Ref - [95] . 2008. Scalable parallel programming with CUDA. Queue 6, 2 (2008), 40–53.
DOI: DOI: https://doi.org/10.1145/1365490.1365500 Google ScholarCross Ref - [96] . 2020. CUDA C Best Practices Guide (v11.1 ed.). https://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf.Google Scholar
- [97] . [n.d.]. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/index.html.Google Scholar
- [98] . 2020. NVIDIA A100 Tensor Core GPU Architecture.Google Scholar
- [99] . 2007. NVIDIA CUDA Programming Guide (version 1.0 ed.).Google Scholar
- [100] . [n.d.]. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect.Google Scholar
- [101] . 2016. NVIDIA Tesla P100.Google Scholar
- [102] . 2017. NVIDIA Tesla V100 GPU Architecture.Google Scholar
- [103] . 2018. NVIDIA Turing GPU Architecture.Google Scholar
- [104] . 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi.Google Scholar
- [105] . 2009. Star Schema Benchmark-Revision 3. http://www.cs.umbo.edu/poneil/StarSchemaB.PDF.Google Scholar
- [106] . 2019. Frontier Spec Sheet. https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontier_specsheet.pdf.Google Scholar
- [107] . 2020. New 3rd gen Intel® Xeon® scalable processor (codename: Ice Lake-SP). In Proc. of IEEE HCS 32. IEEE Computer Society, 1–22.
DOI: DOI: https://doi.org/10.1109/HCS49909.2020.9220434Google Scholar - [108] . 2020. Future of High Performance. https://ir.amd.com/news-events/analyst-day.Google Scholar
- [109] . 2016. GPL: A GPU-based pipelined query processing engine. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 1935–1950.
DOI: DOI: https://doi.org/10.1145/2882903.2915224 Google ScholarCross Ref - [110] . 2014. Waste not... efficient co-processing of relational data. In Proc. of IEEE ICDE’14. 508–519.
DOI: DOI: https://doi.org/10.1109/ICDE.2014.6816677Google Scholar - [111] . 2016. Voodoo - a vector algebra for portable database performance on modern hardware. Proc. VLDB Endow 9, 14 (2016), 1707–1718.
DOI: DOI: https://doi.org/10.14778/3007328.3007336 Google ScholarCross Ref - [112] . 2012. X-device query processing by bitwise distribution. In Proc. of ACM DaMoN’12. ACM, 48–54.
DOI: DOI: https://doi.org/10.1145/2236584.2236591 Google ScholarCross Ref - [113] . 2015. Scaling up mixed workloads: A battle of data freshness, flexibility, and scheduling. In Proc. of TPCTC’14. Springer International Publishing, 97–112.
DOI: DOI: https://doi.org/10.1007/978-3-319-15350-6_7Google Scholar - [114] . 2020. GPU-accelerated data management under the test of time. In Proc. of CIDR’20.Google Scholar
- [115] . 2013. Heterogeneous system architecture overview. In Proc. of IEEE HCS 25. 1–41.
DOI: DOI: https://doi.org/10.1109/HOTCHIPS.2013.7478286Google Scholar - [116] . 2013. Heterogeneous system architecture (HSA): Overview and implementation. In Proc. of IEEE HCS 25. 1–41.
DOI: DOI: https://doi.org/10.1109/HOTCHIPS.2013.7478286Google Scholar - [117] . 2019. Performance analysis and automatic tuning of hash aggregation on GPUs. In Proc. of ACM DaMoN’19. Association for Computing Machinery, Article
8 , 11 pages.DOI: DOI: https://doi.org/10.1145/3329785.3329922 Google ScholarCross Ref - [118] . 2015. The operator variant selection problem on heterogeneous hardware. In Proc. of ADMS@VLDB’15. 1–12. http://www.adms-conf.org/2015/ADMS_Viktor_Rosenfeld_CR.pdf.Google Scholar
- [119] . 2017. Faster across the PCIe Bus: A GPU library for lightweight decompression. In Proc. of ACM DaMoN’17. Association for Computing Machinery, Article
8 , 5 pages. https://doi.org/10.1145/3076113.3076122 Google ScholarDigital Library - [120] . 2017. IBM Power9 processor architecture. 37, 2 (2017), 40–51.
DOI: DOI: https://doi.org/10.1109/MM.2017.40 Google ScholarCross Ref - [121] . 2018. Everything you need to know about unified memory. In GPU Tech Conference 2018. https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf.Google Scholar
- [122] . 2011. Special online collection: Dealing with data. challenges and opportunities. Science 331, 6018 (2011), 692–693.
DOI: DOI: https://doi.org/10.1126/science.331.6018.692Google Scholar - [123] . 2007. Scan primitives for GPU computing. In Proc. of ACM SIGGRAPH/Eurographics’07 Workshop. The Eurographics Association.
DOI: DOI: https://doi.org/10.2312/EGGH/EGGH07/097-106 Google ScholarCross Ref - [124] . 2016. A hybrid B+-tree as solution for in-memory indexing on CPU-GPU heterogeneous computing platforms. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 1523–1538.
DOI: DOI: https://doi.org/10.1145/2882903.2882918 Google ScholarCross Ref - [125] . 2020. Compute Express Link™ 2.0 White Paper.Google Scholar
- [126] . 2019. Hardware-conscious hash-joins on GPUs. In Proc. of IEEE ICDE’19. 698–709.
DOI: DOI: https://doi.org/10.1109/ICDE.2019.00068Google Scholar - [127] . 2012. The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In Proc. of ACM CF’12. Association for Computing Machinery, 103–112.
DOI: DOI: https://doi.org/10.1145/2212908.2212924 Google ScholarCross Ref - [128] . 2020. IBM’s POWER10 processor. In Proc. of IEEE HCS 32. IEEE Computer Society, 1–43.
DOI: DOI: https://doi.org/10.1109/HCS49909.2020.9220618Google Scholar - [129] . 2017. A memory bandwidth-efficient hybrid radix sort on GPUs. In Proc. of ACM SIGMOD’17. Association for Computing Machinery, 417–432.
DOI: DOI: https://doi.org/10.1145/3035918.3064043 Google ScholarCross Ref - [130] . 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66–73.
DOI: DOI: https://doi.org/10.1109/MCSE.2010.69 Google ScholarCross Ref - [131] . 2020. The AMD “Zen 2” processor. IEEE Micro 40, 2 (2020), 45–52.
DOI: DOI: https://doi.org/10.1109/MM.2020.2974217Google Scholar - [132] . 2003. Hardware acceleration for spatial selections and joins. In Proc. of ACM SIGMOD’03. ACM, 455–466.
DOI: DOI: https://doi.org/10.1145/872757.872813 Google ScholarCross Ref - [133] . 2010. Data, Data Everywhere. A Special Report on Managing Information. https://www.economist.com/special-report/2010/02/27/data-data-everywhere.Google Scholar
- [134] . [n.d.]. The Open Standard for Parallel Programming of Heterogeneous Systems. https://www.khronos.org/opencl/.Google Scholar
- [135] . 2021. TPC-H Version 2 and Version 3. http://www.tpc.org/tpch/.Google Scholar
- [136] . 2020. Inside Tiger Lake: Intel’s next generation mobile client CPU. In Proc. of IEEE HCS 32. 1–26.
DOI: DOI: https://doi.org/10.1109/HCS49909.2020.9220443Google Scholar - [137] . 2018. Generic system calls for GPUs. In Proc. of ACM/IEEE ISCA’18. 843–856.
DOI: DOI: https://doi.org/10.1109/ISCA.2018.00075 Google ScholarCross Ref - [138] . 1993. Limits of Instruction-Level Parallelism.Google Scholar
- [139] . 2012. Accelerating pathology image data cross-comparison on CPU-GPU hybrid systems. Proc. VLDB Endow. 5, 11 (2012), 1543–1554.
DOI: DOI: https://doi.org/10.14778/2350229.2350268 Google ScholarCross Ref - [140] . 1995. Hitting the memory wall: Implications of the obvious. SIGARCH Comput. Archit. News 23, 1 (1995), 20–24.
DOI: DOI: https://doi.org/10.1145/216585.216588 Google ScholarCross Ref - [141] . 2013. The yin and yang of processing data warehousing queries on GPU devices. Proc. VLDB Endow. 6, 10 (2013), 817–828.
DOI: DOI: https://doi.org/10.14778/2536206.2536210 Google ScholarCross Ref - [142] . 2017. GPU rasterization for real-time spatial aggregation over arbitrary polygons. Proc. VLDB Endow 11, 3 (2017), 352–365.
DOI: DOI: https://doi.org/10.14778/3157794.3157803 Google ScholarCross Ref - [143] . 2004. Programming graphics hardware. In Proc. of Eurographics’04 Tutorials. Eurographics Association.
DOI: DOI: https://doi.org/10.2312/egt.20041034Google Scholar - [144] . 2018. A GPU-accelerated framework for processing trajectory queries. In Proc. of IEEE ICDE’18. 1037–1048.
DOI: DOI: https://doi.org/10.1109/ICDE.2018.00097Google Scholar - [145] . 2020. FineStream: Fine-grained window-based stream processing on CPU-GPU integrated architectures. In Proc. of USENIX ATC’20. USENIX Association, 633–647. https://www.usenix.org/conference/atc20/presentation/zhang-feng. Google ScholarDigital Library
- [146] . 2017. DIDO: Dynamic pipelines for in-memory key-value stores on coupled CPU-GPU architectures. In Proc. of IEEE ICDE’17. 671–682.
DOI: DOI: https://doi.org/10.1109/ICDE.2017.120Google Scholar - [147] . 2015. Mega-KV: A case for GPUs to maximize the throughput of in-memory key-value stores. Proc. VLDB Endow 8, 11 (2015), 1226–1237.
DOI: DOI: https://doi.org/10.14778/2809974.2809984 Google ScholarCross Ref - [148] . 2006. Super-scalar RAM-CPU cache compression. In Proc. of IEEE ICDE’06. 59–59.
DOI: DOI: https://doi.org/10.1109/ICDE.2006.150Google Scholar
Index Terms
- Query Processing on Heterogeneous CPU/GPU Systems
Recommendations
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster ComputingIn this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
In-cache query co-processing on coupled CPU-GPU architectures
Recently, there have been some emerging processor designs that the CPU and the GPU (Graphics Processing Unit) are integrated in a single chip and share Last Level Cache (LLC). However, the main memory bandwidth of such coupled CPU-GPU architectures can ...
Algorithmic performance studies on graphics processing units
We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear ...
Comments