ABSTRACT
Modern enterprise applications represent an emergent application arena that requires the processing of queries and computations over massive amounts of data. Large-scale, multi-GPU cluster systems potentially present a vehicle for major improvements in throughput and consequently overall performance. However, throughput improvement using GPUs is challenged by the distinctive memory and computational characteristics of Relational Algebra (RA) operators that are central to queries for answering business questions.
This paper introduces the design, implementation, and evaluation of Red Fox, a compiler and runtime infrastructure for executing relational queries on GPUs. Red Fox is comprised of i) a language front-end for LogiQL which is a commercial query language, ii) an RA to GPU compiler, iii) optimized GPU implementation of RA operators, and iv) a supporting runtime. We report the performance on the full set of industry standard TPC-H queries on a single node GPU. Compared with a commercial LogiQL system implementation optimized for a state of art CPU machine, Red Fox on average is 6.48x faster including PCIe transfer time. We point out key bottlenecks, propose potential solutions, and analyze the GPU implementation of these queries. To the best of our knowledge, this is the first reported end-to-end compilation and execution infrastructure that supports the full set of TPC-H queries on commodity GPUs.
- Amazon. Amazon elastic compute cloud, 2013.Google Scholar
- P. Bakkum and K. Skadron. Accelerating sql database operations on a gpu with cuda. GPGPU '10, pages 94--103. ACM, 2010. Google ScholarDigital Library
- S. Baxter. Modern gpu. http://nvlabs.github.io/moderngpu/, 2013.Google Scholar
- N. Bell and J. Hoberock. Thrust: A productivity-oriented library for cuda. pages 359--372. Morgan Kaufmann Publishers, 2011.Google Scholar
- B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: compiling an embedded data parallel language. PPoPP '11, pages 47--56. ACM, 2011. Google ScholarDigital Library
- E. F. Codd. A relational model of data for large shared data banks. Commun. ACM, 26(1):64--69, Jan. 1983. Google ScholarDigital Library
- Council, T.P.P. Tpc-h - top ten performance results - non-clustered. http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster.Google Scholar
- Council, T.P.P. Tpc benchmark h, standard specification revision 2.16.0, 2013.Google Scholar
- G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational algorithms for multi-bulk-synchronous processors. PPoPP '13, pages 301--302. ACM, 2013. Google ScholarDigital Library
- G. F. Diamos and S. Yalamanchili. Harmony: an execution model and runtime for heterogeneous many core systems. HPDC '08, pages 197--200. ACM, 2008. Google ScholarDigital Library
- W. Fang, B. He, and Q. Luo. Database compression on graphics processors. Proc. VLDB Endow., 3(1-2):670--680, Sept. 2010. Google ScholarDigital Library
- N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. SIGGRAPH '05, 2005. Google ScholarDigital Library
- O. Green, R. McColl, and D. A. Bader. Gpu merge path: a gpu merging algorithm. ICS '12, pages 331--340. ACM, 2012. Google ScholarDigital Library
- B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: a mapreduce framework on graphics processors. PACT '08, pages 260--269. ACM, 2008. Google ScholarDigital Library
- B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query coprocessing on graphics processors. ACM Trans. Database Syst., 34(4):21:1--21:39, Dec. 2009. Google ScholarDigital Library
- B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. SIGMOD '08, pages 511--524. ACM, 2008. Google ScholarDigital Library
- J. He, M. Lu, and B. He. Revisiting co-processing for hash joins on the coupled cpu-gpu architecture. Proc. VLDB Endow., 6(10):889--900, Aug. 2013. Google ScholarDigital Library
- S. S. Huang, T. J. Green, and B. T. Loo. Datalog and emerging applications: an interactive tutorial. SIGMOD '11, pages 1213--1216. ACM, 2011. Google ScholarDigital Library
- T. Kaldewey, G. Lohman, R. Mueller, and P. Volk. Gpu join processing revisited. DaMoN '12, pages 55--62, 2012. Google ScholarDigital Library
- Khronos Group. The OpenCL Specification, version 2.0. November 2013.Google Scholar
- C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. CGO '04, pages 75--. IEEE Computer Society, 2004. Google ScholarDigital Library
- C. A. Martınez-Angeles, I. Dutra, V. S. Costa, and J. Buenabad-Chávez. A datalog engine for gpus. CHRISTIAN-ALBRECHTS-UNIVERSITAT ZU KIEL, page 239, 2013.Google Scholar
- T. Mostak. An overview of mapd (massively parallel database). 2013.Google Scholar
- S. Ngamsuriyaroj and R. Pornpattana. Performance evaluation of tpc-h queries on mysql cluster. WAINA '10, pages 1035--1040. IEEE Computer Society, 2010. Google ScholarDigital Library
- NVIDIA. Cuda c programming guide, 2012.Google Scholar
- NVIDIA. Parallel thread execution isa, 2013.Google Scholar
- P. O'Neil, E. O'Neil, X. Chen, and S. Revilak. The star schema benchmark and augmented fact table indexing. In Performance Evaluation and Benchmarking, volume 5895 of Lecture Notes in Computer Science, pages 237--252. Springer Berlin Heidelberg, 2009. Google ScholarDigital Library
- S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich. Optix: a general purpose ray tracing engine. ACM Trans. Graph., 29:66:1--66:13, July 2010. Google ScholarDigital Library
- H. Rauhe, J. Dees, K.-U. Sattler, and F. Faerber. Multi-level parallel query execution framework for cpu and gpu. In Advances in Databases and Information Systems, volume 8133 of Lecture Notes in Computer Science, pages 330--343. Springer Berlin Heidelberg, 2013.Google ScholarDigital Library
- N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. IPDPS '09, pages 1--10. IEEE Computer Society, 2009. Google ScholarDigital Library
- M. Seeger and S. Ultra-Large-Sites. Key-value stores: a practical overview. Computer Science and Media, 2009.Google Scholar
- E. A. Sitaridi and K. A. Ross. Optimizing select conditions on gpus. DaMoN '13, pages 4:1--4:8. ACM, 2013. Google ScholarDigital Library
- The Green 500. The green500 list - nov 2013, 2013.Google Scholar
- P. Trancoso, D. Othonos, and A. Artemiou. Data parallel acceleration of decision support queries using cell/be and gpus. CF '09, pages 117--126. ACM, 2009. Google ScholarDigital Library
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990. Google ScholarDigital Library
- H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. MICRO '12, pages 107--118. IEEE Computer Society, 2012. Google ScholarDigital Library
- H. Wu, G. Diamos, J. Wang, S. Cadambi, S. Yalamanchili, and S. Chakradhar. Optimizing data warehousing applications for gpus using kernel fusion/fission. IPDPSW '12, pages 2433--2442. IEEE Computer Society, 2012. Google ScholarDigital Library
- J. Young, H. Wu, and S. Yalamanchili. Satisfying data-intensive queries using gpu clusters. SCC '12, pages 1314--. IEEE Computer Society, 2012. Google ScholarDigital Library
- Y. Yuan, R. Lee, and X. Zhang. The yin and yang of processing data warehousing queries on gpu devices. Proc. VLDB Endow., 6(10):817--828, Aug. 2013. Google ScholarDigital Library
- S. Zhang, J. He, B. He, and M. Lu. Omnidb: Towards portable and efficient query processing on parallel cpu/gpu architectures. Proceedings of the VLDB Endowment, 6(12), 2013. Google ScholarDigital Library
Index Terms
- Red Fox: An Execution Environment for Relational Query Processing on GPUs
Recommendations
Red Fox: An Execution Environment for Relational Query Processing on GPUs
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationModern enterprise applications represent an emergent application arena that requires the processing of queries and computations over massive amounts of data. Large-scale, multi-GPU cluster systems potentially present a vehicle for major improvements in ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Comments