skip to main content
10.1145/2581122.2544166acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
tutorial

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Published:15 February 2014Publication History

ABSTRACT

Modern enterprise applications represent an emergent application arena that requires the processing of queries and computations over massive amounts of data. Large-scale, multi-GPU cluster systems potentially present a vehicle for major improvements in throughput and consequently overall performance. However, throughput improvement using GPUs is challenged by the distinctive memory and computational characteristics of Relational Algebra (RA) operators that are central to queries for answering business questions.

This paper introduces the design, implementation, and evaluation of Red Fox, a compiler and runtime infrastructure for executing relational queries on GPUs. Red Fox is comprised of i) a language front-end for LogiQL which is a commercial query language, ii) an RA to GPU compiler, iii) optimized GPU implementation of RA operators, and iv) a supporting runtime. We report the performance on the full set of industry standard TPC-H queries on a single node GPU. Compared with a commercial LogiQL system implementation optimized for a state of art CPU machine, Red Fox on average is 6.48x faster including PCIe transfer time. We point out key bottlenecks, propose potential solutions, and analyze the GPU implementation of these queries. To the best of our knowledge, this is the first reported end-to-end compilation and execution infrastructure that supports the full set of TPC-H queries on commodity GPUs.

References

  1. Amazon. Amazon elastic compute cloud, 2013.Google ScholarGoogle Scholar
  2. P. Bakkum and K. Skadron. Accelerating sql database operations on a gpu with cuda. GPGPU '10, pages 94--103. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Baxter. Modern gpu. http://nvlabs.github.io/moderngpu/, 2013.Google ScholarGoogle Scholar
  4. N. Bell and J. Hoberock. Thrust: A productivity-oriented library for cuda. pages 359--372. Morgan Kaufmann Publishers, 2011.Google ScholarGoogle Scholar
  5. B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: compiling an embedded data parallel language. PPoPP '11, pages 47--56. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. F. Codd. A relational model of data for large shared data banks. Commun. ACM, 26(1):64--69, Jan. 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Council, T.P.P. Tpc-h - top ten performance results - non-clustered. http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster.Google ScholarGoogle Scholar
  8. Council, T.P.P. Tpc benchmark h, standard specification revision 2.16.0, 2013.Google ScholarGoogle Scholar
  9. G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational algorithms for multi-bulk-synchronous processors. PPoPP '13, pages 301--302. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. F. Diamos and S. Yalamanchili. Harmony: an execution model and runtime for heterogeneous many core systems. HPDC '08, pages 197--200. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Fang, B. He, and Q. Luo. Database compression on graphics processors. Proc. VLDB Endow., 3(1-2):670--680, Sept. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. SIGGRAPH '05, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. O. Green, R. McColl, and D. A. Bader. Gpu merge path: a gpu merging algorithm. ICS '12, pages 331--340. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: a mapreduce framework on graphics processors. PACT '08, pages 260--269. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query coprocessing on graphics processors. ACM Trans. Database Syst., 34(4):21:1--21:39, Dec. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. SIGMOD '08, pages 511--524. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. He, M. Lu, and B. He. Revisiting co-processing for hash joins on the coupled cpu-gpu architecture. Proc. VLDB Endow., 6(10):889--900, Aug. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. S. Huang, T. J. Green, and B. T. Loo. Datalog and emerging applications: an interactive tutorial. SIGMOD '11, pages 1213--1216. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Kaldewey, G. Lohman, R. Mueller, and P. Volk. Gpu join processing revisited. DaMoN '12, pages 55--62, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Khronos Group. The OpenCL Specification, version 2.0. November 2013.Google ScholarGoogle Scholar
  21. C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. CGO '04, pages 75--. IEEE Computer Society, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. A. Martınez-Angeles, I. Dutra, V. S. Costa, and J. Buenabad-Chávez. A datalog engine for gpus. CHRISTIAN-ALBRECHTS-UNIVERSITAT ZU KIEL, page 239, 2013.Google ScholarGoogle Scholar
  23. T. Mostak. An overview of mapd (massively parallel database). 2013.Google ScholarGoogle Scholar
  24. S. Ngamsuriyaroj and R. Pornpattana. Performance evaluation of tpc-h queries on mysql cluster. WAINA '10, pages 1035--1040. IEEE Computer Society, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. NVIDIA. Cuda c programming guide, 2012.Google ScholarGoogle Scholar
  26. NVIDIA. Parallel thread execution isa, 2013.Google ScholarGoogle Scholar
  27. P. O'Neil, E. O'Neil, X. Chen, and S. Revilak. The star schema benchmark and augmented fact table indexing. In Performance Evaluation and Benchmarking, volume 5895 of Lecture Notes in Computer Science, pages 237--252. Springer Berlin Heidelberg, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich. Optix: a general purpose ray tracing engine. ACM Trans. Graph., 29:66:1--66:13, July 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. H. Rauhe, J. Dees, K.-U. Sattler, and F. Faerber. Multi-level parallel query execution framework for cpu and gpu. In Advances in Databases and Information Systems, volume 8133 of Lecture Notes in Computer Science, pages 330--343. Springer Berlin Heidelberg, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. IPDPS '09, pages 1--10. IEEE Computer Society, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Seeger and S. Ultra-Large-Sites. Key-value stores: a practical overview. Computer Science and Media, 2009.Google ScholarGoogle Scholar
  32. E. A. Sitaridi and K. A. Ross. Optimizing select conditions on gpus. DaMoN '13, pages 4:1--4:8. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. The Green 500. The green500 list - nov 2013, 2013.Google ScholarGoogle Scholar
  34. P. Trancoso, D. Othonos, and A. Artemiou. Data parallel acceleration of decision support queries using cell/be and gpus. CF '09, pages 117--126. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. MICRO '12, pages 107--118. IEEE Computer Society, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Wu, G. Diamos, J. Wang, S. Cadambi, S. Yalamanchili, and S. Chakradhar. Optimizing data warehousing applications for gpus using kernel fusion/fission. IPDPSW '12, pages 2433--2442. IEEE Computer Society, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. Young, H. Wu, and S. Yalamanchili. Satisfying data-intensive queries using gpu clusters. SCC '12, pages 1314--. IEEE Computer Society, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Y. Yuan, R. Lee, and X. Zhang. The yin and yang of processing data warehousing queries on gpu devices. Proc. VLDB Endow., 6(10):817--828, Aug. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Zhang, J. He, B. He, and M. Lu. Omnidb: Towards portable and efficient query processing on parallel cpu/gpu architectures. Proceedings of the VLDB Endowment, 6(12), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Red Fox: An Execution Environment for Relational Query Processing on GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
        February 2014
        328 pages
        ISBN:9781450326704
        DOI:10.1145/2581122

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 February 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • tutorial
        • Research
        • Refereed limited

        Acceptance Rates

        CGO '14 Paper Acceptance Rate29of100submissions,29%Overall Acceptance Rate312of1,061submissions,29%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader