tutorial

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Authors:
Haicheng Wu

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Gregory Diamos

NVIDIA

NVIDIA
View Profile

,
Tim Sheard

Portland State University

Portland State University
View Profile

,
Molham Aref

LogicBlox Inc.

LogicBlox Inc.
View Profile

,
Sean Baxter

NVIDIA

NVIDIA
View Profile

,
Michael Garland

NVIDIA

NVIDIA
View Profile

,
Sudhakar Yalamanchili

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationFebruary 2014Pages 44–54https://doi.org/10.1145/2581122.2544166

Published:15 February 2014Publication History

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Pages 44–54

ABSTRACT

Modern enterprise applications represent an emergent application arena that requires the processing of queries and computations over massive amounts of data. Large-scale, multi-GPU cluster systems potentially present a vehicle for major improvements in throughput and consequently overall performance. However, throughput improvement using GPUs is challenged by the distinctive memory and computational characteristics of Relational Algebra (RA) operators that are central to queries for answering business questions.

This paper introduces the design, implementation, and evaluation of Red Fox, a compiler and runtime infrastructure for executing relational queries on GPUs. Red Fox is comprised of i) a language front-end for LogiQL which is a commercial query language, ii) an RA to GPU compiler, iii) optimized GPU implementation of RA operators, and iv) a supporting runtime. We report the performance on the full set of industry standard TPC-H queries on a single node GPU. Compared with a commercial LogiQL system implementation optimized for a state of art CPU machine, Red Fox on average is 6.48x faster including PCIe transfer time. We point out key bottlenecks, propose potential solutions, and analyze the GPU implementation of these queries. To the best of our knowledge, this is the first reported end-to-end compilation and execution infrastructure that supports the full set of TPC-H queries on commodity GPUs.

References

Amazon. Amazon elastic compute cloud, 2013.Google Scholar
P. Bakkum and K. Skadron. Accelerating sql database operations on a gpu with cuda. GPGPU '10, pages 94--103. ACM, 2010. Google ScholarDigital Library
S. Baxter. Modern gpu. http://nvlabs.github.io/moderngpu/, 2013.Google Scholar
N. Bell and J. Hoberock. Thrust: A productivity-oriented library for cuda. pages 359--372. Morgan Kaufmann Publishers, 2011.Google Scholar
B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: compiling an embedded data parallel language. PPoPP '11, pages 47--56. ACM, 2011. Google ScholarDigital Library
E. F. Codd. A relational model of data for large shared data banks. Commun. ACM, 26(1):64--69, Jan. 1983. Google ScholarDigital Library
Council, T.P.P. Tpc-h - top ten performance results - non-clustered. http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster.Google Scholar
Council, T.P.P. Tpc benchmark h, standard specification revision 2.16.0, 2013.Google Scholar
G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational algorithms for multi-bulk-synchronous processors. PPoPP '13, pages 301--302. ACM, 2013. Google ScholarDigital Library
G. F. Diamos and S. Yalamanchili. Harmony: an execution model and runtime for heterogeneous many core systems. HPDC '08, pages 197--200. ACM, 2008. Google ScholarDigital Library
W. Fang, B. He, and Q. Luo. Database compression on graphics processors. Proc. VLDB Endow., 3(1-2):670--680, Sept. 2010. Google ScholarDigital Library
N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. SIGGRAPH '05, 2005. Google ScholarDigital Library
O. Green, R. McColl, and D. A. Bader. Gpu merge path: a gpu merging algorithm. ICS '12, pages 331--340. ACM, 2012. Google ScholarDigital Library
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: a mapreduce framework on graphics processors. PACT '08, pages 260--269. ACM, 2008. Google ScholarDigital Library
B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query coprocessing on graphics processors. ACM Trans. Database Syst., 34(4):21:1--21:39, Dec. 2009. Google ScholarDigital Library
B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. SIGMOD '08, pages 511--524. ACM, 2008. Google ScholarDigital Library
J. He, M. Lu, and B. He. Revisiting co-processing for hash joins on the coupled cpu-gpu architecture. Proc. VLDB Endow., 6(10):889--900, Aug. 2013. Google ScholarDigital Library
S. S. Huang, T. J. Green, and B. T. Loo. Datalog and emerging applications: an interactive tutorial. SIGMOD '11, pages 1213--1216. ACM, 2011. Google ScholarDigital Library
T. Kaldewey, G. Lohman, R. Mueller, and P. Volk. Gpu join processing revisited. DaMoN '12, pages 55--62, 2012. Google ScholarDigital Library
Khronos Group. The OpenCL Specification, version 2.0. November 2013.Google Scholar
C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. CGO '04, pages 75--. IEEE Computer Society, 2004. Google ScholarDigital Library
C. A. Martınez-Angeles, I. Dutra, V. S. Costa, and J. Buenabad-Chávez. A datalog engine for gpus. CHRISTIAN-ALBRECHTS-UNIVERSITAT ZU KIEL, page 239, 2013.Google Scholar
T. Mostak. An overview of mapd (massively parallel database). 2013.Google Scholar
S. Ngamsuriyaroj and R. Pornpattana. Performance evaluation of tpc-h queries on mysql cluster. WAINA '10, pages 1035--1040. IEEE Computer Society, 2010. Google ScholarDigital Library
NVIDIA. Cuda c programming guide, 2012.Google Scholar
NVIDIA. Parallel thread execution isa, 2013.Google Scholar
P. O'Neil, E. O'Neil, X. Chen, and S. Revilak. The star schema benchmark and augmented fact table indexing. In Performance Evaluation and Benchmarking, volume 5895 of Lecture Notes in Computer Science, pages 237--252. Springer Berlin Heidelberg, 2009. Google ScholarDigital Library
S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich. Optix: a general purpose ray tracing engine. ACM Trans. Graph., 29:66:1--66:13, July 2010. Google ScholarDigital Library
H. Rauhe, J. Dees, K.-U. Sattler, and F. Faerber. Multi-level parallel query execution framework for cpu and gpu. In Advances in Databases and Information Systems, volume 8133 of Lecture Notes in Computer Science, pages 330--343. Springer Berlin Heidelberg, 2013.Google ScholarDigital Library
N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. IPDPS '09, pages 1--10. IEEE Computer Society, 2009. Google ScholarDigital Library
M. Seeger and S. Ultra-Large-Sites. Key-value stores: a practical overview. Computer Science and Media, 2009.Google Scholar
E. A. Sitaridi and K. A. Ross. Optimizing select conditions on gpus. DaMoN '13, pages 4:1--4:8. ACM, 2013. Google ScholarDigital Library
The Green 500. The green500 list - nov 2013, 2013.Google Scholar
P. Trancoso, D. Othonos, and A. Artemiou. Data parallel acceleration of decision support queries using cell/be and gpus. CF '09, pages 117--126. ACM, 2009. Google ScholarDigital Library
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990. Google ScholarDigital Library
H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. MICRO '12, pages 107--118. IEEE Computer Society, 2012. Google ScholarDigital Library
H. Wu, G. Diamos, J. Wang, S. Cadambi, S. Yalamanchili, and S. Chakradhar. Optimizing data warehousing applications for gpus using kernel fusion/fission. IPDPSW '12, pages 2433--2442. IEEE Computer Society, 2012. Google ScholarDigital Library
J. Young, H. Wu, and S. Yalamanchili. Satisfying data-intensive queries using gpu clusters. SCC '12, pages 1314--. IEEE Computer Society, 2012. Google ScholarDigital Library
Y. Yuan, R. Lee, and X. Zhang. The yin and yang of processing data warehousing queries on gpu devices. Proc. VLDB Endow., 6(10):817--828, Aug. 2013. Google ScholarDigital Library
S. Zhang, J. He, B. He, and M. Lu. Omnidb: Towards portable and efficient query processing on parallel cpu/gpu architectures. Proceedings of the VLDB Endowment, 6(12), 2013. Google ScholarDigital Library

Index Terms

Red Fox: An Execution Environment for Relational Query Processing on GPUs
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Red Fox: An Execution Environment for Relational Query Processing on GPUs
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Modern enterprise applications represent an emergent application arena that requires the processing of queries and computations over massive amounts of data. Large-scale, multi-GPU cluster systems potentially present a vehicle for major improvements in ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
February 2014
328 pages
ISBN:9781450326704
DOI:10.1145/2581122

Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 February 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
Relational Query Processing
Qualifiers
- tutorial
- Research
- Refereed limited
Conference

Acceptance Rates
CGO '14 Paper Acceptance Rate29of100submissions,29%Overall Acceptance Rate312of1,061submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 46
  Total Citations
  View Citations
- 336
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Red Fox: An Execution Environment for Relational Query Processing on GPUs

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

ABSTRACT

References

Cited By

Index Terms

Recommendations

Red Fox: An Execution Environment for Relational Query Processing on GPUs

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Red Fox: An Execution Environment for Relational Query Processing on GPUs

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

ABSTRACT

References

Cited By

Index Terms

Recommendations

Red Fox: An Execution Environment for Relational Query Processing on GPUs

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media