ABSTRACT
We present Singe, a Domain Specific Language (DSL) compiler for combustion chemistry that leverages warp specialization to produce high performance code for GPUs. Instead of relying on traditional GPU programming models that emphasize data-parallel computations, warp specialization allows compilers like Singe to partition computations into sub-computations which are then assigned to different warps within a thread block. Fine-grain synchronization between warps is performed efficiently in hardware using producer-consumer named barriers. Partitioning computations using warp specialization allows Singe to deal efficiently with the irregularity in both data access patterns and computation. Furthermore, warp-specialized partitioning of computations allows Singe to fit extremely large working sets into on-chip memories. Finally, we describe the architecture and general compilation techniques necessary for constructing a warp-specializing compiler. We show that the warp-specialized code emitted by Singe is up to 3.75X faster than previously optimized data-parallel GPU kernels.
- CUDA programming guide version 5.5. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html, 2013.Google Scholar
- Parallel thread execution ISA version 3.2. http://docs.nvidia.com/cuda/parallel-thread-execution/index.html, 2013.Google Scholar
- M. Bauer, H. Cook, and B. Khailany. CudaDMA: optimizing GPU memory bandwidth via warp specialization. SC '11, 2011. Google ScholarDigital Library
- H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. PPoPP, pages 35--46, 2011. Google ScholarDigital Library
- J. H. Chen, A. Choudhary, B. de Supinski, M. DeVries, E. R. Hawkes, S. Klasky, W. K. Liao, K. L. Ma, J. Mellor-Crummey, N. Podhorszki, R. Sankaran, S. Shende, and C. S. Yoo. Terascale direct numerical simulations of turbulent combustion using S3D. phComputational Science and Discovery, 2009.Google Scholar
- Z. DeVito, N. Joubert, F. Palacios, S. Oakley, M. Medina, M. Barrientos, E. Elsen, F. Ham, A. Aiken, K. Duraisamy, E. Darve, J. Alonso, and P. Hanrahan. Liszt: a domain specific language for building portable mesh-based pde solvers. SC, pages 9:1--9:12, 2011. Google ScholarDigital Library
- S. Hack and G. Goos. Optimal register allocation for SSA-form programs in polynomial time. Inf. Process. Lett., 2006. Google ScholarDigital Library
- S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDA graph algorithms at maximum warp. PPoPP, 2011. Google ScholarDigital Library
- R. Kee, F. Rupley, and E. Meeks. CHEMKIN-III: A fortran chemical kinetics package for the analysis of gas-phase chemical and plasma kinetics. 1996.Google ScholarCross Ref
- Khronos. The OpenCL Specification, Version 2.0. http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf, 2013.Google Scholar
- J. M. Levesque, R. Sankaran, and R. Grout. Hybridizing S3D into an exascale application using OpenACC: an approach for moving to multi-petaflops and beyond. SC '12, pages 15:1--15:11, 2012. Google ScholarDigital Library
- T. Lu and C. K. Law. Toward accommodating realistic fuel chemistry in large-scale computations. Progress in Energy and Combustion Science, pages 192 -- 215, 2009.Google ScholarCross Ref
- J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 2012. Google ScholarDigital Library
- E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for GPU computing. ASPLOS, 2011. Google ScholarDigital Library
Index Terms
- Singe: leveraging warp specialization for high performance on GPUs
Recommendations
Singe: leveraging warp specialization for high performance on GPUs
PPoPP '14We present Singe, a Domain Specific Language (DSL) compiler for combustion chemistry that leverages warp specialization to produce high performance code for GPUs. Instead of relying on traditional GPU programming models that emphasize data-parallel ...
Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUs
CGO 2019: Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and OptimizationSince the advent of GPU computing, GPU hardware has evolved at a fast pace. Since application performance heavily depends on the latest hardware improvements, performance portability is extremely challenging for GPU application library developers. ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Comments