ABSTRACT
In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and find that reasonable restrictions on the synchronization model enable significant optimizations and performance improvements over a baseline approach. We evaluate these techniques in a production-level compiler and runtime for the CUDA programming model targeting modern CPUs. Applications tested with our tool often showed performance parity with the compiled C version of the application for single-thread performance. With modest coarse-grained multithreading typical of today's CPU architectures, an average of 3.4x speedup on 4 processors was observed across the test applications.
- A. Aiken and D. Gay. Barrier inference. In Proceedings of the 25th ACM Symposium on Principles of Programming Languages, pages 342--354, 1998. Google ScholarDigital Library
- H. A. Andrade and S. Kovner. Software synthesis from dataflow models for embedded software design in the G programming language and the LabVIEW development environment. In Proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, pages 1705--1709, 1998.Google Scholar
- G. Bikshandi, J. G. Castanos, S. B. Kodali, V. K. Nandivada, I. Peshansky, V. A. Saraswat, S. Sur, P. Varma, and T. Wen. Efficient, portable implementation of asynchronous multi-place programs. In PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 271--282, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th International Conference on Machine Learning, pages 104--111, June 2008. Google ScholarDigital Library
- W.-Y. Chen. Building a source-to-source UPC-to-C translator. Master's thesis, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, Berkeley, CA, 2004.Google Scholar
- F. Chow, S. Chan, R. Kennedy, S. ming Liu, R. Lo, and P. Tu. A new algorithm for partial redundancy elimination based on ssa form. In Proceedings of the ACM Conference on Programming Language Design and Implementation, pages 273--286, 1997. Google ScholarDigital Library
- J. H. Cowie, D. M. Nicol, and A. T. Ogielski. Modeling the global internet. Computing in Science and Engineering, 1(1):42--50, 1999. Google ScholarDigital Library
- P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20:23--53, 1991.Google ScholarCross Ref
- G. Gao, J. Amaral, J. Dehnert, and R. Towle. The SGI Pro64 compiler infrastructure. Tutorial. October 2000.Google Scholar
- Gregory Diamos. The design and implementation Ocelot's dynamic binary translator from PTX to multi-core x86. Technical Report GIT-CERCS-09-18, Georgia Institute of Technology, 2009.Google Scholar
- High Performance Fortran Forum. High Performance Fortran language specification, version 1.0. Technical Report CRPC-TR92225, Rice University, May 1993.Google Scholar
- Khronos OpenCL Working Group. The OpenCL Specification, May 2009.Google Scholar
- S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proceedings of 14th ACM Symposium on Principles and Practice of Parallel Programming, pages 101--110, 2008. Google ScholarDigital Library
- S.-W. Liao, Z. Du, G. Wu, and G.-Y. Lueh. Data and computation transformations for Brook streaming applications on multiprocessors. In Proceedings of the 4th International Symposium on Code Generation and Optimization, pages 196--207, March 2006. Google ScholarDigital Library
- E. P. Markatos and T. J. LeBlanc. Using processor affinity in loop scheduling on shared-memory multiprocessors. In Proceedings of the 1992 International Conference on Supercomputing, pages 104--113, July 1992. Google ScholarDigital Library
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40--53, 2008. Google ScholarDigital Library
- OpenMP Architecture Review Board. OpenMP application program interface, May 2005.Google Scholar
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM Symposium on Principles and Practice of Parallel Programming, February 2008. Google ScholarDigital Library
- M. Schatz, C. Trapnell, A. Delcher, and A. Varshney. High-throughput sequence alignment using graphics processing units. BMC Bioinformatics, 8(1):474, 2007.Google ScholarCross Ref
- J. Shirako, J. M. Zhao, V. K. Nandivada, and V. N. Sarkar. Chunking parallel loops in the presence of synchronization. In ICS '09: Proceedings of the 23rd international conference on Supercomputing, pages 181--192, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, September 2007.Google ScholarCross Ref
- S. S. Stone, J. P. Haldar, S. C. Tsao, W.-M. W. Hwu, Z.-P. Liang, and B. P. Sutton. Accelerating advanced MRI reconstructions on GPUs. In Proceedings of the ACM International Conference on Computing Frontiers, pages 261--272, 2008. Google ScholarDigital Library
- J. A. Stratton, S. S. Stone, and W. mei W. Hwu. MCUDA: An effective implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing, pages 16--30, July 2008. Google ScholarDigital Library
- L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarDigital Library
Index Terms
- Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
Recommendations
Massively LDPC Decoding on Multicore Architectures
Unlike usual VLSI approaches necessary for the computation of intensive Low-Density Parity-Check (LDPC) code decoders, this paper presents flexible software-based LDPC decoders. Algorithms and data structures suitable for parallel computing are proposed ...
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Designing fast architecture-sensitive tree search on modern multicore/many-core processors
In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor ...
Comments