research-article

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

Authors:
John A. Stratton

NVIDIA Corporation / University of Illinois at Urbana-Champaign, Champaign, IL, USA

NVIDIA Corporation / University of Illinois at Urbana-Champaign, Champaign, IL, USA
View Profile

,
Vinod Grover

NVIDIA Corporation, Santa Clara, CA, USA

NVIDIA Corporation, Santa Clara, CA, USA
View Profile

,
Jaydeep Marathe

NVIDIA Corporation, Santa Clara, CA, USA

NVIDIA Corporation, Santa Clara, CA, USA
View Profile

,
Bastiaan Aarts

NVIDIA Corporation, Santa Clara, CA, USA

NVIDIA Corporation, Santa Clara, CA, USA
View Profile

,
Mike Murphy

NVIDIA Corporation, Santa Clara, CA, USA

NVIDIA Corporation, Santa Clara, CA, USA
View Profile

,
Ziang Hu

NVIDIA Corporation, Santa Clara, CA, USA

NVIDIA Corporation, Santa Clara, CA, USA
View Profile

,
Wen-mei W. Hwu

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimizationApril 2010Pages 111–119https://doi.org/10.1145/1772954.1772971

Published:24 April 2010Publication History

CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization

Pages 111–119

ABSTRACT

In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and find that reasonable restrictions on the synchronization model enable significant optimizations and performance improvements over a baseline approach. We evaluate these techniques in a production-level compiler and runtime for the CUDA programming model targeting modern CPUs. Applications tested with our tool often showed performance parity with the compiled C version of the application for single-thread performance. With modest coarse-grained multithreading typical of today's CPU architectures, an average of 3.4x speedup on 4 processors was observed across the test applications.

References

A. Aiken and D. Gay. Barrier inference. In Proceedings of the 25th ACM Symposium on Principles of Programming Languages, pages 342--354, 1998. Google ScholarDigital Library
H. A. Andrade and S. Kovner. Software synthesis from dataflow models for embedded software design in the G programming language and the LabVIEW development environment. In Proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, pages 1705--1709, 1998.Google Scholar
G. Bikshandi, J. G. Castanos, S. B. Kodali, V. K. Nandivada, I. Peshansky, V. A. Saraswat, S. Sur, P. Varma, and T. Wen. Efficient, portable implementation of asynchronous multi-place programs. In PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 271--282, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th International Conference on Machine Learning, pages 104--111, June 2008. Google ScholarDigital Library
W.-Y. Chen. Building a source-to-source UPC-to-C translator. Master's thesis, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, Berkeley, CA, 2004.Google Scholar
F. Chow, S. Chan, R. Kennedy, S. ming Liu, R. Lo, and P. Tu. A new algorithm for partial redundancy elimination based on ssa form. In Proceedings of the ACM Conference on Programming Language Design and Implementation, pages 273--286, 1997. Google ScholarDigital Library
J. H. Cowie, D. M. Nicol, and A. T. Ogielski. Modeling the global internet. Computing in Science and Engineering, 1(1):42--50, 1999. Google ScholarDigital Library
P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20:23--53, 1991.Google ScholarCross Ref
G. Gao, J. Amaral, J. Dehnert, and R. Towle. The SGI Pro64 compiler infrastructure. Tutorial. October 2000.Google Scholar
Gregory Diamos. The design and implementation Ocelot's dynamic binary translator from PTX to multi-core x86. Technical Report GIT-CERCS-09-18, Georgia Institute of Technology, 2009.Google Scholar
High Performance Fortran Forum. High Performance Fortran language specification, version 1.0. Technical Report CRPC-TR92225, Rice University, May 1993.Google Scholar
Khronos OpenCL Working Group. The OpenCL Specification, May 2009.Google Scholar
S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proceedings of 14th ACM Symposium on Principles and Practice of Parallel Programming, pages 101--110, 2008. Google ScholarDigital Library
S.-W. Liao, Z. Du, G. Wu, and G.-Y. Lueh. Data and computation transformations for Brook streaming applications on multiprocessors. In Proceedings of the 4th International Symposium on Code Generation and Optimization, pages 196--207, March 2006. Google ScholarDigital Library
E. P. Markatos and T. J. LeBlanc. Using processor affinity in loop scheduling on shared-memory multiprocessors. In Proceedings of the 1992 International Conference on Supercomputing, pages 104--113, July 1992. Google ScholarDigital Library
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40--53, 2008. Google ScholarDigital Library
OpenMP Architecture Review Board. OpenMP application program interface, May 2005.Google Scholar
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM Symposium on Principles and Practice of Parallel Programming, February 2008. Google ScholarDigital Library
M. Schatz, C. Trapnell, A. Delcher, and A. Varshney. High-throughput sequence alignment using graphics processing units. BMC Bioinformatics, 8(1):474, 2007.Google ScholarCross Ref
J. Shirako, J. M. Zhao, V. K. Nandivada, and V. N. Sarkar. Chunking parallel loops in the presence of synchronization. In ICS '09: Proceedings of the 23rd international conference on Supercomputing, pages 181--192, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, September 2007.Google ScholarCross Ref
S. S. Stone, J. P. Haldar, S. C. Tsao, W.-M. W. Hwu, Z.-P. Liang, and B. P. Sutton. Accelerating advanced MRI reconstructions on GPUs. In Proceedings of the ACM International Conference on Computing Frontiers, pages 261--272, 2008. Google ScholarDigital Library
J. A. Stratton, S. S. Stone, and W. mei W. Hwu. MCUDA: An effective implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing, pages 16--30, July 2008. Google ScholarDigital Library
L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarDigital Library

Index Terms

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Massively LDPC Decoding on Multicore Architectures

Unlike usual VLSI approaches necessary for the computation of intensive Low-Density Parity-Check (LDPC) code decoders, this paper presents flexible software-based LDPC decoders. Algorithms and data structures suitable for parallel computing are proposed ...
Read More
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Read More
Designing fast architecture-sensitive tree search on modern multicore/many-core processors

In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
April 2010
300 pages
ISBN:9781605586359
DOI:10.1145/1772954
General Chairs:
Andreas Moshovos
University of Toronto
,
Greg Steffan
University of Toronto
,
Program Chairs:
Kim Hazelwood
University of Virginia
,
David Kaeli
Northeastern University
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 April 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CPU
CUDA
SPMD
multicore
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate312of1,061submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 61
  Total Citations
  View Citations
- 772
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization

ABSTRACT

References

Cited By

Index Terms

Recommendations

Massively LDPC Decoding on Multicore Architectures

A performance study of general-purpose applications on graphics processors using CUDA

Designing fast architecture-sensitive tree search on modern multicore/many-core processors