skip to main content
10.1145/1772954.1772971acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

Authors Info & Claims
Published:24 April 2010Publication History

ABSTRACT

In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and find that reasonable restrictions on the synchronization model enable significant optimizations and performance improvements over a baseline approach. We evaluate these techniques in a production-level compiler and runtime for the CUDA programming model targeting modern CPUs. Applications tested with our tool often showed performance parity with the compiled C version of the application for single-thread performance. With modest coarse-grained multithreading typical of today's CPU architectures, an average of 3.4x speedup on 4 processors was observed across the test applications.

References

  1. A. Aiken and D. Gay. Barrier inference. In Proceedings of the 25th ACM Symposium on Principles of Programming Languages, pages 342--354, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. H. A. Andrade and S. Kovner. Software synthesis from dataflow models for embedded software design in the G programming language and the LabVIEW development environment. In Proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, pages 1705--1709, 1998.Google ScholarGoogle Scholar
  3. G. Bikshandi, J. G. Castanos, S. B. Kodali, V. K. Nandivada, I. Peshansky, V. A. Saraswat, S. Sur, P. Varma, and T. Wen. Efficient, portable implementation of asynchronous multi-place programs. In PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 271--282, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th International Conference on Machine Learning, pages 104--111, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W.-Y. Chen. Building a source-to-source UPC-to-C translator. Master's thesis, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, Berkeley, CA, 2004.Google ScholarGoogle Scholar
  6. F. Chow, S. Chan, R. Kennedy, S. ming Liu, R. Lo, and P. Tu. A new algorithm for partial redundancy elimination based on ssa form. In Proceedings of the ACM Conference on Programming Language Design and Implementation, pages 273--286, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. H. Cowie, D. M. Nicol, and A. T. Ogielski. Modeling the global internet. Computing in Science and Engineering, 1(1):42--50, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20:23--53, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  9. G. Gao, J. Amaral, J. Dehnert, and R. Towle. The SGI Pro64 compiler infrastructure. Tutorial. October 2000.Google ScholarGoogle Scholar
  10. Gregory Diamos. The design and implementation Ocelot's dynamic binary translator from PTX to multi-core x86. Technical Report GIT-CERCS-09-18, Georgia Institute of Technology, 2009.Google ScholarGoogle Scholar
  11. High Performance Fortran Forum. High Performance Fortran language specification, version 1.0. Technical Report CRPC-TR92225, Rice University, May 1993.Google ScholarGoogle Scholar
  12. Khronos OpenCL Working Group. The OpenCL Specification, May 2009.Google ScholarGoogle Scholar
  13. S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proceedings of 14th ACM Symposium on Principles and Practice of Parallel Programming, pages 101--110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S.-W. Liao, Z. Du, G. Wu, and G.-Y. Lueh. Data and computation transformations for Brook streaming applications on multiprocessors. In Proceedings of the 4th International Symposium on Code Generation and Optimization, pages 196--207, March 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. P. Markatos and T. J. LeBlanc. Using processor affinity in loop scheduling on shared-memory multiprocessors. In Proceedings of the 1992 International Conference on Supercomputing, pages 104--113, July 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40--53, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. OpenMP Architecture Review Board. OpenMP application program interface, May 2005.Google ScholarGoogle Scholar
  18. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM Symposium on Principles and Practice of Parallel Programming, February 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Schatz, C. Trapnell, A. Delcher, and A. Varshney. High-throughput sequence alignment using graphics processing units. BMC Bioinformatics, 8(1):474, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  20. J. Shirako, J. M. Zhao, V. K. Nandivada, and V. N. Sarkar. Chunking parallel loops in the presence of synchronization. In ICS '09: Proceedings of the 23rd international conference on Supercomputing, pages 181--192, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, September 2007.Google ScholarGoogle ScholarCross RefCross Ref
  22. S. S. Stone, J. P. Haldar, S. C. Tsao, W.-M. W. Hwu, Z.-P. Liang, and B. P. Sutton. Accelerating advanced MRI reconstructions on GPUs. In Proceedings of the ACM International Conference on Computing Frontiers, pages 261--272, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. A. Stratton, S. S. Stone, and W. mei W. Hwu. MCUDA: An effective implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing, pages 16--30, July 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
        April 2010
        300 pages
        ISBN:9781605586359
        DOI:10.1145/1772954

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 April 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate312of1,061submissions,29%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader