Abstract
Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level data parallel language embedded in Python. The Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. In this paper, we discuss the language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations. We also outline the runtime support by which Copperhead programs interoperate with standard Python modules. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45-100% of the performance of hand-crafted, well optimized CUDA code.
- N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC'09: Proc. Conference on High Performance Computing Networking, Storage and Analysis, pages 1--11. ACM, 2009. Google ScholarDigital Library
- G. E. Blelloch. Vector models for data-parallel computing. MIT Press, Cambridge, MA, USA, 1990. ISBN 0-262-02313-X. URL http://www.cs.cmu.edu/guyb/papers/Ble90.pdf. Google ScholarDigital Library
- G. E. Blelloch. Programming parallel algorithms. phCommun. ACM, 39 (3): 85--97, 1996. Google ScholarDigital Library
- G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. Journal of Parallel and Distributed Computing, 21 (1): 4--14, Apr. 1994. Google ScholarDigital Library
- B. Catanzaro, N. Sundaram, and K. Keutzer. Fast Support Vector Machine Training and Classification on Graphics Processors. In Proceedings of the 25th International Conference on Machine Learning, pages 104--111, 2008. Google ScholarDigital Library
- B. Catanzaro, S. Kamil, Y. Lee, K. Asanović, J. Demmel, K. Keutzer, J. Shalf, K. Yelick, and A. Fox. SEJITS: Getting Productivity and Performance With Selective Embedded JIT Specialization. Technical Report UCB/EECS-2010--23, EECS Department, University of California, Berkeley, 2010.Google Scholar
- M. M. T. Chakravarty, R. Leshchinskiy, S. P. Jones, G. Keller, and S. Marlow. Data parallel Haskell: a status report. In Proc. 2007 Workshop on Declarative Aspects of Multicore Programming, pages 10--18. ACM, 2007. Google ScholarDigital Library
- Clyther. Clyther: Python language extension for opencl. http://clyther.sourceforge.net/, 2010.Google Scholar
- Cython. Cython: C-extensions for python. http://cython.org/, 2010.Google Scholar
- R. Garg and J. N. Amaral. Compiling python to a hybrid execution environment. In GPGPU'10: Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 19--30. ACM, 2010. Google ScholarDigital Library
- A. Ghuloum, E. Sprangle, J. Fang, G. Wu, and X. Zhou. Ct: A Flexible Parallel Programming Model for Tera-scale Architectures. Technical Report White Paper, Intel Corporation, 2007.Google Scholar
- T. D. Han and T. S. Abdelrahman. hiCUDA: a high-level directive-based language for GPU programming. In GPGPU-2: Proc. 2nd Workshop on General Purpose Processing on Graphics Processing Units, pages 52--61. ACM, 2009. Google ScholarDigital Library
- W. D. Hillis and J. Guy L. Steele. Data parallel algorithms. Commun. ACM, 29 (12): 1170--1183, 1986. ISSN 0001-0782. Google ScholarDigital Library
- J. Hoberock and N. Bell. Thrust: A parallel template library, 2009. URL http://www.meganewtons.com/. Version 1.2.Google Scholar
- P. Hudak and M. P. Jones. Haskell vs. Ada vs. C vs. Awk vs...an experiment in software prototyping productivity. Technical Report YALEU/DCS/RR-1049, Yale University Department of Computer Science, New Haven, CT, 1994.Google Scholar
- S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. Improvements to Platt's SMO Algorithm for SVM Classifier Design. Neural Comput., 13 (3): 637--649, 2001. ISSN 0899--7667. Google ScholarDigital Library
- A. Klöckner. Codepy. http://mathema.tician.de/software/codepy, 2010.Google Scholar
- A. Klöckner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih. Pycuda: Gpu run-time code generation for high-performance computing. CoRR, abs/0911.3456, 2009.Google Scholar
- S. Lee, M. M. T. Chakravarty, V. Grover, and G. Keller. GPU kernels as data-parallel array computations in Haskell. In Workshop on Exploiting Parallelism using GPUs and other Hardware-Assisted Methods (EPAHM 2009), 2009.Google Scholar
- S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proc. 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 101--110. ACM, 2009. Google ScholarDigital Library
- M. McCool. Data-parallel programming on the Cell BE and the GPU using the RapidMind development platform. In Proc. GSPx Multicore Applications Conference, Nov. 2006.Google Scholar
- M. D. McCool, Z. Qin, and T. S. Popa. Shader metaprogramming. In Proc. Graphics Hardware 2002, pages 57--68. Eurographics Association, 2002. Google ScholarDigital Library
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. phQueue, 6 (2): 40--53, Mar/Apr 2008. Google ScholarDigital Library
- NVIDIA CUDA C Programming Guide. NVIDIA Corporation, May 2010. URL http://www.nvidia.com/CUDA. Version 3.1.Google Scholar
- L. Prechelt. Are Scripting Languages Any Good? A Validation of Perl, Python, Rexx, and Tcl against C, C, and Java. In Advances in Computers, volume 57, pages 205--270. Elsevier, 2003.Google ScholarCross Ref
- J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley, July 2010. ISBN 978-0-13-138768-3. Google ScholarDigital Library
- N. Sundaram, T. Brox, and K. Keutzer. Dense Point Trajectories by GPU-accelerated Large Displacement Optical Flow. In phEuropean Conference on Computer Vision, pages 438--445, September 2010. Google ScholarDigital Library
- D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program GPUs for general-purpose uses. SIGARCH Comput. Archit. News, 34 (5): 325--335, 2006. Google ScholarDigital Library
- Theano. Theano: A python array expression library. http://deeplearning.net/software/theano/, 2010.Google Scholar
- M. Wolfe. Implementing the PGI accelerator model. In Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 43--50. ACM, 2010. Google ScholarDigital Library
Index Terms
- Copperhead: compiling an embedded data parallel language
Recommendations
Copperhead: compiling an embedded data parallel language
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programmingModern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level ...
Braid: integrating task and data parallelism
FRONTIERS '95: Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)Archetype data parallel or task parallel applications are well served by contemporary languages. However, for applications containing a balance of task and data parallelism the choice of language is less clear. While there are languages that enable both ...
Exploiting Implicit Parallelism in Dynamic Array Programming Languages
ARRAY'14: Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array ProgrammingWe have built an interpreter for the array programming language J. The interpreter exploits implicit data parallelism in the language to achieve good parallel speedups on a variety of benchmark applications.
Many array programming languages operate on ...
Comments