research-article

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Authors:
Jonathan Ragan-Kelley

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

,
Connelly Barnes

Adobe, Cambridge, MA, USA

Adobe, Cambridge, MA, USA
View Profile

,
Andrew Adams

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

,
Sylvain Paris

Adobe, Cambridge, MA, USA

Adobe, Cambridge, MA, USA
View Profile

,
Frédo Durand

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

,
Saman Amarasinghe

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 48 Issue 6June 2013pp 519–530https://doi.org/10.1145/2499370.2462176

Published:16 June 2013Publication History

ACM SIGPLAN Notices

Abstract

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.

We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

References

A. Adams, E. Talvala, S. H. Park, D. E. Jacobs, B. Ajdin, N. Gelfand, J. Dolson, D. Vaquero, J. Baek, M. Tico, H. P. A. Lensch, W. Matusik, K. Pulli, M. Horowitz, and M. Levoy. The Frankencamera: An experimental platform for computational photography. ACM Trans. Graph., 29(4), 2010. Google ScholarDigital Library
J. Ansel, C. Chan, Y. L.Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In ACM Programming Language Design and Implementation, 2009. Google ScholarDigital Library
M. Aubry, S. Paris, S. W. Hasinoff, J. Kautz, and F. Durand. Fast and robust pyramid-based image processing. Technical Report MIT-CSAILTR- 2011-049, Massachusetts Institute of Technology, 2011.Google Scholar
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. In SIGGRAPH, 2004. Google ScholarDigital Library
J. Chen, S. Paris, and F. Durand. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph., 26(3), 2007. Google ScholarDigital Library
CoreImage. Apple CoreImage programming guide, 2006.Google Scholar
J. L. T. Cornwall, L. Howes, P. H. J. Kelly, P. Parsonage, and B. Nicoletti. High-performance SIMT code generation in an active visual effects library. In Conf. on Computing Frontiers, 2009. Google ScholarDigital Library
C. Elliott. Functional image synthesis. In Proceedings of Bridges, 2001.Google Scholar
K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In ACM/IEEE conference on Supercomputing, 2006. Google ScholarDigital Library
M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, 2005. Google ScholarDigital Library
M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, C. Leger, A. A. Lamb, J. Wong, H. Hoffman, D. Z. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In International Conf. on Architectural Support for Programming Languages and Operating Systems, 2002. Google ScholarDigital Library
P. L. Guernic, A. Benveniste, P. Bournai, and T. Gautier. Signal -- A data flow-oriented language for signal processing. IEEE Transactionsn on Acoustics, Speech and Signal Processing, 34(2):362--374, 1986.Google ScholarCross Ref
J. Holewinski, L. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS, 2012. Google ScholarDigital Library
IPP. Intel Integrated Performance Primitives. http://software.intel.com/en-us/articles/intel-ipp/.Google Scholar
S. Kamil, C. Chan, L. Oliker, J. Shalf, , and S.Williams. An auto-tuning framework for parallel multicore stencil computations. In IPDPS, 2010.Google ScholarCross Ref
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, 2007. Google ScholarDigital Library
J. Meng and K. Skadron. A performance study for iterative stencil loops on gpus with ghost zone optimizations. In IJPP, 2011.Google ScholarCross Ref
R. Moore. Interval Analysis. 1966.Google Scholar
A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In Supercomputing, 2010. Google ScholarDigital Library
OpenMP. OpenMP. http://openmp.org/.Google Scholar
S. Paris, P. Kornprobst, J. Tumblin, and F. Durand. Bilateral filtering: Theory and applications. Foundations and Trends in Computer Graphics and Vision, 2009. Google ScholarDigital Library
S. Paris, S. W. Hasinoff, and J. Kautz. Local Laplacian filters: Edgeaware image processing with a Laplacian pyramid. ACM Trans. Graph., 30(4), 2011. Google ScholarDigital Library
PixelBender. Adobe PixelBender reference, 2010.Google Scholar
H. Printz. Automatic Mapping of Large Signal Processing Systems to a Parallel Machine. Ph.D. Thesis, Carnegie Mellon University, 1991. Google ScholarDigital Library
M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93, 2005.Google ScholarCross Ref
J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 31(4), 2012. Google ScholarDigital Library
M. A. Shantzis. A model for efficient and flexible image computing. In ACM SIGGRAPH, 1994. Google ScholarDigital Library
Y. Tang, R. Chowdhury, B. Kuszmaul, C.-K. Luk, and C. Leiserson. The Pochoir stencil compiler. In SPAA, 2011. Google ScholarDigital Library
W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In International Conference on Compiler Construction, 2002. Google ScholarDigital Library
P.-S. Tseng. A Parallelizing Compiler for Disributed Memory Parallel Computers. PhD thesis, Carnegie Mellon University, 1989. Google ScholarDigital Library
X. Zhou, J.-P. Giacalone, M. J. Garzarán, R. H. Kuhn, Y. Ni, and D. Padua. Hierarchical overlapped tiling.Google Scholar

Index Terms

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. ...
Read More
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Read More
PolyMage: Automatic Optimization for Image Processing Pipelines
ASPLOS '15

This paper presents the design and implementation of PolyMage, a domain-specific language and compiler for image processing pipelines. An image processing pipeline can be viewed as a graph of interconnected stages which process images successively. Each ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 48, Issue 6
PLDI '13
June 2013
515 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2499370
Issue’s Table of Contents
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2013
546 pages
ISBN:9781450320146
DOI:10.1145/2491956
General Chair:
Hans-J. Boehm
HP Labs
,
Program Chair:
Cormac Flanagan
University of California, Santa Cruz
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 June 2013
Check for updates
Author Tags
autotuning
compiler
domain specific language
gpu
image processing
locality
optimization
parallelism
redundant computation
vectorization
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 873
  Total Citations
  View Citations
- 3,888
  Total Downloads
- Downloads (Last 12 months)594
- Downloads (Last 6 weeks)62
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

PolyMage: Automatic Optimization for Image Processing Pipelines