Abstract
Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.
We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.
- A. Adams, E. Talvala, S. H. Park, D. E. Jacobs, B. Ajdin, N. Gelfand, J. Dolson, D. Vaquero, J. Baek, M. Tico, H. P. A. Lensch, W. Matusik, K. Pulli, M. Horowitz, and M. Levoy. The Frankencamera: An experimental platform for computational photography. ACM Trans. Graph., 29(4), 2010. Google ScholarDigital Library
- J. Ansel, C. Chan, Y. L.Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In ACM Programming Language Design and Implementation, 2009. Google ScholarDigital Library
- M. Aubry, S. Paris, S. W. Hasinoff, J. Kautz, and F. Durand. Fast and robust pyramid-based image processing. Technical Report MIT-CSAILTR- 2011-049, Massachusetts Institute of Technology, 2011.Google Scholar
- I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. In SIGGRAPH, 2004. Google ScholarDigital Library
- J. Chen, S. Paris, and F. Durand. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph., 26(3), 2007. Google ScholarDigital Library
- CoreImage. Apple CoreImage programming guide, 2006.Google Scholar
- J. L. T. Cornwall, L. Howes, P. H. J. Kelly, P. Parsonage, and B. Nicoletti. High-performance SIMT code generation in an active visual effects library. In Conf. on Computing Frontiers, 2009. Google ScholarDigital Library
- C. Elliott. Functional image synthesis. In Proceedings of Bridges, 2001.Google Scholar
- K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In ACM/IEEE conference on Supercomputing, 2006. Google ScholarDigital Library
- M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, 2005. Google ScholarDigital Library
- M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, C. Leger, A. A. Lamb, J. Wong, H. Hoffman, D. Z. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In International Conf. on Architectural Support for Programming Languages and Operating Systems, 2002. Google ScholarDigital Library
- P. L. Guernic, A. Benveniste, P. Bournai, and T. Gautier. Signal -- A data flow-oriented language for signal processing. IEEE Transactionsn on Acoustics, Speech and Signal Processing, 34(2):362--374, 1986.Google ScholarCross Ref
- J. Holewinski, L. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS, 2012. Google ScholarDigital Library
- IPP. Intel Integrated Performance Primitives. http://software.intel.com/en-us/articles/intel-ipp/.Google Scholar
- S. Kamil, C. Chan, L. Oliker, J. Shalf, , and S.Williams. An auto-tuning framework for parallel multicore stencil computations. In IPDPS, 2010.Google ScholarCross Ref
- S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, 2007. Google ScholarDigital Library
- J. Meng and K. Skadron. A performance study for iterative stencil loops on gpus with ghost zone optimizations. In IJPP, 2011.Google ScholarCross Ref
- R. Moore. Interval Analysis. 1966.Google Scholar
- A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In Supercomputing, 2010. Google ScholarDigital Library
- OpenMP. OpenMP. http://openmp.org/.Google Scholar
- S. Paris, P. Kornprobst, J. Tumblin, and F. Durand. Bilateral filtering: Theory and applications. Foundations and Trends in Computer Graphics and Vision, 2009. Google ScholarDigital Library
- S. Paris, S. W. Hasinoff, and J. Kautz. Local Laplacian filters: Edgeaware image processing with a Laplacian pyramid. ACM Trans. Graph., 30(4), 2011. Google ScholarDigital Library
- PixelBender. Adobe PixelBender reference, 2010.Google Scholar
- H. Printz. Automatic Mapping of Large Signal Processing Systems to a Parallel Machine. Ph.D. Thesis, Carnegie Mellon University, 1991. Google ScholarDigital Library
- M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93, 2005.Google ScholarCross Ref
- J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 31(4), 2012. Google ScholarDigital Library
- M. A. Shantzis. A model for efficient and flexible image computing. In ACM SIGGRAPH, 1994. Google ScholarDigital Library
- Y. Tang, R. Chowdhury, B. Kuszmaul, C.-K. Luk, and C. Leiserson. The Pochoir stencil compiler. In SPAA, 2011. Google ScholarDigital Library
- W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In International Conference on Compiler Construction, 2002. Google ScholarDigital Library
- P.-S. Tseng. A Parallelizing Compiler for Disributed Memory Parallel Computers. PhD thesis, Carnegie Mellon University, 1989. Google ScholarDigital Library
- X. Zhou, J.-P. Giacalone, M. J. Garzarán, R. H. Kuhn, Y. Ni, and D. Padua. Hierarchical overlapped tiling.Google Scholar
Index Terms
- Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
Recommendations
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and ImplementationImage processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
AbstractGPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
PolyMage: Automatic Optimization for Image Processing Pipelines
ASPLOS '15This paper presents the design and implementation of PolyMage, a domain-specific language and compiler for image processing pipelines. An image processing pipeline can be viewed as a graph of interconnected stages which process images successively. Each ...
Comments