Abstract
Loop tiling to exploit data locality and parallelism plays an essential role in a variety of general-purpose and domain-specific compilers. Affine transformations in polyhedral frameworks implement classical forms of rectangular and parallelogram tiling, but these lead to pipelined start with rather inefficient wavefront parallelism. Multiple extensions to polyhedral compilers evaluated sophisticated shapes such as trapezoid or diamond tiles, enabling concurrent start along the axes of the iteration space; yet these resort to custom schedulers and code generators insufficiently integrated within the general framework. One of these modified shapes referred to as overlapped tiling also lacks a unifying framework to reason about its composition with affine transformations; this prevents its application in general-purpose loop-nest optimizers and the fair comparison with other techniques. We revisit overlapped tiling, recasting it as an affine transformation on schedule trees composable with any affine scheduling algorithm. We demonstrate how to derive tighter tile shapes with less redundant computations. Our method models the traditional “scalene trapezoid” shapes and novel “right-rectangle” variants. It goes beyond the state of the art by avoiding the restriction to a domain-specific language or introducing post-pass rescheduling and custom code generation. We conduct experiments on the PolyMage benchmarks and iterated stencils, validating the effectiveness and applicability of our technique on both general-purpose multicores and GPU accelerators.
- Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, New York, NY, 303--316. DOI:https://doi.org/10.1145/2628071.2628092Google ScholarDigital Library
- Mathieu Aubry, Sylvain Paris, Samuel W. Hasinoff, Jan Kautz, and Frédo Durand. 2014. Fast local Laplacian filters: Theory and applications. ACM Transactions on Graphics 33, 5 (Sept. 2014), Article 167, 14 pages. DOI:https://doi.org/10.1145/2629645Google ScholarDigital Library
- David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Comput.ing Surveys 26, 4 (Dec. 1994), 345--420. DOI:https://doi.org/10.1145/197405.197406Google Scholar
- Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’12). IEEE, Los Alamitos, CA, Article 40, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389051.Google ScholarDigital Library
- Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and optimizing time-iterated computations on periodic domains. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). ACM, New York, NY, 39--50. DOI:https://doi.org/10.1145/2628071.2628106Google ScholarDigital Library
- Uday Bondhugula, Vinayaka Bandishti, and Irshad Pananilath. 2017. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems 28, 5 (Oct. 2017), 1285--1298. DOI:https://doi.org/10.1109/TPDS.2016.2615094Google ScholarDigital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’08). ACM, New York, NY, 101--113. DOI:https://doi.org/10.1145/1375581.1375595Google ScholarDigital Library
- Peter J. Burt and Edward H. Adelson. 1983. A multiresolution spline with application to image mosaics. ACM Transactions on Graphics 2, 4 (Oct. 1983), 217--236. DOI:https://doi.org/10.1145/245.247Google ScholarDigital Library
- Jiawen Chen, Sylvain Paris, and Frédo Durand. 2007. Real-time edge-aware image processing with the bilateral grid. In ACM SIGGRAPH 2007 Papers (SIGGRAPH ’07). ACM, New York, NY, Article 103. DOI:https://doi.org/10.1145/1275808.1276506Google ScholarDigital Library
- Eddie C. Davis, Michelle Mills Strout, and Catherine Olschanowsky. 2018. Transforming loop chains via macro dataflow graphs. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO’18). ACM, New York, NY, 265--277. DOI:https://doi.org/10.1145/3168832Google ScholarDigital Library
- H. Eissfeller and S. M. Muller. 1990. The triangle method for saving startup time in parallel computers. In Proceedings of the 1990 5th Distributed Memory Computing Conference. 568--572.Google Scholar
- Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. International Journal of Parallel Programming 21, 6 (Dec. 1992), 389--420. DOI:https://doi.org/10.1007/BF01379404Google Scholar
- Pieter Ghysels and Wim Vanroose. 2015. Modeling the performance of geometric multigrid stencils on multicore computer architectures. SIAM Journal on Scientific Computing 37, 2 (2015), C194--C216.Google ScholarCross Ref
- T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. 2014. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). ACM, New York, NY, Article 66, 10 pages. DOI:https://doi.org/10.1145/2544137.2544160Google Scholar
- T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. 2013. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). ACM, New York, NY, 24--31. DOI:https://doi.org/10.1145/2458523.2458526Google Scholar
- Tobias Grosser, Sven Verdoolaege, and Albert Cohen. 2015. Polyhedral AST generation is more than scanning polyhedra. ACM Transactions on Programming Languages and Systems 37, 4 (July 2015), Article 12, 50 pages. DOI:https://doi.org/10.1145/2743016Google ScholarDigital Library
- T. Grosser, S. Verdoolaege, A. Cohen, and P. Sadayappan. 2014. The relation between diamond tiling and hexagonal tiling. Parallel Processing Letters 24, 3 (2014), 1441002.Google ScholarCross Ref
- Chris Harris and Mike Stephens. 1988. A combined corner and edge detector. In Proceedings of the Alvey Vision Conference, Vol. 15. 1--6.Google ScholarCross Ref
- J. Holewinski, L.-N. Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 311--320. DOI:https://doi.org/10.1145/2304576.2304619Google Scholar
- DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level tiling: M for the price of one. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, New York, NY, Article 51, 12 pages. DOI:https://doi.org/10.1145/1362622.1362691Google ScholarDigital Library
- S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. 2007. Effective automatic parallelization of stencil computations. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’07). ACM, New York, NY, 235--244. DOI:https://doi.org/10.1145/1250734.1250761Google Scholar
- Tareq M. Malas, Georg Hager, Hatem Ltaief, and David E. Keyes. 2017. Multidimensional intratile parallelization for memory-starved stencil computations. ACM Transactions on Parallel Computing 4, 3 (Dec. 2017), Article 12, 32 pages. DOI:https://doi.org/10.1145/3155290Google ScholarDigital Library
- Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling halide image processing pipelines. ACM Transactions on Graphics 35, 4 (July 2016), Article 83, 11 pages. DOI:https://doi.org/10.1145/2897824.2925952Google ScholarDigital Library
- Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 429--443. DOI:https://doi.org/10.1145/2694344.2694364Google ScholarDigital Library
- Daniel A. Orozco and Guang R. Gao. 2009. Mapping the FDTD application to many-core chip architectures. In Proceedings of the International Conference on Parallel Processing. 309--316.Google Scholar
- Irshad Pananilath, Aravind Acharya, Vinay Vasista, and Uday Bondhugula. 2015. An optimizing code generator for a class of Lattice-Boltzmann computations. ACM Transactions on Architecture and Code Optimization 12, 2 (May 2015), Article 14, 23 pages. DOI:https://doi.org/10.1145/2739047Google ScholarDigital Library
- Sylvain Paris, Samuel W. Hasinoff, and Jan Kautz. 2015. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. Communications of the ACM 58, 3 (Feb. 2015), 81--91. DOI:https://doi.org/10.1145/2723694Google ScholarDigital Library
- Sylvain Paris, Pierre Kornprobst, Jack Tumblin, and Frédo Durand. 2009. Bilateral filtering: Theory and applications. Foundations and Trends® in Computer Graphics and Vision 4, 1 (2009), 1--73.Google Scholar
- William Pugh and David Wonnacott. 1994. Static analysis of upper and lower bounds on dependences and parallelism. ACM Transactions on Programming Languages and Systems 16, 4 (July 1994), 1248--1278. DOI:https://doi.org/10.1145/183432.183525Google ScholarDigital Library
- Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Transactions on Graphics 31, 4 (July 2012), Article 32, 12 pages. DOI:https://doi.org/10.1145/2185520.2185528Google ScholarDigital Library
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, New York, NY, 519--530. DOI:https://doi.org/10.1145/2491956.2462176Google ScholarDigital Library
- Fabrice Rastello and Yves Robert. 2002. Automatic partitioning of parallel loops with parallelepiped-shaped tiles. IEEE Transactions on Parallel and Distributed Systems 13, 5 (2002), 460--470. DOI:https://doi.org/10.1109/TPDS.2002.1003856Google ScholarDigital Library
- Sunil Shrestha, Guang R. Gao, Joseph Manzano, Andres Marquez, and John Feo. 2015. Locality aware concurrent start for stencil applications. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). IEEE, Los Alamitos, CA, 157--166. http://dl.acm.org/citation.cfm?id=2738600.2738620.Google ScholarDigital Library
- Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2011. Cache accurate time skewing in iterative stencil computations. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP’11). IEEE, Los Alamitos, CA, 571--581. DOI:https://doi.org/10.1109/ICPP.2011.47Google ScholarDigital Library
- Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4 (Jan. 2013), Article 54, 23 pages. DOI:https://doi.org/10.1145/2400682.2400713Google ScholarDigital Library
- Xing Zhou, María J. Garzarán, and David A. Padua. 2015. Optimal parallelogram selection for hierarchical tiling. ACM Transactions on Architecture and Code Optimization 11, 4 (Jan. 2015), Article 58, 23 pages. DOI:https://doi.org/10.1145/2687414Google ScholarDigital Library
- Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, New York, NY, 207--218. DOI:https://doi.org/10.1145/2259016.2259044Google ScholarDigital Library
- Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako, Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2018. Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction (CC’18). 3--13. DOI:https://doi.org/10.1145/3178372.3179507Google ScholarDigital Library
Index Terms
- Flextended Tiles: A Flexible Extension of Overlapped Tiles for Polyhedral Compilation
Recommendations
PLUTO+: near-complete modeling of affine transformations for parallelism and locality
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingAffine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler ...
PLUTO+: near-complete modeling of affine transformations for parallelism and locality
PPoPP '15Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler ...
A practical tile size selection model for affine loop nests
ICS '21: Proceedings of the ACM International Conference on SupercomputingLoop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic ...
Comments