Flextended Tiles: A Flexible Extension of Overlapped Tiles for Polyhedral Compilation

Authors:
Jie Zhao

INRIA 8 DI, École Normale Supérieure, Paris, France

INRIA 8 DI, École Normale Supérieure, Paris, France
View Profile

,
Albert Cohen

Google, Paris, France

Google, Paris, France

0000-0002-8866-5343
View Profile

ACM Transactions on Architecture and Code Optimization Volume 16 Issue 4Article No.: 47pp 1–25https://doi.org/10.1145/3369382

Published:17 December 2019Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Loop tiling to exploit data locality and parallelism plays an essential role in a variety of general-purpose and domain-specific compilers. Affine transformations in polyhedral frameworks implement classical forms of rectangular and parallelogram tiling, but these lead to pipelined start with rather inefficient wavefront parallelism. Multiple extensions to polyhedral compilers evaluated sophisticated shapes such as trapezoid or diamond tiles, enabling concurrent start along the axes of the iteration space; yet these resort to custom schedulers and code generators insufficiently integrated within the general framework. One of these modified shapes referred to as overlapped tiling also lacks a unifying framework to reason about its composition with affine transformations; this prevents its application in general-purpose loop-nest optimizers and the fair comparison with other techniques. We revisit overlapped tiling, recasting it as an affine transformation on schedule trees composable with any affine scheduling algorithm. We demonstrate how to derive tighter tile shapes with less redundant computations. Our method models the traditional “scalene trapezoid” shapes and novel “right-rectangle” variants. It goes beyond the state of the art by avoiding the restriction to a domain-specific language or introducing post-pass rescheduling and custom code generation. We conduct experiments on the PolyMage benchmarks and iterated stencils, validating the effectiveness and applicability of our technique on both general-purpose multicores and GPU accelerators.

References

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, New York, NY, 303--316. DOI:https://doi.org/10.1145/2628071.2628092Google ScholarDigital Library
Mathieu Aubry, Sylvain Paris, Samuel W. Hasinoff, Jan Kautz, and Frédo Durand. 2014. Fast local Laplacian filters: Theory and applications. ACM Transactions on Graphics 33, 5 (Sept. 2014), Article 167, 14 pages. DOI:https://doi.org/10.1145/2629645Google ScholarDigital Library
David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Comput.ing Surveys 26, 4 (Dec. 1994), 345--420. DOI:https://doi.org/10.1145/197405.197406Google Scholar
Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’12). IEEE, Los Alamitos, CA, Article 40, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389051.Google ScholarDigital Library
Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and optimizing time-iterated computations on periodic domains. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). ACM, New York, NY, 39--50. DOI:https://doi.org/10.1145/2628071.2628106Google ScholarDigital Library
Uday Bondhugula, Vinayaka Bandishti, and Irshad Pananilath. 2017. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems 28, 5 (Oct. 2017), 1285--1298. DOI:https://doi.org/10.1109/TPDS.2016.2615094Google ScholarDigital Library
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’08). ACM, New York, NY, 101--113. DOI:https://doi.org/10.1145/1375581.1375595Google ScholarDigital Library
Peter J. Burt and Edward H. Adelson. 1983. A multiresolution spline with application to image mosaics. ACM Transactions on Graphics 2, 4 (Oct. 1983), 217--236. DOI:https://doi.org/10.1145/245.247Google ScholarDigital Library
Jiawen Chen, Sylvain Paris, and Frédo Durand. 2007. Real-time edge-aware image processing with the bilateral grid. In ACM SIGGRAPH 2007 Papers (SIGGRAPH ’07). ACM, New York, NY, Article 103. DOI:https://doi.org/10.1145/1275808.1276506Google ScholarDigital Library
Eddie C. Davis, Michelle Mills Strout, and Catherine Olschanowsky. 2018. Transforming loop chains via macro dataflow graphs. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO’18). ACM, New York, NY, 265--277. DOI:https://doi.org/10.1145/3168832Google ScholarDigital Library
H. Eissfeller and S. M. Muller. 1990. The triangle method for saving startup time in parallel computers. In Proceedings of the 1990 5th Distributed Memory Computing Conference. 568--572.Google Scholar
Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. International Journal of Parallel Programming 21, 6 (Dec. 1992), 389--420. DOI:https://doi.org/10.1007/BF01379404Google Scholar
Pieter Ghysels and Wim Vanroose. 2015. Modeling the performance of geometric multigrid stencils on multicore computer architectures. SIAM Journal on Scientific Computing 37, 2 (2015), C194--C216.Google ScholarCross Ref
T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. 2014. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). ACM, New York, NY, Article 66, 10 pages. DOI:https://doi.org/10.1145/2544137.2544160Google Scholar
T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. 2013. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). ACM, New York, NY, 24--31. DOI:https://doi.org/10.1145/2458523.2458526Google Scholar
Tobias Grosser, Sven Verdoolaege, and Albert Cohen. 2015. Polyhedral AST generation is more than scanning polyhedra. ACM Transactions on Programming Languages and Systems 37, 4 (July 2015), Article 12, 50 pages. DOI:https://doi.org/10.1145/2743016Google ScholarDigital Library
T. Grosser, S. Verdoolaege, A. Cohen, and P. Sadayappan. 2014. The relation between diamond tiling and hexagonal tiling. Parallel Processing Letters 24, 3 (2014), 1441002.Google ScholarCross Ref
Chris Harris and Mike Stephens. 1988. A combined corner and edge detector. In Proceedings of the Alvey Vision Conference, Vol. 15. 1--6.Google ScholarCross Ref
J. Holewinski, L.-N. Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 311--320. DOI:https://doi.org/10.1145/2304576.2304619Google Scholar
DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level tiling: M for the price of one. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, New York, NY, Article 51, 12 pages. DOI:https://doi.org/10.1145/1362622.1362691Google ScholarDigital Library
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. 2007. Effective automatic parallelization of stencil computations. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’07). ACM, New York, NY, 235--244. DOI:https://doi.org/10.1145/1250734.1250761Google Scholar
Tareq M. Malas, Georg Hager, Hatem Ltaief, and David E. Keyes. 2017. Multidimensional intratile parallelization for memory-starved stencil computations. ACM Transactions on Parallel Computing 4, 3 (Dec. 2017), Article 12, 32 pages. DOI:https://doi.org/10.1145/3155290Google ScholarDigital Library
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling halide image processing pipelines. ACM Transactions on Graphics 35, 4 (July 2016), Article 83, 11 pages. DOI:https://doi.org/10.1145/2897824.2925952Google ScholarDigital Library
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 429--443. DOI:https://doi.org/10.1145/2694344.2694364Google ScholarDigital Library
Daniel A. Orozco and Guang R. Gao. 2009. Mapping the FDTD application to many-core chip architectures. In Proceedings of the International Conference on Parallel Processing. 309--316.Google Scholar
Irshad Pananilath, Aravind Acharya, Vinay Vasista, and Uday Bondhugula. 2015. An optimizing code generator for a class of Lattice-Boltzmann computations. ACM Transactions on Architecture and Code Optimization 12, 2 (May 2015), Article 14, 23 pages. DOI:https://doi.org/10.1145/2739047Google ScholarDigital Library
Sylvain Paris, Samuel W. Hasinoff, and Jan Kautz. 2015. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. Communications of the ACM 58, 3 (Feb. 2015), 81--91. DOI:https://doi.org/10.1145/2723694Google ScholarDigital Library
Sylvain Paris, Pierre Kornprobst, Jack Tumblin, and Frédo Durand. 2009. Bilateral filtering: Theory and applications. Foundations and Trends® in Computer Graphics and Vision 4, 1 (2009), 1--73.Google Scholar
William Pugh and David Wonnacott. 1994. Static analysis of upper and lower bounds on dependences and parallelism. ACM Transactions on Programming Languages and Systems 16, 4 (July 1994), 1248--1278. DOI:https://doi.org/10.1145/183432.183525Google ScholarDigital Library
Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Transactions on Graphics 31, 4 (July 2012), Article 32, 12 pages. DOI:https://doi.org/10.1145/2185520.2185528Google ScholarDigital Library
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, New York, NY, 519--530. DOI:https://doi.org/10.1145/2491956.2462176Google ScholarDigital Library
Fabrice Rastello and Yves Robert. 2002. Automatic partitioning of parallel loops with parallelepiped-shaped tiles. IEEE Transactions on Parallel and Distributed Systems 13, 5 (2002), 460--470. DOI:https://doi.org/10.1109/TPDS.2002.1003856Google ScholarDigital Library
Sunil Shrestha, Guang R. Gao, Joseph Manzano, Andres Marquez, and John Feo. 2015. Locality aware concurrent start for stencil applications. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). IEEE, Los Alamitos, CA, 157--166. http://dl.acm.org/citation.cfm?id=2738600.2738620.Google ScholarDigital Library
Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2011. Cache accurate time skewing in iterative stencil computations. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP’11). IEEE, Los Alamitos, CA, 571--581. DOI:https://doi.org/10.1109/ICPP.2011.47Google ScholarDigital Library
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4 (Jan. 2013), Article 54, 23 pages. DOI:https://doi.org/10.1145/2400682.2400713Google ScholarDigital Library
Xing Zhou, María J. Garzarán, and David A. Padua. 2015. Optimal parallelogram selection for hierarchical tiling. ACM Transactions on Architecture and Code Optimization 11, 4 (Jan. 2015), Article 58, 23 pages. DOI:https://doi.org/10.1145/2687414Google ScholarDigital Library
Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, New York, NY, 207--218. DOI:https://doi.org/10.1145/2259016.2259044Google ScholarDigital Library
Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako, Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2018. Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction (CC’18). 3--13. DOI:https://doi.org/10.1145/3178372.3179507Google ScholarDigital Library

Index Terms

Flextended Tiles: A Flexible Extension of Overlapped Tiles for Polyhedral Compilation
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

PLUTO+: near-complete modeling of affine transformations for parallelism and locality
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler ...
Read More
PLUTO+: near-complete modeling of affine transformations for parallelism and locality
PPoPP '15

Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler ...
Read More
A practical tile size selection model for affine loop nests
ICS '21: Proceedings of the ACM International Conference on Supercomputing

Loop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 16, Issue 4
December 2019
572 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3366460
Editor:
Koen De Bosschere
Ghent University, Belgium
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 December 2019
- Revised: 1 October 2019
- Accepted: 1 October 2019
- Received: 1 March 2019
Published in taco Volume 16, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Polyhedral compilation
automatic parallelization
loop tiling
stencil computations
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 1,235
  Total Downloads
- Downloads (Last 12 months)211
- Downloads (Last 6 weeks)38
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format