skip to main content
research-article
Open Access

Flextended Tiles: A Flexible Extension of Overlapped Tiles for Polyhedral Compilation

Published:17 December 2019Publication History
Skip Abstract Section

Abstract

Loop tiling to exploit data locality and parallelism plays an essential role in a variety of general-purpose and domain-specific compilers. Affine transformations in polyhedral frameworks implement classical forms of rectangular and parallelogram tiling, but these lead to pipelined start with rather inefficient wavefront parallelism. Multiple extensions to polyhedral compilers evaluated sophisticated shapes such as trapezoid or diamond tiles, enabling concurrent start along the axes of the iteration space; yet these resort to custom schedulers and code generators insufficiently integrated within the general framework. One of these modified shapes referred to as overlapped tiling also lacks a unifying framework to reason about its composition with affine transformations; this prevents its application in general-purpose loop-nest optimizers and the fair comparison with other techniques. We revisit overlapped tiling, recasting it as an affine transformation on schedule trees composable with any affine scheduling algorithm. We demonstrate how to derive tighter tile shapes with less redundant computations. Our method models the traditional “scalene trapezoid” shapes and novel “right-rectangle” variants. It goes beyond the state of the art by avoiding the restriction to a domain-specific language or introducing post-pass rescheduling and custom code generation. We conduct experiments on the PolyMage benchmarks and iterated stencils, validating the effectiveness and applicability of our technique on both general-purpose multicores and GPU accelerators.

References

  1. Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, New York, NY, 303--316. DOI:https://doi.org/10.1145/2628071.2628092Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Mathieu Aubry, Sylvain Paris, Samuel W. Hasinoff, Jan Kautz, and Frédo Durand. 2014. Fast local Laplacian filters: Theory and applications. ACM Transactions on Graphics 33, 5 (Sept. 2014), Article 167, 14 pages. DOI:https://doi.org/10.1145/2629645Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Comput.ing Surveys 26, 4 (Dec. 1994), 345--420. DOI:https://doi.org/10.1145/197405.197406Google ScholarGoogle Scholar
  4. Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’12). IEEE, Los Alamitos, CA, Article 40, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389051.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and optimizing time-iterated computations on periodic domains. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). ACM, New York, NY, 39--50. DOI:https://doi.org/10.1145/2628071.2628106Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Uday Bondhugula, Vinayaka Bandishti, and Irshad Pananilath. 2017. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems 28, 5 (Oct. 2017), 1285--1298. DOI:https://doi.org/10.1109/TPDS.2016.2615094Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’08). ACM, New York, NY, 101--113. DOI:https://doi.org/10.1145/1375581.1375595Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Peter J. Burt and Edward H. Adelson. 1983. A multiresolution spline with application to image mosaics. ACM Transactions on Graphics 2, 4 (Oct. 1983), 217--236. DOI:https://doi.org/10.1145/245.247Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jiawen Chen, Sylvain Paris, and Frédo Durand. 2007. Real-time edge-aware image processing with the bilateral grid. In ACM SIGGRAPH 2007 Papers (SIGGRAPH ’07). ACM, New York, NY, Article 103. DOI:https://doi.org/10.1145/1275808.1276506Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Eddie C. Davis, Michelle Mills Strout, and Catherine Olschanowsky. 2018. Transforming loop chains via macro dataflow graphs. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO’18). ACM, New York, NY, 265--277. DOI:https://doi.org/10.1145/3168832Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Eissfeller and S. M. Muller. 1990. The triangle method for saving startup time in parallel computers. In Proceedings of the 1990 5th Distributed Memory Computing Conference. 568--572.Google ScholarGoogle Scholar
  12. Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. International Journal of Parallel Programming 21, 6 (Dec. 1992), 389--420. DOI:https://doi.org/10.1007/BF01379404Google ScholarGoogle Scholar
  13. Pieter Ghysels and Wim Vanroose. 2015. Modeling the performance of geometric multigrid stencils on multicore computer architectures. SIAM Journal on Scientific Computing 37, 2 (2015), C194--C216.Google ScholarGoogle ScholarCross RefCross Ref
  14. T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. 2014. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). ACM, New York, NY, Article 66, 10 pages. DOI:https://doi.org/10.1145/2544137.2544160Google ScholarGoogle Scholar
  15. T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. 2013. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). ACM, New York, NY, 24--31. DOI:https://doi.org/10.1145/2458523.2458526Google ScholarGoogle Scholar
  16. Tobias Grosser, Sven Verdoolaege, and Albert Cohen. 2015. Polyhedral AST generation is more than scanning polyhedra. ACM Transactions on Programming Languages and Systems 37, 4 (July 2015), Article 12, 50 pages. DOI:https://doi.org/10.1145/2743016Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Grosser, S. Verdoolaege, A. Cohen, and P. Sadayappan. 2014. The relation between diamond tiling and hexagonal tiling. Parallel Processing Letters 24, 3 (2014), 1441002.Google ScholarGoogle ScholarCross RefCross Ref
  18. Chris Harris and Mike Stephens. 1988. A combined corner and edge detector. In Proceedings of the Alvey Vision Conference, Vol. 15. 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  19. J. Holewinski, L.-N. Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 311--320. DOI:https://doi.org/10.1145/2304576.2304619Google ScholarGoogle Scholar
  20. DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level tiling: M for the price of one. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, New York, NY, Article 51, 12 pages. DOI:https://doi.org/10.1145/1362622.1362691Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. 2007. Effective automatic parallelization of stencil computations. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’07). ACM, New York, NY, 235--244. DOI:https://doi.org/10.1145/1250734.1250761Google ScholarGoogle Scholar
  22. Tareq M. Malas, Georg Hager, Hatem Ltaief, and David E. Keyes. 2017. Multidimensional intratile parallelization for memory-starved stencil computations. ACM Transactions on Parallel Computing 4, 3 (Dec. 2017), Article 12, 32 pages. DOI:https://doi.org/10.1145/3155290Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling halide image processing pipelines. ACM Transactions on Graphics 35, 4 (July 2016), Article 83, 11 pages. DOI:https://doi.org/10.1145/2897824.2925952Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 429--443. DOI:https://doi.org/10.1145/2694344.2694364Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Daniel A. Orozco and Guang R. Gao. 2009. Mapping the FDTD application to many-core chip architectures. In Proceedings of the International Conference on Parallel Processing. 309--316.Google ScholarGoogle Scholar
  26. Irshad Pananilath, Aravind Acharya, Vinay Vasista, and Uday Bondhugula. 2015. An optimizing code generator for a class of Lattice-Boltzmann computations. ACM Transactions on Architecture and Code Optimization 12, 2 (May 2015), Article 14, 23 pages. DOI:https://doi.org/10.1145/2739047Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Sylvain Paris, Samuel W. Hasinoff, and Jan Kautz. 2015. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. Communications of the ACM 58, 3 (Feb. 2015), 81--91. DOI:https://doi.org/10.1145/2723694Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sylvain Paris, Pierre Kornprobst, Jack Tumblin, and Frédo Durand. 2009. Bilateral filtering: Theory and applications. Foundations and Trends® in Computer Graphics and Vision 4, 1 (2009), 1--73.Google ScholarGoogle Scholar
  29. William Pugh and David Wonnacott. 1994. Static analysis of upper and lower bounds on dependences and parallelism. ACM Transactions on Programming Languages and Systems 16, 4 (July 1994), 1248--1278. DOI:https://doi.org/10.1145/183432.183525Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Transactions on Graphics 31, 4 (July 2012), Article 32, 12 pages. DOI:https://doi.org/10.1145/2185520.2185528Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, New York, NY, 519--530. DOI:https://doi.org/10.1145/2491956.2462176Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Fabrice Rastello and Yves Robert. 2002. Automatic partitioning of parallel loops with parallelepiped-shaped tiles. IEEE Transactions on Parallel and Distributed Systems 13, 5 (2002), 460--470. DOI:https://doi.org/10.1109/TPDS.2002.1003856Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sunil Shrestha, Guang R. Gao, Joseph Manzano, Andres Marquez, and John Feo. 2015. Locality aware concurrent start for stencil applications. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). IEEE, Los Alamitos, CA, 157--166. http://dl.acm.org/citation.cfm?id=2738600.2738620.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2011. Cache accurate time skewing in iterative stencil computations. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP’11). IEEE, Los Alamitos, CA, 571--581. DOI:https://doi.org/10.1109/ICPP.2011.47Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4 (Jan. 2013), Article 54, 23 pages. DOI:https://doi.org/10.1145/2400682.2400713Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Xing Zhou, María J. Garzarán, and David A. Padua. 2015. Optimal parallelogram selection for hierarchical tiling. ACM Transactions on Architecture and Code Optimization 11, 4 (Jan. 2015), Article 58, 23 pages. DOI:https://doi.org/10.1145/2687414Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, New York, NY, 207--218. DOI:https://doi.org/10.1145/2259016.2259044Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako, Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2018. Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction (CC’18). 3--13. DOI:https://doi.org/10.1145/3178372.3179507Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Flextended Tiles: A Flexible Extension of Overlapped Tiles for Polyhedral Compilation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 4
      December 2019
      572 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/3366460
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 December 2019
      • Revised: 1 October 2019
      • Accepted: 1 October 2019
      • Received: 1 March 2019
      Published in taco Volume 16, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format