IOOpt: automatic derivation of I/O complexity bounds for affine programs

Authors:
Auguste Olivry

Université Grenoble Alpes, France / CNRS, France / Inria, France

Université Grenoble Alpes, France / CNRS, France / Inria, France
View Profile

,
Guillaume Iooss

Université Grenoble Alpes, France / CNRS, France / Inria, France

Université Grenoble Alpes, France / CNRS, France / Inria, France
View Profile

,
Nicolas Tollenaere

Université Grenoble Alpes, France / CNRS, France / Inria, France

Université Grenoble Alpes, France / CNRS, France / Inria, France
View Profile

,
Atanas Rountev

Ohio State University, USA

Ohio State University, USA
View Profile

,
P. Sadayappan

University of Utah, USA

University of Utah, USA
View Profile

,
Fabrice Rastello

Université Grenoble Alpes, France / CNRS, France / Inria, France

Université Grenoble Alpes, France / CNRS, France / Inria, France
View Profile

PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and ImplementationJune 2021Pages 1187–1202https://doi.org/10.1145/3453483.3454103

Published:18 June 2021Publication History

PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

Pages 1187–1202

ABSTRACT

Evaluating the complexity of an algorithm is an important step when developing applications, as it impacts both its time and energy performance. Computational complexity, which is the number of dynamic operations regardless of the execution order, is easy to characterize for affine programs. Data movement (or, I/O) complexity is more complex to evaluate as it refers, when considering all possible valid schedules, to the minimum required number of I/O between a slow (e.g. main memory) and a fast (e.g. local scratchpad) storage location.

This paper presents IOOpt, a fully automated tool that automatically bounds the data movement of an affine (tilable) program. Given a tilable program described in a DSL, it automatically computes: 1. a lower bound of the I/O complexity as a symbolic expression of the cache size and program parameters; 2. an upper bound that allows one to assess the tightness of the lower bound; 3. a tiling recommendation (loop permutation and tile sizes) that matches the upper bound. For the lower bound algorithm which can be applied to any affine program, a substantial effort has been made to provide bounds that are as tight as possible for neural networks: In particular, it extends the previous work of Olivry et al. to handle multi-dimensional reductions and expose the constraints associated with small dimensions that are present in convolutions. For the upper bound algorithm that reasons on the tile band of the program (e.g. output of a polyhedral compiler such as PluTo), the algebraic computations involved have been tuned to behave well on tensor computations such as direct tensor contractions or direct convolutions. As a bonus, the upper bound algorithm that has been extended to multi-level cache can provide the programmer with a useful tiling recommendation.

We demonstrate the effectiveness of our tool by deriving the symbolic lower and upper bounds for several tensor contraction and convolution kernels. Then we evaluate numerically the tightness of our bound using the convolution layers of Yolo9000 and representative tensor contractions from the TCCG benchmark suite. Finally, we show the pertinence of our I/O complexity model by reporting the running time of the recommended tiled code for the convolution layers of Yolo9000.

References

Alok Aggarwal and Jeffrey S. Vitter. 1988. The Input/Output Complexity of Sorting and Related Problems. Commun. ACM, 31 (1988), 1116–1127. https://doi.org/10.1145/48529.48535 Google ScholarDigital Library
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing Communication in Numerical Linear Algebra. SIAM J. Matrix Analysis Applications, 32, 3 (2011), 866–901. https://doi.org/10.1137/090769156 Google ScholarCross Ref
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2013. Graph Expansion and Communication Costs of Fast Matrix Multiplication. Journal of the ACM, 59, 6 (2013), Article 32, Jan., 23 pages. https://doi.org/10.1145/2395116.2395121 Google ScholarDigital Library
Wenlei Bao, Changwan Hong, Sudheer Chunduri, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, and P. Sadayappan. 2016. Static and Dynamic Frequency Scaling on Multicore CPUs. ACM Transactions on Architecture and Code Optimization, 13, 4 (2016), 26 pages. https://doi.org/10.1145/3011017 Google ScholarDigital Library
Wenlei Bao, Sriram Krishnamoorthy, Louis-Noël Pouchet, and P. Sadayappan. 2018. Analytical modeling of cache behavior for affine programs. Proceeding of the ACM on Programming Languages, 2, POPL (2018), 26 pages. https://doi.org/10.1145/3158120 Google ScholarDigital Library
Alexander I. Barvinok. 1994. A Polynomial Time Algorithm for Counting Integral Points in Polyhedra when the Dimension is Fixed. Mathematics of Operations Research, 19, 4 (1994), 769–779. https://doi.org/10.1287/moor.19.4.769 Google ScholarDigital Library
Cédric Bastoul, Albert Cohen, Sylvain Girbal, Saurabh Sharma, and Olivier Temam. 2003. Putting Polyhedral Loop Transformations to Work. In Languages and Compilers for Parallel Computing, 16th International Workshop (LCPC) (Lecture Notes in Computer Science, Vol. 2958). Springer, 209–225. https://doi.org/10.1007/978-3-540-24644-2_14 Google ScholarCross Ref
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Program Optimization System. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). https://doi.org/10.1145/1375581.1375595 Google ScholarDigital Library
David Callahan, John Cocke, and Ken Kennedy. 1988. Estimating Interlock and Improving Balance for Pipelined Architectures. J. Parallel and Distrib. Comput., 5, 4 (1988), 334–358. https://doi.org/10.1016/0743-7315(88)90002-0 Google ScholarDigital Library
Siddhartha Chatterjee, Erin Parker, Philip J. Hanlon, and Alvin R. Lebeck. 2001. Exact Analysis of the Cache Behavior of Nested Loops. In Proceedings of the 2001 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). ACM, 286–297. https://doi.org/10.1145/378795.378859 Google ScholarDigital Library
Michael Christ, James Demmel, Nicholas Knight, Thomas Scanlon, and Katherine Yelick. 2013. Communication Lower Bounds and Optimal Algorithms for Programs that Reference Arrays – Part 1. arxiv:1308.0068v1.Google Scholar
James Demmel and Grace Dinh. 2018. Communication-Optimal Convolutional Neural Nets. arxiv:1802.06905v2.Google Scholar
James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012. Communication-optimal Parallel and Sequential QR and LU Factorizations. SIAM Journal on Scientific Computing, 34, 1 (2012), A206–A239. https://doi.org/10.1137/080731992 Google ScholarDigital Library
Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2015. On Characterizing the Data Access Complexity of Programs. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 567–580. https://doi.org/10.1145/2676726.2677010 Google ScholarDigital Library
Naznin Fauzia, Venmugil Elango, Mahesh Ravishankar, J. Ramanujam, Fabrice Rastello, Atanas Rountev, Louis-Noël Pouchet, and P. Sadayappan. 2013. Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential. ACM Transaction on Architecture and Code Optimization, 10, 4 (2013), Dec., 29 pages. https://doi.org/10.1145/2541228.2555309 Google ScholarDigital Library
Paul Feautrier. 1988. Parametric Integer Programming. RAIRO Recherche Opérationnelle, 22, 3 (1988), 243–268.Google ScholarCross Ref
Jeanne Ferrante, Vivek Sarkar, and W. Thrash. 1991. On Estimating and Enhancing Cache Effectiveness. In Proceedings of the Languages and Compilers for Parallel Computing (LCPC), Fourth International Workshop (Lecture Notes in Computer Science, Vol. 589). Springer, 328–343. https://doi.org/10.1007/BFb0038674 Google ScholarCross Ref
Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1999. Cache Miss Equations: a Compiler Framework for Analyzing and Tuning Memory Behavior. ACM Transaction on Programming Languages and Systems, 21, 4 (1999), 703–746. https://doi.org/10.1145/325478.325479 Google ScholarDigital Library
Tobias Grosser, Armin Größ linger, and Christian Lengauer. 2012. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation. Parallel Processing Letter, 22, 4 (2012), https://doi.org/10.1142/S0129626412500107 Google ScholarCross Ref
Jia-Wei Hong and H. T. Kung. 1981. I/O complexity: The Red-Blue Pebble Game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC). ACM, 326–333. https://doi.org/10.1145/800076.802486 Google ScholarDigital Library
Intel. 2020. OneAPI Deep Neural Network Library (oneDNN). https://01.org/oneDNNGoogle Scholar
Dror Irony, Sivan Toledo, and Alexandre Tiskin. 2004. Communication Lower Bounds for Distributed-Memory Matrix Multiplication. J. Parallel and Distrib. Comput., 64, 9 (2004), 1017–1026. https://doi.org/10.1016/j.jpdc.2004.03.021 Google ScholarDigital Library
Grzegorz Kwasniewski, Marko Kabic, Maciej Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 24 pages. https://doi.org/10.1145/3295500.3356181 Google ScholarDigital Library
Rui Li, Aravind Sukumaran-Rajam, Richard Veras, Tze Meng Low, Fabrice Rastello, Atanas Rountev, and P. Sadayappan. 2019. Analytical Cache Modeling and Tilesize Optimization for Tensor Contractions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Michela Taufer, Pavan Balaji, and Antonio J. Peña (Eds.). ACM, 13 pages. https://doi.org/10.1145/3295500.3356218 Google ScholarDigital Library
Junyi Liu, John Wickerson, and George A. Constantinides. 2017. Tile Size Selection for Optimized Memory Reuse in High-level Synthesis. In 27th International Conference on Field Programmable Logic and Applications, FPL 2017, Ghent, Belgium, September 4-8, 2017, Marco D. Santambrogio, Diana Göhringer, Dirk Stroobandt, Nele Mentens, and Jari Nurmi (Eds.). IEEE, 1–8. https://doi.org/10.23919/FPL.2017.8056810 Google ScholarCross Ref
Sanyam Mehta, Gautham Beeraka, and Pen-Chung Yew. 2013. Tile Size Selection Revisited. ACM Trans. Archit. Code Optim., 10, 4 (2013), 35:1–35:27. https://doi.org/10.1145/2541228.2555292 Google ScholarDigital Library
Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Štěpán Roučka, Ashutosh Saboo, Isuru Fernando, Sumith Kulal, Robert Cimrman, and Anthony Scopatz. 2017. SymPy: Symbolic Computing in Python. PeerJ Computer Science, 3 (2017), Jan., e103. issn:2376-5992 https://doi.org/10.7717/peerj-cs.103 Google ScholarCross Ref
Auguste Olivry, Julien Langou, Louis-Noël Pouchet, P. Sadayappan, and Fabrice Rastello. 2020. Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). ACM, 808––822. https://doi.org/10.1145/3385412.3385989 Google ScholarDigital Library
Nirmal Prajapati, Waruna Ranasinghe, Sanjay V. Rajopadhye, Rumen Andonov, Hristo Djidjev, and Tobias Grosser. 2017. Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Austin, TX, USA, February 4-8, 2017, Vivek Sarkar and Lawrence Rauchwerger (Eds.). ACM, 163–177. https://doi.org/10.1145/3018743.3018744 Google ScholarDigital Library
Desh Ranjan, John E. Savage, and Mohammad Zubair. 2011. Strong I/O Lower Bounds for Binomial and FFT Computation Graphs. In Computing and Combinatorics - 17th Annual International Conference, COCOON 2011, Dallas, TX, USA, August 14-16, 2011. Proceedings, Bin Fu and Ding-Zhu Du (Eds.) (Lecture Notes in Computer Science, Vol. 6842). Springer, 134–145. https://doi.org/10.1007/978-3-642-22685-4_12 Google ScholarCross Ref
Desh Ranjan, John E. Savage, and Mohammad Zubair. 2012. Upper and Lower I/O Bounds for Pebbling r-Pyramids. J. Discrete Algorithms, 14 (2012), 2–12. https://doi.org/10.1016/j.jda.2011.12.005 Google ScholarDigital Library
Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 6517–6525. https://doi.org/10.1109/CVPR.2017.690 Google ScholarCross Ref
Lakshminarayanan Renganarayanan and Sanjay V. Rajopadhye. 2008. Positivity, Posynomials and Tile Size Selection. In Proceedings of the ACM/IEEE Conference on High Performance Computing (SC). IEEE/ACM, 55. https://doi.org/10.1109/SC.2008.5213293 Google ScholarCross Ref
John E. Savage. 1995. Extending the Hong-Kung Model to Memory Hierarchies. In Proceedings of the First Annual International Conference on Computing and Combinatorics (COCOON ’95). Springer-Verlag, Berlin, Heidelberg. 270–281. https://doi.org/10.1007/BFb0030842 Google ScholarCross Ref
Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J. Ramanujam, P. Sadayappan, and Vivek Sarkar. 2012. Analytical Bounds for Optimal Tile Size Selection. In Proceedings of Compiler Construction - 21st International Conference (CC) (Lecture Notes in Computer Science, Vol. 7210). Springer, 101–121. https://doi.org/10.1007/978-3-642-28652-0_6 Google ScholarDigital Library
Tyler Michael Smith, Bradley Lowery, Julien Langou, and Robert A. van de Geijn. 2019. A Tight I/O Lower Bound for Matrix Multiplication. arxiv:1702.02017v2.Google Scholar
Paul Springer and Paolo Bientinesi. 2016. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication. arxiv:1607.00145.Google Scholar
Sven Verdoolaege. 2010. ISL: An Integer Set Library for the Polyhedral Model. In Mathematical Software–ICMS 2010. 299–302.Google ScholarDigital Library
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. ACM Transactions on Architecture and Code Optimization, 9, 4 (2013), Article 54, Jan., 23 pages. https://doi.org/10.1145/2400682.2400713 Google ScholarDigital Library
Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2007. Counting Integer Points in Parametric Polytopes Using Barvinok’s Rational Functions. Algorithmica, 48, 1 (2007), 37–66. https://doi.org/10.1007/s00453-006-1231-0 Google ScholarDigital Library
Andreas Wächter and Lorenz T. Biegler. 2006. On the Implementation of an Interior-Point Filter Line-Search Algorithm for large-scale nonlinear programming. Math. Program., 106, 1 (2006), 25–57. https://doi.org/10.1007/s10107-004-0559-y Google ScholarDigital Library
Jingling Xue. 2000. Loop Tiling for Parallelism (Kluwer International Series in Engineering and Computer Science, Vol. 575). Kluwer. https://doi.org/10.1007/978-1-4615-4337-4 Google ScholarDigital Library
Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay Rajopadhye, Charles Anderson, Alexandre E. Eichenberger, and Kevin O’Brien. 2010. Automatic Creation of Tile Size Selection Models. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO). ACM, 190–199. https://doi.org/10.1145/1772954.1772982 Google ScholarDigital Library

Index Terms

IOOpt: automatic derivation of I/O complexity bounds for affine programs
1. Software and its engineering
  1. Software organization and properties
    1. Software functional properties
      1. Formal methods
        Automated static analysis
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Automated derivation of parametric data movement lower bounds for affine programs
PLDI 2020: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation

Researchers and practitioners have for long worked on improving the computational complexity of algorithms, focusing on reducing the number of operations needed to perform a computation. However the hardware trend nowadays clearly shows a higher ...
Read More
Non-affine Extensions to Polyhedral Code Generation
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

This paper describes a loop transformation framework that extends a polyhedral representation of loop nests to represent and transform computations with non-affine index arrays in loop bounds and subscripts via a new interface between compile-time and ...
Read More
Monoparametric Tiling of Polyhedral Programs
Abstract
Tiling is a crucial program transformation, adjusting the ops-to-bytes balance of codes to improve locality. Like parallelism, it can be applied at multiple levels. Allowing tile sizes to be symbolic parameters at compile time has many benefits, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation
June 2021
1341 pages
ISBN:9781450383912
DOI:10.1145/3453483
General Chair:
Stephen N. Freund
Williams College, USA
,
Program Chair:
Eran Yahav
Technion, Israel
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Evaluated & Functional / v1.1
Author Tags
Compilation
Convolution
I/O complexity
Polyhedral model
Tensor Contraction
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate406of2,067submissions,20%
Upcoming Conference
PLDI '24

Sponsor:

sigplan

ACM SIGPLAN Conference on Programming Language Design and Implementation

June 24 - 28, 2024

Copenhagen , Denmark
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 376
  Total Downloads
- Downloads (Last 12 months)138
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

IOOpt: automatic derivation of I/O complexity bounds for affine programs

PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automated derivation of parametric data movement lower bounds for affine programs

Non-affine Extensions to Polyhedral Code Generation

Monoparametric Tiling of Polyhedral Programs