skip to main content
10.1145/3453483.3454103acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article
Public Access
Artifacts Evaluated & Functional / v1.1

IOOpt: automatic derivation of I/O complexity bounds for affine programs

Published:18 June 2021Publication History

ABSTRACT

Evaluating the complexity of an algorithm is an important step when developing applications, as it impacts both its time and energy performance. Computational complexity, which is the number of dynamic operations regardless of the execution order, is easy to characterize for affine programs. Data movement (or, I/O) complexity is more complex to evaluate as it refers, when considering all possible valid schedules, to the minimum required number of I/O between a slow (e.g. main memory) and a fast (e.g. local scratchpad) storage location.

This paper presents IOOpt, a fully automated tool that automatically bounds the data movement of an affine (tilable) program. Given a tilable program described in a DSL, it automatically computes: 1. a lower bound of the I/O complexity as a symbolic expression of the cache size and program parameters; 2. an upper bound that allows one to assess the tightness of the lower bound; 3. a tiling recommendation (loop permutation and tile sizes) that matches the upper bound. For the lower bound algorithm which can be applied to any affine program, a substantial effort has been made to provide bounds that are as tight as possible for neural networks: In particular, it extends the previous work of Olivry et al. to handle multi-dimensional reductions and expose the constraints associated with small dimensions that are present in convolutions. For the upper bound algorithm that reasons on the tile band of the program (e.g. output of a polyhedral compiler such as PluTo), the algebraic computations involved have been tuned to behave well on tensor computations such as direct tensor contractions or direct convolutions. As a bonus, the upper bound algorithm that has been extended to multi-level cache can provide the programmer with a useful tiling recommendation.

We demonstrate the effectiveness of our tool by deriving the symbolic lower and upper bounds for several tensor contraction and convolution kernels. Then we evaluate numerically the tightness of our bound using the convolution layers of Yolo9000 and representative tensor contractions from the TCCG benchmark suite. Finally, we show the pertinence of our I/O complexity model by reporting the running time of the recommended tiled code for the convolution layers of Yolo9000.

References

  1. Alok Aggarwal and Jeffrey S. Vitter. 1988. The Input/Output Complexity of Sorting and Related Problems. Commun. ACM, 31 (1988), 1116–1127. https://doi.org/10.1145/48529.48535 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing Communication in Numerical Linear Algebra. SIAM J. Matrix Analysis Applications, 32, 3 (2011), 866–901. https://doi.org/10.1137/090769156 Google ScholarGoogle ScholarCross RefCross Ref
  3. Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2013. Graph Expansion and Communication Costs of Fast Matrix Multiplication. Journal of the ACM, 59, 6 (2013), Article 32, Jan., 23 pages. https://doi.org/10.1145/2395116.2395121 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Wenlei Bao, Changwan Hong, Sudheer Chunduri, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, and P. Sadayappan. 2016. Static and Dynamic Frequency Scaling on Multicore CPUs. ACM Transactions on Architecture and Code Optimization, 13, 4 (2016), 26 pages. https://doi.org/10.1145/3011017 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Wenlei Bao, Sriram Krishnamoorthy, Louis-Noël Pouchet, and P. Sadayappan. 2018. Analytical modeling of cache behavior for affine programs. Proceeding of the ACM on Programming Languages, 2, POPL (2018), 26 pages. https://doi.org/10.1145/3158120 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Alexander I. Barvinok. 1994. A Polynomial Time Algorithm for Counting Integral Points in Polyhedra when the Dimension is Fixed. Mathematics of Operations Research, 19, 4 (1994), 769–779. https://doi.org/10.1287/moor.19.4.769 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cédric Bastoul, Albert Cohen, Sylvain Girbal, Saurabh Sharma, and Olivier Temam. 2003. Putting Polyhedral Loop Transformations to Work. In Languages and Compilers for Parallel Computing, 16th International Workshop (LCPC) (Lecture Notes in Computer Science, Vol. 2958). Springer, 209–225. https://doi.org/10.1007/978-3-540-24644-2_14 Google ScholarGoogle ScholarCross RefCross Ref
  8. Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Program Optimization System. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). https://doi.org/10.1145/1375581.1375595 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. David Callahan, John Cocke, and Ken Kennedy. 1988. Estimating Interlock and Improving Balance for Pipelined Architectures. J. Parallel and Distrib. Comput., 5, 4 (1988), 334–358. https://doi.org/10.1016/0743-7315(88)90002-0 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Siddhartha Chatterjee, Erin Parker, Philip J. Hanlon, and Alvin R. Lebeck. 2001. Exact Analysis of the Cache Behavior of Nested Loops. In Proceedings of the 2001 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). ACM, 286–297. https://doi.org/10.1145/378795.378859 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Michael Christ, James Demmel, Nicholas Knight, Thomas Scanlon, and Katherine Yelick. 2013. Communication Lower Bounds and Optimal Algorithms for Programs that Reference Arrays – Part 1. arxiv:1308.0068v1.Google ScholarGoogle Scholar
  12. James Demmel and Grace Dinh. 2018. Communication-Optimal Convolutional Neural Nets. arxiv:1802.06905v2.Google ScholarGoogle Scholar
  13. James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012. Communication-optimal Parallel and Sequential QR and LU Factorizations. SIAM Journal on Scientific Computing, 34, 1 (2012), A206–A239. https://doi.org/10.1137/080731992 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2015. On Characterizing the Data Access Complexity of Programs. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 567–580. https://doi.org/10.1145/2676726.2677010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Naznin Fauzia, Venmugil Elango, Mahesh Ravishankar, J. Ramanujam, Fabrice Rastello, Atanas Rountev, Louis-Noël Pouchet, and P. Sadayappan. 2013. Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential. ACM Transaction on Architecture and Code Optimization, 10, 4 (2013), Dec., 29 pages. https://doi.org/10.1145/2541228.2555309 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Paul Feautrier. 1988. Parametric Integer Programming. RAIRO Recherche Opérationnelle, 22, 3 (1988), 243–268.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jeanne Ferrante, Vivek Sarkar, and W. Thrash. 1991. On Estimating and Enhancing Cache Effectiveness. In Proceedings of the Languages and Compilers for Parallel Computing (LCPC), Fourth International Workshop (Lecture Notes in Computer Science, Vol. 589). Springer, 328–343. https://doi.org/10.1007/BFb0038674 Google ScholarGoogle ScholarCross RefCross Ref
  18. Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1999. Cache Miss Equations: a Compiler Framework for Analyzing and Tuning Memory Behavior. ACM Transaction on Programming Languages and Systems, 21, 4 (1999), 703–746. https://doi.org/10.1145/325478.325479 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tobias Grosser, Armin Größ linger, and Christian Lengauer. 2012. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation. Parallel Processing Letter, 22, 4 (2012), https://doi.org/10.1142/S0129626412500107 Google ScholarGoogle ScholarCross RefCross Ref
  20. Jia-Wei Hong and H. T. Kung. 1981. I/O complexity: The Red-Blue Pebble Game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC). ACM, 326–333. https://doi.org/10.1145/800076.802486 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Intel. 2020. OneAPI Deep Neural Network Library (oneDNN). https://01.org/oneDNNGoogle ScholarGoogle Scholar
  22. Dror Irony, Sivan Toledo, and Alexandre Tiskin. 2004. Communication Lower Bounds for Distributed-Memory Matrix Multiplication. J. Parallel and Distrib. Comput., 64, 9 (2004), 1017–1026. https://doi.org/10.1016/j.jpdc.2004.03.021 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Grzegorz Kwasniewski, Marko Kabic, Maciej Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 24 pages. https://doi.org/10.1145/3295500.3356181 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rui Li, Aravind Sukumaran-Rajam, Richard Veras, Tze Meng Low, Fabrice Rastello, Atanas Rountev, and P. Sadayappan. 2019. Analytical Cache Modeling and Tilesize Optimization for Tensor Contractions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Michela Taufer, Pavan Balaji, and Antonio J. Peña (Eds.). ACM, 13 pages. https://doi.org/10.1145/3295500.3356218 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Junyi Liu, John Wickerson, and George A. Constantinides. 2017. Tile Size Selection for Optimized Memory Reuse in High-level Synthesis. In 27th International Conference on Field Programmable Logic and Applications, FPL 2017, Ghent, Belgium, September 4-8, 2017, Marco D. Santambrogio, Diana Göhringer, Dirk Stroobandt, Nele Mentens, and Jari Nurmi (Eds.). IEEE, 1–8. https://doi.org/10.23919/FPL.2017.8056810 Google ScholarGoogle ScholarCross RefCross Ref
  26. Sanyam Mehta, Gautham Beeraka, and Pen-Chung Yew. 2013. Tile Size Selection Revisited. ACM Trans. Archit. Code Optim., 10, 4 (2013), 35:1–35:27. https://doi.org/10.1145/2541228.2555292 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Štěpán Roučka, Ashutosh Saboo, Isuru Fernando, Sumith Kulal, Robert Cimrman, and Anthony Scopatz. 2017. SymPy: Symbolic Computing in Python. PeerJ Computer Science, 3 (2017), Jan., e103. issn:2376-5992 https://doi.org/10.7717/peerj-cs.103 Google ScholarGoogle ScholarCross RefCross Ref
  28. Auguste Olivry, Julien Langou, Louis-Noël Pouchet, P. Sadayappan, and Fabrice Rastello. 2020. Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). ACM, 808––822. https://doi.org/10.1145/3385412.3385989 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Nirmal Prajapati, Waruna Ranasinghe, Sanjay V. Rajopadhye, Rumen Andonov, Hristo Djidjev, and Tobias Grosser. 2017. Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Austin, TX, USA, February 4-8, 2017, Vivek Sarkar and Lawrence Rauchwerger (Eds.). ACM, 163–177. https://doi.org/10.1145/3018743.3018744 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Desh Ranjan, John E. Savage, and Mohammad Zubair. 2011. Strong I/O Lower Bounds for Binomial and FFT Computation Graphs. In Computing and Combinatorics - 17th Annual International Conference, COCOON 2011, Dallas, TX, USA, August 14-16, 2011. Proceedings, Bin Fu and Ding-Zhu Du (Eds.) (Lecture Notes in Computer Science, Vol. 6842). Springer, 134–145. https://doi.org/10.1007/978-3-642-22685-4_12 Google ScholarGoogle ScholarCross RefCross Ref
  31. Desh Ranjan, John E. Savage, and Mohammad Zubair. 2012. Upper and Lower I/O Bounds for Pebbling r-Pyramids. J. Discrete Algorithms, 14 (2012), 2–12. https://doi.org/10.1016/j.jda.2011.12.005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 6517–6525. https://doi.org/10.1109/CVPR.2017.690 Google ScholarGoogle ScholarCross RefCross Ref
  33. Lakshminarayanan Renganarayanan and Sanjay V. Rajopadhye. 2008. Positivity, Posynomials and Tile Size Selection. In Proceedings of the ACM/IEEE Conference on High Performance Computing (SC). IEEE/ACM, 55. https://doi.org/10.1109/SC.2008.5213293 Google ScholarGoogle ScholarCross RefCross Ref
  34. John E. Savage. 1995. Extending the Hong-Kung Model to Memory Hierarchies. In Proceedings of the First Annual International Conference on Computing and Combinatorics (COCOON ’95). Springer-Verlag, Berlin, Heidelberg. 270–281. https://doi.org/10.1007/BFb0030842 Google ScholarGoogle ScholarCross RefCross Ref
  35. Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J. Ramanujam, P. Sadayappan, and Vivek Sarkar. 2012. Analytical Bounds for Optimal Tile Size Selection. In Proceedings of Compiler Construction - 21st International Conference (CC) (Lecture Notes in Computer Science, Vol. 7210). Springer, 101–121. https://doi.org/10.1007/978-3-642-28652-0_6 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tyler Michael Smith, Bradley Lowery, Julien Langou, and Robert A. van de Geijn. 2019. A Tight I/O Lower Bound for Matrix Multiplication. arxiv:1702.02017v2.Google ScholarGoogle Scholar
  37. Paul Springer and Paolo Bientinesi. 2016. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication. arxiv:1607.00145.Google ScholarGoogle Scholar
  38. Sven Verdoolaege. 2010. ISL: An Integer Set Library for the Polyhedral Model. In Mathematical Software–ICMS 2010. 299–302.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. ACM Transactions on Architecture and Code Optimization, 9, 4 (2013), Article 54, Jan., 23 pages. https://doi.org/10.1145/2400682.2400713 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2007. Counting Integer Points in Parametric Polytopes Using Barvinok’s Rational Functions. Algorithmica, 48, 1 (2007), 37–66. https://doi.org/10.1007/s00453-006-1231-0 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Andreas Wächter and Lorenz T. Biegler. 2006. On the Implementation of an Interior-Point Filter Line-Search Algorithm for large-scale nonlinear programming. Math. Program., 106, 1 (2006), 25–57. https://doi.org/10.1007/s10107-004-0559-y Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jingling Xue. 2000. Loop Tiling for Parallelism (Kluwer International Series in Engineering and Computer Science, Vol. 575). Kluwer. https://doi.org/10.1007/978-1-4615-4337-4 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay Rajopadhye, Charles Anderson, Alexandre E. Eichenberger, and Kevin O’Brien. 2010. Automatic Creation of Tile Size Selection Models. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO). ACM, 190–199. https://doi.org/10.1145/1772954.1772982 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. IOOpt: automatic derivation of I/O complexity bounds for affine programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation
        June 2021
        1341 pages
        ISBN:9781450383912
        DOI:10.1145/3453483

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 June 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate406of2,067submissions,20%

        Upcoming Conference

        PLDI '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader