ABSTRACT
Evaluating the complexity of an algorithm is an important step when developing applications, as it impacts both its time and energy performance. Computational complexity, which is the number of dynamic operations regardless of the execution order, is easy to characterize for affine programs. Data movement (or, I/O) complexity is more complex to evaluate as it refers, when considering all possible valid schedules, to the minimum required number of I/O between a slow (e.g. main memory) and a fast (e.g. local scratchpad) storage location.
This paper presents IOOpt, a fully automated tool that automatically bounds the data movement of an affine (tilable) program. Given a tilable program described in a DSL, it automatically computes: 1. a lower bound of the I/O complexity as a symbolic expression of the cache size and program parameters; 2. an upper bound that allows one to assess the tightness of the lower bound; 3. a tiling recommendation (loop permutation and tile sizes) that matches the upper bound. For the lower bound algorithm which can be applied to any affine program, a substantial effort has been made to provide bounds that are as tight as possible for neural networks: In particular, it extends the previous work of Olivry et al. to handle multi-dimensional reductions and expose the constraints associated with small dimensions that are present in convolutions. For the upper bound algorithm that reasons on the tile band of the program (e.g. output of a polyhedral compiler such as PluTo), the algebraic computations involved have been tuned to behave well on tensor computations such as direct tensor contractions or direct convolutions. As a bonus, the upper bound algorithm that has been extended to multi-level cache can provide the programmer with a useful tiling recommendation.
We demonstrate the effectiveness of our tool by deriving the symbolic lower and upper bounds for several tensor contraction and convolution kernels. Then we evaluate numerically the tightness of our bound using the convolution layers of Yolo9000 and representative tensor contractions from the TCCG benchmark suite. Finally, we show the pertinence of our I/O complexity model by reporting the running time of the recommended tiled code for the convolution layers of Yolo9000.
- Alok Aggarwal and Jeffrey S. Vitter. 1988. The Input/Output Complexity of Sorting and Related Problems. Commun. ACM, 31 (1988), 1116–1127. https://doi.org/10.1145/48529.48535 Google ScholarDigital Library
- Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing Communication in Numerical Linear Algebra. SIAM J. Matrix Analysis Applications, 32, 3 (2011), 866–901. https://doi.org/10.1137/090769156 Google ScholarCross Ref
- Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2013. Graph Expansion and Communication Costs of Fast Matrix Multiplication. Journal of the ACM, 59, 6 (2013), Article 32, Jan., 23 pages. https://doi.org/10.1145/2395116.2395121 Google ScholarDigital Library
- Wenlei Bao, Changwan Hong, Sudheer Chunduri, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, and P. Sadayappan. 2016. Static and Dynamic Frequency Scaling on Multicore CPUs. ACM Transactions on Architecture and Code Optimization, 13, 4 (2016), 26 pages. https://doi.org/10.1145/3011017 Google ScholarDigital Library
- Wenlei Bao, Sriram Krishnamoorthy, Louis-Noël Pouchet, and P. Sadayappan. 2018. Analytical modeling of cache behavior for affine programs. Proceeding of the ACM on Programming Languages, 2, POPL (2018), 26 pages. https://doi.org/10.1145/3158120 Google ScholarDigital Library
- Alexander I. Barvinok. 1994. A Polynomial Time Algorithm for Counting Integral Points in Polyhedra when the Dimension is Fixed. Mathematics of Operations Research, 19, 4 (1994), 769–779. https://doi.org/10.1287/moor.19.4.769 Google ScholarDigital Library
- Cédric Bastoul, Albert Cohen, Sylvain Girbal, Saurabh Sharma, and Olivier Temam. 2003. Putting Polyhedral Loop Transformations to Work. In Languages and Compilers for Parallel Computing, 16th International Workshop (LCPC) (Lecture Notes in Computer Science, Vol. 2958). Springer, 209–225. https://doi.org/10.1007/978-3-540-24644-2_14 Google ScholarCross Ref
- Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Program Optimization System. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). https://doi.org/10.1145/1375581.1375595 Google ScholarDigital Library
- David Callahan, John Cocke, and Ken Kennedy. 1988. Estimating Interlock and Improving Balance for Pipelined Architectures. J. Parallel and Distrib. Comput., 5, 4 (1988), 334–358. https://doi.org/10.1016/0743-7315(88)90002-0 Google ScholarDigital Library
- Siddhartha Chatterjee, Erin Parker, Philip J. Hanlon, and Alvin R. Lebeck. 2001. Exact Analysis of the Cache Behavior of Nested Loops. In Proceedings of the 2001 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). ACM, 286–297. https://doi.org/10.1145/378795.378859 Google ScholarDigital Library
- Michael Christ, James Demmel, Nicholas Knight, Thomas Scanlon, and Katherine Yelick. 2013. Communication Lower Bounds and Optimal Algorithms for Programs that Reference Arrays – Part 1. arxiv:1308.0068v1.Google Scholar
- James Demmel and Grace Dinh. 2018. Communication-Optimal Convolutional Neural Nets. arxiv:1802.06905v2.Google Scholar
- James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012. Communication-optimal Parallel and Sequential QR and LU Factorizations. SIAM Journal on Scientific Computing, 34, 1 (2012), A206–A239. https://doi.org/10.1137/080731992 Google ScholarDigital Library
- Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2015. On Characterizing the Data Access Complexity of Programs. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 567–580. https://doi.org/10.1145/2676726.2677010 Google ScholarDigital Library
- Naznin Fauzia, Venmugil Elango, Mahesh Ravishankar, J. Ramanujam, Fabrice Rastello, Atanas Rountev, Louis-Noël Pouchet, and P. Sadayappan. 2013. Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential. ACM Transaction on Architecture and Code Optimization, 10, 4 (2013), Dec., 29 pages. https://doi.org/10.1145/2541228.2555309 Google ScholarDigital Library
- Paul Feautrier. 1988. Parametric Integer Programming. RAIRO Recherche Opérationnelle, 22, 3 (1988), 243–268.Google ScholarCross Ref
- Jeanne Ferrante, Vivek Sarkar, and W. Thrash. 1991. On Estimating and Enhancing Cache Effectiveness. In Proceedings of the Languages and Compilers for Parallel Computing (LCPC), Fourth International Workshop (Lecture Notes in Computer Science, Vol. 589). Springer, 328–343. https://doi.org/10.1007/BFb0038674 Google ScholarCross Ref
- Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1999. Cache Miss Equations: a Compiler Framework for Analyzing and Tuning Memory Behavior. ACM Transaction on Programming Languages and Systems, 21, 4 (1999), 703–746. https://doi.org/10.1145/325478.325479 Google ScholarDigital Library
- Tobias Grosser, Armin Größ linger, and Christian Lengauer. 2012. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation. Parallel Processing Letter, 22, 4 (2012), https://doi.org/10.1142/S0129626412500107 Google ScholarCross Ref
- Jia-Wei Hong and H. T. Kung. 1981. I/O complexity: The Red-Blue Pebble Game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC). ACM, 326–333. https://doi.org/10.1145/800076.802486 Google ScholarDigital Library
- Intel. 2020. OneAPI Deep Neural Network Library (oneDNN). https://01.org/oneDNNGoogle Scholar
- Dror Irony, Sivan Toledo, and Alexandre Tiskin. 2004. Communication Lower Bounds for Distributed-Memory Matrix Multiplication. J. Parallel and Distrib. Comput., 64, 9 (2004), 1017–1026. https://doi.org/10.1016/j.jpdc.2004.03.021 Google ScholarDigital Library
- Grzegorz Kwasniewski, Marko Kabic, Maciej Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 24 pages. https://doi.org/10.1145/3295500.3356181 Google ScholarDigital Library
- Rui Li, Aravind Sukumaran-Rajam, Richard Veras, Tze Meng Low, Fabrice Rastello, Atanas Rountev, and P. Sadayappan. 2019. Analytical Cache Modeling and Tilesize Optimization for Tensor Contractions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Michela Taufer, Pavan Balaji, and Antonio J. Peña (Eds.). ACM, 13 pages. https://doi.org/10.1145/3295500.3356218 Google ScholarDigital Library
- Junyi Liu, John Wickerson, and George A. Constantinides. 2017. Tile Size Selection for Optimized Memory Reuse in High-level Synthesis. In 27th International Conference on Field Programmable Logic and Applications, FPL 2017, Ghent, Belgium, September 4-8, 2017, Marco D. Santambrogio, Diana Göhringer, Dirk Stroobandt, Nele Mentens, and Jari Nurmi (Eds.). IEEE, 1–8. https://doi.org/10.23919/FPL.2017.8056810 Google ScholarCross Ref
- Sanyam Mehta, Gautham Beeraka, and Pen-Chung Yew. 2013. Tile Size Selection Revisited. ACM Trans. Archit. Code Optim., 10, 4 (2013), 35:1–35:27. https://doi.org/10.1145/2541228.2555292 Google ScholarDigital Library
- Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Štěpán Roučka, Ashutosh Saboo, Isuru Fernando, Sumith Kulal, Robert Cimrman, and Anthony Scopatz. 2017. SymPy: Symbolic Computing in Python. PeerJ Computer Science, 3 (2017), Jan., e103. issn:2376-5992 https://doi.org/10.7717/peerj-cs.103 Google ScholarCross Ref
- Auguste Olivry, Julien Langou, Louis-Noël Pouchet, P. Sadayappan, and Fabrice Rastello. 2020. Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). ACM, 808––822. https://doi.org/10.1145/3385412.3385989 Google ScholarDigital Library
- Nirmal Prajapati, Waruna Ranasinghe, Sanjay V. Rajopadhye, Rumen Andonov, Hristo Djidjev, and Tobias Grosser. 2017. Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Austin, TX, USA, February 4-8, 2017, Vivek Sarkar and Lawrence Rauchwerger (Eds.). ACM, 163–177. https://doi.org/10.1145/3018743.3018744 Google ScholarDigital Library
- Desh Ranjan, John E. Savage, and Mohammad Zubair. 2011. Strong I/O Lower Bounds for Binomial and FFT Computation Graphs. In Computing and Combinatorics - 17th Annual International Conference, COCOON 2011, Dallas, TX, USA, August 14-16, 2011. Proceedings, Bin Fu and Ding-Zhu Du (Eds.) (Lecture Notes in Computer Science, Vol. 6842). Springer, 134–145. https://doi.org/10.1007/978-3-642-22685-4_12 Google ScholarCross Ref
- Desh Ranjan, John E. Savage, and Mohammad Zubair. 2012. Upper and Lower I/O Bounds for Pebbling r-Pyramids. J. Discrete Algorithms, 14 (2012), 2–12. https://doi.org/10.1016/j.jda.2011.12.005 Google ScholarDigital Library
- Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 6517–6525. https://doi.org/10.1109/CVPR.2017.690 Google ScholarCross Ref
- Lakshminarayanan Renganarayanan and Sanjay V. Rajopadhye. 2008. Positivity, Posynomials and Tile Size Selection. In Proceedings of the ACM/IEEE Conference on High Performance Computing (SC). IEEE/ACM, 55. https://doi.org/10.1109/SC.2008.5213293 Google ScholarCross Ref
- John E. Savage. 1995. Extending the Hong-Kung Model to Memory Hierarchies. In Proceedings of the First Annual International Conference on Computing and Combinatorics (COCOON ’95). Springer-Verlag, Berlin, Heidelberg. 270–281. https://doi.org/10.1007/BFb0030842 Google ScholarCross Ref
- Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J. Ramanujam, P. Sadayappan, and Vivek Sarkar. 2012. Analytical Bounds for Optimal Tile Size Selection. In Proceedings of Compiler Construction - 21st International Conference (CC) (Lecture Notes in Computer Science, Vol. 7210). Springer, 101–121. https://doi.org/10.1007/978-3-642-28652-0_6 Google ScholarDigital Library
- Tyler Michael Smith, Bradley Lowery, Julien Langou, and Robert A. van de Geijn. 2019. A Tight I/O Lower Bound for Matrix Multiplication. arxiv:1702.02017v2.Google Scholar
- Paul Springer and Paolo Bientinesi. 2016. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication. arxiv:1607.00145.Google Scholar
- Sven Verdoolaege. 2010. ISL: An Integer Set Library for the Polyhedral Model. In Mathematical Software–ICMS 2010. 299–302.Google ScholarDigital Library
- Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. ACM Transactions on Architecture and Code Optimization, 9, 4 (2013), Article 54, Jan., 23 pages. https://doi.org/10.1145/2400682.2400713 Google ScholarDigital Library
- Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2007. Counting Integer Points in Parametric Polytopes Using Barvinok’s Rational Functions. Algorithmica, 48, 1 (2007), 37–66. https://doi.org/10.1007/s00453-006-1231-0 Google ScholarDigital Library
- Andreas Wächter and Lorenz T. Biegler. 2006. On the Implementation of an Interior-Point Filter Line-Search Algorithm for large-scale nonlinear programming. Math. Program., 106, 1 (2006), 25–57. https://doi.org/10.1007/s10107-004-0559-y Google ScholarDigital Library
- Jingling Xue. 2000. Loop Tiling for Parallelism (Kluwer International Series in Engineering and Computer Science, Vol. 575). Kluwer. https://doi.org/10.1007/978-1-4615-4337-4 Google ScholarDigital Library
- Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay Rajopadhye, Charles Anderson, Alexandre E. Eichenberger, and Kevin O’Brien. 2010. Automatic Creation of Tile Size Selection Models. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO). ACM, 190–199. https://doi.org/10.1145/1772954.1772982 Google ScholarDigital Library
Index Terms
- IOOpt: automatic derivation of I/O complexity bounds for affine programs
Recommendations
Automated derivation of parametric data movement lower bounds for affine programs
PLDI 2020: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and ImplementationResearchers and practitioners have for long worked on improving the computational complexity of algorithms, focusing on reducing the number of operations needed to perform a computation. However the hardware trend nowadays clearly shows a higher ...
Non-affine Extensions to Polyhedral Code Generation
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationThis paper describes a loop transformation framework that extends a polyhedral representation of loop nests to represent and transform computations with non-affine index arrays in loop bounds and subscripts via a new interface between compile-time and ...
Monoparametric Tiling of Polyhedral Programs
AbstractTiling is a crucial program transformation, adjusting the ops-to-bytes balance of codes to improve locality. Like parallelism, it can be applied at multiple levels. Allowing tile sizes to be symbolic parameters at compile time has many benefits, ...
Comments