skip to main content
10.1145/3295500.3356218acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Analytical cache modeling and tilesize optimization for tensor contractions

Published:17 November 2019Publication History

ABSTRACT

Data movement between processor and memory hierarchy is a fundamental bottleneck that limits the performance of many applications on modern computer architectures. Tiling and loop permutation are key techniques for improving data locality. However, selecting effective tile-sizes and loop permutations is particularly challenging for tensor contractions due to the large number of loops. Even state-of-the-art compilers usually produce sub-optimal tile-sizes and loop permutations, as they rely on naïve cost models. In this paper we provide an analytical model based approach to multi-level tile size optimization and permutation selection for tensor contractions. Our experimental results show that this approach achieves comparable or better performance than state-of-the-art frameworks and libraries for tensor contractions.

References

  1. Cedric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proc. of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Pietro Belotti. 2009. Couenne: a user's manual. Technical Report. Technical report, Lehigh University.Google ScholarGoogle Scholar
  3. U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008. PLUTO: A Practical and Fully Automatic Polyhedral Program Optimization System. In Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 08).Google ScholarGoogle Scholar
  4. Stephanie Coleman and Kathryn S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI '95). ACM, 279--290.Google ScholarGoogle Scholar
  5. T Daniel Crawford and Henry F Schaefer III. 2000. An introduction to coupled cluster theory for computational chemists. Reviews in computational chemistry (2000), 33--136.Google ScholarGoogle Scholar
  6. Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. International journal of parallel programming 21, 5 (1992), 313--347.Google ScholarGoogle Scholar
  7. Kazushige Goto and Robert A Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) 34, 3 (2008), 12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Tze Meng Low, Francisco D Igual, Tyler M Smith, and Enrique S Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Transactions on Mathematical Software (TOMS) 43, 2 (2016), 12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Devin A Matthews. 2018. High-performance tensor contraction without transposition. SIAM Journal on Scientific Computing 40, 1 (2018), C1--C24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lakshminarayanan Renganarayana and Sanjay Rajopadhye. 2008. Positivity, Posynomials and Tile Size Selection. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Article 55, 12 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J Ramanujam, P Sadayappan, and Vivek Sarkar. 2012. Analytical bounds for optimal tile size selection. In International Conference on Compiler Construction. Springer, 101--121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Tyler M Smith, Robert Van De Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 1049--1059.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Paul Springer and Paolo Bientinesi. 2016. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication. CoRR (2016). arXiv:cs.MS, cs.PF/1607.00145 http://arxiv.org/abs/1607.00145Google ScholarGoogle Scholar
  14. Paul Springer, Tong Su, and Paolo Bientinesi. 2017. HPTT: A High-Performance Tensor Transposition C++ Library. In Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2017). ACM, New York, NY, USA, 56--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Field G Van Zee and Robert A Van De Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS) 41, 3 (2015), 14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013), 54:1--54:23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi™. Springer, 167--188.Google ScholarGoogle Scholar
  18. Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay Rajopadhye, Charles Anderson, Alexandre E. Eichenberger, and Kevin O'Brien. 2010. Automatic Creation of Tile Size Selection Models. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '10). ACM, 190--199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Analytical cache modeling and tilesize optimization for tensor contractions

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
      November 2019
      1921 pages
      ISBN:9781450362290
      DOI:10.1145/3295500

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 November 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,516of6,373submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader