research-article

Analytical cache modeling and tilesize optimization for tensor contractions

Authors:
Rui Li

University of Utah

University of Utah
View Profile

,
Aravind Sukumaran-Rajam

Ohio State University

Ohio State University
View Profile

,
Richard Veras

Louisiana State University

Louisiana State University
View Profile

,
Tze Meng Low

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Fabrice Rastello

Univ. Grenoble Alpes

Univ. Grenoble Alpes
View Profile

,
Atanas Rountev

Ohio State University

Ohio State University
View Profile

,
P. Sadayappan

University of Utah

University of Utah
View Profile

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2019Article No.: 74Pages 1–13https://doi.org/10.1145/3295500.3356218

Published:17 November 2019Publication History

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–13

ABSTRACT

Data movement between processor and memory hierarchy is a fundamental bottleneck that limits the performance of many applications on modern computer architectures. Tiling and loop permutation are key techniques for improving data locality. However, selecting effective tile-sizes and loop permutations is particularly challenging for tensor contractions due to the large number of loops. Even state-of-the-art compilers usually produce sub-optimal tile-sizes and loop permutations, as they rely on naïve cost models. In this paper we provide an analytical model based approach to multi-level tile size optimization and permutation selection for tensor contractions. Our experimental results show that this approach achieves comparable or better performance than state-of-the-art frameworks and libraries for tensor contractions.

References

Cedric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proc. of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE.Google ScholarDigital Library
Pietro Belotti. 2009. Couenne: a user's manual. Technical Report. Technical report, Lehigh University.Google Scholar
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008. PLUTO: A Practical and Fully Automatic Polyhedral Program Optimization System. In Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 08).Google Scholar
Stephanie Coleman and Kathryn S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI '95). ACM, 279--290.Google Scholar
T Daniel Crawford and Henry F Schaefer III. 2000. An introduction to coupled cluster theory for computational chemists. Reviews in computational chemistry (2000), 33--136.Google Scholar
Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. International journal of parallel programming 21, 5 (1992), 313--347.Google Scholar
Kazushige Goto and Robert A Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) 34, 3 (2008), 12.Google ScholarDigital Library
Tze Meng Low, Francisco D Igual, Tyler M Smith, and Enrique S Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Transactions on Mathematical Software (TOMS) 43, 2 (2016), 12.Google ScholarDigital Library
Devin A Matthews. 2018. High-performance tensor contraction without transposition. SIAM Journal on Scientific Computing 40, 1 (2018), C1--C24.Google ScholarDigital Library
Lakshminarayanan Renganarayana and Sanjay Rajopadhye. 2008. Positivity, Posynomials and Tile Size Selection. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Article 55, 12 pages.Google ScholarDigital Library
Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J Ramanujam, P Sadayappan, and Vivek Sarkar. 2012. Analytical bounds for optimal tile size selection. In International Conference on Compiler Construction. Springer, 101--121.Google ScholarDigital Library
Tyler M Smith, Robert Van De Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 1049--1059.Google ScholarDigital Library
Paul Springer and Paolo Bientinesi. 2016. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication. CoRR (2016). arXiv:cs.MS, cs.PF/1607.00145 http://arxiv.org/abs/1607.00145Google Scholar
Paul Springer, Tong Su, and Paolo Bientinesi. 2017. HPTT: A High-Performance Tensor Transposition C++ Library. In Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2017). ACM, New York, NY, USA, 56--62. Google ScholarDigital Library
Field G Van Zee and Robert A Van De Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS) 41, 3 (2015), 14.Google ScholarDigital Library
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013), 54:1--54:23. Google ScholarDigital Library
Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi™. Springer, 167--188.Google Scholar
Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay Rajopadhye, Charles Anderson, Alexandre E. Eichenberger, and Kevin O'Brien. 2010. Automatic Creation of Tile Size Selection Models. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '10). ACM, 190--199.Google ScholarDigital Library

Analytical cache modeling and tilesize optimization for tensor contractions
1. Software and its engineering
  1. Software notations and tools

Recommendations

Analytical characterization and design space exploration for optimization of CNNs
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Moving data through the memory hierarchy is a fundamental bottleneck that can limit the performance of core algorithms of machine learning, such as convolutional neural networks (CNNs). Loop-level optimization, including loop tiling and loop permutation,...
Read More
Performance modeling and optimization of parallel out-of-core tensor contractions
PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming

The Tensor Contraction Engine (TCE) is a domain-specific compiler for implementing complex tensor contraction expressions arising in quantum chemistry applications modeling electronic structure. This paper develops a performance model for tensor ...
Read More
Memory minimization for tensor contractions using integer linear programming
IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processing

This paper presents a technique for memory optimization for a class of computations that arises in the field of correlated electronic structure methods such as coupled cluster and configuration interaction methods in quantum chemistry. In this class of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 November 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
domain-specific compiler optimization
model-driven design-space exploration
performance modeling
tensor contraction
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,516of6,373submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 699
  Total Downloads
- Downloads (Last 12 months)100
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Analytical cache modeling and tilesize optimization for tensor contractions

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

Analytical characterization and design space exploration for optimization of CNNs

Performance modeling and optimization of parallel out-of-core tensor contractions

Memory minimization for tensor contractions using integer linear programming

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Analytical cache modeling and tilesize optimization for tensor contractions

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

Analytical characterization and design space exploration for optimization of CNNs

Performance modeling and optimization of parallel out-of-core tensor contractions

Memory minimization for tensor contractions using integer linear programming

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media