Article

Exact analysis of the cache behavior of nested loops

Authors:
Siddhartha Chatterjee

Department of Computer Science, The University of North Carolina, Chapel Hill, NC

Department of Computer Science, The University of North Carolina, Chapel Hill, NC
View Profile

,
Erin Parker

Department of Computer Science, The University of North Carolina, Chapel Hill, NC

Department of Computer Science, The University of North Carolina, Chapel Hill, NC
View Profile

,
Philip J. Hanlon

Department of Mathematics, University of Michigan, Ann Arbor, MI

Department of Mathematics, University of Michigan, Ann Arbor, MI
View Profile

,
Alvin R. Lebeck

Department of Computer Science, Duke University, Durham, NC

Department of Computer Science, Duke University, Durham, NC
View Profile

PLDI '01: Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementationJune 2001Pages 286–297https://doi.org/10.1145/378795.378859

Published:01 May 2001Publication History

PLDI '01: Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation

Pages 286–297

ABSTRACT

We develop from first principles an exact model of the behavior of loop nests executing in a memory hicrarchy, by using a nontraditional classification of misses that has the key property of composability. We use Presburger formulas to express various kinds of misses as well as the state of the cache at the end of the loop nest. We use existing tools to simplify these formulas and to count cache misses. The model is powerful enough to handle imperfect loop nests and various flavors of non-linear array layouts based on bit interleaving of array indices. We also indicate how to handle modest levels of associativity, and how to perform limited symbolic analysis of cache behavior. The complexity of the formulas relates to the static structure of the loop nest rather than to its dynamic trip count, allowing our model to gain efficiency in counting cache misses by exploiting repetitive patterns of cache behavior. Validation against cache simulation confirms the exactness of our formulation. Our method can serve as the basis for a static performance predictor to guide program and data transformations to improve performance.

References

1.A. Agarwal, M. Horowitz, and J. Hennessy. An analytical cache model. ACM Trans. Comput. Syst., 7(2):184-215, May 1989.]] Google ScholarDigital Library
2.N. Ahmed. Locality Enhancement of Imperfectly-nested Loop Nests. PhD thesis, Department of Computer Science, Cornell University, Aug. 2000.]] Google ScholarDigital Library
3.N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. Technical Report TR2000-1782, Cornell University, 2000.]] Google ScholarDigital Library
4.N. Ahmed and K. Pingali. Automatic generation of block-recursive codes. In Proceedings of Europar 2000, pages 125-134, 2000.]] Google ScholarDigital Library
5.M. Alt, C. Ferdinand, F. Martin, and R. Wilhelm. Cache behavior prediction by abstract interpretation. In R. Cousot and D. A. Schmidt, editors, SAS'96, Static Analysis Symposium, volume 1145 of Lecture Notes in Computer Science, pages 51-66. Springer, September 1996.]] Google ScholarDigital Library
6.T. Amon, G. Borriello, T. Hu, and J. Liu. Symbolic timing verification of timing diagrams using Presburger formulas. In Proceedings of DAC 97, pages 226-231, Anaheim, CA, June 1997.]] Google ScholarDigital Library
7.T. Amon, G. Borriello, and J. Liu. Making complex timing relationships readable: Presburger formula simplification using don't cares. In Proceedings of DAC 98, pages 586-590, San Francisco, CA, June 1998.]] Google ScholarDigital Library
8.B. Boigelot and P. Wolper. An automata-theoretic approach to Presburger arithmetic. In A. Mycroft, editor, Proceedings of the Second International Symposium on Static Analysis (SAS '95), volume 983 of Lecture Notes in Computer Science, pages 1-18. Springer Verlag, Sept. 1995.]] Google ScholarDigital Library
9.A. Boudet and H. Comon. Diophantine equations, Presburger arithmetic and finite automata. In H. Kirchner, editor, Proc. Coll. on Trees in Algebra and Programming (CAAP'96), volume 1059 of Lecture Notes in Computer Science, pages 30-43. Springer Verlag, 1996.]] Google ScholarDigital Library
10.P. Boulet and X. Redon. SPPoC: fonctionnemen et applications. Research Report 00-04, LIFL (Laboratoire de Recherche en Informatique de l'Universite des Sciences et Technologies de Lille), 2000. In French. Also see http://www.lifl.fr/west/sppoc/.]]Google Scholar
11.M. Brehob and R. Enbody. A mathematical model of locality and caching. Technical Report TR-MSU-CPS-96-TBD, Michigan State University, Nov. 1996.]]Google Scholar
12.G. C. Cascaval. Compile-Time Performance Prediction of Scientific Programs. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2000.]] Google ScholarDigital Library
13.S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, San Jose, CA, Oct. 1994.]] Google ScholarDigital Library
14.S. Chatterjee, V. V. Jain, A. R. Lebeck, S. Mundhra, and M. Thottethodi. Nonlinear array layouts for hierarchical memory systems. In Proceedings of the 1999 ACM International Conference on Supercomputing, pages 444-453, Rhodes, Greece, June 1999.]] Google ScholarDigital Library
15.S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222-231, Saint-Malo, France, June 1999.]] Google ScholarDigital Library
16.S. Chatterjee and S. Sen. Cache-efficient matrix transposition. In Proceedings of HPCA-6, pages 195-205, Toulouse, France, Jan. 2000.]]Google Scholar
17.M. Cierniak and W. Li. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation, pages 205-217, La Jolla, CA, June 1995.]] Google ScholarDigital Library
18.P. Clauss. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs. In Proceedings of International Conference on Supercomputing, pages 278-285, May 1996.]] Google ScholarDigital Library
19.S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation, pages 279-290, La Jolla, CA, June 1995.]] Google ScholarDigital Library
20.P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23-54, 1991.]]Google ScholarDigital Library
21.J. D. Frens and D. S. Wise. Auto-blocking matrix-multiplication or tracking BLAS3 performance with source code. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 206-216, Las Vegas, NV, June 1997.]] Google ScholarDigital Library
22.C. Fricker, O. Temam, and W. Jalby. Influence of cross-interference on blocked loops: A case study with matrix-vector multiply. ACM Trans. Prog. Lang. Syst., 17(4):561-575, July 1995.]] Google ScholarDigital Library
23.S. Ghosh. Cache Miss Equations: Compiler analysis framework for tuning memory behavior. PhD thesis, Department of Electrical Engineering, Princeton University, Nov. 1999.]] Google ScholarDigital Library
24.S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 International Conference on Supercomputing, pages 317-324, Vienna, Austria, July 1997.]] Google ScholarDigital Library
25.S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228-239, San Jose, CA, Oct. 1998.]] Google ScholarDigital Library
26.S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Prog. Lang. Syst., 21(4):703-746, July 1999.]] Google ScholarDigital Library
27.P. J. Hanlon, D. Chung, S. Chatterjee, D. Genius, A. R. Lebeck, and E. Parker. The combinatorics of cache misses during matrix multiplication. J. Comput. Syst. Sci., 2000. To appear.]] Google ScholarDigital Library
28.J. S. Harper, D. J. Kerbyson, and G. R. Nudd. Analytical modeling of set-associative cache behavior. IEEE Trans. Comput., 48(10):1009-1024, Oct. 1999.]] Google ScholarDigital Library
29.M. D. Hill, J. R. Larus, A. R. Lebeck, M. Talluri, and D. A. Wood. Wisconsin architectural research tool set. Computer Architecture News, 21(4):8-10, August 1993.]] Google ScholarDigital Library
30.M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Trans. Comput., C-38(12):1612-1630, Dec. 1989.]] Google ScholarDigital Library
31.J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, 1979.]] Google ScholarDigital Library
32.M. T. Kandemir, A. N. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam. A linear algebra framework for automatic determination of optimal data layouts. IEEE Transactions on Parallel and Distributed Systems, 10(2):115-135, Feb. 1999.]] Google ScholarDigital Library
33.W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Calculator and Library, version 1.1.0, Nov. 1996.]]Google Scholar
34.W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Library Version 1.1.0 Interface Guide, Nov. 1996.]]Google Scholar
35.W. Kelly and W. Pugh. A framework for unifying reordering transformations. Technical Report CS-TR-3193, Department of Compute Science, University of Maryland, College Park, MD, Apr. 1993.]] Google ScholarDigital Library
36.W. Kelly and W. Pugh. Finding legal reordering transformations using mappings. Technical Report CS-TR-3297, Department of Compute Science, University of Maryland, College Park, MD, June 1994.]]Google Scholar
37.R. E. Kessler and M. D. Hill. Page placement algorithms for large real-index caches. ACM Trans. Comput. Syst., 10(4):338-359, 1992.]] Google ScholarDigital Library
38.M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63-74, Apr. 1991.]] Google ScholarDigital Library
39.A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15-26, Oct. 1994.]] Google ScholarDigital Library
40.A. R. Lebeck and D. A. Wood. Active memory: A new abstraction for memory system simulation. ACM Transactions on Modeling and Computer Simulation, 7(1):42-77, Jan. 1997.]] Google ScholarDigital Library
41.A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languagess, pages 201-214, Paris, France, Jan. 1997.]] Google ScholarDigital Library
42.V. Loechner. PolyLib: A Library for Manipulating Parameterized Polyhedra, Mar. 1999.]]Google Scholar
43.M. Martonosi, A. Gupta, and T. Anderson. Memspy: Analyzing memory system bottlenecks in programs. In SIGMETRICS92, pages 1-12, June 1992.]] Google ScholarDigital Library
44.K. S. McKinley and O. Temam. Quantifying loop nest locality using SPEC'95 and the Perfect benchmarks. ACM Trans. Comput. Syst., 17(4):288-336, Nov. 1999.]] Google ScholarDigital Library
45.N. Mitchell, L. Carter, J. Ferrante, and K. H ogstedt. Quantifying the multi-level nature of tiling interactions. In Languages and Compilers for Parallel Computing: 10th Annual Workshop, LCPC'97, number 1366 in Lecture Notes in Computer Science, pages 1-15. Springer, 1998.]] Google ScholarDigital Library
46.D. C. Oppen. A 2 2 2 pn upper bound on the complexity of Presburger arithmetic. J. Comput. Syst. Sci., 16(3):323-332, July 1978.]]Google ScholarCross Ref
47.Y. Paek, J. Hoeflinger, and D. Padua. Simplification of array access patterns for compiler optimizations. In Proceedings of ACM PLDI, volume 33, pages 60-71, May 1998.]] Google ScholarDigital Library
48.A. K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Rice University, Houston, TX, May 1989. Available as technical report CRPC-TR89009.]] Google ScholarDigital Library
49.W. Pugh. Counting solutions to Presburger formulas: How and why. In Proceedings of the ACM SIGPLAN'94 Conference on Programming Language Design and Implementation, pages 121-134, Orlando, FL, June 1994.]] Google ScholarDigital Library
50.G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proceedings of the ACM SIGPLAN'98 Conference on Programming Language Design and Implementation, pages 38-49, Montreal, Canada, June 1998.]] Google ScholarDigital Library
51.G. Rivera and C.-W. Tseng. Eliminating conflict misses for high performance architectures. In Proceedings of the 1998 International Conference on Supercomputing, pages 353-360, Melbourne, Australia, July 1998.]] Google ScholarDigital Library
52.U. Sch~ning. Complexity of Presburger arithmetic with fixed quantifier dimension. Theory of Computing Systems, 30:423-428, 1997.]]Google ScholarCross Ref
53.N. Shibata, K. Okana, T. Higashino, and K. Taniguchi. A decision algorithm dor prenex form rational Presburger sentences based on combinatorial geometry. In Proceedings of the 2nd International Conference on Discrete Mathematics and Theoretical Computer Science and the 5th Australasian Theory Symposium (DMTCS'99+CATS'99), pages 344-359, Jan. 1999.]]Google Scholar
54.A. Srivastava and A. Eustace. ATOM: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN'94 Conference on Programming Language Design and Implementation, pages 196-205, June 1994.]] Google ScholarDigital Library
55.The Stanford Compiler Group. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. http://suif.stanford.edu.]]Google Scholar
56.R. A. Sugumar and S. G. Abraham. Efficient simulation of multiple cache configurations using binomial trees. Technical Report CSE-TR-111-91, 1991.]]Google Scholar
57.D. Thiebaut and H. Stone. Footprints in the cache. ACM Trans. Comput. Syst., 5(4):305-329, Nov. 1987.]] Google ScholarDigital Library
58.D. A. B. Weikle, S. A. McKee, and W. A. Wulf. Caches as filters: A new approach to cache analysis. In MASCOTS'98, Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, July 1998.]] Google ScholarDigital Library
59.D. A. B. Weikle, K. Skadron, S. A. McKee, and W. A. Wulf. Caches as filters: A unifying model for memory hierarchy analysis. Technical Report CS-2000-16, University of Virginia, June 2000.]] Google ScholarDigital Library
60.V. Weispfenning. Complexity and uniformity of elimination in Presburger arithmetic. In Proceedings of the 1997 International Symposium on Symbolic and Algebraic Computation, pages 48-53, Kihei, Maui, HI, July 1997.]] Google ScholarDigital Library
61.M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation, pages 30-44, Toronto, Canada, June 1991.]] Google ScholarDigital Library
62.M. J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing'89, pages 655-664, Reno, NV, Nov. 1989.]] Google ScholarDigital Library
63.D. A. Wood, M. D. Hill, and R. E. Kessler. A model for estimating trace-sample miss ratios. In Proceedings of ACM SIGMETRICS, May 1991.]] Google ScholarDigital Library
64.H. Zhang and M. Martonosi. Mathematical cache miss analysis for pointer data structures. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing, Portsmouth, VA, Mar. 2001. CD-ROM.]]Google Scholar

Index Terms

Exact analysis of the cache behavior of nested loops
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Concurrent programming structures

Recommendations

Exact analysis of the cache behavior of nested loops

We develop from first principles an exact model of the behavior of loop nests executing in a memory hicrarchy, by using a nontraditional classification of misses that has the key property of composability. We use Presburger formulas to express various ...
Read More
Software Pipelining of Nested Loops
CC '01: Proceedings of the 10th International Conference on Compiler Construction

Software pipelining is a technique to improve the performance of a loop by overlapping the execution of several iterations. The execution of a software-pipelined loop goes through three phases: prolog, kernel, and epilog. Software pipelining works best ...
Read More
Parallelizing tightly nested loops
IPPS '91: Proceedings of the Fifth International Parallel Processing Symposium

Presents a new technique to parallelize nested loops at the statement level. It transforms sequential nested loops, either vectorizable or not, into parallel ones. Previously, the wavefront method was used to parallelize non-vectorizable nested loops. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PLDI '01: Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
June 2001
331 pages
ISBN:1581134142
DOI:10.1145/378795
Chairmen:
Michael Burke
IBM T.J. Watson Research Center
,
Mary Lou Soffa
Univ. of Pittsburgh
ACM SIGPLAN Notices Volume 36, Issue 5
May 2001
330 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/381694
Editors:
Cindy Norris,
James B. Bergin
Issue’s Table of Contents
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
PLDI '01 Paper Acceptance Rate30of144submissions,21%Overall Acceptance Rate406of2,067submissions,20%
More
Upcoming Conference
PLDI '24

Sponsor:

sigplan

ACM SIGPLAN Conference on Programming Language Design and Implementation

June 24 - 28, 2024

Copenhagen , Denmark
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 130
  Total Citations
  View Citations
- 945
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exact analysis of the cache behavior of nested loops

PLDI '01: Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation

ABSTRACT

References

Cited By

Index Terms

Recommendations

Exact analysis of the cache behavior of nested loops

Software Pipelining of Nested Loops

Parallelizing tightly nested loops