skip to main content
10.1145/378795.378859acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
Article

Exact analysis of the cache behavior of nested loops

Authors Info & Claims
Published:01 May 2001Publication History

ABSTRACT

We develop from first principles an exact model of the behavior of loop nests executing in a memory hicrarchy, by using a nontraditional classification of misses that has the key property of composability. We use Presburger formulas to express various kinds of misses as well as the state of the cache at the end of the loop nest. We use existing tools to simplify these formulas and to count cache misses. The model is powerful enough to handle imperfect loop nests and various flavors of non-linear array layouts based on bit interleaving of array indices. We also indicate how to handle modest levels of associativity, and how to perform limited symbolic analysis of cache behavior. The complexity of the formulas relates to the static structure of the loop nest rather than to its dynamic trip count, allowing our model to gain efficiency in counting cache misses by exploiting repetitive patterns of cache behavior. Validation against cache simulation confirms the exactness of our formulation. Our method can serve as the basis for a static performance predictor to guide program and data transformations to improve performance.

References

  1. 1.A. Agarwal, M. Horowitz, and J. Hennessy. An analytical cache model. ACM Trans. Comput. Syst., 7(2):184-215, May 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.N. Ahmed. Locality Enhancement of Imperfectly-nested Loop Nests. PhD thesis, Department of Computer Science, Cornell University, Aug. 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. Technical Report TR2000-1782, Cornell University, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.N. Ahmed and K. Pingali. Automatic generation of block-recursive codes. In Proceedings of Europar 2000, pages 125-134, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.M. Alt, C. Ferdinand, F. Martin, and R. Wilhelm. Cache behavior prediction by abstract interpretation. In R. Cousot and D. A. Schmidt, editors, SAS'96, Static Analysis Symposium, volume 1145 of Lecture Notes in Computer Science, pages 51-66. Springer, September 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.T. Amon, G. Borriello, T. Hu, and J. Liu. Symbolic timing verification of timing diagrams using Presburger formulas. In Proceedings of DAC 97, pages 226-231, Anaheim, CA, June 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.T. Amon, G. Borriello, and J. Liu. Making complex timing relationships readable: Presburger formula simplification using don't cares. In Proceedings of DAC 98, pages 586-590, San Francisco, CA, June 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.B. Boigelot and P. Wolper. An automata-theoretic approach to Presburger arithmetic. In A. Mycroft, editor, Proceedings of the Second International Symposium on Static Analysis (SAS '95), volume 983 of Lecture Notes in Computer Science, pages 1-18. Springer Verlag, Sept. 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.A. Boudet and H. Comon. Diophantine equations, Presburger arithmetic and finite automata. In H. Kirchner, editor, Proc. Coll. on Trees in Algebra and Programming (CAAP'96), volume 1059 of Lecture Notes in Computer Science, pages 30-43. Springer Verlag, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.P. Boulet and X. Redon. SPPoC: fonctionnemen et applications. Research Report 00-04, LIFL (Laboratoire de Recherche en Informatique de l'Universite des Sciences et Technologies de Lille), 2000. In French. Also see http://www.lifl.fr/west/sppoc/.]]Google ScholarGoogle Scholar
  11. 11.M. Brehob and R. Enbody. A mathematical model of locality and caching. Technical Report TR-MSU-CPS-96-TBD, Michigan State University, Nov. 1996.]]Google ScholarGoogle Scholar
  12. 12.G. C. Cascaval. Compile-Time Performance Prediction of Scientific Programs. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, San Jose, CA, Oct. 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.S. Chatterjee, V. V. Jain, A. R. Lebeck, S. Mundhra, and M. Thottethodi. Nonlinear array layouts for hierarchical memory systems. In Proceedings of the 1999 ACM International Conference on Supercomputing, pages 444-453, Rhodes, Greece, June 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222-231, Saint-Malo, France, June 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.S. Chatterjee and S. Sen. Cache-efficient matrix transposition. In Proceedings of HPCA-6, pages 195-205, Toulouse, France, Jan. 2000.]]Google ScholarGoogle Scholar
  17. 17.M. Cierniak and W. Li. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation, pages 205-217, La Jolla, CA, June 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18.P. Clauss. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs. In Proceedings of International Conference on Supercomputing, pages 278-285, May 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation, pages 279-290, La Jolla, CA, June 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23-54, 1991.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.J. D. Frens and D. S. Wise. Auto-blocking matrix-multiplication or tracking BLAS3 performance with source code. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 206-216, Las Vegas, NV, June 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. 22.C. Fricker, O. Temam, and W. Jalby. Influence of cross-interference on blocked loops: A case study with matrix-vector multiply. ACM Trans. Prog. Lang. Syst., 17(4):561-575, July 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. 23.S. Ghosh. Cache Miss Equations: Compiler analysis framework for tuning memory behavior. PhD thesis, Department of Electrical Engineering, Princeton University, Nov. 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. 24.S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 International Conference on Supercomputing, pages 317-324, Vienna, Austria, July 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 25.S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228-239, San Jose, CA, Oct. 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. 26.S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Prog. Lang. Syst., 21(4):703-746, July 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. 27.P. J. Hanlon, D. Chung, S. Chatterjee, D. Genius, A. R. Lebeck, and E. Parker. The combinatorics of cache misses during matrix multiplication. J. Comput. Syst. Sci., 2000. To appear.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. 28.J. S. Harper, D. J. Kerbyson, and G. R. Nudd. Analytical modeling of set-associative cache behavior. IEEE Trans. Comput., 48(10):1009-1024, Oct. 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. 29.M. D. Hill, J. R. Larus, A. R. Lebeck, M. Talluri, and D. A. Wood. Wisconsin architectural research tool set. Computer Architecture News, 21(4):8-10, August 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. 30.M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Trans. Comput., C-38(12):1612-1630, Dec. 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. 31.J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, 1979.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. 32.M. T. Kandemir, A. N. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam. A linear algebra framework for automatic determination of optimal data layouts. IEEE Transactions on Parallel and Distributed Systems, 10(2):115-135, Feb. 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. 33.W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Calculator and Library, version 1.1.0, Nov. 1996.]]Google ScholarGoogle Scholar
  34. 34.W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Library Version 1.1.0 Interface Guide, Nov. 1996.]]Google ScholarGoogle Scholar
  35. 35.W. Kelly and W. Pugh. A framework for unifying reordering transformations. Technical Report CS-TR-3193, Department of Compute Science, University of Maryland, College Park, MD, Apr. 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. 36.W. Kelly and W. Pugh. Finding legal reordering transformations using mappings. Technical Report CS-TR-3297, Department of Compute Science, University of Maryland, College Park, MD, June 1994.]]Google ScholarGoogle Scholar
  37. 37.R. E. Kessler and M. D. Hill. Page placement algorithms for large real-index caches. ACM Trans. Comput. Syst., 10(4):338-359, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. 38.M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63-74, Apr. 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. 39.A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15-26, Oct. 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. 40.A. R. Lebeck and D. A. Wood. Active memory: A new abstraction for memory system simulation. ACM Transactions on Modeling and Computer Simulation, 7(1):42-77, Jan. 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. 41.A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languagess, pages 201-214, Paris, France, Jan. 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. 42.V. Loechner. PolyLib: A Library for Manipulating Parameterized Polyhedra, Mar. 1999.]]Google ScholarGoogle Scholar
  43. 43.M. Martonosi, A. Gupta, and T. Anderson. Memspy: Analyzing memory system bottlenecks in programs. In SIGMETRICS92, pages 1-12, June 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. 44.K. S. McKinley and O. Temam. Quantifying loop nest locality using SPEC'95 and the Perfect benchmarks. ACM Trans. Comput. Syst., 17(4):288-336, Nov. 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. 45.N. Mitchell, L. Carter, J. Ferrante, and K. H ogstedt. Quantifying the multi-level nature of tiling interactions. In Languages and Compilers for Parallel Computing: 10th Annual Workshop, LCPC'97, number 1366 in Lecture Notes in Computer Science, pages 1-15. Springer, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. 46.D. C. Oppen. A 2 2 2 pn upper bound on the complexity of Presburger arithmetic. J. Comput. Syst. Sci., 16(3):323-332, July 1978.]]Google ScholarGoogle ScholarCross RefCross Ref
  47. 47.Y. Paek, J. Hoeflinger, and D. Padua. Simplification of array access patterns for compiler optimizations. In Proceedings of ACM PLDI, volume 33, pages 60-71, May 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. 48.A. K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Rice University, Houston, TX, May 1989. Available as technical report CRPC-TR89009.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. 49.W. Pugh. Counting solutions to Presburger formulas: How and why. In Proceedings of the ACM SIGPLAN'94 Conference on Programming Language Design and Implementation, pages 121-134, Orlando, FL, June 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. 50.G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proceedings of the ACM SIGPLAN'98 Conference on Programming Language Design and Implementation, pages 38-49, Montreal, Canada, June 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. 51.G. Rivera and C.-W. Tseng. Eliminating conflict misses for high performance architectures. In Proceedings of the 1998 International Conference on Supercomputing, pages 353-360, Melbourne, Australia, July 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. 52.U. Sch~ning. Complexity of Presburger arithmetic with fixed quantifier dimension. Theory of Computing Systems, 30:423-428, 1997.]]Google ScholarGoogle ScholarCross RefCross Ref
  53. 53.N. Shibata, K. Okana, T. Higashino, and K. Taniguchi. A decision algorithm dor prenex form rational Presburger sentences based on combinatorial geometry. In Proceedings of the 2nd International Conference on Discrete Mathematics and Theoretical Computer Science and the 5th Australasian Theory Symposium (DMTCS'99+CATS'99), pages 344-359, Jan. 1999.]]Google ScholarGoogle Scholar
  54. 54.A. Srivastava and A. Eustace. ATOM: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN'94 Conference on Programming Language Design and Implementation, pages 196-205, June 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. 55.The Stanford Compiler Group. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. http://suif.stanford.edu.]]Google ScholarGoogle Scholar
  56. 56.R. A. Sugumar and S. G. Abraham. Efficient simulation of multiple cache configurations using binomial trees. Technical Report CSE-TR-111-91, 1991.]]Google ScholarGoogle Scholar
  57. 57.D. Thiebaut and H. Stone. Footprints in the cache. ACM Trans. Comput. Syst., 5(4):305-329, Nov. 1987.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. 58.D. A. B. Weikle, S. A. McKee, and W. A. Wulf. Caches as filters: A new approach to cache analysis. In MASCOTS'98, Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, July 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. 59.D. A. B. Weikle, K. Skadron, S. A. McKee, and W. A. Wulf. Caches as filters: A unifying model for memory hierarchy analysis. Technical Report CS-2000-16, University of Virginia, June 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. 60.V. Weispfenning. Complexity and uniformity of elimination in Presburger arithmetic. In Proceedings of the 1997 International Symposium on Symbolic and Algebraic Computation, pages 48-53, Kihei, Maui, HI, July 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. 61.M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation, pages 30-44, Toronto, Canada, June 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. 62.M. J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing'89, pages 655-664, Reno, NV, Nov. 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. 63.D. A. Wood, M. D. Hill, and R. E. Kessler. A model for estimating trace-sample miss ratios. In Proceedings of ACM SIGMETRICS, May 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. 64.H. Zhang and M. Martonosi. Mathematical cache miss analysis for pointer data structures. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing, Portsmouth, VA, Mar. 2001. CD-ROM.]]Google ScholarGoogle Scholar

Index Terms

  1. Exact analysis of the cache behavior of nested loops

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PLDI '01: Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
      June 2001
      331 pages
      ISBN:1581134142
      DOI:10.1145/378795

      Copyright © 2001 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 May 2001

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      PLDI '01 Paper Acceptance Rate30of144submissions,21%Overall Acceptance Rate406of2,067submissions,20%

      Upcoming Conference

      PLDI '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader