article

Open Access

Improving data locality with loop transformations

Authors:
Kathryn S. McKinley

Computer Science Department, LGRC, University of Massachusetts, Amherst, MA

Computer Science Department, LGRC, University of Massachusetts, Amherst, MA
View Profile

,
Steve Carr

Department of Computer Science, Michigan Technological University, Houghton, MI

Department of Computer Science, Michigan Technological University, Houghton, MI
View Profile

,
Chau-Wen Tseng

Department of Computer Science, University of Maryland, College Park, MD

Department of Computer Science, University of Maryland, College Park, MD
View Profile

Authors Info & Claims

ACM Transactions on Programming Languages and Systems Volume 18 Issue 4pp 424–453https://doi.org/10.1145/233561.233564

Published:01 July 1996Publication History

ACM Transactions on Programming Languages and Systems

Abstract

In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In the this article, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments illustrate that for kernels our model and algorithm can select and achieve the best loop structure for a nest. For over 30 complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve bacause benchmark programs typically have high hit rates even for small data caches; however, our optimizations significanty improved several programs.

References

ABu-SUFAH, W. 1979. Improving the performance of virtual memory computers. Ph.D. thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign. Google Scholar
ALLEN, J. R. AND KENNEDY, K. 1984. Automatic loop interchange. In Proceedings of the SIC- PLAN '8~ Symposium on Compiler Construction. ACM, New York. Google Scholar
ALLEN, J. R. AND KENNEDY, K. 1987. Automatic translation of Fortran programs to vector form. A CM Trans. Program. Lang. Syst. 9, 4 (Oct.), 491-542. Google Scholar
BANERJEE, U. 1990. A theory of loop permutations. In Languages and Compilers for Parallel Computing, D. Gelernter, A. Nicolau, and D. Padua, Eds. The MIT Press, Cambridge, Mass., 54-74. Google Scholar
CALLAHAN, D., CARR, S., AND KENNEDY, K. 1990. Improving register allocation for subscripted variables. In Proceedings of the SIGPLAN '90 Conference on Programming Language Design and Implementation. ACM, New York. Google Scholar
CALLAHAN, D., COCKE, J., AND KENNEDY, K. 1988. Estimating interlock and improving balance for pipelined machines. J. Parall. Distrib. Comput. 5, 4 (Aug.), 334-358. Google Scholar
CARR, S. 1992. Memory-hierarchy management. Ph.D. thesis, Dept. of Computer Science, Rice Univ., Houston, Tex. Google Scholar
CARR, S. AND KENNEDY, K. 1994a. Improving the ratio of memory operations to floating-point operations in loops. ACM Trans. Program. Lang. Syst. 16, 6 (Nov.), 1769-1810. Google Scholar
CARR, S. AND KENNEDY, K. 1994b. Scalar replacement in the presence of conditional control flow. Softw. Prac. Exper. 2g, 1 (Jan.), 51-77. Google Scholar
CARR, S. AND WU, Q. 1995. An analysis of loop permutation on the HP PA-RISC. Tech. Rep. TR95-03, Michigan Technological Univ., Houghton, Mich. Feb.Google Scholar
COLEMAN, S. AND MCKINLEY, K. S. 1995. Tile size selection using cache organization and data layout. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation. ACM, New York. Google Scholar
COOPER, K., HALL, M. W., HOOD, R. T., KENNEDY, K., MCKINLEY, K. S., MELLOR-CRUMMEY, J. M., TORCZON, L., AND WARREN, S. K. 1993. The ParaScope parallel programming environmerit. Proc. IEEE 81, 2 (Feb.), 244-263.Google Scholar
COOPER, K., HALL, M. W., AND KENNEDY, K. 1993. A methodology for procedure cloning. Comput. Lang. 19, 2 (Feb.), 105-117.Google Scholar
FERRANTE, J., SARKAR, V., AND THRASH, W. 1991. On estimating and enhancing cache effectiveness. In Languages and Compilers for Parallel Computing, Jth International Workshop, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, Eds. Springer-Verlag, Berlin, 328-343. Google Scholar
CJANNON, D., JALBY, W., AND CJALLIVAN, K. 1988. Strategies for cache and local memory management by global program transformation. J. Parall. Distrib. Comput. 5, 5 (Oct.), 587-616. Google Scholar
CJOFF, CJ., KENNEDY, K., AND TSENG, C.-W. 1991. Practical dependence testing. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation. ACM, New York. Google Scholar
HALL, M. W., KENNEDY, K., AND MCKINLEY, K. S. 1991. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91. IEEE, New York. Google Scholar
IRIGOIN, F. AND TRIOLET, R. 1988. Supernode partitioning. In Proceedings of the 15th Annual A CM Symposium on the Principles of Programming Languages. ACM, New York. Google Scholar
KENNEDY, K. AND MCKINLEY, K. S. 1992. Optimizing for parallelism and data locality. In Proceedings of the 1992 ACM International Conference on Supercomputing. ACM, New York. Google Scholar
KENNEDY, K. AND MCKINLEY, K. S. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, Eds. Springer-Verlag, Berlin, 30 Google Scholar
KENNEDY, K., MCKINLEY, K. S., AND TSENG, C.-W. 1993. Analysis and transformation in an interactive parallel programming tool. Concurrency Pract. Exper. 5, 7 (Oct.), 575-602.Google Scholar
KUCK, D., KUHN, R., PADUA, D., LEASURE, B., AND WOLFE, M. J. 1981. Dependence graphs and compiler optimizations. In Conference Record of the 8th Annual ACM Symposium on the Principles of Programming Languages. ACM, New York. Google Scholar
LAM, M., ROTHBERG, E., AND WOLF, M. E. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the Jth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York. Google Scholar
LI, W. AND PINGALI, K. 1992. Access normalization: Loop restructuring for NUMA compilers. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York. Google Scholar
MCKINLEY, K. S. 1992. Automatic and interactive parallelization. Ph.D. thesis, Dept. of Computer Science, Rice Univ., Houston, Tex. Google Scholar
WARREN, J. 1984. A hierachical basis for reordering transformations. In Conference Record of the 11th Annual ACM Symposium on the Principles of Programming Languages. ACM, New York. Google Scholar
WOLF, M. E. 1992. Improving locality and parallelism in nested loops. Ph.D. thesis, Dept. of Computer Science, Stanford Univ., Stanford, Calif. Google Scholar
WOLF, M. E. AND LAM, M. 1991. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation. ACM, New York. Google Scholar
WOLFE, M. J. 1986. Advanced loop interchanging. In Proceedings of the 1986 International Conference on Parallel Processing. CRC Press, Boca Raton, Fla.Google Scholar
WOLFE, M. J. 1987. Iteration space tiling for memory hierarchies. In Proceedings of the 3rd SIAM Conference on Parallel Processing. SIAM, Philadelphia, Pa. Google Scholar
WOLFE, M. J. 1991. The Tiny loop restructuring research tool. In Proceedings of the 1991 International Conference on Parallel Processing. CRC Press, Boca Raton, Fla.Google Scholar

Index Terms

Improving data locality with loop transformations
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Optimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...
Read More
Fusion of Loops for Parallelism and Locality

Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which prevent parallelism. In addition, performance ...
Read More
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

Because of the increasing gap between the speeds of processors and main memories, compilers must enhance the locality of applications to achieve high performance. Loop fusion enhances locality by fusing loops that access similar sets of data. Typically, ...
Read More

Reviews

Reviewer: Noah S. Prywes

When using a processor with a cache memory of limited size, the order of memory referencing in the program can affect the overall computation time. It would be too difficult for a programmer to consider this effect in composing a program. Instead, the authors propose incorporating in the compiler the program optimization needed to accelerate program execution for the cache memory of the processor where the program will be executed. The optimization consists of reordering the loops in the program. The paper presents cost models to evaluate the impact of reordering the loops. These are presented for four schemes for reordering loops—loop permutation, fusion, distribution, and reversal. A cost model for combining such schemes is also presented. Finally, results of experimental work are provided to assess the effectiveness of using the loop reordering cost models. Experimental data are presented for a number of programs and for respective cache memories and processors. The cache memories and processors are 64 KB cache direct-mapped, 32-byte cache line for the HP 715/50; 64 KB cache, four-way, 128-byte cache line for the RS/6000; and 8 KB cache, two-way, 32-byte cache line for the i 860. The programs were executed on the RS/6000 and HP 715/50 as well as simulated for different cache memories of the RS/6000 and i 860. The experimental result validates the cost model. The improvement in execution is significant for only some of the programs studied, and particularly for the use of smaller cache memory. The paper reports on important concepts and experiments, not only for use in compiler construction (as suggested), but also for the design of cache memories. Its readers should already be familiar with the concepts of data dependence and temporal reuse.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Programming Languages and Systems Volume 18, Issue 4
July 1996
164 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/233561
Issue’s Table of Contents

Copyright © 1996 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 1996
Published in toplas Volume 18, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cache
compiler optimization
data locality
loop distribution
loop fusion
loop permutation
loop reversal
loop transformations
microprocessors
simulation
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 479
  Total Citations
  View Citations
- 2,314
  Total Downloads
- Downloads (Last 12 months)196
- Downloads (Last 6 weeks)42
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Reshaping cache misses to improve row-buffer locality in multicore systems

Fusion of Loops for Parallelism and Locality

Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Reshaping cache misses to improve row-buffer locality in multicore systems

Fusion of Loops for Parallelism and Locality

Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media