Abstract
In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In the this article, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments illustrate that for kernels our model and algorithm can select and achieve the best loop structure for a nest. For over 30 complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve bacause benchmark programs typically have high hit rates even for small data caches; however, our optimizations significanty improved several programs.
- ABu-SUFAH, W. 1979. Improving the performance of virtual memory computers. Ph.D. thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign. Google Scholar
- ALLEN, J. R. AND KENNEDY, K. 1984. Automatic loop interchange. In Proceedings of the SIC- PLAN '8~ Symposium on Compiler Construction. ACM, New York. Google Scholar
- ALLEN, J. R. AND KENNEDY, K. 1987. Automatic translation of Fortran programs to vector form. A CM Trans. Program. Lang. Syst. 9, 4 (Oct.), 491-542. Google Scholar
- BANERJEE, U. 1990. A theory of loop permutations. In Languages and Compilers for Parallel Computing, D. Gelernter, A. Nicolau, and D. Padua, Eds. The MIT Press, Cambridge, Mass., 54-74. Google Scholar
- CALLAHAN, D., CARR, S., AND KENNEDY, K. 1990. Improving register allocation for subscripted variables. In Proceedings of the SIGPLAN '90 Conference on Programming Language Design and Implementation. ACM, New York. Google Scholar
- CALLAHAN, D., COCKE, J., AND KENNEDY, K. 1988. Estimating interlock and improving balance for pipelined machines. J. Parall. Distrib. Comput. 5, 4 (Aug.), 334-358. Google Scholar
- CARR, S. 1992. Memory-hierarchy management. Ph.D. thesis, Dept. of Computer Science, Rice Univ., Houston, Tex. Google Scholar
- CARR, S. AND KENNEDY, K. 1994a. Improving the ratio of memory operations to floating-point operations in loops. ACM Trans. Program. Lang. Syst. 16, 6 (Nov.), 1769-1810. Google Scholar
- CARR, S. AND KENNEDY, K. 1994b. Scalar replacement in the presence of conditional control flow. Softw. Prac. Exper. 2g, 1 (Jan.), 51-77. Google Scholar
- CARR, S. AND WU, Q. 1995. An analysis of loop permutation on the HP PA-RISC. Tech. Rep. TR95-03, Michigan Technological Univ., Houghton, Mich. Feb.Google Scholar
- COLEMAN, S. AND MCKINLEY, K. S. 1995. Tile size selection using cache organization and data layout. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation. ACM, New York. Google Scholar
- COOPER, K., HALL, M. W., HOOD, R. T., KENNEDY, K., MCKINLEY, K. S., MELLOR-CRUMMEY, J. M., TORCZON, L., AND WARREN, S. K. 1993. The ParaScope parallel programming environmerit. Proc. IEEE 81, 2 (Feb.), 244-263.Google Scholar
- COOPER, K., HALL, M. W., AND KENNEDY, K. 1993. A methodology for procedure cloning. Comput. Lang. 19, 2 (Feb.), 105-117.Google Scholar
- FERRANTE, J., SARKAR, V., AND THRASH, W. 1991. On estimating and enhancing cache effectiveness. In Languages and Compilers for Parallel Computing, Jth International Workshop, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, Eds. Springer-Verlag, Berlin, 328-343. Google Scholar
- CJANNON, D., JALBY, W., AND CJALLIVAN, K. 1988. Strategies for cache and local memory management by global program transformation. J. Parall. Distrib. Comput. 5, 5 (Oct.), 587-616. Google Scholar
- CJOFF, CJ., KENNEDY, K., AND TSENG, C.-W. 1991. Practical dependence testing. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation. ACM, New York. Google Scholar
- HALL, M. W., KENNEDY, K., AND MCKINLEY, K. S. 1991. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91. IEEE, New York. Google Scholar
- IRIGOIN, F. AND TRIOLET, R. 1988. Supernode partitioning. In Proceedings of the 15th Annual A CM Symposium on the Principles of Programming Languages. ACM, New York. Google Scholar
- KENNEDY, K. AND MCKINLEY, K. S. 1992. Optimizing for parallelism and data locality. In Proceedings of the 1992 ACM International Conference on Supercomputing. ACM, New York. Google Scholar
- KENNEDY, K. AND MCKINLEY, K. S. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, Eds. Springer-Verlag, Berlin, 30 Google Scholar
- KENNEDY, K., MCKINLEY, K. S., AND TSENG, C.-W. 1993. Analysis and transformation in an interactive parallel programming tool. Concurrency Pract. Exper. 5, 7 (Oct.), 575-602.Google Scholar
- KUCK, D., KUHN, R., PADUA, D., LEASURE, B., AND WOLFE, M. J. 1981. Dependence graphs and compiler optimizations. In Conference Record of the 8th Annual ACM Symposium on the Principles of Programming Languages. ACM, New York. Google Scholar
- LAM, M., ROTHBERG, E., AND WOLF, M. E. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the Jth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York. Google Scholar
- LI, W. AND PINGALI, K. 1992. Access normalization: Loop restructuring for NUMA compilers. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York. Google Scholar
- MCKINLEY, K. S. 1992. Automatic and interactive parallelization. Ph.D. thesis, Dept. of Computer Science, Rice Univ., Houston, Tex. Google Scholar
- WARREN, J. 1984. A hierachical basis for reordering transformations. In Conference Record of the 11th Annual ACM Symposium on the Principles of Programming Languages. ACM, New York. Google Scholar
- WOLF, M. E. 1992. Improving locality and parallelism in nested loops. Ph.D. thesis, Dept. of Computer Science, Stanford Univ., Stanford, Calif. Google Scholar
- WOLF, M. E. AND LAM, M. 1991. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation. ACM, New York. Google Scholar
- WOLFE, M. J. 1986. Advanced loop interchanging. In Proceedings of the 1986 International Conference on Parallel Processing. CRC Press, Boca Raton, Fla.Google Scholar
- WOLFE, M. J. 1987. Iteration space tiling for memory hierarchies. In Proceedings of the 3rd SIAM Conference on Parallel Processing. SIAM, Philadelphia, Pa. Google Scholar
- WOLFE, M. J. 1991. The Tiny loop restructuring research tool. In Proceedings of the 1991 International Conference on Parallel Processing. CRC Press, Boca Raton, Fla.Google Scholar
Index Terms
- Improving data locality with loop transformations
Recommendations
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesOptimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...
Fusion of Loops for Parallelism and Locality
Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which prevent parallelism. In addition, performance ...
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion
Because of the increasing gap between the speeds of processors and main memories, compilers must enhance the locality of applications to achieve high performance. Loop fusion enhances locality by fusing loops that access similar sets of data. Typically, ...
Comments