ABSTRACT
Applicable to arbitrary sequences and nests of loops, affine partitioning is a program transformation framework that unifies many previously proposed loop transformations, including unimodular transforms, fusion, fission, reindexing, scaling and statement reordering. Algorithms based on affine partitioning have been shown to be effective for parallelization and communication minimization. This paper presents algorithms that improve data locality using affine partitioning.
Blocking and array contraction are two important optimizations that have been shown to be useful for data locality. Blocking creates a set of inner loops so that data brought into the faster levels of the memory hierarchy can be reused. Array contraction reduces an array to a scalar variable and thereby reduces the number of memory operations executed and the memory footprint. Loop transforms are often necessary to make blocking and array contraction possible.
By bringing the full generality of affine partitioning to bear on the problem, our locality algorithm can find more contractable arrays than previously possible. This paper also generalizes the concept of blocking and shows that affine partitioning allows the benefits of blocking be realized in arbitrarily nested loops. Experimental results on a number of benchmarks and a complete multigrid application in aeronautics indicates that affine partitioning is effective in practice.
- 1.N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. In Proceedings of the 2000 ACM International Conference on Supercomputing, pages 141-152, May 2000.]] Google ScholarDigital Library
- 2.N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. In Proceedings of Supercomputing 2000, November 2000.]] Google ScholarDigital Library
- 3.G. Aigner, A. Diwan, D. L. Heine, M. S. Lam, D. L. Moore, B. R. Murphy, and C. Sapuntzakis. An overview of the SUIF2 compiler infrastructure. Technical report, Stanford University, 2000.]]Google Scholar
- 4.D. Bacon, S. Graham, and O. Sharp. Compiler transformations for high-performance computing. Computing Surveys, 26(4):345-420, December 1994.]] Google ScholarDigital Library
- 5.D. Bailey and J. Barton. The NAS kernel benchmark program. Technical Report 86711, NASA, 1985.]]Google Scholar
- 6.S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, October 1994.]] Google ScholarDigital Library
- 7.P. Feautrier. Some efficient solutions to the affine scheduling problem, part I, one dimensional time. International Journal of Parallel Programming, 21(5):313- 348, October 1992.]] Google ScholarDigital Library
- 8.G. R. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing, pages 171-181, August 1992.]] Google ScholarDigital Library
- 9.G. H. Golub and C. F. Van Loan. Matrix Computations - 2nd Edition. The John Hopkins University Press, Baltimore, Maryland, 1989.]] Google ScholarDigital Library
- 10.A. Jameson. Solution of the Euler equations by amultigrid method. Applied Mathematics and Computation, 13:327-356, 1983.]]Google ScholarDigital Library
- 11.K. Kennedy and K. S. McKinley. Optimizing for parallelism and data locality. In Proceedings of the 1992 ACM International Conference on Supercomputing, pages 323-334, July 1992.]] Google ScholarDigital Library
- 12.K. Kennedy and K. S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the Sixth Workshop on Languages and Compilers for Parallel Computing, pages 301-320. Springer-Verlag, August 1993.]] Google ScholarDigital Library
- 13.I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In Proceedings of the ACM SIG- PLAN '97 Conference on Programming Language Design and Implementation, pages 346-357, June 1997.]] Google ScholarDigital Library
- 14.E. C. Lewis, C. Lin, and L. Snyder. The implementation and evaluation of fusion and array contraction in array languages. In Proceedings of the ACM SIGPLAN '98 Conference onProgramming Language Design and Implementation, pages 50-59, June 1998.]] Google ScholarDigital Library
- 15.S.-W. Liao. SUIF Explorer: an Interactive and Interprocedural Paralllelizer. PhD thesis, Stanford University, August 2000. Published as CSL-TR-00-807.]] Google ScholarDigital Library
- 16.A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In Proceedings of the 13th ACM SIGARCH International Conference on Supercomputing, Rhodes, Greece, June 1999.]] Google ScholarDigital Library
- 17.A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In Conference Record of the 24th ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages, Paris, France, January 1997.]] Google ScholarDigital Library
- 18.A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Computing, 24(3-4):445-475, May 1998.]] Google ScholarDigital Library
- 19.W. Pugh and E. Rosser. Iteration space slicing for locality. In Proceedings of the Twelfth Workshop on Languages and Compilers for Parallel Computing. Springer-Verlag, August 1999.]] Google ScholarDigital Library
- 20.V. Sarkar and G. R. Gao. Optimization of array accesses by collective loop transformations. In Proceedings of the 1991 ACM International Conference on Supercomputing, pages 194-204, June 1991.]] Google ScholarDigital Library
- 21.A. Schrijver. Theory of Linear and Integer Programming. Wiley, Chichester, 1986.]] Google ScholarDigital Library
- 22.M. E. Wolf. Improving Locality and Parallelism in Nested Loops. PhD thesis, Stanford University, August 1992. Published as CSL-TR-92-538.]] Google ScholarDigital Library
- 23.M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN '91 Conference onProgramming Language Design and Implementation, pages 30-44, June 1991.]] Google ScholarDigital Library
- 24.M. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley Publishing Company, 1995.]] Google ScholarDigital Library
Index Terms
- Blocking and array contraction across arbitrarily nested loops using affine partitioning
Recommendations
Blocking and array contraction across arbitrarily nested loops using affine partitioning
Applicable to arbitrary sequences and nests of loops, affine partitioning is a program transformation framework that unifies many previously proposed loop transformations, including unimodular transforms, fusion, fission, reindexing, scaling and ...
Affine and unimodular transformations for non-uniform nested loops
ICCOMP'08: Proceedings of the 12th WSEAS international conference on ComputersPerformance improvement in the modern parallel machines needs not only to find sufficient parallelism in a program, but it is also important that we minimize the synchronization and communication overheads in the parallelized program. Parallelizing and ...
Affine-by-Statement Transformations of Imperfectly Nested Loops
IPPS '96: Proceedings of the 10th International Parallel Processing SymposiumA majority of loop restructuring techniques developed so far assume that loops are perfectly nested. The unimodular approach unifies three individual transformations -- loop interchange, skewing and reversal -- but is still limited to perfect loop ...
Comments