ABSTRACT
We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We have developed new techniques for compiler optimizations for distributed shared-memory machines, although the same techniques can be used for sequential machines with a memory hierarchy.
Our compiler optimizations are based on an algebraic representation of data mappings and a new data locality model. We present a pure data transformation algorithm and an algorithm unifying data and control transformations. While there has been much work on control transformations, the opportunities for data transformations have been largely neglected. In fact, data transformations have the advantage of being applicable to programs that cannot be optimized with control transformations. The unified algorithm, which performs data and control transformations simultaneously, offers improvement over optimizations obtained by applying data and control transformations separately.
The experimental results using a set of applications on a parallel machine show that the new optimizations improve performance significantly. These results are further analyzed using locality metrics with instrumentation and simulation.
- 1.J. M. Anderson and M. Lam. Global optimizations for parallelism and locality on scalable parallel machines. Proceedings of ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, June 1993. Google ScholarDigital Library
- 2.V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer. An interactive environment for data partitioning and distribution. In Proc. 5th Distributed Memory Comput. Conf., April 1990.Google ScholarCross Ref
- 3.D. Bau, i. Kodukula, V. Kotlyar, K. Pingali, and P. StodghilI. Solving alignment using elementary linear algebra. Proceedings of the Seventh Annual Workshop on Languages and Compilers for Parallel Computing, 1994. Google ScholarDigital Library
- 4.R. Bianchini and T. J. LeBlanc. Software caching on cache-coherent multiprocessors. In Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing, pages 521-526, December 1992.Google ScholarDigital Library
- 5.William J. Bolosky, Robert P. Fitzgerald, and Michael L. Scott. Simple but effective techniques for NUMA memory management. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, pages 19- 31, Litchfield Park, AZ, December 1989. Google ScholarDigital Library
- 6.William J. Bolosky and Michael L. Scott. False sharing and its effect on shared memory performance. In Proceedings of fhe Fourth Usenix Symposium on Experiences with Distributed and Multiprocessor Systems, pages 57-71, San Diego, CA, September 1993. Also available as MSR-TR-93-1, Microsoft Research Laboratory, September 1993.Google Scholar
- 7.S. Carr, K. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality, in Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, October 1994. Google ScholarDigital Library
- 8.J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. in Proceedings of the Thirteenth ACM Symposium on Operating .Systems Principles, pages 152-164, Pacific Grove, CA, October 1991. Google ScholarDigital Library
- 9.S. Chatterjee, J. R. Gilbert, R. Schreiber, and S. 71:ng. Automatic array alignment in data-parallel programs. Proceedings of ACM Symposium on Principles of Programming Languages, 20, 1993. Google ScholarDigital Library
- 10.Alan L. Cox and Robert J. Fowler. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with PLATINUM. In Proceed~!ngs of the Twelfth ACM Symposium on Operating Systems Principles, pages 32--44, Litchfield Park, AZ, December 1989. Google Scholar
- 11.M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Stenstr6m. The detection and elimination of useless misses in multiprocessors. In Proceedings of the International Symposium on Computer Architecture, pages 88-97, San Diego, CA, May 1993. Google ScholarDigital Library
- 12.S. J. Eggers and T E. Jeremiassen. Eliminating false sharing. In Proc. 1991 International Conference on Parallel Processing, 1991.Google Scholar
- 13.C. Eisenbeis, W. Jalby, D. Windheiser, and F. Bodin. A strategy for array management in local memory. In Proc. 3th Annual Workshop on Languages and Compilers, August 1990.Google Scholar
- 14.J. Ferrante, V. Sarkar, and W. Thrash. On estimating and exchange cache effectiveness. In Proceeding.r of the Fourth Workshop on Languages and Compilers for Parallel Computing, August 1991. Google ScholarDigital Library
- 15.D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformations. Journal of Parallel and Distributed Computing, 5:587-616, 1988. Google ScholarDigital Library
- 16.M. Gupta and P. Banerjee. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3:179-193, 1992. Google ScholarDigital Library
- 17.J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1990. Google ScholarDigital Library
- 18.D. Hudak and S. Abraham. Compiler techniques for data partitioning of sequentially iterated parallel loops. In Proc. ACM Int. Conf. Supercomputing, June 19913. Google ScholarDigital Library
- 19.K. Knobe, J. Lukas, and G. Steele. Data optimization: Allocation of arrays to reduce communication on SIMD machines. Journal of Parallel and Distributed Computing, 8:102-118, February 1990. Google ScholarDigital Library
- 20.J. Li and M. Chen. Index domain alignment: Minimizing cost of cross-referencing between distributed arrays. Technical report, Yale University, 1989.Google Scholar
- 21.K. Li and P. Hudak. Memory coherence in shared virtual memory systems. A CM Transactions on Computer Systems, November 1989. Google ScholarDigital Library
- 22.W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA compilers. A CM Transactions on Computer Systems, 11 (4):353-375, November 1993. Google ScholarDigital Library
- 23.W. Li and K. Pingali. A singular loop transformation framework based on non-singular matrices. International Journal of Parallel Programming, 22(2), April 1994. Google ScholarDigital Library
- 24.Wei Li. Compiler cache optimizations for banded matrix problems. In 9th ACM International Conference on Supercomputing, July 1995. Google ScholarDigital Library
- 25.D. J. Lilja. Cache coherence in large-scale sharedmemory multiprocessors: Issues and comparisons. Computing Surveys, 25(3):303-338, September 1993. Google ScholarDigital Library
- 26.A. Nagurney, C. F. Nicholson, and P. M. Bishop. Spatial price equilibrium models with discriminatory ad valorem tariffs: formulation and comparative computation using variational inequalities. In Recent Advances in Spatial Equilibrium Modeling: Methodology and Applications. Springer-Verlag, Heidelberg, 1995. forthcoming.Google Scholar
- 27.B. Nitzberg and V. Lo. Distributed shared memory: A survey of issues and algorithms. Computer, 24(8):52- 60, August 1991. Google ScholarDigital Library
- 28.A. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Rice University, May 1989. Google ScholarDigital Library
- 29.R. P. LaRowe, Jr. and C. S. Ellis. Experimental comparison of memory management policies for NUMA multiprocessors. A CM Transactions on Computer Systems, 9(4):319-363, November 1991. Google ScholarDigital Library
- 30.J. Ramanujam and P. Sadayappan. Compile-time techniques for data distribution in distributed memory machines. IEEE Transactions on Parallel and Distributed Systems, 2, October 1991. Google ScholarDigital Library
- 31.Jack E. Veenstra and Robert J. Fowler. Mint: a front end for efficient simulation of shared-memory multiprocessors. Proceedings of the Second International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS '94), pages 201-207, January 1994. Google ScholarDigital Library
- 32.M. Wolf and M. Lam. A data locality optimizing algorithm. In Proc. ACM SIGPLAN 91 Conference on Programming Language Design and Implementation, pages 30-44, June 1991. Google ScholarDigital Library
Index Terms
- Unifying data and control transformations for distributed shared-memory machines
Recommendations
Unifying data and control transformations for distributed shared-memory machines
We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We ...
Scalable directory architecture for distributed shared memory chip multiprocessors
Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main ...
Comments