ABSTRACT
Reducing communication cost is crucial to achieving good performance on scalable parallel machines. This paper presents a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. Our algorithm is distinct from existing approaches in that rather than handling loop-nests and array references one by one, it considers all communication in a procedure and their interactions under different placements before making a final decision on the placement of any communication. It exploits the flexibility resulting from this advanced analysis to eliminate redundancy, reduce the number of messages, and reduce contention for cache and communication buffers, all in a unified framework. In contrast, single loop-nest analysis often retains redundant communication, and more aggressive dataflow analysis on array sections can generate too many messages or cache and buffer contention. The algorithm has been implemented in the IBM pHPF compiler for High Performance Fortran. During compilation, the number of messages per processor goes down by as much as a factor of nine for some HPF programs. We present performance results for the IBM SP2 and a network of Sparc workstations (NOW) connected by a Myrinet switch. In many cases, the communication cost is reduced by a factor of two.
- 1.G. Agarwal, J. Saltz, and R. Das. Interprocedural partial redundancy elimination and its application to distributed memory compilation. In Programming Language Design and Implementation (PLDI), La Jolla, CA, June 1995. Google ScholarDigital Library
- 2.F. Allen, M. Burke, P. Charles, R. Cytron, and J. Ferrante. An overview of the ptran analysis system for multiprocessing. Proc. A CM 1987 International Conference on Supercomputing, 1987. Also published in Journal of Parallel and Distributed Computing, Oct., 1988, 5(5) pages 617-640. Google ScholarDigital Library
- 3.S. P. Amarasinghe and M. S. Lain. Communication optimization and code generation for distributed memory machines. In Programming Language Design and implementation (PLDI), Albuquerque, NM, June 1993. ACM SIGPLAN. Google ScholarDigital Library
- 4.Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka. A compilation approach for Fortran 90D/HPF compilers on distributed memory MIMD computers. In Proc. Sixth Annual Workshop on Languages and Compilers for Parallel Computing, Portland, Oregon, Aug. 1993. Google ScholarDigital Library
- 5.T. Brandes. ADAPTOR: A compilation system for data-parallel Fortran programs, in C. W. Kessler, editor, Automatic parallelization - new approaches to code generation, data distribution, and performance prediction. Vieweg Advanced Studies in Computer Science, Vieweg, Wiesbaden, Jan. 1994.Google Scholar
- 6.J.-D. Choi, R. Cytron, and J. Ferrante. On the efficient engineering of ambitious program analysis. IEEE 2"ransact~ons on Software Engineering, 20(2):105-114, Feb. 1994. Google ScholarDigital Library
- 7.C. Click. Global code motion global value numbering. In Programming Language Design and Implementation (PLDI), pages 246-257. ACM SIGPLAN, 1995. Google ScholarDigital Library
- 8.R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. Efficiently computing static single assignment form and the control dependence graph. A CM Transactions on Programming Languages and Systems, 13(4):451-490, Oct. 1991. Google ScholarDigital Library
- 9.H. P. F. Forum. High Performance Fortran language specification, version 1.0. Technical Report CRPC- TR92225, Rice University, May 1993.Google Scholar
- 10.E. Granston and A. Veidenbaum. Detecting redundant accesses to array data. In Prec. $upercomputing '91, pages 854-965, 1991. Google Scholar
- 11.M. Gupta and P. Banerjee. A methodology for highlevel synthesis of communication on multicomputers. In Prec. 6th A CM International Conference on Supercom. puling, Washington D.C., July 1992. Google ScholarDigital Library
- 12.M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, K. Wang, D. Shields, W.-M. Ching, and T. Ngo. An I-IPF compiler for the IBM SP2. In Prec. Supercomputing '95, San Diego, CA, Dec. 1995. Google ScholarDigital Library
- 13.M. Gupta and E. Schonberg. Static analysis to reduce synchronization costs in data-parallel programs. In Principles of Programming Languages (POPL), St. Petersburg Beach, FL, Jan. 1996. Google ScholarDigital Library
- 14.M. Gupta, E. Schonberg, and H. Srinivasan. A unified framework for optimizing communication in dataparallel programs. Technical Report RC 19872(87937) 12/14/94, IBM Research, 1994. To appear in iEEE Transactions on Parallel and Distributed Systems. Google ScholarDigital Library
- 15.S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Communications of the ACM, 35(8):66-80, Aug. 1992. Google ScholarDigital Library
- 16.K. Keeton, T. Anderson, and D. Patterson. LogP quantified: The case for low-overhead local area networks. In Prec. Hot Interconnects IiI: A Symposium on High Performance Interconnects, Stanford, CA, Aug. 1995.Google Scholar
- 17.K. Kennedy and N. Nedeljkovic. Combining dependence and data-flow analyses to optimize communication. In international Parallel Processing Symposium. IEEE, 1995. Google ScholarDigital Library
- 18.K. Kennedy and A. Sethi. A constraint-based communication placement framework. Technical Report CRPC-TR95515-S, CRPC, Rice University, 1995.Google Scholar
- 19.J. Knoop, O. Riithing, and B. Steffen. Lazy code motion. In Programming Language Design and Implementation (PLDI), San Francisco, CA, June 1992. Google ScholarDigital Library
- 20.J. Li and M. Chen. Compiling communication-efficient programs for massively parallel machines. IEEE Transactions on Parallel and Distributed Systems, 2(3):361- 376, July 1991. Google ScholarDigital Library
- 21.T. Mowry, M. Lain, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 62-73. ACM SiGPLAN, 1992. Google ScholarDigital Library
- 22.M. O'Boyle and F. Bodin. Compiler reduction of synchronization in shared virtual memory systems. In Prec. 9th A CM International Conference on Supercompuling, Barcelona, Spain~ July 1995. Google ScholarDigital Library
- 23.V. Sarkar. The PTRAN parallel programming system. Parallel Functional Programming Languages and Compilers, pages 309-391, 1991. Google ScholarDigital Library
- 24.M. Snir et al. The communication software and parallel environment of the IBM SP2. IBM Systems Journal, 34(2):205-221, 1995. Google ScholarDigital Library
- 25.C. Stunkel et al. The SP2 high performance switch. iBM Systems Journal, 34(2):185-204, 1995. Google ScholarDigital Library
- 26.C.-W. Tseng. Compiler optimizations for eliminating barrier synchronization. In Principles and Practice of Parallel Programming (PPoPP), Santa Barbara, CA, july 1995. Google ScholarDigital Library
- 27.R. v Hanxleden and K. Kennedy. Give-n-Take--a balanced code placement framework. In Programming Language Design and Implementation (PLDI), Orlando, FL, June 1994. ACM SIGPLAN. Google ScholarDigital Library
- 28.M. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, 1996. Google ScholarDigital Library
- 29.M. Wolfe and U. Banerjee. Data dependence and its application to parallel processing. International Journal of Parallel Programming, 16(2):137-178, Apr. 1987. Google ScholarDigital Library
- 30.H. Zima, H. Bast, and M. Gerndt. SUPERB: A tool for semi-automatic MiMD/SIMD parallelization. Parallel Computing, 6:1-18, 1988.Google ScholarCross Ref
Index Terms
- Global communication analysis and optimization
Recommendations
A global communication optimization technique based on data-flow analysis and linear algebra
Reducing communication overhead is extremely important in distributed-memory message-passing architectures. In this article, we present a technique to improve communication that considers data access patterns of the entire program. Our approach is based ...
Global communication analysis and optimization
Reducing communication cost is crucial to achieving good performance on scalable parallel machines. This paper presents a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. Our algorithm is distinct ...
Improving the execution time of global communication operations
CF '04: Proceedings of the 1st conference on Computing frontiersMany parallel applications from scientific computing use MPI global communication operations to collect or distribute data. Since the execution times of these communication operations increase with the number of participating processors, scalability ...
Comments