Global communication analysis and optimization

Authors:
Soumen Chakrabarti

Computer Science Division, U. C. Berkeley, CA

Computer Science Division, U. C. Berkeley, CA
View Profile

,
Manish Gupta

IBM T.J. Watson Research Center, Yorktown Heights, P.O. Box, 704, NY

IBM T.J. Watson Research Center, Yorktown Heights, P.O. Box, 704, NY
View Profile

,
Jong-Deok Choi

IBM T.J. Watson Research Center, Yorktown Heights, P.O. Box, 704, NY

IBM T.J. Watson Research Center, Yorktown Heights, P.O. Box, 704, NY
View Profile

PLDI '96: Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementationMay 1996Pages 68–78https://doi.org/10.1145/231379.231391

Published:01 May 1996Publication History

PLDI '96: Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation

Pages 68–78

ABSTRACT

Reducing communication cost is crucial to achieving good performance on scalable parallel machines. This paper presents a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. Our algorithm is distinct from existing approaches in that rather than handling loop-nests and array references one by one, it considers all communication in a procedure and their interactions under different placements before making a final decision on the placement of any communication. It exploits the flexibility resulting from this advanced analysis to eliminate redundancy, reduce the number of messages, and reduce contention for cache and communication buffers, all in a unified framework. In contrast, single loop-nest analysis often retains redundant communication, and more aggressive dataflow analysis on array sections can generate too many messages or cache and buffer contention. The algorithm has been implemented in the IBM pHPF compiler for High Performance Fortran. During compilation, the number of messages per processor goes down by as much as a factor of nine for some HPF programs. We present performance results for the IBM SP2 and a network of Sparc workstations (NOW) connected by a Myrinet switch. In many cases, the communication cost is reduced by a factor of two.

References

1.G. Agarwal, J. Saltz, and R. Das. Interprocedural partial redundancy elimination and its application to distributed memory compilation. In Programming Language Design and Implementation (PLDI), La Jolla, CA, June 1995. Google ScholarDigital Library
2.F. Allen, M. Burke, P. Charles, R. Cytron, and J. Ferrante. An overview of the ptran analysis system for multiprocessing. Proc. A CM 1987 International Conference on Supercomputing, 1987. Also published in Journal of Parallel and Distributed Computing, Oct., 1988, 5(5) pages 617-640. Google ScholarDigital Library
3.S. P. Amarasinghe and M. S. Lain. Communication optimization and code generation for distributed memory machines. In Programming Language Design and implementation (PLDI), Albuquerque, NM, June 1993. ACM SIGPLAN. Google ScholarDigital Library
4.Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka. A compilation approach for Fortran 90D/HPF compilers on distributed memory MIMD computers. In Proc. Sixth Annual Workshop on Languages and Compilers for Parallel Computing, Portland, Oregon, Aug. 1993. Google ScholarDigital Library
5.T. Brandes. ADAPTOR: A compilation system for data-parallel Fortran programs, in C. W. Kessler, editor, Automatic parallelization - new approaches to code generation, data distribution, and performance prediction. Vieweg Advanced Studies in Computer Science, Vieweg, Wiesbaden, Jan. 1994.Google Scholar
6.J.-D. Choi, R. Cytron, and J. Ferrante. On the efficient engineering of ambitious program analysis. IEEE 2"ransact~ons on Software Engineering, 20(2):105-114, Feb. 1994. Google ScholarDigital Library
7.C. Click. Global code motion global value numbering. In Programming Language Design and Implementation (PLDI), pages 246-257. ACM SIGPLAN, 1995. Google ScholarDigital Library
8.R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. Efficiently computing static single assignment form and the control dependence graph. A CM Transactions on Programming Languages and Systems, 13(4):451-490, Oct. 1991. Google ScholarDigital Library
9.H. P. F. Forum. High Performance Fortran language specification, version 1.0. Technical Report CRPC- TR92225, Rice University, May 1993.Google Scholar
10.E. Granston and A. Veidenbaum. Detecting redundant accesses to array data. In Prec. $upercomputing '91, pages 854-965, 1991. Google Scholar
11.M. Gupta and P. Banerjee. A methodology for highlevel synthesis of communication on multicomputers. In Prec. 6th A CM International Conference on Supercom. puling, Washington D.C., July 1992. Google ScholarDigital Library
12.M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, K. Wang, D. Shields, W.-M. Ching, and T. Ngo. An I-IPF compiler for the IBM SP2. In Prec. Supercomputing '95, San Diego, CA, Dec. 1995. Google ScholarDigital Library
13.M. Gupta and E. Schonberg. Static analysis to reduce synchronization costs in data-parallel programs. In Principles of Programming Languages (POPL), St. Petersburg Beach, FL, Jan. 1996. Google ScholarDigital Library
14.M. Gupta, E. Schonberg, and H. Srinivasan. A unified framework for optimizing communication in dataparallel programs. Technical Report RC 19872(87937) 12/14/94, IBM Research, 1994. To appear in iEEE Transactions on Parallel and Distributed Systems. Google ScholarDigital Library
15.S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Communications of the ACM, 35(8):66-80, Aug. 1992. Google ScholarDigital Library
16.K. Keeton, T. Anderson, and D. Patterson. LogP quantified: The case for low-overhead local area networks. In Prec. Hot Interconnects IiI: A Symposium on High Performance Interconnects, Stanford, CA, Aug. 1995.Google Scholar
17.K. Kennedy and N. Nedeljkovic. Combining dependence and data-flow analyses to optimize communication. In international Parallel Processing Symposium. IEEE, 1995. Google ScholarDigital Library
18.K. Kennedy and A. Sethi. A constraint-based communication placement framework. Technical Report CRPC-TR95515-S, CRPC, Rice University, 1995.Google Scholar
19.J. Knoop, O. Riithing, and B. Steffen. Lazy code motion. In Programming Language Design and Implementation (PLDI), San Francisco, CA, June 1992. Google ScholarDigital Library
20.J. Li and M. Chen. Compiling communication-efficient programs for massively parallel machines. IEEE Transactions on Parallel and Distributed Systems, 2(3):361- 376, July 1991. Google ScholarDigital Library
21.T. Mowry, M. Lain, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 62-73. ACM SiGPLAN, 1992. Google ScholarDigital Library
22.M. O'Boyle and F. Bodin. Compiler reduction of synchronization in shared virtual memory systems. In Prec. 9th A CM International Conference on Supercompuling, Barcelona, Spain~ July 1995. Google ScholarDigital Library
23.V. Sarkar. The PTRAN parallel programming system. Parallel Functional Programming Languages and Compilers, pages 309-391, 1991. Google ScholarDigital Library
24.M. Snir et al. The communication software and parallel environment of the IBM SP2. IBM Systems Journal, 34(2):205-221, 1995. Google ScholarDigital Library
25.C. Stunkel et al. The SP2 high performance switch. iBM Systems Journal, 34(2):185-204, 1995. Google ScholarDigital Library
26.C.-W. Tseng. Compiler optimizations for eliminating barrier synchronization. In Principles and Practice of Parallel Programming (PPoPP), Santa Barbara, CA, july 1995. Google ScholarDigital Library
27.R. v Hanxleden and K. Kennedy. Give-n-Take--a balanced code placement framework. In Programming Language Design and Implementation (PLDI), Orlando, FL, June 1994. ACM SIGPLAN. Google ScholarDigital Library
28.M. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, 1996. Google ScholarDigital Library
29.M. Wolfe and U. Banerjee. Data dependence and its application to parallel processing. International Journal of Parallel Programming, 16(2):137-178, Apr. 1987. Google ScholarDigital Library
30.H. Zima, H. Bast, and M. Gerndt. SUPERB: A tool for semi-automatic MiMD/SIMD parallelization. Parallel Computing, 6:1-18, 1988.Google ScholarCross Ref

Index Terms

Global communication analysis and optimization

Recommendations

A global communication optimization technique based on data-flow analysis and linear algebra

Reducing communication overhead is extremely important in distributed-memory message-passing architectures. In this article, we present a technique to improve communication that considers data access patterns of the entire program. Our approach is based ...
Read More
Global communication analysis and optimization

Reducing communication cost is crucial to achieving good performance on scalable parallel machines. This paper presents a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. Our algorithm is distinct ...
Read More
Improving the execution time of global communication operations
CF '04: Proceedings of the 1st conference on Computing frontiers

Many parallel applications from scientific computing use MPI global communication operations to collect or distribute data. Since the execution times of these communication operations increase with the number of participating processors, scalability ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PLDI '96: Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
May 1996
300 pages
ISBN:0897917952
DOI:10.1145/231379
Chairman:
Charles N. Fischer
Univ. of Wisconsin, Madison
ACM SIGPLAN Notices Volume 31, Issue 5
May 1996
300 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/249069
Editor:
Richard L. Wexelblat
Issue’s Table of Contents
Copyright © 1996 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 1996
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
PLDI '96 Paper Acceptance Rate28of112submissions,25%Overall Acceptance Rate406of2,067submissions,20%
More
Upcoming Conference
PLDI '24

Sponsor:

sigplan

ACM SIGPLAN Conference on Programming Language Design and Implementation

June 24 - 28, 2024

Copenhagen , Denmark
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 59
  Total Citations
  View Citations
- 643
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Global communication analysis and optimization

PLDI '96: Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation

ABSTRACT

References

Cited By

Index Terms

Recommendations

A global communication optimization technique based on data-flow analysis and linear algebra

Global communication analysis and optimization

Improving the execution time of global communication operations