skip to main content
article
Open Access

Automatic data layout for distributed-memory machines

Published:01 July 1998Publication History
Skip Abstract Section

Abstract

The goal of languages like Fortran D or High Performance Fortran (HPF) is to provide a simple yet efficient machine-independent parallel programming model. After the algorithm selection, the data layout choice is the key intellectual challenge in writing an efficient program in such languages. The performance of a data layout depends on the target compilation system, the target machine, the problem size, and the number of available processors. This makes the choice of a good layout extremely difficult for most users of such languages. If languages such as HPF are to find general acceptance, the need for data layout selection support has to be addressed. We beleive that the appropriate way to provide the needed support is through a tool that generates data layout specifications automatically. This article discusses the design and implementation of a data layout selection tool that generates HPF-style data layout specifications automatically. Because layout is done in a tool that is not embedded in the target compiler and hence will be run only a few times during the tuning phase of an application, it can use techniques such as integer programming that may be considered too computationally expensive for inclusion in production compilers. The proposed framework for automatic data layout selection builds and examines search spaces of candidate data layouts. A candidate layout is an efficient layout for some part of the program. After the generation of search spaces, a single candidate layout is selected for each program part, resulting in a data layout for the entire program. A good overall data layout may require the remapping of arrays between program parts. A performance estimator based on a compiler model, an execution model, and a machine model are needed to predict the execution time of each candidate layout and the costs of possible remappings between candidate data layouts. In the proposed framework, instances of NP-complete problems are solved during the construction of candidate layout search spaces and the final selection of candidate layouts from each search space. Rather than resorting to heuristics, the framework capitalizes on state-of-the-art 0-1 integer programming technology to compute optimal solutions of these NP-complete problems. A prototype data layout assistant tool based on our framework has been implemented as part of the D system currently under development at Rice University. The article reports preliminary experimental results. The results indicate that the framework is efficient and allows the generation of data layouts of high quality.

References

  1. ADVE, V., CARLE, A., GRANSTON, E., HIRANANDANI, S., KENNEDY, K., I~OELBEL, C., I~REMER, U., MELLOR-CRUMMEY, J., TSENG, C.-W., AND WARREN, S. 1994. Requirements for data-parallel programming environments. IEEE Parallel Distrib. Tech. 2, 3, 48-58.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ADVE, V. AND MELLOR-CRUMMEY, J. 1998. Advanced code generation for High Performance Fortran. In Compilation Techniques and Run Time Systems for Scalable Parallel Systems. Lecture Notes in Computer Science, vol. 1366. Springer-Verlag, Berlin. To appear.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AHO, A. V., SETHI, R., AND ULLMAN, J. 1986. Compilers: Principles, Techniques, and Tools, 2nd ed. Addison-Wesley, Reading, Mass.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. ALTMAN, E. R., GOVINDARAJAN, R., AND GAO, G. R. 1995. Scheduling and mapping: Software pipelining in the presence of structural hazards. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation. ACM, New York, 139-150.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. ALTMAN, R. G. E. R. AND GAO, G. R. 1994. Minimizing register requirements under resourceconstrained rate-optimM software pipelining. In Proceedings of the 27th Annual International Symposium on Microarchitecture.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. ANDERSON, J. 1997. Automatic computation and data decomposition for multiprocessors. Ph.D. thesis, Stanford University, Palo Alto, Cal.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. ANDERSON, J. AND LAM, M. 1993. Global optimizations for parallelism and locality on scalable parallel machines. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation. ACM, New York, 112-125.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. AYGUADI~, E., GARCIA, J., GIRONES, M., LABARTA, J., WORRES, J., AND VALERO, M. 1994. Detecting and using affinity in an automatic data distribution tool. In Proceedings of the 7th Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 892. Springer-Verlag, Berlin, 61-75.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. BALASUNDARAM, V., FOX, G., KENNEDY, I~., AND I~REMER, W. 1991. A static performance estimatot to guide data partitioning decisions. In Proceedings of the 3rd A CM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, 213-223.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. BAU, D., KODUKULA, I., KOTLYAR, V., PINGALI, K., AND STODGHILL, P. 1994. Solving alignment using elementary linear algebra. In Proceedings of the 7th Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 892. Springer-Verlag, Berlin, 46-60.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. BIXBY, R. 1992. Implementing the Simplex method: The initial basis. ORSA J. Comput. 4, 3, 267-284.]]Google ScholarGoogle ScholarCross RefCross Ref
  12. BIXBY, R., KENNEDY, K., AND KREMER, U. 1994. Automatic data layout using 0-1 integer programming. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PA CTg~). 111-122.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. CALLAHAN, D. AND KENNEDY, K. 1988. Compiling programs for distributed-memory multiprocessots. J. Supercomput. 2, 151-169.]]Google ScholarGoogle ScholarCross RefCross Ref
  14. CHAPMAN, B., MEHROTRA, P., AND ZIMA, H. 1992. Vienna Fortran - A Fortran language extension for distributed memory multiprocessors. In Languages, Compilers, and Run-Time Environments for Distributed Memory Machines, J. Saltz and P. Mehrotra, Eds. North-Holland, Amsterdam, The Netherlands, 39-62.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. CHATTERJEE, S., GILBERT, J., SCHREIBER, R., AND TENG, S.-H. 1993. Automatic array alignment in data-parallel programs. In Proceedings of the 20th Annual A CM Symposium on the Principles of Programming Languages. ACM, New York, 16-28.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. CHATTERJEE, S., GILBERT, J., SCHREIBER, R., AND TENG, S.-H. 1995. Optimal evaluation of array expressions on massively parallel machines. ACM Trans. Program. Lang. Syst. 17, 1 (Jan.), 123-156.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. CHATTERJEE, S., GILBERT, J. R., SCHREIBER, R., AND SHEFFLER, T. 1994. Array distribution in data-parallel programs. Lecture Notes in Computer Science, vol. 892. Springer-Verlag, Berlin, 76-91.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. CORMEN, T. H., LEISERSON, C. E., AND RIVEST, R. L. 1990. Introduction to Algorithms. The MIT Press, Cambridge, Mass.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. DAMIAN-IORDACHE, M. AND PEMMARAJU, S. 1998. Automatic data decompositions for messagepassing machines. In Proceedings of the 10th Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 1366. Springer-Verlag, Berlin, 64-78.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D'HOLLANDER, E. 1989. Partitioning and labeling of index sets in do loops with constant dependence. In Proceedings of the 1989 International Conference on Parallel Processing. II139-II144.]]Google ScholarGoogle Scholar
  21. DYBVIG, R. K. 1987. The Scheme Programming Language. Prentice-Hall, Inc., Englewood Cliffs, N.J.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. FEAUTRIER, P. 1994. Fine-grain scheduling under resource constraints. In Proceedings of the 7th Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 892. Springer-Verlag, Berlin, 1-15.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fox, G., HIRANANDANI, S., KENNEDY, K., KOELBEL, C., KREMER, W., TSENG, C., AND WU, M. 1990. Fortran D language specification. Tech. Rep. TR90-141, Dept. of Computer Science, Rice University, Houston, Tex.]]Google ScholarGoogle Scholar
  24. Fox, G., JOHNSON, M., LYZENGA, G., OTTO, S., SALMON, J., AND WALKER, D. 1988. Solving Problems on Concurrent Processors. Vol. 1. Prentice-Hall, Englewood Cliffs, N.J.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. GARCIA, J. 1997. Automatic data distribution for massively parallel processors. Ph.D. thesis, Universitat Polit~cnica de Catalunya, Barcelona, Spain.]]Google ScholarGoogle Scholar
  26. GARCIA, J., AYGUADI~, E., AND LABARTA, J. 1995. A novel approach towards automatic data distribution. In Proceedings of the Workshop on Automatic Data Layout and Performance Prediction (AP'95). Available as Rice University Tech. Rep. CRPC-TR95-548.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. GARCIA, J., AYGUADI~, E., AND LABARTA, J. 1996. Dynamic data distribution with control flow analysis. In Proceedings of Supercomputing '96. Available at URL http://www, supercomp, org/sc96/proceedings.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. GUPTA, M. 1992. Automatic data partitioning on distributed memory multicomputers. Ph.D. thesis, University of Illinois at Urbana-Champaign, Ill.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. GUPTA, M. AND BANERJEE, P. 1992. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers. IEEE Trans. Parallel Distrib. Syst. 3, 2 (Apr.), 179-193.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. HECHT, M. S. 1977. Flow Analysis of Computer Programs. North Holland.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. HIRANANDANI, S., KENNEDY, K., KOELBEL, C., KREMER, U., AND TSENG, C. 1991. An overview of the Fortran D programming system. In Proceedings of the ~th Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 589. Springer- Verlag, Berlin.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. HUANG, C.-H. AND SADAYAPPAN, P. 1991. Communication-free hyperplane partitioning of nested loops. In Proceedings of the ~th Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 589. Springer-Verlag, Berlin.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. HUDAK, D. AND ABRAHAM, S. 1990. Compiler techniques for data partitioning of sequentially iterated parallel loops. In Proceedings of the 1990 A CM International Conference on Supercomputing. ACM, New York, 187-200.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. KELLY, W. AND PUGH, W. 1996. Minimizing communication while preserving parallelism. In Proceedings of the 1996 ACM International Conference on Supercomputing. ACM, New York, 52-60.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. KENNEDY, K. AND KREMER, W. 1995. Automatic data layout for High Performance Fortran. In Proceedings of Supercomputing '95. Available at URL http:////www, supercomp, org//sc95/proceedings.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. KNOBE, K., LUKAS, J., AND STEELE, G. 1990. Data optimization: Allocation of arrays to reduce communication on SIMD machines. J. Parallel Distrib. Comput. 8, 2 (Feb.), 102-118.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. KNOBE, K. AND NATARAJAN, V. 1993. Data optimization: Minimizing residual interprocessor data motion on SIMD machines. J. Supercomput. 7, 387-415.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. KOELBEL, C., LOVEMAN, D., SCHREIBER, R., STEELE, JR., G., AND ZOSEL, M. 1994. The High Performance Fortran Handbook. The MIT Press, Cambridge, Mass.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. KOELBEL, C. AND MEHROTRA, P. 1991. Compiling global name-space parallel loops for distributed execution. IEEE Trans. Parallel Distrib. Syst. 2, 4 (Oct.), 440-451.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. KREMER, U. 1995. Automatic data layout for distributed memory machines. Ph.D. thesis, Rice University, Houston, Tex. Available as CRPC-TR95-559-S.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. KREMER, U. 1997. Optimal and near-optimM solutions for hard compilation problems. Parallel Process. Lett. 7, 2, 371-378.]]Google ScholarGoogle ScholarCross RefCross Ref
  42. KREMER, U. 1998. Automatic data layout with read-only replication and memory constraints. In Proceedings of the 10th Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 1366. Springer-Verlag, Berlin, 419-422.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. KREMER, U. AND RAMIe, M. 1994. Compositional oil reservoir simulation in Fortran D" A feasibility study on Intel iPSC,/860. Int. J. Supercomput. Appl. 8, 2 (Summer), 119-128.]]Google ScholarGoogle Scholar
  44. LEE, P. AND TSAI, T.-B. 1993. Compiling efficient programs for tightly-coupled distributed mereory computers. In Proceedings of the 1993 International Conference on Parallel Processing. II161-II165.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. LI, J. 1991. Compiling Crystal for distributed memory machines. Ph.D. thesis, Dept. of Computer Science, Yale University, New Haven, Conn.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. LI, J. AND CHEN, M. 1991a. Compiling communication-efficient programs for massively parallel machines. IEEE Trans. Parallel Distrib. Syst. 2, 3 (July), 361-376.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. LI, J. AND CHEN, M. 1991b. The data alignment phase in compiling programs for distributedmemory machines. J. Parallel Distrib. Comput. 13, 2 (Oct.), 213-221.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. NEMHAUSER, G. L. AND WOLSEY, L. A. 1988. Integer and Combinatorial Optimization. John Wiley & Sons, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. NING, Q., DONGEN, V. V., AND GAO, G. R. 1995. Automatic data and computation decomposition for distributed memory machines. In Proceedings of the 28th Annual Hawaii International Conference on System Sciences. IEEE, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. NING, Q. AND GAO, G. R. 1993. A novel framework of register allocation for software pipelining. In Proceedings of the 20th Annual A CM Symposium on the Principles of Programming Languages. ACM, New York, 29-42.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. PALERMO, D. 1996. Compiler techniques for optimizing communication and data distribution for distributed-memory multicomputers. Ph.D. thesis, University of Illinois at Urbana-Champaign, Ill. Available as CRHC-96-09.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. PALERMO, D. AND BANERJEE, P. 1995. Automatic selection of dynamic data partitioning schemes for distributed-memory multicomputers. Technical Report CRHC-95-09, Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Ill.]]Google ScholarGoogle Scholar
  53. PALERMO, D. AND BANERJEE, P. 1996. Interprocedural array redistribution with data-flow analysis. In Proceedings of the 9th Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 1239. Springer-Verlag, Berlin, 435-449.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. PHILIPPSEN, M. 1995. Automatic alignment of array data and processes to reduce communication time on DMPPs. In Proceedings of the 5th A CM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, 156-165.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. PRESBERG, D. L. 1996. Comparison of 3 HPF compilers for the IBM SP. NHSF, Rev. 1996, 2. Available at http://www.crpc.rice.edu/NHSEreview/96-2.html.]]Google ScholarGoogle Scholar
  56. PUGH, W. 1991. The Omega test: A fast and practical integer programming algorithm for dependence analysis. In Proceedings of Supercomputing '91. 4-13.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. RAMANUJAM, J. AND SADAYAPPAN, P. 1989. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. In Proceedings of Supercomputing '89. 637-646.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. ROGERS, t. AND PINGALI, K. 1989. Process decomposition through locality of reference. In Proceedings of the SIGPLAN '89 Conference on Programming Language Design and Implementation. ACM, New York, 69-80.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. ROSE, J. AND STEELE, G. 1987. C*: An extended C language for data parallel programming. Technical Report PL87-5, Thinking Machines, Inc., Cambridge, Mass.]]Google ScholarGoogle Scholar
  60. SHEFFLER, T. J., SCHREIBER, R., PUGH, W., GILBERT, J. R., AND CHATTERJEE, S. 1996. Efficient distribution analysis via graph contraction. Int. J. of Parallel Prog. 24, 6 (Dec.), 599-620.]]Google ScholarGoogle ScholarCross RefCross Ref
  61. TANDRI, S. AND ABDELRAHMAN, T. 1997. Automatic partitioning of data and computation on scalable shared memory multiprocessors. In Proceedings of the 1997 International Conference on Parallel Processing. 64-73.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Thinking Machines Corp. 1989. CM Fortran Reference Manual, Version 5.2-0.6 ed. Thinking Machines Corp., Cambridge, Mass.]]Google ScholarGoogle Scholar
  63. TSENG, C. 1993. An optimizing Fortran D compiler for MIMD distributed-memory machines. Ph.D. thesis, Rice University, Houston, Tex. Available as Rice COMP TR93-199.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. WHOLEY, S. 1991. Automatic data mapping for distributed-memory parallel computers. Ph.D. thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pa.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. WHOLEY, S. 1992. Automatic data mapping for distributed-memory parallel computers. In Proceedings of the 1992 ACM International Conference on Supercomputing. ACM, New York, 25-34.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. ZIMA, H., BAST, H.-J., AND GERNDT, M. 1988. SUPERB: A tool for semi-automatic MIMD,/SIMD parallelization. Parallel Comput. 6, 1-18.]]Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Automatic data layout for distributed-memory machines

            Recommendations

            Reviews

            Michael Wolfe

            In High Performance Fortran and similar languages, the user is responsible for adding explicit data distribution directives. Since this is a major impediment to the widespread use of such languages, recent research has studied automatic data distribution. The effort described in this paper uses a five-step process: (1) program partitioning into phases; (2) candidate layout selection; (3) performance estimation; (4) data layout selection; and (5) code generation. The focus of the paper is on step (4). This approach allows for dynamic data redistribution between phases. In the general case, the problem is cast into a 0-1 integer programming framework and solved optimally. Since integer programming is potentially very slow, the authors propose its use in a separate tool rather than in the compiler itself. The paper is of some interest to parall el compiler researchers and developers. The results are promising, but too limited as yet to be of practical use; the sample implementation only handles one-dimensional distributions, and must know the array and machine sizes. Given the number of heuristics in performance estimation, both for data layout and for candidate layout selection, the utility of having an optimal solution for a very inexact problem is open to question.

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Programming Languages and Systems
              ACM Transactions on Programming Languages and Systems  Volume 20, Issue 4
              July 1998
              248 pages
              ISSN:0164-0925
              EISSN:1558-4593
              DOI:10.1145/291891
              Issue’s Table of Contents

              Copyright © 1998 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 July 1998
              Published in toplas Volume 20, Issue 4

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader