The multiflow trace scheduling compiler

Lowney, P. Geoffrey; Freudenberger, Stefan M.; Karzes, Thomas J.; Lichtenstein, W. D.; Nix, Robert P.; O'Donnell, John S.; Ruttenberg, John C.

doi:10.1007/BF01205182

The multiflow trace scheduling compiler

Published: May 1993

Volume 7, pages 51–142, (1993)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

P. Geoffrey Lowney¹,
Stefan M. Freudenberger²,
Thomas J. Karzes³,
W. D. Lichtenstein⁴,
Robert P. Nix⁵,
John S. O'Donnell⁶ &
…
John C. Ruttenberg⁷

212 Accesses
235 Citations
12 Altmetric
Explore all metrics

Abstract

The Multiflow compiler uses the trace scheduling algorithm to find and exploit instruction-level parallelism beyond basic blocks. The compiler generates code for VLIW computers that issue up to 28 operations each cycle and maintain more than 50 operations in flight. At Multiflow the compiler generated code for eight different target machine architectures and compiled over 50 million lines of Fortran and C applications and systems code. The requirement of finding large amounts of parallelism in ordinary programs, the trace scheduling algorithm, and the many unique features of the Multiflow hardware placed novel demands on the compiler. New techniques in instruction scheduling, register allocation, memory-bank management, and intermediate-code optimizations were developed, as were refinements to reduce the overhead of trace scheduling. This article describes the Multiflow compiler and reports on the Multiflow practice and experience with compiling for instruction-level parallelism beyond basic blocks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aho, A.V., and Ullman, J.D. 1977.Principles of Compiler Design. Addison-Wesley, Reading, Mass.
Google Scholar
Aiken, A., and Nicolau, A. 1988. Optimal loop parallelization. InProc., SIGPLAN '88 Conf. on Programming Language Design and Implementation (Atlanta, June), pp. 308–317.
Allen, F.E., Carter, J.L., Fabri, J., Ferrante, J., Harrison, W.H., Loewner, P.G., and Trevillyan, L.H. 1980. The experimental compiling system.IBM J. Res. and Dev., 24, 6 (Nov.): 695–715.
Google Scholar
Allen, J.R., and Kennedy, K. 1984. Automatic loop interchange. InProc., ACM SIGPLAN '84 Symp. on Compiler Construction (Montreal, June), pp. 233–246.
Allen, R., and Kennedy, K. 1987. Automatic translation of FORTRAN programs to vector form.ACM Trans. Programming Languages and Systems, 9, 4 (Oct.): 491–542.
Google Scholar
American National Standards. 1978. American National Standard Programming Language Fortran. Am. Nat. Standards Institute, New York.
Google Scholar
Banerjee, U. 1976. Data dependence in ordinary programs. M.S. thesis, Dept. of Comp. Sci. Rept. no. 76-837, Univ. of Ill., Urbana-Champaign, Ill.
Google Scholar
Banerjee, U. 1979. Speedup of ordinary programs. Ph.D. thesis, Dept. of Comp. Sci. Rept. no. 79–989, Univ. of Ill., Urbana-Champaign, Ill.
Google Scholar
Callahan, D., and Koblenz, B. 1991. Register allocation via hierarchical graph coloring. InProc., ACM SIGPLAN '91 Conf. on Programming Language Design and Implementation (Toronto, June), pp. 192–203.
Callahan, D., Carr, S., and Kennedy, K. 1990. Improving register allocation for subscripted variables. InProc., ACM SIGPLAN '90 Conf. on Programming Languge Design and Implementation (White Plains, N.Y., June), pp. 53–65.
Chaitin, G.J. 1982. Register allocation and spilling via graph coloring. InProc., ACM SIGPLAN '82 Symp. on Compiler Construction (Boston, June), pp. 98–105.
Chaitin, G.J., Auslander, M.A., Chandra, A.K., Cocke, J., Hopkins, M.E., and Markstein, P.W. 1981. Register allocation via coloring.Comput. Lang., 6, 1 (Jan.): 47–57.
Google Scholar
Chastain, M., Gostlin, G., Mankovich, J., and Wallach, S. 1988. The Convex C240 architecture. InProc., IEEE 1988 Supercomputing Conf., pp. 321–329.
Chow, F.C. 1983. A portable machine-independent global optimizer—Design and measurements. Tech. Rept. 83-254, Comp. Systems Lab., Stanford Univ., Stanford, Calif.
Google Scholar
Chow, F., and Hennessy, J. 1984. Register allocation by priority-based coloring. InProc., ACM SIGPLAN '84 Symp. on Compiler Construction (Montreal, June), pp. 222–232.
Chow, F.C., and Hennessy, J.L. 1990. The priority-based coloring approach to register allocation.ACM Trans. Programming Languages and Systems, 12, 4 (Oct.): 501–536.
Google Scholar
Chow, F., Himelstein, M., Killian, E., and Weber, L. 1986. Engineering a RISC compiler system. InProc., IEEE Compcon.
Chow, F., Correll, S., Himelstein, M., Killian, E., and Weber, L. 1987. How many addressing modes are enough? InProc., Second Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Oct.), pp. 117–121.
Colwell, R.P., Nix, R.P., O'Donnell, J.J., Papworth, D.B., and Rodman, P.K. 1988. A VLIW architecture for a trace scheduling compiler.IEEE Trans. Comps., C-37, 8 (Aug.): 967–979.
Google Scholar
Colwell, R.P., Hall, W.E., Joshi, C.S., Papworth, D.B., Rodman, P.K., and Tornes, J.E. 1990. Architecture and implementation of a VLIW supercomputer. InProc., Supercomputing '90 (Nov.), pp. 910–919.
Dehnert, J.C., Hsu, P.Y.-T., and Bratt, J. P. 1989. Overlapped loop support in the Cydra 5. InProc., Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Apr.), pp. 26–38.
Dongarra, J.J. 1991. Performance of various computers using standard linear equations software. CS-89-85, Dept. of Comp. Sci., Univ. of Tenn., Knoxville, Tenn.
Google Scholar
Ellis, J.R. 1986.Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, Mass.
Google Scholar
Ferrante, J., Ottenstein, K.J., and Warren, J.D. 1987. The program dependence graph and its use in optimization.ACM Trans. Programming Languages and Systems, 9, 3, (July): 319–349.
Google Scholar
Feldman, S.I. 1979. Implementation of a portable F77 compiler using modern tools. InProc., SIGPLAN Symp. on Compiler Construction (Aug.), pp. 98–106.
Fisher, J.A. 1979. The optimization of horizontal microcode within and beyond basic blocks: An application of processor scheduling with resources. Ph.D. thesis, New York Univ., New York.
Google Scholar
Fisher, J.A. 1981. Trace scheduling: A technique for global microcode compaction.IEEE Trans. Comps., C-30, 7 (July): 478–490.
Google Scholar
Fisher, J.A., Ellis, J.R., Ruttenberg, J.C. and Nicolau, A. 1984. Parallel processing: A smart compiler and a dumb machine. InProc., ACM SIGPLAN '84 Symp. on Compiler Construction (Montreal, June), pp. 37–47.
Foster, C.C., and Riseman, E.M. 1972. Percolation of code to enhance parallel dispatching and execution.IEEE Trans. Comps., C-21, 12 (Dec): 1411–1415.
Google Scholar
Freudenberger, S.M., and Ruttenberg, J.C. 1992. Phase ordering of register allocation and instruction scheduling. InCode Generation—Concepts, Tools, Techniques (R. Giergerich and S.L. Graham, eds.), Springer-Verlag, London, pp. 146–170.
Google Scholar
Gibbons, P.B., and Muchnik, S.S. 1986. Efficient instruction scheduling for a pipelined architecture. InProc., SIGPLAN '86 Symp. on Compiler Construction (Palo Alto, Calif., June), pp. 11–16.
Gross, T. 1983. Code optimization of pipeline constraints. Tech. Rept. no. 83-255, Comp. Systems Lab., Stanford Univ., Stanford, Calif.
Google Scholar
Gross, T., and Ward, M. 1990. The suppression of compensation code. InAdvances in Languages and Compilers for Parallel Computing (A. Nicolau, D. Gelernter, T. Gross, and D. Padua, eds.), Pitman/MIT Press, pp. 260–273.
Hennessy, J.L., and Gross, T.R. 1983. Postpass code optimization of pipeline constraints.ACM Trans. Programming Languages and Systems, 5, 3 (July): 422–448.
Google Scholar
Hewlett-Packard. 1991.PA-RISC Procedure Calling Convention Reference Manual. Hewlett-Packard Laboratories, order no. 09740-90015.
Himelstein, M.I. 1991. Compiler tail ends. Tutorial notes,SIGPLAN '91 Conf. on Programming Language Design and Implementation.
IBM. 1992.IBM RS/6000 Assembler Language Reference, 2nd ed. IBM Corp. (Jan.).
Johnson, S.C. 1979. A tour through the portable C compiler. Bell Labs., Murray Hill, N.J.
Google Scholar
Joshi, C.S., Reger, B.A., and Feehrer, J.R. 1991. Memory system for a statically scheduled supercomputer. InProc., 1991 Internat. Conf. on Parallel Processing, vol. 1, pp. 196–203.
Google Scholar
Jouppi, N.P., and Wall, D.W. 1989. Available instruction-level parallelism for superscalar and superpipelined machines. InProc., Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Apr.), pp. 272–182.
Kane, G. 1987.MIPS R2000 RISC Architecture. Prentice-Hall, Englewood Cliffs, N.J.
Google Scholar
Kernighan, B.W., and Ritchie, D.M. 1978.The C Programming Language. Prentice-Hall, Englewood Cliffs, N.J.
Google Scholar
Knuth, D.E. 1981.The Art of Computer Programming, vol. 2, Seminumerical Algorithms, 2nd ed. Addison-Wesley, Reading, Mass.
Google Scholar
Kuck, D.J., Kuhn, R.H., Padua, D.A., Leasure, B., and Wolfe, M. 1981. Dependence graphs and compiler optimizations. InProc., Eighth Annual ACM Symp. on Principles of Programming Languages (Williamsburg, Va., Jan.), pp. 207–218.
Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. InProc., SIGPLAN '88 Conf. on Programming Design and Implementation (Atlanta, June 22–24), pp. 318–328.
Lam, M.S. 1989.A Systolic Array Optimizing Compiler. Kluwer Academic, Norwell, Mass.
Google Scholar
Linn, J.L. 1983. SRDAG compaction—A generalization of trace scheduling to increase the use of global context information. InProc., 16th Annual Microprogramming Workshop (Oct.), pp. 11–22.
Linn, J.L. 1988. Horizontal microcode compaction. InMicroprogramming and Firmware Engineering Methods (S. Habib, ed.), Van Nostrand Rheinhold, New York, pp. 381–431.
Google Scholar
Little, J.D.C. 1961. A proof of the queing formulaL = λW.Operations Research, 9, 3 (May): 383–387. (Cited in Coffman, E.G., Jr., and Denning, P.J. 1973.Operating Systems Theory. Prentice-Hall, Englewood Cliffs, N.J.)
Google Scholar
Lubeck, O.M. 1988. Supercomputer performance: The theory, practice, and results. LA-11204-MS, Los Alamos Nat. Lab., Los Alamos, N.M.
Google Scholar
McMahon, F.H. 1986. The Livermore Fortran kernels: A computer test of the numerical performance range. UCRL-53745, Lawrence Livermore Nat. Lab., Univ. of Calif., Livermore.
Google Scholar
Mercer, R. 1988. The CONVEX FORTRAN 5.0 compiler. InProc., Supercomputing '88 (St. Petersburg, Fla., Nov.), pp. 164–175.
Morel, E., and Renvoise, C. 1979. Global optimization by suppression of partial redundancies.CACM, 22, 2 (Feb.): 96–103.
Google Scholar
Nicolau, A. 1984. Parallelism, memory anti-aliasing and correctness for trace scheduling compilers. Ph.D. thesis, Yale Univ., New Haven, Conn.
Google Scholar
Nicolau, A., and Fisher, J. A. 1981. Using an oracle to measure parallelism in single instruction stream programs. InProc., 14th Annual Microprogramming Workshop (Oct.), pp. 171–182.
O'Brien, K., Hay, B., Minish, J., Schaffer, H., Schloss, B., Shepherd, A., and Zaleski, M. 1990. Advanced compiler technology for the RISC System/6000 architecture. InIBM RISC System/6000 Technology. IBM Corp., Austin, Tex., pp. 154–161.
Google Scholar
Odnert, D., Hansen, R., Dadoo, M., and Laventhal, M. 1991. Architecture and compiler enhancements for PA-RISC workstations. InProc., IEEE Compcon (spring), pp. 214–218.
Padua, D.A., and Wolfe, M.J. 1986. Advanced compiler optimizations for supercomputers.CACM, 29, 12 (Dec): 1184–1201.
Google Scholar
Pettis, K., and Hansen, R.C. 1990. Profile guided code positioning. InProc., ACM SGPLAN90 Conf. on Programming Language Design and Implementation (White Plains, N.Y., June), pp. 16–27.
Rau, B.R., and Glaeser, C.D. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. InProc., Fourteenth Annual Workshop on Microprogramming (Oct.), pp. 183–198.
Google Scholar
Rau, B.R., Yen, D.W.L., Yen, W., and Towle, R.A. 1989. The Cydra 5 departmental supercomputer: Design philosophies, decisions, and trade-offs.IEEE Computer, 22, 1 (Jan.): 12–34.
Google Scholar
Reif, J.H. 1977. Combinatorial aspects of symbolic program analysis. Tr-11-77, Center for Research in Computing Technology, Cambridge, Mass.
Google Scholar
Reif, J.H. 1980. Code motion.SIAM J. Computing, 9, 2 (May): 375–395.
Google Scholar
Reif, J.H., and Tarjan, R.E. 1982. Symbolic analysis in almost linear time.SIAM J. Computing, 11, 1 (Feb.): 81–93.
Google Scholar
Rodman, P. 1989. High performance FFTs for a VLIW architecture. InProc., 1989 Symp. on Computer Architecture and Digital Signal Processing (Hong Kong).
SPEC. 1989.SPEC Newsletter, 1, 1 (fall). (Published by Waterside Assoc, Fremont, Calif.)
Google Scholar
Thornton, J.E. 1970.Design of a Computer: The Control Data 6600. Scott, Foresman, Glenview, Ill.
Google Scholar
Tirumalai, P., Lee, M., and Schlansker, M. S. 1990. Parallelization of loops with exits on pipelined architectures. InProc., Supercomputing '90 (New York, Nov.), pp. 200–212.
Tjaden, G.S., and Flynn, M.J. 1970. Detection and parallel execution of independent instructions.IEEE Trans. Comps., C-19, 10 (Oct.): 889–895.
Google Scholar
Tomasulo, R.M. 1982. An efficient algorithm for exploiting multiple arithmetic units. InComputer Structures: Principles and Examples, McGraw-Hill, New York, pp. 293–305.
Google Scholar
Wall, D.W. 1991. Limits of instruction-level parallelism. InProc., Fourth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Santa Clara, Calif., Apr.), pp. 176–188.
Warren, H.S. 1990. Instruction scheduling for the IBM RISC System/6000 processor.IBM J. Res. and Dev., 34, 1, (Jan.): 85–92.
Google Scholar
Wolfe, M. 1989.Optimizing Supercompilers for Supercomputers. MIT Press, Cambridge, Mass.
Google Scholar

Download references

Author information

Authors and Affiliations

Digital Equipment Corporation, HLO2-3/J03, 77 Reed Road, 01749, Hudson, MA
P. Geoffrey Lowney
Hewlett-Packard Laboratories, 1501 Page Mill Rd., 94304, Palo Alto, CA
Stefan M. Freudenberger
D.E. Shaw and Co., 39th Floor, Tower 45, 120 West 45th St., 10036, New York, NY
Thomas J. Karzes
Thinking Machines Corp., 245 First St., 02138, Cambridge, MA
W. D. Lichtenstein
Digital Equipment Corporation, 85 Swanson Rd, 01719, Boxborough, MA
Robert P. Nix
Equator Technologies, 1738 26th Avenue East, 98112, Seattle, WA
John S. O'Donnell
Silicon Graphics, 31 Cherry St., 06460, Milford, CT
John C. Ruttenberg

Authors

P. Geoffrey Lowney
View author publications
You can also search for this author in PubMed Google Scholar
Stefan M. Freudenberger
View author publications
You can also search for this author in PubMed Google Scholar
Thomas J. Karzes
View author publications
You can also search for this author in PubMed Google Scholar
W. D. Lichtenstein
View author publications
You can also search for this author in PubMed Google Scholar
Robert P. Nix
View author publications
You can also search for this author in PubMed Google Scholar
John S. O'Donnell
View author publications
You can also search for this author in PubMed Google Scholar
John C. Ruttenberg
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lowney, P.G., Freudenberger, S.M., Karzes, T.J. et al. The multiflow trace scheduling compiler. J Supercomput 7, 51–142 (1993). https://doi.org/10.1007/BF01205182

Download citation

Received: 01 May 1992
Revised: 01 October 1992
Issue Date: May 1993
DOI: https://doi.org/10.1007/BF01205182

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The multiflow trace scheduling compiler

Abstract

Access this article

Similar content being viewed by others

Static Compiler Analyses for Application-specific Optimization of Task-Parallel Runtime Systems

Compiler Optimizations for Parallel Programs

Topic 4: High-Performance Architectures and Compilers

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The multiflow trace scheduling compiler

Abstract

Access this article

Similar content being viewed by others

Static Compiler Analyses for Application-specific Optimization of Task-Parallel Runtime Systems

Compiler Optimizations for Parallel Programs

Topic 4: High-Performance Architectures and Compilers

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation