Skip to main content
Log in

The multiflow trace scheduling compiler

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The Multiflow compiler uses the trace scheduling algorithm to find and exploit instruction-level parallelism beyond basic blocks. The compiler generates code for VLIW computers that issue up to 28 operations each cycle and maintain more than 50 operations in flight. At Multiflow the compiler generated code for eight different target machine architectures and compiled over 50 million lines of Fortran and C applications and systems code. The requirement of finding large amounts of parallelism in ordinary programs, the trace scheduling algorithm, and the many unique features of the Multiflow hardware placed novel demands on the compiler. New techniques in instruction scheduling, register allocation, memory-bank management, and intermediate-code optimizations were developed, as were refinements to reduce the overhead of trace scheduling. This article describes the Multiflow compiler and reports on the Multiflow practice and experience with compiling for instruction-level parallelism beyond basic blocks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Aho, A.V., and Ullman, J.D. 1977.Principles of Compiler Design. Addison-Wesley, Reading, Mass.

    Google Scholar 

  • Aiken, A., and Nicolau, A. 1988. Optimal loop parallelization. InProc., SIGPLAN '88 Conf. on Programming Language Design and Implementation (Atlanta, June), pp. 308–317.

  • Allen, F.E., Carter, J.L., Fabri, J., Ferrante, J., Harrison, W.H., Loewner, P.G., and Trevillyan, L.H. 1980. The experimental compiling system.IBM J. Res. and Dev., 24, 6 (Nov.): 695–715.

    Google Scholar 

  • Allen, J.R., and Kennedy, K. 1984. Automatic loop interchange. InProc., ACM SIGPLAN '84 Symp. on Compiler Construction (Montreal, June), pp. 233–246.

  • Allen, R., and Kennedy, K. 1987. Automatic translation of FORTRAN programs to vector form.ACM Trans. Programming Languages and Systems, 9, 4 (Oct.): 491–542.

    Google Scholar 

  • American National Standards. 1978. American National Standard Programming Language Fortran. Am. Nat. Standards Institute, New York.

    Google Scholar 

  • Banerjee, U. 1976. Data dependence in ordinary programs. M.S. thesis, Dept. of Comp. Sci. Rept. no. 76-837, Univ. of Ill., Urbana-Champaign, Ill.

    Google Scholar 

  • Banerjee, U. 1979. Speedup of ordinary programs. Ph.D. thesis, Dept. of Comp. Sci. Rept. no. 79–989, Univ. of Ill., Urbana-Champaign, Ill.

    Google Scholar 

  • Callahan, D., and Koblenz, B. 1991. Register allocation via hierarchical graph coloring. InProc., ACM SIGPLAN '91 Conf. on Programming Language Design and Implementation (Toronto, June), pp. 192–203.

  • Callahan, D., Carr, S., and Kennedy, K. 1990. Improving register allocation for subscripted variables. InProc., ACM SIGPLAN '90 Conf. on Programming Languge Design and Implementation (White Plains, N.Y., June), pp. 53–65.

  • Chaitin, G.J. 1982. Register allocation and spilling via graph coloring. InProc., ACM SIGPLAN '82 Symp. on Compiler Construction (Boston, June), pp. 98–105.

  • Chaitin, G.J., Auslander, M.A., Chandra, A.K., Cocke, J., Hopkins, M.E., and Markstein, P.W. 1981. Register allocation via coloring.Comput. Lang., 6, 1 (Jan.): 47–57.

    Google Scholar 

  • Chastain, M., Gostlin, G., Mankovich, J., and Wallach, S. 1988. The Convex C240 architecture. InProc., IEEE 1988 Supercomputing Conf., pp. 321–329.

  • Chow, F.C. 1983. A portable machine-independent global optimizer—Design and measurements. Tech. Rept. 83-254, Comp. Systems Lab., Stanford Univ., Stanford, Calif.

    Google Scholar 

  • Chow, F., and Hennessy, J. 1984. Register allocation by priority-based coloring. InProc., ACM SIGPLAN '84 Symp. on Compiler Construction (Montreal, June), pp. 222–232.

  • Chow, F.C., and Hennessy, J.L. 1990. The priority-based coloring approach to register allocation.ACM Trans. Programming Languages and Systems, 12, 4 (Oct.): 501–536.

    Google Scholar 

  • Chow, F., Himelstein, M., Killian, E., and Weber, L. 1986. Engineering a RISC compiler system. InProc., IEEE Compcon.

  • Chow, F., Correll, S., Himelstein, M., Killian, E., and Weber, L. 1987. How many addressing modes are enough? InProc., Second Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Oct.), pp. 117–121.

  • Colwell, R.P., Nix, R.P., O'Donnell, J.J., Papworth, D.B., and Rodman, P.K. 1988. A VLIW architecture for a trace scheduling compiler.IEEE Trans. Comps., C-37, 8 (Aug.): 967–979.

    Google Scholar 

  • Colwell, R.P., Hall, W.E., Joshi, C.S., Papworth, D.B., Rodman, P.K., and Tornes, J.E. 1990. Architecture and implementation of a VLIW supercomputer. InProc., Supercomputing '90 (Nov.), pp. 910–919.

  • Dehnert, J.C., Hsu, P.Y.-T., and Bratt, J. P. 1989. Overlapped loop support in the Cydra 5. InProc., Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Apr.), pp. 26–38.

  • Dongarra, J.J. 1991. Performance of various computers using standard linear equations software. CS-89-85, Dept. of Comp. Sci., Univ. of Tenn., Knoxville, Tenn.

    Google Scholar 

  • Ellis, J.R. 1986.Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, Mass.

    Google Scholar 

  • Ferrante, J., Ottenstein, K.J., and Warren, J.D. 1987. The program dependence graph and its use in optimization.ACM Trans. Programming Languages and Systems, 9, 3, (July): 319–349.

    Google Scholar 

  • Feldman, S.I. 1979. Implementation of a portable F77 compiler using modern tools. InProc., SIGPLAN Symp. on Compiler Construction (Aug.), pp. 98–106.

  • Fisher, J.A. 1979. The optimization of horizontal microcode within and beyond basic blocks: An application of processor scheduling with resources. Ph.D. thesis, New York Univ., New York.

    Google Scholar 

  • Fisher, J.A. 1981. Trace scheduling: A technique for global microcode compaction.IEEE Trans. Comps., C-30, 7 (July): 478–490.

    Google Scholar 

  • Fisher, J.A., Ellis, J.R., Ruttenberg, J.C. and Nicolau, A. 1984. Parallel processing: A smart compiler and a dumb machine. InProc., ACM SIGPLAN '84 Symp. on Compiler Construction (Montreal, June), pp. 37–47.

  • Foster, C.C., and Riseman, E.M. 1972. Percolation of code to enhance parallel dispatching and execution.IEEE Trans. Comps., C-21, 12 (Dec): 1411–1415.

    Google Scholar 

  • Freudenberger, S.M., and Ruttenberg, J.C. 1992. Phase ordering of register allocation and instruction scheduling. InCode Generation—Concepts, Tools, Techniques (R. Giergerich and S.L. Graham, eds.), Springer-Verlag, London, pp. 146–170.

    Google Scholar 

  • Gibbons, P.B., and Muchnik, S.S. 1986. Efficient instruction scheduling for a pipelined architecture. InProc., SIGPLAN '86 Symp. on Compiler Construction (Palo Alto, Calif., June), pp. 11–16.

  • Gross, T. 1983. Code optimization of pipeline constraints. Tech. Rept. no. 83-255, Comp. Systems Lab., Stanford Univ., Stanford, Calif.

    Google Scholar 

  • Gross, T., and Ward, M. 1990. The suppression of compensation code. InAdvances in Languages and Compilers for Parallel Computing (A. Nicolau, D. Gelernter, T. Gross, and D. Padua, eds.), Pitman/MIT Press, pp. 260–273.

  • Hennessy, J.L., and Gross, T.R. 1983. Postpass code optimization of pipeline constraints.ACM Trans. Programming Languages and Systems, 5, 3 (July): 422–448.

    Google Scholar 

  • Hewlett-Packard. 1991.PA-RISC Procedure Calling Convention Reference Manual. Hewlett-Packard Laboratories, order no. 09740-90015.

  • Himelstein, M.I. 1991. Compiler tail ends. Tutorial notes,SIGPLAN '91 Conf. on Programming Language Design and Implementation.

  • IBM. 1992.IBM RS/6000 Assembler Language Reference, 2nd ed. IBM Corp. (Jan.).

  • Johnson, S.C. 1979. A tour through the portable C compiler. Bell Labs., Murray Hill, N.J.

    Google Scholar 

  • Joshi, C.S., Reger, B.A., and Feehrer, J.R. 1991. Memory system for a statically scheduled supercomputer. InProc., 1991 Internat. Conf. on Parallel Processing, vol. 1, pp. 196–203.

    Google Scholar 

  • Jouppi, N.P., and Wall, D.W. 1989. Available instruction-level parallelism for superscalar and superpipelined machines. InProc., Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Apr.), pp. 272–182.

  • Kane, G. 1987.MIPS R2000 RISC Architecture. Prentice-Hall, Englewood Cliffs, N.J.

    Google Scholar 

  • Kernighan, B.W., and Ritchie, D.M. 1978.The C Programming Language. Prentice-Hall, Englewood Cliffs, N.J.

    Google Scholar 

  • Knuth, D.E. 1981.The Art of Computer Programming, vol. 2, Seminumerical Algorithms, 2nd ed. Addison-Wesley, Reading, Mass.

    Google Scholar 

  • Kuck, D.J., Kuhn, R.H., Padua, D.A., Leasure, B., and Wolfe, M. 1981. Dependence graphs and compiler optimizations. InProc., Eighth Annual ACM Symp. on Principles of Programming Languages (Williamsburg, Va., Jan.), pp. 207–218.

  • Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. InProc., SIGPLAN '88 Conf. on Programming Design and Implementation (Atlanta, June 22–24), pp. 318–328.

  • Lam, M.S. 1989.A Systolic Array Optimizing Compiler. Kluwer Academic, Norwell, Mass.

    Google Scholar 

  • Linn, J.L. 1983. SRDAG compaction—A generalization of trace scheduling to increase the use of global context information. InProc., 16th Annual Microprogramming Workshop (Oct.), pp. 11–22.

  • Linn, J.L. 1988. Horizontal microcode compaction. InMicroprogramming and Firmware Engineering Methods (S. Habib, ed.), Van Nostrand Rheinhold, New York, pp. 381–431.

    Google Scholar 

  • Little, J.D.C. 1961. A proof of the queing formulaL = λW.Operations Research, 9, 3 (May): 383–387. (Cited in Coffman, E.G., Jr., and Denning, P.J. 1973.Operating Systems Theory. Prentice-Hall, Englewood Cliffs, N.J.)

    Google Scholar 

  • Lubeck, O.M. 1988. Supercomputer performance: The theory, practice, and results. LA-11204-MS, Los Alamos Nat. Lab., Los Alamos, N.M.

    Google Scholar 

  • McMahon, F.H. 1986. The Livermore Fortran kernels: A computer test of the numerical performance range. UCRL-53745, Lawrence Livermore Nat. Lab., Univ. of Calif., Livermore.

    Google Scholar 

  • Mercer, R. 1988. The CONVEX FORTRAN 5.0 compiler. InProc., Supercomputing '88 (St. Petersburg, Fla., Nov.), pp. 164–175.

  • Morel, E., and Renvoise, C. 1979. Global optimization by suppression of partial redundancies.CACM, 22, 2 (Feb.): 96–103.

    Google Scholar 

  • Nicolau, A. 1984. Parallelism, memory anti-aliasing and correctness for trace scheduling compilers. Ph.D. thesis, Yale Univ., New Haven, Conn.

    Google Scholar 

  • Nicolau, A., and Fisher, J. A. 1981. Using an oracle to measure parallelism in single instruction stream programs. InProc., 14th Annual Microprogramming Workshop (Oct.), pp. 171–182.

  • O'Brien, K., Hay, B., Minish, J., Schaffer, H., Schloss, B., Shepherd, A., and Zaleski, M. 1990. Advanced compiler technology for the RISC System/6000 architecture. InIBM RISC System/6000 Technology. IBM Corp., Austin, Tex., pp. 154–161.

    Google Scholar 

  • Odnert, D., Hansen, R., Dadoo, M., and Laventhal, M. 1991. Architecture and compiler enhancements for PA-RISC workstations. InProc., IEEE Compcon (spring), pp. 214–218.

  • Padua, D.A., and Wolfe, M.J. 1986. Advanced compiler optimizations for supercomputers.CACM, 29, 12 (Dec): 1184–1201.

    Google Scholar 

  • Pettis, K., and Hansen, R.C. 1990. Profile guided code positioning. InProc., ACM SGPLAN90 Conf. on Programming Language Design and Implementation (White Plains, N.Y., June), pp. 16–27.

  • Rau, B.R., and Glaeser, C.D. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. InProc., Fourteenth Annual Workshop on Microprogramming (Oct.), pp. 183–198.

    Google Scholar 

  • Rau, B.R., Yen, D.W.L., Yen, W., and Towle, R.A. 1989. The Cydra 5 departmental supercomputer: Design philosophies, decisions, and trade-offs.IEEE Computer, 22, 1 (Jan.): 12–34.

    Google Scholar 

  • Reif, J.H. 1977. Combinatorial aspects of symbolic program analysis. Tr-11-77, Center for Research in Computing Technology, Cambridge, Mass.

    Google Scholar 

  • Reif, J.H. 1980. Code motion.SIAM J. Computing, 9, 2 (May): 375–395.

    Google Scholar 

  • Reif, J.H., and Tarjan, R.E. 1982. Symbolic analysis in almost linear time.SIAM J. Computing, 11, 1 (Feb.): 81–93.

    Google Scholar 

  • Rodman, P. 1989. High performance FFTs for a VLIW architecture. InProc., 1989 Symp. on Computer Architecture and Digital Signal Processing (Hong Kong).

  • SPEC. 1989.SPEC Newsletter, 1, 1 (fall). (Published by Waterside Assoc, Fremont, Calif.)

    Google Scholar 

  • Thornton, J.E. 1970.Design of a Computer: The Control Data 6600. Scott, Foresman, Glenview, Ill.

    Google Scholar 

  • Tirumalai, P., Lee, M., and Schlansker, M. S. 1990. Parallelization of loops with exits on pipelined architectures. InProc., Supercomputing '90 (New York, Nov.), pp. 200–212.

  • Tjaden, G.S., and Flynn, M.J. 1970. Detection and parallel execution of independent instructions.IEEE Trans. Comps., C-19, 10 (Oct.): 889–895.

    Google Scholar 

  • Tomasulo, R.M. 1982. An efficient algorithm for exploiting multiple arithmetic units. InComputer Structures: Principles and Examples, McGraw-Hill, New York, pp. 293–305.

    Google Scholar 

  • Wall, D.W. 1991. Limits of instruction-level parallelism. InProc., Fourth Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Santa Clara, Calif., Apr.), pp. 176–188.

  • Warren, H.S. 1990. Instruction scheduling for the IBM RISC System/6000 processor.IBM J. Res. and Dev., 34, 1, (Jan.): 85–92.

    Google Scholar 

  • Wolfe, M. 1989.Optimizing Supercompilers for Supercomputers. MIT Press, Cambridge, Mass.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lowney, P.G., Freudenberger, S.M., Karzes, T.J. et al. The multiflow trace scheduling compiler. J Supercomput 7, 51–142 (1993). https://doi.org/10.1007/BF01205182

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01205182

Keywords

Navigation