ABSTRACT
This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code.
This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained.
The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.
- 1.Aiken, A. and Nicolau, A. Perfect Pipelining: A New Loop Parallelization Technique. Comell University, Oct., 1987.Google Scholar
- 2.Ammratone, M., Bitz, F., Clune E., Kung H. T., Maulik, P., Ribas, H., Tseng, P., and Webb, J. Applications of Warp. Proc. Compcon Spring 87, San Francisco, Feb., 1987, pp. 272-275.Google Scholar
- 3.Annaratone, M., Bitz, F., Deutch, J., Hamey, L., Kung, H. T., Maulik P. C., Tseng, P., and Webb, J. A. Applications Experience on Warp. Proc. 1987 National Computer Cofference, AFIPS, Chicago, June, 1987, pp. 149-158.Google Scholar
- 4.Annaratone, M., Amould, E., Gross, T., Kung, H. T., Lain, M., Menzilcioglu, O. and Webb, I. A. "The Warp Computer: Architecture, Implementation and Performance". IEEE Transactions on Computers C-36, 12 (December 1987). Google ScholarDigital Library
- 5.Colwell, R. P., Nix, R. P., O' Donnell, J. I., Papworth, D. B., and Rodman, P. K. A VLIW Architecture for a Trace Scheduling Compiler. Proc. Second Intl. Conf. on Amhitecmral Support for Programming Languages and Operating Systems, Oct., 1987, pp. 180-192. Google ScholarCross Ref
- 6.Dantzig, G. B., Blatmer, W. O. and Rao, M. R. All Shortest Routes from a Fixed Origin in a Graph. Theory of Graphs, Rome, July, 1967, pp. 85-90.Google Scholar
- 7.E2xfioglu, Kemal. A Compilation Technique for Software Pipelining of Loops with Conditional Jumps. Proc. 20th Annual Workshop on Microprogramming, Dec., 1987. Google ScholarDigital Library
- 8.Ellis, John R. Bulldog: A Compiler for VLIW Architectures. Ph.D. Th., Yale University, 1985. Google ScholarDigital Library
- 9.Fisher, J. A. The Optimization of Horizontal Microcode Within and Beyond Basic Blocks: An Application of Processor Scheduling with Resources. Ph.D. Th., New York Univ., OeL 1979. Google ScholarDigital Library
- 10.Fisher, J. A. "Trace Scheduling: A Technique for Global Microcode Compaction". IEEE Trans. on Computers C-30, 7 (July 1981), 478-490.Google ScholarDigital Library
- 11.Fisher, J. A., Ellis, J. R., Ruttenberg, J. C. and Nicolau, A. Parallel Processing: A Smart Compiler and a Dumb Machine. Proc. ACM SIGPLAN '84 Syrup. on Compiler Construction, Montreal, Canada, June, 1984, pp. 37-47. Google ScholarDigital Library
- 12.Fisher, J. A., Landskov, D. and Shriver, B. D. Microcode Compaction: looking Backward and Looking Forward. Proc. 1981 National Computer Conference, 1981, pp. 95-102.Google ScholarDigital Library
- 13.Floyd, R. W. "Algorithm 97: Shortest Path". Comm. ACM 5, 6 (1962), 345. Google ScholarDigital Library
- 14.Garey, Michael R. and Johnson, David S. Computers and Intractability A Guide to the Theory of NP-Completeness. Freeman, 1979. Google ScholarDigital Library
- 15.Gross, T. and Lain, M. Compilation for a High-performance Systolic Array. Proc. ACM SIGPLAN 86 Symposium on Compiler Construction, June, 1986, pp. 27-3 8. Google ScholarDigital Library
- 16.Hsu, Peter. Highly Concurrent Scalar Processing. Ph.D. Th., University of Hlinois at Urbana-Champaign, 1986. Google ScholarDigital Library
- 17.Isoda, Sadahiro, Kobayashi, Yoshizumi, and Ishida, Tom. "Global Compaction of Horizontal Microprograms Based on the General/zed Data Dependency Graph". IEEE Trans. on Computers o32, 10 (October 1983), 922-933.Google ScholarDigital Library
- 18.Kuck, D. J., Kuhn, R. H., Padua, D. A., Leasure, B. and Wolfe, M. l)etxmdence Graphs and Compiler Optimizations. Proc. ACM Symposium on Principles of Programming Languages, January, 1981, pp. 207-218. Google ScholarDigital Library
- 19.Lah, I. and Atkin, E. Tree Compaction of Microprograms. Proc. 16th Annual Workshop on Microprogramming, Oct., 1982, pp. 23-33.Google Scholar
- 20.Lam, Monica. Compiler Optimizations for Asynchronous Systolic Array Programs. Proc. Fifteenth Annual ACM Symposium on Principles of Programming Languages, Jan., 1988. Google ScholarDigital Library
- 21.Lain, Monica. A Systolic Array Optimizing Compiler. Ph.D. Th., Carnegie Mellon University, May 1987.Google Scholar
- 22.Linn, Joseph L SRDAG Compaction - A Generalization of Trace Scheduling to Increase the Use of Global Context Information. Proc. 16th Annual Workshop on Microprogramming, 1983, pp. 11-2ZGoogle Scholar
- 23.McMahon, F. H. Lawrence Livermore National Laboratory FORTRAN Kernels: MFLOPS.Google Scholar
- 24.Patel, Janak H. and Davidson, Edward S. Improving the Throughput of a Pipeline by Insertion of Delays. Proc. 3rd Annual Symposium on Computer Architecture, Jan., 1976, pp. 159-164. Google ScholarDigital Library
- 25.Rau, B. R. and Glaeser, C. D. Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing. Proc. 14th Annual Workshop on Microprogramming, Oct, 1981, pp. 183-198. Google ScholarDigital Library
- 26.Su, B., Ding, S. and Jin, L. An Improvement of Trace Scheduling for Global Microcode Compaction. Proc. 17th Annual Workshop in Microprogramming, Dec., 1984, pp. 78-85. Google ScholarDigital Library
- 27.5u, B., Ding, $., Wang, I. and Xia, J. GURPR- A Method for Global Software Pipelining. Proc. 20th Annual Workshop on Microprogramming, Dec., 1987, pp. 88-96. Google ScholarDigital Library
- 28.Su, B., Ding, S. and Xia, J. URPR- An Extension of URCR for Software Pipeline. Proc. 19th Annual Workshop on Microprogramruing, OCt., 1986, pp. 104-108. Google ScholarDigital Library
- 29.Tarjan, R. E. "Depth first search and linear graph algorithms". SlAM J. Computing 1, 2 (1972), 146-160.Google Scholar
- 30.Touzeau, R. F. A Fortran Compiler for the FPS-164 Scientific Computer. Proc. ACM SIGPLAN '84 Syrup. on Compiler Construction, June, 1984, pp. 48-57. Google ScholarDigital Library
- 31.Weiss, S. and Smith, J. E. A Study of Scalar Compilation Techniques for Pipelined Supercomputers. Proc. Second Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct, 1987, pp. 105-109. Google ScholarCross Ref
- 32.Wood, Graham. Global Op 'tlmization of Microprograms Through Modular Control Congress. Proe. 12th Annual Workshop in Microprogramming, 1979, pp. 1-6. Google ScholarDigital Library
Index Terms
- Software pipelining: an effective scheduling technique for VLIW machines
Recommendations
Software pipelining: an effective scheduling technique for VLIW machines
Proceedings of the SIGPLAN '88 conference on Programming language design and implementationThis paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding ...
Software Pipelining Irregular Loops On the TMS320C6000 VLIW DSP Architecture
LCTES '01: Proceedings of the ACM SIGPLAN workshop on Languages, compilers and tools for embedded systemsThe TMS320C6000 architecture is a leading family of Digital Signal Processors (DSPs). To achieve peak performance, this VLIW architecture relies heavily on software pipelining. Traditionally, software pipelining has been restricted to regular (FOR) ...
Software Pipelining Irregular Loops On the TMS320C6000 VLIW DSP Architecture
OM '01: Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systemsThe TMS320C6000 architecture is a leading family of Digital Signal Processors (DSPs). To achieve peak performance, this VLIW architecture relies heavily on software pipelining. Traditionally, software pipelining has been restricted to regular (FOR) ...
Comments