ABSTRACT
Traditional list schedulers order instructions based on an optimistic estimate of the load latency imposed by the hardware and therefore cannot respond to variations in memory latency caused by cache hits and misses on non-blocking architectures. In contrast, balanced scheduling schedules instructions based on an estimate of the amount of instruction-level parallelism in the program. By scheduling independent instructions behind loads based on what the program can provide, rather than what the implementation stipulates in the best case (i.e., a cache hit), balanced scheduling can hide variations in memory latencies more effectively.
Since its success depends on the amount of instruction-level parallelism in the code, balanced scheduling should perform even better when more parallelism is available. In this study, we combine balanced scheduling with three compiler optimizations that increase instruction-level parallelism: loop unrolling, trace scheduling and cache locality analysis. Using code generated for the DEC Alpha by the Multiflow compiler, we simulated a non-blocking processor architecture that closely models the Alpha 21164. Our results show that balanced scheduling benefits from all three optimizations, producing average speedups that range from 1.15 to 1.40, across the optimizations. More importantly, because of its ability to tolerate variations in load interlocks, it improves its advantage over traditional scheduling. Without the optimizations, balanced scheduled code is, on average, 1.05 times faster than that generated by a traditional scheduler; with them, its lead increases to 1.18.
- Asp93.T. Asprey. Performance features of the PA7100 microprocessor. IEEE Micro, 13(3):22-35, June 1993. Google ScholarDigital Library
- BCK+89.M. Berry, D. Chen, D. Kuck, S. Lo, Y. Pang, L. Pointer, R. Roloff, A. Sameh, E. Clementi, S. Chin, D, Schneider, G. Fox, P. Messina, D. Walker, C. Hsiung, J. Schwarzmeier, K. Lue, S. Orszag, O. Johnson F. Seidl and, R. Goodrum, and J. Martin. The Perfect Club: Effective performance evaluation of supercomputers. International Journal of Supercompurer Applications, 3(3):5--40, December 1989.Google ScholarDigital Library
- CS95.R.P. Colwell and R.L. Steck. A 0.6urn BiCMOS Processor with Dynamic Execution. In IEEE International Solid-State Circuits Conference '95, page 176-177, February 1995.Google ScholarCross Ref
- DA92.K. Diefendorf and M. Allen. Organization of the Motorola 88110 superscalar RISC microprocessor, iEEE Micro, 12(2):51-61, April 1992. Google ScholarDigital Library
- Dix92.K.M. Dixit. New CPU benchmark suites from SPEC. In COMPCON, pages 305-310, February 1992. Google ScholarDigital Library
- ER94.J. Edmondson and P. Rubinfeld. An overview of the 21164 Alpha AXP microprocessor. In Hot Chips VI, pages 1-8, August 1994.Google Scholar
- FERN84.J.A. Fisher, J.R. Ellis, J. Ruttenberg, and A. Nicolau, Parallel processing: A smart compiler and a dumb machine. In A CM SIGPLAN '84 Symposium on Compiler Construction, pages 36-46, June 1984. Google ScholarDigital Library
- FGL93.S.M. Freudenberger, T.R. Gross, and P.F. Lowney. Avoidance and suppression of compensation code in a trace scheduling compiler. Technical Report HPL-93- 35, Hewlett Packard, 1993.Google Scholar
- FJ94.K.I. Farkas and N.P. Jouppi. Complexity/performance tradeoffs with non-blocking Ioads. In 21th Annual International Symposium on Computer Architecture, pages 211-222, April 1994. Google ScholarDigital Library
- GM86.P.B. Gibbons and S.S. Muchnick. Efficient instruction scheduling for a pipelined architecture. In A CM SIG- PLAN '86 Symposium on Compiler Construction, pages 11-16, July 1986. Google ScholarDigital Library
- Gwe94a.L. Gwennap. UltraSparc Unleashes SPARC Performance. Microprocessor Report, pages 1-8, October 3, 1994.Google Scholar
- Gwe94b.L, Gwennap. 620 Fills Out PowerPC Product Line. Microprocessor Report, pages 12-16, October 24, 1994.Google Scholar
- HG83.J.L. Hennessy and T.R. Gross. Postpass code optimization of pipeline constraints. ACM Transactions on Programming Languages and Systems, 5(3):422-448, July 1983. Google ScholarDigital Library
- KE93.D.R. Kerns and S.J. Eggers. Balanced scheduling: Instruction scheduling when memory latency is uncertain. In A CM SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 278-289, June 1993. Google ScholarDigital Library
- Kro81.D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In 8th Annual International Symposium on Computer Architecture, pages 81-87, May 1981. Google ScholarDigital Library
- LFK+93.P.F. Lowney, S.M. Freudenberger, R.J. Karzes, W.D. Lichtenstein, R.P. Nix, J.S. O'Donnell, and J.C. Ruttenberg. The Multiflow trace scheduling compiler. The Journal of Supercomputing, 7:51-143, 1993. Google ScholarDigital Library
- McL93.E. McLellan. The Alpha AXP architecture and 21064 processor. IEEE Micro, 13(3):36-47, June 1993. Google ScholarDigital Library
- Mip94.MIPS Technologies, Inc. Mips Open RiSC Technology R10000 Microprocessor Technical Brief. October 1994.Google Scholar
- MLG92.T.C. Mowry, M.S. Lam and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Fifth international Conference on Architectural Support for Programming Languages and Operating Systems, pages 62-73, October 1992. Google ScholarDigital Library
- Sit78.R.L. Sites. Instruction Ordering for the Cray-1 Computer. Technical Report 78-CS-023, Univ. of California, San Diego, July 1978.Google Scholar
- Sta.R. Stallman. The GNU Project Optimizing C Compiler. Free Software Foundation.Google Scholar
Index Terms
- Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism
Recommendations
Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism
Traditional list schedulers order instructions based on an optimistic estimate of the load latency imposed by the hardware and therefore cannot respond to variations in memory latency caused by cache hits and misses on non-blocking architectures. In ...
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Comments