Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism

Authors:
Jack L. Lo

Department of Computer Science and Engineering, University of Washington

Department of Computer Science and Engineering, University of Washington
View Profile

,
Susan J. Eggers

Department of Computer Science and Engineering, University of Washington

Department of Computer Science and Engineering, University of Washington
View Profile

PLDI '95: Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementationJune 1995Pages 151–162https://doi.org/10.1145/207110.207132

Published:01 June 1995Publication History

PLDI '95: Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation

Pages 151–162

ABSTRACT

Traditional list schedulers order instructions based on an optimistic estimate of the load latency imposed by the hardware and therefore cannot respond to variations in memory latency caused by cache hits and misses on non-blocking architectures. In contrast, balanced scheduling schedules instructions based on an estimate of the amount of instruction-level parallelism in the program. By scheduling independent instructions behind loads based on what the program can provide, rather than what the implementation stipulates in the best case (i.e., a cache hit), balanced scheduling can hide variations in memory latencies more effectively.

Since its success depends on the amount of instruction-level parallelism in the code, balanced scheduling should perform even better when more parallelism is available. In this study, we combine balanced scheduling with three compiler optimizations that increase instruction-level parallelism: loop unrolling, trace scheduling and cache locality analysis. Using code generated for the DEC Alpha by the Multiflow compiler, we simulated a non-blocking processor architecture that closely models the Alpha 21164. Our results show that balanced scheduling benefits from all three optimizations, producing average speedups that range from 1.15 to 1.40, across the optimizations. More importantly, because of its ability to tolerate variations in load interlocks, it improves its advantage over traditional scheduling. Without the optimizations, balanced scheduled code is, on average, 1.05 times faster than that generated by a traditional scheduler; with them, its lead increases to 1.18.

References

Asp93.T. Asprey. Performance features of the PA7100 microprocessor. IEEE Micro, 13(3):22-35, June 1993. Google ScholarDigital Library
BCK+89.M. Berry, D. Chen, D. Kuck, S. Lo, Y. Pang, L. Pointer, R. Roloff, A. Sameh, E. Clementi, S. Chin, D, Schneider, G. Fox, P. Messina, D. Walker, C. Hsiung, J. Schwarzmeier, K. Lue, S. Orszag, O. Johnson F. Seidl and, R. Goodrum, and J. Martin. The Perfect Club: Effective performance evaluation of supercomputers. International Journal of Supercompurer Applications, 3(3):5--40, December 1989.Google ScholarDigital Library
CS95.R.P. Colwell and R.L. Steck. A 0.6urn BiCMOS Processor with Dynamic Execution. In IEEE International Solid-State Circuits Conference '95, page 176-177, February 1995.Google ScholarCross Ref
DA92.K. Diefendorf and M. Allen. Organization of the Motorola 88110 superscalar RISC microprocessor, iEEE Micro, 12(2):51-61, April 1992. Google ScholarDigital Library
Dix92.K.M. Dixit. New CPU benchmark suites from SPEC. In COMPCON, pages 305-310, February 1992. Google ScholarDigital Library
ER94.J. Edmondson and P. Rubinfeld. An overview of the 21164 Alpha AXP microprocessor. In Hot Chips VI, pages 1-8, August 1994.Google Scholar
FERN84.J.A. Fisher, J.R. Ellis, J. Ruttenberg, and A. Nicolau, Parallel processing: A smart compiler and a dumb machine. In A CM SIGPLAN '84 Symposium on Compiler Construction, pages 36-46, June 1984. Google ScholarDigital Library
FGL93.S.M. Freudenberger, T.R. Gross, and P.F. Lowney. Avoidance and suppression of compensation code in a trace scheduling compiler. Technical Report HPL-93- 35, Hewlett Packard, 1993.Google Scholar
FJ94.K.I. Farkas and N.P. Jouppi. Complexity/performance tradeoffs with non-blocking Ioads. In 21th Annual International Symposium on Computer Architecture, pages 211-222, April 1994. Google ScholarDigital Library
GM86.P.B. Gibbons and S.S. Muchnick. Efficient instruction scheduling for a pipelined architecture. In A CM SIG- PLAN '86 Symposium on Compiler Construction, pages 11-16, July 1986. Google ScholarDigital Library
Gwe94a.L. Gwennap. UltraSparc Unleashes SPARC Performance. Microprocessor Report, pages 1-8, October 3, 1994.Google Scholar
Gwe94b.L, Gwennap. 620 Fills Out PowerPC Product Line. Microprocessor Report, pages 12-16, October 24, 1994.Google Scholar
HG83.J.L. Hennessy and T.R. Gross. Postpass code optimization of pipeline constraints. ACM Transactions on Programming Languages and Systems, 5(3):422-448, July 1983. Google ScholarDigital Library
KE93.D.R. Kerns and S.J. Eggers. Balanced scheduling: Instruction scheduling when memory latency is uncertain. In A CM SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 278-289, June 1993. Google ScholarDigital Library
Kro81.D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In 8th Annual International Symposium on Computer Architecture, pages 81-87, May 1981. Google ScholarDigital Library
LFK+93.P.F. Lowney, S.M. Freudenberger, R.J. Karzes, W.D. Lichtenstein, R.P. Nix, J.S. O'Donnell, and J.C. Ruttenberg. The Multiflow trace scheduling compiler. The Journal of Supercomputing, 7:51-143, 1993. Google ScholarDigital Library
McL93.E. McLellan. The Alpha AXP architecture and 21064 processor. IEEE Micro, 13(3):36-47, June 1993. Google ScholarDigital Library
Mip94.MIPS Technologies, Inc. Mips Open RiSC Technology R10000 Microprocessor Technical Brief. October 1994.Google Scholar
MLG92.T.C. Mowry, M.S. Lam and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Fifth international Conference on Architectural Support for Programming Languages and Operating Systems, pages 62-73, October 1992. Google ScholarDigital Library
Sit78.R.L. Sites. Instruction Ordering for the Cray-1 Computer. Technical Report 78-CS-023, Univ. of California, San Diego, July 1978.Google Scholar
Sta.R. Stallman. The GNU Project Optimizing C Compiler. Free Software Foundation.Google Scholar

Index Terms

Recommendations

Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism

Traditional list schedulers order instructions based on an optimistic estimate of the load latency imposed by the hardware and therefore cannot respond to variations in memory latency caused by cache hits and misses on non-blocking architectures. In ...
Read More
Enhancing instruction level parallelism through compiler-controlled speculation
Read More
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PLDI '95: Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
June 1995
335 pages
ISBN:0897916972
DOI:10.1145/207110
Chairman:
David W. Wall
Digital—Western Research Laboratory
ACM SIGPLAN Notices Volume 30, Issue 6
June 1995
327 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/223428
Editors:
Brent T. Hailpern
IBM T. J. Watson Research Center, Yorktown Heoghts, NY
,
Richard L. Wexelblat
IRS IS:AO, Washington, DC
Issue’s Table of Contents
Copyright © 1995 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1995
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
PLDI '95 Paper Acceptance Rate28of105submissions,27%Overall Acceptance Rate406of2,067submissions,20%
More
Upcoming Conference
PLDI '24

Sponsor:

sigplan

ACM SIGPLAN Conference on Programming Language Design and Implementation

June 24 - 28, 2024

Copenhagen , Denmark
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 511
  Total Downloads
- Downloads (Last 12 months)55
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism

PLDI '95: Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism

Enhancing instruction level parallelism through compiler-controlled speculation

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading