ABSTRACT
Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, a third facet of locality that is frequently present in real-world programs, and describe how to effectively capture and exploit it in order to perform load value prediction. Temporal and spatial locality are attributes of storage locations, and describe the future likelihood of references to those locations or their close neighbors. In a similar vein, value locality describes the likelihood of the recurrence of a previously-seen value within a storage location. Modern processors already exploit value locality in a very restricted sense through the use of control speculation (i.e. branch prediction), which seeks to predict the future value of a single condition bit based on previously-seen values. Our work extends this to predict entire 32- and 64-bit register values based on previously-seen values. We find that, just as condition bits are fairly predictable on a per-static-branch basis, full register values being loaded from memory are frequently predictable as well. Furthermore, we show that simple microarchitectural enhancements to two modern microprocessor implementations (based on the PowerPC 620 and Alpha 21164) that enable load value prediction can effectively exploit value locality to collapse true dependencies, reduce average memory latency and bandwidth requirements, and provide measurable performance gains.
- AS95.Todd M. Austin and Gurindar S. Sohi. Zero-cycle loads: Microarchitecture support for reducing load latency. In Proceedings of the 28th Annual A CM/IEEE International Symposium on Microarchitecture, pages 82-92, December 1995. Google ScholarDigital Library
- ASKL81.Walid Abu-Sufah, David J. Kuck, and Duncan H. Lawrie. On the performance enhancement of paging systems through program analysis and transformations. IEEE Transactions on Computers, C-30(5):341-356, May 1981.Google ScholarDigital Library
- ASU86.A.V. Aho, R. Sethi, and J.D. Ullman. Compilers principles, techniques, and tools. Addison-Wesley, Reading, MA, 1986. Google ScholarDigital Library
- ASW+93.S. G. Abraham, R. A. Sugumar, D. Windheiser, B. R. Ran, and R. Gupta. Predictability of load/store instruction latencies. In Proceedings of the 26th Annual ACM/ IEEE International Symposium on Microarchitecture, December 1993. Google ScholarDigital Library
- BK95.Peter Bannon and Jim Keller. Internal architecture of Alpha 21164 microprocessor. COMPCON 95, 1995. Google ScholarDigital Library
- CB94.Tien-Fu Chen and Jean-Loup Baer. A performance study of software and hardware data prefetching schemes. In 21st Annual International Symposium on Computer Architecture, pages 223-232, 1994. Google ScholarDigital Library
- CKP91.David Callahan, Ken Kennedy, and Allan Porterfield. Software prefetching, in Fourth international Conference on Architectural Support for Programming Lan~ guages and Operating Systems, pages zt0-52, Santa Clara, April 1991. Google ScholarDigital Library
- CMCH91.W. Y. Chen, S. A. Mahlke, P. P. Chang, and W.-M. Hwu. Data access microarchitecture for superscalar processors with compiler-assisted data prefetching. In Proceedings of the 24th International Symposium on Microarchitecture, 199 I. Google ScholarDigital Library
- CMT94.Steve Cart, KathrynS. McKinley, and Chau-Wen Tseng. Compiler optimiza',ions for improving data locality. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, San Jose, October 1994. Google ScholarDigital Library
- DNS95.Trung A. Diep, Christopher Nelson, and John P. Shen. Performance evaluation of the PowerPC 620 microarchitecture. In Proceedings of the 22nd international Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June 1995. Google ScholarDigital Library
- DS95.Trung A. Died and John Paul Shen. VMW: A visualization-based microarchitecture workbench. IEEE Computer, 28(12):57-64, 1995. Google ScholarDigital Library
- Gwe94.Linley Gwennap, Comparing RISC microprocessors. In Proceedings of the Microprocessor Forum, October 1994.Google Scholar
- Har80.Samuel P. Harbison. A Computer Architecture for the Dynamic Optimization of High-Level Language Programs. PhD thesis, Carnegie Mellon University, September 1980. Google ScholarDigital Library
- Har82.Samuel P. Harbison. An architectural alternative to optimizing compilers. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 57-65, March 1982. Google ScholarDigital Library
- Jou88.N.P. Jouppi. Architectural and organizational tradeoffs in the design of the MulfiTitan CPU. Technical Report TN-8, DEC-wrl, December 19gg.Google Scholar
- Jou90.Norman P, Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In 17th Annual International Symposium on Computer Architecture, pages 364-373, Seattle, May 1990. Google ScholarDigital Library
- KEH93.David Keppel, Susan j. Eggers, and Robert R. Henry. Evaluating runtime-compiled, value-specific optimizations. Technical report, University of Washington, 1993.Google Scholar
- Kro81.David Kroft. Lockup-free instruction fetch/prefetch cache organization. In 8th Annual International Symposium on Computer Architecture, pages 81-87. IEEE Computer Society Press, 1981. Google ScholarDigital Library
- LTT95.David Levitan, Thomas Thomas, and Paul Tu. The PowerPC 620 microprocessor: A high performance superscalar RISC processor. COMPCON 95, 1995. Google ScholarDigital Library
- MLG92.Todd C. Mowry, Monica S. Lam, and Anoop Gupta. Design and evaluation of a compiler algorithm for prefetching. In Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 62-73, 1992. Google ScholarDigital Library
- RD94.K. Roland and A. Dollas. Predicting and precluding problems with memory latency. IEEE Micro, 14(4):59- 67, 1994. Google ScholarDigital Library
- Ric92.Stephen E. Richardson. Caching function results: Faster arithmetic by avoiding unnecessary computation. Technical report, Sun Microsystems Laboratories, 1992. Google ScholarDigital Library
- SE94.Amitabh Srivastava and Alan Eustace. ATOM: A system for building customized program analysis tools. In Proceedings of the A CM SIGPLAN '94 Conference on Programming Language Design and Implementation, pages 196-205, 1994. Google ScholarDigital Library
- SIG91.SIGPLAN. Proceedings of the Symposium on Partial Evaluation and Semantics-Based Program Manipulation, volume 26, Cambridge, MA, September 1991. SIGPLAN Notices.Google Scholar
- Smi81.J.E. Smith. A study of branch prediction techniques. In Proceedings of the 8th Annual Symposium on Computer Architecture, pages 135-147, June 1981. Google ScholarDigital Library
- Smi82.Alan Jay Smith. Cache memories. Computing Surveys, 14(3):473-530, 1982. Google ScholarDigital Library
- SW94.Amitabh Srivastava and David W. Wall. Link-time optimization of address calculation on a 64-bit architecture. SIGPLAN Notices, 29(6):49-60, June 1994. Proceedings of the ACM SIGPLAN '94 Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- TFMP95.Gary Tyson, Matthew Farrens, John Matthews, and Andrew R. Pleszkun. A modified approach to data cache management. In Proceedings of the 28th Annual A CM/IEEE International Symposium on Microarchitecture, pages 93-103, December 1995. Google ScholarDigital Library
- YP91.T.Y. Yeh and Y. N. Patt. Two-level adaptive training branch prediction, in Proceedings of the 24th Annual International Symposium on Microarchitecture, pages 51-61, November 1991. Google ScholarDigital Library
Index Terms
- Value locality and load value prediction
Recommendations
Load value prediction via path-based address prediction: avoiding mispredictions due to conflicting stores
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on MicroarchitectureCurrent flagship processors excel at extracting instruction-level-parallelism (ILP) by forming large instruction windows. Even then, extracting ILP is inherently limited by true data dependencies. Value prediction was proposed to address this ...
Value locality and load value prediction
Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, ...
Value locality and load value prediction
Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, ...
Comments