research-article

Free Access

RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads

Authors:
Amir Yazdanbakhsh

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Gennady Pekhimenko

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Bradley Thwaites

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Hadi Esmaeilzadeh

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Onur Mutlu

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Todd C. Mowry

Carnegie Mellon University

Carnegie Mellon University
View Profile

ACM Transactions on Architecture and Code Optimization Volume 12 Issue 4Article No.: 62pp 1–26https://doi.org/10.1145/2836168

Published:06 January 2016Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

This article aims to tackle two fundamental memory bottlenecks: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). To achieve this goal, our approach exploits the inherent error resilience of a wide range of applications. We introduce an approximation technique, called Rollback-Free Value Prediction (RFVP). When certain safe-to-approximate load operations miss in the cache, RFVP predicts the requested values. However, RFVP does not check for or recover from load-value mispredictions, hence, avoiding the high cost of pipeline flushes and re-executions. RFVP mitigates the memory wall by enabling the execution to continue without stalling for long-latency memory accesses. To mitigate the bandwidth wall, RFVP drops a fraction of load requests that miss in the cache after predicting their values. Dropping requests reduces memory bandwidth contention by removing them from the system. The drop rate is a knob to control the trade-off between performance/energy efficiency and output quality. Our extensive evaluations show that RFVP, when used in GPUs, yields significant performance improvement and energy reduction for a wide range of quality-loss levels. We also evaluate RFVP’s latency benefits for a single core CPU. The results show performance improvement and energy reduction for a wide variety of applications with less than 1% loss in quality.

Supplemental Material

Available for Download

pdf

taco1204-62.pdf (1.1 MB)

Slide deck associated with this paper

References

J. R. Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. 1983. Conversion of control dependence to data dependence. In POPL. Google ScholarDigital Library
Carlos Alvarez, Jesus Corbal, and Mateo Valero. 2005. Fuzzy memoization for floating-point multimedia applications. IEEE Transactions on Computing 54, 7. Google ScholarDigital Library
Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. 2014. General-purpose code acceleration with limited-precision analog computation. In ISCA. Google ScholarDigital Library
Jose-Maria Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis. 2014. Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization. In ISCA. Google ScholarDigital Library
Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In PLDI. Google ScholarDigital Library
A Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS.Google Scholar
Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In OOPSLA. Google ScholarDigital Library
Luis Ceze, Karin Strauss, James Tuck, Josep Torrellas, and Jose Renau. 2006. CAVA: Using checkpoint-assisted value prediction to hide L2 misses. ACM Transaction on Architecture and Code Optimization 3, 2. Google ScholarDigital Library
Lakshmi N. Chakrapani, Bilge E. S. Akgul, Suresh Cheemalavagu, Pinar Korkmaz, Krishna V. Palem, and Balasubramanian Seshasayee. 2006. Ultra-efficient (Embedded) SoC architectures based on probabilistic CMOS (PCMOS) technology. In DATE. Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IISWC. Google ScholarDigital Library
Eric S. Chung, Peter A. Milder, James C. Hoe, and Ken Mai. 2010. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPUs? In MICRO. Google ScholarDigital Library
Sylvain Collange, David Defour, and Yao Zhang. 2010. Dynamic detection of uniform and affine vectors in gpgpu computations. In Euro-Par (Parallel Processing Workshops). Google ScholarDigital Library
Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P. Shen. 2001. Speculative precomputation: Long-range prefetching of delinquent loads. In ISCA. Google ScholarDigital Library
Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In ISCA. Google ScholarDigital Library
Richard J. Eickemeyer and Stamatis Vassiliadis. 1993. A Load-instruction unit for pipelined processors. 37, 4. Google ScholarDigital Library
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012a. Architecture support for disciplined approximate programming. In ASPLOS. Google ScholarDigital Library
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012b. Neural acceleration for general-purpose approximate programs. In MICRO. Google ScholarDigital Library
Bart Goeman, Hans Vandierendonck, and Koenraad De Bosschere. 2001. Differential FCM: Increasing value prediction accuracy by improving table usage efficiency. In HPCA. Google ScholarDigital Library
Greg Hamerly, Erez Perelman, and Brad Calder. 2004. How to use SimPoint to pick simulation points. ACM SIGMETRICS Performance Evaluation Review 31, 4. Google ScholarDigital Library
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A mapreduce framework on graphics processors. In PACT. Google ScholarDigital Library
Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro 5. Google ScholarDigital Library
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In ISCA. Google ScholarDigital Library
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO. Google ScholarDigital Library
Xuanhua Li and Donald Yeung. 2007. Application-level correctness and its impact on fault tolerance. In HPCA. Google ScholarDigital Library
Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In MICRO. Google ScholarDigital Library
Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. 1996. Value locality and load value prediction. In ASPLOS. Google ScholarDigital Library
Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving refresh-power in mobile devices through critical data partitioning. In ASPLOS.Google Scholar
Yixin Luo, Sriram Govindan, Bhanu P. Sharma, Mark Santaniello, Justin Meza, Apoorv Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, and Onur Mutlu. 2014. Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory. In DSN. Google ScholarDigital Library
Divya Mahajan, Kartik Ramkrishnan, Rudra Jariwala, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Anandhavel Nagendrakumar, Abbas Rahimi, Hadi Esmaeilzadeh, and Kia Bazargan. 2015. Axilog: Abstractions for approximate hardware design and reuse. In IEEE Micro.Google Scholar
D. Molka, D. Hackenberg, R. Schone, and M. S. Muller. 2009. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In PACT. Google ScholarDigital Library
Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In MICRO. Google ScholarDigital Library
Makoto Murase. 1992. Linear feedback shift register. US Patent.Google Scholar
Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2005. Address-value delta (AVD) prediction: Increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society. Google ScholarDigital Library
Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An effective alternative to large instruction windows. IEEE Micro 23. Google ScholarDigital Library
Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In PLDI. Google ScholarDigital Library
Jongse Park, Hadi Esmaeilzadeh, Xin Zhang, Mayur Naik, and William Harris. 2015. FlexJava: Language support for safe and modular approximate programming. In FSE. Google ScholarDigital Library
Avadh Patel, Furat Afram, Shunfei Chen, and Kanad Ghose. 2011. MARSSx86: A full system simulator for x86 CPUs. In DAC. Google ScholarDigital Library
Gennady Pekhimenko, Evgeny Bolotin, Mike O’Connor, Onur Mutlu, Todd Mowry, and Stephen Keckler. 2015. Toggle-aware compression for GPUs. Computer Architecture Letters.Google Scholar
Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2012. Base-delta-immediate compression: Practical data compression for on-chip caches. In PACT. Google ScholarDigital Library
A. Perais and A. Seznec. 2014. Practical data value speculation for future high-end processors. In HPCA.Google Scholar
Brian M. Rogers, Anil Krishna, Gordon B. Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin. 2009. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In ISCA. Google ScholarDigital Library
Timothy Rogers, Mike O’Connor, and Tor Aamodt. 2012. Cache-conscious wavefront scheduling. In MICRO. Google ScholarDigital Library
Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based approximation for data parallel applications. In ASPLOS. Google ScholarDigital Library
Mehrzad Samadi, Janghaeng Lee, D. Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In MICRO. Google ScholarDigital Library
A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In PLDI. Google ScholarDigital Library
Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. 2013. Approximate storage in solid-state memories. In MICRO. Google ScholarDigital Library
Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. 2014. Load value approximation. In MICRO.Google Scholar
J. Sartori and R. Kumar. 2013. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. IEEE Transactions on Multimedia, 15, 2. Google ScholarDigital Library
Yiannakis Sazeides and James E. Smith. 1997. The predictability of data values. In MICRO. Google ScholarDigital Library
S. Sethumadhavan, R. Roberts, and Y. Tsividis. 2012. A case for hybrid discrete-continuous architectures. Computer Architecture Letters 11, 1. Google ScholarDigital Library
Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In FSE. Google ScholarDigital Library
R. Thomas and M. Franklin. 2001. Using dataflow based context for accurate value prediction. In PACT. Google ScholarDigital Library
Bradley Thwaites, Gennady Pekhimenko, Hadi Esmaeilzadeh, Amir Yazdanbakhsh, Onur Mutlu, Jongse Park, Girish Mururu, and Todd Mowry. August, 2014. Rollback-free value prediction with approximate loads. In PACT. Google ScholarDigital Library
Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, and Onur Mutlu. 2015. A case for core-assisted bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps. In ISCA. Google ScholarDigital Library
Amir Yazdanbakhsh, Divya Mahajan, Bradley Thwaites, Jongse Park, Anandhavel Nagendrakumar, Sindhuja Sethuraman, Kartik Ramkrishnan, Nishanthi Ravindran, Rudra Jariwala, Abbas Rahimi, Hadi Esmaeilzadeh, and Kia Bazargan. 2015. Axilog: Language support for approximate hardware design. In DATE. Google ScholarDigital Library
Huiyang Zhou and Thomas M. Conte. 2005. Enhancing memory-level parallelism via recovery-free value prediction. IEEE Transactions on Computers 54, 7. Google ScholarDigital Library

Index Terms

RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power ...
Read More
Memory System Support for Image Processing
PACT '99: Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques

Image processing applications tend to access their data non-sequentially and reuse that data infrequently. As a result, they tend to perform poorly on conventional memory systems due to high cache and TLB miss rates and are particularly sensitive to the ...
Read More
LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
ASPLOS '18

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 12, Issue 4
January 2016
848 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2836331
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 January 2016
- Accepted: 1 October 2015
- Revised: 1 August 2015
- Received: 1 June 2015
Published in taco Volume 12, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPUs
Load value approximation
memory bandwidth
memory latency
value prediction
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 908
  Total Downloads
- Downloads (Last 12 months)83
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.