skip to main content
research-article
Free Access

RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads

Published:06 January 2016Publication History
Skip Abstract Section

Abstract

This article aims to tackle two fundamental memory bottlenecks: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). To achieve this goal, our approach exploits the inherent error resilience of a wide range of applications. We introduce an approximation technique, called Rollback-Free Value Prediction (RFVP). When certain safe-to-approximate load operations miss in the cache, RFVP predicts the requested values. However, RFVP does not check for or recover from load-value mispredictions, hence, avoiding the high cost of pipeline flushes and re-executions. RFVP mitigates the memory wall by enabling the execution to continue without stalling for long-latency memory accesses. To mitigate the bandwidth wall, RFVP drops a fraction of load requests that miss in the cache after predicting their values. Dropping requests reduces memory bandwidth contention by removing them from the system. The drop rate is a knob to control the trade-off between performance/energy efficiency and output quality. Our extensive evaluations show that RFVP, when used in GPUs, yields significant performance improvement and energy reduction for a wide range of quality-loss levels. We also evaluate RFVP’s latency benefits for a single core CPU. The results show performance improvement and energy reduction for a wide variety of applications with less than 1% loss in quality.

Skip Supplemental Material Section

Supplemental Material

References

  1. J. R. Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. 1983. Conversion of control dependence to data dependence. In POPL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Carlos Alvarez, Jesus Corbal, and Mateo Valero. 2005. Fuzzy memoization for floating-point multimedia applications. IEEE Transactions on Computing 54, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. 2014. General-purpose code acceleration with limited-precision analog computation. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jose-Maria Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis. 2014. Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In PLDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS.Google ScholarGoogle Scholar
  7. Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In OOPSLA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Luis Ceze, Karin Strauss, James Tuck, Josep Torrellas, and Jose Renau. 2006. CAVA: Using checkpoint-assisted value prediction to hide L2 misses. ACM Transaction on Architecture and Code Optimization 3, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Lakshmi N. Chakrapani, Bilge E. S. Akgul, Suresh Cheemalavagu, Pinar Korkmaz, Krishna V. Palem, and Balasubramanian Seshasayee. 2006. Ultra-efficient (Embedded) SoC architectures based on probabilistic CMOS (PCMOS) technology. In DATE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IISWC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Eric S. Chung, Peter A. Milder, James C. Hoe, and Ken Mai. 2010. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPUs? In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sylvain Collange, David Defour, and Yao Zhang. 2010. Dynamic detection of uniform and affine vectors in gpgpu computations. In Euro-Par (Parallel Processing Workshops). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P. Shen. 2001. Speculative precomputation: Long-range prefetching of delinquent loads. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Richard J. Eickemeyer and Stamatis Vassiliadis. 1993. A Load-instruction unit for pipelined processors. 37, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012a. Architecture support for disciplined approximate programming. In ASPLOS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012b. Neural acceleration for general-purpose approximate programs. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Bart Goeman, Hans Vandierendonck, and Koenraad De Bosschere. 2001. Differential FCM: Increasing value prediction accuracy by improving table usage efficiency. In HPCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Greg Hamerly, Erez Perelman, and Brad Calder. 2004. How to use SimPoint to pick simulation points. ACM SIGMETRICS Performance Evaluation Review 31, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A mapreduce framework on graphics processors. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Xuanhua Li and Donald Yeung. 2007. Application-level correctness and its impact on fault tolerance. In HPCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. 1996. Value locality and load value prediction. In ASPLOS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving refresh-power in mobile devices through critical data partitioning. In ASPLOS.Google ScholarGoogle Scholar
  28. Yixin Luo, Sriram Govindan, Bhanu P. Sharma, Mark Santaniello, Justin Meza, Apoorv Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, and Onur Mutlu. 2014. Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory. In DSN. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Divya Mahajan, Kartik Ramkrishnan, Rudra Jariwala, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Anandhavel Nagendrakumar, Abbas Rahimi, Hadi Esmaeilzadeh, and Kia Bazargan. 2015. Axilog: Abstractions for approximate hardware design and reuse. In IEEE Micro.Google ScholarGoogle Scholar
  30. D. Molka, D. Hackenberg, R. Schone, and M. S. Muller. 2009. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Makoto Murase. 1992. Linear feedback shift register. US Patent.Google ScholarGoogle Scholar
  33. Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2005. Address-value delta (AVD) prediction: Increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An effective alternative to large instruction windows. IEEE Micro 23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In PLDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jongse Park, Hadi Esmaeilzadeh, Xin Zhang, Mayur Naik, and William Harris. 2015. FlexJava: Language support for safe and modular approximate programming. In FSE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Avadh Patel, Furat Afram, Shunfei Chen, and Kanad Ghose. 2011. MARSSx86: A full system simulator for x86 CPUs. In DAC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Gennady Pekhimenko, Evgeny Bolotin, Mike O’Connor, Onur Mutlu, Todd Mowry, and Stephen Keckler. 2015. Toggle-aware compression for GPUs. Computer Architecture Letters.Google ScholarGoogle Scholar
  39. Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2012. Base-delta-immediate compression: Practical data compression for on-chip caches. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Perais and A. Seznec. 2014. Practical data value speculation for future high-end processors. In HPCA.Google ScholarGoogle Scholar
  41. Brian M. Rogers, Anil Krishna, Gordon B. Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin. 2009. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Timothy Rogers, Mike O’Connor, and Tor Aamodt. 2012. Cache-conscious wavefront scheduling. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based approximation for data parallel applications. In ASPLOS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mehrzad Samadi, Janghaeng Lee, D. Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In PLDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. 2013. Approximate storage in solid-state memories. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. 2014. Load value approximation. In MICRO.Google ScholarGoogle Scholar
  48. J. Sartori and R. Kumar. 2013. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. IEEE Transactions on Multimedia, 15, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Yiannakis Sazeides and James E. Smith. 1997. The predictability of data values. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S. Sethumadhavan, R. Roberts, and Y. Tsividis. 2012. A case for hybrid discrete-continuous architectures. Computer Architecture Letters 11, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In FSE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. R. Thomas and M. Franklin. 2001. Using dataflow based context for accurate value prediction. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Bradley Thwaites, Gennady Pekhimenko, Hadi Esmaeilzadeh, Amir Yazdanbakhsh, Onur Mutlu, Jongse Park, Girish Mururu, and Todd Mowry. August, 2014. Rollback-free value prediction with approximate loads. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, and Onur Mutlu. 2015. A case for core-assisted bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Amir Yazdanbakhsh, Divya Mahajan, Bradley Thwaites, Jongse Park, Anandhavel Nagendrakumar, Sindhuja Sethuraman, Kartik Ramkrishnan, Nishanthi Ravindran, Rudra Jariwala, Abbas Rahimi, Hadi Esmaeilzadeh, and Kia Bazargan. 2015. Axilog: Language support for approximate hardware design. In DATE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Huiyang Zhou and Thomas M. Conte. 2005. Enhancing memory-level parallelism via recovery-free value prediction. IEEE Transactions on Computers 54, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 4
      January 2016
      848 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/2836331
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 January 2016
      • Accepted: 1 October 2015
      • Revised: 1 August 2015
      • Received: 1 June 2015
      Published in taco Volume 12, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader