ABSTRACT
This paper demonstrates how to utilize the inherent error resilience of a wide range of applications to mitigate the memory wall -- the discrepancy between core and memory speed. We define a new microarchitecturally-triggered approximation technique called rollback-free value prediction. This technique predicts the value of safe-to-approximate loads when they miss in the cache without tracking mispredictions or requiring costly recovery from misspeculations. This technique mitigates the memory wall by allowing the core to continue computation without stalling for long-latency memory accesses. Our detailed study of the quality trade-offs shows that with a modern out-of-order processor, average 8% (up to 19%) performance improvement is possible with 0.8% (up to 1.8%) average quality loss on an approximable subset of SPEC CPU 2000/2006.
- C. Alvarez et al., "Fuzzy memoization for floating-point multimedia applications," IEEE Trans. Comput., 2005. Google ScholarDigital Library
- R. S. Amant et al., "General-purpose code acceleration with limited-precision analog computation," in ISCA, 2014.Google Scholar
- W. Baek and T. M. Chilimbi, "Green: A framework for supporting energy-conscious programming using controlled approximation," in PLDI, 2010. Google ScholarDigital Library
- L. N. Chakrapani et al., "Ultra-efficient (embedded) SOC architectures based on probabilistic CMOS (PCMOS) technology," in DATE, 2006. Google ScholarDigital Library
- J. D. Collins et al., "Speculative precomputation: Long-range prefetching of delinquent loads," in ISCA, 2001. Google ScholarDigital Library
- M. de Kruijf et al., "Relax: An architectural framework for software recovery of hardware faults," in ISCA, 2010. Google ScholarDigital Library
- R. J. Eickemeyer and S. Vassiliadis, "A load-instruction unit for pipelined processors," IBM JRD, 1993. Google ScholarDigital Library
- H. Esmaeilzadeh et al., "Neural acceleration for general-purpose approximate programs," in MICRO, 2012. Google ScholarDigital Library
- H. Esmaeilzadeh et al., "Architecture support for disciplined approximate programming," in ASPLOS, 2012. Google ScholarDigital Library
- S. Liu et al., "Flikker: Saving refresh-power in mobile devices through critical data partitioning," in ASPLOS, 2011. Google ScholarDigital Library
- M. Samadi et al., "Sage: self-tuning approximation for graphics engines," in MICRO, 2013. Google ScholarDigital Library
- A. Sampson et al., "EnerJ: Approximate data types for safe and general low-power computation," in PLDI, 2011. Google ScholarDigital Library
- A. Sampson et al., "Approximate storage in solid-state memories," in MICRO, 2013. Google ScholarDigital Library
- Y. Sazeides and J. E. Smith, "The predictability of data values," in MICRO, 1997. Google ScholarDigital Library
- S. Sidiroglou-Douskos et al., "Managing performance vs. accuracy trade-offs with loop perforation," in FSE, 2011. Google ScholarDigital Library
- H. Zhou and T. M. Conte, "Enhancing memory level parallelism via recovery-free value prediction," in ICS, 2003. Google ScholarDigital Library
Index Terms
- Rollback-free value prediction with approximate loads
Recommendations
CAFFEINE: A Utility-Driven Prefetcher Aggressiveness Engine for Multicores
Aggressive prefetching improves system performance by hiding and tolerating off-chip memory latency. However, on a multicore system, prefetchers of different cores contend for shared resources and aggressive prefetching can degrade the overall system ...
Improving Performance of Large Physically Indexed Caches by Decoupling Memory Addresses from Cache Addresses
Modern CPUs often use large physically indexed caches that are direct-mapped or have low associativities. Such caches do not interact well with virtual memory systems. An improperly placed physical page will end up in a wrong place in the cache, causing ...
Band-Pass Prefetching: An Effective Prefetch Management Mechanism Using Prefetch-Fraction Metric in Multi-Core Systems
In multi-core systems, an application’s prefetcher can interfere with the memory requests of other applications using the shared resources, such as last level cache and memory bandwidth. In order to minimize prefetcher-caused interference, prior ...
Comments