ABSTRACT
Aggressive prefetching is very beneficial for memory latency tolerance of many applications. However, it faces significant challenges in multi-core systems. Prefetchers of different cores on a chip multiprocessor (CMP) can cause significant interference with prefetch and demand accesses of other cores. Because existing prefetcher throttling techniques do not address this prefetcher-caused inter-core interference, aggressive prefetching in multi-core systems can lead to significant performance degradation and wasted bandwidth consumption.
To make prefetching effective in CMPs, this paper proposes a low-cost mechanism to control prefetcher-caused inter-core interference by dynamically adjusting the aggressiveness of multiple cores' prefetchers in a coordinated fashion. Our solution consists of a hierarchy of prefetcher aggressiveness control structures that combine per-core (local) and prefetcher-caused inter-core (global) interference feedback to maximize the benefits of prefetching on each core while optimizing overall system performance. These structures improve system performance by 23% while reducing bus traffic by 17% compared to employing aggressive prefetching and improve system performance by 14% compared to a state-of-the-art prefetcher aggressiveness control technique on an eight-core system.
- J. Baer and T. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Supercomputing '91, 1991. Google ScholarDigital Library
- B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13, 1970. Google ScholarDigital Library
- M. Charney and T. Puzak. Prefetching and memory system behavior of the SPEC95 benchmark suite. IBM Journal of Research and Development, 31(3):265--286, 1997. Google ScholarDigital Library
- R. Cooksey et al. A stateless, content-directed data prefetching mechanism. In ASPLOS-X, 2002. Google ScholarDigital Library
- F. Dahlgren et al. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In ICPP-22, 1993. Google ScholarDigital Library
- E. Ebrahimi et al. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In HPCA-15, 2009.Google ScholarCross Ref
- S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 28(3):42--53, 2008. Google ScholarDigital Library
- A. Gendler et al. A PAB-based multi-prefetcher mechanism. Intl. Journal of Parallel Programming, 34(2):171--188, Apr. 2006. Google ScholarDigital Library
- D. E. Goldberg and J. H. Holland. Genetic algorithms and machine learning. Journal of Machine Learning, 3(2--3):95--99, 1988. Google ScholarDigital Library
- R. Iyer et al. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS'07, June 2007. Google ScholarDigital Library
- D. Joseph and D. Grunwald. Prefetching using Markov predictors. In ISCA-24, 1997. Google ScholarDigital Library
- N. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA-17, 1990. Google ScholarDigital Library
- C. J. Lee et al. Prefetch-aware DRAM controllers. In MICRO-41, 2008. Google ScholarDigital Library
- R. L. Lee, P.-C. Yew, and D. H. Lawrie. Data prefetching in shared memory multiprocessors. In ICPP-16, 1987.Google Scholar
- W.-F. Lin et al. Filtering superfluous prefetches using density vectors. In ICCD-19, 2001.Google Scholar
- K. Luo, J. Gummaraju, and M. Franklin. Balancing throughput and fairness in SMT processors. In ISPASS, 2001.Google Scholar
- Micron. Datasheet: 2Gb DDR3 SDRAM, MT41J512M4 - 64 Meg x 4 x 8 banks, http://download.micron.com/pdf/datasheets/dram/ddr3.Google Scholar
- T. C. Mowry et al. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS-5, 1992. Google ScholarDigital Library
- O. Mutlu et al. Using the first-level caches as filters to reduce the pollution caused by speculative memory references. Intl. Journal of Parallel Programming, 33(5):529--559, October 2005. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO-40, 2007. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In ISCA-35, 2008. Google ScholarDigital Library
- K. J. Nesbit, A. S. Dhodapkar, and J. E. Smith. AC/DC: An adaptive data cache prefetcher. In PACT, 2004. Google ScholarDigital Library
- K. J. Nesbit, J. Laudon, and J. E. Smith. Virtual private caches. In ISCA-34, June 2007. Google ScholarDigital Library
- H. Patil et al. Pinpointing representative portions of large intel itanium programs with dynamic instrumentation. In MICRO-37, 2004. Google ScholarDigital Library
- S. Rixner et al. Memory access scheduling. In ISCA-27, 2000. Google ScholarDigital Library
- A. Snavely and D. M. Tullsen. Symbiotic job scheduling for a simultaneous multithreading processor. In ASPLOS-IX, 2000. Google ScholarDigital Library
- S. Srinath et al. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA-13, 2007. Google ScholarDigital Library
- J. Tendler et al. POWER4 system microarchitecture. IBM Technical White Paper, Oct. 2001.Google Scholar
- D. M. Tullsen and S. J. Eggers. Limitations of cache prefetching on a bus-based multiprocessor. In ISCA-20, 1993. Google ScholarDigital Library
- O. Wechsler. Inside Intel Core microarchitecture. Intel Technical White Paper, 2006.Google Scholar
- X. Zhuang and H.-H. S. Lee. A hardware-based cache pollution filtering mechanism for aggressive prefetches. In ICPP-32, 2003.Google ScholarCross Ref
Index Terms
- Coordinated control of multiple prefetchers in multi-core systems
Recommendations
Prefetch-aware shared resource management for multi-core systems
ISCA '11Chip multiprocessors (CMPs) share a large portion of the memory subsystem among multiple cores. Recent proposals have addressed high-performance and fair management of these shared resources; however, none of them take into account prefetch requests. ...
CAFFEINE: A Utility-Driven Prefetcher Aggressiveness Engine for Multicores
Aggressive prefetching improves system performance by hiding and tolerating off-chip memory latency. However, on a multicore system, prefetchers of different cores contend for shared resources and aggressive prefetching can degrade the overall system ...
Band-Pass Prefetching: An Effective Prefetch Management Mechanism Using Prefetch-Fraction Metric in Multi-Core Systems
In multi-core systems, an application’s prefetcher can interfere with the memory requests of other applications using the shared resources, such as last level cache and memory bandwidth. In order to minimize prefetcher-caused interference, prior ...
Comments