Abstract
Memory latency has become an important performance bottleneck in current microprocessors. This problem aggravates as the number of cores sharing the same memory controller increases. To palliate this problem, a common solution is to implement cache hierarchies with large or huge Last-Level Cache (LLC) organizations.
LLC memories are implemented with a high number of ways (e.g., 16) to reduce conflict misses. Typically, caches have implemented the LRU algorithm to exploit temporal locality, but its performance goes away from the optimal as the number of ways increases. In addition, the implementation of a strict LRU algorithm is costly in terms of area and power.
This article focuses on a family of low-cost replacement strategies, whose implementation scales with the number of ways while maintaining the performance. The proposed strategies track the accessing order for just a few blocks, which cannot be replaced. The victim is randomly selected among those blocks exhibiting poor locality. Although, in general, the random policy helps improving the performance, in some applications the scheme fails with respect to the LRU policy leading to performance degradation. This drawback can be overcome by the addition of a small victim cache of the large LLC.
Experimental results show that, using the best version of the family without victim cache, MPKI reduction falls in between 10% and 11% compared to a set of the most representative state-of-the-art algorithms, whereas the reduction grows up to 22% with respect to LRU. The proposal with victim cache achieves speedup improvements, on average, by 4% compared to LRU. In addition, it reduces dynamic energy, on average, up to 8%. Finally, compared to the studied algorithms, hardware complexity is largely reduced by the baseline algorithm of the family.
- Baer, J.-L. 2010. Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors. Cambridge University Press. Google ScholarDigital Library
- Basu, A., Kirman, N., Kirman, M., Chaudhuri, M., and Martínez, J. 2007. Scavenger: A new last level cache architecture with global block priority. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 421--432. Google ScholarDigital Library
- Belady, L. A. 1966. A study of replacement algorithms for virtual-storage computer. IBM Syst. J. 5, 2, 78--101. Google ScholarDigital Library
- Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0. ACM SIGARCH Comput. Archit. News 25, 13--25. Google ScholarDigital Library
- Chaudhuri, M. 2009. Pseudo-LIFO: The foundation of a new family of replacement policies for last-level caches. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, 401--412. Google ScholarDigital Library
- Dybdahl, H., Stenström, P., and Natvig, L. 2007. An LRU-based Replacement algorithm augmented with frequency of access in shared chip-multiprocessor caches. ACM SIGARCH Comput. Archit. News 35, 45--52. Google ScholarDigital Library
- Jaleel, A., Theobald, K. B., Steely, JR., S. C., and Emer, J. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture. ACM, New York, NY, 60--71. Google ScholarDigital Library
- Jiang, S. and Zhang, X. 2002. LIRS: An efficient low inter-reference recency set replacement policy to improve buffer cache performance. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. ACM, New York, NY, 31--42. Google ScholarDigital Library
- Johnson, T. L., Connors, D. A., Merten, M. C., and Hwu, W.-M. W. 1999. Run-time cache bypassing. IEEE Trans. Comput. 48, 1338--1354. Google ScholarDigital Library
- Kalla, R., Sinharoy, B., Starke, W. J., and Floyd, M. 2010. Power7: IBM's next-generation server processor. IEEE Micro 30, 7--15. Google ScholarDigital Library
- Kharbutli, M. and Solihin, Y. 2008. Counter-based cache replacement and bypassing algorithms. IEEE Trans. Comput. 57, 433--447. Google ScholarDigital Library
- Lin, W.-F. and Reinhardt, S. K. 2002. Predicting last-touch references under optimal replacement. Tech. Rep. CSE-TR-447-02, University of Michigan.Google Scholar
- Liu, H., Ferdman, M., Huh, J., and Burger, D. 2008. Cache Bursts: A new approach for eliminating dead blocks and increasing cache efficiency. InProceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 222--233. Google ScholarDigital Library
- Petoumenos, P., Keramidas, G., and Kaxiras, S. 2009. Instruction-based reuse-distance prediction for effective cache management. In Proceedings of the 9th International Symposium on Systems, Architectures, Modeling, and Simulation. IEEE, 49--58. Google ScholarDigital Library
- Qureshi, M. K., Jaleel, A., Patt, Y. N., Steely, S. C., and Emer, J. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture. ACM, New York, NY, 381--391. Google ScholarDigital Library
- Rivers, J. A., Tam, E. S., Tyson, G. S., Davidson, E. S., and Farrens, M. 1998. Utilizing reuse information in data cache management. In Proceedings of the 12th international conference on supercomputing. ACM, New York, NY, 449--456. Google ScholarDigital Library
- Smith, A. J. 1982. Cache memories. ACM Comput. Surv. 14, 473--530. Google ScholarDigital Library
- SPEC. Standard Performance Evaluation Corporation. http://www.spec.org/cpu2000.Google Scholar
- Stackhouse, B., Bhimji, S., et al. 2009. A 65 nm 2-billion transistor quad-core itanium processor. IEEE J. Solid-State Circ. 44, 1, 18--31.Google ScholarCross Ref
- Thoziyoor, S., Ahn, J. H., Monchiero, M., Brockman, J. B., and Jouppi, N. P. 2008a. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In Proceedings of the 35th Annual International Symposium on Computer Architecture. IEEE, 51--62. Google ScholarDigital Library
- Thoziyoor, S., Muralimanohar, N., Ahn, J. H., and Jouppi, N. P. 2008b. CACTI 5.1. Tech. Rep. Hewlett-Packard Development Company, Palo Alto, CA.Google Scholar
- Tyson, G., Farrens, M., Matthews, J., and Pleszkun, A. R. 1995. A modified approach to data cache management. In Proceedings of the 28th Annual International Symposium on Microarchitecture. IEEE, 93--103. Google ScholarDigital Library
- Wong, W. A. and Baer, J.-L. 2000. Modified LRU policies for improving second-level cache behavior. In Proceedings of the 6th International Symposium on High Performance Computer Architecture. IEEE, 49--60.Google Scholar
- Wood, D., Hill, M. D., and Kessler, R. E. 1991. A model for estimating trace-sample miss ratios. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. ACM, New York, NY, 79--89. Google ScholarDigital Library
- Zhang, C. and Xue, B. 2009. Divide-and-Conquer: A bubble replacement for low level caches. In Proceedings of the 23rd International Conference on Supercomputing. ACM, New York, NY, 80--89. Google ScholarDigital Library
Index Terms
- Combining recency of information with selective random and a victim cache in last-level caches
Recommendations
MRU-Tour-based Replacement Algorithms for Last-Level Caches
SBAC-PAD '11: Proceedings of the 2011 23rd International Symposium on Computer Architecture and High Performance ComputingMemory hierarchy design is a major concern in current microprocessors. Many research work focuses on the Last-Level Cache (LLC), which is designed to hide the long miss penalty of accessing to main memory. To reduce both capacity and conflict misses, ...
Improving Last-Level Cache Performance by Exploiting the Concept of MRU-Tour
PACT '11: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation TechniquesLast-Level Caches (LLCs) implement the LRU algorithm to exploit temporal locality, but its performance is quite far of Belady's optimal algorithm as the number of ways increases. One of the main reasons because of LRU does not reach good performance in ...
Block value based insertion policy for high performance last-level caches
ICS '14: Proceedings of the 28th ACM international conference on SupercomputingLast-level cache performance has been proved to be crucial to the system performance. Essentially, any cache management policy improves performance by retaining blocks that it believes to have higher values preferentially. Most cache management policies ...
Comments