ABSTRACT
Caches are essential to the performance of modern micro- processors. Much recent work on last-level caches has focused on exploiting reference locality to improve efficiency. However, value redundancy is another source of potential improvement. We find that many blocks in the working set of typical benchmark programs have the same values. We propose cache deduplication that effectively increases last- level cache capacity. Rather than exploit specific value redundancy with compression, as in previous work, our scheme detects duplicate data blocks and stores only one copy of the data in a way that can be accessed through multiple physical addresses. We find that typical benchmarks exhibit significant value redundancy, far beyond the zero-content blocks one would expect in any program. Our deduplicated cache effectively increases capacity by an average of 112% com- pared to an 8MB last-level cache while reducing the physical area by 12.2%, yielding an average performance improvement of 15.2%.
- A.R. Alameldeen and D.A. Wood. Adaptive cache compression for high-performance processors. In Proceedings of the 31st Annual International Symposium on Computer Architecture, pages 212--223. IEEE, 2004. Google ScholarDigital Library
- A.R. Alameldeen and D.A. Wood. Frequent pattern compression: A significance-based compression schemefor l2 caches. Dept. of Computer Sciences, University of Wisconsin-Madison, Tech. Rep, 2004.Google Scholar
- S. Biswas, D. Franklin, A. Savage, R. Dixon, T. Sherwood, and F.T. Chong. Multi-execution: multicore caching for data-similar executions. In ACM SIGARCH Computer Architecture News, volume 37, pages 164--173. ACM, 2009. Google ScholarDigital Library
- D. Chen, E. Peserico, and L. Rudolph. A dynamically partitionable compressed cache. In Proceedings of the Singapore-MIT Alliance Symposium, January 2003.Google Scholar
- D. Cheriton, A. Firoozshahian, A. Solomatnikov, J.P. Stevenson, and O. Azizi. Hicamp: architectural support for efficient concurrency-safe shared structured data access. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, pages 287--300. ACM, 2012. Google ScholarDigital Library
- T.E. Denehy and W.W. Hsu. Duplicate management for reference data. Research Report RJ10305, IBM, 2003.Google Scholar
- L. Domnitser, A. Jaleel, J. Loew, N. Abu-Ghazaleh, and D. Ponomarev. Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks. ACM Transactions on Architecture and Code Optimization (TACO), 8(4):35, 2012. Google ScholarDigital Library
- J. Dusser, T. Piquet, and A. Seznec. Zero-content augmented caches. In Proceedings of the 23rd international conference on Supercomputing, pages 46--55. ACM, 2009. Google ScholarDigital Library
- R. W. Green. Memory movement and initialization: Optimization and control. http://software.intel.com/, April 4th, 2013.Google Scholar
- E.G. Hallnor and S.K. Reinhardt. A unified compressed memory hierarchy. In High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on, pages 201--212. IEEE, 2005. Google ScholarDigital Library
- J.L. Henning. Spec cpu2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1--17, 2006. Google ScholarDigital Library
- B. Hong, D. Plantenberg, D.D.E. Long, and M. Sivan-Zimet. Duplicate data elimination in a sanfile system. In Proceedings of the 21st Symposium on Mass Storage Systems (MSS'04), Goddard, MD, 2004.Google Scholar
- A. Jaleel, E. Borch, M. Bhandaru, SC Steely, and J. Emer. Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (tla) cache management policies. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 151--162. IEEE, 2010. Google ScholarDigital Library
- A. Jaleel, H.H. Najaf-Abadi, S. Subramaniam, S.C. Steely, and J. Emer. Cruise: cache replacement and utility-aware scheduling. In ACM SIGARCH Computer Architecture News, volume 40, pages 249--260. ACM, 2012. Google ScholarDigital Library
- S.M. Khan, Y. Tian, and D.A. Jimenez. Sampling dead block prediction for last-level caches. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 175--186. IEEE, 2010. Google ScholarDigital Library
- M. Kleanthous and Y. Sazeides. Catch: A mechanism for dynamically detecting cache-content-duplication and its application to instruction caches. In Proceedings of the conference on Design, automation and test in Europe, pages 1426--1431. ACM, 2008. Google ScholarDigital Library
- P. Koutoupis. Data deduplication with linux. Linux Journal, 2011(207):7, 2011. Google ScholarDigital Library
- N.A. Kurd, S. Bhamidipati, C. Mozak, J.L. Miller, T.M. Wilson, M. Nemani, and M. Chowdhury. Westmere: A family of 32nm ia processors. In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International, pages 96--97. IEEE, 2010.Google ScholarCross Ref
- J.S. Lee, W.K. Hong, and S.D. Kim. Design and evaluation of a selective compressed memory system. In Computer Design, 1999.(ICCD'99) International Conference on, pages 184--191. IEEE, 1999. Google ScholarDigital Library
- J.S. Lee, W.K. Hong, and S.D. Kim. Adaptive methods to minimize decompression overhead for compressed on-chip caches. International journal of computers & applications, 25(2):98--105, 2003.Google Scholar
- D. Levinthal. Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. Intel Performance Analysis Guide, 2009.Google Scholar
- C. Molina, C. Aliagas, M. García, A. Gonzàlez, and J. Tubella. Non redundant data cache. In Proceedings of the 2003 international symposium on Low power electronics and design, ISLPED '03, pages 274--277, New York, N.Y., USA, 2003. ACM. Google ScholarDigital Library
- C.B. Morrey III and D. Grunwald. Content-based block caching. In Proceedings of 23rd IEEE Conference on Mass Storage Systems and Technologies, College Park, Maryland, May 2006.Google Scholar
- N. Muralimanohar, R. Balasubramonian, and N.P. Jouppi. Cacti 6.0: A tool to model large caches. Research report hpl-2009-85, HP Laboratories, 2009.Google Scholar
- R. Pagh and F.F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122--144, 2004. Google ScholarDigital Library
- A. Patel, F. Afram, and K. Ghose. Marss-x86: A qemu-based micro-architectural and systems simulator for x86 multicore processors. In 1st International Qemu Users' Forum, pages 29--30, 2011.Google Scholar
- G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. Linearly compressed pages: a low-complexity, low-latency main memory compression framework. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 172--184. ACM, 2013. Google ScholarDigital Library
- G. Pekhimenko, V. Seshadri, O. Mutlu, T. C. Mowry, P. B. Gibbons, and M. A. Kozuch. Base-delta-immediate compression: A practical data compression mechanism for on-chip caches. In Proceedings of the 21st ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), 2012. Google ScholarDigital Library
- E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder. Using simpoint for accurate and efficient simulation. In Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, SIGMETRICS '03, pages 318--319, New York, N.Y., USA, 2003. ACM. Google ScholarDigital Library
- M.K. Qureshi, D. Thompson, and Y.N. Patt. The v-way cache: Demand-based associativity via global replacement. In Computer Architecture, 2005. ISCA'05. Proceedings. 32nd International Symposium on, pages 544--555. IEEE, 2005. Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. The zcache: Decoupling ways and associativity. In Microarchitecture (MICRO) 2010 43rd Annual IEEE/ACM International Symposium on, pages 187--198. IEEE, 2010. Google ScholarDigital Library
- R. Sendag and P.F. Chuang. Address correlation: Exceeding the limits of locality. IEEE Comput. Architecture Letters, 1(1):13--16, January 2002. Google ScholarDigital Library
- O. Seongil, S. Choo, and J.H. Ahn. Exploring energy-efficient dram array organizations. In Circuits and Systems (MWSCAS), 2011 IEEE 54th International Midwest Symposium on, pages 1--4. IEEE, 2011.Google Scholar
- A. Seznec. A case for two-way skewed-associative caches. In ACM SIGARCH Computer Architecture News, volume 21, pages 169--178. ACM, 1993. Google ScholarDigital Library
- A. Seznec. Analysis of the o-geometric history length branch predictor. In Computer Architecture, 2005. ISCA'05. Proceedings. 32nd International Symposium on, pages 394--405. IEEE, 2005. Google ScholarDigital Library
- L. Villa, M. Zhang, and K. Asanovic. Dynamic zero compression for cache energy reduction. In Microarchitecture, 2000. MICRO-33. Proceedings. 33rd Annual IEEE/ACM International Symposium on, pages 214--220, 2000. Google ScholarDigital Library
- D.F. Wendel, R. Kalla, J. Warnock, R. Cargnoni, S.G. Chu, J.G. Clabes, D. Dreps, D. Hrusecky, J. Friedrich, S. Islam, et al. Power7, a highly parallel, scalable multi-core high end server processor. Solid-State Circuits, IEEE Journal of, 46(1):145--161, 2011.Google Scholar
- J. Yang, Y. Zhang, and R. Gupta. Frequent value compression in data caches. In Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, pages 258--265. ACM, 2000. Google ScholarDigital Library
- T. Yang, H. Jiang, D. Feng, Z. Niu, K. Zhou, and Y. Wan. Debar: A scalable high-performance de-duplication storage system for backup and archiving. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--12. IEEE, 2010.Google ScholarCross Ref
- Y. Zhang, J. Yang, and R. Gupta. Frequent value locality and value-centric data cache design. In ACM SIGOPS Operating Systems Review, volume 34, pages 150--159. ACM, 2000. Google ScholarDigital Library
Index Terms
- Last-level cache deduplication
Recommendations
Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesThe replacement policies for the last-level caches (LLCs) are usually designed based on the access information available locally at the LLC. These policies are inherently sub-optimal due to lack of information about the activities in the inner-levels of ...
Adaptive Cache Bypassing for Inclusive Last Level Caches
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed ProcessingCache hierarchy designs, including bypassing, replacement, and the inclusion property, have significant performance impact. Recent works on high performance caches have shown that cache bypassing is an effective technique to enhance the last level cache ...
A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines
Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, ...
Comments