skip to main content
10.1145/2597652.2597655acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Last-level cache deduplication

Published:10 June 2014Publication History

ABSTRACT

Caches are essential to the performance of modern micro- processors. Much recent work on last-level caches has focused on exploiting reference locality to improve efficiency. However, value redundancy is another source of potential improvement. We find that many blocks in the working set of typical benchmark programs have the same values. We propose cache deduplication that effectively increases last- level cache capacity. Rather than exploit specific value redundancy with compression, as in previous work, our scheme detects duplicate data blocks and stores only one copy of the data in a way that can be accessed through multiple physical addresses. We find that typical benchmarks exhibit significant value redundancy, far beyond the zero-content blocks one would expect in any program. Our deduplicated cache effectively increases capacity by an average of 112% com- pared to an 8MB last-level cache while reducing the physical area by 12.2%, yielding an average performance improvement of 15.2%.

References

  1. A.R. Alameldeen and D.A. Wood. Adaptive cache compression for high-performance processors. In Proceedings of the 31st Annual International Symposium on Computer Architecture, pages 212--223. IEEE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A.R. Alameldeen and D.A. Wood. Frequent pattern compression: A significance-based compression schemefor l2 caches. Dept. of Computer Sciences, University of Wisconsin-Madison, Tech. Rep, 2004.Google ScholarGoogle Scholar
  3. S. Biswas, D. Franklin, A. Savage, R. Dixon, T. Sherwood, and F.T. Chong. Multi-execution: multicore caching for data-similar executions. In ACM SIGARCH Computer Architecture News, volume 37, pages 164--173. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Chen, E. Peserico, and L. Rudolph. A dynamically partitionable compressed cache. In Proceedings of the Singapore-MIT Alliance Symposium, January 2003.Google ScholarGoogle Scholar
  5. D. Cheriton, A. Firoozshahian, A. Solomatnikov, J.P. Stevenson, and O. Azizi. Hicamp: architectural support for efficient concurrency-safe shared structured data access. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, pages 287--300. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T.E. Denehy and W.W. Hsu. Duplicate management for reference data. Research Report RJ10305, IBM, 2003.Google ScholarGoogle Scholar
  7. L. Domnitser, A. Jaleel, J. Loew, N. Abu-Ghazaleh, and D. Ponomarev. Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks. ACM Transactions on Architecture and Code Optimization (TACO), 8(4):35, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dusser, T. Piquet, and A. Seznec. Zero-content augmented caches. In Proceedings of the 23rd international conference on Supercomputing, pages 46--55. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. W. Green. Memory movement and initialization: Optimization and control. http://software.intel.com/, April 4th, 2013.Google ScholarGoogle Scholar
  10. E.G. Hallnor and S.K. Reinhardt. A unified compressed memory hierarchy. In High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on, pages 201--212. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J.L. Henning. Spec cpu2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1--17, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Hong, D. Plantenberg, D.D.E. Long, and M. Sivan-Zimet. Duplicate data elimination in a sanfile system. In Proceedings of the 21st Symposium on Mass Storage Systems (MSS'04), Goddard, MD, 2004.Google ScholarGoogle Scholar
  13. A. Jaleel, E. Borch, M. Bhandaru, SC Steely, and J. Emer. Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (tla) cache management policies. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 151--162. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Jaleel, H.H. Najaf-Abadi, S. Subramaniam, S.C. Steely, and J. Emer. Cruise: cache replacement and utility-aware scheduling. In ACM SIGARCH Computer Architecture News, volume 40, pages 249--260. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S.M. Khan, Y. Tian, and D.A. Jimenez. Sampling dead block prediction for last-level caches. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 175--186. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Kleanthous and Y. Sazeides. Catch: A mechanism for dynamically detecting cache-content-duplication and its application to instruction caches. In Proceedings of the conference on Design, automation and test in Europe, pages 1426--1431. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Koutoupis. Data deduplication with linux. Linux Journal, 2011(207):7, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N.A. Kurd, S. Bhamidipati, C. Mozak, J.L. Miller, T.M. Wilson, M. Nemani, and M. Chowdhury. Westmere: A family of 32nm ia processors. In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International, pages 96--97. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  19. J.S. Lee, W.K. Hong, and S.D. Kim. Design and evaluation of a selective compressed memory system. In Computer Design, 1999.(ICCD'99) International Conference on, pages 184--191. IEEE, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J.S. Lee, W.K. Hong, and S.D. Kim. Adaptive methods to minimize decompression overhead for compressed on-chip caches. International journal of computers & applications, 25(2):98--105, 2003.Google ScholarGoogle Scholar
  21. D. Levinthal. Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. Intel Performance Analysis Guide, 2009.Google ScholarGoogle Scholar
  22. C. Molina, C. Aliagas, M. García, A. Gonzàlez, and J. Tubella. Non redundant data cache. In Proceedings of the 2003 international symposium on Low power electronics and design, ISLPED '03, pages 274--277, New York, N.Y., USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C.B. Morrey III and D. Grunwald. Content-based block caching. In Proceedings of 23rd IEEE Conference on Mass Storage Systems and Technologies, College Park, Maryland, May 2006.Google ScholarGoogle Scholar
  24. N. Muralimanohar, R. Balasubramonian, and N.P. Jouppi. Cacti 6.0: A tool to model large caches. Research report hpl-2009-85, HP Laboratories, 2009.Google ScholarGoogle Scholar
  25. R. Pagh and F.F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122--144, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Patel, F. Afram, and K. Ghose. Marss-x86: A qemu-based micro-architectural and systems simulator for x86 multicore processors. In 1st International Qemu Users' Forum, pages 29--30, 2011.Google ScholarGoogle Scholar
  27. G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. Linearly compressed pages: a low-complexity, low-latency main memory compression framework. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 172--184. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Pekhimenko, V. Seshadri, O. Mutlu, T. C. Mowry, P. B. Gibbons, and M. A. Kozuch. Base-delta-immediate compression: A practical data compression mechanism for on-chip caches. In Proceedings of the 21st ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder. Using simpoint for accurate and efficient simulation. In Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, SIGMETRICS '03, pages 318--319, New York, N.Y., USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M.K. Qureshi, D. Thompson, and Y.N. Patt. The v-way cache: Demand-based associativity via global replacement. In Computer Architecture, 2005. ISCA'05. Proceedings. 32nd International Symposium on, pages 544--555. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Sanchez and C. Kozyrakis. The zcache: Decoupling ways and associativity. In Microarchitecture (MICRO) 2010 43rd Annual IEEE/ACM International Symposium on, pages 187--198. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Sendag and P.F. Chuang. Address correlation: Exceeding the limits of locality. IEEE Comput. Architecture Letters, 1(1):13--16, January 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. O. Seongil, S. Choo, and J.H. Ahn. Exploring energy-efficient dram array organizations. In Circuits and Systems (MWSCAS), 2011 IEEE 54th International Midwest Symposium on, pages 1--4. IEEE, 2011.Google ScholarGoogle Scholar
  34. A. Seznec. A case for two-way skewed-associative caches. In ACM SIGARCH Computer Architecture News, volume 21, pages 169--178. ACM, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Seznec. Analysis of the o-geometric history length branch predictor. In Computer Architecture, 2005. ISCA'05. Proceedings. 32nd International Symposium on, pages 394--405. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. L. Villa, M. Zhang, and K. Asanovic. Dynamic zero compression for cache energy reduction. In Microarchitecture, 2000. MICRO-33. Proceedings. 33rd Annual IEEE/ACM International Symposium on, pages 214--220, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D.F. Wendel, R. Kalla, J. Warnock, R. Cargnoni, S.G. Chu, J.G. Clabes, D. Dreps, D. Hrusecky, J. Friedrich, S. Islam, et al. Power7, a highly parallel, scalable multi-core high end server processor. Solid-State Circuits, IEEE Journal of, 46(1):145--161, 2011.Google ScholarGoogle Scholar
  38. J. Yang, Y. Zhang, and R. Gupta. Frequent value compression in data caches. In Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, pages 258--265. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. T. Yang, H. Jiang, D. Feng, Z. Niu, K. Zhou, and Y. Wan. Debar: A scalable high-performance de-duplication storage system for backup and archiving. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--12. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  40. Y. Zhang, J. Yang, and R. Gupta. Frequent value locality and value-centric data cache design. In ACM SIGOPS Operating Systems Review, volume 34, pages 150--159. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Last-level cache deduplication

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '14: Proceedings of the 28th ACM international conference on Supercomputing
      June 2014
      378 pages
      ISBN:9781450326421
      DOI:10.1145/2597652

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 June 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ICS '14 Paper Acceptance Rate34of160submissions,21%Overall Acceptance Rate584of2,055submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader