research-article

Last-level cache deduplication

Authors:
Yingying Tian

Texas A&M University, College Station, TX, USA

Texas A&M University, College Station, TX, USA
View Profile

,
Samira M. Khan

Carnegie Mellon University, Intel Labs, Pittsburgh, PA, USA

Carnegie Mellon University, Intel Labs, Pittsburgh, PA, USA
View Profile

,
Daniel A. Jiménez

Texas A&M University, College Station, TX, USA

Texas A&M University, College Station, TX, USA
View Profile

,
Gabriel H. Loh

Advanced Micro Devices, Inc., Sunnyvale, CA, USA

Advanced Micro Devices, Inc., Sunnyvale, CA, USA
View Profile

ICS '14: Proceedings of the 28th ACM international conference on SupercomputingJune 2014Pages 53–62https://doi.org/10.1145/2597652.2597655

Published:10 June 2014Publication History

ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

Pages 53–62

ABSTRACT

Caches are essential to the performance of modern micro- processors. Much recent work on last-level caches has focused on exploiting reference locality to improve efficiency. However, value redundancy is another source of potential improvement. We find that many blocks in the working set of typical benchmark programs have the same values. We propose cache deduplication that effectively increases last- level cache capacity. Rather than exploit specific value redundancy with compression, as in previous work, our scheme detects duplicate data blocks and stores only one copy of the data in a way that can be accessed through multiple physical addresses. We find that typical benchmarks exhibit significant value redundancy, far beyond the zero-content blocks one would expect in any program. Our deduplicated cache effectively increases capacity by an average of 112% com- pared to an 8MB last-level cache while reducing the physical area by 12.2%, yielding an average performance improvement of 15.2%.

References

A.R. Alameldeen and D.A. Wood. Adaptive cache compression for high-performance processors. In Proceedings of the 31st Annual International Symposium on Computer Architecture, pages 212--223. IEEE, 2004. Google ScholarDigital Library
A.R. Alameldeen and D.A. Wood. Frequent pattern compression: A significance-based compression schemefor l2 caches. Dept. of Computer Sciences, University of Wisconsin-Madison, Tech. Rep, 2004.Google Scholar
S. Biswas, D. Franklin, A. Savage, R. Dixon, T. Sherwood, and F.T. Chong. Multi-execution: multicore caching for data-similar executions. In ACM SIGARCH Computer Architecture News, volume 37, pages 164--173. ACM, 2009. Google ScholarDigital Library
D. Chen, E. Peserico, and L. Rudolph. A dynamically partitionable compressed cache. In Proceedings of the Singapore-MIT Alliance Symposium, January 2003.Google Scholar
D. Cheriton, A. Firoozshahian, A. Solomatnikov, J.P. Stevenson, and O. Azizi. Hicamp: architectural support for efficient concurrency-safe shared structured data access. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, pages 287--300. ACM, 2012. Google ScholarDigital Library
T.E. Denehy and W.W. Hsu. Duplicate management for reference data. Research Report RJ10305, IBM, 2003.Google Scholar
L. Domnitser, A. Jaleel, J. Loew, N. Abu-Ghazaleh, and D. Ponomarev. Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks. ACM Transactions on Architecture and Code Optimization (TACO), 8(4):35, 2012. Google ScholarDigital Library
J. Dusser, T. Piquet, and A. Seznec. Zero-content augmented caches. In Proceedings of the 23rd international conference on Supercomputing, pages 46--55. ACM, 2009. Google ScholarDigital Library
R. W. Green. Memory movement and initialization: Optimization and control. http://software.intel.com/, April 4th, 2013.Google Scholar
E.G. Hallnor and S.K. Reinhardt. A unified compressed memory hierarchy. In High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on, pages 201--212. IEEE, 2005. Google ScholarDigital Library
J.L. Henning. Spec cpu2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1--17, 2006. Google ScholarDigital Library
B. Hong, D. Plantenberg, D.D.E. Long, and M. Sivan-Zimet. Duplicate data elimination in a sanfile system. In Proceedings of the 21st Symposium on Mass Storage Systems (MSS'04), Goddard, MD, 2004.Google Scholar
A. Jaleel, E. Borch, M. Bhandaru, SC Steely, and J. Emer. Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (tla) cache management policies. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 151--162. IEEE, 2010. Google ScholarDigital Library
A. Jaleel, H.H. Najaf-Abadi, S. Subramaniam, S.C. Steely, and J. Emer. Cruise: cache replacement and utility-aware scheduling. In ACM SIGARCH Computer Architecture News, volume 40, pages 249--260. ACM, 2012. Google ScholarDigital Library
S.M. Khan, Y. Tian, and D.A. Jimenez. Sampling dead block prediction for last-level caches. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 175--186. IEEE, 2010. Google ScholarDigital Library
M. Kleanthous and Y. Sazeides. Catch: A mechanism for dynamically detecting cache-content-duplication and its application to instruction caches. In Proceedings of the conference on Design, automation and test in Europe, pages 1426--1431. ACM, 2008. Google ScholarDigital Library
P. Koutoupis. Data deduplication with linux. Linux Journal, 2011(207):7, 2011. Google ScholarDigital Library
N.A. Kurd, S. Bhamidipati, C. Mozak, J.L. Miller, T.M. Wilson, M. Nemani, and M. Chowdhury. Westmere: A family of 32nm ia processors. In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International, pages 96--97. IEEE, 2010.Google ScholarCross Ref
J.S. Lee, W.K. Hong, and S.D. Kim. Design and evaluation of a selective compressed memory system. In Computer Design, 1999.(ICCD'99) International Conference on, pages 184--191. IEEE, 1999. Google ScholarDigital Library
J.S. Lee, W.K. Hong, and S.D. Kim. Adaptive methods to minimize decompression overhead for compressed on-chip caches. International journal of computers & applications, 25(2):98--105, 2003.Google Scholar
D. Levinthal. Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. Intel Performance Analysis Guide, 2009.Google Scholar
C. Molina, C. Aliagas, M. García, A. Gonzàlez, and J. Tubella. Non redundant data cache. In Proceedings of the 2003 international symposium on Low power electronics and design, ISLPED '03, pages 274--277, New York, N.Y., USA, 2003. ACM. Google ScholarDigital Library
C.B. Morrey III and D. Grunwald. Content-based block caching. In Proceedings of 23rd IEEE Conference on Mass Storage Systems and Technologies, College Park, Maryland, May 2006.Google Scholar
N. Muralimanohar, R. Balasubramonian, and N.P. Jouppi. Cacti 6.0: A tool to model large caches. Research report hpl-2009-85, HP Laboratories, 2009.Google Scholar
R. Pagh and F.F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122--144, 2004. Google ScholarDigital Library
A. Patel, F. Afram, and K. Ghose. Marss-x86: A qemu-based micro-architectural and systems simulator for x86 multicore processors. In 1st International Qemu Users' Forum, pages 29--30, 2011.Google Scholar
G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. Linearly compressed pages: a low-complexity, low-latency main memory compression framework. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 172--184. ACM, 2013. Google ScholarDigital Library
G. Pekhimenko, V. Seshadri, O. Mutlu, T. C. Mowry, P. B. Gibbons, and M. A. Kozuch. Base-delta-immediate compression: A practical data compression mechanism for on-chip caches. In Proceedings of the 21st ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), 2012. Google ScholarDigital Library
E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder. Using simpoint for accurate and efficient simulation. In Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, SIGMETRICS '03, pages 318--319, New York, N.Y., USA, 2003. ACM. Google ScholarDigital Library
M.K. Qureshi, D. Thompson, and Y.N. Patt. The v-way cache: Demand-based associativity via global replacement. In Computer Architecture, 2005. ISCA'05. Proceedings. 32nd International Symposium on, pages 544--555. IEEE, 2005. Google ScholarDigital Library
D. Sanchez and C. Kozyrakis. The zcache: Decoupling ways and associativity. In Microarchitecture (MICRO) 2010 43rd Annual IEEE/ACM International Symposium on, pages 187--198. IEEE, 2010. Google ScholarDigital Library
R. Sendag and P.F. Chuang. Address correlation: Exceeding the limits of locality. IEEE Comput. Architecture Letters, 1(1):13--16, January 2002. Google ScholarDigital Library
O. Seongil, S. Choo, and J.H. Ahn. Exploring energy-efficient dram array organizations. In Circuits and Systems (MWSCAS), 2011 IEEE 54th International Midwest Symposium on, pages 1--4. IEEE, 2011.Google Scholar
A. Seznec. A case for two-way skewed-associative caches. In ACM SIGARCH Computer Architecture News, volume 21, pages 169--178. ACM, 1993. Google ScholarDigital Library
A. Seznec. Analysis of the o-geometric history length branch predictor. In Computer Architecture, 2005. ISCA'05. Proceedings. 32nd International Symposium on, pages 394--405. IEEE, 2005. Google ScholarDigital Library
L. Villa, M. Zhang, and K. Asanovic. Dynamic zero compression for cache energy reduction. In Microarchitecture, 2000. MICRO-33. Proceedings. 33rd Annual IEEE/ACM International Symposium on, pages 214--220, 2000. Google ScholarDigital Library
D.F. Wendel, R. Kalla, J. Warnock, R. Cargnoni, S.G. Chu, J.G. Clabes, D. Dreps, D. Hrusecky, J. Friedrich, S. Islam, et al. Power7, a highly parallel, scalable multi-core high end server processor. Solid-State Circuits, IEEE Journal of, 46(1):145--161, 2011.Google Scholar
J. Yang, Y. Zhang, and R. Gupta. Frequent value compression in data caches. In Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, pages 258--265. ACM, 2000. Google ScholarDigital Library
T. Yang, H. Jiang, D. Feng, Z. Niu, K. Zhou, and Y. Wan. Debar: A scalable high-performance de-duplication storage system for backup and archiving. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--12. IEEE, 2010.Google ScholarCross Ref
Y. Zhang, J. Yang, and R. Gupta. Frequent value locality and value-centric data cache design. In ACM SIGOPS Operating Systems Review, volume 34, pages 150--159. ACM, 2000. Google ScholarDigital Library

Index Terms

Last-level cache deduplication
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

The replacement policies for the last-level caches (LLCs) are usually designed based on the access information available locally at the LLC. These policies are inherently sub-optimal due to lack of information about the activities in the inner-levels of ...
Read More
Adaptive Cache Bypassing for Inclusive Last Level Caches
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Cache hierarchy designs, including bypassing, replacement, and the inclusion property, have significant performance impact. Recent works on high performance caches have shown that cache bypassing is an effective technique to enhance the last level cache ...
Read More
A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines

Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing
June 2014
378 pages
ISBN:9781450326421
DOI:10.1145/2597652
General Chairs:
Arndt Bode
Technische Universität München and Leibniz Rechenzentrum, Germany
,
Michael Gerndt
Technische Universität München, Germany
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Lawrence Rauchwerger
Texas A&M University, USA
,
Barton Miller
University of Wisconsin, USA
,
Martin Schulz
Lawrence Livermore National Laboratory, USA
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache deduplication
last-level caches
Qualifiers
- research-article
Conference

Acceptance Rates
ICS '14 Paper Acceptance Rate34of160submissions,21%Overall Acceptance Rate584of2,055submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 44
  Total Citations
  View Citations
- 1,097
  Total Downloads
- Downloads (Last 12 months)39
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Last-level cache deduplication

ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches

Adaptive Cache Bypassing for Inclusive Last Level Caches

A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Last-level cache deduplication

ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches

Adaptive Cache Bypassing for Inclusive Last Level Caches

A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media