Jenga: Software-Defined Cache Hierarchies

Authors:
Po-An Tsai

MIT CSAIL

MIT CSAIL
View Profile

,
Nathan Beckmann

CMU SCS

CMU SCS
View Profile

,
Daniel Sanchez

MIT CSAIL

MIT CSAIL
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 45 Issue 2May 2017pp 652–665https://doi.org/10.1145/3140659.3080214

Published:24 June 2017Publication History

ACM SIGARCH Computer Architecture News

Abstract

Caches are traditionally organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, since working sets settle at the smallest (i.e., fastest and most energy-efficient) level they fit in. However, rigid hierarchies also add overheads, because each level adds latency and energy even when it does not fit the working set. These overheads are expensive on emerging systems with heterogeneous memories, where the differences in latency and energy across levels are small. Significant gains are possible by specializing the hierarchy to applications.

We propose Jenga, a reconfigurable cache hierarchy that dynamically and transparently specializes itself to applications. Jenga builds virtual cache hierarchies out of heterogeneous, distributed cache banks using simple hardware mechanisms and an OS runtime. In contrast to prior techniques that trade energy and bandwidth for performance (e.g., dynamic bypassing or prefetching), Jenga eliminates accesses to unwanted cache levels. Jenga thus improves both performance and energy efficiency. On a 36-core chip with a 1 GB DRAM cache, Jenga improves energy-delay product over a combination of state-of-the-art techniques by 23% on average and by up to 85%.

References

Neha Agarwal, David Nellans, Mike O'Connor, Stephen W Keckler, and Thomas F Wenisch. 2015. Unlocking bandwidth for GPUs in CC-NUMA systems. In Proc. HPCA-21.Google ScholarCross Ref
David H Albonesi. 1999. Selective cache ways: On-demand cache resource allocation. In Proc. MICRO-32. Google ScholarDigital Library
Manu Awasthi, Kshitij Sudan, Rajeev Balasubramonian, and John Carter. 2009. Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In Proc. HPCA-15.Google ScholarCross Ref
Rajeev Balasubramonian, David H Albonesi, Alper Buyuktosunoglu, and Sandhya Dwarkadas. 2003. A dynamically tunable memory hierarchy. IEEE TOC 52, 10 (2003). Google ScholarDigital Library
Bradford M Beckmann, Michael R Marty, and David A Wood. 2006. ASR: Adaptive selective replication for CMP caches. In Proc. MICRO-39. Google ScholarDigital Library
Bradford M Beckmann and David A Wood. 2004. Managing wire delay in large chip-multiprocessor caches. In Proc. ASPLOS-XI.Google ScholarDigital Library
Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable software-defined caches. In Proc. PACT-22. Google ScholarDigital Library
Nathan Beckmann, Po-An Tsai, and Daniel Sanchez. 2015. Scaling distributed cache hierarchies through computation and data co-scheduling. In Proc. HPCA-21.Google ScholarCross Ref
Jichuan Chang and Gurindar S Sohi. 2006. Cooperative caching for chip multiprocessors. In Proc. ISCA-33. Google ScholarDigital Library
Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B Brockman, and Norman P Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Proc. DATE. Google ScholarDigital Library
Zeshan Chishti, Michael D Powell, and TN Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in CMPs. In Proc. ISCA-32. Google ScholarDigital Library
Sangyeun Cho and Lei Jin. 2006. Managing distributed, shared L2 caches through OS-level page allocation. In Proc. MICRO-39. Google ScholarDigital Library
Chiachen Chou, Aamer Jaleel, and Moinuddin K Qureshi. 2015. BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches. In Proc. ISCA-42. Google ScholarDigital Library
Blas A Cuesta, Alberto Ros, María E Gómez, Antonio Robles, and José F Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proc. ISCA-38. Google ScholarDigital Library
William J. Dally. 2010. GPU Computing: To Exascale and Beyond. In Proc. SC10.Google Scholar
Abhishek Das, Matt Schuchhardt, Nikos Hardavellas, Gokhan Memik, and Alok Choudhary. 2012. Dynamic directories: A mechanism for reducing on-chip interconnect power in multicores. In Proc. DATE. Google ScholarDigital Library
Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P Jouppi. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proc. SC10. Google ScholarDigital Library
Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V Veidenbaum. 2012. Improving cache management policies using dynamic reuse distances. In Proc. MICRO-45. Google ScholarDigital Library
Haakon Dybdahl and Per Stenstrom. 2007. An adaptive shared/private nuca cache partitioning scheme for chip multiprocessors. In Proc. HPCA-13. Google ScholarDigital Library
Sean Franey and Mikko Lipasti. 2015. Tag tables. In Proc. HPCA-21.Google ScholarCross Ref
Matteo Frigo and Steven G Johnson. 2005. The design and implementation of FFTW3. Proc. of the IEEE 93, 2 (2005).Google ScholarCross Ref
Matteo Frigo, Charles E Leiserson, Harald Prokop, and Sridhar Ramachandran. 1999. Cache-oblivious algorithms. In Proc. FOCS-40. Google ScholarDigital Library
Donald Gross. 2008. Fundamentals of queueing theory. John Wiley & Sons. Google ScholarDigital Library
Per Hammarlund, Alberto J. Martinez, Atiq A. Bajwa, David L. Hill, Erik Hallnor, Hong Jiang, Martin Dixon, Michael Derr, Mikal Hunsaker, Rajesh Kumar, Randy B. Osborne, Ravi Rajwar, Ronak Singhal, Reynold D'Sa, Robert Chappell, Shiv Kaushik, Srinivas Chennupaty, Stephan Jourdan, Steve Gunther, Tom Piazza, and Ted Burton. 2014. Haswell: The Fourth-Generation Intel Core Processor. IEEE Micro 34, 2 (2014).Google Scholar
Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: near-optimal block placement and replication in distributed caches. In Proc. ISCA-36. Google ScholarDigital Library
Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia Ailamaki, and Babak Falsafi. 2007. Database servers on chip multiprocessors: Limitations and opportunities. In Proc. CIDR.Google Scholar
John L. Hennessy and David A. Patterson. 2011. Computer Architecture: A Quantitative Approach (5th ed.). Morgan Kaufmann. Google ScholarDigital Library
Enric Herrero, José González, and Ramon Canal. 2010. Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. In Proc. ISCA-37. Google ScholarDigital Library
Andrew Hilton, Neeraj Eswaran, and Amir Roth. 2009. FIESTA: A sample-balanced multi-program workload methodology. Proc. MoBS (2009).Google Scholar
Jen-Hsun Huang. 2015. Leaps in visual computing. In Proc. GTC.Google Scholar
Intel. 2013. Knights Landing: Next Generation Intel Xeon Phi. In Proc. SC13.Google Scholar
J. Jaehyuk Huh, C. Changkyu Kim, H. Shafi, L. Lixin Zhang, D. Burger, and S.W. Keckler. 2007. A NUCA substrate for flexible CMP cache sharing. IEEE TPDS 18, 8 (2007). Google ScholarDigital Library
Aamer Jaleel, Kevin Theobald, Simon C. Steely, and Joel Emer. 2010. High performance vache replacement using re-reference interval prediction (RRIP). In Proc. ISCA-37. Google ScholarDigital Library
Djordje Jevdjic, Stavros Volos, and Babak Falsafi. 2013. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? Have it all with footprint cache. In Proc. ISCA-40. Google ScholarDigital Library
Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, Ravishankar Iyer, Srihari Makineni, Donald Newell, D Solihin, and Rajeev Balasubramonian. 2010. CHOP: Adaptive filter-based DRAM caching for CMP server platforms. In Proc. HPCA-16.Google ScholarCross Ref
Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H Loh. 2015. Enabling interposer-based disintegration of multi-core processors. In Proc. MICRO-48. Google ScholarDigital Library
David Kanter. 2013. Silvermont, Intel's low power architecture.Google Scholar
Harshad Kasture and Daniel Sanchez. 2014. Ubik: Efficient cache sharing with strict QoS for latency-critical workloads. In Proc. ASPLOS-XIX. Google ScholarDigital Library
Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro 31, 5 (2011). Google ScholarDigital Library
Samira Manabi Khan, Yingying Tian, and Daniel A Jimenez. 2010. Sampling dead block prediction for last-level caches. In Proc. MICRO-43. Google ScholarDigital Library
Changkyu Kim, Doug Burger, and Stephen Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proc. ASPLOS-X. Google ScholarDigital Library
Hyunjin Lee, Sangyeun Cho, and Bruce R Childers. 2011. CloudCache: Expanding and shrinking private caches. In Proc. HPCA-17. Google ScholarDigital Library
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. MICRO-42. Google ScholarDigital Library
Gabriel H Loh. 2008. 3D-stacked memory architectures for multi-core processors. In Proc. ISCA-35. Google ScholarDigital Library
Gabriel H Loh and Mark D Hill. 2011. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proc. MICRO-44. Google ScholarDigital Library
Joe Macri. 2015. AMD's next generation GPU and high bandwidth memory architecture: Fury. In HotChips-27.Google Scholar
Niti Madan, Li Zhao, Naveen Muralimanohar, Aniruddha Udipi, Rajeev Balasubramonian, Ravishankar Iyer, Srihari Makineni, and Donald Newell. 2009. Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy. In Proc. HPCA-15.Google ScholarCross Ref
Michael R Marty and Mark D Hill. 2007. Virtual hierarchies to support server consolidation. In Proc. ISCA-34. Google ScholarDigital Library
Javier Merino, Valentin Puente, and Jose Gregorio. 2010. ESP-NUCA: A low-cost adaptive non-uniform cache architecture. In Proc. HPCA-16.Google ScholarCross Ref
Micron. 2013. 1.35V DDR3L power calculator (4Gb x16 chips). (2013).Google Scholar
Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. 2016. Whirlpool: Improving dynamic cache management with static data classification. In Proc. ASPLOS-XXI. Google ScholarDigital Library
Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proc. MICRO-40. Google ScholarDigital Library
Moinuddin Qureshi and Gabriel Loh. 2012. Fundamental latency trade-offs in architecting DRAM caches. In Proc. MICRO-45. Google ScholarDigital Library
Moinuddin K. Qureshi. 2009. Adaptive spill-receive for robust high-performance caching in CMPs. In Proc. HPCA-15.Google ScholarCross Ref
Moinuddin K Qureshi, Aamer Jaleel, Yale N Patt, Simon C Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. In Proc. ISCA-34. Google ScholarDigital Library
Moinuddin K Qureshi and Yale N Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. MICRO-39. Google ScholarDigital Library
Daniel Sanchez and Christos Kozyrakis. 2010. The ZCache: Decoupling ways and associativity. In Proc. MICRO-43. Google ScholarDigital Library
Daniel Sanchez and Christos Kozyrakis. 2011. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In Proc. ISCA-38. Google ScholarDigital Library
Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proc. ISCA-40. Google ScholarDigital Library
Jaewoong Sim, Jaekyu Lee, Moinuddin K Qureshi, and Hyesoon Kim. 2012. FLEXclusion: Balancing cache capacity and on-chip bandwidth via flexible exclusion. In Proc. ISCA-39. Google ScholarDigital Library
Allan Snavely and Dean M Tullsen. 2000. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proc. ASPLOS-IX. Google ScholarDigital Library
Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights Landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (2016). Google ScholarDigital Library
Jeff Stuecheli. 2013. POWER8. In HotChips-25.Google Scholar
Dong Hyuk Woo, Nak Hee Seong, Dean L Lewis, and H-HS Lee. 2010. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Proc. HPCA-16.Google ScholarCross Ref
Leonid Yavits, Amir Morad, and Ran Ginosar. 2014. Cache hierarchy optimization. IEEE CAL 13, 2 (2014). Google ScholarDigital Library
Michael Zhang and Krste Asanovic. 2005. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proc. ISCA-32. Google ScholarDigital Library

Index Terms

Jenga: Software-Defined Cache Hierarchies
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures

Recommendations

Jenga: Software-Defined Cache Hierarchies
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Caches are traditionally organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, since working sets settle at the smallest (i.e., ...
Read More
Jigsaw: scalable software-defined caches
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Shared last-level caches, widely used in chip-multiprocessors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from ...
Read More
Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 45, Issue 2
ISCA'17
May 2017
715 pages
ISSN:0163-5964
DOI:10.1145/3140659
Editor:
Babak Falsafi
Interim
Issue’s Table of Contents
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture
June 2017
736 pages
ISBN:9781450348928
DOI:10.1145/3079856
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 June 2017
Check for updates
Author Tags
Cache
Heterogeneous memories
Hierarchy
NUCA
Partitioning
Qualifiers
- tutorial
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 1,368
  Total Downloads
- Downloads (Last 12 months)200
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Jenga: Software-Defined Cache Hierarchies

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Jenga: Software-Defined Cache Hierarchies

Jigsaw: scalable software-defined caches

Reactive NUCA: near-optimal block placement and replication in distributed caches