Abstract
Caches are traditionally organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, since working sets settle at the smallest (i.e., fastest and most energy-efficient) level they fit in. However, rigid hierarchies also add overheads, because each level adds latency and energy even when it does not fit the working set. These overheads are expensive on emerging systems with heterogeneous memories, where the differences in latency and energy across levels are small. Significant gains are possible by specializing the hierarchy to applications.
We propose Jenga, a reconfigurable cache hierarchy that dynamically and transparently specializes itself to applications. Jenga builds virtual cache hierarchies out of heterogeneous, distributed cache banks using simple hardware mechanisms and an OS runtime. In contrast to prior techniques that trade energy and bandwidth for performance (e.g., dynamic bypassing or prefetching), Jenga eliminates accesses to unwanted cache levels. Jenga thus improves both performance and energy efficiency. On a 36-core chip with a 1 GB DRAM cache, Jenga improves energy-delay product over a combination of state-of-the-art techniques by 23% on average and by up to 85%.
- Neha Agarwal, David Nellans, Mike O'Connor, Stephen W Keckler, and Thomas F Wenisch. 2015. Unlocking bandwidth for GPUs in CC-NUMA systems. In Proc. HPCA-21.Google ScholarCross Ref
- David H Albonesi. 1999. Selective cache ways: On-demand cache resource allocation. In Proc. MICRO-32. Google ScholarDigital Library
- Manu Awasthi, Kshitij Sudan, Rajeev Balasubramonian, and John Carter. 2009. Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In Proc. HPCA-15.Google ScholarCross Ref
- Rajeev Balasubramonian, David H Albonesi, Alper Buyuktosunoglu, and Sandhya Dwarkadas. 2003. A dynamically tunable memory hierarchy. IEEE TOC 52, 10 (2003). Google ScholarDigital Library
- Bradford M Beckmann, Michael R Marty, and David A Wood. 2006. ASR: Adaptive selective replication for CMP caches. In Proc. MICRO-39. Google ScholarDigital Library
- Bradford M Beckmann and David A Wood. 2004. Managing wire delay in large chip-multiprocessor caches. In Proc. ASPLOS-XI.Google ScholarDigital Library
- Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable software-defined caches. In Proc. PACT-22. Google ScholarDigital Library
- Nathan Beckmann, Po-An Tsai, and Daniel Sanchez. 2015. Scaling distributed cache hierarchies through computation and data co-scheduling. In Proc. HPCA-21.Google ScholarCross Ref
- Jichuan Chang and Gurindar S Sohi. 2006. Cooperative caching for chip multiprocessors. In Proc. ISCA-33. Google ScholarDigital Library
- Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B Brockman, and Norman P Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Proc. DATE. Google ScholarDigital Library
- Zeshan Chishti, Michael D Powell, and TN Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in CMPs. In Proc. ISCA-32. Google ScholarDigital Library
- Sangyeun Cho and Lei Jin. 2006. Managing distributed, shared L2 caches through OS-level page allocation. In Proc. MICRO-39. Google ScholarDigital Library
- Chiachen Chou, Aamer Jaleel, and Moinuddin K Qureshi. 2015. BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches. In Proc. ISCA-42. Google ScholarDigital Library
- Blas A Cuesta, Alberto Ros, María E Gómez, Antonio Robles, and José F Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proc. ISCA-38. Google ScholarDigital Library
- William J. Dally. 2010. GPU Computing: To Exascale and Beyond. In Proc. SC10.Google Scholar
- Abhishek Das, Matt Schuchhardt, Nikos Hardavellas, Gokhan Memik, and Alok Choudhary. 2012. Dynamic directories: A mechanism for reducing on-chip interconnect power in multicores. In Proc. DATE. Google ScholarDigital Library
- Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P Jouppi. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proc. SC10. Google ScholarDigital Library
- Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V Veidenbaum. 2012. Improving cache management policies using dynamic reuse distances. In Proc. MICRO-45. Google ScholarDigital Library
- Haakon Dybdahl and Per Stenstrom. 2007. An adaptive shared/private nuca cache partitioning scheme for chip multiprocessors. In Proc. HPCA-13. Google ScholarDigital Library
- Sean Franey and Mikko Lipasti. 2015. Tag tables. In Proc. HPCA-21.Google ScholarCross Ref
- Matteo Frigo and Steven G Johnson. 2005. The design and implementation of FFTW3. Proc. of the IEEE 93, 2 (2005).Google ScholarCross Ref
- Matteo Frigo, Charles E Leiserson, Harald Prokop, and Sridhar Ramachandran. 1999. Cache-oblivious algorithms. In Proc. FOCS-40. Google ScholarDigital Library
- Donald Gross. 2008. Fundamentals of queueing theory. John Wiley & Sons. Google ScholarDigital Library
- Per Hammarlund, Alberto J. Martinez, Atiq A. Bajwa, David L. Hill, Erik Hallnor, Hong Jiang, Martin Dixon, Michael Derr, Mikal Hunsaker, Rajesh Kumar, Randy B. Osborne, Ravi Rajwar, Ronak Singhal, Reynold D'Sa, Robert Chappell, Shiv Kaushik, Srinivas Chennupaty, Stephan Jourdan, Steve Gunther, Tom Piazza, and Ted Burton. 2014. Haswell: The Fourth-Generation Intel Core Processor. IEEE Micro 34, 2 (2014).Google Scholar
- Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: near-optimal block placement and replication in distributed caches. In Proc. ISCA-36. Google ScholarDigital Library
- Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia Ailamaki, and Babak Falsafi. 2007. Database servers on chip multiprocessors: Limitations and opportunities. In Proc. CIDR.Google Scholar
- John L. Hennessy and David A. Patterson. 2011. Computer Architecture: A Quantitative Approach (5th ed.). Morgan Kaufmann. Google ScholarDigital Library
- Enric Herrero, José González, and Ramon Canal. 2010. Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. In Proc. ISCA-37. Google ScholarDigital Library
- Andrew Hilton, Neeraj Eswaran, and Amir Roth. 2009. FIESTA: A sample-balanced multi-program workload methodology. Proc. MoBS (2009).Google Scholar
- Jen-Hsun Huang. 2015. Leaps in visual computing. In Proc. GTC.Google Scholar
- Intel. 2013. Knights Landing: Next Generation Intel Xeon Phi. In Proc. SC13.Google Scholar
- J. Jaehyuk Huh, C. Changkyu Kim, H. Shafi, L. Lixin Zhang, D. Burger, and S.W. Keckler. 2007. A NUCA substrate for flexible CMP cache sharing. IEEE TPDS 18, 8 (2007). Google ScholarDigital Library
- Aamer Jaleel, Kevin Theobald, Simon C. Steely, and Joel Emer. 2010. High performance vache replacement using re-reference interval prediction (RRIP). In Proc. ISCA-37. Google ScholarDigital Library
- Djordje Jevdjic, Stavros Volos, and Babak Falsafi. 2013. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? Have it all with footprint cache. In Proc. ISCA-40. Google ScholarDigital Library
- Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, Ravishankar Iyer, Srihari Makineni, Donald Newell, D Solihin, and Rajeev Balasubramonian. 2010. CHOP: Adaptive filter-based DRAM caching for CMP server platforms. In Proc. HPCA-16.Google ScholarCross Ref
- Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H Loh. 2015. Enabling interposer-based disintegration of multi-core processors. In Proc. MICRO-48. Google ScholarDigital Library
- David Kanter. 2013. Silvermont, Intel's low power architecture.Google Scholar
- Harshad Kasture and Daniel Sanchez. 2014. Ubik: Efficient cache sharing with strict QoS for latency-critical workloads. In Proc. ASPLOS-XIX. Google ScholarDigital Library
- Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro 31, 5 (2011). Google ScholarDigital Library
- Samira Manabi Khan, Yingying Tian, and Daniel A Jimenez. 2010. Sampling dead block prediction for last-level caches. In Proc. MICRO-43. Google ScholarDigital Library
- Changkyu Kim, Doug Burger, and Stephen Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proc. ASPLOS-X. Google ScholarDigital Library
- Hyunjin Lee, Sangyeun Cho, and Bruce R Childers. 2011. CloudCache: Expanding and shrinking private caches. In Proc. HPCA-17. Google ScholarDigital Library
- Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. MICRO-42. Google ScholarDigital Library
- Gabriel H Loh. 2008. 3D-stacked memory architectures for multi-core processors. In Proc. ISCA-35. Google ScholarDigital Library
- Gabriel H Loh and Mark D Hill. 2011. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proc. MICRO-44. Google ScholarDigital Library
- Joe Macri. 2015. AMD's next generation GPU and high bandwidth memory architecture: Fury. In HotChips-27.Google Scholar
- Niti Madan, Li Zhao, Naveen Muralimanohar, Aniruddha Udipi, Rajeev Balasubramonian, Ravishankar Iyer, Srihari Makineni, and Donald Newell. 2009. Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy. In Proc. HPCA-15.Google ScholarCross Ref
- Michael R Marty and Mark D Hill. 2007. Virtual hierarchies to support server consolidation. In Proc. ISCA-34. Google ScholarDigital Library
- Javier Merino, Valentin Puente, and Jose Gregorio. 2010. ESP-NUCA: A low-cost adaptive non-uniform cache architecture. In Proc. HPCA-16.Google ScholarCross Ref
- Micron. 2013. 1.35V DDR3L power calculator (4Gb x16 chips). (2013).Google Scholar
- Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. 2016. Whirlpool: Improving dynamic cache management with static data classification. In Proc. ASPLOS-XXI. Google ScholarDigital Library
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proc. MICRO-40. Google ScholarDigital Library
- Moinuddin Qureshi and Gabriel Loh. 2012. Fundamental latency trade-offs in architecting DRAM caches. In Proc. MICRO-45. Google ScholarDigital Library
- Moinuddin K. Qureshi. 2009. Adaptive spill-receive for robust high-performance caching in CMPs. In Proc. HPCA-15.Google ScholarCross Ref
- Moinuddin K Qureshi, Aamer Jaleel, Yale N Patt, Simon C Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. In Proc. ISCA-34. Google ScholarDigital Library
- Moinuddin K Qureshi and Yale N Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. MICRO-39. Google ScholarDigital Library
- Daniel Sanchez and Christos Kozyrakis. 2010. The ZCache: Decoupling ways and associativity. In Proc. MICRO-43. Google ScholarDigital Library
- Daniel Sanchez and Christos Kozyrakis. 2011. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In Proc. ISCA-38. Google ScholarDigital Library
- Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proc. ISCA-40. Google ScholarDigital Library
- Jaewoong Sim, Jaekyu Lee, Moinuddin K Qureshi, and Hyesoon Kim. 2012. FLEXclusion: Balancing cache capacity and on-chip bandwidth via flexible exclusion. In Proc. ISCA-39. Google ScholarDigital Library
- Allan Snavely and Dean M Tullsen. 2000. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proc. ASPLOS-IX. Google ScholarDigital Library
- Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights Landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (2016). Google ScholarDigital Library
- Jeff Stuecheli. 2013. POWER8. In HotChips-25.Google Scholar
- Dong Hyuk Woo, Nak Hee Seong, Dean L Lewis, and H-HS Lee. 2010. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Proc. HPCA-16.Google ScholarCross Ref
- Leonid Yavits, Amir Morad, and Ran Ginosar. 2014. Cache hierarchy optimization. IEEE CAL 13, 2 (2014). Google ScholarDigital Library
- Michael Zhang and Krste Asanovic. 2005. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proc. ISCA-32. Google ScholarDigital Library
Index Terms
- Jenga: Software-Defined Cache Hierarchies
Recommendations
Jenga: Software-Defined Cache Hierarchies
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureCaches are traditionally organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, since working sets settle at the smallest (i.e., ...
Jigsaw: scalable software-defined caches
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesShared last-level caches, widely used in chip-multiprocessors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Comments