skip to main content
tutorial
Public Access

Jenga: Software-Defined Cache Hierarchies

Published:24 June 2017Publication History
Skip Abstract Section

Abstract

Caches are traditionally organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, since working sets settle at the smallest (i.e., fastest and most energy-efficient) level they fit in. However, rigid hierarchies also add overheads, because each level adds latency and energy even when it does not fit the working set. These overheads are expensive on emerging systems with heterogeneous memories, where the differences in latency and energy across levels are small. Significant gains are possible by specializing the hierarchy to applications.

We propose Jenga, a reconfigurable cache hierarchy that dynamically and transparently specializes itself to applications. Jenga builds virtual cache hierarchies out of heterogeneous, distributed cache banks using simple hardware mechanisms and an OS runtime. In contrast to prior techniques that trade energy and bandwidth for performance (e.g., dynamic bypassing or prefetching), Jenga eliminates accesses to unwanted cache levels. Jenga thus improves both performance and energy efficiency. On a 36-core chip with a 1 GB DRAM cache, Jenga improves energy-delay product over a combination of state-of-the-art techniques by 23% on average and by up to 85%.

References

  1. Neha Agarwal, David Nellans, Mike O'Connor, Stephen W Keckler, and Thomas F Wenisch. 2015. Unlocking bandwidth for GPUs in CC-NUMA systems. In Proc. HPCA-21.Google ScholarGoogle ScholarCross RefCross Ref
  2. David H Albonesi. 1999. Selective cache ways: On-demand cache resource allocation. In Proc. MICRO-32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Manu Awasthi, Kshitij Sudan, Rajeev Balasubramonian, and John Carter. 2009. Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In Proc. HPCA-15.Google ScholarGoogle ScholarCross RefCross Ref
  4. Rajeev Balasubramonian, David H Albonesi, Alper Buyuktosunoglu, and Sandhya Dwarkadas. 2003. A dynamically tunable memory hierarchy. IEEE TOC 52, 10 (2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bradford M Beckmann, Michael R Marty, and David A Wood. 2006. ASR: Adaptive selective replication for CMP caches. In Proc. MICRO-39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bradford M Beckmann and David A Wood. 2004. Managing wire delay in large chip-multiprocessor caches. In Proc. ASPLOS-XI.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable software-defined caches. In Proc. PACT-22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nathan Beckmann, Po-An Tsai, and Daniel Sanchez. 2015. Scaling distributed cache hierarchies through computation and data co-scheduling. In Proc. HPCA-21.Google ScholarGoogle ScholarCross RefCross Ref
  9. Jichuan Chang and Gurindar S Sohi. 2006. Cooperative caching for chip multiprocessors. In Proc. ISCA-33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B Brockman, and Norman P Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Proc. DATE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Zeshan Chishti, Michael D Powell, and TN Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in CMPs. In Proc. ISCA-32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sangyeun Cho and Lei Jin. 2006. Managing distributed, shared L2 caches through OS-level page allocation. In Proc. MICRO-39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chiachen Chou, Aamer Jaleel, and Moinuddin K Qureshi. 2015. BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches. In Proc. ISCA-42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Blas A Cuesta, Alberto Ros, María E Gómez, Antonio Robles, and José F Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proc. ISCA-38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. William J. Dally. 2010. GPU Computing: To Exascale and Beyond. In Proc. SC10.Google ScholarGoogle Scholar
  16. Abhishek Das, Matt Schuchhardt, Nikos Hardavellas, Gokhan Memik, and Alok Choudhary. 2012. Dynamic directories: A mechanism for reducing on-chip interconnect power in multicores. In Proc. DATE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P Jouppi. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proc. SC10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V Veidenbaum. 2012. Improving cache management policies using dynamic reuse distances. In Proc. MICRO-45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Haakon Dybdahl and Per Stenstrom. 2007. An adaptive shared/private nuca cache partitioning scheme for chip multiprocessors. In Proc. HPCA-13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Sean Franey and Mikko Lipasti. 2015. Tag tables. In Proc. HPCA-21.Google ScholarGoogle ScholarCross RefCross Ref
  21. Matteo Frigo and Steven G Johnson. 2005. The design and implementation of FFTW3. Proc. of the IEEE 93, 2 (2005).Google ScholarGoogle ScholarCross RefCross Ref
  22. Matteo Frigo, Charles E Leiserson, Harald Prokop, and Sridhar Ramachandran. 1999. Cache-oblivious algorithms. In Proc. FOCS-40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Donald Gross. 2008. Fundamentals of queueing theory. John Wiley & Sons. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Per Hammarlund, Alberto J. Martinez, Atiq A. Bajwa, David L. Hill, Erik Hallnor, Hong Jiang, Martin Dixon, Michael Derr, Mikal Hunsaker, Rajesh Kumar, Randy B. Osborne, Ravi Rajwar, Ronak Singhal, Reynold D'Sa, Robert Chappell, Shiv Kaushik, Srinivas Chennupaty, Stephan Jourdan, Steve Gunther, Tom Piazza, and Ted Burton. 2014. Haswell: The Fourth-Generation Intel Core Processor. IEEE Micro 34, 2 (2014).Google ScholarGoogle Scholar
  25. Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: near-optimal block placement and replication in distributed caches. In Proc. ISCA-36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia Ailamaki, and Babak Falsafi. 2007. Database servers on chip multiprocessors: Limitations and opportunities. In Proc. CIDR.Google ScholarGoogle Scholar
  27. John L. Hennessy and David A. Patterson. 2011. Computer Architecture: A Quantitative Approach (5th ed.). Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Enric Herrero, José González, and Ramon Canal. 2010. Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. In Proc. ISCA-37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Andrew Hilton, Neeraj Eswaran, and Amir Roth. 2009. FIESTA: A sample-balanced multi-program workload methodology. Proc. MoBS (2009).Google ScholarGoogle Scholar
  30. Jen-Hsun Huang. 2015. Leaps in visual computing. In Proc. GTC.Google ScholarGoogle Scholar
  31. Intel. 2013. Knights Landing: Next Generation Intel Xeon Phi. In Proc. SC13.Google ScholarGoogle Scholar
  32. J. Jaehyuk Huh, C. Changkyu Kim, H. Shafi, L. Lixin Zhang, D. Burger, and S.W. Keckler. 2007. A NUCA substrate for flexible CMP cache sharing. IEEE TPDS 18, 8 (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Aamer Jaleel, Kevin Theobald, Simon C. Steely, and Joel Emer. 2010. High performance vache replacement using re-reference interval prediction (RRIP). In Proc. ISCA-37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Djordje Jevdjic, Stavros Volos, and Babak Falsafi. 2013. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? Have it all with footprint cache. In Proc. ISCA-40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, Ravishankar Iyer, Srihari Makineni, Donald Newell, D Solihin, and Rajeev Balasubramonian. 2010. CHOP: Adaptive filter-based DRAM caching for CMP server platforms. In Proc. HPCA-16.Google ScholarGoogle ScholarCross RefCross Ref
  36. Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H Loh. 2015. Enabling interposer-based disintegration of multi-core processors. In Proc. MICRO-48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. David Kanter. 2013. Silvermont, Intel's low power architecture.Google ScholarGoogle Scholar
  38. Harshad Kasture and Daniel Sanchez. 2014. Ubik: Efficient cache sharing with strict QoS for latency-critical workloads. In Proc. ASPLOS-XIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro 31, 5 (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Samira Manabi Khan, Yingying Tian, and Daniel A Jimenez. 2010. Sampling dead block prediction for last-level caches. In Proc. MICRO-43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Changkyu Kim, Doug Burger, and Stephen Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proc. ASPLOS-X. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Hyunjin Lee, Sangyeun Cho, and Bruce R Childers. 2011. CloudCache: Expanding and shrinking private caches. In Proc. HPCA-17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. MICRO-42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Gabriel H Loh. 2008. 3D-stacked memory architectures for multi-core processors. In Proc. ISCA-35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Gabriel H Loh and Mark D Hill. 2011. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proc. MICRO-44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Joe Macri. 2015. AMD's next generation GPU and high bandwidth memory architecture: Fury. In HotChips-27.Google ScholarGoogle Scholar
  47. Niti Madan, Li Zhao, Naveen Muralimanohar, Aniruddha Udipi, Rajeev Balasubramonian, Ravishankar Iyer, Srihari Makineni, and Donald Newell. 2009. Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy. In Proc. HPCA-15.Google ScholarGoogle ScholarCross RefCross Ref
  48. Michael R Marty and Mark D Hill. 2007. Virtual hierarchies to support server consolidation. In Proc. ISCA-34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Javier Merino, Valentin Puente, and Jose Gregorio. 2010. ESP-NUCA: A low-cost adaptive non-uniform cache architecture. In Proc. HPCA-16.Google ScholarGoogle ScholarCross RefCross Ref
  50. Micron. 2013. 1.35V DDR3L power calculator (4Gb x16 chips). (2013).Google ScholarGoogle Scholar
  51. Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. 2016. Whirlpool: Improving dynamic cache management with static data classification. In Proc. ASPLOS-XXI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proc. MICRO-40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Moinuddin Qureshi and Gabriel Loh. 2012. Fundamental latency trade-offs in architecting DRAM caches. In Proc. MICRO-45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Moinuddin K. Qureshi. 2009. Adaptive spill-receive for robust high-performance caching in CMPs. In Proc. HPCA-15.Google ScholarGoogle ScholarCross RefCross Ref
  55. Moinuddin K Qureshi, Aamer Jaleel, Yale N Patt, Simon C Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. In Proc. ISCA-34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Moinuddin K Qureshi and Yale N Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. MICRO-39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Daniel Sanchez and Christos Kozyrakis. 2010. The ZCache: Decoupling ways and associativity. In Proc. MICRO-43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Daniel Sanchez and Christos Kozyrakis. 2011. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In Proc. ISCA-38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proc. ISCA-40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jaewoong Sim, Jaekyu Lee, Moinuddin K Qureshi, and Hyesoon Kim. 2012. FLEXclusion: Balancing cache capacity and on-chip bandwidth via flexible exclusion. In Proc. ISCA-39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Allan Snavely and Dean M Tullsen. 2000. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proc. ASPLOS-IX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights Landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Jeff Stuecheli. 2013. POWER8. In HotChips-25.Google ScholarGoogle Scholar
  64. Dong Hyuk Woo, Nak Hee Seong, Dean L Lewis, and H-HS Lee. 2010. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Proc. HPCA-16.Google ScholarGoogle ScholarCross RefCross Ref
  65. Leonid Yavits, Amir Morad, and Ran Ginosar. 2014. Cache hierarchy optimization. IEEE CAL 13, 2 (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Michael Zhang and Krste Asanovic. 2005. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proc. ISCA-32. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Jenga: Software-Defined Cache Hierarchies

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 45, Issue 2
      ISCA'17
      May 2017
      715 pages
      ISSN:0163-5964
      DOI:10.1145/3140659
      Issue’s Table of Contents
      • cover image ACM Conferences
        ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture
        June 2017
        736 pages
        ISBN:9781450348928
        DOI:10.1145/3079856

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 June 2017

      Check for updates

      Qualifiers

      • tutorial
      • Research
      • Refereed limited

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader