ABSTRACT
This paper presents the Distributed Cooperative Caching, a scalable and energy-efficient scheme to manage chip multiprocessor (CMP) cache resources. The proposed configuration is based in the Cooperative Caching framework [3] but it is intended for large scale CMPs. Both centralized and distributed configurations have the advantage of combining the benefits of private and shared caches. In our proposal, the Coherence Engine has been redesigned to allow its partitioning and thus, eliminate the size constraints imposed by the duplication of all tags. At the same time, a global replacement mechanism has been added to improve the usage of cache space. Our framework uses several Distributed Coherence Engines spread across all the nodes to improve scalability. The distribution permits a better balance of the network traffic over the entire chip avoiding bottlenecks and increasing performance for a 32-core CMP by 21% over a traditional shared memory configuration and by 57% over the Cooperative Caching scheme.
Furthermore, we have reduced the power consumption of the entire system by using a different tag allocation method and by reducing the number of tags compared on each request. For a 32-core CMP the Distributed Cooperative Caching framework provides an average improvement of the power/performance relation (MIPS3/W) of 3.66x over a traditional shared memory configuration and 4.30x over Cooperative Caching.
- M. Acacio, J. Gonzalez, J. Garcia, and J. Duato. A new scalable directory architecture for large-scale multiprocessors. In HPCA '01: 7th International Symposium on High-Performance Computer Architecture, pages 97--106, January 2001. Google ScholarDigital Library
- B. Beckmann, M. Marty, and D. Wood. Asr: Adaptive selective replication for cmp caches. In MICRO-39: 39th Annual IEEE/ACM International Symposium on Microarchitecture, December 2006. Google ScholarDigital Library
- J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In ISCA '06: 33rd Annual International Symposium on Computer Architecture, pages 264--276, June 2006. Google ScholarDigital Library
- J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In ICS '07: 21st Annual International Conference on Supercomputing, pages 242--252, June 2007. Google ScholarDigital Library
- Z. Chishti, M. Powell, and T. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In MICRO-36: 36th Annual IEEE/ACM International Symposium on Microarchitecture, pages 55--66, December 2003. Google ScholarDigital Library
- J. Davis, J. Laudon, and K. Olukotun. Maximizing cmp throughput with mediocre cores. In PACT '05: 14th International Conference on Parallel Architectures and Compilation Techniques, pages 51--62, September 2005. Google ScholarDigital Library
- J. Dorsey, S. Searles, M. Ciraula, S. Johnson, N. Bujanos, D. Wu, M. Braganza, S. Meyers, E. Fang, and R. Kumar. An integrated quad-core opteron processor. In ISSCC '07: IEEE International Solid-State Circuits Conference, pages 102--103, February 2007.Google ScholarCross Ref
- P. Dubey. A platform 2015 workload model: Recognition, mining and synthesis moves computers to the era of tera. Intel White Paper, Intel Corporation, 2005.Google Scholar
- H. Dybdahl and P. Stenstrom. An adaptive shared/private nuca cache partitioning scheme for chip multiprocessors. In HPCA '07: 13th International Symposium on High Performance Computer Architecture, pages 2--12, February 2007. Google ScholarDigital Library
- J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler. A nuca substrate for flexible cmp cache sharing. In ICS '05: 19th Annual International Conference on Supercomputing, pages 31--40, June 2005. Google ScholarDigital Library
- J. S. Kim, M. B. Taylor, J. Miller, and D. Wentzlaff. Energy characterization of a tiled architecture processor with on-chip networks. In ISLPED '03: International symposium on Low power electronics and design, pages 424--427, August 2003. Google ScholarDigital Library
- D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The directory-based cache coherence protocol for the dash multiprocessor. In ISCA '90: 17th Annual International Symposium on Computer Architecture, pages 148--159, May 1990. Google ScholarDigital Library
- P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35(2):50--58, 2002. Google ScholarDigital Library
- M. Martin, M. Hill, and D. Wood. Token coherence: decoupling performance and correctness. In ISCA '03: 30th Annual International Symposium on Computer Architecture, pages 182--193, June 2003. Google ScholarDigital Library
- M. Martin, D. J. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, and D. Wood. Multifacet's general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News, 33(4):92--99, 2005. Google ScholarDigital Library
- M. Monchiero, R. Canal, and A. Gonzalez. Power/performance/thermal design space exploration for multicore architectures. IEEE Transactions on Parallel and Distributed Systems, 19(5):666--681, May 2008. Google ScholarDigital Library
- R. Mullins. Minimising dynamic power consumption in on-chip networks. International Symposium on System-on-Chip, pages 1--4, November 2006.Google ScholarCross Ref
- M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO-39: 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 423--432, December 2006. Google ScholarDigital Library
- N. Sakran, M. Yuffe, M. Mehalel, J. Doweck, E. Knoll, and A. Kovacs. The implementation of the 65nm dual-core 64b merom processor. In ISSCC '07: IEEE International Solid-State Circuits Conference, pages 106--590, February 2007.Google ScholarCross Ref
- K. Strauss, X. Shen, and J. Torrellas. Uncorq: Unconstrained snoop request delivery in embedded-ring multiprocessors. In MICRO-40: 40th Annual IEEE/ACM International Symposium on Microarchitecture, December 2007. Google ScholarDigital Library
- D. Tarjan, S. Thoziyoor, and N. Jouppi. Cacti 4.0. Technical report, HP Labs Palo Alto, June 2006.Google Scholar
- S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar. An 80-tile 1.28tflops network-on-chip in 65nm cmos. In ISSCC '07: IEEE International Solid-State Circuits Conference, February 2007.Google ScholarCross Ref
- H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik. Orion: a power-performance simulator for interconnection networks. In MICRO-35: 35th Annual IEEE/ACM International Symposium on Microarchitecture, pages 294--305, November 2002. Google ScholarDigital Library
- M. Zhang and K. Asanovic. Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In ISCA '05: 32nd Annual International Symposium on Computer Architecture, pages 336--345, June 2005. Google ScholarDigital Library
Index Terms
- Distributed cooperative caching
Recommendations
Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors
ISCA '10Next generation tiled microarchitectures are going to be limited by off-chip misses and by on-chip network usage. Furthermore, these platforms will run an heterogeneous mix of applications with very different memory needs, leading to significant ...
Cooperative Caching for Chip Multiprocessors
This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through ...
Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors
ISCA '10: Proceedings of the 37th annual international symposium on Computer architectureNext generation tiled microarchitectures are going to be limited by off-chip misses and by on-chip network usage. Furthermore, these platforms will run an heterogeneous mix of applications with very different memory needs, leading to significant ...
Comments