ABSTRACT
As the number of cores increases on chip multiprocessors, coherence is fast becoming a central issue for multi-core performance. This is exacerbated by the fact that interconnection speeds are not scaling well with technology. This paper describes mechanisms to accelerate coherence for a multi-core architecture that has multiple private L2 caches and a scalable point-to-point interconnect between cores. These techniques exploit the differences in geometry between chip multiprocessors and traditional multiprocessor architectures.
Directory-based protocols have been proposed as a scalable alternative to snoop-based protocols. In this paper, we discuss implementations of coherence for CMPs and propose and evaluate a novel directory-based coherence scheme to improve the performance of parallel programs on such processors. Proximity-aware coherence accelerates read and write misses by initiating cache-to-cache transfers from the spatially closest sharer. This has the dual benefit of eliminating unnecessary accesses to off-chip memory, and minimizing the distance over which communicated data moves across the network. The proposed schemes result in speedups up to 74.9% for our workloads.
- M. E. Acacio, J. Gonzalez, J. M. Garcia, and J. Duato. A novel approach to reduce l2 miss latency in shared-memory multiprocessors. In IPDPS '02: Proceedings of the 16th International Parallel and Distributed Processing Symposium, page 25, Washington, DC, USA, 2002. IEEE Computer Society. Google ScholarDigital Library
- AMD. http://www.amd.com/usen/processors/productinformation/0 30 118 9484%,00.html.Google Scholar
- L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In ISCA-27, 2000. Google ScholarDigital Library
- J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd International Symposium on Computer Architecture, pages 264--276, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- F. Dahlgren and J. Torrellas. Cache-only memory architectures. Computer, 32(6):72--79, 1999. Google ScholarDigital Library
- Device Group. Predictive technology model. In UC Berkeley Technical Report, 2001.Google Scholar
- N. Eisley, L.-S. Peh, and L. Shang. In-network cache coherence. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 321{332, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- A. Gupta, W.-D. Weber, and T. C. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In ICPP (1), pages 312--321, 1990.Google Scholar
- A. Hartstein and T. R. Puzak. The optimum pipeline depth considering both power and performance. ACM Trans. Archit. Code Optim., 1(4):369--388, 2004. Google ScholarDigital Library
- R. Ho, K. Mai, and M. Horowitz. The future of wires. Proceedings of the IEEE, 89(4):490--504, 2001.Google ScholarCross Ref
- J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A nuca substrate for exible cmp cache sharing. In Proceedings of the 19th ACM International Conference on Supercomputing (ICS 05), June 2005. Google ScholarDigital Library
- IBM. Power5: Presentation at microprocessor forum. 2003.Google Scholar
- Intel. http://www.intel.com/products/processor/coreduo/.Google Scholar
- P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded sparc processor. In IEEE MICRO Magazine, Mar. 2005. Google ScholarDigital Library
- R. Kumar, V. Zyuban, and D. M. Tullsen. Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling. In Proceedings of International Symposium on Computer Architecture, 2005. Google ScholarDigital Library
- R. Kumar, V. Zyuban, and D. M. Tullsen. Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling. In Proceedings of International Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
- J. Laudon and D. Lenoski. The SGI Origin: a ccNUMA highly scalable server. In ISCA '97: Proceedings of the 24th annual international symposium on Computer architecture, pages 241--251, New York, NY, USA, 1997. ACM Press. Google ScholarDigital Library
- D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Henessy, M. Horowitz, and M. Lam. The stanford DASH multiprocessor. In IEEE Computer, 1992. Google ScholarDigital Library
- M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: decoupling performance and correctness. In Proceedings of the 30th annual international symposium on Computer architecture, pages 182--193, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
- M. M. Michael and A. K. Nanda. Design and performance of directory caches for scalable shared memory multiprocessors. In HPCA '99: Proceedings of the 5th International Symposium on High Performance Computer Architecture, page 142, Washington, DC, USA, 1999. IEEE Computer Society. Google ScholarDigital Library
- B. W. O'Krafka and A. R. Newton. An empirical evaluation of two memory-efficient directory methods. In ISCA '90: Proceedings of the 17th annual international symposium on Computer Architecture, pages 138--147, New York, NY, USA, 1990. ACM Press. Google ScholarDigital Library
- V. S. Pai, P. Ranganathan, and S. V. Adve. RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors. In Proceedings of the Third Workshop on Computer Architecture Education, February 1997. Also appears in IEEE TCCA Newsletter, October 1997. Google ScholarDigital Library
- Sun. UltrasparcIV: http://siliconvalley.internet.com/news/print.php/3090801.Google Scholar
- M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In ISCA '05: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pages 336--345, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- Z. Zhang and J. Torrellas. Reducing remote conict misses: Numa with remote cache versus coma. In HPCA '97: Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, page 272, Washington, DC, USA, 1997. IEEE Computer Society. Google ScholarDigital Library
Index Terms
- Proximity-aware directory-based coherence for multi-core processor architectures
Recommendations
Directory based cache coherence verification logic in CMPs cache system
MES '13: Proceedings of the First International Workshop on Many-core Embedded SystemsThis work reports a high speed protocol verificaion logic for Chip Multiprocessors (CMPs) realizing directory based cache coherence system. A special class of cellular automata (CA) referred to as single length cycle 2-attractor CA (TACA), has been ...
SARC Coherence: Scaling Directory Cache Coherence in Performance and Power
The SARC project seeks to improve power scalability of shared-memory chip multiprocessors (CMPs) by making directory coherence more efficient in both power and performance. The authors describe how they eliminate two major sources of inefficiency for ...
PS directory: a scalable multilevel directory cache for CMPs
As the number of cores increases in current and future chip-multiprocessor (CMP) generations, coherence protocols must rely on novel hardware structures to scale in terms of performance, power, and area. Systems that use directory information for ...
Comments