Abstract
Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency-capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in-situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighborsý caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.
- {1} A. R. Alameldeen and D. A. Wood. Variability in architectural simulations of multi-threaded workloads. In HPCA 9, pp 7-18, Feb. 2003. Google ScholarDigital Library
- {2} E. Artiaga, X. Martorell, Y. Becerra, and N. Navarro. Experiences on implementing Parmacs macros to run the Splash-2 suite on multiprocessors. Technical Report UPC-DAC-1998-1, Department of Computer Architecture Universittat Politecnica de Catalunya, Jan. 1998.Google ScholarCross Ref
- {3} P. Barford and M. Crovella. Generating representative web workloads for network and server performance evaluation. In Measurement and Modeling of Computer Systems, pages 151-160, June 1998. Google ScholarDigital Library
- {4} L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In the 27th ISCA, pages 282-293, June 2000. Google ScholarDigital Library
- {5} B. M. Beckmann and D. A. Wood. TLC: Transmission line caches. In MICRO 36, pages 43-54, Dec. 2003. Google ScholarDigital Library
- {6} B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In MICRO 37, pages 319-330, Dec. 2004. Google ScholarDigital Library
- {7} D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting the interthread cache contention on a chip multiprocessor architecture. In HPCA 11, pages 340-351, Feb. 2005. Google ScholarDigital Library
- {8} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In MICRO 36, pages 55-66, Dec. 2004. Google ScholarDigital Library
- {9} J. P. Singh, D. Culler, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, 1998. Google ScholarDigital Library
- {10} J. H. Edmondson and et al. Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor. Digital Technical Journal, 7(1), 1995. Google ScholarDigital Library
- {11} B. Falsafi and D. A. Wood. Reactive NUMA: A design for unifying SCOMA and CC-NUMA. In the 24th ISCA, pages 229-240, June 1997. Google ScholarDigital Library
- {12} S. Finnes. iseries.myseries. http://www-1.ibm.com/servers/uk/media/ iseries_skillbuilder/POWER5DeliverWith%outDisruption1.pdf, 2004.Google Scholar
- {13} E. G. Hallnor and S. K. Reinhardt. A fully associative software-managed cache design. In the 27th ISCA, pages 107-116, June 2000. Google ScholarDigital Library
- {14} C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In ASPLOS X, pages 211-222, Oct. 2002. Google ScholarDigital Library
- {15} C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA 10, pages 176-185, Feb. 2004. Google ScholarDigital Library
- {16} P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, and G. Hallberg. Simics: A full system simulation platform. IEEE Computer, 35(2):50-58, Feb. 2002. Google ScholarDigital Library
- {17} P. Michaud. Exploiting the cache capacity of a single-chip multicore processor with execution migration. In HPCA 10, pages 186-197, Feb. 2004. Google ScholarDigital Library
- {18} S. D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. Sullivan, and T. Grutkowski. The implementation of the Itanium 2 microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1448-1460, Nov. 2002.Google ScholarCross Ref
- {19} K. Olukotun, B. A Nayfeh, L. Hammond, K. Wilson, and K. Chang. The case for a single-chip multiprocessor. In ASPLOS VII, pages 2-11, 1996. Google ScholarDigital Library
- {20} Open Source Development Labs. Open source development labs data-base test 2. http://www.osdl.org/lab_activities/kernel_testing/ osdl_database_test_suite/o%sdl_dbt-2/.Google Scholar
- {21} M. Papamarcos and J. Patel. A low overhead coherence solution for multiprocessors with private cache memories. In the 11th ISCA 84, pages 348-354, 1984. Google ScholarDigital Library
- {22} P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and area model. Technical report, Compaq Computer Corporation, Aug. 2001.Google Scholar
- {23} P. Stenstrom, E. Hagersten, D. Lilja, M. Martonosi, and M. Venugopal. Trends in shared memory multiprocessing. IEEE Computer, 30(12):44-50, Dec. 1997. Google ScholarDigital Library
- {24} P. Stenstrom, T. Joe, and A. Gupta. Comparative performance evaluation of cache-coherent NUMA and COMA architectures. In the 19th ISCA , pages 80-91, 1992. Google ScholarDigital Library
- {25} Sun Microsystems. Sun's 64-bit gemini chip. Sunflash, 66(4), Aug. 2003.Google Scholar
- {26} J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. IBM eserver Power4 System Microarchitecture. IBM White Paper, Oct. 2001.Google Scholar
- {27} The Standard Performance Evaluation Corporation. Spec CPU2000 suite. http://www.specbench.org/osg/cpu2000/.Google Scholar
- {28} D. Weiss, J. J. Wuu, and V. Chin. The on-chip 3-MB subarray-based third-level cache on an Itanium microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1523-1529, Nov. 2002.Google ScholarCross Ref
- {29} M. Wong. Stressing linux with real-world workloads. In Linux Symposium , pages 495-504, July 2003.Google Scholar
- {30} M. Wong, J. Zhang, C. Thomas, B. Olmstead, and C. White. Open source development labs database test 2 differences from the tpc-c, version 0.15. http://www.osdl.org/docs/dbt_2_differences_from_tpc_c.pdf, June 2002.Google Scholar
- {31} S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In the 22nd ISCA, pages 24-36, July 1995. Google ScholarDigital Library
Index Terms
- Optimizing Replication, Communication, and Capacity Allocation in CMPs
Recommendations
Optimizing Replication, Communication, and Capacity Allocation in CMPs
ISCA '05: Proceedings of the 32nd annual international symposium on Computer ArchitectureChip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to ...
Throttling capacity sharing in private L2 caches of CMPs
RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied ComputationIn Chip Multi-Processors (CMPs) with private L2 caches, to combine the strengths of private and shared caches, private caches can share capacity through spilling replaced blocks to other private caches. However, indiscriminate spilling can make the ...
Filtering directory lookups in CMPs
Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with shared cache and directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed ...
Comments