skip to main content
article

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Published:01 May 2005Publication History
Skip Abstract Section

Abstract

Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency-capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in-situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighborsý caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.

References

  1. {1} A. R. Alameldeen and D. A. Wood. Variability in architectural simulations of multi-threaded workloads. In HPCA 9, pp 7-18, Feb. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. {2} E. Artiaga, X. Martorell, Y. Becerra, and N. Navarro. Experiences on implementing Parmacs macros to run the Splash-2 suite on multiprocessors. Technical Report UPC-DAC-1998-1, Department of Computer Architecture Universittat Politecnica de Catalunya, Jan. 1998.Google ScholarGoogle ScholarCross RefCross Ref
  3. {3} P. Barford and M. Crovella. Generating representative web workloads for network and server performance evaluation. In Measurement and Modeling of Computer Systems, pages 151-160, June 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. {4} L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In the 27th ISCA, pages 282-293, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. {5} B. M. Beckmann and D. A. Wood. TLC: Transmission line caches. In MICRO 36, pages 43-54, Dec. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. {6} B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In MICRO 37, pages 319-330, Dec. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. {7} D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting the interthread cache contention on a chip multiprocessor architecture. In HPCA 11, pages 340-351, Feb. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. {8} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In MICRO 36, pages 55-66, Dec. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. {9} J. P. Singh, D. Culler, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. {10} J. H. Edmondson and et al. Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor. Digital Technical Journal, 7(1), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. {11} B. Falsafi and D. A. Wood. Reactive NUMA: A design for unifying SCOMA and CC-NUMA. In the 24th ISCA, pages 229-240, June 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. {12} S. Finnes. iseries.myseries. http://www-1.ibm.com/servers/uk/media/ iseries_skillbuilder/POWER5DeliverWith%outDisruption1.pdf, 2004.Google ScholarGoogle Scholar
  13. {13} E. G. Hallnor and S. K. Reinhardt. A fully associative software-managed cache design. In the 27th ISCA, pages 107-116, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. {14} C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In ASPLOS X, pages 211-222, Oct. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. {15} C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA 10, pages 176-185, Feb. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. {16} P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, and G. Hallberg. Simics: A full system simulation platform. IEEE Computer, 35(2):50-58, Feb. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. {17} P. Michaud. Exploiting the cache capacity of a single-chip multicore processor with execution migration. In HPCA 10, pages 186-197, Feb. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. {18} S. D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. Sullivan, and T. Grutkowski. The implementation of the Itanium 2 microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1448-1460, Nov. 2002.Google ScholarGoogle ScholarCross RefCross Ref
  19. {19} K. Olukotun, B. A Nayfeh, L. Hammond, K. Wilson, and K. Chang. The case for a single-chip multiprocessor. In ASPLOS VII, pages 2-11, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. {20} Open Source Development Labs. Open source development labs data-base test 2. http://www.osdl.org/lab_activities/kernel_testing/ osdl_database_test_suite/o%sdl_dbt-2/.Google ScholarGoogle Scholar
  21. {21} M. Papamarcos and J. Patel. A low overhead coherence solution for multiprocessors with private cache memories. In the 11th ISCA 84, pages 348-354, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. {22} P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and area model. Technical report, Compaq Computer Corporation, Aug. 2001.Google ScholarGoogle Scholar
  23. {23} P. Stenstrom, E. Hagersten, D. Lilja, M. Martonosi, and M. Venugopal. Trends in shared memory multiprocessing. IEEE Computer, 30(12):44-50, Dec. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. {24} P. Stenstrom, T. Joe, and A. Gupta. Comparative performance evaluation of cache-coherent NUMA and COMA architectures. In the 19th ISCA , pages 80-91, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. {25} Sun Microsystems. Sun's 64-bit gemini chip. Sunflash, 66(4), Aug. 2003.Google ScholarGoogle Scholar
  26. {26} J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. IBM eserver Power4 System Microarchitecture. IBM White Paper, Oct. 2001.Google ScholarGoogle Scholar
  27. {27} The Standard Performance Evaluation Corporation. Spec CPU2000 suite. http://www.specbench.org/osg/cpu2000/.Google ScholarGoogle Scholar
  28. {28} D. Weiss, J. J. Wuu, and V. Chin. The on-chip 3-MB subarray-based third-level cache on an Itanium microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1523-1529, Nov. 2002.Google ScholarGoogle ScholarCross RefCross Ref
  29. {29} M. Wong. Stressing linux with real-world workloads. In Linux Symposium , pages 495-504, July 2003.Google ScholarGoogle Scholar
  30. {30} M. Wong, J. Zhang, C. Thomas, B. Olmstead, and C. White. Open source development labs database test 2 differences from the tpc-c, version 0.15. http://www.osdl.org/docs/dbt_2_differences_from_tpc_c.pdf, June 2002.Google ScholarGoogle Scholar
  31. {31} S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In the 22nd ISCA, pages 24-36, July 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing Replication, Communication, and Capacity Allocation in CMPs

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  • Published in

                    cover image ACM SIGARCH Computer Architecture News
                    ACM SIGARCH Computer Architecture News  Volume 33, Issue 2
                    ISCA 2005
                    May 2005
                    531 pages
                    ISSN:0163-5964
                    DOI:10.1145/1080695
                    Issue’s Table of Contents
                    • cover image ACM Conferences
                      ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture
                      June 2005
                      541 pages
                      ISBN:076952270X

                    Copyright © 2005 Authors

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 1 May 2005

                    Check for updates

                    Qualifiers

                    • article