article

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Authors:
Zeshan Chishti

Purdue University

Purdue University
View Profile

,
Michael D. Powell

Purdue University

Purdue University
View Profile

,
T. N. Vijaykumar

Purdue University

Purdue University
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 33 Issue 2May 2005pp 357–368https://doi.org/10.1145/1080695.1070001

Published:01 May 2005Publication History

ACM SIGARCH Computer Architecture News

Abstract

Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency-capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in-situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighborsý caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.

References

{1} A. R. Alameldeen and D. A. Wood. Variability in architectural simulations of multi-threaded workloads. In HPCA 9, pp 7-18, Feb. 2003. Google ScholarDigital Library
{2} E. Artiaga, X. Martorell, Y. Becerra, and N. Navarro. Experiences on implementing Parmacs macros to run the Splash-2 suite on multiprocessors. Technical Report UPC-DAC-1998-1, Department of Computer Architecture Universittat Politecnica de Catalunya, Jan. 1998.Google ScholarCross Ref
{3} P. Barford and M. Crovella. Generating representative web workloads for network and server performance evaluation. In Measurement and Modeling of Computer Systems, pages 151-160, June 1998. Google ScholarDigital Library
{4} L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In the 27th ISCA, pages 282-293, June 2000. Google ScholarDigital Library
{5} B. M. Beckmann and D. A. Wood. TLC: Transmission line caches. In MICRO 36, pages 43-54, Dec. 2003. Google ScholarDigital Library
{6} B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In MICRO 37, pages 319-330, Dec. 2004. Google ScholarDigital Library
{7} D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting the interthread cache contention on a chip multiprocessor architecture. In HPCA 11, pages 340-351, Feb. 2005. Google ScholarDigital Library
{8} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In MICRO 36, pages 55-66, Dec. 2004. Google ScholarDigital Library
{9} J. P. Singh, D. Culler, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, 1998. Google ScholarDigital Library
{10} J. H. Edmondson and et al. Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor. Digital Technical Journal, 7(1), 1995. Google ScholarDigital Library
{11} B. Falsafi and D. A. Wood. Reactive NUMA: A design for unifying SCOMA and CC-NUMA. In the 24th ISCA, pages 229-240, June 1997. Google ScholarDigital Library
{12} S. Finnes. iseries.myseries. http://www-1.ibm.com/servers/uk/media/ iseries_skillbuilder/POWER5DeliverWith%outDisruption1.pdf, 2004.Google Scholar
{13} E. G. Hallnor and S. K. Reinhardt. A fully associative software-managed cache design. In the 27th ISCA, pages 107-116, June 2000. Google ScholarDigital Library
{14} C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In ASPLOS X, pages 211-222, Oct. 2002. Google ScholarDigital Library
{15} C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA 10, pages 176-185, Feb. 2004. Google ScholarDigital Library
{16} P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, and G. Hallberg. Simics: A full system simulation platform. IEEE Computer, 35(2):50-58, Feb. 2002. Google ScholarDigital Library
{17} P. Michaud. Exploiting the cache capacity of a single-chip multicore processor with execution migration. In HPCA 10, pages 186-197, Feb. 2004. Google ScholarDigital Library
{18} S. D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. Sullivan, and T. Grutkowski. The implementation of the Itanium 2 microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1448-1460, Nov. 2002.Google ScholarCross Ref
{19} K. Olukotun, B. A Nayfeh, L. Hammond, K. Wilson, and K. Chang. The case for a single-chip multiprocessor. In ASPLOS VII, pages 2-11, 1996. Google ScholarDigital Library
{20} Open Source Development Labs. Open source development labs data-base test 2. http://www.osdl.org/lab_activities/kernel_testing/ osdl_database_test_suite/o%sdl_dbt-2/.Google Scholar
{21} M. Papamarcos and J. Patel. A low overhead coherence solution for multiprocessors with private cache memories. In the 11th ISCA 84, pages 348-354, 1984. Google ScholarDigital Library
{22} P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and area model. Technical report, Compaq Computer Corporation, Aug. 2001.Google Scholar
{23} P. Stenstrom, E. Hagersten, D. Lilja, M. Martonosi, and M. Venugopal. Trends in shared memory multiprocessing. IEEE Computer, 30(12):44-50, Dec. 1997. Google ScholarDigital Library
{24} P. Stenstrom, T. Joe, and A. Gupta. Comparative performance evaluation of cache-coherent NUMA and COMA architectures. In the 19th ISCA , pages 80-91, 1992. Google ScholarDigital Library
{25} Sun Microsystems. Sun's 64-bit gemini chip. Sunflash, 66(4), Aug. 2003.Google Scholar
{26} J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. IBM eserver Power4 System Microarchitecture. IBM White Paper, Oct. 2001.Google Scholar
{27} The Standard Performance Evaluation Corporation. Spec CPU2000 suite. http://www.specbench.org/osg/cpu2000/.Google Scholar
{28} D. Weiss, J. J. Wuu, and V. Chin. The on-chip 3-MB subarray-based third-level cache on an Itanium microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1523-1529, Nov. 2002.Google ScholarCross Ref
{29} M. Wong. Stressing linux with real-world workloads. In Linux Symposium , pages 495-504, July 2003.Google Scholar
{30} M. Wong, J. Zhang, C. Thomas, B. Olmstead, and C. White. Open source development labs database test 2 differences from the tpc-c, version 0.15. http://www.osdl.org/docs/dbt_2_differences_from_tpc_c.pdf, June 2002.Google Scholar
{31} S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In the 22nd ISCA, pages 24-36, July 1995. Google ScholarDigital Library

Index Terms

Recommendations

Optimizing Replication, Communication, and Capacity Allocation in CMPs
ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture

Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to ...
Read More
Throttling capacity sharing in private L2 caches of CMPs
RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied Computation

In Chip Multi-Processors (CMPs) with private L2 caches, to combine the strengths of private and shared caches, private caches can share capacity through spilling replaced blocks to other private caches. However, indiscriminate spilling can make the ...
Read More
Filtering directory lookups in CMPs

Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with shared cache and directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 33, Issue 2
ISCA 2005
May 2005
531 pages
ISSN:0163-5964
DOI:10.1145/1080695
Issue’s Table of Contents
ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture
June 2005
541 pages
ISBN:076952270X
Copyright © 2005 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2005
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 139
  Total Citations
  View Citations
- 22
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

Optimizing Replication, Communication, and Capacity Allocation in CMPs

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Throttling capacity sharing in private L2 caches of CMPs

Filtering directory lookups in CMPs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Digital Edition

Caption

Optimizing Replication, Communication, and Capacity Allocation in CMPs

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Throttling capacity sharing in private L2 caches of CMPs

Filtering directory lookups in CMPs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

Digital Edition

Share this Publication link

Share on Social Media