Skip to main content
Erschienen in: The Journal of Supercomputing 3/2015

01.03.2015

DASC-DIR: a low-overhead coherence directory for many-core processors

verfasst von: Alberto Ros, Manuel E. Acacio

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Current trends point toward future many-core processors being implemented using the hardware-managed, implicitly addressed, coherent caches memory model. With this memory model, all on-chip storage is used for private and shared caches that are kept coherent by hardware. Communication between cores is performed by writing to and reading from shared memory, and a scalable point-to-point interconnection network is in charge of transmitting messages. Cache coherence in this context is guaranteed by means of a directory-based protocol. Unfortunately, it has been previously shown that the directory structure required to keep track of sharers can restrict the scalability of these designs due its excessive area or energy requirements, or for a compressed directory, the increased coherence traffic that in some cases it could cause. On the other hand, in many-core architectures, memory blocks are commonly assigned to the banks of a NUCA shared cache by following a physical mapping. This mapping assigns blocks to cache banks in a round-robin fashion, thus neglecting the distance between the cores that more frequently access every block and the corresponding NUCA bank for the block. This issue impacts both cache access latency and the amount of on-chip network traffic generated and causes that some area- and energy-efficient compressed directories significantly increase the number of messages per coherence event, which finally translates into degraded performance. In this work we propose an efficient and low-overhead coherence directory which is built around two main ingredients: the first is the use of the distance-aware round-robin mapping policy, an OS-managed policy which tries to map the pages accessed by a core to its closest (local) bank, at the same time it introduces an upper bound on the deviation of the distribution of memory pages among cache banks, which lessens the number of off-chip accesses. The second is the utilization of a very compressed directory structure which takes advantage of this mapping policy to represent sharers in a very compact way without increasing coherence network traffic. Simulation results for a 32-core architecture demonstrate that compared to a full-map directory using the typical round-robin physical mapping policy, our proposal drastically reduces the size of the directory structure (and thus, its area and energy requirements); at the same time, it does not increase coherence network traffic and 6 % average savings in execution time are achieved.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
To have a clearer understanding of the impact that the used compressed sharing code has on the results, we concentrate solely on the number of unnecessary coherence messages, leaving implementation-dependant details out of the comparison.
 
Literatur
1.
Zurück zum Zitat Tendler JM, Dodson JS, Fields JS, Le H, Sinharoy B (2002) POWER4 system microarchitecture. IBM J Res Develop 46(1):5–25CrossRef Tendler JM, Dodson JS, Fields JS, Le H, Sinharoy B (2002) POWER4 system microarchitecture. IBM J Res Develop 46(1):5–25CrossRef
3.
Zurück zum Zitat Kurian G, Miller JE, Psota J, Eastep J, Liu J, Michel J, Kimerling LC, Agarwal A (2010) Atac: a 1,000-core cache-coherent processor with on-chip optical network. In: 19th international conference on parallel architectures and compilation techniques (PACT), pp 477–488 Kurian G, Miller JE, Psota J, Eastep J, Liu J, Michel J, Kimerling LC, Agarwal A (2010) Atac: a 1,000-core cache-coherent processor with on-chip optical network. In: 19th international conference on parallel architectures and compilation techniques (PACT), pp 477–488
4.
Zurück zum Zitat Zhang M, Asanović K (2005) Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: 32nd international symposium on computer architecture (ISCA), pp 336–345 Zhang M, Asanović K (2005) Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: 32nd international symposium on computer architecture (ISCA), pp 336–345
5.
Zurück zum Zitat Kalla R, Sinharoy B, Starke WJ, Floyd M (2010) POWER7: IBMs next-generation server processor. IEEE Micro 30(2):7–15CrossRef Kalla R, Sinharoy B, Starke WJ, Floyd M (2010) POWER7: IBMs next-generation server processor. IEEE Micro 30(2):7–15CrossRef
6.
Zurück zum Zitat Shah M, Barreh J, Brooks J, Golla R, Grohoski G, Gura N, Hetherington R, Jordan P, Luttrell M, Olson C, Saha B, Sheahan D, Spracklen L, Wynn A (2007) UltraSPARC T2: a highly-threaded, power-efficient, SPARC SoC. In: IEEE Asian solid-state circuits conference, pp 22–25 Shah M, Barreh J, Brooks J, Golla R, Grohoski G, Gura N, Hetherington R, Jordan P, Luttrell M, Olson C, Saha B, Sheahan D, Spracklen L, Wynn A (2007) UltraSPARC T2: a highly-threaded, power-efficient, SPARC SoC. In: IEEE Asian solid-state circuits conference, pp 22–25
7.
Zurück zum Zitat Loh GH (2008) 3d-stacked memory architectures for multi-core processors. In: 35th international symposium on computer architecture (ISCA), pp 453–464 Loh GH (2008) 3d-stacked memory architectures for multi-core processors. In: 35th international symposium on computer architecture (ISCA), pp 453–464
8.
Zurück zum Zitat Cho S, Jin L (2006) Managing distributed, shared L2 caches through OS-level page allocation. In: 39th IEEE/ACM international symposium on microarchitecture (MICRO), pp 455–465 Cho S, Jin L (2006) Managing distributed, shared L2 caches through OS-level page allocation. In: 39th IEEE/ACM international symposium on microarchitecture (MICRO), pp 455–465
9.
Zurück zum Zitat Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive NUCA: near-optimal block placement and replication in distributed caches. In: 36th international symposium on computer architecture (ISCA), pp 184–195 Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive NUCA: near-optimal block placement and replication in distributed caches. In: 36th international symposium on computer architecture (ISCA), pp 184–195
10.
Zurück zum Zitat Hughes CJ, Kim C, Chen Y-K (2010) Performance and energy implications of many-core caches for throughput computing. IEEE Micro 30(6):25–35CrossRef Hughes CJ, Kim C, Chen Y-K (2010) Performance and energy implications of many-core caches for throughput computing. IEEE Micro 30(6):25–35CrossRef
11.
Zurück zum Zitat Ros A, Cintra M, Acacio ME, García JM (2009) Distance-aware round-robin mapping for large NUCA caches. In: 16th international conference on high performance computing (HiPC), pp 79–88 Ros A, Cintra M, Acacio ME, García JM (2009) Distance-aware round-robin mapping for large NUCA caches. In: 16th international conference on high performance computing (HiPC), pp 79–88
12.
Zurück zum Zitat Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: 10th international conference on architectural support for programming language and operating systems (ASPLOS), pp 211–222 Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: 10th international conference on architectural support for programming language and operating systems (ASPLOS), pp 211–222
13.
Zurück zum Zitat Das S, Fan A, Chen K-N, Tan CS, Checka N, Reif R (2004) Technology, performance, and computer-aided design of three-dimensional integrated circuits. In: International symposium on physical design, pp 108–115 Das S, Fan A, Chen K-N, Tan CS, Checka N, Reif R (2004) Technology, performance, and computer-aided design of three-dimensional integrated circuits. In: International symposium on physical design, pp 108–115
14.
Zurück zum Zitat Acacio ME, González J, García JM, Duato J (2005) A two-level directory architecture for highly scalable cc-NUMA multiprocessors. IEEE Trans Parall Distrib Syst (TPDS) 16(1):67–79CrossRef Acacio ME, González J, García JM, Duato J (2005) A two-level directory architecture for highly scalable cc-NUMA multiprocessors. IEEE Trans Parall Distrib Syst (TPDS) 16(1):67–79CrossRef
15.
Zurück zum Zitat Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform. IEEE Comput 35(2):50–58CrossRef Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform. IEEE Comput 35(2):50–58CrossRef
16.
Zurück zum Zitat Martin MM, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comput Architect News 33(4):92–99CrossRef Martin MM, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comput Architect News 33(4):92–99CrossRef
17.
Zurück zum Zitat Puente V, Gregorio JA, Beivide R (2002) SICOSYS: an integrated framework for studying interconnection network in multiprocessor systems. In: 10th Euromicro workshop on parallel, distributed and network-based processing, pp 15–22 Puente V, Gregorio JA, Beivide R (2002) SICOSYS: an integrated framework for studying interconnection network in multiprocessor systems. In: 10th Euromicro workshop on parallel, distributed and network-based processing, pp 15–22
18.
Zurück zum Zitat Alameldeen AR, Wood DA (2003) Variability in architectural simulations of multi-threaded workloads. In: 9th international symposium on high-performance computer architecture (HPCA), pp 7–18 Alameldeen AR, Wood DA (2003) Variability in architectural simulations of multi-threaded workloads. In: 9th international symposium on high-performance computer architecture (HPCA), pp 7–18
19.
Zurück zum Zitat Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: 22nd international symposium on computer architecture (ISCA), pp 24–36 Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: 22nd international symposium on computer architecture (ISCA), pp 24–36
20.
Zurück zum Zitat Gupta A, Weber W-D, Mowry TC (1990) Reducing memory traffic requirements for scalable directory-based cache coherence schemes. In: International conference on parallel processing (ICPP), pp 312–321 Gupta A, Weber W-D, Mowry TC (1990) Reducing memory traffic requirements for scalable directory-based cache coherence schemes. In: International conference on parallel processing (ICPP), pp 312–321
21.
Zurück zum Zitat Chaiken D, Kubiatowicz J, Agarwal A (1991) LimitLESS directories: a scalable cache coherence scheme. In: 4th international conference on architectural support for programming language and operating systems (ASPLOS), pp 224–234 Chaiken D, Kubiatowicz J, Agarwal A (1991) LimitLESS directories: a scalable cache coherence scheme. In: 4th international conference on architectural support for programming language and operating systems (ASPLOS), pp 224–234
22.
Zurück zum Zitat Simoni R, Horowitz MA (2001) Dynamic pointer allocation for scalable cache coherence directories. In: International symposium on shared memory multiprocessing, pp 72–81 Simoni R, Horowitz MA (2001) Dynamic pointer allocation for scalable cache coherence directories. In: International symposium on shared memory multiprocessing, pp 72–81
23.
Zurück zum Zitat Chishti Z, Powell MD, Vijaykumar TN (2003) Distance associativity for high-performance energy-efficient non-uniform cache architectures. In: 36th IEEE/ACM international symposium on microarchitecture (MICRO), pp 55–66 Chishti Z, Powell MD, Vijaykumar TN (2003) Distance associativity for high-performance energy-efficient non-uniform cache architectures. In: 36th IEEE/ACM international symposium on microarchitecture (MICRO), pp 55–66
24.
Zurück zum Zitat Beckmann BM, Wood DA (2004) Managing wire delay in large chip-multiprocessor caches. In: 37th IEEE/ACM international symposium on microarchitecture (MICRO), pp 319–330 Beckmann BM, Wood DA (2004) Managing wire delay in large chip-multiprocessor caches. In: 37th IEEE/ACM international symposium on microarchitecture (MICRO), pp 319–330
25.
Zurück zum Zitat Zhang M, Asanović K (Oct. 2005) Victim migration: dynamically adapting between private and shared CMP caches. Tech. rep, Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory Zhang M, Asanović K (Oct. 2005) Victim migration: dynamically adapting between private and shared CMP caches. Tech. rep, Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory
26.
Zurück zum Zitat Lin J, Lu Q, Ding X, Zhang Z, Zhang X, Sadayappan P (2008) Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In: 14th international symposium on high-performance computer architecture (HPCA), pp 367–378 Lin J, Lu Q, Ding X, Zhang Z, Zhang X, Sadayappan P (2008) Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In: 14th international symposium on high-performance computer architecture (HPCA), pp 367–378
27.
Zurück zum Zitat Awasthi M, Sudan K, Balasubramonian R, Carter J (2009) Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In: 15th international symposium on high-performance computer architecture (HPCA), pp 250–261 Awasthi M, Sudan K, Balasubramonian R, Carter J (2009) Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In: 15th international symposium on high-performance computer architecture (HPCA), pp 250–261
28.
Zurück zum Zitat Chaudhuri M (2009) PageNUCA: selected policies for page-grain locality management in large shared chip-multiprocessor caches. In: 15th international symposium on high-performance computer architecture (HPCA), pp 227–238 Chaudhuri M (2009) PageNUCA: selected policies for page-grain locality management in large shared chip-multiprocessor caches. In: 15th international symposium on high-performance computer architecture (HPCA), pp 227–238
29.
Zurück zum Zitat García-Guirado A, Fernández-Pascual R, Ros A, García JM (2012) Dapsco: distance-aware partially shared cache organization. ACM Trans Architech Code Opt (TACO) 8(4), 25:1–25:19 García-Guirado A, Fernández-Pascual R, Ros A, García JM (2012) Dapsco: distance-aware partially shared cache organization. ACM Trans Architech Code Opt (TACO) 8(4), 25:1–25:19
30.
Zurück zum Zitat Li Y, Abousamra A, Melhem R, Jones AK (2010) Compiler-assisted data distribution for chip multiprocessors. In: 19th international conference on parallel architectures and compilation techniques (PACT), pp 501–512 Li Y, Abousamra A, Melhem R, Jones AK (2010) Compiler-assisted data distribution for chip multiprocessors. In: 19th international conference on parallel architectures and compilation techniques (PACT), pp 501–512
31.
Zurück zum Zitat Li Y, Melhem RG, Jones AK (2012) Practically private: enabling high performance cmps through compiler-assisted data classification. In: 21st international conference on parallel architectures and compilation techniques (PACT), pp 231–240 Li Y, Melhem RG, Jones AK (2012) Practically private: enabling high performance cmps through compiler-assisted data classification. In: 21st international conference on parallel architectures and compilation techniques (PACT), pp 231–240
32.
Zurück zum Zitat Ros A, Acacio ME, García JM (2008) Scalable directory organization for tiled CMP architectures. In: International conference on computer design (CDES), pp 112–118 Ros A, Acacio ME, García JM (2008) Scalable directory organization for tiled CMP architectures. In: International conference on computer design (CDES), pp 112–118
33.
Zurück zum Zitat Zebchuk J, Srinivasan V, Qureshi MK, Moshovos A (2009) A tagless coherence directory. In: 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 423–434 Zebchuk J, Srinivasan V, Qureshi MK, Moshovos A (2009) A tagless coherence directory. In: 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 423–434
34.
Zurück zum Zitat Cuesta B, Ros A, Gómez ME, Robles A, Duato J (2011) Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In: 38th international symposium on computer architecture (ISCA), pp 93–103 Cuesta B, Ros A, Gómez ME, Robles A, Duato J (2011) Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In: 38th international symposium on computer architecture (ISCA), pp 93–103
35.
Zurück zum Zitat Ferdman M, Lotfi-Kamran P, Balet K, Falsafi B (2011) Cuckoo directory: a scalable directory for many-core systems. In: 17th international symposium on high-performance computer architecture (HPCA), pp 169–180 Ferdman M, Lotfi-Kamran P, Balet K, Falsafi B (2011) Cuckoo directory: a scalable directory for many-core systems. In: 17th international symposium on high-performance computer architecture (HPCA), pp 169–180
36.
Zurück zum Zitat Sanchez D, Kozyrakis C (2012) SCD: a scalable coherence directory with flexible sharer set encoding. In: 18th international symposium on high-performance computer architecture (HPCA), pp 129–140 Sanchez D, Kozyrakis C (2012) SCD: a scalable coherence directory with flexible sharer set encoding. In: 18th international symposium on high-performance computer architecture (HPCA), pp 129–140
37.
Zurück zum Zitat Kuskin J, Ofelt D, Heinrich M, Heinlein J, Simoni R, Gharachorloo K, Chapin J, Nakahira D, Baxter J, Horowitz MA, Gupta A, Rosenblum M, Hennessy JL (1994) The stanford FLASH multiprocessor. In: 21st international symposium on computer architecture (ISCA), pp 302–313 Kuskin J, Ofelt D, Heinrich M, Heinlein J, Simoni R, Gharachorloo K, Chapin J, Nakahira D, Baxter J, Horowitz MA, Gupta A, Rosenblum M, Hennessy JL (1994) The stanford FLASH multiprocessor. In: 21st international symposium on computer architecture (ISCA), pp 302–313
38.
Zurück zum Zitat Agarwal A, Bianchini R, Chaiken D, Kranz D, Kubiatowicz J, Hong Lim B, Mackenzie K, Yeung D (1995) The MIT Alewife machine: architecture and performance. In: 22nd international symposium on computer architecture (ISCA), pp 2–13 Agarwal A, Bianchini R, Chaiken D, Kranz D, Kubiatowicz J, Hong Lim B, Mackenzie K, Yeung D (1995) The MIT Alewife machine: architecture and performance. In: 22nd international symposium on computer architecture (ISCA), pp 2–13
39.
Zurück zum Zitat Agarwal A, Simoni R, Hennessy JL, Horowitz MA (1988) An evaluation of directory schemes for cache coherence. In: 15th international symposium on computer architecture (ISCA), pp 280–289 Agarwal A, Simoni R, Hennessy JL, Horowitz MA (1988) An evaluation of directory schemes for cache coherence. In: 15th international symposium on computer architecture (ISCA), pp 280–289
40.
Zurück zum Zitat Mukherjee SS, Hill MD (1994) An evaluation of directory protocols for medium-scale shared-memory multiprocessors. In: 8th international conference on supercomputing (ICS), pp 64–74 Mukherjee SS, Hill MD (1994) An evaluation of directory protocols for medium-scale shared-memory multiprocessors. In: 8th international conference on supercomputing (ICS), pp 64–74
Metadaten
Titel
DASC-DIR: a low-overhead coherence directory for many-core processors
verfasst von
Alberto Ros
Manuel E. Acacio
Publikationsdatum
01.03.2015
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 3/2015
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-014-1325-4

Weitere Artikel der Ausgabe 3/2015

The Journal of Supercomputing 3/2015 Zur Ausgabe