Skip to main content
Top
Published in: The Journal of Supercomputing 3/2015

01-03-2015

DASC-DIR: a low-overhead coherence directory for many-core processors

Authors: Alberto Ros, Manuel E. Acacio

Published in: The Journal of Supercomputing | Issue 3/2015

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Current trends point toward future many-core processors being implemented using the hardware-managed, implicitly addressed, coherent caches memory model. With this memory model, all on-chip storage is used for private and shared caches that are kept coherent by hardware. Communication between cores is performed by writing to and reading from shared memory, and a scalable point-to-point interconnection network is in charge of transmitting messages. Cache coherence in this context is guaranteed by means of a directory-based protocol. Unfortunately, it has been previously shown that the directory structure required to keep track of sharers can restrict the scalability of these designs due its excessive area or energy requirements, or for a compressed directory, the increased coherence traffic that in some cases it could cause. On the other hand, in many-core architectures, memory blocks are commonly assigned to the banks of a NUCA shared cache by following a physical mapping. This mapping assigns blocks to cache banks in a round-robin fashion, thus neglecting the distance between the cores that more frequently access every block and the corresponding NUCA bank for the block. This issue impacts both cache access latency and the amount of on-chip network traffic generated and causes that some area- and energy-efficient compressed directories significantly increase the number of messages per coherence event, which finally translates into degraded performance. In this work we propose an efficient and low-overhead coherence directory which is built around two main ingredients: the first is the use of the distance-aware round-robin mapping policy, an OS-managed policy which tries to map the pages accessed by a core to its closest (local) bank, at the same time it introduces an upper bound on the deviation of the distribution of memory pages among cache banks, which lessens the number of off-chip accesses. The second is the utilization of a very compressed directory structure which takes advantage of this mapping policy to represent sharers in a very compact way without increasing coherence network traffic. Simulation results for a 32-core architecture demonstrate that compared to a full-map directory using the typical round-robin physical mapping policy, our proposal drastically reduces the size of the directory structure (and thus, its area and energy requirements); at the same time, it does not increase coherence network traffic and 6 % average savings in execution time are achieved.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Footnotes
1
To have a clearer understanding of the impact that the used compressed sharing code has on the results, we concentrate solely on the number of unnecessary coherence messages, leaving implementation-dependant details out of the comparison.
 
Literature
1.
go back to reference Tendler JM, Dodson JS, Fields JS, Le H, Sinharoy B (2002) POWER4 system microarchitecture. IBM J Res Develop 46(1):5–25CrossRef Tendler JM, Dodson JS, Fields JS, Le H, Sinharoy B (2002) POWER4 system microarchitecture. IBM J Res Develop 46(1):5–25CrossRef
3.
go back to reference Kurian G, Miller JE, Psota J, Eastep J, Liu J, Michel J, Kimerling LC, Agarwal A (2010) Atac: a 1,000-core cache-coherent processor with on-chip optical network. In: 19th international conference on parallel architectures and compilation techniques (PACT), pp 477–488 Kurian G, Miller JE, Psota J, Eastep J, Liu J, Michel J, Kimerling LC, Agarwal A (2010) Atac: a 1,000-core cache-coherent processor with on-chip optical network. In: 19th international conference on parallel architectures and compilation techniques (PACT), pp 477–488
4.
go back to reference Zhang M, Asanović K (2005) Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: 32nd international symposium on computer architecture (ISCA), pp 336–345 Zhang M, Asanović K (2005) Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: 32nd international symposium on computer architecture (ISCA), pp 336–345
5.
go back to reference Kalla R, Sinharoy B, Starke WJ, Floyd M (2010) POWER7: IBMs next-generation server processor. IEEE Micro 30(2):7–15CrossRef Kalla R, Sinharoy B, Starke WJ, Floyd M (2010) POWER7: IBMs next-generation server processor. IEEE Micro 30(2):7–15CrossRef
6.
go back to reference Shah M, Barreh J, Brooks J, Golla R, Grohoski G, Gura N, Hetherington R, Jordan P, Luttrell M, Olson C, Saha B, Sheahan D, Spracklen L, Wynn A (2007) UltraSPARC T2: a highly-threaded, power-efficient, SPARC SoC. In: IEEE Asian solid-state circuits conference, pp 22–25 Shah M, Barreh J, Brooks J, Golla R, Grohoski G, Gura N, Hetherington R, Jordan P, Luttrell M, Olson C, Saha B, Sheahan D, Spracklen L, Wynn A (2007) UltraSPARC T2: a highly-threaded, power-efficient, SPARC SoC. In: IEEE Asian solid-state circuits conference, pp 22–25
7.
go back to reference Loh GH (2008) 3d-stacked memory architectures for multi-core processors. In: 35th international symposium on computer architecture (ISCA), pp 453–464 Loh GH (2008) 3d-stacked memory architectures for multi-core processors. In: 35th international symposium on computer architecture (ISCA), pp 453–464
8.
go back to reference Cho S, Jin L (2006) Managing distributed, shared L2 caches through OS-level page allocation. In: 39th IEEE/ACM international symposium on microarchitecture (MICRO), pp 455–465 Cho S, Jin L (2006) Managing distributed, shared L2 caches through OS-level page allocation. In: 39th IEEE/ACM international symposium on microarchitecture (MICRO), pp 455–465
9.
go back to reference Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive NUCA: near-optimal block placement and replication in distributed caches. In: 36th international symposium on computer architecture (ISCA), pp 184–195 Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive NUCA: near-optimal block placement and replication in distributed caches. In: 36th international symposium on computer architecture (ISCA), pp 184–195
10.
go back to reference Hughes CJ, Kim C, Chen Y-K (2010) Performance and energy implications of many-core caches for throughput computing. IEEE Micro 30(6):25–35CrossRef Hughes CJ, Kim C, Chen Y-K (2010) Performance and energy implications of many-core caches for throughput computing. IEEE Micro 30(6):25–35CrossRef
11.
go back to reference Ros A, Cintra M, Acacio ME, García JM (2009) Distance-aware round-robin mapping for large NUCA caches. In: 16th international conference on high performance computing (HiPC), pp 79–88 Ros A, Cintra M, Acacio ME, García JM (2009) Distance-aware round-robin mapping for large NUCA caches. In: 16th international conference on high performance computing (HiPC), pp 79–88
12.
go back to reference Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: 10th international conference on architectural support for programming language and operating systems (ASPLOS), pp 211–222 Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: 10th international conference on architectural support for programming language and operating systems (ASPLOS), pp 211–222
13.
go back to reference Das S, Fan A, Chen K-N, Tan CS, Checka N, Reif R (2004) Technology, performance, and computer-aided design of three-dimensional integrated circuits. In: International symposium on physical design, pp 108–115 Das S, Fan A, Chen K-N, Tan CS, Checka N, Reif R (2004) Technology, performance, and computer-aided design of three-dimensional integrated circuits. In: International symposium on physical design, pp 108–115
14.
go back to reference Acacio ME, González J, García JM, Duato J (2005) A two-level directory architecture for highly scalable cc-NUMA multiprocessors. IEEE Trans Parall Distrib Syst (TPDS) 16(1):67–79CrossRef Acacio ME, González J, García JM, Duato J (2005) A two-level directory architecture for highly scalable cc-NUMA multiprocessors. IEEE Trans Parall Distrib Syst (TPDS) 16(1):67–79CrossRef
15.
go back to reference Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform. IEEE Comput 35(2):50–58CrossRef Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform. IEEE Comput 35(2):50–58CrossRef
16.
go back to reference Martin MM, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comput Architect News 33(4):92–99CrossRef Martin MM, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comput Architect News 33(4):92–99CrossRef
17.
go back to reference Puente V, Gregorio JA, Beivide R (2002) SICOSYS: an integrated framework for studying interconnection network in multiprocessor systems. In: 10th Euromicro workshop on parallel, distributed and network-based processing, pp 15–22 Puente V, Gregorio JA, Beivide R (2002) SICOSYS: an integrated framework for studying interconnection network in multiprocessor systems. In: 10th Euromicro workshop on parallel, distributed and network-based processing, pp 15–22
18.
go back to reference Alameldeen AR, Wood DA (2003) Variability in architectural simulations of multi-threaded workloads. In: 9th international symposium on high-performance computer architecture (HPCA), pp 7–18 Alameldeen AR, Wood DA (2003) Variability in architectural simulations of multi-threaded workloads. In: 9th international symposium on high-performance computer architecture (HPCA), pp 7–18
19.
go back to reference Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: 22nd international symposium on computer architecture (ISCA), pp 24–36 Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: 22nd international symposium on computer architecture (ISCA), pp 24–36
20.
go back to reference Gupta A, Weber W-D, Mowry TC (1990) Reducing memory traffic requirements for scalable directory-based cache coherence schemes. In: International conference on parallel processing (ICPP), pp 312–321 Gupta A, Weber W-D, Mowry TC (1990) Reducing memory traffic requirements for scalable directory-based cache coherence schemes. In: International conference on parallel processing (ICPP), pp 312–321
21.
go back to reference Chaiken D, Kubiatowicz J, Agarwal A (1991) LimitLESS directories: a scalable cache coherence scheme. In: 4th international conference on architectural support for programming language and operating systems (ASPLOS), pp 224–234 Chaiken D, Kubiatowicz J, Agarwal A (1991) LimitLESS directories: a scalable cache coherence scheme. In: 4th international conference on architectural support for programming language and operating systems (ASPLOS), pp 224–234
22.
go back to reference Simoni R, Horowitz MA (2001) Dynamic pointer allocation for scalable cache coherence directories. In: International symposium on shared memory multiprocessing, pp 72–81 Simoni R, Horowitz MA (2001) Dynamic pointer allocation for scalable cache coherence directories. In: International symposium on shared memory multiprocessing, pp 72–81
23.
go back to reference Chishti Z, Powell MD, Vijaykumar TN (2003) Distance associativity for high-performance energy-efficient non-uniform cache architectures. In: 36th IEEE/ACM international symposium on microarchitecture (MICRO), pp 55–66 Chishti Z, Powell MD, Vijaykumar TN (2003) Distance associativity for high-performance energy-efficient non-uniform cache architectures. In: 36th IEEE/ACM international symposium on microarchitecture (MICRO), pp 55–66
24.
go back to reference Beckmann BM, Wood DA (2004) Managing wire delay in large chip-multiprocessor caches. In: 37th IEEE/ACM international symposium on microarchitecture (MICRO), pp 319–330 Beckmann BM, Wood DA (2004) Managing wire delay in large chip-multiprocessor caches. In: 37th IEEE/ACM international symposium on microarchitecture (MICRO), pp 319–330
25.
go back to reference Zhang M, Asanović K (Oct. 2005) Victim migration: dynamically adapting between private and shared CMP caches. Tech. rep, Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory Zhang M, Asanović K (Oct. 2005) Victim migration: dynamically adapting between private and shared CMP caches. Tech. rep, Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory
26.
go back to reference Lin J, Lu Q, Ding X, Zhang Z, Zhang X, Sadayappan P (2008) Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In: 14th international symposium on high-performance computer architecture (HPCA), pp 367–378 Lin J, Lu Q, Ding X, Zhang Z, Zhang X, Sadayappan P (2008) Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In: 14th international symposium on high-performance computer architecture (HPCA), pp 367–378
27.
go back to reference Awasthi M, Sudan K, Balasubramonian R, Carter J (2009) Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In: 15th international symposium on high-performance computer architecture (HPCA), pp 250–261 Awasthi M, Sudan K, Balasubramonian R, Carter J (2009) Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In: 15th international symposium on high-performance computer architecture (HPCA), pp 250–261
28.
go back to reference Chaudhuri M (2009) PageNUCA: selected policies for page-grain locality management in large shared chip-multiprocessor caches. In: 15th international symposium on high-performance computer architecture (HPCA), pp 227–238 Chaudhuri M (2009) PageNUCA: selected policies for page-grain locality management in large shared chip-multiprocessor caches. In: 15th international symposium on high-performance computer architecture (HPCA), pp 227–238
29.
go back to reference García-Guirado A, Fernández-Pascual R, Ros A, García JM (2012) Dapsco: distance-aware partially shared cache organization. ACM Trans Architech Code Opt (TACO) 8(4), 25:1–25:19 García-Guirado A, Fernández-Pascual R, Ros A, García JM (2012) Dapsco: distance-aware partially shared cache organization. ACM Trans Architech Code Opt (TACO) 8(4), 25:1–25:19
30.
go back to reference Li Y, Abousamra A, Melhem R, Jones AK (2010) Compiler-assisted data distribution for chip multiprocessors. In: 19th international conference on parallel architectures and compilation techniques (PACT), pp 501–512 Li Y, Abousamra A, Melhem R, Jones AK (2010) Compiler-assisted data distribution for chip multiprocessors. In: 19th international conference on parallel architectures and compilation techniques (PACT), pp 501–512
31.
go back to reference Li Y, Melhem RG, Jones AK (2012) Practically private: enabling high performance cmps through compiler-assisted data classification. In: 21st international conference on parallel architectures and compilation techniques (PACT), pp 231–240 Li Y, Melhem RG, Jones AK (2012) Practically private: enabling high performance cmps through compiler-assisted data classification. In: 21st international conference on parallel architectures and compilation techniques (PACT), pp 231–240
32.
go back to reference Ros A, Acacio ME, García JM (2008) Scalable directory organization for tiled CMP architectures. In: International conference on computer design (CDES), pp 112–118 Ros A, Acacio ME, García JM (2008) Scalable directory organization for tiled CMP architectures. In: International conference on computer design (CDES), pp 112–118
33.
go back to reference Zebchuk J, Srinivasan V, Qureshi MK, Moshovos A (2009) A tagless coherence directory. In: 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 423–434 Zebchuk J, Srinivasan V, Qureshi MK, Moshovos A (2009) A tagless coherence directory. In: 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 423–434
34.
go back to reference Cuesta B, Ros A, Gómez ME, Robles A, Duato J (2011) Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In: 38th international symposium on computer architecture (ISCA), pp 93–103 Cuesta B, Ros A, Gómez ME, Robles A, Duato J (2011) Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In: 38th international symposium on computer architecture (ISCA), pp 93–103
35.
go back to reference Ferdman M, Lotfi-Kamran P, Balet K, Falsafi B (2011) Cuckoo directory: a scalable directory for many-core systems. In: 17th international symposium on high-performance computer architecture (HPCA), pp 169–180 Ferdman M, Lotfi-Kamran P, Balet K, Falsafi B (2011) Cuckoo directory: a scalable directory for many-core systems. In: 17th international symposium on high-performance computer architecture (HPCA), pp 169–180
36.
go back to reference Sanchez D, Kozyrakis C (2012) SCD: a scalable coherence directory with flexible sharer set encoding. In: 18th international symposium on high-performance computer architecture (HPCA), pp 129–140 Sanchez D, Kozyrakis C (2012) SCD: a scalable coherence directory with flexible sharer set encoding. In: 18th international symposium on high-performance computer architecture (HPCA), pp 129–140
37.
go back to reference Kuskin J, Ofelt D, Heinrich M, Heinlein J, Simoni R, Gharachorloo K, Chapin J, Nakahira D, Baxter J, Horowitz MA, Gupta A, Rosenblum M, Hennessy JL (1994) The stanford FLASH multiprocessor. In: 21st international symposium on computer architecture (ISCA), pp 302–313 Kuskin J, Ofelt D, Heinrich M, Heinlein J, Simoni R, Gharachorloo K, Chapin J, Nakahira D, Baxter J, Horowitz MA, Gupta A, Rosenblum M, Hennessy JL (1994) The stanford FLASH multiprocessor. In: 21st international symposium on computer architecture (ISCA), pp 302–313
38.
go back to reference Agarwal A, Bianchini R, Chaiken D, Kranz D, Kubiatowicz J, Hong Lim B, Mackenzie K, Yeung D (1995) The MIT Alewife machine: architecture and performance. In: 22nd international symposium on computer architecture (ISCA), pp 2–13 Agarwal A, Bianchini R, Chaiken D, Kranz D, Kubiatowicz J, Hong Lim B, Mackenzie K, Yeung D (1995) The MIT Alewife machine: architecture and performance. In: 22nd international symposium on computer architecture (ISCA), pp 2–13
39.
go back to reference Agarwal A, Simoni R, Hennessy JL, Horowitz MA (1988) An evaluation of directory schemes for cache coherence. In: 15th international symposium on computer architecture (ISCA), pp 280–289 Agarwal A, Simoni R, Hennessy JL, Horowitz MA (1988) An evaluation of directory schemes for cache coherence. In: 15th international symposium on computer architecture (ISCA), pp 280–289
40.
go back to reference Mukherjee SS, Hill MD (1994) An evaluation of directory protocols for medium-scale shared-memory multiprocessors. In: 8th international conference on supercomputing (ICS), pp 64–74 Mukherjee SS, Hill MD (1994) An evaluation of directory protocols for medium-scale shared-memory multiprocessors. In: 8th international conference on supercomputing (ICS), pp 64–74
Metadata
Title
DASC-DIR: a low-overhead coherence directory for many-core processors
Authors
Alberto Ros
Manuel E. Acacio
Publication date
01-03-2015
Publisher
Springer US
Published in
The Journal of Supercomputing / Issue 3/2015
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-014-1325-4

Other articles of this Issue 3/2015

The Journal of Supercomputing 3/2015 Go to the issue

Premium Partner