Top

The Journal of Supercomputing

Published in:

01-03-2015

DASC-DIR: a low-overhead coherence directory for many-core processors

Authors: Alberto Ros, Manuel E. Acacio

Published in: The Journal of Supercomputing | Issue 3/2015

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Current trends point toward future many-core processors being implemented using the hardware-managed, implicitly addressed, coherent caches memory model. With this memory model, all on-chip storage is used for private and shared caches that are kept coherent by hardware. Communication between cores is performed by writing to and reading from shared memory, and a scalable point-to-point interconnection network is in charge of transmitting messages. Cache coherence in this context is guaranteed by means of a directory-based protocol. Unfortunately, it has been previously shown that the directory structure required to keep track of sharers can restrict the scalability of these designs due its excessive area or energy requirements, or for a compressed directory, the increased coherence traffic that in some cases it could cause. On the other hand, in many-core architectures, memory blocks are commonly assigned to the banks of a NUCA shared cache by following a physical mapping. This mapping assigns blocks to cache banks in a round-robin fashion, thus neglecting the distance between the cores that more frequently access every block and the corresponding NUCA bank for the block. This issue impacts both cache access latency and the amount of on-chip network traffic generated and causes that some area- and energy-efficient compressed directories significantly increase the number of messages per coherence event, which finally translates into degraded performance. In this work we propose an efficient and low-overhead coherence directory which is built around two main ingredients: the first is the use of the distance-aware round-robin mapping policy, an OS-managed policy which tries to map the pages accessed by a core to its closest (local) bank, at the same time it introduces an upper bound on the deviation of the distribution of memory pages among cache banks, which lessens the number of off-chip accesses. The second is the utilization of a very compressed directory structure which takes advantage of this mapping policy to represent sharers in a very compact way without increasing coherence network traffic. Simulation results for a 32-core architecture demonstrate that compared to a full-map directory using the typical round-robin physical mapping policy, our proposal drastically reduces the size of the directory structure (and thus, its area and energy requirements); at the same time, it does not increase coherence network traffic and 6 % average savings in execution time are achieved.

previous article Understanding I/O workload characteristics of a Peta-scale storage system

next article Automatic scoping of task clauses for the OpenMP tasking model

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

To have a clearer understanding of the impact that the used compressed sharing code has on the results, we concentrate solely on the number of unnecessary coherence messages, leaving implementation-dependant details out of the comparison.

Tendler JM, Dodson JS, Fields JS, Le H, Sinharoy B (2002) POWER4 system microarchitecture. IBM J Res Develop 46(1):5–25CrossRef

Intel Xeon Phi Coprocessor, http://software.intel.com/en-us/mic-developer (2013).

Kurian G, Miller JE, Psota J, Eastep J, Liu J, Michel J, Kimerling LC, Agarwal A (2010) Atac: a 1,000-core cache-coherent processor with on-chip optical network. In: 19th international conference on parallel architectures and compilation techniques (PACT), pp 477–488

Zhang M, Asanović K (2005) Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: 32nd international symposium on computer architecture (ISCA), pp 336–345

Kalla R, Sinharoy B, Starke WJ, Floyd M (2010) POWER7: IBMs next-generation server processor. IEEE Micro 30(2):7–15CrossRef

Shah M, Barreh J, Brooks J, Golla R, Grohoski G, Gura N, Hetherington R, Jordan P, Luttrell M, Olson C, Saha B, Sheahan D, Spracklen L, Wynn A (2007) UltraSPARC T2: a highly-threaded, power-efficient, SPARC SoC. In: IEEE Asian solid-state circuits conference, pp 22–25

Loh GH (2008) 3d-stacked memory architectures for multi-core processors. In: 35th international symposium on computer architecture (ISCA), pp 453–464

Cho S, Jin L (2006) Managing distributed, shared L2 caches through OS-level page allocation. In: 39th IEEE/ACM international symposium on microarchitecture (MICRO), pp 455–465

Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive NUCA: near-optimal block placement and replication in distributed caches. In: 36th international symposium on computer architecture (ISCA), pp 184–195

10.

Hughes CJ, Kim C, Chen Y-K (2010) Performance and energy implications of many-core caches for throughput computing. IEEE Micro 30(6):25–35CrossRef

11.

Ros A, Cintra M, Acacio ME, García JM (2009) Distance-aware round-robin mapping for large NUCA caches. In: 16th international conference on high performance computing (HiPC), pp 79–88

12.

Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: 10th international conference on architectural support for programming language and operating systems (ASPLOS), pp 211–222

13.

Das S, Fan A, Chen K-N, Tan CS, Checka N, Reif R (2004) Technology, performance, and computer-aided design of three-dimensional integrated circuits. In: International symposium on physical design, pp 108–115

14.

Acacio ME, González J, García JM, Duato J (2005) A two-level directory architecture for highly scalable cc-NUMA multiprocessors. IEEE Trans Parall Distrib Syst (TPDS) 16(1):67–79CrossRef

15.

Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform. IEEE Comput 35(2):50–58CrossRef

16.

Martin MM, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comput Architect News 33(4):92–99CrossRef

17.

Puente V, Gregorio JA, Beivide R (2002) SICOSYS: an integrated framework for studying interconnection network in multiprocessor systems. In: 10th Euromicro workshop on parallel, distributed and network-based processing, pp 15–22

18.

Alameldeen AR, Wood DA (2003) Variability in architectural simulations of multi-threaded workloads. In: 9th international symposium on high-performance computer architecture (HPCA), pp 7–18

19.

Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: 22nd international symposium on computer architecture (ISCA), pp 24–36

20.

Gupta A, Weber W-D, Mowry TC (1990) Reducing memory traffic requirements for scalable directory-based cache coherence schemes. In: International conference on parallel processing (ICPP), pp 312–321

21.

Chaiken D, Kubiatowicz J, Agarwal A (1991) LimitLESS directories: a scalable cache coherence scheme. In: 4th international conference on architectural support for programming language and operating systems (ASPLOS), pp 224–234

22.

Simoni R, Horowitz MA (2001) Dynamic pointer allocation for scalable cache coherence directories. In: International symposium on shared memory multiprocessing, pp 72–81

23.

Chishti Z, Powell MD, Vijaykumar TN (2003) Distance associativity for high-performance energy-efficient non-uniform cache architectures. In: 36th IEEE/ACM international symposium on microarchitecture (MICRO), pp 55–66

24.

Beckmann BM, Wood DA (2004) Managing wire delay in large chip-multiprocessor caches. In: 37th IEEE/ACM international symposium on microarchitecture (MICRO), pp 319–330

25.

Zhang M, Asanović K (Oct. 2005) Victim migration: dynamically adapting between private and shared CMP caches. Tech. rep, Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory

26.

Lin J, Lu Q, Ding X, Zhang Z, Zhang X, Sadayappan P (2008) Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In: 14th international symposium on high-performance computer architecture (HPCA), pp 367–378

27.

Awasthi M, Sudan K, Balasubramonian R, Carter J (2009) Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In: 15th international symposium on high-performance computer architecture (HPCA), pp 250–261

28.

Chaudhuri M (2009) PageNUCA: selected policies for page-grain locality management in large shared chip-multiprocessor caches. In: 15th international symposium on high-performance computer architecture (HPCA), pp 227–238

29.

García-Guirado A, Fernández-Pascual R, Ros A, García JM (2012) Dapsco: distance-aware partially shared cache organization. ACM Trans Architech Code Opt (TACO) 8(4), 25:1–25:19

30.

Li Y, Abousamra A, Melhem R, Jones AK (2010) Compiler-assisted data distribution for chip multiprocessors. In: 19th international conference on parallel architectures and compilation techniques (PACT), pp 501–512

31.

Li Y, Melhem RG, Jones AK (2012) Practically private: enabling high performance cmps through compiler-assisted data classification. In: 21st international conference on parallel architectures and compilation techniques (PACT), pp 231–240

32.

Ros A, Acacio ME, García JM (2008) Scalable directory organization for tiled CMP architectures. In: International conference on computer design (CDES), pp 112–118

33.

Zebchuk J, Srinivasan V, Qureshi MK, Moshovos A (2009) A tagless coherence directory. In: 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 423–434

34.

Cuesta B, Ros A, Gómez ME, Robles A, Duato J (2011) Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In: 38th international symposium on computer architecture (ISCA), pp 93–103

35.

Ferdman M, Lotfi-Kamran P, Balet K, Falsafi B (2011) Cuckoo directory: a scalable directory for many-core systems. In: 17th international symposium on high-performance computer architecture (HPCA), pp 169–180

36.

Sanchez D, Kozyrakis C (2012) SCD: a scalable coherence directory with flexible sharer set encoding. In: 18th international symposium on high-performance computer architecture (HPCA), pp 129–140

37.

Kuskin J, Ofelt D, Heinrich M, Heinlein J, Simoni R, Gharachorloo K, Chapin J, Nakahira D, Baxter J, Horowitz MA, Gupta A, Rosenblum M, Hennessy JL (1994) The stanford FLASH multiprocessor. In: 21st international symposium on computer architecture (ISCA), pp 302–313

38.

Agarwal A, Bianchini R, Chaiken D, Kranz D, Kubiatowicz J, Hong Lim B, Mackenzie K, Yeung D (1995) The MIT Alewife machine: architecture and performance. In: 22nd international symposium on computer architecture (ISCA), pp 2–13

39.

Agarwal A, Simoni R, Hennessy JL, Horowitz MA (1988) An evaluation of directory schemes for cache coherence. In: 15th international symposium on computer architecture (ISCA), pp 280–289

40.

Mukherjee SS, Hill MD (1994) An evaluation of directory protocols for medium-scale shared-memory multiprocessors. In: 8th international conference on supercomputing (ICS), pp 64–74

Title: DASC-DIR: a low-overhead coherence directory for many-core processors
Authors: Alberto Ros
Manuel E. Acacio
Publication date: 01-03-2015
Publisher: Springer US
Published in: The Journal of Supercomputing / Issue 3/2015
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-014-1325-4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 3/2015

A fully parallelized scheme of constructing independent spanning trees on Möbius cubes

Real-time indexing for large image databases: color and edge directivity descriptor on GPU

Resource and application-aware resource discovery in computing environments

Application mapping algorithms for mesh-based network-on-chip architectures

A novel sleep scheduling scheme in green wireless sensor networks

On-the-fly adaptive routing for dragonfly interconnection networks

Premium Partner