Skip to main content
Erschienen in: The Journal of Supercomputing 3/2014

01.09.2014

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

verfasst von: Weiwei Fu, Tianzhou Chen, Chao Wang, Li Liu

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

On-chip distributed memory system has become an attractive solution for massive parallel memory accesses found in future many-core processors. However, increasing number of on-chip cores and memory controllers inevitably introduce many remote memory accesses, which generate a large amount of on-chip traffic and put great pressure on the interconnection. This paper tries to optimize on-chip memory access traffic via runtime thread migration. We first analyze memory access behaviors in multi-threaded applications and find that the memory access targets and volumes are similar during short periods, which makes runtime prediction feasible. But the memory access targets exhibit great mobility during long periods, motivating us to dynamically move threads towards the data. Based on these observations, we propose a novel low-cost and distributed thread migration algorithm which adjusts thread placement in chains based on benefit estimation. We present details of the workflow, including the trigger and arbitration of migration requests and the procedures to determine the migration chains. Simulation results show that our algorithm achieves system performance speedup of 11.5 % and reduces average memory access latency by 11.0 %. It can find a few but effective thread migrations to optimize on-chip memory access traffic with acceptable hardware and runtime overheads.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Wulf WA, McKee SA (1995) Hitting the memory wall: implications of the obvious. ACM SIGARCH Comput Archit News 23(1):20–24CrossRef Wulf WA, McKee SA (1995) Hitting the memory wall: implications of the obvious. ACM SIGARCH Comput Archit News 23(1):20–24CrossRef
2.
Zurück zum Zitat Dashti M, Fedorova A, Funston J, Gaud F, Lachaize R, Lepers B, and Roth M (2013) Traffic management: a holistic approach to memory placement on numa systems. In: Proceedings of the 18th international conference on architectural support for programming languages and operating systems, pp 381–394, ACM Dashti M, Fedorova A, Funston J, Gaud F, Lachaize R, Lepers B, and Roth M (2013) Traffic management: a holistic approach to memory placement on numa systems. In: Proceedings of the 18th international conference on architectural support for programming languages and operating systems, pp 381–394, ACM
3.
Zurück zum Zitat Kamali A (2010) Sharing aware scheduling on multicore systems. Applied Science, School of Computing Science, USA Kamali A (2010) Sharing aware scheduling on multicore systems. Applied Science, School of Computing Science, USA
4.
Zurück zum Zitat Tam D, Azimi R, Stumm M (2007) Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. ACM SIGOPS Oper Syst Rev 41(3):47–58CrossRef Tam D, Azimi R, Stumm M (2007) Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. ACM SIGOPS Oper Syst Rev 41(3):47–58CrossRef
5.
Zurück zum Zitat Chen TS (1999) Task migration in 2D wormhole-routed mesh multicomputers. In: High performance computing. Springer, Berlin, pp 354–362 Chen TS (1999) Task migration in 2D wormhole-routed mesh multicomputers. In: High performance computing. Springer, Berlin, pp 354–362
6.
Zurück zum Zitat Misler M, Jerger NE (2013) Moths: mobile threads for on-chip networks. ACM Trans Embed Comput Syst (TECS) 12(1s):56 Misler M, Jerger NE (2013) Moths: mobile threads for on-chip networks. ACM Trans Embed Comput Syst (TECS) 12(1s):56
7.
Zurück zum Zitat Wang C, Yu L, Liu L, Chen T (2012) Packet triggered prediction based task migration for network-on-chip. In: 20th IEEE Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 491–498 Wang C, Yu L, Liu L, Chen T (2012) Packet triggered prediction based task migration for network-on-chip. In: 20th IEEE Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 491–498
8.
Zurück zum Zitat Dally WJ, Towles B (2001) Route packets, not wires: on-chip interconnection networks. In: Proceedings of the IEEE design automation conference, pp 684–689 Dally WJ, Towles B (2001) Route packets, not wires: on-chip interconnection networks. In: Proceedings of the IEEE design automation conference, pp 684–689
9.
Zurück zum Zitat Benini L, De Micheli G (2002) Networks on chips: a new SoC paradigm. Computer 35(1):70–78CrossRef Benini L, De Micheli G (2002) Networks on chips: a new SoC paradigm. Computer 35(1):70–78CrossRef
10.
Zurück zum Zitat Dally WJ, Towles BP (2004) Principles and practices of interconnection networks. Access online via Elsevier, London Dally WJ, Towles BP (2004) Principles and practices of interconnection networks. Access online via Elsevier, London
11.
Zurück zum Zitat Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques, ACM, pp 72–81 Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques, ACM, pp 72–81
12.
Zurück zum Zitat Lachaize R, Lepers B, Quma V (2012) MemProf: a memory profiler for NUMA multicore systems. In: USENIX ATC 12 Lachaize R, Lepers B, Quma V (2012) MemProf: a memory profiler for NUMA multicore systems. In: USENIX ATC 12
13.
Zurück zum Zitat Shen X, Zhong Y, Ding C (2007) Predicting locality phases for dynamic memory optimization. J Parallel Distrib Comput 67(7):783–796CrossRefMATH Shen X, Zhong Y, Ding C (2007) Predicting locality phases for dynamic memory optimization. J Parallel Distrib Comput 67(7):783–796CrossRefMATH
14.
Zurück zum Zitat Abts D, Enright Jerger ND, Kim J, Gibson D, Lipasti MH (2009) Achieving predictable performance through better memory controller placement in many-core CMPs. ACM SIGARCH Comput Archit News 37(3):451–461 Abts D, Enright Jerger ND, Kim J, Gibson D, Lipasti MH (2009) Achieving predictable performance through better memory controller placement in many-core CMPs. ACM SIGARCH Comput Archit News 37(3):451–461
15.
Zurück zum Zitat Rangan KK, Wei G, Brooks Y (2009) Thread motion: fine-grained power management for multi-core systems. In: Proceedings of the international symposium on computer architecture Rangan KK, Wei G, Brooks Y (2009) Thread motion: fine-grained power management for multi-core systems. In: Proceedings of the international symposium on computer architecture
16.
Zurück zum Zitat Lei T, Kumar S (2003) A two-step genetic algorithm for mapping task graphs to a network on chip architecture. In: Proceedings of the Euromicro symposium on digital system design, pp 180C187 Lei T, Kumar S (2003) A two-step genetic algorithm for mapping task graphs to a network on chip architecture. In: Proceedings of the Euromicro symposium on digital system design, pp 180C187
17.
Zurück zum Zitat Rixner S, Dally WJ, Kapasi UJ, Mattson P, Owens JD (2000) Memory access scheduling. ACM SIGARCH Comput Archit News 28(2):128–138 Rixner S, Dally WJ, Kapasi UJ, Mattson P, Owens JD (2000) Memory access scheduling. ACM SIGARCH Comput Archit News 28(2):128–138
18.
Zurück zum Zitat Rixner S (2004) Memory controller optimizations for web servers. In: Proceedings of the 37th annual IEEE/ACM international symposium on microarchitecture, pp 355–366 Rixner S (2004) Memory controller optimizations for web servers. In: Proceedings of the 37th annual IEEE/ACM international symposium on microarchitecture, pp 355–366
19.
Zurück zum Zitat Jog A, Bolotin E et al (2004) application-aware memory system for fair and efficient execution of concurrent GPGPU applications [C]. In: Proceedings of workshop on general purpose processing using GPUs Jog A, Bolotin E et al (2004) application-aware memory system for fair and efficient execution of concurrent GPGPU applications [C]. In: Proceedings of workshop on general purpose processing using GPUs
21.
Zurück zum Zitat Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S et al (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39:1C7 Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S et al (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39:1C7
22.
Zurück zum Zitat Wang HS, Zhu X, Peh L-S, Malik S (2002) Orion: a power-performance simulator for interconnection networks. In: Proceedingsof the 35th annual IEEE/ACM international symposium on microarchitecture, MICRO-35, pp 294–305 Wang HS, Zhu X, Peh L-S, Malik S (2002) Orion: a power-performance simulator for interconnection networks. In: Proceedingsof the 35th annual IEEE/ACM international symposium on microarchitecture, MICRO-35, pp 294–305
23.
Zurück zum Zitat Lozi JP, David F, Thomas G et al (2012) Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In: Proceedings of the Usenix annual technical Conference, pp 65–76 Lozi JP, David F, Thomas G et al (2012) Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In: Proceedings of the Usenix annual technical Conference, pp 65–76
24.
Zurück zum Zitat Bertozzi S, Acquaviva A, Bertozzi D, Poggiali A (2006) Supporting task migration in multi-processor systems-on-chip: a feasibility study. In: Proceedings of the conference on design, automation and test in Europe, pp 15–20. European Design and Automation Association Bertozzi S, Acquaviva A, Bertozzi D, Poggiali A (2006) Supporting task migration in multi-processor systems-on-chip: a feasibility study. In: Proceedings of the conference on design, automation and test in Europe, pp 15–20. European Design and Automation Association
25.
Zurück zum Zitat Katre KM, Ramaprasad H, Sarkar A, Mueller F (2009) Policies for migration of real-time tasks in embedded multi-core systems. In: Real-time systems symposium, pp 17–20 Katre KM, Ramaprasad H, Sarkar A, Mueller F (2009) Policies for migration of real-time tasks in embedded multi-core systems. In: Real-time systems symposium, pp 17–20
26.
Zurück zum Zitat Goodarzi B, Sarbazi-Azad H (2011) Task migration in mesh NoCs over virtual point-to-point connections. In: 19th IEEE Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 463–469 Goodarzi B, Sarbazi-Azad H (2011) Task migration in mesh NoCs over virtual point-to-point connections. In: 19th IEEE Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 463–469
27.
Zurück zum Zitat Briao EW, Barcelos D, Wagner FR (2008) Dynamic task allocation strategies in MPSoC for soft real-time applications. In: Proceedings of the conference on design, automation and test in Europe, ACM, pp 1386–1389 Briao EW, Barcelos D, Wagner FR (2008) Dynamic task allocation strategies in MPSoC for soft real-time applications. In: Proceedings of the conference on design, automation and test in Europe, ACM, pp 1386–1389
28.
Zurück zum Zitat Xie B, Chen T, Hu W, Tang X, Wang D (2013) An energy-aware online task mapping algorithm in NoC-based system. J Supercomput 64(3):1021–1037 Xie B, Chen T, Hu W, Tang X, Wang D (2013) An energy-aware online task mapping algorithm in NoC-based system. J Supercomput 64(3):1021–1037
29.
Zurück zum Zitat Shim KS, Lis M, Cho MH, Khan O, Devadas S (2011) System-level optimizations for memory access in the execution migration machine (EM2), CAOS Shim KS, Lis M, Cho MH, Khan O, Devadas S (2011) System-level optimizations for memory access in the execution migration machine (EM2), CAOS
30.
Zurück zum Zitat Sarkar A, Mueller F, Ramaprasad H, Mohan S (2009) Push-assisted migration of real-time tasks in multi-core processors. ACM Sigplan Not 44(7):80–89 Sarkar A, Mueller F, Ramaprasad H, Mohan S (2009) Push-assisted migration of real-time tasks in multi-core processors. ACM Sigplan Not 44(7):80–89
31.
Zurück zum Zitat Hardy D, Puaut I (2009) Estimation of cache related migration delays for multi-core processors with shared instruction caches. In: 17th international conference on real-time and network systems, pp 45–54 Hardy D, Puaut I (2009) Estimation of cache related migration delays for multi-core processors with shared instruction caches. In: 17th international conference on real-time and network systems, pp 45–54
32.
Zurück zum Zitat Bastoni A, Brandenburg B, Anderson J (2010) Cache-related preemption and migration delays: empirical approximation and impact on schedulability. In: Proceedings of the 6th international workshop on operating systems platforms for embedded real-time apps, pp 33–44 Bastoni A, Brandenburg B, Anderson J (2010) Cache-related preemption and migration delays: empirical approximation and impact on schedulability. In: Proceedings of the 6th international workshop on operating systems platforms for embedded real-time apps, pp 33–44
33.
Zurück zum Zitat Bakhoda A, Kim J, Aamodt TM (2010) Throughput-effective on-chip networks for manycore accelerators. In: Proceedings of the 2010 43rd annual IEEE/ACM international symposium on microarchitecture, pp 421–432, IEEE Computer Society Bakhoda A, Kim J, Aamodt TM (2010) Throughput-effective on-chip networks for manycore accelerators. In: Proceedings of the 2010 43rd annual IEEE/ACM international symposium on microarchitecture, pp 421–432, IEEE Computer Society
34.
Zurück zum Zitat Kim D, Yoo S, Lee S (2010) A network congestion-aware memory controller. In: 2010 IEEE 4th ACM/IEEE international symposium on networks-on-chip (NOCS), pp 257–264 Kim D, Yoo S, Lee S (2010) A network congestion-aware memory controller. In: 2010 IEEE 4th ACM/IEEE international symposium on networks-on-chip (NOCS), pp 257–264
35.
Zurück zum Zitat Kim D, Kim K, Kim JY, Lee SJ, Yoo HJ (2007) Solutions for real chip implementation issues of NoC and their application to memory-centric NoC. In: IEEE 1st international symposium on networks-on-chip, NOCS 2007, pp 30–39 Kim D, Kim K, Kim JY, Lee SJ, Yoo HJ (2007) Solutions for real chip implementation issues of NoC and their application to memory-centric NoC. In: IEEE 1st international symposium on networks-on-chip, NOCS 2007, pp 30–39
36.
Zurück zum Zitat Sharifi A, Kultursay E, Kandemir M, Das CR (2012) Addressing end-to-end memory access latency in NoC-based multicores. In: Proceedings of the 2012 45th annual IEEE/ACM international symposium on microarchitecture, pp 294–304, IEEE Computer Society Sharifi A, Kultursay E, Kandemir M, Das CR (2012) Addressing end-to-end memory access latency in NoC-based multicores. In: Proceedings of the 2012 45th annual IEEE/ACM international symposium on microarchitecture, pp 294–304, IEEE Computer Society
37.
Zurück zum Zitat Chandra R, Devine S, Verghese B, Gupta A, Rosenblum M (1994) Scheduling and page migration for multiprocessor compute servers. ACM SIGPLAN Not 29(11):12–24 Chandra R, Devine S, Verghese B, Gupta A, Rosenblum M (1994) Scheduling and page migration for multiprocessor compute servers. ACM SIGPLAN Not 29(11):12–24
38.
Zurück zum Zitat Corbalan J, Martorell X, Labarta J (2004) Page migration with dynamic space-sharing scheduling policies: the case of the SGIO 2000. Int J Parallel Program 32(4):263–288CrossRef Corbalan J, Martorell X, Labarta J (2004) Page migration with dynamic space-sharing scheduling policies: the case of the SGIO 2000. Int J Parallel Program 32(4):263–288CrossRef
39.
Zurück zum Zitat LaRowe RP Jr, Ellis CS (1991) Page placement policies for NUMA multiprocessors. J Parallel Distrib Comput 11(2):112–129CrossRef LaRowe RP Jr, Ellis CS (1991) Page placement policies for NUMA multiprocessors. J Parallel Distrib Comput 11(2):112–129CrossRef
40.
Zurück zum Zitat LaRowe RP Jr, Ellis CS, Holliday MA (1992) Evaluation of NUMA memory management through modeling and measurements. IEEE Trans Parallel Distrib Syst 3(6):686–701CrossRef LaRowe RP Jr, Ellis CS, Holliday MA (1992) Evaluation of NUMA memory management through modeling and measurements. IEEE Trans Parallel Distrib Syst 3(6):686–701CrossRef
41.
Zurück zum Zitat Blagodurov S, Zhuravlev S, Fedorova A et al (2010) A case for NUMA-aware contention management on multicore systems. In: Proceedings of the 19th international conference on parallel architectures and compilation techniques, ACM, pp 557–558 Blagodurov S, Zhuravlev S, Fedorova A et al (2010) A case for NUMA-aware contention management on multicore systems. In: Proceedings of the 19th international conference on parallel architectures and compilation techniques, ACM, pp 557–558
Metadaten
Titel
Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems
verfasst von
Weiwei Fu
Tianzhou Chen
Chao Wang
Li Liu
Publikationsdatum
01.09.2014
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 3/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-014-1240-8

Weitere Artikel der Ausgabe 3/2014

The Journal of Supercomputing 3/2014 Zur Ausgabe