Top

The Journal of Supercomputing

Published in:

01-09-2014

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Authors: Weiwei Fu, Tianzhou Chen, Chao Wang, Li Liu

Published in: The Journal of Supercomputing | Issue 3/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

On-chip distributed memory system has become an attractive solution for massive parallel memory accesses found in future many-core processors. However, increasing number of on-chip cores and memory controllers inevitably introduce many remote memory accesses, which generate a large amount of on-chip traffic and put great pressure on the interconnection. This paper tries to optimize on-chip memory access traffic via runtime thread migration. We first analyze memory access behaviors in multi-threaded applications and find that the memory access targets and volumes are similar during short periods, which makes runtime prediction feasible. But the memory access targets exhibit great mobility during long periods, motivating us to dynamically move threads towards the data. Based on these observations, we propose a novel low-cost and distributed thread migration algorithm which adjusts thread placement in chains based on benefit estimation. We present details of the workflow, including the trigger and arbitration of migration requests and the procedures to determine the migration chains. Simulation results show that our algorithm achieves system performance speedup of 11.5 % and reduces average memory access latency by 11.0 %. It can find a few but effective thread migrations to optimize on-chip memory access traffic with acceptable hardware and runtime overheads.

previous article Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Wulf WA, McKee SA (1995) Hitting the memory wall: implications of the obvious. ACM SIGARCH Comput Archit News 23(1):20–24CrossRef

Dashti M, Fedorova A, Funston J, Gaud F, Lachaize R, Lepers B, and Roth M (2013) Traffic management: a holistic approach to memory placement on numa systems. In: Proceedings of the 18th international conference on architectural support for programming languages and operating systems, pp 381–394, ACM

Kamali A (2010) Sharing aware scheduling on multicore systems. Applied Science, School of Computing Science, USA

Tam D, Azimi R, Stumm M (2007) Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. ACM SIGOPS Oper Syst Rev 41(3):47–58CrossRef

Chen TS (1999) Task migration in 2D wormhole-routed mesh multicomputers. In: High performance computing. Springer, Berlin, pp 354–362

Misler M, Jerger NE (2013) Moths: mobile threads for on-chip networks. ACM Trans Embed Comput Syst (TECS) 12(1s):56

Wang C, Yu L, Liu L, Chen T (2012) Packet triggered prediction based task migration for network-on-chip. In: 20th IEEE Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 491–498

Dally WJ, Towles B (2001) Route packets, not wires: on-chip interconnection networks. In: Proceedings of the IEEE design automation conference, pp 684–689

Benini L, De Micheli G (2002) Networks on chips: a new SoC paradigm. Computer 35(1):70–78CrossRef

10.

Dally WJ, Towles BP (2004) Principles and practices of interconnection networks. Access online via Elsevier, London

11.

Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques, ACM, pp 72–81

12.

Lachaize R, Lepers B, Quma V (2012) MemProf: a memory profiler for NUMA multicore systems. In: USENIX ATC 12

13.

Shen X, Zhong Y, Ding C (2007) Predicting locality phases for dynamic memory optimization. J Parallel Distrib Comput 67(7):783–796CrossRefMATH

14.

Abts D, Enright Jerger ND, Kim J, Gibson D, Lipasti MH (2009) Achieving predictable performance through better memory controller placement in many-core CMPs. ACM SIGARCH Comput Archit News 37(3):451–461

15.

Rangan KK, Wei G, Brooks Y (2009) Thread motion: fine-grained power management for multi-core systems. In: Proceedings of the international symposium on computer architecture

16.

Lei T, Kumar S (2003) A two-step genetic algorithm for mapping task graphs to a network on chip architecture. In: Proceedings of the Euromicro symposium on digital system design, pp 180C187

17.

Rixner S, Dally WJ, Kapasi UJ, Mattson P, Owens JD (2000) Memory access scheduling. ACM SIGARCH Comput Archit News 28(2):128–138

18.

Rixner S (2004) Memory controller optimizations for web servers. In: Proceedings of the 37th annual IEEE/ACM international symposium on microarchitecture, pp 355–366

19.

Jog A, Bolotin E et al (2004) application-aware memory system for fair and efficient execution of concurrent GPGPU applications [C]. In: Proceedings of workshop on general purpose processing using GPUs

20.

(2003) Micron, 1gb, x4, x8, x16, ddr3 sdram datasheet. http://www.micron.com/products/dram/ddr3-sdram. 25 Sep 2013

21.

Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S et al (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39:1C7

22.

Wang HS, Zhu X, Peh L-S, Malik S (2002) Orion: a power-performance simulator for interconnection networks. In: Proceedingsof the 35th annual IEEE/ACM international symposium on microarchitecture, MICRO-35, pp 294–305

23.

Lozi JP, David F, Thomas G et al (2012) Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In: Proceedings of the Usenix annual technical Conference, pp 65–76

24.

Bertozzi S, Acquaviva A, Bertozzi D, Poggiali A (2006) Supporting task migration in multi-processor systems-on-chip: a feasibility study. In: Proceedings of the conference on design, automation and test in Europe, pp 15–20. European Design and Automation Association

25.

Katre KM, Ramaprasad H, Sarkar A, Mueller F (2009) Policies for migration of real-time tasks in embedded multi-core systems. In: Real-time systems symposium, pp 17–20

26.

Goodarzi B, Sarbazi-Azad H (2011) Task migration in mesh NoCs over virtual point-to-point connections. In: 19th IEEE Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 463–469

27.

Briao EW, Barcelos D, Wagner FR (2008) Dynamic task allocation strategies in MPSoC for soft real-time applications. In: Proceedings of the conference on design, automation and test in Europe, ACM, pp 1386–1389

28.

Xie B, Chen T, Hu W, Tang X, Wang D (2013) An energy-aware online task mapping algorithm in NoC-based system. J Supercomput 64(3):1021–1037

29.

Shim KS, Lis M, Cho MH, Khan O, Devadas S (2011) System-level optimizations for memory access in the execution migration machine (EM2), CAOS

30.

Sarkar A, Mueller F, Ramaprasad H, Mohan S (2009) Push-assisted migration of real-time tasks in multi-core processors. ACM Sigplan Not 44(7):80–89

31.

Hardy D, Puaut I (2009) Estimation of cache related migration delays for multi-core processors with shared instruction caches. In: 17th international conference on real-time and network systems, pp 45–54

32.

Bastoni A, Brandenburg B, Anderson J (2010) Cache-related preemption and migration delays: empirical approximation and impact on schedulability. In: Proceedings of the 6th international workshop on operating systems platforms for embedded real-time apps, pp 33–44

33.

Bakhoda A, Kim J, Aamodt TM (2010) Throughput-effective on-chip networks for manycore accelerators. In: Proceedings of the 2010 43rd annual IEEE/ACM international symposium on microarchitecture, pp 421–432, IEEE Computer Society

34.

Kim D, Yoo S, Lee S (2010) A network congestion-aware memory controller. In: 2010 IEEE 4th ACM/IEEE international symposium on networks-on-chip (NOCS), pp 257–264

35.

Kim D, Kim K, Kim JY, Lee SJ, Yoo HJ (2007) Solutions for real chip implementation issues of NoC and their application to memory-centric NoC. In: IEEE 1st international symposium on networks-on-chip, NOCS 2007, pp 30–39

36.

Sharifi A, Kultursay E, Kandemir M, Das CR (2012) Addressing end-to-end memory access latency in NoC-based multicores. In: Proceedings of the 2012 45th annual IEEE/ACM international symposium on microarchitecture, pp 294–304, IEEE Computer Society

37.

Chandra R, Devine S, Verghese B, Gupta A, Rosenblum M (1994) Scheduling and page migration for multiprocessor compute servers. ACM SIGPLAN Not 29(11):12–24

38.

Corbalan J, Martorell X, Labarta J (2004) Page migration with dynamic space-sharing scheduling policies: the case of the SGIO 2000. Int J Parallel Program 32(4):263–288CrossRef

39.

LaRowe RP Jr, Ellis CS (1991) Page placement policies for NUMA multiprocessors. J Parallel Distrib Comput 11(2):112–129CrossRef

40.

LaRowe RP Jr, Ellis CS, Holliday MA (1992) Evaluation of NUMA memory management through modeling and measurements. IEEE Trans Parallel Distrib Syst 3(6):686–701CrossRef

41.

Blagodurov S, Zhuravlev S, Fedorova A et al (2010) A case for NUMA-aware contention management on multicore systems. In: Proceedings of the 19th international conference on parallel architectures and compilation techniques, ACM, pp 557–558

Title: Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems
Authors: Weiwei Fu
Tianzhou Chen
Chao Wang
Li Liu
Publication date: 01-09-2014
Publisher: Springer US
Published in: The Journal of Supercomputing / Issue 3/2014
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-014-1240-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 3/2014

An optimal control policy to realize green cloud systems with SLA-awareness

Novel resource allocation algorithms to performance and energy efficiency in cloud computing

An enhanced integrity of web contents through mobile cloud environments

Performance analysis based resource allocation for green cloud computing

Computational awareness towards green environments

Assessment of human perceptual sensitivity to physically non-conforming motion in virtual environments

Premium Partner