Skip to main content
Erschienen in: The Journal of Supercomputing 12/2018

08.09.2018

Accelerated bulk memory operations on heterogeneous multi-core systems

verfasst von: JongHyuk Lee, Weidong Shi, JoonMin Gil

Erschienen in: The Journal of Supercomputing | Ausgabe 12/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the past few years, the general-purpose computing on GPU (GPGPU). Recently, revolutionary measures have been taken along this direction: an integrated GPU, i.e., CPUs and GPUs are integrated into the same package or even into the same die. However, considering a system-on-chip, the GPU takes up considerable silicon resources, but when running non-graphical workloads or non-GPGPU applications it is likely that overall system performance will not be affected. This paper presents a novel approach to accelerate conventional operations that are normally performed on CPUs, which are bulk memory operations such as memcpy or memcmp, using an integrated GPU. Offloading bulk memory operations to the GPU has many benefits: (i) The throughput GPU outperforms the CPU in bulk memory operations; (ii) for on-die GPUs with unified cache between the GPU and the CPU, the CPU can utilize the GPU private cache to store the moved data and reduce the CPU cache bottleneck; (iii) additional lightweight hardware can also support asynchronous offloads; and (iv) unlike the prior art using a dedicated hardware copy engine (e.g., DMA), our approach utilizes as much GPU hardware resources as possible. The performance results based on our solution showed that offloaded bulk memory operations outperform CPU up to 4.3 times faster on micro-benchmarks while using fewer resources. Using eight real-world applications and a cycle-based full-system simulation environment, five of eight applications showed about 30% speedup and two applications showed about 20% speedup.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Lee J, Liu Z, Tian X, Woo DH, Shi W, Boumber D, Yan Y, Kwon KA (2012) Acceleration of bulk memory operations in a heterogeneous multicore architecture. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, pp 423–424 Lee J, Liu Z, Tian X, Woo DH, Shi W, Boumber D, Yan Y, Kwon KA (2012) Acceleration of bulk memory operations in a heterogeneous multicore architecture. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, pp 423–424
3.
Zurück zum Zitat Benziane SH, Benyettou A (2017) Dorsal hand vein identification based on binary particle swarm optimization. J Inf Process Syst 13(2):268–283 Benziane SH, Benyettou A (2017) Dorsal hand vein identification based on binary particle swarm optimization. J Inf Process Syst 13(2):268–283
4.
Zurück zum Zitat Finogeev AG, Parygin DS, Finogeev AA (2017) The convergence computing model for big sensor data mining and knowledge discovery. Hum Centric Comput Inf Sci 7(1):11CrossRef Finogeev AG, Parygin DS, Finogeev AA (2017) The convergence computing model for big sensor data mining and knowledge discovery. Hum Centric Comput Inf Sci 7(1):11CrossRef
5.
Zurück zum Zitat Ghadekar PP, Chopade NB (2016) Content based dynamic texture analysis and synthesis based on SPIHT with GPU. J Inf Process Syst 12(1):46–56 Ghadekar PP, Chopade NB (2016) Content based dynamic texture analysis and synthesis based on SPIHT with GPU. J Inf Process Syst 12(1):46–56
6.
Zurück zum Zitat Koo KM, Cha EY (2017) Image recognition performance enhancements using image normalization. Hum Centric Comput Inf Sci 7(1):33CrossRef Koo KM, Cha EY (2017) Image recognition performance enhancements using image normalization. Hum Centric Comput Inf Sci 7(1):33CrossRef
7.
Zurück zum Zitat Mohd-Hilmi MN, Al-Laila MH, Malim H, Ahamed NH (2016) Accelerating group fusion for ligand-based virtual screening on multi-core and many-core platforms. J Inf Process Syst 12(4):724–740 Mohd-Hilmi MN, Al-Laila MH, Malim H, Ahamed NH (2016) Accelerating group fusion for ligand-based virtual screening on multi-core and many-core platforms. J Inf Process Syst 12(4):724–740
8.
Zurück zum Zitat Hao F, Min G, Pei Z, Park DS, Yang LT (2017) \( k \)-clique community detection in social networks based on formal concept analysis. IEEE Syst J 11(1):250–259CrossRef Hao F, Min G, Pei Z, Park DS, Yang LT (2017) \( k \)-clique community detection in social networks based on formal concept analysis. IEEE Syst J 11(1):250–259CrossRef
9.
Zurück zum Zitat Hao F, Pei Z, Park DS, Yang LT, Jeong YS, Park JH (2017) Iceberg clique queries in large graphs. Neurocomputing 256:101–110CrossRef Hao F, Pei Z, Park DS, Yang LT, Jeong YS, Park JH (2017) Iceberg clique queries in large graphs. Neurocomputing 256:101–110CrossRef
10.
Zurück zum Zitat Song W, Liu L, Tian Y, Sun G, Fong S, Cho K (2017) A 3D localisation method in indoor environments for virtual reality applications. Hum Centric Comput Inf Sci 7(1):39CrossRef Song W, Liu L, Tian Y, Sun G, Fong S, Cho K (2017) A 3D localisation method in indoor environments for virtual reality applications. Hum Centric Comput Inf Sci 7(1):39CrossRef
12.
Zurück zum Zitat Fung W, Sham I, Yuan G, Aamodt T (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp 407–420 Fung W, Sham I, Yuan G, Aamodt T (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp 407–420
16.
Zurück zum Zitat Gschwind M (2006) Chip multiprocessing and the cell broadband engine. In: Proceedings of the 3rd Conference on Computing Frontiers, CF ’06. ACM, New York, NY, USA, pp 1–8 Gschwind M (2006) Chip multiprocessing and the cell broadband engine. In: Proceedings of the 3rd Conference on Computing Frontiers, CF ’06. ACM, New York, NY, USA, pp 1–8
17.
Zurück zum Zitat Jiang X, Solihin Y, Zhao L, Iyer R (2009) Architecture support for improving bulk memory copying and initialization performance. In: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, Washington, DC, USA, pp 169–180 Jiang X, Solihin Y, Zhao L, Iyer R (2009) Architecture support for improving bulk memory copying and initialization performance. In: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, Washington, DC, USA, pp 169–180
18.
Zurück zum Zitat Seshadri V, Mutlu O (2017) Simple operations in memory to reduce data movement. In: Hurson AR, Milutinovic V (ed) Advances in computers, vol 106. Elsevier, New York, pp 107–166 Seshadri V, Mutlu O (2017) Simple operations in memory to reduce data movement. In: Hurson AR, Milutinovic V (ed) Advances in computers, vol 106. Elsevier, New York, pp 107–166
19.
Zurück zum Zitat Zhao L, Bhuyan LN, Iyer R, Makineni S, Newell D (2007) Hardware support for accelerating data movement in server platform. IEEE Trans Comput 56:740–753MathSciNetCrossRef Zhao L, Bhuyan LN, Iyer R, Makineni S, Newell D (2007) Hardware support for accelerating data movement in server platform. IEEE Trans Comput 56:740–753MathSciNetCrossRef
20.
Zurück zum Zitat Woo DH, Lee HHS (2010) Compass: a programmable data prefetcher using idle GPU shaders. In: Hoe JC, Adve VS (eds) ASPLOS. ACM, New York, pp 297–310 Woo DH, Lee HHS (2010) Compass: a programmable data prefetcher using idle GPU shaders. In: Hoe JC, Adve VS (eds) ASPLOS. ACM, New York, pp 297–310
21.
Zurück zum Zitat Abts D, Bataineh A, Scott S, Faanes G, Schwarzmeier J, Lundberg E, Johnson T, Bye M, Schwoerer G (2007) The Cray BlackWidow: a highly scalable vector multiprocessor. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC ’07. ACM, New York, NY, USA, pp 17:1–17:12 Abts D, Bataineh A, Scott S, Faanes G, Schwarzmeier J, Lundberg E, Johnson T, Bye M, Schwoerer G (2007) The Cray BlackWidow: a highly scalable vector multiprocessor. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC ’07. ACM, New York, NY, USA, pp 17:1–17:12
22.
Zurück zum Zitat Ahn J, Hong S, Yoo S, Mutlu O, Choi K (2016) A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Comput Archit News 43(3):105–117CrossRef Ahn J, Hong S, Yoo S, Mutlu O, Choi K (2016) A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Comput Archit News 43(3):105–117CrossRef
23.
Zurück zum Zitat Hsieh K, Ebrahimi E, Kim G, Chatterjee N, O’Connor M, Vijaykumar N, Mutlu O, Keckler SW (2016) Transparent offloading and mapping (tom): enabling programmer-transparent near-data processing in GPU systems. ACM SIGARCH Comput Archit News 44(3):204–216CrossRef Hsieh K, Ebrahimi E, Kim G, Chatterjee N, O’Connor M, Vijaykumar N, Mutlu O, Keckler SW (2016) Transparent offloading and mapping (tom): enabling programmer-transparent near-data processing in GPU systems. ACM SIGARCH Comput Archit News 44(3):204–216CrossRef
24.
Zurück zum Zitat Pattnaik A, Tang X, Jog A, Kayiran O, Mishra AK, Kandemir MT, Mutlu O, Das CR (2016) Scheduling techniques for GPU architectures with processing-in-memory capabilities. In: Proceeedings of the 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, pp 31–44 Pattnaik A, Tang X, Jog A, Kayiran O, Mishra AK, Kandemir MT, Mutlu O, Das CR (2016) Scheduling techniques for GPU architectures with processing-in-memory capabilities. In: Proceeedings of the 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, pp 31–44
25.
Zurück zum Zitat Seshadri V, Lee D, Mullins T, Hassan H, Boroumand A, Kim J, Kozuch MA, Mutlu O, Gibbons PB, Mowry TC (2017) Ambit: in-memory accelerator for bulk bitwise operations using commodity dram technology. In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, pp 273–287 Seshadri V, Lee D, Mullins T, Hassan H, Boroumand A, Kim J, Kozuch MA, Mutlu O, Gibbons PB, Mowry TC (2017) Ambit: in-memory accelerator for bulk bitwise operations using commodity dram technology. In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, pp 273–287
26.
Zurück zum Zitat Vaidyanathan K, Chai L, Huang W, Panda DK (2007) Efficient asynchronous memory copy operations on multi-core systems and I/OAT. In: Proceedings of the 2007 IEEE International Conference on Cluster Computing, CLUSTER ’07. IEEE Computer Society, Washington, DC, USA, pp 159–168 Vaidyanathan K, Chai L, Huang W, Panda DK (2007) Efficient asynchronous memory copy operations on multi-core systems and I/OAT. In: Proceedings of the 2007 IEEE International Conference on Cluster Computing, CLUSTER ’07. IEEE Computer Society, Washington, DC, USA, pp 159–168
27.
Zurück zum Zitat Kernighan BW, Dennis M (1988) The C programming language. Prentice-Hall, Upper Saddle River Kernighan BW, Dennis M (1988) The C programming language. Prentice-Hall, Upper Saddle River
29.
Zurück zum Zitat Magnusson P, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform. Computer 35(2):50–58CrossRef Magnusson P, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform. Computer 35(2):50–58CrossRef
30.
Zurück zum Zitat Neelakantam N, Blundell C, Devietti J, Martin MM, Zilles C (2008) FeS2: A full-system execution-driven simulator for x86. In: Proceedings of the Architectural Support for Programming Languages and Operating Systems. ASPLOS 2018 Neelakantam N, Blundell C, Devietti J, Martin MM, Zilles C (2008) FeS2: A full-system execution-driven simulator for x86. In: Proceedings of the Architectural Support for Programming Languages and Operating Systems. ASPLOS 2018
31.
Zurück zum Zitat Martin MMK, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput Archit News 33:92–99CrossRef Martin MMK, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput Archit News 33:92–99CrossRef
32.
Zurück zum Zitat Yourst MT (2007) PTLsim: a cycle accurate full system x86-64 microarchitectural simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2007 Yourst MT (2007) PTLsim: a cycle accurate full system x86-64 microarchitectural simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2007
33.
Zurück zum Zitat Meng J, Skadron K (2009) Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling. In: Proceedings of the 2009 IEEE International Conference on Computer Design, ICCD’09. IEEE Press, Piscataway, NJ, USA, pp 282–288 Meng J, Skadron K (2009) Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling. In: Proceedings of the 2009 IEEE International Conference on Computer Design, ICCD’09. IEEE Press, Piscataway, NJ, USA, pp 282–288
34.
Zurück zum Zitat Blackburn SM, Garner R, Hoffmann C, Khang AM, McKinley KS, Bentzur R, Diwan A, Feinberg D, Frampton D, Guyer SZ, Hirzel M, Hosking A, Jump M, Lee H, Moss JEB, Phansalkar A, Stefanović D, VanDrunen T, von Dincklage D, Wiedermann B (2006) The dacapo benchmarks: java benchmarking development and analysis. SIGPLAN Not 41:169–190CrossRef Blackburn SM, Garner R, Hoffmann C, Khang AM, McKinley KS, Bentzur R, Diwan A, Feinberg D, Frampton D, Guyer SZ, Hirzel M, Hosking A, Jump M, Lee H, Moss JEB, Phansalkar A, Stefanović D, VanDrunen T, von Dincklage D, Wiedermann B (2006) The dacapo benchmarks: java benchmarking development and analysis. SIGPLAN Not 41:169–190CrossRef
38.
Zurück zum Zitat Koziol J (2003) Intrusion detection with Snort, 1st edn. Sams, Indianapolis Koziol J (2003) Intrusion detection with Snort, 1st edn. Sams, Indianapolis
Metadaten
Titel
Accelerated bulk memory operations on heterogeneous multi-core systems
verfasst von
JongHyuk Lee
Weidong Shi
JoonMin Gil
Publikationsdatum
08.09.2018
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 12/2018
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-018-2589-x

Weitere Artikel der Ausgabe 12/2018

The Journal of Supercomputing 12/2018 Zur Ausgabe

Premium Partner