nach oben

Erschienen in:

2016 | OriginalPaper | Buchkapitel

Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors

verfasst von : Bin Ren, Nishkam Ravi, Yi Yang, Min Feng, Gagan Agrawal, Srimat Chakradhar

Erschienen in: Languages and Compilers for Parallel Computing

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Orchestrating data transfers between CPU and a coprocessor manually is cumbersome, particularly for multi-dimensional arrays and other data structures with multi-level pointers common in scientific computations. This paper describes a system that includes both compile-time and runtime solutions for this problem, with the overarching goal of improving programmer productivity while maintaining performance.

We find that the standard linearization method performs poorly for non-uniform dimensions on the coprocessor due to redundant data transfers and suppression of important compiler optimizations such as vectorization. The key contribution of this paper is a novel approach for heap linearization that avoids modifying memory accesses to enable vectorization, referred to as partial linearization with pointer reset.

We implement partial linearization with pointer reset as the compile time solution, whereas runtime solution is implemented as an enhancement to MYO library. We evaluate our approach with respect to multiple C benchmarks. Experimental results demonstrate that our best compile-time solution can perform 2.5x-5x faster than original runtime solution, and the CPU-MIC code with it can achieve 1.5x-2.5x speedup over the 16-thread CPU version.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Petal Tool for Analyzing and Transforming Legacy MPI Applications

Nächstes Kapitel Topology-Aware Parallelism for NUMA Copying Collectors

Intel C++ Compiler. http://www.intel.com/Compilers.

Due to the page limitation, we omit some details of the runtime optimization and the source-to-source transformation to integrate two approaches, and all of our code examples in this version. Please refer to our LCPC’15 conference version for more details: http://www.csc2.ncsu.edu/workshops/lcpc2015/lcpc15proc.pdf.

OpenACC: Directives for Accelerators. http://www.openacc-standard.org/.

As described in http://www.csc2.ncsu.edu/workshops/lcpc2015/lcpc15proc.pdf.

Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: PPoPP (2008)

Bennett, J.K., Carter, J.B., Zwaenepoel, W.: Munin: distributed shared memory based on type-specific memory coherence. In: PPoPP (1990)

Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: PPoPP (1995)

Bueno, J., Martorell, X., Badia, R.M., Ayguadé, E., Labarta, J.: Implementing OmpSs support for regions of data in architectures with multiple address spaces. In: ICS, pp. 359–368. ACM (2013)

Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perf. Comput. Appl. 21, 291–312 (2007)CrossRef

Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear array layouts for hierarchical memory systems. In: ICS (1999)

Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 46–55 (1998)CrossRef

Dubach, C., Cheng, P., Rabbah, R., Bacon, D.F., Fink, S.J.: Compiling a high-level language for GPUs: (via language support for architectures and compilers). In: PLDI. ACM (2012)

El-Ghazawi, T., Smith, L.: UPC: unified parallel C. In: SC. ACM (2006)

10.

Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.-M.W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: ASPLOS (2010)

11.

Grasso, I., Pellegrini, S., Cosenza, B., Fahringer, T.: libWater: heterogeneous distributed computing made easy. In: ICS, pp. 161–172. ACM (2013)

12.

Gropp, W.D., Lusk, E.L., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface. The MIT Press, Cambridge (1999)

13.

T.P. Group. PGI Accelerator Compilers OpenACC Getting Started Guide (2013)

14.

Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for CPU-GPU architectures. In: CGO (2012)

15.

Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. In: PLDI (2011)

16.

Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming. Newnes (2013)

17.

Ju, Y.-L., Dietz, H.G.: Reduction of cache coherence overhead by compiler data layout and loop transformation. In: LCPC (1992)

18.

Ladelsky, R.: Matrix flattening and transposing in GCC. In: GCC Summit Proceedings, vol. 2007 (2006)

19.

Lattner, C., Adve, V.S.: Automatic pool allocation: improving performance by controlling data structure layout in the heap. In: PLDI. ACM (2005)

20.

Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: SC (2010)

21.

Li, K., Hudak, P.: Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7(4), 321–359 (1989)CrossRef

22.

Maleki, S., Gao, Y., Garzaran, M.J., Wong, T., Padua, D.A.: An evaluation of vectorizing compilers. In: PACT. IEEE (2011)

23.

Margiolas, C., O’Boyle, M.F.: Portable and transparent host-device communication optimization for GPGPU environments. In: CGO. ACM (2014)

24.

Newburn, C.J., Deodhar, R., Dmitriev, S., Murty, R., Narayanaswamy, R., Wiegert, J., Chinchilla, F., McGuire, R.: Offload Compiler Runtime for the Intel\(^{\textregistered }\) Xeon PhiTM Coprocessor. In: Supercomputing. Springer (2013)

25.

Numrich, R.W., Reid, J.: Co-array fortran for parallel programming. In: ACM Sigplan Fortran Forum. ACM (1998)

26.

Pai, S., Govindarajan, R., Thazhuthaveetil, M.J.: Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In: PACT (2012)

27.

Ravi, N., Yang, Y., Bao, T., Chakradhar, S.: Apricot: an optimizing compiler and productivity tool for x86-Compatible many-core coprocessors. In: ICS (2012)

28.

Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media Inc., Sebastopol (2010)

29.

Ren, B., Agrawal, G.: Compiling dynamic data structures in python to enable the use of multi-core and many-core libraries. In: PACT. IEEE (2011)

30.

Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., Mendelson, A.: Programming model for a heterogeneous x86 platform. In: PLDI (2009)

31.

Saraswat, V.A., Sarkar, V., von Praun, C.: X10: concurrent programming for modern architectures. In: PPoPP (2007)

32.

Sidelnik, A., Maleki, S., Chamberlain, B.L., Garzarán, M.J., Padua, D.: Performance portability with the chapel language. In: IPDPS. IEEE (2012)

33.

Yan, Y., Grossman, M., Sarkar, V.: JCUDA: a programmer-friendly interface for accelerating java programs with CUDA. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 887–899. Springer, Heidelberg (2009)CrossRef

34.

Zhang, Y., Ding, W., Liu, J., Kandemir, M.: Optimizing data layouts for parallel computation on multicores. In: PACT (2011)

Titel: Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors
verfasst von: Bin Ren
Nishkam Ravi
Yi Yang
Min Feng
Gagan Agrawal
Srimat Chakradhar
Verlag: Springer International Publishing
Buch: Languages and Compilers for Parallel Computing
Print ISBN: 978-3-319-29777-4

Electronic ISBN: 978-3-319-29778-1

Copyright-Jahr: 2016
DOI: https://doi.org/10.1007/978-3-319-29778-1_11

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner