Skip to main content

2016 | OriginalPaper | Buchkapitel

Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors

verfasst von : Bin Ren, Nishkam Ravi, Yi Yang, Min Feng, Gagan Agrawal, Srimat Chakradhar

Erschienen in: Languages and Compilers for Parallel Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Orchestrating data transfers between CPU and a coprocessor manually is cumbersome, particularly for multi-dimensional arrays and other data structures with multi-level pointers common in scientific computations. This paper describes a system that includes both compile-time and runtime solutions for this problem, with the overarching goal of improving programmer productivity while maintaining performance.
We find that the standard linearization method performs poorly for non-uniform dimensions on the coprocessor due to redundant data transfers and suppression of important compiler optimizations such as vectorization. The key contribution of this paper is a novel approach for heap linearization that avoids modifying memory accesses to enable vectorization, referred to as partial linearization with pointer reset.
We implement partial linearization with pointer reset as the compile time solution, whereas runtime solution is implemented as an enhancement to MYO library. We evaluate our approach with respect to multiple C benchmarks. Experimental results demonstrate that our best compile-time solution can perform 2.5x-5x faster than original runtime solution, and the CPU-MIC code with it can achieve 1.5x-2.5x speedup over the 16-thread CPU version.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
Due to the page limitation, we omit some details of the runtime optimization and the source-to-source transformation to integrate two approaches, and all of our code examples in this version. Please refer to our LCPC’15 conference version for more details: http://​www.​csc2.​ncsu.​edu/​workshops/​lcpc2015/​lcpc15proc.​pdf.
 
3
OpenACC: Directives for Accelerators. http://​www.​openacc-standard.​org/​.
 
Literatur
1.
Zurück zum Zitat Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: PPoPP (2008) Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: PPoPP (2008)
2.
Zurück zum Zitat Bennett, J.K., Carter, J.B., Zwaenepoel, W.: Munin: distributed shared memory based on type-specific memory coherence. In: PPoPP (1990) Bennett, J.K., Carter, J.B., Zwaenepoel, W.: Munin: distributed shared memory based on type-specific memory coherence. In: PPoPP (1990)
3.
Zurück zum Zitat Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: PPoPP (1995) Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: PPoPP (1995)
4.
Zurück zum Zitat Bueno, J., Martorell, X., Badia, R.M., Ayguadé, E., Labarta, J.: Implementing OmpSs support for regions of data in architectures with multiple address spaces. In: ICS, pp. 359–368. ACM (2013) Bueno, J., Martorell, X., Badia, R.M., Ayguadé, E., Labarta, J.: Implementing OmpSs support for regions of data in architectures with multiple address spaces. In: ICS, pp. 359–368. ACM (2013)
5.
Zurück zum Zitat Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perf. Comput. Appl. 21, 291–312 (2007)CrossRef Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perf. Comput. Appl. 21, 291–312 (2007)CrossRef
6.
Zurück zum Zitat Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear array layouts for hierarchical memory systems. In: ICS (1999) Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear array layouts for hierarchical memory systems. In: ICS (1999)
7.
Zurück zum Zitat Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 46–55 (1998)CrossRef Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 46–55 (1998)CrossRef
8.
Zurück zum Zitat Dubach, C., Cheng, P., Rabbah, R., Bacon, D.F., Fink, S.J.: Compiling a high-level language for GPUs: (via language support for architectures and compilers). In: PLDI. ACM (2012) Dubach, C., Cheng, P., Rabbah, R., Bacon, D.F., Fink, S.J.: Compiling a high-level language for GPUs: (via language support for architectures and compilers). In: PLDI. ACM (2012)
9.
Zurück zum Zitat El-Ghazawi, T., Smith, L.: UPC: unified parallel C. In: SC. ACM (2006) El-Ghazawi, T., Smith, L.: UPC: unified parallel C. In: SC. ACM (2006)
10.
Zurück zum Zitat Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.-M.W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: ASPLOS (2010) Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.-M.W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: ASPLOS (2010)
11.
Zurück zum Zitat Grasso, I., Pellegrini, S., Cosenza, B., Fahringer, T.: libWater: heterogeneous distributed computing made easy. In: ICS, pp. 161–172. ACM (2013) Grasso, I., Pellegrini, S., Cosenza, B., Fahringer, T.: libWater: heterogeneous distributed computing made easy. In: ICS, pp. 161–172. ACM (2013)
12.
Zurück zum Zitat Gropp, W.D., Lusk, E.L., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface. The MIT Press, Cambridge (1999) Gropp, W.D., Lusk, E.L., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface. The MIT Press, Cambridge (1999)
13.
Zurück zum Zitat T.P. Group. PGI Accelerator Compilers OpenACC Getting Started Guide (2013) T.P. Group. PGI Accelerator Compilers OpenACC Getting Started Guide (2013)
14.
Zurück zum Zitat Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for CPU-GPU architectures. In: CGO (2012) Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for CPU-GPU architectures. In: CGO (2012)
15.
Zurück zum Zitat Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. In: PLDI (2011) Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. In: PLDI (2011)
16.
Zurück zum Zitat Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming. Newnes (2013) Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming. Newnes (2013)
17.
Zurück zum Zitat Ju, Y.-L., Dietz, H.G.: Reduction of cache coherence overhead by compiler data layout and loop transformation. In: LCPC (1992) Ju, Y.-L., Dietz, H.G.: Reduction of cache coherence overhead by compiler data layout and loop transformation. In: LCPC (1992)
18.
Zurück zum Zitat Ladelsky, R.: Matrix flattening and transposing in GCC. In: GCC Summit Proceedings, vol. 2007 (2006) Ladelsky, R.: Matrix flattening and transposing in GCC. In: GCC Summit Proceedings, vol. 2007 (2006)
19.
Zurück zum Zitat Lattner, C., Adve, V.S.: Automatic pool allocation: improving performance by controlling data structure layout in the heap. In: PLDI. ACM (2005) Lattner, C., Adve, V.S.: Automatic pool allocation: improving performance by controlling data structure layout in the heap. In: PLDI. ACM (2005)
20.
Zurück zum Zitat Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: SC (2010) Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: SC (2010)
21.
Zurück zum Zitat Li, K., Hudak, P.: Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7(4), 321–359 (1989)CrossRef Li, K., Hudak, P.: Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7(4), 321–359 (1989)CrossRef
22.
Zurück zum Zitat Maleki, S., Gao, Y., Garzaran, M.J., Wong, T., Padua, D.A.: An evaluation of vectorizing compilers. In: PACT. IEEE (2011) Maleki, S., Gao, Y., Garzaran, M.J., Wong, T., Padua, D.A.: An evaluation of vectorizing compilers. In: PACT. IEEE (2011)
23.
Zurück zum Zitat Margiolas, C., O’Boyle, M.F.: Portable and transparent host-device communication optimization for GPGPU environments. In: CGO. ACM (2014) Margiolas, C., O’Boyle, M.F.: Portable and transparent host-device communication optimization for GPGPU environments. In: CGO. ACM (2014)
24.
Zurück zum Zitat Newburn, C.J., Deodhar, R., Dmitriev, S., Murty, R., Narayanaswamy, R., Wiegert, J., Chinchilla, F., McGuire, R.: Offload Compiler Runtime for the Intel\(^{\textregistered }\) Xeon PhiTM Coprocessor. In: Supercomputing. Springer (2013) Newburn, C.J., Deodhar, R., Dmitriev, S., Murty, R., Narayanaswamy, R., Wiegert, J., Chinchilla, F., McGuire, R.: Offload Compiler Runtime for the Intel\(^{\textregistered }\) Xeon PhiTM Coprocessor. In: Supercomputing. Springer (2013)
25.
Zurück zum Zitat Numrich, R.W., Reid, J.: Co-array fortran for parallel programming. In: ACM Sigplan Fortran Forum. ACM (1998) Numrich, R.W., Reid, J.: Co-array fortran for parallel programming. In: ACM Sigplan Fortran Forum. ACM (1998)
26.
Zurück zum Zitat Pai, S., Govindarajan, R., Thazhuthaveetil, M.J.: Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In: PACT (2012) Pai, S., Govindarajan, R., Thazhuthaveetil, M.J.: Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In: PACT (2012)
27.
Zurück zum Zitat Ravi, N., Yang, Y., Bao, T., Chakradhar, S.: Apricot: an optimizing compiler and productivity tool for x86-Compatible many-core coprocessors. In: ICS (2012) Ravi, N., Yang, Y., Bao, T., Chakradhar, S.: Apricot: an optimizing compiler and productivity tool for x86-Compatible many-core coprocessors. In: ICS (2012)
28.
Zurück zum Zitat Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media Inc., Sebastopol (2010) Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media Inc., Sebastopol (2010)
29.
Zurück zum Zitat Ren, B., Agrawal, G.: Compiling dynamic data structures in python to enable the use of multi-core and many-core libraries. In: PACT. IEEE (2011) Ren, B., Agrawal, G.: Compiling dynamic data structures in python to enable the use of multi-core and many-core libraries. In: PACT. IEEE (2011)
30.
Zurück zum Zitat Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., Mendelson, A.: Programming model for a heterogeneous x86 platform. In: PLDI (2009) Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., Mendelson, A.: Programming model for a heterogeneous x86 platform. In: PLDI (2009)
31.
Zurück zum Zitat Saraswat, V.A., Sarkar, V., von Praun, C.: X10: concurrent programming for modern architectures. In: PPoPP (2007) Saraswat, V.A., Sarkar, V., von Praun, C.: X10: concurrent programming for modern architectures. In: PPoPP (2007)
32.
Zurück zum Zitat Sidelnik, A., Maleki, S., Chamberlain, B.L., Garzarán, M.J., Padua, D.: Performance portability with the chapel language. In: IPDPS. IEEE (2012) Sidelnik, A., Maleki, S., Chamberlain, B.L., Garzarán, M.J., Padua, D.: Performance portability with the chapel language. In: IPDPS. IEEE (2012)
33.
Zurück zum Zitat Yan, Y., Grossman, M., Sarkar, V.: JCUDA: a programmer-friendly interface for accelerating java programs with CUDA. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 887–899. Springer, Heidelberg (2009)CrossRef Yan, Y., Grossman, M., Sarkar, V.: JCUDA: a programmer-friendly interface for accelerating java programs with CUDA. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 887–899. Springer, Heidelberg (2009)CrossRef
34.
Zurück zum Zitat Zhang, Y., Ding, W., Liu, J., Kandemir, M.: Optimizing data layouts for parallel computation on multicores. In: PACT (2011) Zhang, Y., Ding, W., Liu, J., Kandemir, M.: Optimizing data layouts for parallel computation on multicores. In: PACT (2011)
Metadaten
Titel
Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors
verfasst von
Bin Ren
Nishkam Ravi
Yi Yang
Min Feng
Gagan Agrawal
Srimat Chakradhar
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-29778-1_11

Premium Partner