nach oben

Erschienen in:

2015 | OriginalPaper | Buchkapitel

A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X

verfasst von : A. A. Awan, K. Hamidouche, C. H. Chu, Dhabaleswar Panda

Erschienen in: OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

An ever increased push for performance in the HPC arena has led to a multitude of hybrid architectures in both software and hardware for HPC systems. Partitioned Global Address Space (PGAS) programming model has gained a lot of attention over the last couple of years. The main advantage of PGAS model is the ease of programming provided by the abstraction of a single memory across nodes of a cluster. OpenSHMEM implementations currently implement the OpenSHMEM 1.2 specification that provides interface for one-sided, atomic, and collective operations. However, the recent trend in HPC arena in general, and Message Passing Interface (MPI) community in specific, is to use Non-Blocking Collective (NBC) communication to efficiently overlap computation with communication to save precious CPU cycles.

This work is inspired by encouraging performance numbers for NBC implementations of various MPI libraries. As the OpenSHMEM community has been discussing the use of non-blocking communication, in this paper, we propose an NBC interface for OpenSHMEM, present its design, implementation, and performance evaluation. We discuss the NBC interface that has been modeled along the lines of MPI NBC interface and requires minimal changes to the function signatures. We have designed and implemented this interface using the Unified Communication Runtime in MVAPICH2-X. In addition, we propose OpenSHMEM NBC benchmarks as an extension to the OpenSHMEM benchmarks available in the widely used OMB suite. Our performance evaluation shows that the proposed NBC implementation provides up to 96 percent overlap for different collectives with little NBC overhead.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Proposing OpenSHMEM Extensions Towards a Future for Hybrid Programming and Heterogeneous Computing

Nächstes Kapitel An Evaluation of OpenSHMEM Interfaces for the Variable-Length Alltoallv() Collective Operation

Chapel The Cascade High-Productivity Language. http://chapel.cray.com/

MVAPICH2-X: Unified MPI+PGAS Communication Runtime over OpenFabrics/Gen2 for Exascale Systems. http://mvapich.cse.ohio-state.edu/

OpenSHMEM. http://www.openshmem.org/

Awan, A. A., Hamidouche, K., Venkatesh, A., Perkins, J., Subramoni, H., Panda. D. K.: GPU-Aware design, implementation, and evaluation of non-blocking collective benchmarks (accepted for publication). In: Proceedings of the 22nd European MPI Users’ Group Meeting EuroMPI 2015. ACM, Bordeaux (2015)

Bell, C., Bonachea, D., Nishtala, R., Yelick, K.: Optimizing bandwidth limited problems using one-sided communication and overlap. In: Proceedings of the 20th International Conference on Parallel and Distributed Processing IPDPS 2006, pp. 84–84. IEEE Computer Society Washington, DC, USA(2006)

Co-Array Fortran. http://www.co-array.org

Open MPI : Open Source High Performance Computing. http://www.open-mpi.org

Cong, G., Almasi, G., Saraswat, V.: Fast PGAS implementation of distributed graph algorithms. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010)

Graham, R.L., Poole, S., Shamis, P., Bloch, G., Bloch, N., Chapman, H., Kagan, M., Shahar, A., Rabinovitz, I., Shainer, G.: Overlapping computation and communication: barrier algorithms and connectx-2 core-direct capabilities. In: 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum IPDPSW, pp. 1–8, April 2010

10.

Hilfinger, P. N., Bonachea, D., Gay, D., Graham, S., Liblit, B., Pike, G., Yelick, K.: Titanium language reference manual. Technical report, Berkeley, CA, USA (2001)

11.

Hoefler, T., Lumsdaine, A.: Design, Implementation, and Usage of LibNBC. Technical report, Open Systems Lab, Indiana University, August 2006

12.

Hoefler, T., Schneider, T., Lumsdaine, A.: Accurately measuring collective operations at massive scale. In: Proceedings of the 22nd IEEE International Parallel & Distributed Processing Symposium PMEO 2008 Workshop, April 2008

13.

Hoefler, T., Squyres, J.M., Rehm, W., Lumsdaine, A.: A case for non-blocking collective operations. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds.) ISPA Workshops 2006. LNCS, vol. 4331, pp. 155–164. Springer, Heidelberg (2006) CrossRef

14.

InfiniBand Trade Association. http://www.infinibandta.com

15.

Intel MPI Benchmarks (IMB). https://software.intel.com/en-us/articles/intel-mpi-benchmarks

16.

Liu, J., Jiang, W., Wyckoff, P., Panda, D. K., Ashton, D., Buntinas, D., Gropp, B., Tooney, B.: High Performance Implementation of MPICH2 over InfiniBand with RDMA Support. In: IPDPS (2004)

17.

Jose, J., Kandalla, K., Luo, M., Panda, D. K.: Supporting hybrid MPI and OpenSHMEM over InfiniBand: design and performance evaluation. In: Proceedings of the 2012 41st International Conference on Parallel Processing, ICPP 2012, pp. 219–228. IEEE Computer Society (2012)

18.

Jose, J., Kandalla, K., Zhang, J., Potluri, S., Panda, D. K. D. K.: Optimizing collective communication in openshmem. In: 7th International Conference on PGAS Programming Models, p. 185 (2013)

19.

Jose, J., Zhang, J., Venkatesh, A., Potluri, S., Panda, D.K.D.K.: A comprehensive performance evaluation of OpenSHMEM libraries on InfiniBand clusters. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 14–28. Springer, Heidelberg (2014)

20.

Kandalla, K.C., Subramoni, H., Tomko, K., Pekurovsky, D., Panda, D.K.: A Novel functional partitioning approach to design high-performance MPI-3 non-blocking alltoallv collective on multi-core systems. In: 42nd International Conference on Parallel Processing, ICPP 2013, pp. 611–620, Lyon, France, 1–4 October 2013

21.

Kini, S.P., Liu, J., Wu, J., Wyckoff, P., Panda, D.K.: Fast and scalable barrier using RDMA and multicast mechanisms for infiniband-based clusters. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 369–378. Springer, Heidelberg (2003) CrossRef

22.

Lawry, W., Wilson, C., Maccabe, A.B., Brightwell, R.: COMB: a portable benchmark suite for assessing MPI overlap. In: IEEE Cluster, pp. 23–26 (2002)

23.

Liu, J., Jiang, W., Wyckoff, P., Panda, D. K., Ashton, D., Buntinas, D., Gropp, W., Toonen, B.: Design and implementation of MPICH2 over InfiniBand with RDMA support. In: Proceedings of International Parallel and Distributed Processing Symposium, IPDPS 2004, April 2004

24.

Liu, J., Mamidala, A., Panda, D. K.: Fast And scalable MPI-level broadcast using InfiniBand’s hardware multicast support. In: Proceedings of International Parallel and Distributed Processing Symposium, IPDPS 2004, April 2004

25.

Mamidala, A., Liu, J., Panda, D. K.: Efficient barrier and allreduce on IBA clusters using hardware multicast and adaptive algorithms. In: IEEE Cluster Computing (2004)

26.

MPI-3 Standard Document. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

27.

Poole, S., Shamis, P., Welch, A., Pophale, S., Venkata, M.G., Hernandez, O., Koenig, G., Curtis, T., Hsu, C.-H.: OpenSHMEM extensions and a vision for its future direction. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 149–162. Springer, Heidelberg (2014)

28.

Rabinovitz, I., Shamis, P., Graham, R.L., Bloch, N., Shainer, G.: Network offloaded hierarchical collectives using ConnectX-2”s CORE-Direct capabilities. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 102–112. Springer, Heidelberg (2010) CrossRef

29.

Sandia MPI Micro-Benchmark Suite (SMB). http://www.cs.sandia.gov/smb/index.html

30.

Subramoni, H., Awan, A.A., Hamidouche, K., Pekurovsky, D., Venkatesh, A., Chakraborty, S., Tomko, K., Panda, D.K.: Designing non-blocking personalized collectives with near perfect overlap for RDMA-enabled clusters. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 434–453. Springer, Heidelberg (2015) CrossRef

31.

TOP 500 Supercomputer Sites. http://www.top500.org

32.

UPC Consortium. UPC Language Specifications, v1.2. Technical report LBNL-59208, Lawrence Berkeley National Lab (2005)

33.

Yelick, K., Bonachea, D., Chen, W.-Y., Colella, P., Datta, K., Duell, J., Graham, S.L., Hargrove, P., Hilfinger, P., Husbands, P., Iancu, C., Kamil, A., Nishtala, R., Su, J., Welcome, M., Wen, T.: Productivity and performance using partitioned global address space languages. In: International Workshop on Parallel Symbolic Computation, PASCO 2007 (2007)

34.

Zhang, J., Behzad, B., Snir, M.: Optimizing the barnes-hut algorithm in UPC. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 75:1–75:11. ACM, New York (2011)

Titel: A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X
verfasst von: A. A. Awan
K. Hamidouche
C. H. Chu
Dhabaleswar Panda
Verlag: Springer International Publishing
Buch: OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies
Print ISBN: 978-3-319-26427-1

Electronic ISBN: 978-3-319-26428-8

Copyright-Jahr: 2015
DOI: https://doi.org/10.1007/978-3-319-26428-8_5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"