Top

International Journal of Parallel Programming

Published in:

01-12-2016

Combining Data and Computation Distribution Directives for Hybrid Parallel Programming : A Transformation System

Authors: Rachid Habel, Frédérique Silber-Chaussumier, François Irigoin, Elisabeth Brunet, François Trahay

Published in: International Journal of Parallel Programming | Issue 6/2016

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This paper describes dSTEP, a directive-based programming model for hybrid shared and distributed memory machines. The originality of our work is the definition and an implementation of a unified high-level programming model addressing both data and computation distributions, providing a particularly fine control of the computation. The goal is to improve the programmer productivity while providing good performances in terms of execution time and memory usage. We define a generic compilation scheme for computation mapping and communication generation. We implement the solution in a source-to-source compiler together with a runtime library. We provide a series of optimizations to improve the performance of the generated code, with a special focus on reducing the communications time. We evaluate our solution on several scientific kernels as well as on the more challenging NAS BT benchmark, and compare our results with the hand written Fortran MPI and UPC implementations. The results show first that our solution allows to make explicit the non trivial parallel execution of the NAS BT benchmark using the dSTEP directives. Second, the results show that our generated MPI+OpenMP BT program runs with a 83.35 speedup over the original NAS OpenMP C benchmark on a hybrid cluster composed of 64 quadricores (256 cores). Overall, our solution dramatically reduces the programming effort while providing good time execution and memory usage performances. This programming model is suitable for a large variety of machines as multi-core and accelerator clusters.

previous article Czip: A Fast Lossless Compression Algorithm for Climate Data

next article A Parallelization Approach for Hard Real-Time Systems and Its Application on Two Industrial Programs

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Amini, M., Ancourt, C., Coelho, F., Irigoin, F., Jouvelot, P., Keryell, R., Villalon, P., Creusillet, B., Guelton, S.: PIPS is Not (Just) Polyhedral Software. In: International Workshop on Polyhedral Compilation Techniques (IMPACT11), Chamonix, France (2011)

Amza, C., Cox, A.L., Dwarkadas, S., Keleher, P., Lu, H., Rajamony, R., Yu, W., Zwaenepoel, W.: Treadmarks: shared memory computing on networks of workstations. Computer 29(2), 18–28 (1996)CrossRef

Ancourt, C., Coelho, F., Irigoin, F., Keryell, R.: A linear algebra framework for static high performance Fortran code distribution. Sci. Program. 6, 3–27 (1997)

Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exp. 23(2), 187–198 (2011)CrossRef

Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., et al.: The NAS parallel benchmarks. Int. J. High Perform. Comput. Appl. 5(3), 63–73 (1991)CrossRef

Banerjee, P., Chandy, J.A., Gupta, M., Hodges IV, E.W., Holm, J.G., Lain, A., Palermo, D.J., Ramaswamy, S., Su, E.: The PARADIGM compiler for distributed-memory multicomputers. Computer 28(10), 37–47 (1995)CrossRef

Basumallik, A., Eigenmann, R.: Towards automatic translation of OpenMP to MPI. In: Proceedings of the 19th annual international conference on Supercomputing, ACM, pp. 189–198 (2005)

Bolze, R., Cappello, F., Caron, E., Daydé, M., Desprez, F., Jeannot, E., Jégou, Y., Lanteri, S., Leduc, J., Melab, N., et al.: Grid’5000: a large scale and highly reconfigurable experimental grid testbed. Int. J. High Perform. Comput. Appl. 20(4), 481–494 (2006)CrossRef

Bonachea, D.: GASNet Specification, V1.1. Technical Report, University of California at Berkeley, Berkeley (2002)

10.

Bueno, J., Martinell, L., Duran, A., Farreras, M., Martorell, X., Badia, R., Ayguade, E., Labarta, J.: Productive Cluster Programming with OmpSs. In: Euro-Par 2011 Parallel Processing, Lecture Notes in Computer Science, vol. 6852, pp. 555–566. Springer, Berlin, Heidelberg (2011)

11.

Chamberlain, B.L., Callahan, D., Zima, H.P.: Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)CrossRef

12.

Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., Von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Not. 40(10), 519–538 (2005)CrossRef

13.

Creusillet, B., Irigoin, F.: Interprocedural Array Region Analyses. In: Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 1033, pp. 46–60. Springer, Berlin, Heidelberg (1996)

14.

Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)CrossRef

15.

Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A hybrid multi-core parallel programming environment. In: Workshop on General Purpose Processing on Graphics Processing Units (GPGPU 2007) (2007)

16.

Duarn, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(02), 173–193 (2011). doi:10.1142/S0129626411000151 MathSciNetCrossRef

17.

Feautrier, P.: Dataflow Analysis of Array and Scalar References. Int. J. Parallel Program. 20, 23–53 (1991)CrossRefMATH

18.

Irigoin, F., Jouvelot, P., Triolet, R.: Semantical interprocedural parallelization: an overview of the PIPS project. In: Proceedings of the 5th international conference on Supercomputing, ACM, New York, ICS ’91, pp. 244–251 (1991). doi:10.1145/109025.109086

19.

Kennedy, K., Koelbel, C., Zima, H.: The rise and fall of High Performance Fortran: an historical object lesson. In: Proceedings of the third ACM SIGPLAN conference on History of programming languages, ACM, New York, HOPL III, pp. 7–1–7–22 (2007). doi:10.1145/1238844.1238851

20.

Kim, D.: Parameterized and Multi-level Tiled Loop Generation. Ph.D. thesis, Colorado State University aAI3419053 (2010)

21.

Kusano, K., Satoh, S., Sato, M.: Performance Evaluation of the Omni OpenMP Compiler. In: High Performance Computing, pp. 403–414. Springer (2000)

22.

Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. ACM Sigplan Not. 44(4), 101–110 (2009)CrossRef

23.

Li, J., Chen, M.: Index domain alignment: minimizing cost of cross-referencing between distributed arrays. In: Proceedings of the 3rd Symposium on the Frontiers of Massively Parallel Computation, 1990, IEEE, pp. 424–433 (1990)

24.

Mellor-Crummey, J., Adhianto, L., Scherer, W.N. III, Jin, G.: A new vision for co-array Fortran. In: Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, ACM, New York, PGAS ’09, pp. 5:1–5:9 (2009). doi:10.1145/1809961.1809969

25.

Mellor-Crummey, John M., Adve, Vikram S., Broom, Bradley, Chavarra-Miranda, Daniel G., Fowler, Robert J., Jin, Guohua, Kennedy, Ken, Yi, Qing: Advanced optimization strategies in the Rice dHPF compiler. Concurr. Comput.: Pract. Exp. 14, 741–767 (2002)CrossRefMATH

26.

Merlin, J., Miles, D., Schuster, V.: Distributed OMP: Extensions to OpenMP for SMP clusters. In: Second European Workshop on OpenMP (EWOMP), pp. 14–15 (2000)

27.

Message Passing Interface Forum: MPI: A Message-Passing Interface Standard, Version 3.0 (2012)

28.

Millot, D., Muller, A., Parrot, C., Silber-Chaussumier, F.: STEP: A Distributed OpenMP for Coarse-Grain Parallelism Tool. In: OpenMP in a New Era of Parallelism, Lecture Notes in Computer Science, vol. 5004, pp. 83–99. Springer, Berlin, Heidelberg (2008)

29.

Millot, D., Muller, A., Parrot, C., Silber-Chaussumier, F.: From OpenMP to MPI: first experiments of the STEP source-to-source transformation tool. In: The international Parallel Computing Conference (ParCo), pp. 669–676 (2009)

30.

Nakao, Masahiro, Lee, Jinpil, Boku, Taisuke, Sato, Mitsuhisa: Productivity and performance of global-view programming with XcalableMP PGAS language. In: 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 402–409 (2012)

31.

Nieplocha, J., Harrison, R., Littlefield, R.J.: Global arrays: a nonuniform memory access programming model for high-performance computers. J. Supercomput. 10(2), 169–189 (1996)CrossRef

32.

Numrich, R.W., Reid, J.: Co-Array Fortran for parallel programming. SIGPLAN Fortran Forum 17(2), 1–31 (1998)CrossRef

33.

Pouchet, L.N.: PolyBoench/C, The polyhedral benchmark suite. (2014). http://www.cse.ohio-state.edu/pouchet/software/polybench

34.

Rice University, CORPORATE.: High performance Fortran language specification. SIGPLAN Fortran Forum 12(4), 1–86 (1993). doi:10.1145/174223.158909

35.

Silber-Chaussumier, F., Muller, A., Habel, R.: Generating data transfers for distributed GPU parallel Programs. J. Parallel Distrib. Comput. 73(12), 1649–1660 (2013)CrossRef

36.

The OpenACC Consortium: The OpenACC Programming Interface. (2014). http://www.openacc-standard.org

37.

Trahay, F., Brunet, E., Denis, A., Namyst, R.: A multithreaded communication engine for multicore architectures. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–7 (2008). doi:10.1109/IPDPS.2008.4536139

38.

UPC Consortium: UPC Language Specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab. (2005). http://www.gwu.edu/~upc/publications/LBNL-59208.pdf

39.

Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACCfirst Experiences with Real-world Applications. In: Euro-Par 2012 Parallel Processing, Springer, pp. 859–870 (2012)

40.

Van der Wijngaart, R.F., Wong, P.: NAS Parallel Benchmarks Version 2.4. Technical Report, NAS technical report, NAS-02-007 (2002)

41.

Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. SIGPLAN Not. 26(6), 30–44 (1991)CrossRef

42.

Yuki, T., Rajopadhye, S.: Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs. Technical Report, Colorado State University Technical Report CS13-105 (2013)

Title: Combining Data and Computation Distribution Directives for Hybrid Parallel Programming : A Transformation System
Authors: Rachid Habel
Frédérique Silber-Chaussumier
François Irigoin
Elisabeth Brunet
François Trahay
Publication date: 01-12-2016
Publisher: Springer US
Published in: International Journal of Parallel Programming / Issue 6/2016
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-016-0428-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 6/2016

A Parallel Yet Pipelined Architecture for Efficient Implementation of the Advanced Encryption Standard Algorithm on Reconfigurable Hardware

Preface to the Special Issue on Sequential Code Parallelization

Parallel Implementations of the Cooperative Particle Swarm Optimization on Many-core and Multi-core Architectures

A Parallelization Approach for Hard Real-Time Systems and Its Application on Two Industrial Programs

Czip: A Fast Lossless Compression Algorithm for Climate Data

Transparent Speculative Parallelization of Discrete Event Simulation Applications Using Global Variables

Premium Partner